The columns in a vector spreadsheet can be of three types: variate, text or factor. A variate is a numeric column, while text can contain letters or a mixture of letters and numbers.
A factor is sometimes called a ‘categorical’ or ‘class’ variable within other statistical applications. Each factor can be represented with a text label and/or a numerical value (level). The factors are also assigned ordinal values that are numbered from 1 upwards, and these show the order the levels or labels of the factor will be displayed in the Output. For example, the table below shows a factor that has 4 levels:
The level in position 1 (ordinal 1) will be displayed first. The ordinal values are always numbered 1,2,3…n, with no gaps, and you can reverse the order (3, 2, 1) to reverse the order displayed in the Output. You can also reorder the levels (numeric values) and labels (text) of a factor in other ways by sorting them (see Sorting levels or labels). For example, in the table below, the levels and labels have been reordered by sorting the labels alphabetically:
In the spreadsheet below, the columns Sex, Age, Severity and Treatment are factors. The factor Sex has two specified levels with the labels Female and Male. The factor Age has 3 levels, 10, 11 and 12. The factor Severity has four levels 0, 1, 2, 3 with the labels None, Mild, Moderate and Acute. The factor Treatment has four levels 0, 1, 2, 10 with the labels Control, Normal, Medium and High.
Although Age is a continuous measurement (as we continually grow older), the individuals have been assigned to a discrete age group based on their age at their last birthday.
Severity is an assessment of the degree of a disease in an individual, and a doctor will have assigned a level to each person, based on his or her clinical judgement. The levels of severity have a natural ordering, with 0 as an obvious starting value, so the numerical levels 0-3 have been assigned to these, but text labels have also been assigned to describe what each level means.
The factor Treatment is the intended amount of a drug that has been prescribed to each individual, so it has a natural numerical value (dose in milligrams), but also a label to give extra information on these doses. In an experimental trial, a small set of fixed values is often used for a treatment factor that could vary continuously (like the amount of drug given in this case), so that a number of replicates of the levels used occur. The variation between the replicates is then used to assess the consistency of the response to the treatment.
Details of the Severity factor are shown below, using the Edit Factor Levels and Labels dialog. This shows that any factor column has three components that Genstat calls Ordinals, Levels and Labels. The ordinals are numbered from 1 upwards, and they dictate the order that the levels of a factor will be displayed. If both levels and labels are present, you have a choice of which is displayed within a spreadsheet column (the default is to display the text labels). In the spreadsheet image above, the column Severity has text labels displayed, whilst Treatment has numeric levels displayed. Within a spreadsheet column you can also choose to display the ordinals of a factor (1, 2, 3, etc.) – this is the default if no levels or labels have been assigned to the factor.
Each factor can have a text Label, to allow clearer identification of the group; for example, the Severity factor shown above has labels None, Mild, Moderate, and Acute, which are probably more meaningful than their associated levels (0, 1, 2 and 3). Each factor can also have a numerical value called a Level assigned to it. A factor can have both levels and labels defined at the same time.
The numeric levels are distinct from the ordinals, and the levels need not be in numerical order. If no user-defined levels are given for a factor, then the default levels are the same as the ordinal values of the factor i.e 1, 2, 3, etc.
The availability of ordinals, levels and labels for factors gives you great flexibility in how you present results that are in tables and graphs. For example, the order of rows in a table is always dictated by the ordinals, but what you will see is either the levels, or the labels. When doing a numerical calculation with a factor, the values of the factor levels are used. When sorting on a factor, the values of the ordinals are used.
A factor is indicated in a spreadsheet by a red exclamation mark at the start of the column name. A column imported from another data source, such as Excel, can be marked as a factor containing grouping information by ending the column name with an exclamation mark, e.g. Chemical!.
The factor, Treatment, shown in the Edit Factor Levels and Labelss dialog below, has 4 groups each with both a numerical level and a text label.
The groups in a factor can be reordered: initially the groups in Treatment have been sorted numerically on the levels. The groups in the Treatment could be sorted alphabetically by label (as shown below), and the joint membership of individuals to a particular group would remain the same. However, the order that the groups would appear in an analysis or table would change.
The ordinals are always numbered in continuous sequence 1,2,3… (or reverse order) even when the factor levels have been reordered. When importing data from other file formats, Genstat will by default order the levels of any factor either numerically or alphabetically or by the order they are first met in the data. This behaviour is controllable via the spreadsheet options or in the dialogs used to load data. You can manipulate factor levels and labels by selecting Spread | Factor and selecting one of the menu options shown below.
As spreadsheet packages do not have a factor cell type (they only have numeric or text cells), Genstat lets you specify a factor in one of these files by appending a ! to the end of the column name (e.g. Sex! or Age!). When you import the file into Genstat, columns with an exclamation mark will be recognized as factors.
Unlike Genstat, Excel does not allow both numerical and text values in one cell. To enable both to be specified for a factor (as in Severity or Treatment), you can add a comment to the column name as described below.
The factor labels or levels must start with a ! on a new line in the comment. If you are providing levels use the format !(100,90,50,10). If you are providing labels use the format !T(Control,A,B,C) or !t(‘Control’,’A’,’B’,’C’). The order of the items in the comment will define the order of the levels or labels in the factor. If a column in Excel just contains ordinal values (i.e. 1…n), the comment can still be used to assign labels or levels to these groups. The first item in the comment will define the level or label for group 1 in the factor etc. The column description information can also be given in the comment, as a line not starting with an exclamation mark.
The following image shows two comments entered into Excel (to do this right-click and select Insert Comment, making sure you don’t put line breaks in the middle of the list).
A factor must be used in many places (e.g. ANOVA) where group membership is specified. You can convert variates and texts to a factor and back again.