Groups and explanatory variables
It was explained earlier that data from different groups can be combined in a single data matrix with a categorical variable that gives group membership. In a similar way, a categorical variable can be used to split a data set into groups.
In some data sets, one categorical variable can be thought of as a response whose values are thought to depend on a second categorical variable — an explanatory variable. We can then think of the explanatory variable as defining different groups and ask how the response distribution differs between the groups.
Do not use the response variable to define the groups.
If one categorical variable is a response and the other is an explanatory variable, the methods in the previous section can be used to see how the explanatory variable affects the response.
Bipolar disorder and family history
In a study of bipolar disorder (a mental disorder involving severe mood changes), information was collected from a group of subjects with the disorder about their age at onset of the disorder and their family history of mood disorders. The contingency table below describes the data that were collected.
Age at onset | |||
---|---|---|---|
Family history | Early (18 or younger) | Late (19 or older) | |
Negative | 28 | 35 | |
Bipolar disorder | 19 | 38 | |
Unipolar | 41 | 44 | |
Unipolar and bipolar | 53 | 60 |
In this data set, Age at onset is the response and Family history is the explanatory variable — it is possible for family history to affect when the subject was first diagnosed with bipolar disorder, but not the reverse (!).
We can therefore use the methods in the previous section to compare the distributions for people with different family histories. For example, the following table shows the percentages within type of family history.
Age at onset | ||||
---|---|---|---|---|
Family history | Early (18 or younger) | Late (19 or older) | Total | |
Negative | 44.4 | 55.6 | 100.0 | |
Bipolar disorder | 33.3 | 66.7 | 100.0 | |
Unipolar | 48.2 | 51.8 | 100.0 | |
Unipolar and bipolar | 46.9 | 53.1 | 100.0 |
Although the sample size is small, there is an indication that when people have a family history of bipolar disorder, they are more likely to have late onset themselves.
It is however unhelpful to treat Age at onset as defining the groups. For example, the percentages in the following table are much harder to interpret and compare.
Age at onset | |||
---|---|---|---|
Family history | Early (18 or younger) | Late (19 or older) | |
Negative | 19.9 | 19.8 | |
Bipolar disorder | 13.5 | 21.5 | |
Unipolar | 29.1 | 24.9 | |
Unipolar and bipolar | 37.6 | 33.9 | |
Total | 100.0 | 100.0 |
Bivariate data without an explanatory variable
Not all data sets have variables that can be categorised as a response and an explanatory variable. Sometimes the relationship between the variables is more symmetrical but we still want to discover whether particular values of one variable are associated with values of the other.
For numerical variables, we would use a correlation coefficient to describe the strength of the relationship (as opposed to least squares for variables that can be classified as a response and explanatory variable). When the two variables are categorical, different methods are needed to describe the association between the variables.
The remainder of this section describes some methods of analysing data of this form.
Alcohol and nicotine intake
As part of a study of how drinking and smoking by pregnant women affected their children, data were collected from 452 mothers about the relationship between their nicotine intake during pregnancy and their alcohol intake before their pregnancy was recognised. The contingency table below describes the relationship between these two ordinal categorical variables.
Nicotine (milligrams/day) | ||||
---|---|---|---|---|
Alcohol (oz/day) | None | 1 to 15 | Over 15 | |
None | 105 | 7 | 11 | |
0.01 to 0.10 | 58 | 5 | 13 | |
0.11 to 0.99 | 84 | 37 | 42 | |
1.00 or more | 57 | 16 | 17 |
The variables cannot be classified as a response and explanatory variable — both variables have similar status. However it is reasonable to ask whether high alcohol consumption tends to be associated with high nicotine intake.