Groups and explanatory variables

It was explained earlier that data from different groups can be combined in a single data matrix with a categorical variable that gives group membership. In a similar way, a categorical variable can be used to split a data set into groups.

In some data sets, one categorical variable can be thought of as a response whose values are thought to depend on a second categorical variable — an explanatory variable. We can then think of the explanatory variable as defining different groups and ask how the response distribution differs between the groups.

Do not use the response variable to define the groups.

If one categorical variable is a response and the other is an explanatory variable, the methods in the previous section can be used to see how the explanatory variable affects the response.

Drug screening of job applicants

Urine drug screening that was performed on 2537 applicants for postal jobs. Among the categorical variables measured from each applicant were the type of drug detected (if any) and the applicant's gender. The contingency table below shows these data.

  Negative Marijuana Cocaine Other drugs Total
Male 1465 146 33 28 1672
Female 764 52 22 27 865

In this data set, the result of the drug test is the response and gender is the explanatory variable — it is possible for gender to affect the type of drug detected, but not the reverse (!).

We can therefore use the methods in the previous section to compare the distributions for males and females. For example, the following table shows the percentages within each gender group.

  Negative Marijuana Cocaine Other drugs Total
Male 87.6 8.7 2.0 1.7 100.0
Female 88.3 6.0 2.5 3.1 100.0

From this table, it can be seen that the differences between males and females are fairly small.

It is however unhelpful to treat the drug result as defining the groups. For example, the percentages in the following table are much harder to interpret and compare.

  Negative Marijuana Cocaine Other drugs
Male 65.7 73.7 60.0 50.9
Female 34.3 26.3 40.0 49.1
Total 100.0 100.0 100.0 100.0

Bivariate data without an explanatory variable

Not all data sets have variables that can be categorised as a response and an explanatory variable. Sometimes the relationship between the variables is more symmetrical but we still want to discover whether particular values of one variable are associated with values of the other.

For numerical variables, we would use a correlation coefficient to describe the strength of the relationship (as opposed to least squares for variables that can be classified as a response and explanatory variable). When the two variables are categorical, different methods are needed to describe the association between the variables.

The remainder of this section describes some methods of analysing data of this form.

Alcohol and nicotine intake

As part of a study of how drinking and smoking by pregnant women affected their children, data were collected from 452 mothers about the relationship between their nicotine intake during pregnancy and their alcohol intake before their pregnancy was recognised. The contingency table below describes the relationship between these two ordinal categorical variables.

    Nicotine (milligrams/day)
Alcohol (oz/day) None 1 to 15 Over 15
  None 105 7 11
0.01 to 0.10 58 5 13
0.11 to 0.99 84 37 42
1.00 or more 57 16 17

The variables cannot be classified as a response and explanatory variable — both variables have similar status. However it is reasonable to ask whether high alcohol consumption tends to be associated with high nicotine intake.