Groups and explanatory variables

It was explained earlier that data from different groups can be combined in a single data matrix with a categorical variable that gives group membership. In a similar way, a categorical variable can be used to split a data set into groups.

In some data sets, one categorical variable can be thought of as a response whose values are thought to depend on a second categorical variable — an explanatory variable. We can then think of the explanatory variable as defining different groups and ask how the response distribution differs between the groups.

Do not use the response variable to define the groups.

If one categorical variable is a response and the other is an explanatory variable, the methods in the previous section can be used to see how the explanatory variable affects the response.

Drug screening of job applicants

Urine drug screening that was performed on 2537 applicants for postal jobs. Among the categorical variables measured from each applicant were the type of drug detected (if any) and the applicant's gender. The contingency table below shows these data.

  Negative Marijuana Cocaine Other drugs Total
Male 1465 146 33 28 1672
Female 764 52 22 27 865

In this data set, the result of the drug test is the response and gender is the explanatory variable — it is possible for gender to affect the type of drug detected, but not the reverse (!).

We can therefore use the methods in the previous section to compare the distributions for males and females. For example, the following table shows the percentages within each gender group.

  Negative Marijuana Cocaine Other drugs Total
Male 87.6 8.7 2.0 1.7 100.0
Female 88.3 6.0 2.5 3.1 100.0

From this table, it can be seen that the differences between males and females are fairly small.

It is however unhelpful to treat the drug result as defining the groups. For example, the percentages in the following table are much harder to interpret and compare.

  Negative Marijuana Cocaine Other drugs
Male 65.7 73.7 60.0 50.9
Female 34.3 26.3 40.0 49.1
Total 100.0 100.0 100.0 100.0

Bivariate data without an explanatory variable

Not all data sets have variables that can be categorised as a response and an explanatory variable. Sometimes the relationship between the variables is more symmetrical but we still want to discover whether particular values of one variable are associated with values of the other.

For numerical variables, we would use a correlation coefficient to describe the strength of the relationship (as opposed to least squares for variables that can be classified as a response and explanatory variable). When the two variables are categorical, different methods are needed to describe the association between the variables.

The remainder of this section describes some methods of analysing data of this form.

Customer ratings of two product ranges

A company selling both quality stereo systems and musical instruments is interested in how its reputation for one product line is related to its reputation for the other. A sample of 543 persons is asked to rate each in a three-point scale and the contingency table below shows the relationship between these two ordinal categorical variables.

    Rating of stereo products
Rating of instruments   Below ave    Average    Above ave 
  Below average 105 7 11
Average 58 5 13
Above average 84 37 42

This relationship is not causal — both variables have similar status. However it is reasonable to ask whether good ratings of the stereo products tend to be associated with good ratings of the stereo products.