Data sets with two categorical variables
In this section, we explain how to model data sets that consist of pairs of categorical values — bivariate categorical data.
Examples where pairs of categorical variables are measured from 'individuals' are:
'Individuals' | Variable X | Variable Y |
---|---|---|
Employees in a large company | Sex (M or F) | Education (none, high school or tertiary) |
Rose plants | Aphids (yes or no) | Quality of blooms (poor, OK or good) |
TVs leaving a production line | Assembler (A, B, C or D) | Status (defective or OK) |
In each case, the data that are collected are pairs of categorical values that are measurements of the two variables. Data of this form are usually summarised in a contingency table.
Patients taking prescribed medicine
Many patients do not adhere to the medical regimes prescribed for them. Predicting which patients are likely to stop taking their prescribed medication would help treatment.
One study in the USA examined 62 patients who had been diagnosed as having glaucoma (a disease that affects vision) and who had been prescribed eye drops to take at home. Each patient was classified by whether or not they complied with the treatment prescribed (i.e. took their eye drops regularly) and by racial group.
Patient | Race | Compliance |
---|---|---|
1 | White | Complier |
2 | Black | Non-complier |
3 | White | Non-complier |
... | ... | ... |
The data from the 62 patients are summarised in the contingency table below.
Race | Compliers | Non-compliers | Total |
---|---|---|---|
White | 13 | 10 | 23 |
Non-white | 13 | 26 | 39 |
Total | 26 | 36 | 62 |
Joint probabilities
To model bivariate categorical data, we assume an underlying population of pairs of categorical values. The data are treated as a random sample of pairs of values from this population. A real finite population occasionally underlies the data, but we must usually hypothesise an infinite underlying population.
The proportion of times that the pair of categorical values (x, y) occurs in the population — its probability — is denoted by pxy. The probabilities pxy are called the joint probabilities for the two variables.
Gambling simulation
A gambler draws a card from a shuffled deck and also tosses a coin, resulting in a pair of categorical values,
Variable | Possible values |
---|---|
Coin side, X | Head or Tail |
Card suit, Y | Heart, Club, Diamond or Spade |
Each of the eight possible combinations of coin side and card suit is equally likely and would occur in the same proportions in the underlying population.
The probabilities for all pairs are therefore the same,
phead, heart = phead, club = ... = ptail, spade = 1/8 = 0.125
These joint probabilities are shown in blue in the table below.
The two lower tables (in black) describe 100 pairs of values sampled from this population — 100 coin-card pairs. The first of these tables shows the sample as a contingency table of counts; the other table displays the counts as proportions.
Click Take sample a few times to see the variability in samples of 100 coin-card pairs. Increase the sample size to 500 and repeat. Observe that the sample proportions are less variable when the sample size is large.
Interest in the model
We are usually interested in the joint probabilities in the underlying model, rather than the corresponding proportions from the sample data that have been collected (the contingency table). However the joint probabilities are unknown parameters in most practical situations and must be estimated from the sample data.
Patients taking prescribed medicine
In situations of practical importance, the underlying probabilities are unknown. In the data set that was collected to examine which patients take prescribed medicine, the 61 glaucoma patients in the study were not the focus of attention — the researcher wanted to generalise to all other similar patients.
The population proportions are unknown, but the sample proportions provide estimates of them.