Data sets with two categorical variables

In this section, we explain how to model data sets that consist of pairs of categorical values — bivariate categorical data.

Examples where pairs of categorical variables are measured from 'individuals' are:

'Individuals' Variable X Variable Y
Trapped rabbits Sex (M or F) Signs of disease (yes or no)
Rose plants Aphids (yes or no) Quality of blooms (poor, OK or good)
Fruit fly larvae Developmental stage (1st, 2nd or 3rd instar) Surviving heat treatment (alive or dead)

In each case, the data that are collected are pairs of categorical values that are measurements of the two variables. Data of this form are usually summarised in a contingency table.

Patients taking prescribed medicine

Many patients do not adhere to the medical regimes prescribed for them. Predicting which patients are likely to stop taking their prescribed medication would help treatment.

One study in the USA examined 62 patients who had been diagnosed as having glaucoma (a disease that affects vision) and who had been prescribed eye drops to take at home. Each patient was classified by whether or not they complied with the treatment prescribed (i.e. took their eye drops regularly) and by racial group.

Patient    Race        Compliance    
1 White Complier
2 Black Non-complier
3 White Non-complier
... ... ...

The data from the 62 patients are summarised in the contingency table below.

Race     Compliers     Non-compliers    Total   
  White 13 10 23
  Non-white    13 26 39
Total 26 36 62

Joint probabilities

To model bivariate categorical data, we assume an underlying population of pairs of categorical values. The data are treated as a random sample of pairs of values from this population. A real finite population occasionally underlies the data, but we must usually hypothesise an infinite underlying population.

The proportion of times that the pair of categorical values (x, y) occurs in the population — its probability — is denoted by pxy. The probabilities pxy are called the joint probabilities for the two variables.

Gambling simulation

A gambler draws a card from a shuffled deck and also tosses a coin, resulting in a pair of categorical values,

Variable Possible values
Coin side, X   Head or Tail
Card suit, Y   Heart, Club, Diamond or Spade

Each of the eight possible combinations of coin side and card suit is equally likely and would occur in the same proportions in the underlying population.

pxy = 1/8

The probabilities for all pairs are therefore the same,

phead, heart  =  phead, club  =  ...  =  ptail, spade  =  1/8  =  0.125

These joint probabilities are shown in blue in the table below.

The two lower tables (in black) describe 100 pairs of values sampled from this population — 100 coin-card pairs. The first of these tables shows the sample as a contingency table of counts; the other table displays the counts as proportions.

Click Take sample a few times to see the variability in samples of 100 coin-card pairs. Increase the sample size to 500 and repeat. Observe that the sample proportions are less variable when the sample size is large.

Interest in the model

We are usually interested in the joint probabilities in the underlying model, rather than the corresponding proportions from the sample data that have been collected (the contingency table). However the joint probabilities are unknown parameters in most practical situations and must be estimated from the sample data.

Patients taking prescribed medicine

In situations of practical importance, the underlying probabilities are unknown. In the data set that was collected to examine which patients take prescribed medicine, the 61 glaucoma patients in the study were not the focus of attention — the researcher wanted to generalise to all other similar patients.

The population proportions are unknown, but the sample proportions provide estimates of them.