Data sets with two categorical variables

In this section, we explain how to model data sets that consist of pairs of categorical values — bivariate categorical data.

Examples where pairs of categorical variables are measured from 'individuals' are:

'Individuals' Variable X Variable Y
Employees in a large company Sex (M or F) Education (none, high school or tertiary)
Customers leaving supermarket Checkout operator type (full- or part-time) Rating of quality of service (poor, OK or good)
TVs leaving a production line Assembler (A, B, C or D) Status (defective or OK)

In each case, the data that are collected are pairs of categorical values that are measurements of the two variables. Data of this form are usually summarised in a contingency table.

Requests for promotional material by travellers

Travel agents provide 'destination-specific travel literature' about activities, facilities and prices to tourists free of charge on request. A study was made to investigate the differences between information seekers (who requested such literature) and nonseekers, with the aim of better targeting such material.

A sample of 686 tourists was selected and each was classified as an information seeker or non-seeker and in various other ways including educational level.

Tourist    Educational level        Information seeker?    
1 High school degree Yes
2 College degree Yes
3 Some high school No
... ... ...

The data from the 686 tourists are summarised in the contingency table below.

  Information seeker?  
Education     Yes         No        Total   
  Some high school 13 27 40
  High school degree    64 118 182
  Some college    100 123 223
  College degree    59 69 128
  Graduate degree    67 46 113
Total 303 383 686

Joint probabilities

To model bivariate categorical data, we assume an underlying population of pairs of categorical values. The data are treated as a random sample of pairs of values from this population. A real finite population occasionally underlies the data, but we must usually hypothesise an infinite underlying population.

The proportion of times that the pair of categorical values (x, y) occurs in the population — its probability — is denoted by pxy. The probabilities pxy are called the joint probabilities for the two variables.

Gambling simulation

A gambler draws a card from a shuffled deck and also tosses a coin, resulting in a pair of categorical values,

Variable Possible values
Coin side, X   Head or Tail
Card suit, Y   Heart, Club, Diamond or Spade

Each of the eight possible combinations of coin side and card suit is equally likely and would occur in the same proportions in the underlying population.

pxy = 1/8

The probabilities for all pairs are therefore the same,

phead, heart  =  phead, club  =  ...  =  ptail, spade  =  1/8  =  0.125

These joint probabilities are shown in blue in the table below.

The two lower tables (in black) describe 100 pairs of values sampled from this population — 100 coin-card pairs. The first of these tables shows the sample as a contingency table of counts; the other table displays the counts as proportions.

Click Take sample a few times to see the variability in samples of 100 coin-card pairs. Increase the sample size to 500 and repeat. Observe that the sample proportions are less variable when the sample size is large.

Interest in the model

We are usually interested in the joint probabilities in the underlying model, rather than the corresponding proportions from the sample data that have been collected (the contingency table). However the joint probabilities are unknown parameters in most practical situations and must be estimated from the sample data.

Requests for promotional material by travellers

In situations of practical importance, the underlying probabilities are unknown. In the data set that was collected to examine which travellers request promotional material, the 686 travellers in the study were not the focus of attention — the researcher wanted to generalise to all other similar travellers.

The population proportions are unknown, but the sample proportions provide estimates of them.