Data sets with two categorical variables
In this section, we explain how to model data sets that consist of pairs of categorical values — bivariate categorical data.
Examples where pairs of categorical variables are measured from 'individuals' are:
'Individuals' | Variable X | Variable Y |
---|---|---|
Employees in a large company | Sex (M or F) | Education (none, high school or tertiary) |
Customers leaving supermarket | Checkout operator type (full- or part-time) | Rating of quality of service (poor, OK or good) |
TVs leaving a production line | Assembler (A, B, C or D) | Status (defective or OK) |
In each case, the data that are collected are pairs of categorical values that are measurements of the two variables. Data of this form are usually summarised in a contingency table.
Requests for promotional material by travellers
Travel agents provide 'destination-specific travel literature' about activities, facilities and prices to tourists free of charge on request. A study was made to investigate the differences between information seekers (who requested such literature) and nonseekers, with the aim of better targeting such material.
A sample of 686 tourists was selected and each was classified as an information seeker or non-seeker and in various other ways including educational level.
Tourist | Educational level | Information seeker? |
---|---|---|
1 | High school degree | Yes |
2 | College degree | Yes |
3 | Some high school | No |
... | ... | ... |
The data from the 686 tourists are summarised in the contingency table below.
Information seeker? | |||
---|---|---|---|
Education | Yes | No | Total |
Some high school | 13 | 27 | 40 |
High school degree | 64 | 118 | 182 |
Some college | 100 | 123 | 223 |
College degree | 59 | 69 | 128 |
Graduate degree | 67 | 46 | 113 |
Total | 303 | 383 | 686 |
Joint probabilities
To model bivariate categorical data, we assume an underlying population of pairs of categorical values. The data are treated as a random sample of pairs of values from this population. A real finite population occasionally underlies the data, but we must usually hypothesise an infinite underlying population.
The proportion of times that the pair of categorical values (x, y) occurs in the population — its probability — is denoted by pxy. The probabilities pxy are called the joint probabilities for the two variables.
Gambling simulation
A gambler draws a card from a shuffled deck and also tosses a coin, resulting in a pair of categorical values,
Variable | Possible values |
---|---|
Coin side, X | Head or Tail |
Card suit, Y | Heart, Club, Diamond or Spade |
Each of the eight possible combinations of coin side and card suit is equally likely and would occur in the same proportions in the underlying population.
The probabilities for all pairs are therefore the same,
phead, heart = phead, club = ... = ptail, spade = 1/8 = 0.125
These joint probabilities are shown in blue in the table below.
The two lower tables (in black) describe 100 pairs of values sampled from this population — 100 coin-card pairs. The first of these tables shows the sample as a contingency table of counts; the other table displays the counts as proportions.
Click Take sample a few times to see the variability in samples of 100 coin-card pairs. Increase the sample size to 500 and repeat. Observe that the sample proportions are less variable when the sample size is large.
Interest in the model
We are usually interested in the joint probabilities in the underlying model, rather than the corresponding proportions from the sample data that have been collected (the contingency table). However the joint probabilities are unknown parameters in most practical situations and must be estimated from the sample data.
Requests for promotional material by travellers
In situations of practical importance, the underlying probabilities are unknown. In the data set that was collected to examine which travellers request promotional material, the 686 travellers in the study were not the focus of attention — the researcher wanted to generalise to all other similar travellers.
The population proportions are unknown, but the sample proportions provide estimates of them.