Assessing independence, based on a sample

Independence is an important special case of models for bivariate data. However it is a property of the joint population probabilities and in most practical situations these are unknown.

We must assess independence from a sample of individuals — a contingency table.


Skiing injuries

The contingency table below categorises a sample of 214 child skiers according to their skiing ability and whether or not they were injured.

Sample Data
  Injured Uninjured Total
Beginner 20 60 80
Intermediate 9 84 93
Advanced 2 39 41
Total 31 183 214

Independence would be an important characteristic of the child skiers since it would imply that the occurrence of injury did not depend on skiing ability.

Are those sample data consistent with a model of independence?


Marginal distributions and independence

The marginal counts in a contingency table describe the univariate distributions of the two variables on their own, but do not tell you anything about their relationship. For example, the two contingency tables below have the same margins.

Strong relationship
  C1 C2 C3 Total
R1 30 0 0 30
R2 0 40 0 40
R3 0 0 30 30
Total 30 40 30 100
 
No relationship
  C1 C2 C3 Total
R1 9 12 9 30
R2 12 16 12 40
R3 9 12 9 30
Total 30 40 30 100

However the table on the left supports an extremely strong relationship — if the row category is known, we can accurately predict the column category. On the other hand, there is no evidence of association in the table on the right — each row of the table contains the column categories in the same proportions.

Estimated cell counts under independence

In practice, the pattern of counts in a contingency table is rarely so easily interpreted. A first step is to determine the pattern that is most consistent with independence of the rows and columns, based on the observed margins.

  C1 C2 C3 Total
R1 ? ? ? 30
R2 ? ? ? 40
R3 ? ? ? 30
Total 30 40 30 100

If the rows and columns are independent, the conditional probabilities are the same for each row, so we distribute each marginal row total between the column categories in the same proportions — determined by the marginal proportions for the column categories.

This pattern is gives the estimated cell counts and the following formula can be used to evaluate them.

where n denotes the total for the whole table and nx and ny denote the marginal totals for row x and column y.

Skiing injuries

We now find the pattern of estimated cell counts for the skiing injury data that is most consistent with independence of ability level and injury, based only on the margins of the observed contingency table.

  Injured Uninjured Total
Beginner ? ? 80
Intermediate ? ? 93
Advanced ? ? 41
Total 31 183 214

If injury is indeed independent of skiing ability, then we estimate that the proportion of the 80 beginners who are injured would be the same as the marginal proportion who are injured. Since 31 out of 214 skiers are injured, we therefore expect that the number of injured beginners would be

This is an example of the general formula that was presented earlier,

The complete table of estimated cell counts is: