Assessing independence, based on a sample

Independence is an important special case of models for bivariate data. However it is a property of the joint population probabilities and in most practical situations these are unknown.

We must assess independence from a sample of individuals — a contingency table.


Recruiting source and success

A sample of 1,400 store clerks hired in one year by a large US retailing chain was selected by researchers who wanted to determine whether the recruiting source for employees is related to whether they perform satisfactorily in their job (determined from supervisor evaluations). Four recruiting sources were defined.

Observed and estimated cell counts
  Unsatisfactory Satisfactory Total
Employee referral 167
(149.9)
85
(102.1)
252
In-store notice 383
(383.2)
261
(260.8)
644
Employment agency 33
(29.8)
17
(20.2)
50
Media announcement 250
(270.1)
204
(183.9)
454
Total 833 567 1400

Independence would be an important characteristic of employment since it would imply that employees recruited from all sources have the same probability of satisfactory performance.

Are those sample data consistent with a model of independence?


Marginal distributions and independence

The marginal counts in a contingency table describe the univariate distributions of the two variables on their own, but do not tell you anything about their relationship. For example, the two contingency tables below have the same margins.

Strong relationship
  C1 C2 C3 Total
R1 30 0 0 30
R2 0 40 0 40
R3 0 0 30 30
Total 30 40 30 100
 
No relationship
  C1 C2 C3 Total
R1 9 12 9 30
R2 12 16 12 40
R3 9 12 9 30
Total 30 40 30 100

However the table on the left supports an extremely strong relationship — if the row category is known, we can accurately predict the column category. On the other hand, there is no evidence of association in the table on the right — each row of the table contains the column categories in the same proportions.

Estimated cell counts under independence

In practice, the pattern of counts in a contingency table is rarely so easily interpreted. A first step is to determine the pattern that is most consistent with independence of the rows and columns, based on the observed margins.

  C1 C2 C3 Total
R1 ? ? ? 30
R2 ? ? ? 40
R3 ? ? ? 30
Total 30 40 30 100

If the rows and columns are independent, the conditional probabilities are the same for each row, so we distribute each marginal row total between the column categories in the same proportions — determined by the marginal proportions for the column categories.

This pattern is gives the estimated cell counts and the following formula can be used to evaluate them.

where n denotes the total for the whole table and nx and ny denote the marginal totals for row x and column y.

Recruiting source and success

We now find the pattern of estimated cell counts for the recruitment data that is most consistent with independence of recruiting source and success, based only on the margins of the observed contingency table.

Sample Data
  Unsatisfactory Satisfactory Total
Employee referral ? ? 252
In-store notice ? ? 644
Employment agency ? ? 50
Media announcement ? ? 454
Total 833 567 1400

If satisfactory performance is indeed independent of recruitment, then we estimate that the proportion of the 252 recruited from 'Employee referral' who are satisfactory would be the same as the marginal proportion who are satisfactory. Since 567 out of the total 1400 in the study are satisfactory, we therefore expect that the number recruited from 'Employee referral' who are satisfactory would be

This is an example of the general formula that was presented earlier,

The complete table of estimated cell counts is:

If recruitment and success are indeed independent, then the observed cell counts in the sample data should be similar to these estimated cell counts.

Observed and estimated cell counts
  Unsatisfactory Satisfactory Total
Employee referral 167
(149.9)
85
(102.1)
252
In-store notice 383
(383.2)
261
(260.8)
644
Employment agency 33
(29.8)
17
(20.2)
50
Media announcement 250
(270.1)
204
(183.9)
454
Total 833 567 1400

Comparison of observed and estimated cell counts

The hypothesis of independence is assessed by asking whether the observed and estimated cell counts are 'sufficiently close' — are the observed counts consistent with the counts estimated under independence? We address this formally in the following pages.