Lurking variables and relationships between categorical variables
The relationship between two categorical variables, X and Y, can also be strongly influenced by a third lurking variable, Z. Repeating the points made before,
The marginal relationship between X and Y can be...
compared to their conditional relationship, given Z.
When the direction of the relationship reverses, the effect is called Simpson's paradox. As with other 'paradoxes', there is no real contradiction; it just takes a bit more thought to understand why your initial intuition is wrong.
Smoking and survival
In the early 1970s, a one-in-six health survey was conducted in Whickham in the North of England. Twenty years later, a follow-up study was conducted. The table below describes which of the 1,314 women who were classified either as current smokers or as never having smoked in the earlier survey, survived until the second survey.
A naive examination of the data suggests that smoking decreases the probability of dying.
Click Slice to see the corresponding data for different age groups. Observe that the probability of dying is higher for the smokers in each age group. Age is a lurking variable and the conditional relationship between smoking and survival is the reverse of their marginal relationship.
The marginal relationship between smoking and survival is misleading. The conditional relationships for different age groups are much more meaningful.
Graphical representation
This reversal can be illustrated in a proportional Venn diagram. In this diagram,
The area of the rectangle for any combination of smoking, age and survival is proportional to the number of women in that combination.
Initially the rectangles are arranged to display the conditional proportions. Click any blue rectangle to display the proportions dying in that age group for smokers and non-smokers, and observe that P(dead) is higher for smokers than non-smokers in each age group.
Now select Group by smoker? from the pop-up menu. The rectangles are rearranged to show the overall proportions of smokers and non-smokers dying. Observe that the overall proportion of smokers dying is less than the overall proportion of non-smokers dying.
The reason for the reversal of the relationship is that most of the older women were non-smokers, and they had a higher proportion of deaths, increasing the marginal proportion of deaths for the non-smokers.