Lurking variables and relationships between categorical variables
When the direction of the relationship reverses, the effect is called Simpson's paradox. As with other 'paradoxes', there is no real contradiction; it just takes a bit more thought to understand why your initial intuition is wrong.
Smoking and survival
In a health survey, 1,314 women were classified as smokers or non-smokers, and their survival after 20 years was recorded.
Survival | ||||
---|---|---|---|---|
Smoker? | Dead | Alive | Total | P(Dead) |
Smoker | 139 | 443 | 582 | 0.239 |
Non-smoker | 230 | 502 | 732 | 0.314 |
A naive examination of the data suggests that smoking decreases the probability of dying, but the opposite is true if the women are split into age groups.
Age 18-44 | |||||
Survival | |||||
---|---|---|---|---|---|
Smoker? | Dead | Alive | Total | P(Dead) | |
Smoker | 19 | 269 | 288 | 0.066 | |
Non-smoker | 13 | 327 | 340 | 0.038 | |
Age 45-64 | |||||
Survival | |||||
Smoker? | Dead | Alive | Total | P(Dead) | |
Smoker | 78 | 167 | 245 | 0.318 | |
Non-smoker | 52 | 147 | 199 | 0.261 | |
Age 65+ | |||||
Survival | |||||
Smoker? | Dead | Alive | Total | P(Dead) | |
Smoker | 42 | 7 | 49 | 0.857 | |
Non-smoker | 165 | 28 | 193 | 0.855 |
Proportional Venn diagram
Simpson's paradox is explained in the proportional Venn diagram below — in it, each rectangle is proportional to the number of women with these values for the variables.
Most of the women aged 65+ were non-smokers. This increased the overall death rate of the non-smokers.