Testing for independence
We now formally describe a hypothesis test for whether two categorical variables are independent.
Warning about low estimated cell counts
The p-value for the test can be found because the χ2 test statistic has approximately a chi-squared distribution. This approximation is close for most data sets that are encountered, but is less so when the sample size, n, is small. The guidelines that are often given suggest that the p-value can be relied on if:
If the cell counts are small enought that these conditions do not hold, the p-value is less reliable. (But advanced statistical methods are required to do better!)
Simulation: Independent variables