Data with small counts
The chi-squared goodness-of-fit test that was described earlier in this section has two requirements:
Firstly consider how we might test whether the following data set is a random sample from a \(\PoissonDistn(\lambda=2)\) distribution.
1 | 3 | 2 | 2 | 5 | 4 | 5 | 2 | 0 | 2 |
2 | 4 | 3 | 5 | 2 | 3 | 1 | 4 | 2 | 6 |
These counts are small, so the chi-squared test cannot be directly used.
Frequency table
Rather than applying the test to the raw counts, we first summarise the data in a frequency table.
x | 0 | 1 | 2 | 3 | 4 | 5 | 6+ |
---|---|---|---|---|---|---|---|
Freq(x) | 1 | 2 | 7 | 3 | 3 | 3 | 1 |
We will treat these frequencies as our observed counts, then find expected counts from the Poisson distribution's probability function,
\[ p(x) \;\;=\;\; \frac{\lambda^x e^{-\lambda}}{x!}\]Since there were \(n=20\) values in the data set, and the null hypothesis value is \(\lambda=2\), the count that we would expect in the frequency table for the value \(x\) is
\[ E_x \;\;=\;\; 20 \times p(x) \;\;=\;\; 20 \times \frac{2^x e^{-2}}{x!}\]This results in the following table of observed and expected counts.
x | 0 | 1 | 2 | 3 | 4 | 5 | 6+ |
---|---|---|---|---|---|---|---|
\(O_x\) | 1 | 2 | 7 | 3 | 3 | 3 | 1 |
\(E_x\) | 2.707 | 5.413 | 5.413 | 3.609 | 1.804 | 0.722 | 0.331 |
Combining cells
These counts still do not satisfy the requirement for all \(E_x\) to be ≥1 and 80% to be ≥5. Before applying a chi-squared goodness-of-fit test to them, cells in the table should be combined.
x | 0,1 | 2 | 3+ |
---|---|---|---|
\(O_x\) | 3 | 7 | 10 |
\(E_x\) | 8.120 | 5.413 | 6.466 |
Based on these three counts,
\[ \begin{align} X^2 \;&=\; \sum_{i=1}^{10} {\frac{\left(O_i - E_i\right)^2}{E_i}} \\ &=\; \frac{(3-8.120)^2}{8.120} + \frac{(7-5.413)^2}{5.413} + \frac{(10-6.466)^2}{6.466} \\ &=\; 5.624 \end{align} \]P-value and conclusion
To calculate the degrees of freedom, we note that there are 3 categories (after grouping) and one constraint: the expected frequencies from the model are “constrained” to add up to 20 by how they were calculated.
\[ \sum{E_i} \;=\; \sum{O_i} \;=\; 20 \]The chi-squared test should therefore be based on \(3-1 = 2\) degrees of freedom. The p-value for testing whether our model for the data fits is the probability that a \(\ChiSqrDistn(2 \text{ df})\) distribution is greater than the observed value,
p-value = \(P(X^2 \ge 5.621) = 0.060\)
Since this is between 0.05 and 0.10, we would conclude that there is only very weak evidence against the null hypothesis that the original data set was a random sample from a \(\PoissonDistn(2)\) distribution.
(It is hardly surprising that a data set with only 20 values does not show up problems with a model — a larger data set would be more sensitive to any possible lack of fit of the model.)