Data with small counts

The chi-squared goodness-of-fit test that was described earlier in this section has two requirements:

Firstly consider how we might test whether the following data set is a random sample from a \(\PoissonDistn(\lambda=2)\) distribution.

1 3 2 2 5 4 5 2 0 2
2 4 3 5 2 3 1 4 2 6

These counts are small, so the chi-squared test cannot be directly used.

Frequency table

Rather than applying the test to the raw counts, we first summarise the data in a frequency table.

x 0 1 2 3 4 5 6+
Freq(x)  1 2 7 3 3 3 1

We will treat these frequencies as our observed counts, then find expected counts from the Poisson distribution's probability function,

\[ p(x) \;\;=\;\; \frac{\lambda^x e^{-\lambda}}{x!}\]

Since there were \(n=20\) values in the data set, and the null hypothesis value is \(\lambda=2\), the count that we would expect in the frequency table for the value \(x\) is

\[ E_x \;\;=\;\; 20 \times p(x) \;\;=\;\; 20 \times \frac{2^x e^{-2}}{x!}\]

This results in the following table of observed and expected counts.

x 0 1 2 3 4 5 6+
\(O_x\) 1 2 7 3 3 3 1
\(E_x\) 2.707 5.413 5.413 3.609 1.804 0.722 0.331

Combining cells

These counts still do not satisfy the requirement for all \(E_x\) to be ≥1 and 80% to be ≥5. Before applying a chi-squared goodness-of-fit test to them, cells in the table should be combined.

x 0,1 2 3+
\(O_x\) 3 7 10
\(E_x\) 8.120 5.413 6.466

Based on these three counts,

\[ \begin{align} X^2 \;&=\; \sum_{i=1}^{10} {\frac{\left(O_i - E_i\right)^2}{E_i}} \\ &=\; \frac{(3-8.120)^2}{8.120} + \frac{(7-5.413)^2}{5.413} + \frac{(10-6.466)^2}{6.466} \\ &=\; 5.624 \end{align} \]

P-value and conclusion

To calculate the degrees of freedom, we note that there are 3 categories (after grouping) and one constraint: the expected frequencies from the model are “constrained” to add up to 20 by how they were calculated.

\[ \sum{E_i} \;=\; \sum{O_i} \;=\; 20 \]

The chi-squared test should therefore be based on \(3-1 = 2\) degrees of freedom. The p-value for testing whether our model for the data fits is the probability that a \(\ChiSqrDistn(2 \text{ df})\) distribution is greater than the observed value,

p-value = \(P(X^2 \ge 5.621) = 0.060\)

Since this is between 0.05 and 0.10, we would conclude that there is only very weak evidence against the null hypothesis that the original data set was a random sample from a \(\PoissonDistn(2)\) distribution.

(It is hardly surprising that a data set with only 20 values does not show up problems with a model — a larger data set would be more sensitive to any possible lack of fit of the model.)