Goodness-of-fit test for other discrete distributions

The same approach that was used on the previous page can be applied to test whether a discrete data set is a random sample from any distribution.

  1. Estimate any unknown parameters of the model.
  2. Summarise the data in a frequency table; the frequencies in the table are our observed counts, \(\{O_x\}\).
  3. Use the model's probability function (with estimated parameters) to estimate probabilities for the values in the frequency table.
  4. Multiply these probabilities by the number of raw data values to get expected counts, \(\{E_x\}\).
  5. Combine cells in the frequency table to avoid small expected counts.
  6. Calculate the chi-squared test statistic,
\[ X^2 \;=\; \sum_{x} {\frac{\left(O_x - E_x\right)^2}{E_x}} \]
  1. The number of 'constraints' is the number of estimated parameters plus one (since \(\sum{E_i} \;=\; \sum{O_i}\)). The degrees of freedom are the number of combined counts minus the number of constraints.
  2. Find the p-value for the test as the upper tail of the chi-squared distribution with this number of degrees of freedom.
  3. Interpret the p-value — small values give evidence that the data do not fit the distribution.

The steps will be illustrated in an example.

Example

The following table gives the number of male children among the first 12 children in 6,115 families of size 13, taken from hospital records in 19th century Saxony. (The 13th child has been ignored to avoid the possible distortion of families stopping when a desired sex is reached.)

Males 0 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 3 24 104 286 670 1033 1343 1112 829 478 181 45 7

Assuming independence and that each child has the same probability of being male, \(\pi\), this would be a random sample from a \(\BinomDistn(n=12, \; \pi)\) distribution.

Is there evidence that the probability of a birth being male differs from family to family?

This is equivalent to asking whether the 6,115 values are a random sample from a \(\BinomDistn(n=12, \; \pi)\) distribution. If the probability of a birth being male varies from family to family, there would be overdispersion and a beta-binomial distribution may fit better.

The binomial distribution has one unknown parameter. Its best estimate is the overall proportion of males in the 6,115 families,

\[ \hat{\pi} \;=\; \frac{38,100}{12 \times 6,115} \;=\; 0.5192\]

Using the binomial probability function, we can obtain expected cell counts for the frequency table,

\[ E_x \;=\; 6,115 \times {{12} \choose x} {0.5192}^x (1 - 0.5192)^{12-x}\]

These are shown in the table below.

x 0 1 2 3 4 5 6 7 8 9 10 11 12
\(O_x\) 3 24 104 286 670 1033 1343 1112 829 478 181 45 7
\(E_x\) 0.93 12.09 71.80 258.48 628.06 1085.2 1367.3 1265.6 854.25 410.01 132.84 26.08 2.35

Since the first expected cell count is under one, we combine it with the next:

x 0,1 2 3 4 5 6 7 8 9 10 11 12
\(O_x\) 27 104 286 670 1033 1343 1112 829 478 181 45 7
\(E_x\) 13.02 71.80 258.48 628.06 1085.2 1367.3 1265.6 854.25 410.01 132.84 26.08 2.35

The test statistic is

\[ X^2 \;=\; \sum_{x} {\frac{\left(O_x - E_x\right)^2}{E_x}} \;=\; 109.19 \]

and this should be compared to a chi-squared distribution with \((12 - 1 - 1) = 10\) degrees of freedom, giving a p-value that is virtually zero.

Since such a large difference between the observed and expected cell counts would be almost impossible if the data were a random sample from a binomial distribution, we can conclude that it is almost certain that this model does not hold.

Comparing the observed and expected counts, there is overdispersion — there were more families with mostly boys and with mostly girls than would be expected by chance.