Goodness-of-fit test for other discrete distributions
The same approach that was used on the previous page can be applied to test whether a discrete data set is a random sample from any distribution.
The steps will be illustrated in an example.
Example
The following table gives the number of male children among the first 12 children in 6,115 families of size 13, taken from hospital records in 19th century Saxony. (The 13th child has been ignored to avoid the possible distortion of families stopping when a desired sex is reached.)
Males | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Frequency | 3 | 24 | 104 | 286 | 670 | 1033 | 1343 | 1112 | 829 | 478 | 181 | 45 | 7 |
Assuming independence and that each child has the same probability of being male, \(\pi\), this would be a random sample from a \(\BinomDistn(n=12, \; \pi)\) distribution.
Is there evidence that the probability of a birth being male differs from family to family?
This is equivalent to asking whether the 6,115 values are a random sample from a \(\BinomDistn(n=12, \; \pi)\) distribution. If the probability of a birth being male varies from family to family, there would be overdispersion and a beta-binomial distribution may fit better.
The binomial distribution has one unknown parameter. Its best estimate is the overall proportion of males in the 6,115 families,
\[ \hat{\pi} \;=\; \frac{38,100}{12 \times 6,115} \;=\; 0.5192\]Using the binomial probability function, we can obtain expected cell counts for the frequency table,
\[ E_x \;=\; 6,115 \times {{12} \choose x} {0.5192}^x (1 - 0.5192)^{12-x}\]These are shown in the table below.
x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
\(O_x\) | 3 | 24 | 104 | 286 | 670 | 1033 | 1343 | 1112 | 829 | 478 | 181 | 45 | 7 |
\(E_x\) | 0.93 | 12.09 | 71.80 | 258.48 | 628.06 | 1085.2 | 1367.3 | 1265.6 | 854.25 | 410.01 | 132.84 | 26.08 | 2.35 |
Since the first expected cell count is under one, we combine it with the next:
x | 0,1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
\(O_x\) | 27 | 104 | 286 | 670 | 1033 | 1343 | 1112 | 829 | 478 | 181 | 45 | 7 |
\(E_x\) | 13.02 | 71.80 | 258.48 | 628.06 | 1085.2 | 1367.3 | 1265.6 | 854.25 | 410.01 | 132.84 | 26.08 | 2.35 |
The test statistic is
\[ X^2 \;=\; \sum_{x} {\frac{\left(O_x - E_x\right)^2}{E_x}} \;=\; 109.19 \]and this should be compared to a chi-squared distribution with \((12 - 1 - 1) = 10\) degrees of freedom, giving a p-value that is virtually zero.
Since such a large difference between the observed and expected cell counts would be almost impossible if the data were a random sample from a binomial distribution, we can conclude that it is almost certain that this model does not hold.
Comparing the observed and expected counts, there is overdispersion — there were more families with mostly boys and with mostly girls than would be expected by chance.