Frequency table for continuous data
The same chi-squared test can be used to assess whether a continuous distribution is an appropriate model for continuous data, but the data must first be summarised in a frequency table. This is done by splitting the range of possible values of the distribution into classes such as "10 ≤ X < 11", then counting the number of values in each class. (This is the same as would be done when drawing a histogram of the data.)
The method is complicated slightly by the need to use the distribution's cumulative distribution function to find the expected counts in the different classes. An example illustrates the method.
Body temperature
In a study to determine the "normal" body temperature of healthy adults, body temperatures were found from 130 adults. The following frequency table summarises the data.
Temperature, x | Frequency |
---|---|
\(X \lt 96.0\) | 0 |
\(96.0 \le X \lt 96.5\) | 2 |
\(96.5 \le X \lt 97.0\) | 4 |
\(97.0 \le X \lt 97.5\) | 13 |
\(97.5 \le X \lt 98.0\) | 21 |
\(98.0 \le X \lt 98.5\) | 38 |
\(98.5 \le X \lt 99.0\) | 33 |
\(99.0 \le X \lt 99.5\) | 15 |
\(99.5 \le X \lt 100.0\) | 2 |
\(100.0 \le X \lt 100.5\) | 1 |
\(100.5 \le X \lt 101.0\) | 1 |
\(X \ge 101.0\) | 0 |
We will examine whether these data could be a random sample from a normal distribution. The diagram below shows a histogram of the data based on these classes.
A normal distribution has been superimposed on the histogram. Click Best fit to show the method of moments estimates of the normal distribution's parameters,
\[ \hat{\mu} \;=\; \overline{x} \;=\; 98.25 \spaced{and} \hat{\sigma} \;=\; s \;=\; 0.7332 \]The normal distribution seems a reasonably close match to the histogram, but we will apply a chi-squared goodness-of-fit test to formally test this. The probabilities for values within these classes were found from the best-fitting \(\NormalDistn(\mu=98.25, \sigma = 0.7332)\) distribution, then multiplied by the number of values, 130, to get expected counts.
Temperature x |
Observed count, O |
Expected count, E |
---|---|---|
\(X \lt 96.0\) | 0 | 0.14 |
\(96.0 \le X \lt 96.5\) | 2 | 0.97 |
\(96.5 \le X \lt 97.0\) | 4 | 4.64 |
\(97.0 \le X \lt 97.5\) | 13 | 14.20 |
\(97.5 \le X \lt 98.0\) | 21 | 27.76 |
\(98.0 \le X \lt 98.5\) | 38 | 34.69 |
\(98.5 \le X \lt 99.0\) | 33 | 27.72 |
\(99.0 \le X \lt 99.5\) | 15 | 14.16 |
\(99.5 \le X \lt 100.0\) | 2 | 4.62 |
\(100.0 \le X \lt 100.5\) | 1 | 0.96 |
\(100.5 \le X \lt 101.0\) | 1 | 0.13 |
\(X \ge 101.0\) | 0 | 0.01 |
Since several expected counts are low, classes must be combined before calculating the chi-squared goodness-of-fit statistic.
Temperature x |
Observed count, O |
Expected count, E |
---|---|---|
\(X \lt 97.0\) | 6 | 5.61 |
\(97.0 \le X \lt 97.5\) | 13 | 14.20 |
\(97.5 \le X \lt 98.0\) | 21 | 27.76 |
\(98.0 \le X \lt 98.5\) | 38 | 34.69 |
\(98.5 \le X \lt 99.0\) | 33 | 27.72 |
\(99.0 \le X \lt 99.5\) | 15 | 14.16 |
\(X \ge 99.5\) | 4 | 5.71 |
The test statistic is
\[ X^2 \;=\; \sum_{x} {\frac{\left(O_x - E_x\right)^2}{E_x}} \;=\; 3.657 \]We used 7 counts to evaluate the test statistic and there were 2 estimated parameters with 1 other constraint on the expected values (the constraint that \(\sum {E_x} = \sum {O_x}\)). The test statistic should therefore be compared to the chi-squared distribution with \((7-2-1) = 4\) degrees of freedom. The p-value is the probability of a value from the \(\ChiSqrDistn(4 \text{ df})\) distribution as high as 3.657 and can be found (e.g. using Excel) to be 0.4545.
To interpret this, there would be almost 50% probability of getting observed counts as far from those expected from a normal distribution if the data did come from a normal distribution. The data are therefore consistent with a normal distribution — there is no evidence of problems with the normal model.