Samples from several groups
We now consider data sets that arise as random samples from two or more groups.
Sugar beet yields
An experiment was conducted to assess how the application of nitrogen fertiliser at different rates affected the yield of sugar beet. Five plots (randomly chosen from a pool of 30 plots) were given each of six different amounts of fertiliser — the experimental treatments. The table below shows the sugar beet yields from all 30 plots.
Nitrogen level (lb/acre) | Yield from plot (tons/acre) | ||||
---|---|---|---|---|---|
0 | 31.3 | 33.4 | 29.2 | 32.2 | 33.9 |
50 | 38.8 | 37.5 | 37.4 | 35.8 | 38.4 |
100 | 40.9 | 39.2 | 39.5 | 38.6 | 39.8 |
150 | 40.9 | 41.7 | 39.4 | 40.1 | 40.0 |
200 | 39.7 | 40.6 | 39.2 | 38.7 | 41.9 |
250 | 40.6 | 41.0 | 41.5 | 41.1 | 39.8 |
The five recorded yields at each level of nitrogen can be treated as a random sample from some distribution, but the six distributions corresponding to the different nitrogen levels may differ.
A normal model for the data
The distributions in the different groups, from which we are assuming that the data are random samples, may differ in various ways. However it is common to assume that all groups have normal distributions, with a \(\NormalDistn(\mu_i,\;\sigma_i^2)\) distribution in the \(i\)'th of the \(g\) groups. If we can also assume that the variance is the same in each group, our model is that
\[ Y_{ij} \;\;\sim\;\; \NormalDistn(\mu_i, \sigma^2) \qquad \text{for }i = 1,\dots,g \text{ and }j = 1,\dots,n_i \]where \(Y_{ij}\) is the \(j\)'th value in group \(i\). Note that we are using \(n_i\) to denote the number of values in the \(i\)'th group and the total number of values is \(n = \sum_{i=1}^g n_i\).
Parameter estimates
The best estimates of the group means \(\{\mu_i\}\) — the maximum likelihood estimates — are the sample means in the groups,
\[ \hat{\mu}_{i} \;\;=\;\; \overline{Y}_i \]but how should we estimate the common group variance, \(\sigma^2\)?
Definition
The pooled estimate of the common group variance is
\[ S_{\text{pooled}}^2 \;=\; \frac{\sum_{i=1}^g (n_i - 1)S_i^2}{\sum_{i=1}^g (n_i - 1)} \]where
\[ S_i^2 \;=\; \frac{\sum_{j=1}^{n_i} (Y_{ij} - \overline{Y}_i)^2} {n_i - 1} \]is the sample variance in group \(i\).
To help understand this estimator, note that it is simply the average of the group variance if there are the same number of values in all groups — if all \(n_i\) are equal. If the \(n_i\) differ, more weight is given to the variance in the larger groups.
Distribution of pooled variance
The pooled estimator \(S_{\text{pooled}}^2\) is an unbiased estimator of \(\sigma^2\) and
\[ \frac{n-g}{\sigma^2}S_{\text{pooled}}^2 \;\;\sim\;\; \ChiSqrDistn(n-g\;\text{df}) \]We have already shown that the sample variance from a normal distribution is an unbiased estimator of \(\sigma^2\), so \(E[S_i^2] = \sigma^2\). Therefore
\[ E[S_{\text{pooled}}^2] \;=\; \frac{\sum_{i=1}^g (n_i - 1)E[S_i^2]}{\sum_{i=1}^g (n_i - 1)} \;=\; \frac{\sum_{i=1}^g (n_i - 1)\sigma^2}{\sum_{i=1}^g (n_i - 1)} \;=\; \sigma^2 \]Now we already know that the sample variance in any group has a Chi-squared distribution,
\[ \frac{n_i - 1}{\sigma^2}S_i^2 \;\sim\; \ChiSqrDistn(n_i - 1\;\text{df}) \]Since \(\sum_{i=1}^g (n_i - 1) = (n-g)\),
\[ \frac{n-g}{\sigma^2}S_{\text{pooled}}^2 \;=\; \sum_{i=1}^g \frac{n_i - 1}{\sigma^2}S_i^2 \]Using the fact that the sum of independent Chi-squared variables also has a Chi-squared distribution whose degrees of freedom are the sum of those of its components,
\[ \frac{n-g}{\sigma^2}S_{\text{pooled}}^2 \;=\; \sum_{i=1}^g \frac{n_i - 1}{\sigma^2}S_i^2 \;\sim\; \ChiSqrDistn(n - g\;\text{df}) \]Note that we can also prove that the pooled estimator is unbiased from its Chi-squared distribution — the distribution's mean equals its degrees of freedom.
We now apply this result to the sugar beet data.
Sugar beet yields
The table below shows the sample means and variances in the six groups. Since there were \(n_i = 5\) values in each group, all group variances have 4 degrees of freedom.
Nitrogen level (lb/acre) |
Group mean \(\overline{y}_i\) |
Group variance \(s_i^2\) |
Degrees of freedom |
---|---|---|---|
0 | 32.00 | 3.485 | 4 |
50 | 37.58 | 1.342 | 4 |
100 | 39.60 | 0.725 | 4 |
150 | 40.42 | 0.797 | 4 |
200 | 40.02 | 1.597 | 4 |
250 | 40.80 | 0.415 | 4 |
Since all group sizes are equal, the pooled variance estimator is the average of those in the groups,
\[ S_{\text{pooled}}^2 \;=\; 1.3935 \]and this has 24 degrees of freedom. Since
\[ \frac{24}{\sigma^2}S_{\text{pooled}}^2 \;\sim\; \ChiSqrDistn(24\;\text{df}) \]we can use this as a pivot to get a 95% confidence interval for \(\sigma^2\) in the same way that was done with a single sample,
\[ \frac{24 s_{\text{pooled}}^2}{\chi_{24,\;0.975}^2} \;\;\lt\;\; \sigma^2 \;\;\lt\;\; \frac{24s_{\text{pooled}}^2}{\chi_{24,\;0.025}^2} \] \[ 0.850 \;\;\lt\;\; \sigma^2 \;\;\lt\;\; 2.697 \]This confidence interval is very wide, but the width should not be surprising when the group variances are so varied. Large sample sizes are necessary in order to estimate variances accurately.