We now consider data sets that arise as random samples from two or more groups.

A normal model

We often assume that all groups have normal distributions, with a \(\NormalDistn(\mu_i,\;\sigma_i^2)\) distribution in the \(i\)'th of the \(g\) groups. It is also common to assume that the variances is the same in all groups, so

\[ Y_{ij} \;\;\sim\;\; \NormalDistn(\mu_i, \sigma^2) \qquad \text{for }i = 1,\dots,g \text{ and }j = 1,\dots,n_i \]

where \(Y_{ij}\) is the \(j\)'th value in group \(i\). Note that we are using \(n_i\) to denote the number of values in the \(i\)'th group and the total number of values is \(n = \sum_{i=1}^g n_i\).

The maximum likelihood estimates of the group means \(\{\mu_i\}\) are

\[ \hat{\mu}_{i} \;\;=\;\; \overline{Y}_i \]

but how should we estimate the common group variance, \(\sigma^2\)?

Definition

The pooled estimate of the common group variance is

\[ S_{\text{pooled}}^2 \;=\; \frac{\sum_{i=1}^g (n_i - 1)S_i^2}{\sum_{i=1}^g (n_i - 1)} \]

where

\[ S_i^2 \;=\; \frac{\sum_{j=1}^{n_i} (Y_{ij} - \overline{Y}_i)^2} {n_i - 1} \]

is the sample variance in group \(i\).

We now give its distribution.

Distribution of pooled variance

The pooled estimator \(S_{\text{pooled}}^2\) is an unbiased estimator of \(\sigma^2\) and

\[ \frac{n-g}{\sigma^2}S_{\text{pooled}}^2 \;\;\sim\;\; \ChiSqrDistn(n-g\;\text{df}) \]

(Proved in full version)

Since the quantity on the left is a pivot for \(\sigma^2\), it can be used to find a confidence interval for the parameter, in a very similar way to how one was found from a single random sample.