Comparing the populations
For two-group data sets, we usually want to compare the underlying populations. In particular, the main questions of interest are:
Comparing the population means
The two standard deviations in the groups may differ, but we are usually more interested in differences between the population means. The earlier questions can be asked in terms of the difference between these means,
δ = μ2 − μ1
If the group means are equal (and µ2 - µ1 is therefore zero), then values from neither group are higher than from the other, on average. Indeed, if the distributions are normal and σ1 and σ2 are also equal, then a zero value for µ2 - µ1 also implies that the distributions in the two groups are identical.
µ2 - µ1 describes how much higher the values in group 2 are (on average) than the values in group 1.
The best estimate of µ2 - µ1 is, naturally, the difference between the means of the two samples, .
Randomness of sample difference
Unfortunately,
cannot give a definitive answer to questions about µ2 - µ1 since it is a random summary statistic — it varies from sample to sample. The
distribution of
must be understood before we can make any inference about µ2 - µ1.
Simulation: Effect of zinc on colds
An experiment was conducted on volunteers who developed a cold within the previous 24 hours. They were randomised to take either zinc acetate lozenges or placebo lozenges (identical but with no active ingredient). The duration of cold symptoms for the subjects who received zinc had mean 4.5 days, and standard deviation 1.6 days. The control group's duration had mean 8.1 days, and standard deviation 1.8 days.
We will conduct a simulated experiment based on this scenario. In the simulation, we will generate 'symptom durations' for 24 patients with the zinc treatment from a normal distribution with µ1 = 4.5 and σ1 = 1.6. Another 24 patients in the placebo group will have normal durations with µ2 = 8.1 days and σ2 = 1.8 days.
Note that the duration of symptoms is, on average, µ2 - µ1 = 3.6 days lower for those who get the zinc lozenges.
(The normal distributions from which the data are sampled are represented by a pale blue band at µ ± 2σ. The narrower darker blue band includes half of the population distribution.)
Click Accumulate, then take several samples. Observe that the difference between the sample means is a random quantity whose distribution is centred on µ2 - µ1 = 3.6 days.
The difference in means from a single data set, ,
is therefore an estimate of µ2 - µ1,
but is unlikely to be exactly equal to it.
Birth weights in summer and winter
In practice, the underlying population means (and their difference) are unknown, and only a single sample from each group is available. The data set below is a typical example.
Without an understanding of the distribution of ,
it is impossible to properly interpret what the sample difference, 0.104 kg,
tells you about the difference between the underlying population means.