Long page
descriptions

Chapter 11   Comparing Groups

11.1   Models for two groups

11.1.1   Interest in underlying population

As with single-group data, the populations underlying two-group data sets are usually of more interest than the specific sample data.

11.1.2   Model for two groups

Two-group data sets are often modelled as separate random samples from two normal populations.

11.1.3   Parameters of the normal model

The normal model has four parameters — the means and standard deviations in the two groups.

11.1.4   Parameter estimates

The parameters of the normal model can be estimated by the sample means and standard deviations in the two groups.

11.1.5   Difference between means

The difference between the population means is of particular interest. The difference between the sample means provides an estimate. It varies from sample to sample and has a distribution.

11.2   Distn of sums and differences

11.2.1   Means and sums of samples

The mean of a random sample is approximately normal with s.d. equal to σ divided by √n. The sum of a random sample is also approximately normal, but its s.d. is σ times √n.

11.2.2   Sum and difference

The sum and difference of two independent normal variables is also normally distributed. If they have the same standard deviation, σ, the sum and difference both have standard deviation 1.414σ. (Their variance is 2σ².)

11.2.3   Sum and difference (cont)

This page generalises the results to the sum and difference of variables whose standard deviations may be different.

11.2.4   Probabilities for sums and differences

If two variables are independent and have normal distributions, probabilities relating to their sum and difference can be found using the formulae for the mean and standard deviation of sums and differences.

11.3   Comparing means in two groups

11.3.1   Distn of difference between means

The difference between the means of two samples from normal populations has a normal distribution whose mean and s.d. can be found from the population means and s.d.s. This is the approximate distribution even when the populations are non-normal.

11.3.2   SE of difference between means

When the difference between the sample means is used to estimate the difference between the underlying population means, there is likely to be an error. The error distribution is approximately normal with mean 0. A formula for its standard deviation is given.

11.3.3   CI for difference between means

A 95% confidence interval is given for the difference between two population means. Its properties are demonstrated.

11.3.4   Testing a hypothesis

A hypothesis test is developed for testing whether two group means are the same.

11.3.5   One-tailed tests for differences

If the alternative hypothesis is for one particular mean to be greater, then the p-value for the test is found from only one tail of the t distribution.

11.4   Comparing two proportions

11.4.1   Modelling two proportions

Two-group categorical data can be modelled as samples from two categorical populations with different probabilities of 'success'.

11.4.2   Distribution of difference in proportions

The difference between two sample proportions has a distribution that is approximately normal and whose parameters can be estimated using earlier results about the mean and standard deviation of differences.

11.4.3   CI for difference in proportions

The standard deviation of the difference between two sample proportions can be estimated. From this, a 95% confidence interval is developed for the difference between two probabilities.

11.4.4   Testing for difference in probabilities

A hypothesis test is developed to assess whether two population probabilities are the same.

11.5   Paired t test

11.5.1   Paired data

Paired data are a type of bivariate data in which two similar measurements are made from each individual. We are usually interested in testing whether the means of both measurements are the same.

11.5.2   Analysis of differences

For paired data, differences between the two measurements hold all information about whether the means of both variables are the same.

11.5.3   Paired t-test

Testing for a difference between the means of the measurements is done with an ordinary t-test for whether the mean difference is zero.

11.5.4   Pairing and experimental design

To estimate or test the difference between two means, it is sometimes possible to collect data from two independent samples or from paired units. If the paired units are similar, a pair data gives more accurate results.

11.6   Comparing several means

11.6.1   Model

To compare the means of several groups, a model of normal distributions in all groups is used but all group standard deviations must be assumed to be the same.

11.6.2   Parameter estimates

The sample standard deviations in the separate groups can be combined to give a pooled estimate of the common standard deviation, σ.

11.6.3   Revisiting two groups ((optional))

Earlier CIs and tests for equality of two group means can be improved when the group standard deviations are known to be the same. However this refinement is not recommended for general use.

11.6.4   Variation between and within groups

Both variability between group means and variability within groups must be used to assess whether the groups differ.

11.6.5   Sums of squares

Variability within groups and between groups are described by sums of squares.

11.6.6   Coefficient of determination

The coefficient of determination (R-squared) is the ratio of the between-groups and total sums of squares. It is the proportion of variation that can be explained by differences between the groups.

11.6.7   Test for differences between groups

The F-ratio is a test statistic that is based on the between- and within-groups sums of squares. The associated p-value tests whether all groups have the same mean.

11.6.8   Examples

The F-test is applied to a few data sets.

11.7   Randomised blocks

11.7.1   Generalising the idea of paired data

In some data sets, the values arise in blocks of 3 or more related measurements. Randomised block and repeated measure data are of this form.

11.7.2   Example with baseline treatment

Ignoring the blocking of values loses important information about the difference between treatments. Comparing treatments separately against a baseline treatment using paired differences may be possible.

11.7.3   Use of blocking information

If there is no baseline treatment against which to compare the other measurements in each block, it is possible to simultaneously test whether all treatment means are equal. Again, ignoring the blocks loses important information.

11.7.4   Randomised block designs

Data of this form often arises from a randomised block experiment in which the experimental units occur in related blocks and treatments are randomly allocated within each block.

11.7.5   Model for randomised blocks ((optional))

Although blocks and treatments arise in different ways, they are modelled similarly. A 3-dimensional display of the data represents both blocks and treatments in the same way.

11.7.6   Removing block effects

The variation between blocks can be removed by adding/subtracting a value to each block to make all block means equal. This reduces the residual (unexplained) sum of squares.

11.7.7   Sums of squares

The total sum of squares can be split into sums of squares for blocks and treatments, and a residual sum of squares.

11.7.8   Anova table and examples

An anova table shows these sums of squares and associated degrees of freedom. The F-ratio for treatments in the table is the basis of a test for equal treatment means. Several examples are given.