Standard deviation and variance

The variance of a set of values is the square of its standard deviation. Although standard deviations are much easier to interpret since they are in the same units as the original values (e.g. kg or degrees C), variances are the basis of many advanced statistical methods.

In particular, the total sums of squares is closely related to the overall variance of the observations, and the within-group sum of squares is closely related to the variance within the groups.

Mean sums of squares

The connection between sums of squares and variances occurs through quantities called mean sums of squares.

These are obtained by dividing each of the three sums of squares by a value called its degrees of freedom.

The mean total sum of squares is the sample variance of the response (ignoring groups).
The mean within-group sum of squares describes the variance within groups.
The mean between-group sum of squares is harder to directly interpret.

Note that the within-group and between-group degrees of freedom (the denominators) also add to give the total degrees of freedom.

The mean sums of squares are all variances.

This cannot be easily explained for the between-group sum of squares, but is the basis of the term analysis of variance that is used for the methodology.

Analysis of variance table

The calculations are usually presented in a table called an analysis of variance table. (This is often abbreviated to an anova table.)

(The final column contains a value that is used to test whether the group means are equal, but it can be ignored here. Textbooks usually call it an F ratio because the test is known as an F test, but a better name might be a Variance Ratio, because the numerator and denominator mean sums of squares are types of variances.)

Interpreting the mean sums of squares

The square root of MSSTotal is the overall standard deviation of the data.

The square root of MSSWithin is a kind of 'average' of the standard deviations within the g groups.

Illustration of calculations

The dot plots on the left below show 3 numerical measurements from each of 4 groups.

The slider adjusts the relative size of the between-group and within-group sums of squares. Observe how this affects the relative sizes of SSWithin and SSBetween.

Maximum temperatures in Bulawayo

The diagram below repeats the jittered dot plot for all monthly maximum temperatures from July 1951 to April 2001, and separate dot plots for individual months.

The anova table for these data is shown next.

Source SS df MSS  F ratio 
Between groups   4,417   11     401.52     148.4  
Within groups 1,585   586   2.70  
Total 6,002 597   10.53  

From this table,


Generalising

The idea that overall variation in a variable can be partly explained by other information is a general one that is the basis of many advanced statistical methods.


In all cases, we try to use available information to reduce the unexplained variation and analysis of variance is used to analyse the data.

This partly explains the importance of variance and its square root, standard deviation, in all aspects of data analysis.