Types of variation

The overall variation of values is usually larger than the variation within individual groups. A general way to explain this effect is in terms of explained and unexplained variation.

Overall variation
This is variation in the values if the grouping of values is ignored.
Unexplained variation
We have no information that helps us to understand the variability of values within each group, so the within-group variability is unexplained by the groups.
Explained variation
This describes the difference between the overall and unexplained variation. For example, if the groups corresponded to seasons, knowledge of the season would help to predict rainfall, so the season explains part of the variation in rainfalls.

Reducing the unexplained variability

Low unexplained variability means that the values within each group are relatively close to their group mean. As a result, the group mean will be a relatively accurate prediction of future values.

It is good to split data into groups in a way that has high explained variation (differences between group means) and low unexplained variation (within groups).

Finding a way to group data such that there is low standard deviation within groups is therefore worthwhile.

Forecasts of rainfall in Moorings (Monze), Zambia

The most important planting time in Southern Zambia is October-December, and availability of good rainfall forecasts for that period would help farmers to decide which crops to plant and when the planting should take place.

The upper jittered dot plot below shows the rainfall totals for October-December at Moorings for the years 1922 to 2003.

Imagine a forecast of rainfall for this period that is provided in September each year. Such forecasts are often provided in the form High, Average or Low rather than numerically. The lower section of the diagram shows one type of forecast for each year — each rainfall is drawn on a row corresponding to the forecast for that year. (Click on any cross to see the year, prediction and actual rainfall.)

Observe that the distribution has lower spread within each forecast group than the overall spread of rainfalls.

Use the slider to adjust the characteristics of the forecast. The better the forecast, the lower the standard deviation within the groups and the more the forecast narrows the range of likely rainfalls.

The lower the variation of rainfall for years that are forecast as Low, etc, the better the quality of the forecast.


Predicting a future value

In most practical examples, the grouping of values is fixed by the nature of the data set. The lower variation within groups, compared to the overall variation, means that we should be able to predict a future value more accurately (with the group mean) if we know its group membership.

Maximum temperatures in Bulawayo

The table below summarises the maximum monthly temperatures in Bulawayo from July 1951 to April 2001.

Month     Mean     Standard deviation
January 32.13 2.11
February 31.44 2.17
March 31.24 2.08
April 30.46 1.73
May 28.53 1.69
June 26.10 1.37
July 26.43 1.32
August 30.04 1.28
September    33.44 1.24
October 34.93 1.01
November 34.34 1.49
December 32.62 1.75
Overall 30.99 3.17

Consider prediction of the maximum temperature in Bulawayo in a future month, assuming no long-term trend.

If we are not told which month:
The best prediction would be 30.99 degrees, the overall mean in our combined data set. The standard deviation, s = 3.17, describes the likely errors in this prediction. Using the 70-95-100 rule of thumb, our prediction has about 95% chance of being within 2s = 6.34 degrees of the actual temperature.
If we know that the month is July:
There is much less variation within the July temperatures. Using only historical data from July, we would predict the maximum temperature to be 26.43 degrees (the group mean) and the standard deviation within July temperatures, s = 1.32, would describe the likely prediction errors. This prediction has about 95% chance of being within 2s = 2.64 degrees of the actual temperature.

Using knowledge of the month, we get a much more accurate prediction.

This results from the lower standard deviations within months than the overall standard deviation.


The overall standard deviation, s = 3.17, describes the variation if we do not take into account differences between the months. Some of this variation is explained by differences between the months.

The within-July standard deviation, s = 1.32, describes year-to-year variation in the July maximum temperatures. This variation is unexplained from the available information.