Types of variation

The overall variation of values is usually larger than the variation within individual groups. A general way to explain this effect is in terms of explained and unexplained variation.

Overall variation
This is variation in the values if the grouping of values is ignored.
Unexplained variation
We have no information that helps us to understand the variability of values within each group, so the within-group variability is unexplained by the groups.
Explained variation
This describes the difference between the overall and unexplained variation. For example, if the groups corresponded to seasons, knowledge of the season would help to predict rainfall, so the season explains part of the variation in rainfalls.

Reducing the unexplained variability

Low unexplained variability means that the values within each group are relatively close to their group mean. As a result, the group mean will be a relatively accurate prediction of future values.

It is good to split data into groups in a way that has high explained variation (differences between group means) and low unexplained variation (within groups).

Finding a way to group data such that there is low standard deviation within groups is therefore worthwhile.

Forecasts of rainfall in Moorings (Monze), Zambia

The most important planting time in Southern Zambia is October-December, and availability of good rainfall forecasts for that period would help farmers to decide which crops to plant and when the planting should take place.

The upper jittered dot plot below shows the rainfall totals for October-December at Moorings for the years 1922 to 2003.

Imagine a forecast of rainfall for this period that is provided in September each year. Such forecasts are often provided in the form High, Average or Low rather than numerically. The lower section of the diagram shows one type of forecast for each year — each rainfall is drawn on a row corresponding to the forecast for that year. (Click on any cross to see the year, prediction and actual rainfall.)

Observe that the distribution has lower spread within each forecast group than the overall spread of rainfalls.

Use the slider to adjust the characteristics of the forecast. The better the forecast, the lower the standard deviation within the groups and the more the forecast narrows the range of likely rainfalls.

The lower the variation of rainfall for years that are forecast as Low, etc, the better the quality of the forecast.


Predicting a future value

In most practical examples, the grouping of values is fixed by the nature of the data set. The lower variation within groups, compared to the overall variation, means that we should be able to predict a future value more accurately (with the group mean) if we know its group membership.

Maximum temperatures in Boston

The table below summarises the mean daily maximum temperatures (ºC) in Boston, each month from January 1950 to April 2014.

Month     Mean     Standard deviation
January   2.48 2.27
February   3.53 2.00
March   7.46 1.81
April 13.54 1.66
May 19.26 1.71
June 24.72 1.61
July 27.80 1.40
August 26.69 1.04
September    22.56 1.24
October 16.79 1.31
November 11.04 1.67
December   5.08 2.27
Overall 15.04 9.00

Consider prediction of the mean maximum temperature in Boston in a future month, assuming no long-term trend.

If we are not told which month:
The best prediction would be 15.04ºC, the overall mean in our combined data set. The standard deviation, s = 9.00, describes the likely errors in this prediction. Using the 70-95-100 rule of thumb, our prediction has about 95% chance of being within 2s =  18ºC of the actual temperature.
If we know that the month is July:
There is much less variation within the July temperatures. Using only historical data from July, we would predict the maximum temperature to be 27.80ºC (the group mean) and the standard deviation within July temperatures, s = 1.40, would describe the likely prediction errors. This prediction has about 95% chance of being within 2s = 2.80ºC of the actual temperature.

Using knowledge of the month, we get a much more accurate prediction.

This results from the lower standard deviations within months than the overall standard deviation.


The overall standard deviation, s = 9.00, describes the variation if we do not take into account differences between the months. Some of this variation is explained by differences between the months.

The within-July standard deviation, s = 1.40, describes year-to-year variation in the July maximum temperatures. This variation is unexplained from the available information.