Types of variation
The overall variation of values is usually larger than the variation within individual groups. A general way to explain this effect is in terms of explained and unexplained variation.
Reducing the unexplained variability
Low unexplained variability means that the values within each group are relatively close to their group mean. As a result, the group mean will be a relatively accurate prediction of future values.
It is good to split data into groups in a way that has high explained variation (differences between group means) and low unexplained variation (within groups).
Finding a way to group data such that there is low standard deviation within groups is therefore worthwhile.
Forecasts of rainfall in Moorings (Monze), Zambia
The most important planting time in Southern Zambia is October-December, and availability of good rainfall forecasts for that period would help farmers to decide which crops to plant and when the planting should take place.
The upper jittered dot plot below shows the rainfall totals for October-December at Moorings for the years 1922 to 2003.
Imagine a forecast of rainfall for this period that is provided in September each year. Such forecasts are often provided in the form High, Average or Low rather than numerically. The lower section of the diagram shows one type of forecast for each year — each rainfall is drawn on a row corresponding to the forecast for that year. (Click on any cross to see the year, prediction and actual rainfall.)
Observe that the distribution has lower spread within each forecast group than the overall spread of rainfalls.
Use the slider to adjust the characteristics of the forecast. The better the forecast, the lower the standard deviation within the groups and the more the forecast narrows the range of likely rainfalls.
The lower the variation of rainfall for years that are forecast as Low, etc, the better the quality of the forecast.
Predicting a future value
In most practical examples, the grouping of values is fixed by the nature of the data set. The lower variation within groups, compared to the overall variation, means that we should be able to predict a future value more accurately (with the group mean) if we know its group membership.
Maximum temperatures in Bulawayo
The table below summarises the maximum monthly temperatures in Bulawayo from July 1951 to April 2001.
Month | Mean | Standard deviation |
---|---|---|
January | 32.13 | 2.11 |
February | 31.44 | 2.17 |
March | 31.24 | 2.08 |
April | 30.46 | 1.73 |
May | 28.53 | 1.69 |
June | 26.10 | 1.37 |
July | 26.43 | 1.32 |
August | 30.04 | 1.28 |
September | 33.44 | 1.24 |
October | 34.93 | 1.01 |
November | 34.34 | 1.49 |
December | 32.62 | 1.75 |
Overall | 30.99 | 3.17 |
Consider prediction of the maximum temperature in Bulawayo in a future month, assuming no long-term trend.
Using knowledge of the month, we get a much more accurate prediction.
This results from the lower standard deviations within months than the overall standard deviation.
The overall standard deviation, s = 3.17, describes the variation if we do not take into account differences between the months. Some of this variation is explained by differences between the months.
The within-July standard deviation, s = 1.32, describes year-to-year variation in the July maximum temperatures. This variation is unexplained from the available information.