Outliers and the standard deviation

The mean and standard deviation of a data set summarise the centre and spread of values but contain no further information about the shape of the distribution. They are therefore poor descriptions of distributions that have clusters, outliers or skewness.

It is worth spending a little more time investigating the effect of an outlier on the mean and standard deviation. Although an outlier has a reasonably strong influence on the mean of the data,

Outliers have an extremely strong effect on the standard deviation.

By applying the 70-95-100 rule of thumb and thinking about whether the resulting proportions of values within 1, 2 and 3 standard deviations are reasonable in the context of the data, you may be able to tell that something is wrong. (E.g. is it reasonable that 70% of the values are between say 14 and 18, and 30% are outside this interval?)

The mean and standard deviation are bad summaries of data sets with outliers.

A graphical display such as a dot plot is the best way to detect an outlier and you should always look at the data before summarising with a mean and standard deviation.

An outlier should be carefully examined. Was the value incorrectly recorded? Was there something unusual about the individual from which the measurement was obtained? If we are convinced that there was something wrong about the value, it should be removed from the data set before further analysis.

Date of first rains in Samaru, Nigeria

The stacked dot plot below shows the date of the first planting rain in Samaru each year from 1928 to 1983, defined as the first occasion after 1st April when there was more than 20 mm of rain within one or two days.

The dates are recorded as the day number after 1st January (e.g. 1st April = day 92, 1st May = day 122).

We will now consider adding an 'outlier' to this data set — we will pretend that it is an incorrectly recorded date for the first rains in 1927. Click High outlier to add a value of 240 to the data set. The mean date is increased by only 2 days, but the standard deviation increases from 18.7 to 23.9 days, a much greater increase.

Drag the outlier to 300, increasing the standard deviation to 29.5. The 70-95-100 rule with this standard deviation gives a misleading impression of the chance of a very early or late date for the first rains.

Missing value

When a data value is missing (e.g. the date of the first rains in 1927 may not have been recorded), it is often coded as an 'impossible' value, such as '999'. Click Missing value (coded 999) to change the value for 1927 to 999. Observe that:

No planting rain

Another complication with rainfall data in Africa is that sometimes the rains do not start at all. Consider what would happen if there was no planting rain in 1984. Such a year might be coded as the value '0'. Such zeros are important information (unlike missing values) but they should not be included with calculating means and standard deviations.

Click Low outlier and then No planting rain (coded 0) to see the effect of erroneously treating the value 0 for 1984 as a proper date. Again the standard deviation is badly affected.

The mean and standard deviation may appear 'reasonable' even if there are outliers. Always examine a dot plot, histogram or box plot before analysis.