Outliers
Values that are considerably larger or smaller than the bulk of the data are called outliers.
Detection of outliers is particularly important. An outlier may have been incorrectly recorded, or there may have been other anomalous circumstances associated with it. Outliers must be carefully checked if possible. If anything atypical can be found, outliers should be deleted from the data set and their deletion noted in any reports about the data.
Aircraft pollution
The table below shows emissions of CO2 per takeoff/landing cycle for several jet aircraft in the mid 1990s.
Aircraft | CO2 emissions (kg per cycle) |
Weight (1000 kg) |
CO2 emissions per 1000 kg |
||||
---|---|---|---|---|---|---|---|
|
|
|
|
The CO2 emmissions cannot be compared meaningfully — it is hardly surprising that the larger aircraft tend to emit more than smaller ones! A more useful comparison of CO2 emmissions can be obtained by first dividing by aircraft weight. The last column in the table gives emissions in each takeoff/landing cycle per 1000 kg aircraft weight.
These values are displayed in the stem and leaf plot on the right. The display shows one outlier. The Gates Learjet 24D emits over 200kg CO2 per 1000 kg aircraft weight in each takeoff/landing cycle — more than double that of the other aircraft. An engineer would question which characteristics of the plane might explain its emissions. Is it smaller than the other planes? Older? Using a different engine technology? |
![]() |
Outliers and skew distributions
An extreme data value that stands out from the rest of the data does not necessarily indicate that there is a mistake in the data or something unusual about the individual. Our interpretation of the extreme value should also take into account the shape of the distribution of values for the rest of the data.
Storm duration
The stem and leaf plot below shows the durations (in minutes) of the first 50 storms in the 1983/4 rainy season in the Bvumbwe catchment in Malawi.
One storm lasted much longer than the others (880 minutes). It is certainly worth checking the records for this storm (was the duration perhaps really 88 minutes?). However the value is not necessarily a mistake.
Most storms are short, with durations less than 100 minutes, so the longest rows of leaves are at the bottom of the stem and leaf plot. There are fewer storms lasting 100-200 minutes, fewer still of 200-300 minutes and this pattern continues, with the frequency of storms decreasing steadily up the stem and leaf plot. This shape of distribution is called a skew distribution, as opposed to a symmetric distribution whose tails decrease at similar speed on both sides of the peak density.
Perhaps this 'outlier' is a continuation of the pattern into the tail of the distribution and is just a long storm that could be expected once every hundred or so storms.