Outliers

Values that are considerably larger or smaller than the bulk of the data are called outliers.

Detection of outliers is particularly important. An outlier may have been incorrectly recorded, or there may have been other anomalous circumstances associated with it. Outliers must be carefully checked if possible. If anything atypical can be found, outliers should be deleted from the data set and their deletion noted in any reports about the data.

Aircraft pollution

The table below shows emissions of CO2 per takeoff/landing cycle for several jet aircraft in the mid 1990s.

Aircraft CO2 emissions
(kg per cycle)
Weight
(1000 kg)
CO2 emissions
per 1000 kg
B747-400
B747-200
MD-11
DC10-30
L1011-200
A300
DC8-63
A310
B707-320B
B767-300
B757-200
B727-200
A320
B737-300
B737-100
DC9-50
BAe-146
BAC 111-400
Fokker 28
Dassault Falcon 20
Gates Learjet 36
Gates Learjet 35
Gates Learjet 24D
Cessna Citation
10822.72
11673.04
8115.23
7313.15
8283.17
5633.66
6246.50
4880.16
6246.50
5351.02
4614.96
4866.75
2898.05
2758.79
3244.50
3272.23
1855.08
2516.73
2034.24
1117.43
536.79
536.79
1221.07
462.08
394.0
351.5
274.0
251.7
195.0
165.0
161.0
150.0
148.3
137.0
108.8
88.3
73.5
63.0
52.3
48.9
40.5
39.4
29.4
12.8
8.1
7.7
6.1
5.2
27.469
33.209
29.618
29.055
42.478
34.143
38.798
32.534
42.121
39.059
42.417
55.116
39.429
43.790
62.036
66.917
45.804
63.876
69.192
87.299
66.270
69.713
200.175
88.862

The CO2 emmissions cannot be compared meaningfully — it is hardly surprising that the larger aircraft tend to emit more than smaller ones! A more useful comparison of CO2 emmissions can be obtained by first dividing by aircraft weight. The last column in the table gives emissions in each takeoff/landing cycle per 1000 kg aircraft weight.

These values are displayed in the stem and leaf plot on the right.

The display shows one outlier. The Gates Learjet 24D emits over 200kg CO2 per 1000 kg aircraft weight in each takeoff/landing cycle — more than double that of the other aircraft.

An engineer would question which characteristics of the plane might explain its emissions. Is it smaller than the other planes? Older? Using a different engine technology?


Outliers and skew distributions

An extreme data value that stands out from the rest of the data does not necessarily indicate that there is a mistake in the data or something unusual about the individual. Our interpretation of the extreme value should also take into account the shape of the distribution of values for the rest of the data.

Symmetric distribution

Skew distribution
If the distribution of values has its peak at one side and a long tail to the other side, the distribution is called skew. It is not unusual for the extreme value in a very skew distribution to be a fair distance from the other values.

If the tail is to the right we call this a right of positively skewed distribution. Similarly, if the tail is to the left the distribution is left (or negatively) skewed.

Storm duration

The stem and leaf plot below shows the durations (in minutes) of the first 50 storms in the 1983/4 rainy season in the Bvumbwe catchment in Malawi.

One storm lasted much longer than the others (880 minutes). It is certainly worth checking the records for this storm (was the duration perhaps really 88 minutes?). However the value is not necessarily a mistake.

Most storms are short, with durations less than 100 minutes, so the longest rows of leaves are at the bottom of the stem and leaf plot. There are fewer storms lasting 100-200 minutes, fewer still of 200-300 minutes and this pattern continues, with the frequency of storms decreasing steadily up the stem and leaf plot. This shape of distribution is called a skew distribution, as opposed to a symmetric distribution whose tails decrease at similar speed on both sides of the peak density.

Perhaps this 'outlier' is a continuation of the pattern into the tail of the distribution and is just a long storm that could be expected once every hundred or so storms.