Improved box plot
One problem with the basic box plot that was described in the previous pages is that it cannot show whether there are outliers in the data set. A common modification draws some of the most extreme values as separate crosses on the box plot and extends the 'whisker' only as far as the most extreme observations that are not drawn separately.
The modified box plot makes outliers stand out.
(The rule that is used to decide on which values to display as crosses is explained below.)
In many practical applications, skew distributions with a long tail towards the higher values are common. For example, experiments involving survival times of plants or insects, or times until failure of manufactured items usually result in data with occasional high values.
In the data set below is a skew distribution with no outliers and no values stand out as unusual.
Drag the slider to change the data set into one with a fairly symmetric distribution and a single outlier. The basic box plot does not show the existence of the outlier.
The basic box plot cannot distinguish between a very long-tailed distribution and an outlier.
Now select Box plot showing outliers from the pop-up menu and again use the slider to see how the improved box plot distinguishes between a skew distribution and one with an outlier.
Which extreme values are displayed as crosses?
We firstly define the interquartile range to be the distance between the upper and lower quartiles (i.e. the length of the central box in the box plot). Any values more than 1.5 times this distance from the box are displayed with a separate cross. The 'whiskers' that are drawn to the sides of the central box extend only as far as the most extreme values within these limits.
The diagram below allows you to investigate these improved box plots.
Drag the cross on the jittered dot plot corresponding to the highest value (6.5) to the right, increasing its value to turn it into an outlier.
The other crosses on the jittered dot plot can be similarly dragged to change the distributions of values. When are the extreme values separately displayed as crosses in the box plot?