Outliers
The strength of the relationship between two variables is usually the most important information that we gain from a scatterplot. Two other features may however be apparent — if present, they often provide the most useful information from the data. In this page we examine outliers and in the next, clusters are discussed.
Sometimes one cross on a scatterplot lies well away from the rest of the scatter of points. Such an observation is called an outlier.
An outlier may be an extreme value of one or both variables and the outlier may therefore be apparent in the marginal distributions of the variables.
However an individual may stand apart from the bulk of the data in a scatterplot without being an extreme observation for either variable on its own. Univariate displays of the separate variables may not detect such an outlier. |
![]() |
Brain and body weight
The scatterplot below shows the brain and body weights of a collection of animals. Since body weight ranges from 0.023 kg (mouse) to 9,400 kg (triceratops) and has an extremely skew distribution, the diagram plots log brain weight against log body weight.
There is quite a strong relationship between the variables — heavier animals tend to have larger brains. However verify the following by clicking on the crosses
Importance of outliers
Outliers are features of a data set that must be carefully checked. An outlier is often caused by a recording or transcription error, so...
First check that the values of the variables are correctly recorded.
Sometimes an outlier arises because an individual is fundamentally different from the others. This may be important information and may point to a new plant species, an employee who is performing poorly, a single mature student in a class of teenagers, ...
The individuals should be further examined (perhaps collecting further information from them) to try to assess whether the outlier individual has distinct characteristics.
An outlier that is either extreme or that has other distinctive characteristics would often be deleted from the data set, but should be mentioned in a report about the data.
House price and floor area
The following scatterplot shows information obtained from a sample of 141 real estate sales records from two suburbs of a city.
One house had an extremely high floor area — it stands out as an outlier in the marginal distribution of floor area. From the relationship between floor area and price of the other houses, we would also have expected this house's sale price to have been higher, so it is an outlier because the house does not follow the pattern that is evident in the rest of the data.
Ideally, we would try to find out what was special about the house. Was it mis-classified as a house when it was really a commercial property or block of apartments?
However since no further information is available about this outlier, we could omit it from further analysis. However in any reports we should state that our conclusions are only valid for houses below 400 square metres.