Outliers and errors

An outlier is a measurement that does not fit in with the pattern exhibited by the rest of the data. By definition, an outlier does not satisfy the normal linear model that fits the rest of the data, so it should be omitted from the analysis.

Standardised residuals

Unfortunately, in a real data set, the errors are unknown, so we must use the residuals from the least squares line as estimates of the errors. The residuals can be used in a similar way to give information about whether there is an outlier.

Most statistical software will evaluate standardised residuals for you when you fit a line by least squares and automatically report any that are more extreme than ±3 — they are often taken to indicate possible outliers.

It is worth remembering however that there is still a probability 0.003 that a value from the standard normal distribution will be outside ±3. In a data set of 1,000 values, it would therefore be expected that 3 values would be labelled as 'outliers' by this rule.

Standardised residuals when there are no outliers

Standardised residuals would need to be outside ±3.5 or ±4 for us to be really confident that they are outliers.

Problems with residuals as indicators of outliers

Illustration

Do not rely on an extreme residual to tell you whether a high-leverage point is an outlier.