A simple linear model is not always appropriate

A simple linear model for a response Y in terms of an explanatory variable X is often a useful summary of a relationship, but is not appropriate for all data sets.

A simple linear model is only appropriate when the cloud of crosses in a scatterplot of the data is regularly spread around a straight line.

If the crosses are scattered round a curve, the relationship is called nonlinear and other models must be used. We will consider nonlinear models in the next section.

In this page, we examine another problem.

Outliers should be investigated

An outlier is an observation that does not conform to the pattern and variability exhibited by the rest of the data. In a linear model, the most important type of outlier is a data point that lies at a distance from the line that would fit through the rest of the data.

The individual corresponding to any outlier should be carefully examined. Recording or transcription errors may be the cause. Alternatively, it may be possible to determine some distinguishing characteristic of the individual that underlies the unusual response measurement.

If an outlier is extreme enough, or if a special cause for its unusual behaviour can be found from outside information, the individual can be classified as aberrant and deleted from the data set.

It is important to look at any data set graphically before fitting a linear model to check that no curvature or outliers is present.


Temperature and latitude

The scatterplot below shows the maximum January temperature (degrees Fahrenheit) and latitude (degrees north of the equator) for various cities in the USA, as published in the book Data Analysis and Regression, by F.Mosteller and J.W.Tukey in 1977. The least squares line is superimposed on the data.

There are four unusually large residuals that might be considered to be outliers. They are all considerably warmer than the model would predict for their latitude. Dragging with the mouse over the four points reveals their names and highlights their positions on a map of the USA.

You may have noticed that one of the outliers, Jacksonville, is too far north on the map. The latitude of Jacksonville was wrongly recorded in the published data set as 38 degrees instead of 30 degrees. Click the checkbox Correct Jacksonville to change Jacksonville's latitude to 30 degrees. Note that the position of the least squares line is not affected much.

The remaining three outliers are Portland, Seattle and Juneau. These are all in the northwest of the USA, and a little geographical research shows that they are warmed by a sea current in the winter. Retaining these cities causes us to overestimate the temperatures of other high-latitude cities. Their common characteristic suggests that we should perhaps model the northwest of the USA separately from the remainder of the country. Click the checkbox Delete NorthWest to remove the three cities. The least squares line now fits much closer to the remaining data points.