Always look at a scatterplot first

Although the correlation coefficient is a good description of the strength of many relationships, it does not adequately describe others.

A scatterplot should always be examined to help assess whether there are features in the data that the correlation coefficient cannot describe.


Anscomb's data

The diagram below shows a scatterplot and several numerical summary statistics including the correlation coefficient.

The pop-up menu allows you to select one of four data sets. All four data sets have the same summary statistics, but the scatterplot shows that the relationships between X and Y are very different.

Moderate linear relationship
There is a moderately strong positive linear relationship between X and Y and the summary statistics adequately describe this data set.
Strong nonlinear relationship
All points lie exactly on a curve so the relationship is stronger than suggested by the correlation coefficient.
Strong linear & outlier
In this data set, all crosses are on a straight line except for a single outlier. Is it a transcription or measurement error?
Only one X is different
X = 8 for all points except one where X = 19. This X could possibly be an outlier, but even if it is a genuine measurement, the point has an extreme influence on the correlation coefficient. Changing its y-value to 7 would reduce the correlation coefficient to zero. The apparent strength of the relationship is determined by the y-value of this point, so it is called highly influential.

The important message is:

Don't just calculate summary statistics, look at a scatterplot of the data too.