Correlation coefficient and nonlinear relationships

The correlation coefficient, r, is a good description of the strength of linear relationship but not nonlinear ones. If a scatterplot shows marked curvature, the correlation coefficient can considerably understate the strength of the relationship.

How can you summarise the strength of a nonlinear relationship? One solution is to use a different summary statistic that has better properties than r. In this page, we describe an alternative solution to the problem.

Transform the variables to linearise the relationship

Applying a nonlinear transformation (such as a log transformation) to a single variable changes the shape of its distribution and can be used to eliminate skewness.

Nonlinear transformation of a variable has a more important effect on bivariate data — it alters the shape of the relationship. (It is easier to explain how this happens in the diagram below than in words!) It is often possible to linearise a relationship by transforming one or both variables.

Therefore one way to use r to describe nonlinear relationships is to apply a transformation to one or both variables to remove the nonlinearity before evaluating r.

Marine bacteria surviving X-rays

The scatterplot below shows the numbers of a marine bacterium surviving exposure to 200 kilovolt X-rays for periods ranging from t=1 to t=15 intervals of 6 minutes. The relationship is clearly nonlinear with a steep initial decline in numbers followed by a more gradual decrease.

The correlation coefficient is r = -0.907, indicating a fairly strong relationship between the numbers of survivors and the dose of X-rays, but the nonlinearity means that the strength of the relationship is understated by this.

Drag the red line on the vertical axis upwards to apply a power transformation to the number of survivors. When the power becomes close to zero (which corresponds to a log transformation of the response), the relationship becomes nearly linear. Since the relationship between log(survivors) and dose is approximately linear, the correlation coefficient between these two variables, r = -0.994, is a better description of the strength of the relationship.

Choosing the transformation

How do you decide which variable to transform to linearise a relationship? There is no easy answer that works for all data sets, but a useful approach is to look at the marginal distributions of each variable separately and to initially try transformations that remove any skewness in these distributions.

Metabolic rate and lifespan

The scatterplot below shows the lifespan (in years) and metabolic rate (measured by oxygen intake per gram of weight) of a selection of mammals. The box plots on the two axes show that each of the variables has a skew distribution, and the scatterplot shows that the relationship is nonlinear.

Drag the red lines on the axes to remove marginal skewness. For these data (and many others), the transformations that remove marginal skewness also linearise the relationship. The correlation coefficient between the transformed variables (about -0.7) is therefore a fairer description of the strength of the relationship than that of the original measurements (-0.47).

You probably found that metabolic rate0.2 had a reasonably symmetric distribution. It is however hard to interpret values on these scales, so drag to a log transformation of both variables — their relationship is equally linear and the graph is easier to interpret.

After linearising the relationship, man no longer seems to be an outlier. Which mammals now seem to be the most unusual?