Correlation coefficient and nonlinear relationships

The correlation coefficient, r, is a good description of the strength of linear relationship but not nonlinear ones. If a scatterplot shows marked curvature, the correlation coefficient can considerably understate the strength of the relationship.

How can you summarise the strength of a nonlinear relationship? One solution is to use a different summary statistic that has better properties than r. In this page, we describe an alternative solution to the problem.

Transform the variables to linearise the relationship

Applying a nonlinear transformation (such as a log transformation) to a single variable changes the shape of its distribution and can be used to eliminate skewness.

Nonlinear transformation of a variable has a more important effect on bivariate data — it alters the shape of the relationship. (It is easier to explain how this happens in the diagram below than in words!) It is often possible to linearise a relationship by transforming one or both variables.

Therefore one way to use r to describe nonlinear relationships is to apply a transformation to one or both variables to remove the nonlinearity before evaluating r.

Prices of Mazda cars

The scatterplot below shows the price and age of second hand Mazda cars that were advertised for sale in the Melbourne Age newspaper on 8 February 1992. The relationship is clearly nonlinear with a steep initial decline in price followed by a more gradual decrease.

The correlation coefficient is r = -0.8, indicating a fairly strong relationship between the price and age of these cars, but the nonlinearity means that the strength of the relationship is understated by this.

Drag the red line on the vertical axis upwards to apply a power transformation to the car prices. When the power becomes close to zero (which corresponds to a log transformation of the response), the relationship becomes nearly linear. Since the relationship between log(price) and age is approximately linear, the correlation coefficient between these two variables, r = -0.9, is a better description of the strength of the relationship.

Choosing the transformation

How do you decide which variable to transform to use? There is no easy answer that works for all data sets, but a useful approach is to look at the marginal distributions of each variable seperately and to initially try transformations that remove any skewness in these distributions.

GDP and population

The scatterplot below shows the Gross Domestic Product (GDP in US$billion) and population (million) in the all countries of the world in 2012. The box plots on the two axes show that each of the variables has a skew distribution. Because of the dense cluster of countries near the origin of the plot, it is difficult to assess whether or not there is any relationship between the variables. Notions of linearity cannot even be assessed.

Drag the red line on each axis to remove the skewness in that variable. Raising each variable to a power of approximately -0.2 removes its skewness, and increases the correlation coefficient from 0.25 to 0.59, a much better indication of the strength of the relationship. (For fine tuning of the powers, use the arrow keys on your keyboard.)

You probably found that each variable raised to the power -0.2 had a reasonably symmetric distribution. It is however hard to interpret values on these scales, so drag both powers back to 0 (corresponding to a log transformation). The correlation between log(GDP) and log(population) is 0.63 which is probably the best indication of the strength of the relationship.

Note that the 'rich' countries are those that lie on the top left of the cluster of points. The 'poor' countries are towards the bottom right of the cluster. Drag over the crosses to identify the countries. Are there any outliers or clusters of countries?