Correlation coefficient and nonlinear relationships
The correlation coefficient, r, is a good description of the strength of linear relationship but not nonlinear ones. If a scatterplot shows marked curvature, the correlation coefficient can considerably understate the strength of the relationship.
How can you summarise the strength of a nonlinear relationship? One solution is to use a different summary statistic that has better properties than r. In this page, we describe an alternative solution to the problem.
Transform the variables to linearise the relationship
Applying a nonlinear transformation (such as a log transformation) to a single variable changes the shape of its distribution and can be used to eliminate skewness.
Nonlinear transformation of a variable has a more important effect on bivariate data — it alters the shape of the relationship. (It is easier to explain how this happens in the diagram below than in words!) It is often possible to linearise a relationship by transforming one or both variables.
Therefore one way to use r to describe nonlinear relationships is to apply a transformation to one or both variables to remove the nonlinearity before evaluating r.
Prices of Mazda cars
The scatterplot below shows the price and age of second hand Mazda cars that were advertised for sale in the Melbourne Age newspaper on 8 February 1992. The relationship is clearly nonlinear with a steep initial decline in price followed by a more gradual decrease.
The correlation coefficient is r = -0.8, indicating a fairly strong relationship between the price and age of these cars, but the nonlinearity means that the strength of the relationship is understated by this.
Drag the red line on the vertical axis upwards to apply a power transformation to the car prices. When the power becomes close to zero (which corresponds to a log transformation of the response), the relationship becomes nearly linear. Since the relationship between log(price) and age is approximately linear, the correlation coefficient between these two variables, r = -0.9, is a better description of the strength of the relationship.
Choosing the transformation
How do you decide which variable to transform to use? There is no easy answer that works for all data sets, but a useful approach is to look at the marginal distributions of each variable separately and to initially try transformations that remove any skewness in these distributions.
GDP and population
The scatterplot below shows the Gross Domestic Product (GDP in US$billion) and population (million) in the all countries of the world in 2011 for which data was available. The box plots on the two axes show that each of the variables has a skew distribution. Because of the dense cluster of countries near the origin of the plot, it is difficult to assess whether or not there is any relationship between the variables. Notions of linearity cannot even be assessed.
Drag the red line on each axis to remove the skewness in that variable. Raising each variable to a power of approximately zero (corresponding to a log transformation) removes its skewness, and increases the correlation coefficient from 0.498 to 0.782, a much better indication of the strength of the relationship. (For fine tuning of the powers, use the arrow keys on your keyboard.)
You may have felt that the variables raised to slightly different powers had more symmetric distributions. It is however hard to interpret values when the powers are values such as –0.1 or +0.1, so it is usuall best to use powers such as 0.5, 0.0 (log), –0.5 or –1.0.
Note that the 'rich' countries are those that lie on the top left of the cluster of points. The 'poor' countries are towards the bottom right of the cluster. Drag over the crosses to identify the countries. Are there any outliers or clusters of countries?