Relationship between two numerical variables

In an earlier chapter, we described various methods to describe the relationship between two numerical variables, Y and X.

Although these methods honestly describe the data, they can sometimes give a misleading impression of how the two variables are related.

Marginal and conditional relationships

The problem arises when other variables are also associated with both Y and X. This can arise when the extra variables are either numerical or categorical, but is easiest to explain with an extra categorical explanatory variable, Z.

The marginal relationship between Y and X is the relationship that is evident in a scatterplot of the two variables without taking into account any other variables. In contrast, the conditional relationship between Y and X given Z is the relationship that is evident within subsets of the data corresponding to different values of Z.

The strength and direction of the marginal relationship between Y and X may be different from those of the conditional relationships given Z.

It is possible for the marginal relationship between Y and X to show positive correlation, when the conditional distributions all have zero or even negative correlation.

When the marginal and conditional relationships differ, the conditional relationships are usually more meaningful.

(However we give an example at the end of this section where the marginal relationship is more easily interpreted.)

Lurking (or hidden) variables

It is possible to examine the marginal and conditional distributions of Y and X if the third variable, Z, has been recorded. However Z might not be one of the variables in our data set.

If the marginal relationship between X and Y is different from their conditional relationship given Z, but Z has either not been recorded or is ignored when analysing the data, then Z is called a lurking variable (or a hidden variable).

Always think about whether there might be a lurking variable, Z, that is distorting the relationship that is observed between Y and X.

The following example illustrates the difference between the marginal and conditional relationships between two numerical variables, X and Y, and shows how the marginal relationship can be misleading.

Eagle flight

The distances that eagles fly vary greatly and a biologist investigates whether flight distances depend on the weights of the birds. In a study that extended over twelve months, 120 different eagles were each tagged with a radio transmitter for a period of two weeks. The average flight length (km per day) and weight (kg) of each eagle were recorded. A scatterplot of these measurements is shown below.

The marginal relationship between flight length and weight shows a positive correlation, so it might be taken to imply that heavy eagles tend to fly further than light eagles, and a naive biologist might try to explain this in terms of their body mass providing the energy necessary to extend their range.

However this conclusion is flawed. The data were collected at different times during the year, and season is a third variable that affects both the weights of the birds and the distances that they fly.

Click the checkbox Slice, then use the slider to display the relationship separately for the birds that were observed in each month. We now see that there is no clear relationship between weight and flight length in the conditional relationship within each month.

The observed relationship between flight length and weight was caused by a lurking variable, season.

Nutrition and exam performance

A different investigator is studying child nutrition in developing countries. The number of calories eaten daily by each child taking a particular school examination and the exam mark are recorded. There is a positive relationship between food intake and exam results. The investigator claims that this shows that some children are performing poorly because they are not eating enough and these findings are published widely.

Can you see the flaw in this conclusion?

The nutritionalist should have considered what lurking variables might affect both exam results and food intake. Families whose parents are well educated are likely to have children who are better nourished, and they are also likely to have children who perform well at school (either for genetic reasons or because greater encouragement and help is given).

Family background is therefore a lurking variable that could underlie the observed relationship between food intake and exam results, even if increasing food intake does not directly cause improved exam results.