Relationship between two numerical variables

In an earlier chapter, we described various methods to describe the relationship between two numerical variables, Y and X.

Although these methods honestly describe the data, they can sometimes give a misleading impression of how the two variables are related.

Marginal and conditional relationships

The problem arises when other variables are also associated with both Y and X. This can arise when the extra variables are either numerical or categorical, but is easiest to explain with an extra categorical explanatory variable, Z.

The marginal relationship between Y and X is the relationship that is evident in a scatterplot of the two variables without taking into account any other variables. In contrast, the conditional relationship between Y and X given Z is the relationship that is evident within subsets of the data corresponding to different values of Z.

The strength and direction of the marginal relationship between Y and X may be different from those of the conditional relationships given Z.

It is possible for the marginal relationship between Y and X to show positive correlation, when the conditional distributions all have zero or even negative correlation.

When the marginal and conditional relationships differ, the conditional relationships are usually more meaningful.

(However we give an example at the end of this section where the marginal relationship is more easily interpreted.)

Lurking (or hidden) variables

It is possible to examine the marginal and conditional distributions of Y and X if the third variable, Z, has been recorded. However Z might not be one of the variables in our data set.

If the marginal relationship between X and Y is different from their conditional relationship given Z, but Z has either not been recorded or is ignored when analysing the data, then Z is called a lurking variable (or a hidden variable).

Always think about whether there might be a lurking variable, Z, that is distorting the relationship that is observed between Y and X.

The following example illustrates the difference between the marginal and conditional relationships between two numerical variables, X and Y, and shows how the marginal relationship can be misleading.

Reading ability and height

A scientist examining genetic factors affecting reading ability in primary children collects a variety of information from each child at a particular school. A scatterplot of the reading ability and height of the children is shown below.

The marginal relationship between reading ability and height shows a positive correlation, so it might be taken to imply that there is a genetic effect on reading ability since taller children tend to be better readers.

However this conclusion is flawed. The ages of the children in the study ranged from 6 to 10 years old, and age is a third variable that affects both height and reading ability.

Click the checkbox Slice, then use the slider to display the relationship separately for the children of each age. We now see that there is no clear relationship between height and reading ability in the conditional relationship within each age group.

The observed marginal relationship between reading ability and height was caused by a lurking variable, age.

Nutrition and exam performance

A different investigator is studying child nutrition in developing countries. The number of calories eaten daily by each child taking a particular school examination and the exam mark are recorded. There is a positive relationship between food intake and exam results. The investigator claims that this shows that some children are performing poorly because they are not eating enough and these findings are published widely.

Can you see the flaw in this conclusion?

The nutritionalist should have considered what lurking variables might affect both exam results and food intake. Families whose parents are well educated are likely to have children who are better nourished, and they are also likely to have children who perform well at school (either for genetic reasons or because greater encouragement and help is given).

Family background is therefore a lurking variable that could underlie the observed relationship between food intake and exam results, even if increasing food intake does not directly cause improved exam results.