If you don't want to print now,

Chapter 6   Multivariate Data

6.1   Displaying multivariate data

6.1.1   Representing a third variable

Different plotting symbols for a third variable

The simplest way to show the relationship between three numerical variables is based on a scatterplot of two of the variables, with the third variable being represented by differing symbols instead of identical 'crosses'. Possibilities are:

Although this kind of scatterplot is easy to draw, it is usually hard to interpret. The decision about which of the three variables to represent using the plotting symbol can make interpretation easier.

6.1.2   Rotating 3D scatterplots

3-dimensional scatterplots

The most direct way to extend a standard scatterplot of two variables, X and Y, to include a third variable, Z, is to add a third dimension to become an axis for Z. Each individual would be represented by a cross in a 3-dimensional cube.

Although a computer screen is only 2-dimensional, it can display a projection of such a 3-dimensional scatterplot. Rotating such a display with the mouse gives a good feel for the shape of the underlying 3-dimensional scatter of points. This is a dynamic display that is only effective on a computer — the information in the display is conveyed by movement. On paper, we can only present information in two dimensions.

Three-dimensional scatterplots are an interesting (and occasionally useful) way to display data. They are however much overrated as an analysis technique and simpler displays are usually more effective for extracting information from multivariate data.

6.1.3   Scatterplot matrix and brushing

Scatterplots of all pairs of variables

The problem of displaying relationships becomes even more difficult when there are more than three variables. It is possible to gain some insight into their relationships with an array of scatterplots of all pairs of variables, called a scatterplot matrix.

Brushing

Although a static scatterplot matrix reveals some aspects of the relationships between the variables, more insight into the data is obtained by adding dynamic features.

On a computer display, the scatterplots can be dynamically linked, so that clicking on a cross on one scatterplot highlights that individual in all scatterplots. Brushing extends this to allow highlighting of multiple crosses on a scatterplot with a 'brush' tool.

6.1.4   Brushing example

More about brushing

Brushing can be used on any linked displays of the same data set.

6.1.5   Slicing

Slicing a scatterplot using a third variable

Slicing is a dynamic technique that only displays a subset of the individuals, based on a restricted range of values for one particular variable. For example, we might display a scatterplot of Y against X, but only for individuals whose values of Z are between 20 and 30. The subset can be dynamically changed with a slider or other control to show whether the relationship between X and Y is different for individuals with low Z and high Z.

Trellis plot

Slicing is a dynamic method and must therefore be performed on a computer. Although less effective, an alternative is to show a series of static plots for different slices of the data. This is called a trellis plot.

6.2   Groups and regression

6.2.1   Additional variables in regression

Three or more variables

We previously described the relationship between two numerical variables, X and Y.

In many applications, more than two measurements are made from each individual and the additional variables may throw light on the relationship.

Use of this extra information may lead to more accurate predictions of Y.

6.2.2   Displaying groups

Distinguishing groups in a scatterplot

We now consider how a categorical variable, Z, can help explain the relationship between two numerical variables, X and Y. It equivalently examines whether the relationship between X and Y is the same in each of several groups.

As in most other situations, data analysis should start by examining the data graphically. Differences between the groups can be shown with different symbols and/or colours for the crosses in a scatterplot of Y against X.

6.2.3   Regression with grouped data

Least squares in each group

We can separately examine the relationship between X and Y in each group (or equivalently for each value of a categorical variable, Z). If the relationship between X and Y is different in the different groups, this should lead to more accurate predictions of Y from X.

A regression line can be separately fitted by least squares in each group.

6.2.4   Parallel regression lines

Separate regression lines

The problem with fitting separate regression lines by least squares in the different groups is that it is difficult to concisely explain the difference between the groups — the difference between the predicted response in the groups depends on the value of the explanatory variable.

Parallel regression lines

Interpretation is considerably simplified if we constrain the regression lines for the different groups to have the same slope. In the diagram below, the difference between the groups is the same for all values of X.

Parallel lines are not appropriate descriptions of all data sets. Always check a scatterplot first.

Least squares

The principle behind fitting parallel lines to two or more groups is the same as in ordinary simple regression — we choose the parameters to minimise the sum of squared residuals (vertical distances between the data crosses and their corresponding line). The resulting formulae are complicated, but most statistical software will do the calculations for you.

6.2.5   Transformed variables and groups ((advanced))

Transformation of the response

It is much easier to interpret the parameters when parallel lines are fitted to different groups than when their slopes are different, but the data or the context may not justify such a simplification.

After a nonlinear transformation of the response, the relationships in the two groups may be closer to parallel. A transformation to reduce skewness in the reponse often works well.

Understanding the model for ln(y)

Logarithmic transformations are particularly useful because the parallel least squares lines have a simple interpretation. For the above data they are:

Female:    ln (y) = 0.391 + 0.0747 x
Male:       ln (y) = 1.192 + 0.0747 x

This means that:

ln(y) is (1.192 - 0.391) = 0.801 higher for females than for males with the same x.

We can now concisely summarise the difference between males and females:

y for females is e0.810 = 2.23 times that for males with the same x.

6.2.6   Grouping with a numerical variable ((optional))

Definition of groups

A categorical variable naturally splits the individuals into groups, but a numerical variable, Z, can also be used to define groups.

6.2.7   Scatterplot matrix with groups

Distinguishing groups in a scatterplot matrix

A scatterplot matrix can help to describe the relationships between 3 or more numerical variables. If each individual belongs to one of several known groups, different symbols or colours can be used to distinguish the groups.

6.3   Multiple regression

6.3.1   More than one explanatory variable

Response and explanatory variables

We are often interested in how a 'response' variable, Y, depends on other explanatory variables. If there is a single explanatory variable, X, we can predict Y from X with a simple linear model of the form,

y  =  b0 + b1 x

However if other explanatory variables have been recorded from each individual, we should be able to use them to predict the response more accurately.

6.3.2   Multiple regression equation

Adding extra variables

A simple linear model for a single explanatory variable,

y  =  b0 + b1 x

can be easily extended to describe the effect of a second explanatory variable, Z, with an extra linear term,

y  =  b0 + b1 x + b2 z

and so on with more explanatory variables,

y  =  b0 + b1 x + b2 z + b3 w + ...

This type of model is called a multiple regression model.

Coefficients

Despite our use of the same symbols (b0, b1, ...) for all three models above, their 'best' values are often different for the different models. An example will be given in the next page.

6.3.3   Interpreting coefficients

Marginal and conditional relationships

In a linear model that predicts a response from several explanatory variables, the least squares coefficient associated with any explanatory variable describes its effect on the response if all other variables are held constant. This is also called the variable's conditional effect on the response.

This may be very different from the size and even the sign of the coefficient when a linear model is fitted with only that single explanatory variable. This simple linear model describes the marginal relationship between the response and that variable.

Example

In a model for predicting the percentage body fat of men, the best model (as determined by least squares) in a simple model with weight, is

Predicted body fat   =   -10.00  +  0.162 Weight

However if we add Abdomen circumference to the model, the best values for the coefficients are

Predicted body fat   =   -41.35  -  0.136 Weight  +  0.915 Abdomen

6.3.4   Scatterplot of three variables

Graphical display of data and models

Before applying statistical methods to any data set, we should try to display the data graphically. For a situation with response Y and two explanatory variables, X and Z, a rotating 3-dimensional scatterplot can display the data.

This kind of 3-dimensional diagram can also effectively illustrate linear models and their parameters.

(Similar graphical displays are not possible when there are 3 or more explanatory variables, but the concepts are the same.)

6.3.5   Regression plane

Graphical display of the equation

The linear equation to predict a response from a single explanatory variable can be displayed as a straight line on a scatterplot of Y against X.

y  =  b0 + b1 x

The corresponding equation for predicting Y from X and Z is

y  =  b0 + b1 x + b2 z

This linear model can be displayed as a plane on a 3-dimensional scatterplot of Y against X and Z.

6.3.6   Fitted values and residuals

Fitted values

A linear model provides a prediction of y for any values of x and z. For the i'th individual, the prediction is called its fitted value,

fitted value,    =  b0 + b1 xi + b2 zi

Residuals

The difference between the fitted value and the actual response is called the individual's residual.

residual,   ei  =  yi

Residuals correspond to vertical distances between crosses in a 3-dimensional scatterplot and the plane representing the model.

6.3.7   Least squares estimation

Goal of small residuals

When fitting a model to data, the fitted values are unlikely to match exactly the observed response values and the prediction 'errors' are the residuals,

ei  =  yi

'Small' residuals are desirable.

Least squares

The size of the residuals is described by their sum of squares,

SSResidual    

The least squares estimates of b0, b1 and b2 are the values that minimise this.

The solution can be obtained algebraically but the formulae are relatively complex and a computer should be used to evaluate the least squares estimates.

6.4   Marginal relationships

6.4.1   Misleading marginal correlation

Relationship between two numerical variables

Scatterplots, correlation coefficients and least squares lines honestly summarise the relationship between two numerical variables, Y and X.

However they can sometimes give a misleading impression about the relationship. The problem arises when other variables are also associated with both Y and X.

Marginal relationship between Y and X
The relationship that is evident in a scatterplot of the two variables without taking into account any other variables.
Conditional relationship between Y and X, given Z
The relationship between Y and X that is evident within subsets of the data corresponding to different values of Z.

The marginal and conditional relationships between Y and X are often different and may even be in a different direction.

Lurking (or hidden) variables

If the marginal relationship between X and Y is different from their conditional relationship given Z, but Z has either not been recorded or is ignored when analysing the data, then Z is called a lurking variable (or a hidden variable).

Always think about whether there might be a lurking variable, Z, that is distorting the relationship that is observed between Y and X.

6.4.2   Misleading marginal difference in means

Relationship between a numerical and a categorical variable

The previous page showed that the marginal relationship between two numerical variables, X and Y, can be very different from their conditional relationship for specific values of Z. The same can happen when X is a categorical variable, perhaps defining different groups.

6.4.3   Simpsons paradox

Lurking variables and relationships between categorical variables

When the direction of the relationship reverses, the effect is called Simpson's paradox. As with other 'paradoxes', there is no real contradiction; it just takes a bit more thought to understand why your initial intuition is wrong.

Smoking and survival

In a health survey, 1,314 women were classified as smokers or non-smokers, and their survival after 20 years was recorded.

  Survival    
Smoker?   Dead     Alive     Total     P(Dead)  
  Smoker 139 443 582 0.239
  Non-smoker   230 502 732 0.314

A naive examination of the data suggests that smoking decreases the probability of dying, but the opposite is true if the women are split into age groups.

Age 18-44
    Survival    
  Smoker?   Dead     Alive     Total     P(Dead)  
    Smoker 19 269 288 0.066
    Non-smoker   13 327 340 0.038
Age 45-64
    Survival    
  Smoker?   Dead     Alive     Total     P(Dead)  
    Smoker 78 167 245 0.318
    Non-smoker   52 147 199 0.261
Age 65+
    Survival    
  Smoker?   Dead     Alive     Total     P(Dead)  
    Smoker 42 7 49 0.857
    Non-smoker   165 28 193 0.855

Proportional Venn diagram

Simpson's paradox is explained in the proportional Venn diagram below — in it, each rectangle is proportional to the number of women with these values for the variables.

Most of the women aged 65+ were non-smokers. This increased the overall death rate of the non-smokers.

6.4.4   Other examples with lurking variables

Analysis using 'lurking' variables

Variables are only called 'lurking' variables if they are either unrecorded or are unused when analysing the data. In this section's examples, we presented two analyses of the data sets:

  1. A simple naive analysis ignoring the lurking variable, and
  2. A more complete analysis using this variable (which should therefore no longer be called 'lurking')

Although the simple analysis can result in wrong conclusions, a full analysis using the 'lurking' variable is always more complex.

Do not ignore lurking variables to simplify the analysis — you could reach the wrong conclusions from the data.