Long page
descriptions

Chapter 6   Multivariate Data

6.1   Displaying multivariate data

6.1.1   Representing a third variable

A third numerical variable can be represented in a scatterplot by use of different symbols or colours.

6.1.2   Rotating 3D scatterplots

Three numerical variables can be displayed in a 3-dimensional scatterplot; this may be rotated to help understand the relationships in the data.

6.1.3   Scatterplot matrix and brushing

An array of scatterplots of all pairs of variables is often informative, especially if the scatterplots are dynamically linked.

6.1.4   Brushing example

'Brushing' refers to dynamic highlighting of the same individuals in multiple linked displays.

6.1.5   Slicing

Slicing is another dynamic technique. Only observations within a range of values of one variable (a slice) are displayed in linked displays.

6.2   Groups and regression

6.2.1   Additional variables in regression

Correlation and least squares are used to describe the relationship between two numerical variables. Additional measurements from each individual can potentially help to refine our understanding of the relationship.

6.2.2   Displaying groups

Different symbols or colours can be used to represent a third categorical variable in a scatterplot.

6.2.3   Regression with grouped data

The relationship between Y and X can be separately described by a least squares line within each group. This should lead to improved prediction of the response if the relationship is different in different groups.

6.2.4   Parallel regression lines

If regression lines for the different groups are parallel, it is easy to summarise the group differences numerically and interpret these differences.

6.2.5   Transformed variables and groups ((advanced))

Transformations may linearise the relationship between the response and explanatory variables in each group and also give parallel regression lines.

6.2.6   Grouping with a numerical variable ((optional))

A numerical variable can be used to split the individuals into groups.

6.2.7   Displaying groups

Groups can also be represented with different symbols or colours on a scatterplot matrix that describes the relationships between 3 or more other variables.

6.3   Multiple regression

6.3.1   More than one explanatory variable

In many data sets, two or more explanatory variables could potentially affect the response. Using two or more explanatory variables may give more accurate predictions.

6.3.2   Multiple regression equation

A simple linear model with a single explanatory variable can be extended with extra terms to explain the additional effect of other explanatory variables.

6.3.3   Interpreting coefficients

The slope coefficient associated with an explanatory variable describes its effect if all other variables are held constant. It may have a different sign from the correlation coefficient between the variable and the response.

6.3.4   Scatterplot of three variables

The relationship between a response variable and two explanatory variables can be effectively displayed in a rotating 3-dimensional scatterplot.

6.3.5   Regression plane

The equation of a linear model for Y in terms of X and Z can be displayed as a plane in 3-dimensions.

6.3.6   Fitted values and residuals

The residuals are vertical distances from the crosses on a 3-dimensional scatterplot to the plane representing the model.

6.3.7   Least squares estimation

An objective estimation method is to minimise the sum of squared residuals -- the principle of least squares.

6.4   Marginal relationships

6.4.1   Misleading marginal correlation

Variables Y and X may be positively correlated overall, but have zero or even negative correlation at each value of a categorical variable, Z. The variable Z is called a lurking (or hidden) variable.

6.4.2   Misleading marginal difference in means

A lurking variable can also distort the difference between the means of Y in two groups (i.e. for two values of a categorical variable, X).

6.4.3   Simpsons paradox

If X, Y and Z are all categorical, a reversal of the marginal relationship between X and Y and their conditional relationships for different values of Z is called Simpson's paradox.

6.4.4   Other examples with lurking variables

A few extra examples are shown where a hidden variable, Z, can result in a misleading conclusion from the marginal relationship. A full analysis using Z is always more complex but is essential to understand the relationship.