Long page
descriptions

Chapter 3   Two Numerical Variables

3.1   Scatterplots

3.1.1   Bivariate data sets

Many data sets contain two or more measurements from each individual. Even when the main interest is in one variable, the others can help to understand its distribution.

3.1.2   Scatterplots

The main display that shows the relationship between two variables is a scatterplot.

3.1.3   Limitations of univariate displays

Univariate displays don't show relationships between variables.

3.1.4   Marginal distributions

A scatterplot of two variables can be enhanced with box plots or histograms on the margins of a scatterplot.

3.1.5   Time series

When a single measurement is made at regular intervals, the data are called a time series. Time series data can be treated as bivariate, with time being the second variable.

3.2   Understanding relationships

3.2.1   Strength of a relationship

The main feature of interest in a scatterplot is the strength of the relationship between the two variables.

3.2.2   Outliers

An extreme value of one or both of the variables is an outlier. An unusual combination of values is also called an outlier.

3.2.3   Clusters

If the crosses on a scatterplot separate into clusters, different groups of individuals are suggested.

3.2.4   Dangers of over-interpretation

In small data sets, there may be considerable variability, so patterns should be strongly evident before they are reported.

3.2.5   Explanatory and response variables

One variable can often be classified as an explanatory variable that either causally affects the response variable, or is useful for predicting its value.

3.3   Correlation

3.3.1   Units for X and Y

A numerical description of the strength of a relationship should not be affected by rescaling the variables.

3.3.2   Units-free variables (z-scores)

Standardising a variable gives z-scores that do not depend on the units of the original variable. (The correlation coefficient will be defined in terms of z-scores for X and Y.)

3.3.3   Correlation coefficient

The correlation coefficient summarises the strength of the relationship between X and Y. It is +1 when the scatterplot crosses are on a straight line with positive slope, -1 when on a line with negative slope, and zero when X and Y are unrelated.

3.3.4   Scatterplots and the value of r

You should be able to estimate the value of r from looking at a scatterplot and imagine a scatter of crosses corresponding to any value of r.

3.3.5   Nonlinear relationships

The correlation coefficient is only a good measure of the strength of a relationship if the points in a scatterplot are scattered round a straight line, not a curve.

3.3.6   R does not tell the whole story

The correlation coefficient cannot identify curvature, outliers or clusters and can be misleading if these features are present. A scatterplot must always be examined too.

3.4   Least squares

3.4.1   Predicting Y from X

A line or curve is useful for predicting the value of Y from a known value of X.

3.4.2   Linear models

A straight line can often be used to predict one variable from another.

3.4.3   Fitted values and residuals

The difference between the actual value of Y and the value predicted by a line is called a residual. Small residuals are clearly desirable.

3.4.4   Least squares

The sum of squared residuals describes the accuracy of predictions from a line. The method of least squares positions the line to minimise the sum of squared residuals.

3.4.5   Curvature and outliers

A linear model is not appropriate if there are either curvature or outliers in a scatterplot of the data. Outliers should be carefully examined.

3.4.6   Residual plots

Outliers and curvature in the relationship are often displayed more clearly in a plot of residuals.

3.4.7   Predicting Y and predicting X ((advanced))

Least squares does not treat Y and X symmetrically. The best line for predicting Y from X is different from the best line for predicting X from Y.

3.5   Nonlinear relationships

3.5.1   Transformations and correlation

The correlation coefficient does not adequately describe the strength of a nonlinear relationship. Transforming the variables to linearise the relationship helps.

3.5.2   Transformations and models

If a relationship is nonlinear, a linear model can often be fitted to transformed response or explanatory variables.

3.5.3   Quadratic models

An alternative solution to nonlinearity is to fit a quadratic curve the data, again using the principle of least squares.

3.5.4   Dangers of extrapolation

Since the form of a relationship is unknown beyond the range of x-values in the data, it is always dangerous to extrapolate.