Long page
descriptions

Chapter 1   Simple Linear Regression

1.1   Correlation

1.1.1   Strength of a relationship

The main feature of interest in a scatterplot is the strength of the relationship between the two variables.

1.1.2   Units for X and Y

A numerical description of the strength of a relationship should not be affected by rescaling the variables.

1.1.3   Units-free variables (z-scores)

Standardising a variable gives z-scores that do not depend on the units of the original variable. (The correlation coefficient will be defined in terms of z-scores for X and Y.)

1.1.4   Correlation coefficient

The correlation coefficient summarises the strength of the relationship between X and Y. It is +1 when the scatterplot crosses are on a straight line with positive slope, -1 when on a line with negative slope, and zero when X and Y are unrelated.

1.1.5   Scatterplots and the value of r

You should be able to estimate the value of r from looking at a scatterplot and imagine a scatter of crosses corresponding to any value of r.

1.1.6   Nonlinear relationships

The correlation coefficient is only a good measure of the strength of a relationship if the points in a scatterplot are scattered round a straight line, not a curve.

1.1.7   R does not tell the whole story

The correlation coefficient cannot identify curvature, outliers or clusters and can be misleading if these features are present. A scatterplot must always be examined too.

1.1.8   Outliers

An extreme value of one or both of the variables is an outlier. An unusual combination of values is also called an outlier.

1.1.9   Clusters

If the crosses on a scatterplot separate into clusters, different groups of individuals are suggested.

1.1.10   Dangers of over-interpretation

In small data sets, there may be considerable variability, so patterns should be strongly evident before they are reported.

1.2   Association & causal relationships

1.2.1   Interest in relationships

For most data sets, we are interested in understanding the relationships between the variables. However interpreting relationships must be done with care.

1.2.2   Causal and non-causal relationships

If the relationship between X and Y is causal, it is possible to predict the effect of changing the value of X.

1.2.3   Detecting causal relationships

Causality can only be deduced from how the data were collected — the data values themselves do not contain any information about causality.

1.2.4   Observational and experimental data

In an observational study, values are passively recorded from individuals. Experiments are characterised by the experimenter's control over the values of one or more variables.

1.2.5   Data collection and causality

Causal relationships can only be deduced from well-designed experiments.

1.3   Least squares

1.3.1   Predicting Y from X

A line or curve is useful for predicting the value of Y from a known value of X.

1.3.2   Linear models

A straight line can often be used to predict one variable from another.

1.3.3   Fitted values and residuals

The difference between the actual value of Y and the value predicted by a line is called a residual. Small residuals are clearly desirable.

1.3.4   Least squares

The sum of squared residuals describes the accuracy of predictions from a line. The method of least squares positions the line to minimise the sum of squared residuals.

1.3.5   Curvature and outliers

A linear model is not appropriate if there are either curvature or outliers in a scatterplot of the data. Outliers should be carefully examined.

1.3.6   Residual plots

Outliers and curvature in the relationship are often displayed more clearly in a plot of residuals.

1.3.7   Predicting Y and predicting X ((advanced))

Least squares does not treat Y and X symmetrically. The best line for predicting Y from X is different from the best line for predicting X from Y.

1.4   Linear regression models

1.4.1   Interest in generalising from data

Some bivariate data sets describe complete populations. Others are 'representative' of an underlying population or process.

1.4.2   Distribution of Y for each X

Bivariate data can be modelled by specifying a response distribution for each possible X.

1.4.3   Normal linear model

The response is often modelled with a normal distribution whose mean is a linear function of X and whose standard deviation is constant.

1.4.4   Another way to describe the model

A normal linear model can be described in terms of 'errors'. In samples from the model, approximately 95% of errors are within 2 standard deviations of zero, so about 95% of the points in a scatterplot are within this distance of the regression line.

1.4.5   Model parameters

The normal linear model has 3 unknown parameters. For many data sets, these parameters have meaningful interpretations.

1.5   Estimating parameters

1.5.1   Estimating the slope and intercept

A least squares line provides estimates of the linear model's slope and intercept. These estimates are random values — they vary from sample to sample.

1.5.2   Estimating the error standard devn

The third parameter of the normal linear model is the error standard deviation. It can be estimated using the residuals from the least squares line.

1.5.3   Distn of least squares estimates

The least squares estimate of the model's slope has a normal distribution that is centred on the true value.

1.5.4   Standard error of least squares slope

The distribution of the least squares slope may be estimated from a single data set.

1.5.5   95% confidence interval for slope

A confidence interval for the model's slope can be obtained from its least squares estimate and its standard error.

1.5.6   Properties of confidence interval

Confidence intervals for the model's slope have the same properties as confidence intervals for population means or proportions.

1.5.7   Influences on accuracy ((advanced))

The standard error of the least squares slope depends on the response standard deviation round the model line, the sample size and the standard deviation of X. Collecting data with a big spread of x-values gives more accurate estimates but there are disadvantages.

1.6   Testing regression parameters

1.6.1   Importance of zero slope

If the model's slope is zero, the response distribution does not depend on the explanatory variable. This special case is particularly meaningful in many studies.

1.6.2   Testing whether slope is zero

The p-value for testing whether a linear model's slope is zero is the probability that its least squares estimate is as far from zero as the recorded value.

1.6.3   Strength of evidence and relationship

It is important to distinguish the strength of a relationship (summarised by the correlation coefficient) and the strength of evidence for existence of a relationship (summarised by the p-value).

1.6.4   Properties of p-values ((advanced))

As with other tests, all p-values between 0 and 1 are equally likely if the null hypothesis holds (model slope is zero), but p-values near�0 are more likely if the alternative hypothesis holds (model slope is non-zero).

1.7   Predicting the response

1.7.1   Estimated response distn at X

From estimates of the 3 linear model parameters, we can obtain an estimated response distribution at any x-value.

1.7.2   Variability of estimate at X

The predicted response at any X varies from sample to sample. The prediction is more variable at x-values far from the mean of the 'training' data.

1.7.3   Estimating the mean vs prediction

A distinction is made between estimating the mean response at X and predicting a new individual's response at X. Errors are larger (on average) when predicting a new individual's response.

1.7.4   Confidence and prediction intervals

A 95% confidence interval is used to estimate the mean response at X. A 95% prediction interval is similar, but gives a range of likely values for a new response value. The prediction interval is wider than the confidence interval.

1.8   Linear model assumptions

1.8.1   Assumptions in a normal linear model

The normal linear model involves assumptions of linearity, constant variance, normal error distribution and independence of different observations. Residuals can be examined to assess whether these assumptions are appropriate for a particular data set.

1.8.2   Curvature — transforming X

If the relationship between Y and X is nonlinear, a transformation of X may linearise the relationship.

1.8.3   Curvature and non-constant variance

Transforming the response may remove curvature in the relationship, but also affects whether the error standard deviation is constant. Fortunately, the same transformation of Y often removes curvature and non-constant standard deviation.

1.8.4   Transformations and prediction

If a normal linear model describes the relationship between a transformation of the response and a transformation of the explanatory variable, predictions can be made by fitting the linear model to the transformed data, then performing the inverse transformation on the prediction.

1.8.5   Non-normal errors

The errors in a normal linear model are assumed to have normal distributions. Violation of this assumption is less important than nonlinearity, non-constant variance or outliers, but a probability plot of the residuals can be used to assess normality.

1.8.6   Correlated errors

The errors in a normal linear model are assumed to be independent. In data where the observations are recorded sequentially, successive errors are sometimes found to be correlated. Correlated errors can arise whatever the x-variable, but are most often seen when the x-variable is time itself.

1.9   Leverage, outliers and influence

1.9.1   Leverage

The most effective x-value at which to take a new response observation is one where predictions are most variable. The variance of predictions at x, divided by sigma-squared is called the leverage at x.

1.9.2   Outliers and leverage

If an outlier is also a high-leverage point, it can badly 'pull' the least squares line and the resulting residual often does not indicate that it is an outlier.

1.9.3   Variances of the residuals

Even when all data points come from a normal linear model, all residuals do not have the same standard deviation.

1.9.4   Standardised residuals

Dividing the residuals by an estimate of their standard deviation gives values that can be compared to ±2 and ±3 to look for outlliers.

1.9.5   Deleted residuals

Standardised residuals still do not show up outliers that are high leverage points. Deleted residuals are based on the difference between the response and the prediction from the data without that observation.

1.9.6   Externally studentised residuals

Rather than standardising each residual by dividing by its standard deviation based on the mean squared residual for the whole data set, it is better to standardise with the mean squared residual from the data set without that value.

1.9.7   Influence on fitted values

Leverage describes the potential of each point to influence the results. DFITS describes its actual influence on the fitted values.

1.9.8   Influence on regression coefficients

An alternative measure of influence describes the influence of each point on the least squares coefficients.

1.9.9   Summary and examples

This page summarises the various measures of residual and influence and gives a few examples where residuals, leverage and influence are interpreted.