The main feature of interest in a scatterplot is the strength of the relationship between the two variables.
A numerical description of the strength of a relationship should not be affected by rescaling the variables.
Standardising a variable gives z-scores that do not depend on the units of the original variable. (The correlation coefficient will be defined in terms of z-scores for X and Y.)
The correlation coefficient summarises the strength of the relationship between X and Y. It is +1 when the scatterplot crosses are on a straight line with positive slope, -1 when on a line with negative slope, and zero when X and Y are unrelated.
You should be able to estimate the value of r from looking at a scatterplot and imagine a scatter of crosses corresponding to any value of r.
The correlation coefficient is only a good measure of the strength of a relationship if the points in a scatterplot are scattered round a straight line, not a curve.
The correlation coefficient cannot identify curvature, outliers or clusters and can be misleading if these features are present. A scatterplot must always be examined too.
An extreme value of one or both of the variables is an outlier. An unusual combination of values is also called an outlier.
If the crosses on a scatterplot separate into clusters, different groups of individuals are suggested.
In small data sets, there may be considerable variability, so patterns should be strongly evident before they are reported.
For most data sets, we are interested in understanding the relationships between the variables. However interpreting relationships must be done with care.
If the relationship between X and Y is causal, it is possible to predict the effect of changing the value of X.
Causality can only be deduced from how the data were collected — the data values themselves do not contain any information about causality.
In an observational study, values are passively recorded from individuals. Experiments are characterised by the experimenter's control over the values of one or more variables.
Causal relationships can only be deduced from well-designed experiments.
A line or curve is useful for predicting the value of Y from a known value of X.
A straight line can often be used to predict one variable from another.
The difference between the actual value of Y and the value predicted by a line is called a residual. Small residuals are clearly desirable.
The sum of squared residuals describes the accuracy of predictions from a line. The method of least squares positions the line to minimise the sum of squared residuals.
A linear model is not appropriate if there are either curvature or outliers in a scatterplot of the data. Outliers should be carefully examined.
Outliers and curvature in the relationship are often displayed more clearly in a plot of residuals.
Least squares does not treat Y and X symmetrically. The best line for predicting Y from X is different from the best line for predicting X from Y.
Some bivariate data sets describe complete populations. Others are 'representative' of an underlying population or process.
Bivariate data can be modelled by specifying a response distribution for each possible X.
The response is often modelled with a normal distribution whose mean is a linear function of X and whose standard deviation is constant.
A normal linear model can be described in terms of 'errors'. In samples from the model, approximately 95% of errors are within 2 standard deviations of zero, so about 95% of the points in a scatterplot are within this distance of the regression line.
The normal linear model has 3 unknown parameters. For many data sets, these parameters have meaningful interpretations.
A least squares line provides estimates of the linear model's slope and intercept. These estimates are random values — they vary from sample to sample.
The third parameter of the normal linear model is the error standard deviation. It can be estimated using the residuals from the least squares line.
The least squares estimate of the model's slope has a normal distribution that is centred on the true value.
The distribution of the least squares slope may be estimated from a single data set.
A confidence interval for the model's slope can be obtained from its least squares estimate and its standard error.
Confidence intervals for the model's slope have the same properties as confidence intervals for population means or proportions.
The standard error of the least squares slope depends on the response standard deviation round the model line, the sample size and the standard deviation of X. Collecting data with a big spread of x-values gives more accurate estimates but there are disadvantages.
If the model's slope is zero, the response distribution does not depend on the explanatory variable. This special case is particularly meaningful in many studies.
The p-value for testing whether a linear model's slope is zero is the probability that its least squares estimate is as far from zero as the recorded value.
It is important to distinguish the strength of a relationship (summarised by the correlation coefficient) and the strength of evidence for existence of a relationship (summarised by the p-value).
As with other tests, all p-values between 0 and 1 are equally likely if the null hypothesis holds (model slope is zero), but p-values near�0 are more likely if the alternative hypothesis holds (model slope is non-zero).
From estimates of the 3 linear model parameters, we can obtain an estimated response distribution at any x-value.
The predicted response at any X varies from sample to sample. The prediction is more variable at x-values far from the mean of the 'training' data.
A distinction is made between estimating the mean response at X and predicting a new individual's response at X. Errors are larger (on average) when predicting a new individual's response.
A 95% confidence interval is used to estimate the mean response at X. A 95% prediction interval is similar, but gives a range of likely values for a new response value. The prediction interval is wider than the confidence interval.
The normal linear model involves assumptions of linearity, constant variance, normal error distribution and independence of different observations. Residuals can be examined to assess whether these assumptions are appropriate for a particular data set.
If the relationship between Y and X is nonlinear, a transformation of X may linearise the relationship.
Transforming the response may remove curvature in the relationship, but also affects whether the error standard deviation is constant. Fortunately, the same transformation of Y often removes curvature and non-constant standard deviation.
If a normal linear model describes the relationship between a transformation of the response and a transformation of the explanatory variable, predictions can be made by fitting the linear model to the transformed data, then performing the inverse transformation on the prediction.
The errors in a normal linear model are assumed to have normal distributions. Violation of this assumption is less important than nonlinearity, non-constant variance or outliers, but a probability plot of the residuals can be used to assess normality.
The errors in a normal linear model are assumed to be independent. In data where the observations are recorded sequentially, successive errors are sometimes found to be correlated. Correlated errors can arise whatever the x-variable, but are most often seen when the x-variable is time itself.
The most effective x-value at which to take a new response observation is one where predictions are most variable. The variance of predictions at x, divided by sigma-squared is called the leverage at x.
If an outlier is also a high-leverage point, it can badly 'pull' the least squares line and the resulting residual often does not indicate that it is an outlier.
Even when all data points come from a normal linear model, all residuals do not have the same standard deviation.
Dividing the residuals by an estimate of their standard deviation gives values that can be compared to ±2 and ±3 to look for outlliers.
Standardised residuals still do not show up outliers that are high leverage points. Deleted residuals are based on the difference between the response and the prediction from the data without that observation.
Rather than standardising each residual by dividing by its standard deviation based on the mean squared residual for the whole data set, it is better to standardise with the mean squared residual from the data set without that value.
Leverage describes the potential of each point to influence the results. DFITS describes its actual influence on the fitted values.
An alternative measure of influence describes the influence of each point on the least squares coefficients.
This page summarises the various measures of residual and influence and gives a few examples where residuals, leverage and influence are interpreted.