Some bivariate data sets describe complete populations. Others are 'representative' of an underlying population or process.
Bivariate data can be modelled by specifying a response distribution for each possible X.
The response is often modelled with a normal distribution whose mean is a linear function of X and whose standard deviation is constant.
A normal linear model can be described in terms of 'errors'. In samples from the model, approximately 95% of errors are within 2 standard deviations of zero, so about 95% of the points in a scatterplot are within this distance of the regression line.
The normal linear model has 3 unknown parameters. For many data sets, these parameters have meaningful interpretations.
A least squares line provides estimates of the linear model's slope and intercept. These estimates are random values — they vary from sample to sample.
The third parameter of the normal linear model is the error standard deviation. It can be estimated using the residuals from the least squares line.
The least squares estimate of the model's slope has a normal distribution that is centred on the true value.
The distribution of the least squares slope may be estimated from a single data set.
A confidence interval for the model's slope can be obtained from its least squares estimate and its standard error.
Confidence intervals for the model's slope have the same properties as confidence intervals for population means or proportions.
The standard error of the least squares slope depends on the response standard deviation round the model line, the sample size and the standard deviation of X. Collecting data with a big spread of x-values gives more accurate estimates but there are disadvantages.
If the model's slope is zero, the response distribution does not depend on the explanatory variable. This special case is particularly meaningful in many studies.
The p-value for testing whether a linear model's slope is zero is the probability that its least squares estimate is as far from zero as the recorded value.
It is important to distinguish the strength of a relationship (summarised by the correlation coefficient) and the strength of evidence for existence of a relationship (summarised by the p-value).
As with other tests, all p-values between 0 and 1 are equally likely if the null hypothesis holds (model slope is zero), but p-values near�0 are more likely if the alternative hypothesis holds (model slope is non-zero).
From estimates of the 3 linear model parameters, we can obtain an estimated response distribution at any x-value.
The predicted response at any X varies from sample to sample. The prediction is more variable at x-values far from the mean of the 'training' data.
A distinction is made between estimating the mean response at X and predicting a new individual's response at X. Errors are larger (on average) when predicting a new individual's response.
A 95% confidence interval is used to estimate the mean response at X. A 95% prediction interval is similar, but gives a range of likely values for a new response value. The prediction interval is wider than the confidence interval.
The normal linear model involves assumptions of linearity, constant variance, normal error distribution and independence of different observations. Residuals can be examined to assess whether these assumptions are appropriate for a particular data set.
If the relationship between Y and X is nonlinear, a transformation of X may linearise the relationship.
Transforming the response may remove curvature in the relationship, but also affects whether the error standard deviation is constant. Fortunately, the same transformation of Y often removes curvature and non-constant standard deviation.
If a normal linear model describes the relationship between a transformation of the response and a transformation of the explanatory variable, predictions can be made by fitting the linear model to the transformed data, then performing the inverse transformation on the prediction.
An outlier is a response value that is unusually large or small. An extreme residual suggests an outlier and standardised residuals can be used to assess it. However if the outlier corresponds to an extreme x-value (a high leverage point) it may not show up as a large residual.
The errors in a normal linear model are assumed to have normal distributions. Violation of this assumption is less important than nonlinearity, non-constant variance or outliers, but a probability plot of the residuals can be used to assess normality.
The errors in a normal linear model are assumed to be independent. In data where the observations are recorded sequentially, successive errors are sometimes found to be correlated. Correlated errors can arise whatever the x-variable, but are most often seen when the x-variable is time itself.