The shape of a relationship is only known around the data
The models that we have used to describe the relationship between a response, Y, and explanatory variable, X, are usually only approximations to the 'real' relationship. For example, a scatterplot may look linear, but we really have no information about the shape of the relationship beyond our data.
As a result, the models that we have described may be used to predict Y from values of X that are within the range of x-values in our data, but we should be very cautious about using a fitted model for predictions outside this range. This is called extrapolation and it can be badly in error.
Avoid using a model to predict Y far beyond the available data.
What model for curvature is best?
We described two different types of nonlinear model for data with curvature — quadratic models and linear models based on transformed variables. There are many data sets for which both types of model fit equally well.
However the different types of model can give very different predictions when used to extrapolate from the observed data. The data cannot help us to decide which is better.
Forbes' data
One such data set is the Forbes data, for which we hope to describe the relationship between barometric pressure, Y, and the boiling point of water, X.
We saw earlier that a quadratic model fitted well; the least squares curve is
estimate of pressure = 116.59 − 1.4165 x + 0.004752 x2
The logarithm of the pressure is also approximately linearly related to boiling point with the equation
estimate of ln (pressure) = −0.9518 + 0.02052 x
This equation can be rewritten in the form
estimate of pressure = exp( −0.9518 + 0.02052 x )
The two models (quadratic in the original measurements and linear in the transformed measurements) both fit the data equally well with residual plots that are almost identical. The data give no indication of which of the alternative models fits better.
Use the pull-down menu to display the two fitted lines (Quadratic model and Log linear model). Observe that they almost coincide.
Select the option Both models, then use the slider to extend the two axes. When the boiling point is less than approximately 150 degrees, or more than 250 degrees, the two models give very different predictions of the pressure.
Indeed, the quadratic model (which is green) predicts that the pressure will increase when the boiling point falls below 150 degrees, which does not conform to physics! For these data, the log linear model is likely to give better predictions at low boiling points.
However we really have little idea about which curve is better at high boiling points — indeed, neither may adequately predict pressure. The point here is that...
Predictions from any model will be very unreliable if the predictions are made away from values of the explanatory variables where we have data