The notion of prediction

Causal relationships
In causal relationships, one variable is thought to directly affect the other. When the value of the explanatory variable is determined from a further individual (and its value may be under our control), we might hope to predict the resulting value for the response.
 
For example, the concentration of a chemical may be recorded after a reaction is conducted at a variety of temperatures. We would like to predict the concentration from future runs of the experiment at different temperatures.
 
Non-causal relationships
In other situations, the relationship is not causal but we are still interested in predicting the value of one variable from a known value of the other variable.
 
For example, accurate measurements of body fat are difficult to make, whereas measurements of skinfold thickness are relatively easily found. It would be useful to be able to predict body fat from skinfold thickness, based on a dataset with both measurements from a group of people.

Notation and convention

In both of the above cases, the variables can be classified as an explanatory variable and a response. When we talk in general about this type of data, we will use the letter X to denote the explanatory variable and Y to denote the response.

Always draw the response variable, Y, on the vertical axis of a scatterplot and X on the horizontal axis.

Describing the form of the relationship

The correlation coefficient describes the strength of the relationship and whether the scatter of crosses on a scatterplot has positive or negative slope, but it holds no information about the position of the crosses on the scatterplot — the form of the relationship.

The form of the relationship can be described by a straight line or curve that lies close to the crosses in a scatterplot of Y against X. Such a line is called a regression line.

Predicting the response from the regression line

The regression line (i.e. the curve or straight line on the scatterplot that describes the form of the relationship) can be drawn close to the crosses 'by eye'. (We will describe better objective ways to position the line later.)

It is possible to use any such regression line to 'read off' the y-value corresponding to any x. This provides a prediction of the likely y-value that would be recorded if a new observation was made at this x and can be expressed as

y  =  ƒ ( x )

where f() corresponds to the regression line.

Predicting impurity in chemical process

The scatterplot below shows the percentage impurity of the output of a chemical process at various reaction temperatures.

First position the grey curve to fit close to the data by dragging the five red circles up and down. (This is a computer-based way to position the curve 'by eye'.)

When your curve is close to the points, click Finished sketching curve. You can use the curve to make predictions of process impurity at any temperature. Drag the red vertical line towards the left or right to see these predictions.