The notion of prediction

Causal relationships
In causal relationships, one variable is thought to directly affect the other. When the value of the explanatory variable is determined from a further individual (and its value may be under our control), we might hope to predict the resulting value for the response.
 
For example, a supermarket chain may want to discover how sales of a particular brand of disposable razors is affected by thir price. In 20 similar stores, the price is randomly set at 79, 89, 99, 109 or 119 cents and sales are recorded over 1 week. The store would like to predict sales if it sets the price at 104 cents.
 
Non-causal relationships
In other situations, the relationship is not causal but we are still interested in predicting the value of one variable from a known value of the other variable.
 
For example, the above supermarket chain may record the number of days that each checkout operator is absent due to illness and their length of service. It would be useful to be able to predict sick days for each employee from their length of service, even though illness could also affect an employee's length of service.

Notation and convention

In both of the above cases, the variables can be classified as an explanatory variable and a response. When we talk in general about this type of data, we will use the letter X to denote the explanatory variable and Y to denote the response.

Always draw the response variable, Y, on the vertical axis of a scatterplot and X on the horizontal axis.

Describing the form of the relationship

The correlation coefficient describes the strength of the relationship and whether the scatter of crosses on a scatterplot has positive or negative slope, but it holds no information about the position of the crosses on the scatterplot — the form of the relationship.

The form of the relationship can be described by a straight line or curve that lies close to the crosses in a scatterplot of Y against X. Such a line is called a regression line.

Predicting the response from the regression line

The regression line (i.e. the curve or straight line on the scatterplot that describes the form of the relationship) can be drawn close to the crosses 'by eye'. (We will describe better objective ways to position the line later.)

It is possible to use any such regression line to 'read off' the y-value corresponding to any x. This provides a prediction of the likely y-value that would be recorded if a new observation was made at this x and can be expressed as

y  =  ƒ ( x )

where f() corresponds to the regression line.

Predicting the effect of advertising

The scatterplot below shows the weekly retail sales of televisions of a particular brand, against weekly advertising expenditure.

First position the grey curve to fit close to the data by dragging the five red circles up and down. (This is a computer-based way to position the curve 'by eye'.)

When your curve is close to the points, click Finished sketching curve. You can use the curve to make predictions of sales at any level of expenditure. Drag the red vertical line towards the left or right to see these predictions.