Many data sets contain two or more measurements from each individual. Even when the main interest is in one variable, the others can help to understand its distribution.
The main display that shows the relationship between two variables is a scatterplot.
Univariate displays don't show relationships between variables.
A scatterplot of two variables can be enhanced with box plots or histograms on the margins of a scatterplot.
When a single measurement is made at regular intervals, the data are called a time series. Time series data can be treated as bivariate, with time being the second variable.
The main feature of interest in a scatterplot is the strength of the relationship between the two variables.
An extreme value of one or both of the variables is an outlier. An unusual combination of values is also called an outlier.
If the crosses on a scatterplot separate into clusters, different groups of individuals are suggested.
In small data sets, there may be considerable variability, so patterns should be strongly evident before they are reported.
One variable can often be classified as an explanatory variable that either causally affects the response variable, or is useful for predicting its value.
A numerical description of the strength of a relationship should not be affected by rescaling the variables.
Standardising a variable gives z-scores that do not depend on the units of the original variable. (The correlation coefficient will be defined in terms of z-scores for X and Y.)
The correlation coefficient summarises the strength of the relationship between X and Y. It is +1 when the scatterplot crosses are on a straight line with positive slope, -1 when on a line with negative slope, and zero when X and Y are unrelated.
You should be able to estimate the value of r from looking at a scatterplot and imagine a scatter of crosses corresponding to any value of r.
The correlation coefficient is only a good measure of the strength of a relationship if the points in a scatterplot are scattered round a straight line, not a curve.
The correlation coefficient cannot identify curvature, outliers or clusters and can be misleading if these features are present. A scatterplot must always be examined too.
A line or curve is useful for predicting the value of Y from a known value of X.
A straight line can often be used to predict one variable from another.
The difference between the actual value of Y and the value predicted by a line is called a residual. Small residuals are clearly desirable.
The sum of squared residuals describes the accuracy of predictions from a line. The method of least squares positions the line to minimise the sum of squared residuals.
A linear model is not appropriate if there are either curvature or outliers in a scatterplot of the data. Outliers should be carefully examined.
Outliers and curvature in the relationship are often displayed more clearly in a plot of residuals.
Least squares does not treat Y and X symmetrically. The best line for predicting Y from X is different from the best line for predicting X from Y.
The correlation coefficient does not adequately describe the strength of a nonlinear relationship. Transforming the variables to linearise the relationship helps.
If a relationship is nonlinear, a linear model can often be fitted to transformed response or explanatory variables.
An alternative solution to nonlinearity is to fit a quadratic curve the data, again using the principle of least squares.
Since the form of a relationship is unknown beyond the range of x-values in the data, it is always dangerous to extrapolate.