Long page
descriptions

Chapter 3   Analysis of Variance

3.1   Anova for simple linear model

3.1.1   Components for regression model

In regression data, the difference between the response and its overall mean can be split into an explained component and a residual.

3.1.2   Sums of squares

The total sum of squares equals the explained sum of squares plus the residual sum of squares.

3.1.3   Coefficient of determination

The relative sizes of the explained and residual sums of squares holds information about the strength of the relationship. The coefficient of determination describes the proportion of total variation that is explained.

3.1.4   Coefficient of determination and experiments ((optional))

For experimental data, the coefficient of determination is affected both by the strength of the relationship and the range of x-values chosen by the experimenter.

3.1.5   Analysis of variance test

The F ratio can be used to test whether the variables are related (i.e. to test whether the model slope is zero). Since the F ratio is the square of the t statistic for this test, the conclusions for the F and t tests are identical.

3.2   All-or-nothing anova

3.2.1   Components of variation for Y vs X and Z

The difference between each response value and the overall mean can be split into a component explained by the explanatory variables and a residual.

3.2.2   Sums of squares for Y vs X and Z

The total, regression and residual sum of squares contain information about how well the explanatory variables explain variability in the response. The coefficient of determination is a useful summary statistic.

3.2.3   F test for regression of Y vs X and Z

The ratio of the mean regression and residual sums of squares has an F distribution if the response is unrelated to the explanatory variables but is larger if they are related. It can be used as a test statistic for whether there is a relationship.

3.2.4   All-or-nothing F test for any GLM

A similar F test can simultaneously test whether all slope parameters in a GLM are zero.

3.2.5   Different interpretations of R-sqr and F

The coefficient of determination, R-sqr, describes the proportion of response variation that is explained by the model. The F ratio describes the strength of evidence for there being any relationship at all. In large samples, R-sqr can be small even when F is large.

3.3   Multicollinearity of X and Z

3.3.1   Correlated explanatory variables

When the explanatory variables, X and Z, are correlated, their slope parameters can be estimated less accurately than for uncorrelated explanatory variables covering the same spreads of x- and z-values.

3.3.2   Correlated explanatory variables

The variance inflation factors for the slope parameters quantify the increase in their standard errors due to the explanatory variables being correlated.

3.3.3   Understanding multicollinearity

The slope coefficient for X is the slope of a slice through the regression plane at any z-value. When X and Z are highly correlated, similar slices of the data contain small ranges of x-values and therefore hold little information about the parameter.

3.3.4   Understanding multicollinearity (cont)

The position of the least squares plane is most accurately determined near the data. When X and Z are highly correlated, the LS plane can be very variable away from the data.

3.3.5   F- and t-tests: a paradox?

If X and Z are correlated, the F-test can show that the explanatory variables are related to Y but t-tests of the separate slopes may show that either one of X or Z can be dropped from the full model.

3.3.6   T-tests in full and partial models

If X and Z are correlated, the t-test for X in the full model with X and Z can give a different result from the t-test in the model with only the single explanatory variable X.

3.4   Sequential sums of squares

3.4.1   Sequentially adding variables

As explanatory variables are added to the model, the regression plane gets closer to the data points. The regression planes corresponding to models with only X or Z correspond to planes that have zero slope for the other variable.

3.4.2   Splitting the explained sum of squares

Each additional variable reduces the residual sum of squares by an amount that is the sum of squares of differences between the least squares fits of the two models.

3.4.3   Order of adding X and Z

The explained sum of squares for X can be different, depending on whether Z is already in the model.

3.4.4   Anova tests for individual variables

There are two ways to split the total sum of squares in an anova table. The F-test for the final variable added to the model gives identical results to the t-test for the coefficient in the full model.

3.4.5   Orthogonal variables

When the two explanatory variables are uncorrelated (orthogonal), the results are easier to interpret. The slope coefficients for X are the same, whether or not Z is in the model, and the two anova tables are identical.

3.4.6   Orthogonal variables and experimental design

Orthogonal variables usually only arise from designed experiments. They result in the most accurate parameter estimates and results that are relatively easy to interpret.

3.4.7   Other sequences of models

For any sequence of models with increasing complexity, component sums of squares can be defined that compare successive models in the sequence.

3.5   Testing linearity in regression

3.5.1   Linear and quadratic models

Linearity can be assessed by comparing the fits of a linear and quadratic model. The total sum of squares can be split into linear, quadratic and residual sums of squares.

3.5.2   Understanding the sums of squares

The quadratic sum of squares compares the fit of a linear and quadratic model and therefore holds information about whether there is curvature in the data.

3.5.3   Testing for linearity

An F ratio comparing the quadratic and residual mean sums of squares can be used to test for linearity.

3.5.4   Polynomial models

In polynomial models, only one order of adding terms is meaningful. This means that only a single anova table is possible.

3.5.5   Testing curvature in complex models

Analysis of variance can test the significance of the reduction in the residual sums of squares from adding quadratic terms in X and Z to a model with linear terms in both variables.

3.5.6   Inference for models with interaction

Testing whether the coefficient of XZ is zero can be done with either a t-test or analysis of variance. Both tests give the same p-value.

3.6   Several explanatory variables

3.6.1   Marginal sums of squares

The marginal sums of squares in a general linear model describe the effect on the residual sum of squares of deleting single variables from the full model.

3.6.2   Variable selection

The variable with smallest marginal sum of squares is least important and its p-value indicates whether it can be dropped from the model. The marginal sums of squares can be recalculated and further variables dropped in an iterative procedure.

3.6.3   Multicollinearity

When the explanatory variables are uncorrelated, parameter estimates and marginal sums of squares are unaffected by removing other variables. Variance inflation factors indicate the degree of multicollinearity.

3.6.4   Sequential sums of squares

Sequential sums of squares describe changes to the residual sum of squares when the explanatory variables are added sequentially. The sums of squares depend on the order of adding the variables.

3.6.5   Sequential sums of squares and fitted values

The sequential sums of squares are also the sum of squared differences between the fitted values of consecutive models. In some applications, these differences can be shown graphically to illustrate the sequential sum of squares.

3.6.6   Order of variables in ssq table

The sequential sums of squares depend on the order of adding the variables except with the explanatory variables are uncorrelated.

3.6.7   Sums of squares for groups of variables

Individual explanatory variables can be grouped together by adding their sums of squares and degrees of freedom.

3.6.8   Hypothesis tests in anova tables

The sum of squares table can be extended with mean sums of squares and F ratios. P-values can be found for the F ratios to indicate whether each variable can be dropped from the model, but these should only be interpreted when subsequent p-values are not significant.