Backward elimination
In the previous page, we saw that the marginal sums of squares and associated p-values could be used to decide on which variable, if any, should be removed first from the full model. This process can be continued, successively dropping the remaining variable with lowest marginal sum of squares until all remaining variables have 'small enough' p-values. This procedure is called backward elimination of variables.
The 'stopping rule' for this iterative procedure is often when the remaining variables have p-values greater than 0.05.
Note that the marginal sums of squares and p-values change each time a variable is dropped. They must be re-calculated at each step of the procedure.
Forward selection
This is a similar procedure to backward elimination, but starts with a model with no explanatory variables and adds variables one at a time. At each step, the variable that reduces the residual sum of squares most is added — this is equivalent to adding the variable that would have the lowest p-value (most significant) if it was added to the model.
The procedure is stopped when the p-values for the remaining variables that are not in the model are all high — say above 0.05.
Stepwise variable selection
It is occasionally found that after several variables have been deleted in backward elimination, the p-value to add one of the deleted variables has now dropped below 0.05 and could be added back into the model.
Stepwise variable selection allows this by alternating between steps of backward elimination (provided at least one variable in the model has a high enough p-value) and forward selection (provided at least one variable not in the model has a low enough p-value).
Forward selection for body fat data
The diagram below is similar to the one on the previous page. It initially gives the p-values and Type 3 (marginal) sums of squares for the full model with all 13 explanatory variables.
As before, knee is the variable with smallest Type 3 sum of squares, so click its checkbox to remove it from the model. The p-values and Type 3 sums of squares are recalculated for the 12-variable model without knee.
Now remove the variable with the smallest remaining Type 3 sum of squares and continue until all remaining variables in the model have p-values lower than 0.05.
The model with weight, abdomen, forearm and wrist has a residual sum of squares that is only a little higher than that of the full 13-variable model.
Forward selection for body fat data
The diagram below is again similar, but it also shows the Type 3 sums of squares and p-values for adding variables.
Click the button Remove all. The red p-values and Type 3 sums of squares are those for adding the variables. Observe that height is the only variable that is not significantly important — all other variables would be significantly related to body fat if they were the only variable in the model.
Abdomen is the variable with highest Type 3 sum of squares. Click its checkbox to add it to the model — it is the 1-variable model with the smallest residual sum of squares. Observe that the other p-values and Type 3 sums of squares change. Height has now become important whereas forearm has lost its importance.
Continue adding the variables with highest Type 3 sum of squares until all variables that are not in the model (the red p-values) are all greater than 0.05.
Again the model with weight, abdomen, forearm and wrist is selected as 'best'.
Stepwise variable selection
For the body fat data, the p-values for all variables that were added in forward selection remained less than 0.05, and the p-values for variables that were deleted in backward elimination similarly remained greater than 0.5, so there was no need to alternate between forward selection and backward elimination.
Better methods for variable selection
The above iterative procedures for variable selection were historically developed to avoid fitting all possible subsets of variables. For example, in the body fat example with 13 explanatory variables, there are 1,716 different models with 6 explanatory variables. Unfortunately, none of the iterative procedures are guaranteed to find the best (minimum residual sum of squares) of these 1,716 six-variable models, though the chosen six-variable model will have a residual sum of squares close to the minimum.
Read a textbook about multiple regression for better methods of variable selection.