Uncorrelated explanatory variables and experiments
When the explanatory variables are correlated with each other, as in the Body Fat data, the least squares coefficient for any variable depends on which of the other variables are in the model. Its marginal sums of squares similarly change as other variables are added or deleted from the model. This makes the data much harder to analyse.
In experimental data, it is possible to choose the values of the explanatory variables to make them uncorrelated. The simplest way to do this is to decide on a small number of values (often 2 or 3) for each variable and conduct an equal number of runs of the experiment (often 1) for every possible combination of these values.
If the explanatory variables are uncorrelated, the least squares estimate and marginal sum of squares for each variable is the same irrespective of the other variables in the model.
Plasma etching
The following data came from an industrial experiment that was conducted when developing a nitride etch process on a single-wafer plasma etcher. The process uses C2F6 as the reactant gas. Four numerical explanatory variables were varied in the experiment, with each taking one of two values. All combinations of values for the four explanatory variables were used once giving 24 = 16 runs of the experiment.
Response | ||
---|---|---|
Etch rate | Etch rate for silicon nitride, measured in Å per minute | |
Explanatory variables | ||
A-c gap | Anode-cathode gap, either 0.8 cm or 1.2 cm | |
Pressure | Pressure in the reactor chamber, either 4.5 mTorr or 550 mTorr | |
Gas flow | Flow rate of C2F6, either 125 SCCM or 200 SCCM | |
Power | Power applied to the cathode, either 275 W or 325 W |
Since each combination of values for the explanatory variables was used once, the explanatory variables are uncorrelated.
Click checkboxes to delete variables and observe that the least squares estimates for the other parameters remain the same, as do their Type 3 sums of squares.
(However the reported standard errors and p-values do change since they use the mean residual sum of squares to estimate of σ.)
Variance inflation factors
We earlier described the use of variance inflation factors in linear models with two explanatory variables, X and Z, to quantify the loss of accuracy of each least squares coefficient due to correlation between the explanatory variables,
A similar definition is used when there are more explanatory variables — it is the ratio of the actual variance of the least squares coefficient with the value that would have arisen if the explanatory variables had the same spread but were uncorrelated.
For example, if the VIF for the parameter associated with X is 4.0, its variance is 4 times what it would have been if X had been uncorrelated with the other explanatory variables — i.e. its standard deviation is twice the minimum possible.
Some authors suggest that a VIF of 10 or higher is 'of concern'. It certainly means that there will be problems interpreting the effect of explanatory variables — least squares coefficients could vary greatly depending on the other variables in the model.
For observational data, multicollinearity cannot usually be avoided, but deletion of some variables from the model will usually decrease the VIFs.
Body fat
In data from observational studies, such as the Body Fat data, the explanatory variables are often highly correlated. The correlations between the variables are shown below, with correlations between explanatory variables over 0.8 highlighted in bright red.
Fat 1.00
Age 0.29 1.00
Wt 0.61 -0.01 1.00
Ht -0.09 -0.17 0.31 1.00
Neck 0.49 0.11 0.83 0.25 1.00
Chest 0.70 0.18 0.89 0.13 0.78 1.00
Abdomen 0.81 0.23 0.89 0.09 0.75 0.92 1.00
Hip 0.63 -0.05 0.94 0.17 0.73 0.83 0.87 1.00
Thigh 0.56 -0.20 0.87 0.15 0.70 0.73 0.77 0.90 1.00
Knee 0.51 0.02 0.85 0.29 0.67 0.72 0.74 0.82 0.80 1.00
Ankle 0.27 -0.11 0.61 0.26 0.48 0.48 0.45 0.56 0.54 0.61 1.00
Biceps 0.49 -0.04 0.80 0.21 0.73 0.73 0.68 0.74 0.76 0.68 0.48 1.00
Forearm 0.36 -0.09 0.63 0.23 0.62 0.58 0.50 0.55 0.57 0.56 0.42 0.68 1.00
Wrist 0.35 0.21 0.73 0.32 0.74 0.66 0.62 0.63 0.56 0.66 0.57 0.63 0.59 1.00
Fat Age Wt Ht Neck Chest Abdom Hip Thigh Knee Ankle Bicep Fore Wrist
In particular, Weight and Hip have high correlations with the other explanatory variables. The diagram below shows the variance inflation factors for the explanatory variables. Observe that Weight and Hip have the highest VIF values.
Three variables have VIF over 10, so we should be aware that their coefficients could change substantially depending on the other explanatory variables in the model.