Mixtures of numerical and categorical explanatory variables

We have already examined models with the following types of explanatory variables:

By now, the pattern in such models should be clear. The mean response was modelled as the sum of terms that describe the effects of the individual explanatory variables,

The definition of each effect depends on whether the explanatory variable is numerical or categorical.

Numerical explanatory variable, X
This can be modelled using either a linear function of x or a polynomial function, such as

        or         

Categorical explanatory variable, X
This is modelled using indicator variables for all categories (groups) except a 'baseline' category,

We can use models of this form for any combination of numerical and categorical explanatory variables.

Since the model parameters appear linearly, all models of this form are general linear models (GLMs)

Hypothesis tests

In these models, we are interested in testing whether individual variables can be deleted from the full model. This is simplest when there are only 2-group categorical variables and linear terms for numerical variables in the model since the tests only ask whether a single parameter in the model is zero.

Single parameter for each explanatory variable
A table of parameter estimates, standard errors, t-statistics and p-values provides tests for whether each explanatory variable can be omitted from the full model.

When some categorical explanatory variables have 3 or more distinct categories, this approach does not work since 2 or more parameters should be simultaneously tested.

There are also problems with standard analysis of variance tables since they usually depend on the order of adding the explanatory variables to the model. (The explanatory variables are rarely orthogonal in practice when there are mixtures of explanatory variable types.) Several anova tables would be required to fully analyse the data.

Some explanatory variables with 2+ parameters
A more general test for whether 2 or more parameters are zero in the full model is based on the Type 3 sums of squares — the change in residual sum of squares that results from removing the corresponding variables. The mean Type 3 sums of squares are divided by the mean residual sum of squares to form an F ratio for which a p-value can be found.
These tests can be arranged in a table corresponding to the table of t-statistics for single parameters.

The tests using Type 3 sums of squares are a more general form of the t-tests for individual parameters.

The p-values in the table of Type 3 sums of squares are interpreted in exactly the same way as those in the table of t-statistics for single parameters.

The p-value for a single parameter is identical whether it is tested with a t-test or using its Type 3 sum of squares.

Body fat of AIS athletes

Data were collected from a sample of 202 elite athletes who were in training at the Australian Institute of Sport. We will use these data to model the percentage body fat of the athletes in terms of four numerical explanatory variables, Height, Weight, Red cell count and White cell count, and two categorical explanatory variables, Sex and Sport.

The table below shows the least squares estimates of the coefficients of the numerical explanatory variables and the indicator variables for the categorical explanatory variables. (The '±' values give 95% confidence intervals for them.)

The table of Type 3 sums of squares provides hypothesis tests for whether the six explanatory variables can be deleted from the full model.

Note that this is not an anova table. Since the sums of squares are not sequential, they do not add to the total sum of squares.

The only variables that are not significant (large p-values) are Red cell count and White cell count. Click the checkboxes for these two variables to remove them from the model. The remaining four explanatory variables are all highly significant, so we conclude that body fat is related to all four.

The following interpretation can be made of some of the model parameters:

  • For athletes of the same weight and height and in the same sport, women are estimated to have 9.6% more body fat than men.
  • For athletes of the same sex, sport and height, body fat is estimated to be 0.27% higher for each extra kg in weight.
  • For athletes of the same sex, height and weight, swimmers are estimated to have 3.53% less body fat than basketballers.