Long page
descriptions

Chapter 4   Indicator Variables

4.1   Indicator variable for two groups

4.1.1   Model and hypotheses

We assume that the data arise as random samples from two normal populations with the same standard deviation.

4.1.2   Indicator variables

A linear model with an explanatory variable whose value is 0 or 1 depending on the group is equivalent to the model with separate parameters for each of the two group means.

4.1.3   Least squares estimates

The least squares estimates for the GLM with an indicator variable give fitted values that are the two group means.

4.1.4   Test for equal group means

The t-test for whether the coefficient of the indicator variable is zero is identical to the standard t-test for comparing two sample means.

4.1.5   Sums of squares

The explained sum of squares can be interpreted as the sum of squares between groups; the residual sum of squares is the sum of squares within groups.

4.1.6   Anova test for equal means

The anova F test for equal group means gives the same p-value and conclusion as the standard t test.

4.2   Indicator variables for 3+ groups

4.2.1   Model for several groups

The normal model for several groups has the same standard deviation in each group but allows the means to be different.

4.2.2   Indicator variables

Different means in g groups can be modelled with (g-1) indicator variables whose GLM coefficients are differences between the group means that of a baseline group.

4.2.3   Least squares estimates

The least squares estimates for the GLM with indicator variables for the groups result in fitted values that are the group sample means.

4.2.4   Tests for separate indicator variables

T-tests for the coefficients of separate indicator variables test whether the mean of that group equals the mean of the baseline group. This may make sense if the baseline group is a control treatment, but in general there are too many pairwise comparisons to rely on the results of separate t-tests.

4.2.5   Explained and residual sums of squares

Testing for equal group means requires simultaneous testing of the coefficients of the (g-1) indicator variables using analysis of variance. The explained and residual sums of squares describe between-group and within-group variation.

4.2.6   Coefficient of determination

The coefficient of determination is the proportion of response variation that is explained by the groups.

4.2.7   Anova test for equal group means

An anova table with between-group and within-group sums of squares provides a test for equal group means.

4.2.8   Can groups be combined? ((advanced))

Many parameterisations are possible for the model with arbitrary group means. An example is given that allows testing of whether a subset of group means are equal.

4.3   Numerical & categorical variables

4.3.1   Model for X and 2 groups

A simple linear model for Y against X can be augmented with a 0/1 indicator variable distinguishing the groups. The model is a GLM and can be represented by two parallel lines on the scatterplot of Y vs X.

4.3.2   Inference for X and 2 groups

A t-test for whether the coefficient of the indicator variable is zero tests whether the groups are the same.

4.3.3   Model for X and 3+ groups

If there are g groups, (g-1) indicator variables can be added to the simple linear model. This corresponds to g parallel lines for the groups on the scatterplot of Y vs X.

4.3.4   Inference for X and 3+ groups

Analysis of variance provides a single test for differences between the groups. If X is not orthogonal to the groups, there are two different anova tables corresponding to the two orders of adding X and the indicator variables.

4.3.5   Categorical variables and groups

Observations that are split into groups can be equivalently considered as a single data set with a categorical variable defining group membership.

4.3.6   Model for 2 categorical variables

Data with two categorical explanatory variables often arise from designed experiments. Two sets of indicator variables can be used to model the effects of the two variables in a GLM.

4.3.7   Inference for 2 categorical variables

The effects of the categorical explanatory variables can be tested with analysis of variance. The categorical explanatory variables are usually orthogonal in designed experiments and a single anova table can test both variables.

4.3.8   Mixtures of explanatory variable types

The models in this section can be extended with terms for any mixture of numerical and categorical explanatory variables. If one or more categorical explanatory variables has 3+ levels, F-tests based on Type 3 sums of squares should be used to test significance instead of t-tests for individual parameters.

4.4   Interactions

4.4.1   Interaction between numerical variables

Interaction between two numerical variables can be modelled in a GLM with a term involving the product of the variables.

4.4.2   Numerical-categorical interaction

If there is no interaction between a numerical and categorical explanatory variable, the regression lines for all categories are parallel. If these regression lines are not parallel, then there is interaction. Extra terms can be added to the no-interaction GLM to model the interaction.

4.4.3   Inference for num-cat interaction

The existence of interaction can be tested with a test for whether the parameters for the interaction terms are zero. This can be done with a t-test if there are only 2 categories, but an F-test is needed for more categories.

4.4.4   Interaction between categorical variables

Indicator variables can be added to the no-interaction GLM to model an interaction between 2 categorical variables.

4.4.5   Inference for categorical interaction

Testing for interaction is equivalent to testing whether the parameters for the interaction indicator variables are all zero.

4.4.6   Interactions with several variables

An example is shown with several main effects and interactions.

4.5   Goodness-of-fit and pure error

4.5.1   Modelling nonlinearity with g groups

When there are several response values at each x, the most general model for curvature allows for an arbitrary response mean at each x. This model places no constraint on the shape of the curvature.

4.5.2   Understanding the sums of squares

The nonlinearity sum of squares describes the distances of the group means from a straight line.

4.5.3   Test for linearity

An F ratio comparing the nonlinear and residual sums of squares provides a test for linearity.

4.5.4   Contrasts for testing linearity ((optional))

The g-group model that allows for arbitrary response means at each x can be parameterised in a way that makes the test for linearity equivalent to testing whether the coefficients of (g-2) indicator variables are zero.

4.5.5   Comparison of tests

The quadratic test is more likely to detect 'smooth' nonlinearity. The test based on the factor model is better at detecting more irregular types of nonlinearity, including those that can arise from badly randomised experiments.

4.5.6   Goodness-of-fit for other models

In experiments where there are repeated response measurements at different x-z combinations it is possible to perform a more general anova test about the fit of a model. This can detect both curvature and interaction.