This page describes notation for experimental data -- experimental units, controlled variables and a response.
Decisions must be made about the experimental units to use, the response to measure from each, the controlled variables to vary and the values of these variables to use. These are generally chosen for non-statistical reasons.
Experimental design defines which experimental treatments are applied to which units. If this is done badly, the experiment can result in incorrect conclusions. Randomisation prevents biased results.
If the experimental units are not identical, grouping them into blocks of similar units improves accuracy.
In the simplest type of experiment, there are no known differences between the experimental units and a single factor is varied. In a completely design, the different levels of the factor are randomly allocated to the pool of experimental units.
This page gives a few data sets from completely randomised experiments for a single factor.
Varying the controlled factor causes variability in the response -- explained variation. Other response variability remains unexplained.
The mean responses at the different factor levels summarise differences between the treatments -- the explained variation.
The response is usually modelled as the sum of two terms, a term for explained variation that depends on the factor level and a random term with a normal distribution describing unexplained variation.
If the relationship between the response and x is nonlinear, the mean response can be modelled with a quadratic function of x. An even more general model uses a separate parameter for the mean response at each x that is used; it is also appropriate for a categorical explanatory variable.
The simplest model for an experiment with one numerical controlled variable, x, is a linear model in which the mean response is a linear function of x.
All models involve unknown parameters. The least squares estimates of the parameters minimise the sum of squared residuals.
If only 2 values of a numerical factor are used in an experiment, a linear model has identical fit to a model that treats the factor as categorical. If 3 values of the factor have been used, a quadratic model is equivalent to a model that treats the factor as categorical.
Evenly spaced values of a numerical factor can be replaced by any other evenly spaced values, such as 1, 2, ... without changing the fit of the model. A numerical or categorical factor with 2 levels is often modelled as a numerical factor with values -1 and +1.
Assessing whether a categorical factor affects the response must take into account both variation between the treatment means and also variation within each factor level.
For experiments with numerical factors, the ideas of between- and within-treatment variation must be generalised to explained and unexplained variation. Both types of variation affect our assessment of whether the factor affects the response.
Explained and unexplained variation are summarised by quantities called explained and unexplained sums of squares.
The explained and unexplained sums of squares form the basis of an analysis of variance table that can be used to test whether the factor really does affect the response.
A linear model is the simplest one for a numerical factor but a quadratic model and one that treats the factor as categorical categorical allow increasing degrees of curvature in the relationship. Models that allow curvature have smaller residual sums of squares.
The explained sum of squares for changing from a quadratic to a categorical model is the basis of an anova test of goodness-of-fit of a quadratic model. The explained sum of squares for changing from a linear to a quadratic model can be used to test for curvature.
The mean residual sum of squares estimates the variance of the 'errors' in the model. This is also the variance of replicate observations within any factor level.
Confidence intervals for the treatment means provide a good summary of the effect of a factor.