Multiple hypothesis tests
Analysis of many data sets involves several hypothesis tests, either to test different independent aspects of the data or in sequence. For example, analysis of the 4-factor experiment at the end of the previous section required sequentially testing several interaction terms to find the simplest model that was consistent with the data.
A more extreme example arises with microarray experiments in genetics where tens of thousands of response measurements may be made. This number of hypothesis tests would result if analysis of variance is applied to each possible response.
For sequential testing to find the 'best' model for a data set, we usually only use the p-values from the sequential tests as rough guidance and should avoid interpreting the p-values as strict probabilities. However it is important to understand the problems that can arise if multiple tests are conducted.
Probability of at least one significant result
Firstly consider n idependent hypothesis tests. If we conclude that any particular test is significant if its p-value is under 0.05, then the probability that at least one of the n tests gives a false positive result (a p-value under 0.05 when the null hypothesis is true) is
P(at least one significant result) = 1 - (1 - 0.05)n
As n increases, this probability gets further from the nominal significance level of the individual tests, 0.05, and closer to 1.
In many practical situations involving n tests, they are not independent. An exact formula for the probability cannot then be given, but an upper bound can be shown to be
P(at least one significant result) ≤ n × 0.05
Adjusting the critical value for hypothesis tests
The two formulae above can be inverted to give the necessary critical value if we want to specify the overall probality of at least one test being significant. For example, inverting the formula for dependent tests implies that for an overall significance level of 0.05, we should only reject the null hypothesis for individual tests if:
p-value < 0.05 / n
This inverse formula is called a Bonferroni correction.
Fixed significance levels for individual tests
Drag the slider in the above diagram to see how quickly the overall probability of at least one of several tests being significant increases.
Fixed overall significance level
The following diagram does the inverse by fixing the overall significance level and showing the significance levels required for the individual tests to achieve this.
Agan drag the slider to the right. Observe that much stronger evidence is needed from individual tests to ensure that the overall probability of getting at least one significant result (when H0 holds) remains at 0.05.
Factorial experiment with four factors
Consider a factorial experiment with four factors in which we test in sequence the 4-factor interaction, the 3-factor and 2-factor interactions, and finally test the main effects if no interactions are significant. (The final example in the previous section is an example.)
If none of the factors affect the response, then there is a probability of more than 0.5 that testing at least one of the15 interaction and main effect terms will give a p-value less than 0.05, even if none of the factors really affect the response.
If none of the factors affect the response, we might want to specify that our testing will conclude this with probability 0.95. Since none of the 15 interaction and main effect terms can be significant for this to occur, the Bonferroni correction states that the p-values for the individual tests should be less than 0.05 / 15 = 0.0033 before we should treat the term as significant.
In practice
In any hypothesis test, there are two possible types of error:
Decision | |||
---|---|---|---|
accept H0 | reject H0 | ||
True state of nature | H0 is true | correct | Type I error |
HA (H0 is false) | Type II error | correct |
Adjusting the decision rule to reduce the probability of a Type I error generally increases the probability of a Type II error. The Bonferroni correction reduces the probability of a Type I error for individual tests, so
The Bonferroni correction increases the probability that situations in which H0 is false will not be detected.
Failing to detect significant factors (or interactions) usually has as bad consequences as concluding that they are significant when they are not. Therefore:
Except in special situations, Bonferroni corrections are rarely used.
However if multiple tests have been performed, it is important to take note of the results in this page when reporting conclusions about a significant result.