If you don't want to print now,
The term statistical inference refers to techniques for obtaining information about a statistical model's parameters based on data from the model. There are two different but related types of question about the parameter (or parameters) that we might ask.
Uncertainty and strength of evidence
A distribution's parameter cannot be determined exactly from a single random sample — there is a 5% chance that a 95% confidence interval will not include the true parameter value.
In a similar way, a single random sample can rarely provide enough information about a parameter to allow us to be sure whether or not any statement about it will be true. The best we can hope for is an indication of the strength of the evidence against the statement.
All types of hypothesis test conform to the same framework.
Data and model
Hypothesis tests are based on data that are collected by some random mechanism. We can usually specify some characteristics of this random mechanism — a model for the data.
In this e-book, we assume that the data are a random sample from some distribution and may even be able to argue that this distribution belongs to a specific family such as a Poisson distribution. We concentrate on some specific characteristic of this family of distributions — a parameter of the distribution whose value is unknown.
Null and alternative hypotheses
In hypothesis testing, we want to compare two statements about an unknown parameter in the model.
The null hypothesis is the more restrictive of the two hypotheses and often specifies a single value for the unknown parameter such as \(\alpha = 0\). It is a 'default' value that can be accepted as holding if there is no evidence against it. A researcher often collects data with the express hope of disproving the null hypothesis.
If the null hypothesis is not true, we say that the alternative hypothesis holds. (You can understand most of hypothesis testing without paying much attention to the alternative hypothesis however!)
Either the null hypothesis or the alternative hypothesis must be true.
Simplifying the null hypothesis
In some situations, both the null and alternative hypotheses cover ranges of values for the parameter. To simplify the analysis, we do the test as though the null hypothesis specified the single value closest to the alternative hypothesis range.
For example, we treat the hypotheses
in exactly the same way as
A hypothesis test is based on a quantity called a test statistic that is a function of the data. Because it is found from random data, it can also be treated as a random variable with a distribution. The test statistic should have the following properties.
Binomial example
In a situation where we are interested in the probability of success, \(\pi\), our data might be the successes and failures in \(n = 90\) independent trials. To test the hypotheses,
The number of successes, \(X\), might be used as a test statistics since
The null and alternative hypotheses are treated differently in statistical hypothesis testing. We compare them by asking ...
Are the data consistent with the null hypothesis?
A hypothesis test is based on a p-value — the probability of getting a value of the test statistic as "extreme" as the one calculated from the actual data set, assuming that the null hypothesis holds.
Definition
For a hypothesis test using a test statistic \(T\), if the values of \(T\) that favour the alternative hypothesis more than the observed value of the test statistic, \(t\), are the set \(A\), the p-value for the test is the probability of such a value when the null hypothesis holds,
\[ \text{p-value} \;\;=\;\; P(T \in A \mid H_0) \]For example, if large values of the test statistic, \(T\), would favour the alternative hypothesis and it is evaluated to be \(t\) from the recorded data, the p-value is
\[ \text{p-value} \;\;=\;\; P(T \ge t \mid H_0) \]Since we know the distribution of the test statistic when the null hypothesis holds, the p-value can always be evaluated.
Interpretation
A small p-value means that our data would have been unlikely if the null hypothesis was true. This gives evidence that the data are not consistent with the null hypothesis. The following table may be regarded as an oversimplification, but can be used as a guide to interpreting p-values.
p-value | Interpretation |
---|---|
over 0.1 | no evidence that the null hypothesis does not hold |
between 0.05 and 0.1 | very weak evidence that the null hypothesis does not hold |
between 0.01 and 0.05 | moderately strong evidence that the null hypothesis does not hold |
under 0.01 | strong evidence that the null hypothesis does not hold |
Binomial example
If \(X\) is the number of successes in \(n = 90\) independent binomial trials, \(X \sim \BinomDistn(n=90, \pi)\). We might want to test
If \(x = 36\) successes were recorded, the p-value is the probability of 36 or more successes from a \(\BinomDistn(n=90, \diagfrac {\small 1} {\small 3})\) distribution.
There would be an 11% chance of getting 36 successes when \(\pi\) was \(\diagfrac {\small 1} {\small 3}\) so the data would not be particularly unusual if H0 was true. We would therefore conclude that the data are consistent with the null hypothesis and there is no evidence suggesting that the alternative hypothesis is really true.
One- and two-tailed tests
In some situations, the alternative hypothesis only allows for probabilities on one side of the null hypothesis value, such as
This is called a one-tailed test. However the alternative hypothesis often allows for parameter values on either side of the null hypothesis value — called a two-tailed test. An example would be
P-value for two-tailed test
A two-tailed test is usually based on the same test statistic that would be used for the corresponding one-tailed test, but values in both tails of its distribution usually give support to the alternative hypothesis.
The p-value is double the smaller tail area of the test statistic, to take account of the fact that values in the opposite tail of the distribution would give equally strong evidence against H0.
Example: Ethics codes in companies
In 1999, The Conference Board surveyed 124 companies and found that 97 had their own ethics codes ("Business Bulletin", Wall Street Journal, Aug 19, 1999). In 1997, it was believed that 72% of companies had ethics codes, so is there any evidence that the proportion has changed?
If H0 is true, the number with an ethics code will be
\[ X \;\;\sim\;\; \BinomDistn(n=124, \pi = 0.72) \]The p-value is found from this distribution,
Getting 97 companies with ethics codes is therefore not unlikely, so we conclude that there is no evidence from these data of a change in the proportion of companies with ethics codes since 1997.
Distribution of p-values
In any hypothesis test,
The diagram below shows typical distributions that might be obtained.
P-values and probability
P-values have a rectangular distribution between 0 and 1 when H0 holds. A consequence of this is that the probability of obtaining a p-value of 0.1 or lower is exactly 0.1 (when H0 holds). This is illustrated on the left of the diagram below.
Similarly, the probability of obtaining a p-value of 0.01 or lower is exactly 0.01, etc. (when H0 holds).
P-values are more likely to be near 0 than near 1 if the alternative hypothesis holds
Counts with Poisson distributions
Suppose that we have \(n\) independent discrete random variables \(\{X_1, X_2,\dots, X_k\}\) that are counts of events. We will now consider whether they might be counts from Poisson processes in which the rate of events for \(X_i\) is \(\lambda_i\),
\[ X_i \;\;\sim\;\; \PoissonDistn(\lambda_i) \]If this model holds then, from the properties of Poisson distributions,
\[ E[X_i] \;\;=\;\; \Var(X_i) \;\;=\;\; \lambda_i \]Standardised versions of these counts
\[ Z_i \;\;=\;\; \frac{X_i - \lambda_i}{\sqrt{\lambda_i}} \]will have mean zero and standard deviation one. If the \(\{\lambda_i\}\) are large,
\[ Z_i \;\;=\;\; \frac{X_i - \lambda_i}{\sqrt{\lambda_i}} \;\; \underset{\text{approx}}{\sim} \;\; \NormalDistn(0,1) \]Chi-squared statistic
If the Poisson model holds,
\[ \sum_{i=1}^k {Z_i^2} \;\;=\;\; \sum_{i=1}^k {\frac{\left(X_i - \lambda_i\right)^2}{\lambda_i}} \;\; \underset{\text{approx}}{\sim} \;\; \ChiSqrDistn(k \text{ df}) \]In the context of goodness-of-fit tests, this is often written as
\[ X^2 \;\;=\;\; \sum_{i=1}^k {\frac{\left(O_i - E_i\right)^2}{E_i}} \;\; \underset{\text{approx}}{\sim} \;\; \ChiSqrDistn(k \text{ df}) \]In practice, this approximation is reasonable provided most of the \(\{E_i\}\) — the Poisson means — are reasonably large. The usual guideline is that
If these guidelines are not met, the chi-squared distribution should not be used to find probabilities related to \(X^2\).
We now consider a hypothesis test about whether such a Poisson model underlies a set of counts,
Test statistic
The chi-squared statistic can be used as a test statistic.
\[ X^2 \;\;=\;\; \sum_{i=1}^k {\frac{\left(O_i - E_i\right)^2}{E_i}} \]since
P-value and conclusion
The p-value is the probability that the test statistic, \(X^2\), is as large as was recorded from the actual data, \(x^2\),
\[ \text{p-value} \;\;=\;\; P(X^2 \ge x^2) \]when H0 is true. This can be found from the upper tail of the \(\ChiSqrDistn(k \text{ df})\) distribution.
Example
The following table describes the number of heart attacks in a city in ten weeks.
Week | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Count | 6 | 11 | 13 | 10 | 21 | 8 | 16 | 6 | 9 | 19 |
Test whether the heart attacks occurred at random with a rate of \(\lambda = 10\) per week.
(Solved in full version)
Constraints on the expected counts
The goodness-of-fit test is easiest when the null hypothesis specifies all Poisson distribution means, \(\{E_i\}\). In practice however, the null hypothesis values of the \(\{E_i\}\) involve unknown parameters that must be estimated from the data. The simplest example is
Estimating the unknown parameters makes the chi-squared test statistic smaller. If \(c\) parameters are estimated from the data and used to get values for the \(\{E_i\}\), we say that there are \(c\) constraints on the \(\{E_i\}\) and
\[ X^2 \;\;=\;\; \sum_{i=1}^k {\frac{\left(O_i - E_i\right)^2}{E_i}} \;\; \underset{\text{approx}}{\sim} \;\; \ChiSqrDistn(k - c \text{ df}) \]Example
This table shows the number of heart attacks in a city in each of ten weeks.
Week | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Count | 6 | 11 | 13 | 10 | 21 | 8 | 16 | 6 | 9 | 19 |
Test whether the heart attacks have occurred at random with a constant rate over this period.
(Solved in full version)
The earlier test had two requirements:
The chi-squared test cannot therefore be used directly to test whether the following data set is a random sample from a \(\PoissonDistn(\lambda=2)\) distribution, since the \(\{E_i\}\) are all 2.
1 | 3 | 2 | 2 | 5 | 4 | 5 | 2 | 0 | 2 |
2 | 4 | 3 | 5 | 2 | 3 | 1 | 4 | 2 | 6 |
Frequency table
We first summarise the data in a frequency table.
x | 0 | 1 | 2 | 3 | 4 | 5 | 6+ |
---|---|---|---|---|---|---|---|
Freq(x) | 1 | 2 | 7 | 3 | 3 | 3 | 1 |
Treating these frequencies as our observed counts, \(\{O_i\}\), we can find expected counts from the Poisson distribution's probability function,
\[ p(x) \;\;=\;\; \frac{\lambda^x e^{-\lambda}}{x!}\]Since \(n=20\) and \(\lambda=2\) when H0 holds,
\[ E_x \;\;=\;\; 20 \times p(x) \;\;=\;\; 20 \times \frac{2^x e^{-2}}{x!}\]giving
x | 0 | 1 | 2 | 3 | 4 | 5 | 6+ |
---|---|---|---|---|---|---|---|
\(O_x\) | 1 | 2 | 7 | 3 | 3 | 3 | 1 |
\(E_x\) | 2.707 | 5.413 | 5.413 | 3.609 | 1.804 | 0.722 | 0.331 |
Combining cells
Since we still do not have all \(\{E_x\}\) ≥1 and 80% of them ≥5, cells in the table should be combined.
x | 0,1 | 2 | 3+ |
---|---|---|---|
\(O_x\) | 3 | 7 | 10 |
\(E_x\) | 8.120 | 5.413 | 6.466 |
Based on these three counts,
\[ \begin{align} X^2 \;&=\; \sum_{i=1}^{10} {\frac{\left(O_i - E_i\right)^2}{E_i}} \\ &=\; \frac{(3-8.120)^2}{8.120} + \frac{(7-5.413)^2}{5.413} + \frac{(10-6.466)^2}{6.466} \\ &=\; 5.624 \end{align} \]P-value and conclusion
There are 3 categories (after grouping) and one constraint .
\[ \sum{E_i} \;=\; \sum{O_i} \;=\; 20 \]The test should therefore be based on a chi-squared distribution with \((3-1) = 2\) degrees of freedom. The p-value here is
p-value = \(P(X^2 \ge 5.621) = 0.060\)
We conclude that there is only very weak evidence against H0 (that the original data set was a random sample from a \(\PoissonDistn(2)\) distribution).
(It is hardly surprising that a data set with only 20 values does not show up problems with a model — a larger data set would be more sensitive to any possible lack of fit of the model.)
This approach can be applied to test whether a discrete data set of \(n\) values is a random sample from any distribution.
Example
The following table gives the number of male children among the first 12 children in 6,115 families of size 13, taken from hospital records in 19th century Saxony. (The 13th child has been ignored to avoid the possible distortion of families stopping when a desired sex is reached.)
Males | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Frequency | 3 | 24 | 104 | 286 | 670 | 1033 | 1343 | 1112 | 829 | 478 | 181 | 45 | 7 |
Assuming independence and that each child has the same probability of being male, \(\pi\), this would be a random sample from a \(\BinomDistn(n=12, \; \pi)\) distribution.
Is there evidence that the probability of a birth being male differs from family to family?
(Solved in full version)
This test can also be used for continuous data if the data are first summarised in a frequency table. The range of possible values of the distribution is partitioned into classes such as "10 ≤ X < 11", then the frequencies are the numbers of values in these classes.
Body temperature
In a study to determine the "normal" body temperature of healthy adults, body temperatures were found from 130 adults. The following frequency table summarises the data.
Temperature, x | Frequency |
---|---|
\(X \lt 96.0\) | 0 |
\(96.0 \le X \lt 96.5\) | 2 |
\(96.5 \le X \lt 97.0\) | 4 |
\(97.0 \le X \lt 97.5\) | 13 |
\(97.5 \le X \lt 98.0\) | 21 |
\(98.0 \le X \lt 98.5\) | 38 |
\(98.5 \le X \lt 99.0\) | 33 |
\(99.0 \le X \lt 99.5\) | 15 |
\(99.5 \le X \lt 100.0\) | 2 |
\(100.0 \le X \lt 100.5\) | 1 |
\(100.5 \le X \lt 101.0\) | 1 |
\(X \ge 101.0\) | 0 |
Could this be a random sample from a normal distribution?
We first find the method of moments estimates of the normal distribution's parameters,
\[ \hat{\mu} \;=\; \overline{x} \;=\; 98.25 \spaced{and} \hat{\sigma} \;=\; s \;=\; 0.7332 \]A histogram of the data and the best-fitting normal distribution seem similar in shape, but we will formally test the normal model with a chi-squared goodness-of-fit test.
The probabilities for values within these classes were found from the best-fitting \(\NormalDistn(\mu=98.25, \sigma = 0.7332)\) distribution, then multiplied by the number of values, 130, to get expected counts. However since several expected counts are low, classes must be combined before calculating the chi-squared goodness-of-fit statistic.
Temperature x |
Observed count, O |
Expected count, E |
---|---|---|
\(X \lt 97.0\) | 6 | 5.61 |
\(97.0 \le X \lt 97.5\) | 13 | 14.20 |
\(97.5 \le X \lt 98.0\) | 21 | 27.76 |
\(98.0 \le X \lt 98.5\) | 38 | 34.69 |
\(98.5 \le X \lt 99.0\) | 33 | 27.72 |
\(99.0 \le X \lt 99.5\) | 15 | 14.16 |
\(X \ge 99.5\) | 4 | 5.71 |
The test statistic is
\[ X^2 \;=\; \sum_{x} {\frac{\left(O_x - E_x\right)^2}{E_x}} \;=\; 3.657 \]Since there are 7 counts in the combined frequency table and 2 estimated parameters, the test statistic should be compared to the chi-squared distribution with \((7-2-1) = 4\) degrees of freedom. The p-value is the probability of a value from the \(\ChiSqrDistn(4 \text{ df})\) distribution as high as 3.657 and can be found (e.g. using Excel) to be 0.4545.
Since there would be almost 50% probability of getting observed counts as far from those expected from a normal distribution if the data did come from a normal distribution, we conclude that the data are consistent with coming from a normal distribution.
We now concentrate on a random sample from a \(\NormalDistn(\mu, \sigma^2)\) distribution and develop hypothesis tests about the distribution's two parameters. Initially we assume that \(\sigma^2\) is a known value and consider how perform tests about \(\mu\). The test may be one-tailed, such as
or two-tailed,
where \(\mu_0\) is a known constant.
Test statistic
The sample mean, \(\overline{X}\), could be used as a test statistic but, in practice, it is easier to use its standardised version as the test statistic,
\[ Z \;\;=\;\; \frac{\overline{X} - \mu_0}{\diagfrac{\sigma}{\sqrt{n}}} \]This has a standard normal distribution, \(Z \sim \NormalDistn(0,1)\) if H0 is true.
P-value and interpretation
The p-value for the test is found by comparing the value of the test statistic (evaluated from the data set) to the standard normal distribution. For a one-tailed test, this is one tail area of the distribution, but for a two-tailed test, it is double the smaller tail area since values of \(\overline{X}\) below \(\mu_0\) give the same evidence against H0 as values above it.
The p-value is interpreted in the same way as for all other hypothesis tests. A small value means that a value of \(\overline{X}\) as far as was observed from \(\mu_0\) would be unlikely if the null hypothesis was true, and this provides evidence suggesting that the alternative hypothesis is true. The diagram below illustrates for a 2-tailed test.
In most practical situations where normal distributions are used as models for data, the normal variance, \(\sigma^2\), is an unknown parameter. The test statistic on the previous page,
\[ Z \;\;=\;\; \frac{\overline{X} - \mu_0}{\diagfrac {\sigma}{\sqrt n}} \]can no longer be evaluated in a test for \(\mu\) since \(\sigma^2\) is now unknown.
Test statistic
If \(\sigma\) is replaced by the sample standard deviation, \(S\), the test statistic
\[ T \;\;=\;\; \frac{\overline{X} - \mu_0}{\diagfrac {S}{\sqrt n}} \]no longer has a standard normal distribution, but its distribution is another standard distribution when H0 is true.
Distribution of test statistic T
If \(\overline{X}\) and \(S^2\) are the mean and variance of a random sample of size \(n\) from a \(\NormalDistn(\mu_0, \sigma^2)\) distribution,
\[ T \;\;=\;\; \frac{\overline{X} - \mu_0}{\diagfrac{S}{\sqrt{n}}} \;\;\sim\;\; \TDistn(n-1 \text{ df}) \](Proved in full version)
T-test for \(\mu\)
The test statistic \(T\) is used in a similar way to the \(Z\) statistic on the previous page, but the p-value is obtained as a tail probability from a t distribution instead of a standard normal one.
In most situations where data are modelled as a random sample from a \(\NormalDistn(\mu, \sigma^2)\) distribution, the parameter \(\mu\) is of most interest.
However very occasionally, a test about the distribution's variance, \(\sigma^2\), is needed. This could be a one-tailed test such as
H0 : \(\sigma^2 = \sigma_0^2\)
HA : \(\sigma^2 \gt \sigma_0^2\)
where \(\sigma_0^2\) is some constant, or a two-tailed test of the form
H0 : \(\sigma^2 = \sigma_0^2\)
HA : \(\sigma^2 \ne \sigma_0^2\)
Chi-squared test
The test is based on the sample variance, \(S^2\), of a random sample of \(n\) values. If H0 holds,
\[ X^2 \;=\; \frac {n-1}{\sigma_0^2} S^2 \]has a \(\ChiSqrDistn(n - 1\;\text{df})\) distribution. The test's p-value can be found from a tail probability of this distribution.
Example
The following 20 values are a random sample from a \(\NormalDistn(\mu, \sigma^2)\) distribution.
18.68 | 16.28 | 26.02 | 21.57 | 20.54 | 19.45 | 24.55 | 23.03 | 19.34 | 24.69 |
21.31 | 15.22 | 22.81 | 20.53 | 21.01 | 14.98 | 20.52 | 22.39 | 23.37 | 23.23 |
Test whether the distribution's variance is \(\sigma^2 = 4\).
(Solved in full version)
Robustness
This tests in this section are all based on the assumption that the data are a random sample from a normal distribution. The t-test for the distribution's mean is not affected badly if the underlying distribution is non-normal, so we say that this test is robust.
However the chi-squared test test for the variance does not provide an accurate p-value if the distribution from which the data are sampled has a non-normal shape. This chi-squared test is not robust.
We now consider random samples from two normal population,
\[ \begin{align} X_{1,i} \;\;&\sim\;\; \NormalDistn(\mu_1,\;\sigma^2) \qquad \text{for } i=1,\dots,n_1 \\ X_{2,i} \;\;&\sim\;\; \NormalDistn(\mu_2,\;\sigma^2) \qquad \text{for } i=1,\dots,n_2 \end{align}\]The difference between the sample means is normally distributed,
\[ \overline{X}_1 - \overline{X}_2 \;\;\sim\;\; \NormalDistn\left(\mu_1 - \mu_2,\;\sigma^2\left(\frac 1{n_1} + \frac 1{n_2}\right)\right) \]The best estimate of the common variance, \(\sigma^2\), is
\[ S_{\text{pooled}}^2 \;\;=\;\; \frac{(n_1 - 1)S_1^2 + (n_2 - 1)S_2^2}{n_i + n_2 - 2} \]and we showed earlier that its distribution is proportional to a chi-squared distribution,
\[ \frac{n_1 + n_2 - 2}{\sigma^2}S_{\text{pooled}}^2 \;\;\sim\;\; \ChiSqrDistn(n_1 + n_2 - 2 \text{ df}) \]Our best estimate of the standard error of \( \overline{X}_1 - \overline{X}_2\) is therefore
\[ \se(\overline{X}_1 - \overline{X}_2) \;=\; \sqrt{\frac{(n_1 - 1)S_1^2 + (n_2 - 1)S_2^2}{n_1 + n_2 - 2} \left(\frac 1{n_1} + \frac 1{n_2}\right) } \]Testing for equal means
We now consider a hypothesis test for whether the two means are equal,
H0 : \(\mu_1 = \mu_2\)
HA : \(\mu_1 \ne \mu_2\)
(or the corresponding one-tailed alternative). The following function of the data can be used as a test statistic — its distribution is fully known when the null hypothesis holds.
Test statistic
If \(\overline{X}_1\) and \(S_1^2\) are the mean and variance of a sample of \(n_1\) values from a \(\NormalDistn(\mu_1, \sigma^2)\) distribution and \(\overline{X}_2\) and \(S_2^2\) are the mean and variance of an independent sample of \(n_2\) values from a \(\NormalDistn(\mu_2, \sigma^2)\) distribution,
\[ T \;\;=\;\; \frac{\overline{X}_1 - \overline{X}_2}{\se(\overline{X}_1 - \overline{X}_2)} \;\;\sim\;\; \TDistn(n_1 + n_2 - 2 \text{ df}) \]provided \(\mu_1 = \mu_2\).
(Proved in full version)
A p-value for the test is the probability of a value from this t distribution that is further from zero than the value that is evaluated from the actual data.
Example
A botanist is interested in comparing the growth response of dwarf pea stems to two different levels of the hormone indoleacetic acid (IAA). Using 16 day old pea plants the botanist obtains 5 millimetre sections and floats these sections on solutions with different hormone concentrations to observe the effect of the hormone on the growth of the pea stem. Let \(X\) and \(Y\) denote respectively the independent growths that can be attributed to the hormone during the first 26 hours after sectioning for \((0.5 \times 10^{-4})\) and \(10^{-4}\) levels of concentration of IAA.
The botanist measured the growths of pea stem segments in millimetres for \(n_X = 11\) observations of \(X\):
0.8 | 1.8 | 1.0 | 0.1 | 0.9 | 1.7 | 1.0 | 1.4 | 0.9 | 1.2 | 0.5 |
and \(n_Y = 13\) observations of \(Y\):
1.0 1.8 |
0.8 2.5 |
1.6 1.4 |
2.6 1.9 |
1.3 2.0 |
1.1 1.2 |
2.4 |
Test whether the larger hormone concentration results in greater growth of the pea plants.
(Solved in full version)
When testing whether the means of two normal distributions are equal, we often assume that their variances are the same. We now describe a hypothesis test to assess this assumption.
H0 : \(\sigma_1^2 = \sigma_2^2\)
HA : \(\sigma_1^2 \ne \sigma_2^2\)
We showed earlier that the two sample variances, \(S_1^1\) and \(S_2^2\), have distributions proportional to chi-squared distributions,
\[ \frac{n_1-1}{\sigma_1^2} S_1^2 \;\;\sim\;\; \ChiSqrDistn(n_1 - 1 \text{ df}) \]and similarly for \(S_2^2\).
Hypothesis test
The ratio of the two sample variances can be used as a test statistic for this test.
Test statistic
If \(\overline{X}_1\) and \(S_1^2\) are the mean and variance of a sample of \(n_1\) values from a \(\NormalDistn(\mu_1, \sigma_1^2)\) distribution and \(\overline{X}_2\) and \(S_2^2\) are the mean and variance of an independent sample of \(n_2\) values from a \(\NormalDistn(\mu_2, \sigma_2^2)\) distribution,
\[ F \;\;=\;\; \frac{S_1^2}{S_2^2} \;\;\sim\;\; \FDistn(n_1 - 1,\; n_2 - 1 \text{ df}) \]provided \(\sigma_1^2 = \sigma_2^2\).
(Proved in full version)
The p-value for the test can be found from the tail probabilities of this distribution.
Example
When analysing the data set about the effect of the hormone IAA on the growth of dwarf pea stems on the previous page, an assumption was made that the underlying normal distribution's variance was the same for both hormone levels. Test whether the data are consistent with this assumption.
(Solved in full version)
Revisiting p-values
A hypothesis test is based on two competing hypotheses about the value of a parameter, \(\theta\). The null hypothesis is the simpler one, restricting \(\theta\) to a single value, whereas the alternative hypothesis allows for a range of values. Initially we consider a 1-tailed alternative,
The hypothesis test is based on a test statistic,
\[ Q \;\;=\;\; g(X_1, X_2, \dots, X_n \mid \theta_0) \]whose distribution is fully known (without any unknown parameters) when H0 is true (i.e. when \(\theta_0\) is the true parameter value). The p-value for the test is the probability of a test statistic value as extreme as that observed in the data. A p-value close to zero throw doubt on the null hypothesis.
Fixed significance level
An alternative way to perform the test involves a rule that results in a decision about which of the two hypotheses holds. Any such data-based rule can lead us to the wrong decision, so we must take into account the probability of this.
Definition
The significance level is the probability of wrongly concluding that H0 does not hold when it actually does.
For example, it might be acceptable to have a 5% chance of concluding that \(\theta \gt \theta_0\) when \(\theta\) is really equal to \(\theta_0\). This means a significance level of \(\alpha = 0.05\) for the test.
To attain this significance level, we should reject H0 when the test statistic, \(Q\), falls in the "rejection region" below.
Two-tailed tests
For a two-tailed test, values of the test statistic in both tails of its distribution usually provide evidence that H0 does not hold, so the rejection region should correspond to area \(\diagfrac {\alpha} 2\) in each tail to attain a significance level of \(\alpha\).
Saturated fat content of cooking oil
Cooking oil made from soybeans has little cholesterol and has been claimed to have only 15% saturated fat. A clinician believes that the saturated fat content is greater than 15% and randomly samples 13 bottles of soybean cooking oil for testing.
15.2 12.4 |
15.4 13.5 |
15.9 17.1 |
16.9 14.3 |
19.1 18.2 |
15.5 16.3 |
20.0 |
Assuming that the data are a random sample from a normal distribution, the clinician wants to test the following hypotheses.
H0 : \(\mu = 15%\)
HA : \(\mu \gt 15%\)
What is his conclusion from testing these hypotheses with a significance level of \(\alpha = 5%\)?
(Solved in full version)
Decisions from tests
Hypothesis tests often result in some action by the researchers that depends on whether we conclude that H0 or HA is true. This decision depends on the data.
Decision | Action |
---|---|
accept H0 | some action (often the status quo) |
reject H0 | a different action (often a change to a process) |
There are two types of error that can be made, represented by the red cells below:
Decision | |||
---|---|---|---|
accept H0 | reject H0 | ||
True state of nature | H0 is true | correct | Type I error |
HA (H0 is false) | Type II error | correct |
A good decision rule should have small probabilities for both kinds of error.
Saturated fat content of cooking oil
The clinician who tested the saturated fat content of soybean cooking oil was interested in the hypotheses.
H0 : \(\mu = 15%\)
HA : \(\mu \gt 15%\)
If H0 is rejected, the clinician intends to report the high saturated fat content to the media. The two possible errors that could be made are described below.
Decision | ||||
---|---|---|---|---|
accept H0 (do nothing) |
reject H0 (contact media) |
|||
Truth | H0: | µ is really 15% | correct | wrongly accuses manufacturers |
HA: | µ is really over 15% | fails to detect high saturated fat | correct |
Ideally the decision should be made in a way that keeps both probabilities low.
Decision rule and significance level
A decision rule's probability of a Type I error is its significance level. Fixing the significance level at say 5% therefore sets the details of the decision rule such that
\[ P(\text{reject }H_0 \mid H_0 \text{ is true}) \;\;=\;\; 0.05 \]This does not however tell you the probability of a Type II error.
Illustration
Consider a test about the mean of a normal distribution with \(\sigma = 4\), based on a random sample of \(n = 16\) values:
H0 : μ = 10
HA : μ > 10
The sample mean will be used as a test statistic since its distribution is known when the null hypothesis holds,
\[ \overline{X} \;\;\sim\;\; \NormalDistn\left(\mu_0, \frac{\sigma}{\sqrt{n}}\right) \;\;=\;\; \NormalDistn(10, 1) \]Large values of \(\overline{X}\) would usually be associated with the alternative hypothesis, so we will consider decision rules of the form
Data | Decision |
---|---|
![]() |
accept H0 |
![]() |
reject H0 |
for some value of \(k\), the critical value for the test.
The diagram below illustrates the probabilities of Type I and Type II errors for different decision rules — these are the red areas in the upper and lower parts of each pair of normal distributions.
Note how reducing the probability of a Type I error increases the probability of a Type II error — it is impossible to simultaneously make both probabilities small with only \(n\) = 16 observations.
The above diagram used an alternative hypothesis value of \(\mu = 13\). The alternative hypothesis allows other values of \(\mu > 12\) and the probability of a Type II error reduces as \(\mu\) increases. For a decision rule that results in a 5% significance level, the diagram below illustrates this.
This is as should be expected — the further that the real value \(\mu\) is above 10, the more likely we are to detect that it is higher than 10 from the sample mean.
The decision rule affects the probabilities of Type I and Type II errors and there is always a trade-off between these two probabilities.
Significance level
Performing hypothesis tests through interpretation of p-values and through decision rules with fixed significance levels are closely related.
Although some of the underlying theory depends on the type of test, the decision rule for any test can be based on its p-value. For example, for a test with significance level 5%, the decision rule is always:
Decision | |
---|---|
p-value > 0.05 | accept H0 |
p-value < 0.05 | reject H0 |
For example, to conduct a test with significance level 1%, the null hypothesis, H0, should be rejected if the p-value is less than 0.01.
If computer software provides the p-value for a hypothesis test, it is therefore easy to translate it into a decision about whether to accept (or reject) the null hypothesis at the 5% or 1% significance level.
Discrete distributions
For continuous distributions, we can usually determine the critical value for a decision rule that exactly corresponds to any required significance level. Unfortunately this is usually impossible when the data come from discrete distributions.
Failure of printed circuit boards
Suppose a manufacturer of a certain printed circuit has found that the probability of a board failing is \(\pi = 0.06\), and an engineer suggests some changes to the production process that might reduce this probability. Suppose that \(n = 200\) circuits will be produced using the proposed new method and \(Y\) of these will fail. We will use the sample number failing, \(y\), to test the hypotheses
We now consider two possible decision rules. Their significance levels can be found by adding probabilities from the \(\BinomDistn(n=200, \pi=0.06)\) distribution.
Since there are no intermediate decision rules, there is no decision rule with a significance level of exactly 0.05.
If a hypothesis test is required at a specified significance level, such as 0.05, a conservative approach should be taken. The decision rule should be chosen such that its significance level is under the required value. For example, we should reject the null hypothesis (and conclude that the new production method method is better) if 6 or fewer boards fail in the example above.
Definition
The power of a decision rule is the probability of correctly deciding that the alternative hypothesis is true when it really is true,
\[ \begin{align} \text{Power} \;&=\; 1 - P(\text{Type II error}) \\[0.3em] &=\; P\left(\text {decide }H_A \text{ is true} \mid H_A\right) \end{align} \]A test with high power is therefore desirable.
However the alternative hypothesis usually allows for a range of different parameter values, such as \(\mu \gt 12\). The power is really a power function that can be graphed against the possible parameter values.
Failure of printed circuit boards
The example on the previous page about a new method of producing printed circuit boards examined the hypotheses
We now consider the decision rule that rejects the null hypothesis if \(y=7\) or fewer fail out of a sample of \(n=200\) boards. The power function is
\[ \begin{align} \operatorname{Power}(\pi) \;&=\; P\left(\text {decide }H_A \text{ is true} \mid \pi \lt 0.06\right) \\[0.4em] &=\; P(X \le 7 \mid \pi \lt 0.06) \\ &=\; \sum_{x=0}^7 {200 \choose x} \; \pi^x \; (1 - \pi)^{200-x} \end{align} \]At the null hypothesis value, \(\pi = 0.06\), the power function is simply the significance level of the test. The probability of rejecting H0 is higher when \(\pi\) is lower than this.
The power of the test is 0.746 when \(\pi = 0.03\). The engineer would be disappointed with this. It means that if the new manufacturing method has actually decreased the probability of failure to 0.03 from 0.06 (a big improvement), there is still a good chance (25%) that the null hypothesis, \(\pi = 0.06\), is accepted and the improvement is rejected.
Changing the decision criterion
The trade-off between low significance level and high power is illustrated in the diagram below. Changing the decision rule to increase the power (and decrease the probability of a Type II error) also increases its significance level (and probability of a Type I error).
The only way to increase the power of a test without also increasing the significance level is to collect more data.
Failure of printed circuit boards
Consider the above printed circuit board example. Let us assume that the engineers have decided that it is acceptable for the probability of a Type I error (deciding that \(\pi\) < 0.06 when \(\pi\) is really 0.06) to be 5%:
significance level = P(Type I error) = 0.05
and that it is acceptable for the probability of a Type II error (deciding that \(\pi\) is 0.06) to be 10% if \(\pi\) is really 0.03.
With a sample size of \(n\) and decision rule to reject the null hypothesis if \(x \le k\), we therefore require
\[ \begin{align} \alpha \;&=\; P\left(\text {reject }H_0 \mid \pi = 0.06\right) \\[0.4em] &=\; P(X \le 7 \mid \pi = 0.06) \\ &=\; \sum_{x=0}^k {n \choose x} \; 0.06^x \; 0.94^{n-x} \\ &=\; 0.05 \end{align} \]and
\[ \begin{align} \beta \;&=\; P\left(\text {accept }H_0 \mid \pi = 0.03\right) \\[0.4em] &=\; 1 - P(X \le 7 \mid \pi = 0.03) \\ &=\; 1 - \sum_{x=0}^k {n \choose x} \; 0.03^x \; 0.97^{n-x} \\ &=\; 0.10 \end{align} \]The values of \(n\) (the sample size) and \(k\) (the cut-off for the decision rule) are the values that satisfy these two equations (or are as close as possible to this). By trial and error with different \(n\) and \(k\), we can find that if \(n = 400\) and \(k = 16\),
For tests involving continuous distributions, it is often possible to find explicit solutions to the equations for the probabilities of Type I and II errors.
Nested models
In some situations, a particular statistical model can be regarded as a special case of a more complex model (with more parameters). We will call the simpler model the small model, \(\mathcal{M}_S\), and say that it is nested in the more general big model, \(\mathcal{M}_B\). To compare these, we use the hypotheses,
Poisson parameter
The following table describes the number of defective items produced on a production line in 20 successive days.
1 2 |
3 4 |
2 3 |
2 5 |
5 2 |
4 3 |
5 1 |
2 4 |
0 2 |
2 6 |
It might be assumed that the data are a random sample from a \(\PoissonDistn(\lambda)\) distribution, and we might want to test whether the rate of defective items was \(\lambda = 2\) per day. Since the \(\PoissonDistn(2)\) distribution is a special case of the \(\PoissonDistn(\lambda)\) distribution,
Exponential means
Clinical records give the survival time, in months from diagnosis, of 30 sufferers from a certain disease as
9.73 5.56 4.28 4.87 |
1.55 6.20 1.08 7.17 |
28.65 6.10 16.16 9.92 |
2.40 6.19 7.67 1.11 |
4.66 4.35 7.31 3.28 |
13.38 3.08 0.41 4.33 |
2.16 4.49 0.75 |
4.45 10.29 0.90 |
In a clinical trial of a new drug treatment, 21 sufferers had survival times of:
22.07 12.47 6.42 |
8.15 0.64 20.04 |
17.49 2.22 3.00 |
28.09 3.94 8.59 |
4.26 32.82 8.32 |
2.12 18.53 |
9.95 4.25 |
3.70 5.82 |
Is there any difference in survival times for those using the new drug?
An exponential model might be considered a reasonable model for the data in each group. This would have a common death rate in both groups, \(\lambda\), if the drug had no effect on survival times, and different rates for the control group, \(\lambda_C\), and the group getting the new drug, \(\lambda_D\).
Exponential vs Weibull distribution
The \(\ExponDistn(\lambda)\) distribution is a special case of the \(\WeibullDistn(\alpha, \lambda)\) distribution corresponding to \(\alpha = 1\). Testing whether the failure rate is constant can therefore be done using the following small and big models.
Exponential vs Gamma
In a similar way, the \(\ExponDistn(\lambda)\) distribution is a special case of the \(\GammaDistn(\alpha, \lambda)\) distribution corresponding to \(\alpha = 1\).
Does the simpler "small" model fit the data, or is the more general "big" model is needed?
Since it has more parameters to adjust, the big model, \(\mathcal{M}_B\), always fits better than the small model, \(\mathcal{M}_S\), but does it fit significantly better?
A model's fit can be described by its maximum possible likelihood,
\[ L(\mathcal{M}) \;\;=\;\; \underset{\text{all models in }\mathcal{M}}{\operatorname{max}} P(data \mid model) \]This is the likelihood with all unknown parameters replaced by maximum likelihood estimates.
Likelihood ratio
Because \(\mathcal{M}_S\) is a special case of \(\mathcal{M}_B\),
\[ L(\mathcal{M}_B) \;\;\ge\;\; L(\mathcal{M}_S) \]Equivalently, the likelihood ratio is always at least one,
\[ R \;\;=\;\; \frac{L(\mathcal{M}_B)}{L(\mathcal{M}_S)} \;\;\ge\;\; 1 \]Big values of \(R\) suggest that \(\mathcal{M}_S\) does not fit as well as \(\mathcal{M}_B\). Equivalently,
\[ \log(R) \;\;=\;\; \ell(\mathcal{M}_B) - \ell(\mathcal{M}_S) \;\;\ge\;\; 0 \]and again, big values suggest that \(\mathcal{M}_S\) does not fit as well as \(\mathcal{M}_B\).
We now formally test the hypotheses
The following test statistic is used,
\[ X^2 \;\;=\;\; 2\log(R) \;\;=\;\; 2\left(\ell(\mathcal{M}_B) - \ell(\mathcal{M}_S)\right) \]\(X^2\) has (approximately) a standard distribution when H0 holds, and is likely to be largest if \(\mathcal{M}_S\) is not correct.
Distribution of test statistic
If the data do come from \(L(\mathcal{M}_S)\), and \(L(\mathcal{M}_B)\) has \(k\) more parameters than \(L(\mathcal{M}_S)\),
\[ X^2 \;\;=\;\; 2\left( \ell(\mathcal{M}_B) - \ell(\mathcal{M}_S)\right) \;\;\underset{\text{approx}}{\sim} \;\; \ChiSqrDistn(k \text{ df}) \]Likelihood ratio test
Question
The following table describes the number of defective items from a production line in each of 20 days.
1 2 |
3 4 |
2 3 |
2 5 |
5 2 |
4 3 |
5 1 |
2 4 |
0 2 |
2 6 |
Assuming that the data are a random sample from a \(\PoissonDistn(\lambda)\) distribution, use a likelihood ratio test for whether the rate of defects was \(\lambda = 2\) per week.
(Solved in full version)
We now use a maximum likelihood test to compare two random samples from exponential distributions.
Question
Clinical records give the survival time in months from diagnosis of 30 sufferers from a certain disease as
9.73 5.56 4.28 4.87 |
1.55 6.20 1.08 7.17 |
28.65 6.10 16.16 9.92 |
2.40 6.19 7.67 1.11 |
4.66 4.35 7.31 3.28 |
13.38 3.08 0.41 4.33 |
2.16 4.49 0.75 |
4.45 10.29 0.90 |
In a clinical trial of a new drug treatment, 21 sufferers had survival times of
22.07 12.47 6.42 |
8.15 0.64 20.04 |
17.49 2.22 3.00 |
28.09 3.94 8.59 |
4.26 32.82 8.32 |
2.12 18.53 |
9.95 4.25 |
3.70 5.82 |
Assuming that survival times are exponentially distributed, perform a likelihood ratio test for whether the death rate is different for those getting the new drug.
(Solved in full version)
The ideas of test statistics and pivots are very closely related.
If a function \(g(X_1, \dots, X_n, \theta)\) can be used as a pivot for \(\theta\), then \(g(X_1, \dots, X_n, \theta_0)\) can be used as a test statistic for testing whether \(\theta = \theta_0\).
To simplify the notation, we simply write \(Q(\theta) \sim \mathcal{Standard\;distn}\).
Relationship between confidence interval and test
These are also closely related. The diagram below illustrates a possible standard distribution for \(Q(\theta) \sim \mathcal{Standard\;distn}\).
Because of this relationship between confidence intervals and tests, we can perform a hypothesis test from a confidence interval.
Test from confidence interval
If we reject the null hypothesis in a 2-tailed test about whether \(\theta = \theta_0\) when the parameter value \(\theta_0\) is outside a \((1 - \alpha)\) confidence interval for \(\theta\), then the test has significance level \(\alpha\).
Similarly, we can find a confidence interval based on a hypothesis test.
Confidence interval from test
A \((1 - \alpha)\) confidence interval can be found as the values \(\theta_0\) that are not rejected by a 2-tailed test of whether \(\theta = \theta_0\) at significance level \(\alpha\).
Because of the relationship between confidence intervals and hypothesis tests, a hypothesis test at significance level \(\alpha\) for
can be performed in the following way.
T-shirt sizes
A retail clothing outlet has collected the following data from random sampling of invoices of T-shirts over the past month.
Small | Medium | Large | XL | Total | |
---|---|---|---|---|---|
North Island | 2 | 15 | 24 | 9 | 50 |
South Island | 4 | 17 | 23 | 6 | 50 |
Concentrating on the probability that a North Island T-shirt is Small, \(\pi\), we have the approximate pivot,
\[ \frac{x - n\pi}{\sqrt{n \pi(1 - \pi)}} \;\;\underset{\text{approx}}{\sim} \;\; \NormalDistn(0,1) \]where \(x = 2\) and \(n = 50\). This can be rearranged to get a 95% confidence interval
\[ 0.011 \;\;\lt\;\; \pi \;\;\lt\;\; 0.135 \]If we wanted to perform a test about \(\pi\),
we note that 0.15 is not in the 95% confidence interval. This means that we would reject the null hypothesis in a test at the 5% significance level.
There are two reasons why hypothesis tests may be easier than confidence intervals:
In these situations, good approach to finding a confidence interval may be through hypothesis tests.
Binomial probability
The earlier confidence intervals that we showed for the binomial distribution's parameter \(\pi\) were based on normal approximations to the number or proportion of successes. However a hypothesis test can be performed using binomial probabilities without the need for this normal approximation.
A \((1 - \alpha)\) confidence interval for a parameter \(\theta\) can be found as follows:
Exact confidence interval for \(\pi\) when \(n = 20\) and \(x = 7\)
For a 2-tailed hypothesis test with H0: \(\pi = \pi_0\), we can use the test statistic
\[ X \;\;\sim\;\; \BinomDistn(n=20, \pi_0) \]For any value of \(\pi_0\), the p-value is double the smaller tail probability from this binomial distribution,
\[ 2 \times P(X \le 7 \mid \pi = \pi_0) \spaced{or} 2 \times P(X \ge 7 \mid \pi = \pi_0) \]For a test at significance level \(\alpha = 5%\), we reject H0 if the p-value is less than 0.05.
A 95% confidence interval for \(\pi\) consists of the values \(\pi_0\) that are not rejected in this test. By trial-and-error with different values of \(\pi_0\), we find the following p-values,
H0 is only accepted at the 5% significance level for values of \(\pi_0\) between these, so the exact 95% confidence interval is
\[ 0.154 \;\;\lt\;\; \pi \;\;\lt\;\; 0.592\]We will now find a 95% confidence interval for the median of a data set from a random sample, without making any assumptions about the shape of the underlying distribution. This will be done by inverting a test for the hypotheses
where the parameter \(\theta\) is the median of the distribution. This can be based on the number of values less than \(\theta_0\). If \(\theta_0\) really is the median (and H0 is true),
\[ Y \;\;=\;\; \text{number of values below }\theta_0 \;\;\sim\;\; \BinomDistn(n, \pi=0.5) \]The p-value for the test is the probability of \(Y\) being further from \(\diagfrac{n}{2}\) than was observed in the data.
A 95% confidence interval can be found (by trial-and-error) as the values of \(\theta_0\) that would result in the null hypothesis being accepted — i.e. with p-values greater than 0.05.
Question
A certain disease in dogs is characterized in the early stages by unusually high levels of a blood protein. This measurement has been proposed as a diagnostic test for infection: if the measured level is above a threshold value, the dog is diagnosed as having the disease. A ‘false positive’ occurs when a healthy dog happens to have a level above the threshold and is wrongly diagnosed as having the disease.
Measurements on a sample of 50 unaffected dogs gave the following results:
14.4 16.1 11.9 7.5 9.3 |
9.3 16.4 4.9 12.8 23.7 |
23.7 13.5 9.3 8.6 17.6 |
19.0 20.3 17.0 30.4 8.3 |
13.9 20.1 10.2 23.6 14.9 |
50.4 8.5 7.5 23.0 18.7 |
5.5 11.2 31.8 20.4 13.0 |
13.4 11.7 7.8 19.4 21.4 |
16.4 28.2 31.3 30.3 26.6 |
Measurements on a sample of 27 diseased dogs gave the results:
21.9 40.8 37.6 |
41.7 23.3 39.8 |
66.3 34.4 27.8 |
49.8 19.3 55.5 |
50.7 27.5 30.2 |
60.7 8.5 51.2 |
24.2 24.9 16.1 |
28.2 30.2 |
15.7 18.3 |
22.5 8.4 |
Find 95% confidence intervals for the median level of blood protein in the two groups.
(Solved in full version)
Consider a likelihood ratio test for the hypotheses
This is based on the log-likelihood when H0 holds, \(\ell(\theta_0)\), and its maximum possible value under HA, \(\ell(\hat{\theta})\), where \(\hat{\theta}\) is the maximum likelihood estimate of \(\theta\). For a hypothesis test with 5% significance level, we reject H0 if
\[ 2\left(\ell(\hat{\theta}) - \ell(\theta_0)\right) \;\;\gt\;\; K \]where the constant \(K\) is the 95th percentile of the \(\ChiSqrDistn(1 \text{ df})\) distribution, \(K = 3.841\).
Inverting the likelihood ratio test
A 95% confidence interval for \(\theta\) can therefore be found as the values for which
\[ \begin{align} 2\left(\ell(\hat{\theta}) - \ell(\theta_0)\right) \;\;&\le\;\; 3.841 \\[0.4em] \ell(\theta_0)\;\;&\gt\;\; \ell(\hat{\theta}) - \frac{3.841}{2} \\[0.5em] \ell(\theta_0)\;\;&\gt\;\; \ell(\hat{\theta}) - 1.921 \end{align} \]It is therefore the values of \(\theta\) for which the log-likelihood is within 1.921 of its maximum.
Question
Clinical records give the survival time in months from diagnosis of 30 sufferers from a certain disease as
9.73 5.56 4.28 4.87 |
1.55 6.20 1.08 7.17 |
28.65 6.10 16.16 9.92 |
2.40 6.19 7.67 1.11 |
4.66 4.35 7.31 3.28 |
13.38 3.08 0.41 4.33 |
2.16 4.49 0.75 |
4.45 10.29 0.90 |
If the survival times are exponentially distributed with a death rate of \(\lambda\), find a 95% confidence interval for \(\lambda\).
(Solved in full version)