If you don't want to print now,
Inference
Statistical inference refers to statistical techniques for obtaining information about a population parameter (or parameters) from a random sample. There are two branches of inference:
Estimation
Point estimates and confidence intervals give answers to questions of the form:
What parameter values would be consistent with the sample data?
Hypothesis tests
This chapter deals with a related type of question:
Are the sample data consistent with some statement about the parameters?
Errors and strength of evidence
A single random sample can rarely provide enough information about a population parameter to allow us to be sure whether or not any statement (hypothesis) about that parameter will be true. The best we can hope for is an indication of the strength of the evidence against it.
Randomness in sports results
Although we like to think that the 'best' team wins in sports competitions, there is actually considerable variability in the results that can only be explained through randomness. For example when two teams play a series of matches, the same team rarely wins all matches.
English Premier Soccer League, 2008/09
In the English Premier Soccer league, each team plays every other team twice (home and away) during the season. Three points are awarded for a win and one point for a draw. The table below shows the wins, draws, losses and total points for all teams at the end of the 2008/09 season.
Team |
Wins | Draws | Losses | Points | |
---|---|---|---|---|---|
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. |
Manchester City Liverpool Chelsea Arsenal Everton Tottenham Hotspur Manchester United Southampton Stoke City Newcastle United Crystal Palace Swansea City West Ham United Sunderland Aston Villa Hull City West Bromwich Albion Norwich City Fulham Cardiff City |
27 26 25 24 21 21 19 15 13 15 13 11 11 10 10 10 7 8 9 7 |
5 6 7 7 9 6 7 11 11 4 6 9 7 8 8 7 15 9 5 9 |
6 6 6 7 8 11 12 12 14 19 19 18 20 20 20 21 16 21 24 22 |
86 84 82 79 72 69 64 56 50 49 45 42 40 38 38 37 36 33 32 30 |
Were all teams evenly matched?
A simulation can help us to investigate this question. We could be used to generate results from all 380 matches in the season for evenly matched teams, each result having probabilities 0.372, 0.372 and 0.255 of being a win, loss or draw for the home team. (A proportion 0.255 of games in the actual league resulted in draws.)
If there are differences between teams, we would expect the worst teams to have very few points at the end of the season and the best to have many. On the other hand, for evenly matched teams, we would expect all 20 finals points to be similar. The spread of final points in the league table should tell us something about whether the teams are evenly matched.
In the actual league table, the standard deviation of the final points for the 20 teams was 18.236. The diagram below shows the standard deviations in 200 simulated league tables with evenly matched teams.
The spread of points in the actual league table was much higher than the spread that would be likely for evenly matched teams, so:
There is strong evidence that the top teams are 'better' than the bottom teams.
Is a population proportion equal to 0.8, or is it lower?
Consider a random sample of n = 100 values from a population in which it is claimed that the probability of success is at least 0.80.
Is an observed count of x = 72 successes low enough to throw doubt on the claimed probability of success, 0.80?
Simulation
We can perform a simulation with probability 0.80 of success to generate a random sample 100 successes and failures. The diagram below shows the numbers of successes in 200 such simulations.
If the probability of success was really 0.80, there would be approximately 10/200 = 5% chance of 72 or successes being observed. Since this is low,
There is moderately strong evidence that the probability of success is less than 0.80 — observing only 72 successes would be unlikely if the probability really was 0.80.
Does a random sample have mean 520?
In an industrial process, some measurement, X, is normally distributed with standard deviation σ = 10. Its mean should be µ = 520 but can drift from this, so samples of n = 10 measurements are regularly collected as part of quality control.
If one such sample had mean 529, does the process need to be adjusted? The question can be reexpressed as:
If the underlying population mean was really µ = 520, what is the chance a sample of 10 values having a mean as far from 520 as 529?
Simulation
We can base our answer on the distribution of the sample mean, assuming that X has a normal distribution with µ = 520 and σ = 10. Simulations of 10 values from this distribution can be used to get an approximate distribution.
From the 200 simulated samples above, it seems very unlikely that a sample mean of 529 would have been recorded if the process meanhad been µ = 520. We therefore conclude that:
There is strong evidence that the process no longer has a mean of µ = 520 and needs to be adjusted.
Simulation and randomisation
Simulation and randomisation both involve randomly generated data sets.
Randomisation of samples from two populations
Consider random samples from two populations. If these populations are identical, any of the sampled values could have equally belonged to either population. Random variations of the data can be generated by randomly allocating the values to the two samples.
The diagram below gives an example of two-group data and a random allocation of the 101 values to the two groups.
Test for equal population means
In the example above, the mean for Group A was 0.902 higher than that for Group B. To examine whether the underlying population means are equal, we can find how far apart the sample means would be in randomised variations of the data.
The diagram below shows the difference between the Group A and Group B means for 100 randomised variations of the data.
A difference between sample means as large as that recorded would be unusual if both underlying populations were the same, so
We conclude that there is strong evidence that the population means for the two groups are different.
Another example using randomisation
Randomisation can be used to assess whether two variables are uncorrelated.
Variable | |
---|---|
Y | X |
65 60 58 57 55 49 46 43 42 40 39 37 36 |
63 62 41 41 50 51 51 34 32 45 36 41 53 |
Are the variables uncorrelated in the underlying population?
Randomisation
If Y and X were unrelated, any permutation of the values of X would have been equally likely, so we can generate random variations of the data by shuffling the values of X between the individuals.
The correlation between Y and X in the data was 0.537. To assess whether such a large correlation could have arisen if Y and X were unrelated, we can use randomisations to investigate the distribution of the correlation coefficient when they are unrelated. The diagram below shows the correlation coefficients from 200 randomisations of the data.
Since there would only be a 7/200 = 3.5% chance of a sample correlation being as far from zero as that observed, we can conclude that:
There is moderately strong evidence that there is a real relationship between Y and X (in the underlying population).
A general framework
You may find it difficult to spot the common theme in the examples in this section, but they are all examples of hypothesis testing and fit into a common framework that is used for all hypothesis tests.
Data, model and question
We assess whether the null hypothesis is true by asking ...
Are the data consistent with the null hypothesis?
p-value | Interpretation |
---|---|
over 0.1 | no evidence that the null hypothesis does not hold |
between 0.05 and 0.1 | very weak evidence that the null hypothesis does not hold |
between 0.01 and 0.05 | moderately strong evidence that the null hypothesis does not hold |
under 0.01 | strong evidence that the null hypothesis does not hold |
Soccer league in one season
Proportion
Process mean
Comparison of groups
Correlation coefficient
Inference and random samples
In the rest of this chapter, we assume that the observed data are a random sample from some population. Inference asks questions about characteristics of the underlying population distribution — unknown population parameters.
The null and alternative hypotheses specify values for the unknown parameters.
Categorical populations
For categorical populations, the unknowns are the population probabilities for the different categories. We focus on one category that is of particular interest ('success').
The null and alternative hypotheses are specified in terms of the probability of success, π.
Test statistic
A test about a probability, π, could be based on the corresponding sample proportion, p, but it is more convenient to use the number of successes, x, rather than p since we know its distribution,
X ~ binomial (n , π)
P-value
The p-value is the probability of getting such an 'extreme' set of data if the null hypothesis is true. Since we know the binomial distribution of X when the null hypothesis holds,
The p-value is a sum of binomial probabilities
Note that the p-value can be obtained exactly without need for simulations or randomisation.
Example
In 100 values from a categorical population, 72 successes were observed. Is this consistent with probability π = 0.80 of success, or is the probability of success lower?
H0: π = 0.80
HA: π < 0.80
If the null hypothesis is true, we know the distribution of the number of successes, X,
The p-value for the test is the probability of observing 72 or fewer successes, assuming that the null hypothesis holds. Since this is 0.0342, we conclude that there would be little chance of seeing as low a number of successes if π = 0.80, so
There is moderately strong evidence that π < 0.80
Telepathy experiment
In a telepathy experiment, one subject selects 90 cards with random shapes (circle, square or cross) and attempts to 'send' the shapes to another subject who is out of sight. The subject correctly guesses 36 shapes.
H0: π = 1/3 (guessing)
HA: π > 1/3 (telepathy)
If the subject is guessing, the number of correct choices will be binomial with π = 1/3 and n = 90.
The p-value for the test is the probability of getting as 'extreme' a count as was observed if the subject guesses, 0.1103. Since this is high, we conclude that there is no evidence of telepathy from the data.
Interpretation of p-values
A p-value only tells you whether the data are consistent with the null hypothesis or are inconsistent with it. From a very small p-value, we can conclude that the null hypothesis is probably wrong. However a high p-value does not mean that the null hypothesis is correct, only that the observed data are consistent with it. In the telepathy example, we could never be sure that π was not very very slightly different from 1/3.
A hypothesis test should never conclude that the null hypothesis is correct.
For the telepathy example, the correct interpretation of p-values would be...
p-value | Conclusion |
---|---|
p > 0.1 | There is no evidence against π = 1/3. |
0.05 < p < 0.1 | There is only slight evidence against π = 1/3. |
0.01 < p < 0.05 | There is moderately strong evidence against π = 1/3. |
p < 0.01 | There is strong evidence against π = 1/3. |
P-value for a one-tailed test
Consider a test of the hypotheses
H0 : π = π0
HA : π < π0
where π0 is a constant of interest. The following diagram shows how the p-value is found:
Since the probabilities in one tail of the distribution are added, this is called a one-tailed test.
P-value for a two-tailed test
If the alternative hypothesis allows either high or low values of x, the test is called a two-tailed test,
H0 : π = π0
HA : π ≠ π0
The p-value is then double the smaller tail probability since values of x in both tails of the binomial distribution would provide evidence for HA.
Example
In a population of people, a proportion 0.574 have blood group O. In a sub-group of this population, a sample of 54 individuals were tested and 26 of these had blood group O. Is there any evidence that they differ from the main population?
This question can be expressed with the hypotheses
H0 : π = 0.574
HA : π ≠ 0.574
If the sub-group had the same proportion with blood group O as the main population, the number out of 54 with this blood group would have the binomial distribution below.
There is a probability 0.1085 of getting 26 or fewer, but large sample numbers with blood group O would also throw doubt on the null hypothesis, so the p-value is double the low-tail probability, 0.2169.
From the large p-value, we conclude that there is no evidence of a difference between the sub-group and the main population.
Normal approximation
When the sample size, n, is large, a normal distribution may be used as an approximation to the binomial.
Approximate p-value
We again test the hypotheses
H0 : π = π0
HA : π < π0
If n is large, the approximate normal distribution for x can be used to obtain the p-value for the test.
Home-based businesses owned by women
A study found that 369 out of 899 sampled home-based businesses were owned by women. Are they less likely to be owned by females than by males? The hypotheses are...
H0 : π = 0.5
HA : π < 0.5
where π = P(owned by female).
with a standard normal distribution. Since x is discrete,
P(X ≤ 369) = P(X ≤ 369.5) = P(X ≤ 369.9) = ...
To find this tail probability, any value of x between 369 and 370 might have been used when evaluating the z-score. The p-value can be more accurately estimate by using 369.5. This is called a continuity correction.
The continuity correction involves either adding or subtracting 0.5 from the observed count, x, before finding the z-score.
Difference between parameter and estimate
If the value of a parameter specified by the null hypothesis (e.g. a population proportion, π0) is close to the corresponding sample statistic (e.g. the sample proportion, p) then there is no reason to doubt the null hypothesis. However if they are far apart, the data are not consistent with the null hypothesis and we should conclude that the alternative hypothesis holds.
A large distance between the estimate and hypothesized value gives evidence against the null hypothesis.
Statistical distance
To help assess this difference, we express it as a number of standard errors since we know from the 70-95-100 rule of thumb that that 2 (standard errors) is a large distance, 3 is a very large distance, and 1 is not much.
For a proportion, the number of standard errors is
In general, the statistical distance of an estimate to a hypothesised value of the underlying parameter is
Values more than 2, or less than -2, suggests that the hypothesized value is wrong. However if z is close to zero, p is reasonably close to π0 and we should not doubt the null hypothesis.
Test statistic and p-value
The statistical distance of an estimate to a hypothesised value of the underlying parameter is
If the null hypothesis holds, z has approximately a standard normal distribution and it can be used as a test statistic for tests about the parameter. The p-value can be determined from the tail areas of this standard normal distribution.
For a two-tailed test, the p-value is the red tail area and can be looked up using either normal tables or in Excel.
Example
We again examine a data set in which 369 out of 899 home-based businesses were owned by women. Are less than 50% of such businesses owned by women?
H0 : π = 0.5
HA : π < 0.5
The diagram below shows how the 'statistical distance' of the sample proportion from 0.5 is calculated.
The p-value for the test is the lower tail area of the standard normal distribution and is virtually zero here, so we again conclude that it is almost certain that less than 50% of such businesses are owned by women.
Using a 'statistical distance' to test a proportion gives a p-value that is identical to the p-value based on a normal approximation to the number of successes without a continuity correction. (The p-value is slightly different if a continuity correction is used.) However this approach will be used to test many different kinds of parameter in later sections.
(The procedure will be refined slightly when applied to situations where the standard error of the estimate must itself be estimated from the sample data. A t distribution will be used instead of a standard normal distribution.)
Tests about numerical populations
The most important characteristic of a numerical population is usually its mean, µ. Hypothesis tests therefore usually question the value of this parameter.
Null and alternative hypotheses
Two-tailed tests about a population mean involve the hypotheses
H0 : μ = μ0
HA : μ ≠ μ0
where µ0 is the constant that we think may be the true mean.
In a one-tailed test, the alternative hypothesis involves only high (or low) values of µ, such as
H0 : μ = μ0
HA : μ > μ0
Hypotheses and p-value
We initially assume that the population standard deviation σ is a known value. The null hypothesis is usually
H0 : µ = µ 0
The test is based on the sample mean, .
This
has a distribution that is approximately normal and has mean and standard
deviation
![]() |
= μ |
![]() |
= | ![]() |
Since the distribution of
is fully known when H0 is true, a tail area of its distribution
gives the p-value for the test. The tail of the distribution to use
depends on the form of the alternative hypothesis.
Statistical distance and p-value
If σ is a known value, the calculation to find the p-value for testing the mean can be expressed in terms of the general formula for the statistical distance between a parameter and its estimate,
In the context of a test about means,
Since z has a standard normal(0, 1) distribution when the null hypothesis holds, it can be used as a test statistic and the p-value for the test can be determined from its tail areas.
For a two-tailed test, the p-value is the red tail area.
Example
The mean of a sample of n = 30 values is 16.8. Does the population have mean µ = 18.3 and standard deviation σ = 7.1, or is the mean now lower than 18.3?
H0 : µ = 18.3
HA : µ < 18.3
The p-value for the test is shown below:
The p-value can be evaluated using the statistical distance of 16.8 from 18.3 (a z statistic),
The p-value is reasonably large, meaning that a sample mean as low as 16.8 would not be unusual if µ = 18.3, so there is no evidence against µ = 18.3.
Test statistic if σ is unknown
In practical problems, the value of σ is rarely known so we cannot use
as a test statistic — it cannot be evaluated even when H0 is true. Instead, we must use a closely related type of 'statistical distance' between the sample mean and µ0,
where s is the sample standard deviation. This test statistic no longer has a normal distribution — it has greater spread due to the extra variability that results from estimating s, and has a standard distribution called a t distribution with (n - 1) degrees of freedom.
Finding a p-value from the t distribution
When testing the value of µ when σ is unknown, we use the test statistic
This has a t distribution (with n − 1 degrees of freedom) when H0 is true, so the p-value is found from a tail area of this distribution.
One-tailed test
H0 : μ = μ0
HA : μ < μ0
The steps for testing these hypotheses are shown in the diagram below.
Example
Consider a sample of n = 13 values with mean = 16.14
and standard deviation s = 2.15. A test for whether the population mean
is more than 15.0 uses the hypotheses:
H0 : µ = 15
HA : µ > 15
Since the population standard deviation, σ, is unknown, the test must be based on a t statistic.
Decisions from tests
Many hypothesis tests are followed by some action that depends on whether we conclude that H0 or HA is true. This decision depends on the data.
Decision | Action |
---|---|
accept H0 | some action (often the status quo) |
reject H0 | a different action (often a change to a process) |
There are two ways in which an error might be made — wrongly rejecting H0 when it is true (called a Type I error), and wrongly accepting H0 when it is false (called a Type II error).
Decision | |||
---|---|---|---|
accept H0 | reject H0 | ||
True state of nature | H0 is true | correct | Type I error |
HA (H0 is false) | Type II error | correct |
A good decision rule about whether to accept or reject H0 (and perform the corresponding action) should ideally have small probabilities for both kinds of error.
Using a sample mean to make decisions
We assume initially that a population is normally distributed with known standard deviation, σ, and that we want a test for the hypotheses:
H0 : μ = μ0
HA : μ > μ0
Large values of throw
doubt on H0, so our decision should be of the form:
Data | Decision |
---|---|
![]() |
accept H0 |
![]() |
reject H0 |
The probabilities of Type I and Type II errors are shown in the red cells of the table below:
Decision | |||
---|---|---|---|
accept H0 | reject H0 | ||
Truth | H0 is true | ![]() |
|
HA (H0 is false) | ![]() |
Example: Test for the hypotheses:
H0 : μ = 10
HA : μ > 10
If it is known that σ = 4, then the mean of a random sample of n = 16 values is approximately normal with mean µ and standard deviation 1. If the decision rule rejects H0 when the sample mean is less than k, the diagram below illustrates the probabilities of Type I and Type II errors.
Increasing k reduces P(Type I error) but increases P(Type II error). The choice of k for the decision rule is a trade-off between the acceptable sizes of the two types of error.
Significance level
The decision rule affects the probabilities of Type I and Type II errors and there is always a trade-off between these two probabilities. Selecting a critical value to reduce one error probability will increase the other.
In practice, we usually concentrate on the probability of a Type I error. The decision rule is chosen to make the probability of a Type I error equal to a pre-chosen value, often 5% or 1%. This probability is called the significance level of the test. In many applications the significance level is set at 5%.
P-values and decisions
Since the distribution of a test statistic is always fully known when H0 is true, the critical value for any significance level can be found as a quantile of this distribution. For example,
The critical value for a test depends on the distribution of its test statistic (e.g. a normal, t or other distribution). However the decision rule can equivalently be based in a simple way on the p-value for the test. For example, for a test with significance level 5%, the decision rule can always be expressed as:
Decision | |
---|---|
p-value > 0.05 | accept H0 |
p-value < 0.05 | reject H0 |
For a test with significance level 1%, the null hypothesis, H0, should be rejected if the p-value is less than 0.01.
If computer software provides the p-value for a hypothesis test, it is easy to translate it into a decision to accept (or reject) the null hypothesis at any significance level.
Power of a test
For any decision rule, the significance level gives the probability of an error when H0 is true — the Type I error. It is common to describe what happens if HA is true with the probability of correctly picking HA. This is called the power of the test and is one minus the probability of a Type II error.
Decision | |||
---|---|---|---|
accept H0 | reject H0 | ||
Truth | H0 is true | Significance level = P (Type I error) |
|
HA (H0 is false) | P (Type II error) | Power = 1 - P (Type II error) |
When the alternative hypothesis includes a range of possible parameter values (e.g. µ ≠ 0), the power is not a single value but depends on the actual parameter value.
Increasing the power of a test
It is clearly desirable to use a test whose power is as close to 1.0 as possible. There are three different ways to increase the power.
When the significance level is fixed, increasing the sample size is usually the only way to improve the power.
Symmetric hypotheses
In some situations there is a kind of symmetry between two competing hypotheses. For example, if two candidates, A and B, stand in an election and π is the population proportion who will vote for A, we are interested in which candidate will win:
H1 : π > 0.5
H2 : π < 0.5
Null and alternative hypotheses
In statistical hypothesis testing, the two hypotheses are not treated symmetrically in this way. Instead, we ask whether the sample data are consistent with one particular hypothesis (the null hypothesis, denoted by H0). If the data are not consistent with H0, then we can conclude that the competing hypothesis (the alternative hypothesis, denoted by HA) must be true.
The two possibilities are:
We should never conclude that H0 is likely to be true.
Example
Consider a test for whether a population mean is zero:
H0 : µ = 0.0
HA : µ ≠ 0.0
Based on a random sample, we might conclude:
Describing the credibility of the null hypothesis
A p-value is a numerical description of the strength of the evidence against H0 that is provided by the data .
A p-value is a numerical summary statistic that describes the strength of the evidence against H0
P-values are interpreted in the same way for all hypothesis tests.
Interpretation of p-values
In all hypothesis tests,
Distribution of p-values
P-values are found from random samples so they have distributions. Regardless of the hypothesis test,
Simulation
Consider a test for whether a population mean is zero:
H0 : µ = 0.0
HA : µ ≠ 0.0
The diagram below shows the p-values from a t-test for these hypotheses, based on several random samples from a normal distribution for which H0 is true. Note that the p-value is equally likely to be anywhere between 0 and 1.
The next diagram shows p-values calculated in the same way, but based on random samples from a normal distribution for which HA is true. Note that the p-value is more likely to be near zero.
Although it is possible to obtain a low p-value when H0 holds and a high p-value when HA holds, low p-values are more likely under HA than under H0.
P-values and probability
When H0 holds,
On the other hand, when HA holds, p-values are more likely to be near zero and
Examples
Of course, we may be wrong. A p-value of 0.0023 could arise when either H0 or HA holds but it is more likely under HA. And a p-value of 0.4 could also arise when either hypothesis is true.
Interpretation of p-values for all tests
p-value | Interpretation |
---|---|
over 0.1 | no evidence that the null hypothesis does not hold |
between 0.05 and 0.1 | very weak evidence that the null hypothesis does not hold |
between 0.01 and 0.05 | moderately strong evidence that the null hypothesis does not hold |
under 0.01 | strong evidence that the null hypothesis does not hold |
Applying the general properties of p-values to different tests
P-values for all hypothesis tests have the properties that were described earlier in this section. You should now be able to interpret any p-value if you know the null and alternative hypotheses that it tests. (A statistical computer program is generally used to perform hypothesis tests, so knowing the details of how the p-value is obtained is of little importance.)
Example
The following data have been collected. Are they sampled from a normally distributed population?
41.9 90.6 29.9 10.2 33.7 26.9 88.5 6.5 16.6 19.2 12.6 32.0 3.6 8.1 |
68.1 57.9 -3.0 42.2 14.5 25.7 28.1 78.4 126.2 42.0 66.6 20.6 54.6 31.7 |
2.3 45.5 55.5 37.2 51.6 97.1 80.3 41.1 7.3 31.0 30.2 1.7 27.0 38.0 |
144.9 27.8 121.9 26.0 -11.5 15.5 16.9 27.3 23.9 61.1 68.2 10.0 37.8 77.1 |
24.3 63.2 -0.6 1.0 12.1 134.5 53.8 60.4 9.0 -6.4 31.0 -2.8 114.6 19.8 |
11.5 39.6 59.0 20.7 37.3 23.1 32.7 13.0 70.6 87.3 -3.2 -20.8 119.1 -0.1 |
104.4 -4.6 72.5 7.7 31.4 36.9 47.2 74.7 29.1 70.5 77.7 81.0 191.8 1.6 |
-0.8 59.4 -2.2 -12.5 81.6 44.0 63.6 114.3 33.6 83.0 70.8 50.1 55.8 28.3 |
-7.9 51.3 37.7 48.3 88.9 59.4 126.9 35.0 51.0 91.1 -2.7 79.2 0.1 12.9 |
16.2 23.0 22.4 64.4 10.2 7.6 27.7 8.0 23.5 25.3 22.5 |
The diagram below shows a histogram of the data and the best-fitting normal distribution. Could the skewness in the data have occurred by chance from a normal population?
The Shapiro-Wilkes W test can be used to test whether data come from a normal distribution:
H0 : population distribution is normal
HA : population distribution is not normal
Computer software reports the p-value for this test as "under 0.01". We conclude that the probability of obtaining such a non-normal looking sample from a normal distribution is less than 0.01, so there is strong evidence that the data do not come from a normal population.