If you don't want to print now,

Chapter 10   Testing Hypotheses

10.1   Introduction to hypothesis tests

10.1.1   Inference

Inference

Statistical inference refers to statistical techniques for obtaining information about a population parameter (or parameters) from a random sample. There are two branches of inference:

Estimation

Point estimates and confidence intervals give answers to questions of the form:

What parameter values would be consistent with the sample data?

Hypothesis tests

This chapter deals with a related type of question:

Are the sample data consistent with some statement about the parameters?

Errors and strength of evidence

A single random sample can rarely provide enough information about a population parameter to allow us to be sure whether or not any statement (hypothesis) about that parameter will be true. The best we can hope for is an indication of the strength of the evidence against it.

10.1.2   Soccer league simulation

Randomness in sports results

Although we like to think that the 'best' team wins in sports competitions, there is actually considerable variability in the results that can only be explained through randomness. For example when two teams play a series of matches, the same team rarely wins all matches.

English Premier Soccer League, 2008/09

In the English Premier Soccer league, each team plays every other team twice (home and away) during the season. Three points are awarded for a win and one point for a draw. The table below shows the wins, draws, losses and total points for all teams at the end of the 2008/09 season.

 
Team
Wins Draws Losses Points
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Manchester City
Liverpool
Chelsea
Arsenal
Everton
Tottenham Hotspur
Manchester United
Southampton
Stoke City
Newcastle United
Crystal Palace
Swansea City
West Ham United
Sunderland
Aston Villa
Hull City
West Bromwich Albion
Norwich City
Fulham
Cardiff City
27
26
25
24
21
21
19
15
13
15
13
11
11
10
10
10
7
8
9
7
5
6
7
7
9
6
7
11
11
4
6
9
7
8
8
7
15
9
5
9
6
6
6
7
8
11
12
12
14
19
19
18
20
20
20
21
16
21
24
22
86
84
82
79
72
69
64
56
50
49
45
42
40
38
38
37
36
33
32
30

Were all teams evenly matched?

A simulation can help us to investigate this question. We could be used to generate results from all 380 matches in the season for evenly matched teams, each result having probabilities 0.372, 0.372 and 0.255 of being a win, loss or draw for the home team. (A proportion 0.255 of games in the actual league resulted in draws.)

If there are differences between teams, we would expect the worst teams to have very few points at the end of the season and the best to have many. On the other hand, for evenly matched teams, we would expect all 20 finals points to be similar. The spread of final points in the league table should tell us something about whether the teams are evenly matched.

In the actual league table, the standard deviation of the final points for the 20 teams was 18.236. The diagram below shows the standard deviations in 200 simulated league tables with evenly matched teams.

The spread of points in the actual league table was much higher than the spread that would be likely for evenly matched teams, so:

There is strong evidence that the top teams are 'better' than the bottom teams.

10.1.3   Simulation to test a proportion

Is a population proportion equal to 0.8, or is it lower?

Consider a random sample of n = 100 values from a population in which it is claimed that the probability of success is at least 0.80.

Is an observed count of x = 72 successes low enough to throw doubt on the claimed probability of success, 0.80?

Simulation

We can perform a simulation with probability 0.80 of success to generate a random sample 100 successes and failures. The diagram below shows the numbers of successes in 200 such simulations.

If the probability of success was really 0.80, there would be approximately 10/200 = 5% chance of 72 or successes being observed. Since this is low,

There is moderately strong evidence that the probability of success is less than 0.80 — observing only 72 successes would be unlikely if the probability really was 0.80.

10.1.4   Test for a mean

Does a random sample have mean 520?

In an industrial process, some measurement, X, is normally distributed with standard deviation σ = 10. Its mean should be µ = 520 but can drift from this, so samples of n = 10 measurements are regularly collected as part of quality control.

If one such sample had mean 529, does the process need to be adjusted? The question can be reexpressed as:

If the underlying population mean was really µ = 520, what is the chance a sample of 10 values having a mean as far from 520 as 529?

Simulation

We can base our answer on the distribution of the sample mean, assuming that X has a normal distribution with µ = 520 and σ = 10. Simulations of 10 values from this distribution can be used to get an approximate distribution.

From the 200 simulated samples above, it seems very unlikely that a sample mean of 529 would have been recorded if the process meanhad been µ = 520. We therefore conclude that:

There is strong evidence that the process no longer has a mean of µ = 520 and needs to be adjusted.

10.1.5   Randomisation tests ((optional))

Simulation and randomisation

Simulation and randomisation both involve randomly generated data sets.

Simulation
New data sets are generated directly from a model.
Randomisation
Random variations are made to the data.

Randomisation of samples from two populations

Consider random samples from two populations. If these populations are identical, any of the sampled values could have equally belonged to either population. Random variations of the data can be generated by randomly allocating the values to the two samples.

The diagram below gives an example of two-group data and a random allocation of the 101 values to the two groups.

Test for equal population means

In the example above, the mean for Group A was 0.902 higher than that for Group B. To examine whether the underlying population means are equal, we can find how far apart the sample means would be in randomised variations of the data.

The diagram below shows the difference between the Group A and Group B means for 100 randomised variations of the data.

A difference between sample means as large as that recorded would be unusual if both underlying populations were the same, so

We conclude that there is strong evidence that the population means for the two groups are different.

10.1.6   Randomisation test for correlation ((advanced))

Another example using randomisation

Randomisation can be used to assess whether two variables are uncorrelated.

Variable
Y X
65
60
58
57
55
49
46
43
42
40
39
37
36
63
62
41
41
50
51
51
34
32
45
36
41
53

Are the variables uncorrelated in the underlying population?

Randomisation

If Y and X were unrelated, any permutation of the values of X would have been equally likely, so we can generate random variations of the data by shuffling the values of X between the individuals.

The correlation between Y and X in the data was 0.537. To assess whether such a large correlation could have arisen if Y and X were unrelated, we can use randomisations to investigate the distribution of the correlation coefficient when they are unrelated. The diagram below shows the correlation coefficients from 200 randomisations of the data.

Since there would only be a 7/200 = 3.5% chance of a sample correlation being as far from zero as that observed, we can conclude that:

There is moderately strong evidence that there is a real relationship between Y and X (in the underlying population).

10.1.7   Common patterns in tests

A general framework

You may find it difficult to spot the common theme in the examples in this section, but they are all examples of hypothesis testing and fit into a common framework that is used for all hypothesis tests.

Data, model and question

Data (and model)
The data were assumed to arise from a random mechanism (model), some of whose characteristics are unknown.
Null hypothesis
This is a statement about the characteristics of the model. We are particularly interested in whether the null hypothesis is true.
Alternative hypothesis
This is usually just the opposite of the null hypothesis.

We assess whether the null hypothesis is true by asking ...

Are the data consistent with the null hypothesis?

Test statistic
This is some function of the data that throws light on whether the null or alternative hypothesis holds.
P-value
This is the probability of obtaining a test statistic value as 'extreme' as the one recorded if the null hypothesis holds.
Interpreting the p-value
The following table can be used as a guide:
p-value Interpretation
over 0.1 no evidence that the null hypothesis does not hold
between 0.05 and 0.1 very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05 moderately strong evidence that the null hypothesis does not hold
under 0.01 strong evidence that the null hypothesis does not hold

Soccer league in one season

Data (and model)
End-of-season points for the teams in the league. The model involves probabilities for wins, losses and draws in all matches, but we know little about these probabilities.
Null hypothesis
All teams have the same probability of winning each match.
Alternative hypothesis
All teams do not have the same probabilities of winning.
Test statistic
The standard deviation of final points. It will be low if the teams have the same abilities (null hypothesis) and higher otherwise (alternative hypothesis).
P-value
Probability that the standard deviation is as high as 16.7 (the actual data) if all teams are equally matched. None of the simulated leagues had standard deviations as high as 16.7, so the p-value is zero.
Interpreting the p-value
There is extremely strong evidence that the teams do not have the same probability of winning.

Proportion

Data (and model)
Number of successes in 100 values. Our model assumes that P(success) is the same for all values and that they were independent.
Null hypothesis
The probability of success is 0.80.
Alternative hypothesis
The probability of success is less than 0.80.
Test statistic
The number of successes in the 100 values. It will be near 80 if the underlying probability of success is 0.80 (null hypothesis) and lower than 80 if it is less (alternative hypothesis).
P-value
Probability of 72/100 or fewer successes (the actual data) if the underlying population proportion is 0.80. The p-value was 0.05.
Interpreting the p-value
Since 72 or fewer successes would be unlikely if the population proportion was 0.80, we have moderately strong evidence that it is lower than 0.80.

Process mean

Data (and model)
Sample of 10 values from a process. Our model is that they were sampled from a normally distribution with σ = 10 and unknown µ.
Null hypothesis
The distribution of values has mean µ = 520.
Alternative hypothesis
The alternative hypothesis is that µ ≠ 520 gm.
Test statistic
The mean of our sample of 10 values. It will be close to 520 if the process is working correctly (the null hypothesis) and farther from this if the process mean has drifted from 520 (alternative hypothesis).
P-value
Probability of a sample mean as far from 520 as the value in our actual data (529) if the underlying population mean is 520. Since none of the simulated samples had means as far from 520, the p-value is 0.0.
Interpreting the p-value
Since a mean as far from 520 as our actual mean (529) is very unlikely, there is strong evidence that the mean weight is no longer 520 gm.

Comparison of groups

Data (and model)
Samples of values from two groups. No assumptions are made about the shape of the underlying distributions (model) but the data are assumed to be random samples from them.
Null hypothesis
The population distributions are the same in both groups.
Alternative hypothesis
The groups have different distributions.
Test statistic
The difference between the means of the two groups. It should be near zero if the two populations are the same (null hypothesis) and only different from zero if the groups are different (alternative hypothesis).
P-value
Probability that the difference in means is further from zero than 0.902 (the actual data) if both samples come from the same population. Since this never happened in our randomised samples, the p-value is zero.
Interpreting the p-value
We conclude that it is almost certain that the null hypothesis does not hold —values are higher in Group A than in Group B.

Correlation coefficient

Data (and model)
Pairs of values from two variables (Y and X). No assumptions are made about a model underlying the data.
Null hypothesis
The correlation coefficient between Y and X in the population is zero.
Alternative hypothesis
The correlation is non-zero.
Test statistic
The sample correlation coefficient between Y and X. It will be close to zero if the variables are uncorrelated in the underlying population (null hypothesis) and further from zero otherwise (alternative hypothesis).
P-value
Probability that the correlation coefficient is further from zero than 0.537 (the actual data) if the variables are uncorrelated in the underlying population. Only 3.5% of the simulated samples had relationships as strong, so this is the p-value.
Interpreting the p-value
A correlation coefficient as far from zero would be unlikely if the null hypothesis was true, so there is moderately strong evidence that the variables are correlated.

10.2   Tests about proportions

10.2.1   Inference about parameters

Inference and random samples

In the rest of this chapter, we assume that the observed data are a random sample from some population. Inference asks questions about characteristics of the underlying population distribution — unknown population parameters.

The null and alternative hypotheses specify values for the unknown parameters.

Categorical populations

For categorical populations, the unknowns are the population probabilities for the different categories. We focus on one category that is of particular interest ('success').

The null and alternative hypotheses are specified in terms of the probability of success, π.

10.2.2   P-value for testing proportion

Test statistic

A test about a probability, π, could be based on the corresponding sample proportion, p, but it is more convenient to use the number of successes, x, rather than p since we know its distribution,

X  ~  binomial (n , π)

P-value

The p-value is the probability of getting such an 'extreme' set of data if the null hypothesis is true. Since we know the binomial distribution of X when the null hypothesis holds,

The p-value is a sum of binomial probabilities

Note that the p-value can be obtained exactly without need for simulations or randomisation.

Example

In 100 values from a categorical population, 72 successes were observed. Is this consistent with probability π = 0.80 of success, or is the probability of success lower?

H0:   π = 0.80

HA:   π < 0.80

If the null hypothesis is true, we know the distribution of the number of successes, X,

The p-value for the test is the probability of observing 72 or fewer successes, assuming that the null hypothesis holds. Since this is 0.0342, we conclude that there would be little chance of seeing as low a number of successes if π = 0.80, so

There is moderately strong evidence that π < 0.80

10.2.3   Another example

Rat recognition of symbols

In an experiment, a rat was trained to 'recognise' three symbols by placing food in one of three boxes (marked with a circle, cross and square) and shown a card with the correct symbol, this being repeated 100 times.

After this training period, 90 cards were presented to the rat and the box with the same symbol was picked 36 times.

H0:   π = 1/3       (guessing)

HA:   π > 1/3       (learned)

If the rat is guessing, the number of correct choices will be binomial with π = 1/3 and n = 90.

The p-value for the test is the probability of getting as 'extreme' a count as was observed if the rat guesses, 0.1103. Since this is high, we conclude that there is no evidence of learning from the data.

Interpretation of p-values

A p-value only tells you whether the data are consistent with the null hypothesis or are inconsistent with it. From a very small p-value, we can conclude that the null hypothesis is probably wrong. However a high p-value does not mean that the null hypothesis is correct, only that the observed data are consistent with it. In the rat training example, we could never be sure that π was not very very slightly different from 1/3.

A hypothesis test should never conclude that the null hypothesis is correct.

For the telepathy example, the correct interpretation of p-values would be...

p-value Conclusion
p >  0.1 No evidence against π = 1/3.
0.05 < p < 0.1 Only slight evidence against π = 1/3.
0.01 < p < 0.05    Moderately strong evidence against π = 1/3.
p < 0.01 Strong evidence against π = 1/3.

10.2.4   One- and two-tailed tests

P-value for a one-tailed test

Consider a test of the hypotheses

H0 :   π  =  π0
HA :   π  <  π0

where π0 is a constant of interest. The following diagram shows how the p-value is found:

Since the probabilities in one tail of the distribution are added, this is called a one-tailed test.

P-value for a two-tailed test

If the alternative hypothesis allows either high or low values of x, the test is called a two-tailed test,

H0 :   π  =  π0
HA :   π  ≠  π0

The p-value is then double the smaller tail probability since values of x in both tails of the binomial distribution would provide evidence for HA.

Example

In a population of people, a proportion 0.574 have blood group O. In a sub-group of this population, a sample of 54 individuals were tested and 26 of these had blood group O. Is there any evidence that they differ from the main population?

This question can be expressed with the hypotheses

H0 :   π  =  0.574
HA :   π  ≠  0.574

If the sub-group had the same proportion with blood group O as the main population, the number out of 54 with this blood group would have the binomial distribution below.

There is a probability 0.1085 of getting 26 or fewer, but large sample numbers with blood group O would also throw doubt on the null hypothesis, so the p-value is double the low-tail probability, 0.2169.

From the large p-value, we conclude that there is no evidence of a difference between the sub-group and the main population.

10.2.5   Normal approximation

Normal approximation

When the sample size, n, is large, a normal distribution may be used as an approximation to the binomial.

Approximate p-value

We again test the hypotheses

H0 :   π  =  π0
HA :   π  <  π0

If n is large, the approximate normal distribution for x can be used to obtain the p-value for the test.

Adverse reactions to drug

A pharmaceutical company claims that only 1% of a drug's users experience adverse reactions. An agency monitors 2500 patients taking the drug and observes adverse reactions in 37 cases. Is the occurrence of adverse reactions more common than claimed by the company?

H0 :   π  =  0.01
HA :   π  >  0.01

where π = P(adverse effect).

with a standard normal distribution. Since x is discrete,

P(X ≥ 37)   =  P(X ≤ 36.5)   =   P(X ≤ 36.9)   =   ...

To find this tail probability, any value of x between 36 and 37 might have been used when evaluating the z-score. The p-value can be more accurately estimate by using 36.5. This is called a continuity correction. This more accurate p-value in the above example is 0.011 leading to a similar conclusion.

The continuity correction involves either adding or subtracting 0.5 from the observed count, x, before finding the z-score.

10.2.6   Statistical distance

Difference between parameter and estimate

If the value of a parameter specified by the null hypothesis (e.g. a population proportion, π0) is close to the corresponding sample statistic (e.g. the sample proportion, p) then there is no reason to doubt the null hypothesis. However if they are far apart, the data are not consistent with the null hypothesis and we should conclude that the alternative hypothesis holds.

A large distance between the estimate and hypothesized value gives evidence against the null hypothesis.

Statistical distance

To help assess this difference, we express it as a number of standard errors since we know from the 70-95-100 rule of thumb that that 2 (standard errors) is a large distance, 3 is a very large distance, and 1 is not much.

For a proportion, the number of standard errors is

In general, the statistical distance of an estimate to a hypothesised value of the underlying parameter is

Values more than 2, or less than -2, suggests that the hypothesized value is wrong. However if z is close to zero, p is reasonably close to π0 and we should not doubt the null hypothesis.

10.2.7   Tests based on statistical distance

Test statistic and p-value

The statistical distance of an estimate to a hypothesised value of the underlying parameter is

If the null hypothesis holds, z has approximately a standard normal distribution and it can be used as a test statistic for tests about the parameter. The p-value can be determined from the tail areas of this standard normal distribution.

For a two-tailed test, the p-value is the red tail area and can be looked up using either normal tables or in Excel.

Example

We again examine a data set in which a proportion 37/2500 = 0.0148 of 2,500 patients had adverse reactions to a drug. Do more than 1% of such patients have adverse reactions?

H0 :   π  =  0.01
HA :   π  >  0.01

The diagram below shows how the 'statistical distance' of the sample proportion from 0.01 is calculated.

The p-value for the test is the upper tail area of the standard normal distribution and is 0.0079 here, so we again conclude that there is strong evidence that more than 1% of patients have adverse reactions from the drug.

Using a 'statistical distance' to test a proportion gives a p-value that is identical to the p-value based on a normal approximation to the number of successes without a continuity correction. (The p-value is slightly different if a continuity correction is used.) However this approach will be used to test many different kinds of parameter in later sections.

(The procedure will be refined slightly when applied to situations where the standard error of the estimate must itself be estimated from the sample data. A t distribution will be used instead of a standard normal distribution.)

10.3   Tests about means

10.3.1   Introduction

Tests about numerical populations

The most important characteristic of a numerical population is usually its mean, µ. Hypothesis tests therefore usually question the value of this parameter.

Null and alternative hypotheses

Two-tailed tests about a population mean involve the hypotheses

H0 :   μ  =  μ0
HA :   μ  ≠  μ0

where µ0 is the constant that we think may be the true mean.

In a one-tailed test, the alternative hypothesis involves only high (or low) values of µ, such as

H0 :   μ  =  μ0
HA :   μ  >  μ0

10.3.2   Test for mean (known σ)

Hypotheses and p-value

We initially assume that the population standard deviation σ is a known value. The null hypothesis is usually

H0 :   µ  =  µ 0

The test is based on the sample mean, . This has a distribution that is approximately normal and has mean and standard deviation

 =  μ
 = 

Since the distribution of is fully known when H0 is true, a tail area of its distribution gives the p-value for the test. The tail of the distribution to use depends on the form of the alternative hypothesis.

One-tailed test (HA : µ  >  µ 0)
The p-value is the upper tail area (shown in green below).

For HA : µ  <  µ 0, the opposite tail of the distribution is used.
Two-tailed test
The p-value is the sum of the two tail areas below. It would be calculated as twice the smaller tail area.

10.3.3   P-value from statistical distance

Statistical distance and p-value

If σ is a known value, the calculation to find the p-value for testing the mean can be expressed in terms of the general formula for the statistical distance between a parameter and its estimate,

In the context of a test about means,

Since z has a standard normal(0, 1) distribution when the null hypothesis holds, it can be used as a test statistic and the p-value for the test can be determined from its tail areas.

For a two-tailed test, the p-value is the red tail area.

Example

The mean of a sample of n = 30 values is 16.8. Does the population have mean µ = 18.3 and standard deviation σ = 7.1, or is the mean now lower than 18.3?

H0 :   µ  =  18.3

HA :   µ  <  18.3

The p-value for the test is shown below:

The p-value can be evaluated using the statistical distance of 16.8 from 18.3 (a z statistic),

The p-value is reasonably large, meaning that a sample mean as low as 16.8 would not be unusual if µ = 18.3, so there is no evidence against µ = 18.3.

10.3.4   The t distribution

Test statistic if σ is unknown

In practical problems, the value of σ is rarely known so we cannot use

as a test statistic — it cannot be evaluated even when H0 is true. Instead, we must use a closely related type of 'statistical distance' between the sample mean and µ0,

where s is the sample standard deviation. This test statistic no longer has a normal distribution — it has greater spread due to the extra variability that results from estimating s, and has a standard distribution called a t distribution with (n - 1) degrees of freedom.

10.3.5   The t test for a mean

Finding a p-value from the t distribution

When testing the value of µ when σ is unknown, we use the test statistic

This has a t distribution (with n − 1 degrees of freedom) when H0 is true, so the p-value is found from a tail area of this distribution.

One-tailed test

H0 :   μ  =  μ0
HA :   μ  <  μ0

The steps for testing these hypotheses are shown in the diagram below.

Example

Consider a sample of n = 13 values with mean = 16.14 and standard deviation s = 2.15. A test for whether the population mean is more than 15.0 uses the hypotheses:

H0 :   µ  =  15

HA :   µ  >  15

Since the population standard deviation, σ, is unknown, the test must be based on a t statistic.

10.4   Decisions and significance

10.4.1   Hypothesis tests and decisions

Decisions from tests

Many hypothesis tests are followed by some action that depends on whether we conclude that H0 or HA is true. This decision depends on the data.

Decision    Action
accept H0    some action (often the status quo)   
reject H0    a different action (often a change to a process)   

There are two ways in which an error might be made — wrongly rejecting H0 when it is true (called a Type I error), and wrongly accepting H0 when it is false (called a Type II error).

Decision
  accept H0     reject H0  
True state of nature H0 is true    correct Type I error
HA (H0 is false)     Type II error correct

A good decision rule about whether to accept or reject H0 (and perform the corresponding action) should ideally have small probabilities for both kinds of error.

10.4.2   Decision rules

Using a sample mean to make decisions

We assume initially that a population is normally distributed with known standard deviation, σ, and that we want a test for the hypotheses:

H0 :   μ  =  μ0
HA :   μ  >  μ0

Large values of throw doubt on H0, so our decision should be of the form:

Data Decision
< k    accept H0
is k or higher    reject H0   

The probabilities of Type I and Type II errors are shown in the red cells of the table below:

Decision
  accept H0     reject H0  
Truth H0 is true     
HA (H0 is false)      

Example: Test for the hypotheses:

H0 :   μ = 10
HA :   μ > 10

If it is known that σ = 4, then the mean of a random sample of n = 16 values is approximately normal with mean µ and standard deviation 1. If the decision rule rejects H0 when the sample mean is less than k, the diagram below illustrates the probabilities of Type I and Type II errors.

Increasing k reduces P(Type I error) but increases P(Type II error). The choice of k for the decision rule is a trade-off between the acceptable sizes of the two types of error.

10.4.3   Significance level and p-values

Significance level

The decision rule affects the probabilities of Type I and Type II errors and there is always a trade-off between these two probabilities. Selecting a critical value to reduce one error probability will increase the other.

In practice, we usually concentrate on the probability of a Type I error. The decision rule is chosen to make the probability of a Type I error equal to a pre-chosen value, often 5% or 1%. This probability is called the significance level of the test. In many applications the significance level is set at 5%.

P-values and decisions

Since the distribution of a test statistic is always fully known when H0 is true, the critical value for any significance level can be found as a quantile of this distribution. For example,

The critical value for a test depends on the distribution of its test statistic (e.g. a normal, t or other distribution). However the decision rule can equivalently be based in a simple way on the p-value for the test. For example, for a test with significance level 5%, the decision rule can always be expressed as:

Decision
p-value > 0.05     accept H0
p-value < 0.05     reject H0

For a test with significance level 1%, the null hypothesis, H0, should be rejected if the p-value is less than 0.01.

If computer software provides the p-value for a hypothesis test, it is easy to translate it into a decision to accept (or reject) the null hypothesis at any significance level.

10.4.4   Sample size and power

Power of a test

For any decision rule, the significance level gives the probability of an error when H0 is true — the Type I error. It is common to describe what happens if HA is true with the probability of correctly picking HA. This is called the power of the test and is one minus the probability of a Type II error.

Decision
  accept H0     reject H0  
Truth H0 is true      Significance level =
P (Type I error)
HA (H0 is false)     P (Type II error) Power =
1 - P (Type II error)

When the alternative hypothesis includes a range of possible parameter values (e.g. µ ≠ 0), the power is not a single value but depends on the actual parameter value.

Increasing the power of a test

It is clearly desirable to use a test whose power is as close to 1.0 as possible. There are three different ways to increase the power.

Change the critical value
This cannot be done if the significance level is fixed — adjusting the critical value to increase the power also increases the probability of a Type I error.
Use a different decision rule
In this e-book, we only describe the most powerful type of decision rule to test any hypotheses, so you will not be able to increase the power by changing the decision rule.
Increase the sample size
By increasing the amount of data on which we base our decision about whether to accept or reject H0, the probabilities of making both types of error can be reduced.

When the significance level is fixed, increasing the sample size is usually the only way to improve the power.

10.5   Properties of p-values

10.5.1   Null and alternative hypotheses

Symmetric hypotheses

In some situations there is a kind of symmetry between two competing hypotheses. For example, if two candidates, A and B, stand in an election and π is the population proportion who will vote for A, we are interested in which candidate will win:

H1 :   π  >  0.5

H2 :   π  <  0.5

Null and alternative hypotheses

In statistical hypothesis testing, the two hypotheses are not treated symmetrically in this way. Instead, we ask whether the sample data are consistent with one particular hypothesis (the null hypothesis, denoted by H0). If the data are not consistent with H0, then we can conclude that the competing hypothesis (the alternative hypothesis, denoted by HA) must be true.

The two possibilities are:

We should never conclude that H0 is likely to be true.

Example

Consider a test for whether a population mean is zero:

H0 :   µ  =  0.0

HA :   µ  ≠  0.0

Based on a random sample, we might conclude:

10.5.2   Consistency with null hypothesis

Describing the credibility of the null hypothesis

A p-value is a numerical description of the strength of the evidence against H0 that is provided by the data .

A p-value is a numerical summary statistic that describes the strength of the evidence against H0

P-values are interpreted in the same way for all hypothesis tests.

10.5.3   Distribution of p-values

Interpretation of p-values

In all hypothesis tests,

Distribution of p-values

P-values are found from random samples so they have distributions. Regardless of the hypothesis test,

Simulation

Consider a test for whether a population mean is zero:

H0 :   µ  =  0.0

HA :   µ  ≠  0.0

The diagram below shows the p-values from a t-test for these hypotheses, based on several random samples from a normal distribution for which H0 is true. Note that the p-value is equally likely to be anywhere between 0 and 1.

The next diagram shows p-values calculated in the same way, but based on random samples from a normal distribution for which HA is true. Note that the p-value is more likely to be near zero.

Although it is possible to obtain a low p-value when H0 holds and a high p-value when HA holds, low p-values are more likely under HA than under H0.

10.5.4   Interpretation of a p-value

P-values and probability

When H0 holds,

On the other hand, when HA holds, p-values are more likely to be near zero and

Examples

p-value = 0.0023
From this, we know that there would be only 0.0023 probability of getting such a small p-value if H0 was true. This is unlikely, so there is strong evidence that H0 does not hold.
p-value = 0.4
There is probability 0.4 of seeing such a low p-value if H0 is true, so there is no evidence against H0.

Of course, we may be wrong. A p-value of 0.0023 could arise when either H0 or HA holds but it is more likely under HA. And a p-value of 0.4 could also arise when either hypothesis is true.

Interpretation of p-values for all tests

p-value Interpretation
over 0.1 no evidence that the null hypothesis does not hold
between 0.05 and 0.1 very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05   moderately strong evidence that the null hypothesis does not hold
under 0.01 strong evidence that the null hypothesis does not hold

10.5.5   P-values for other tests

Applying the general properties of p-values to different tests

P-values for all hypothesis tests have the properties that were described earlier in this section. You should now be able to interpret any p-value if you know the null and alternative hypotheses that it tests. (A statistical computer program is generally used to perform hypothesis tests, so knowing the details of how the p-value is obtained is of little importance.)

Example

The following data have been collected. Are they sampled from a normally distributed population?

41.9
90.6
29.9
10.2
33.7
26.9
88.5
6.5
16.6
19.2
12.6
32.0
3.6
8.1
68.1
57.9
-3.0
42.2
14.5
25.7
28.1
78.4
126.2
42.0
66.6
20.6
54.6
31.7
2.3
45.5
55.5
37.2
51.6
97.1
80.3
41.1
7.3
31.0
30.2
1.7
27.0
38.0
144.9
27.8
121.9
26.0
-11.5
15.5
16.9
27.3
23.9
61.1
68.2
10.0
37.8
77.1
24.3
63.2
-0.6
1.0
12.1
134.5
53.8
60.4
9.0
-6.4
31.0
-2.8
114.6
19.8
11.5
39.6
59.0
20.7
37.3
23.1
32.7
13.0
70.6
87.3
-3.2
-20.8
119.1
-0.1
104.4
-4.6
72.5
7.7
31.4
36.9
47.2
74.7
29.1
70.5
77.7
81.0
191.8
1.6
-0.8
59.4
-2.2
-12.5
81.6
44.0
63.6
114.3
33.6
83.0
70.8
50.1
55.8
28.3
-7.9
51.3
37.7
48.3
88.9
59.4
126.9
35.0
51.0
91.1
-2.7
79.2
0.1
12.9
16.2
23.0
22.4
64.4
10.2
7.6
27.7
8.0
23.5
25.3
22.5
 
 
 

The diagram below shows a histogram of the data and the best-fitting normal distribution. Could the skewness in the data have occurred by chance from a normal population?

The Shapiro-Wilkes W test can be used to test whether data come from a normal distribution:

H0 :  population distribution is normal
HA :  population distribution is not normal

Computer software reports the p-value for this test as "under 0.01". We conclude that the probability of obtaining such a non-normal looking sample from a normal distribution is less than 0.01, so there is strong evidence that the data do not come from a normal population.