Computational problem
To find the p-value for a hypothesis test about a proportion, tail probabilities for a binomial distribution must be summed.
If the sample size n is large, there may be a huge number of probabilities to add together and this is both tedious and may result in numerical errors.
Home-based businesses owned by women
A recent study that was reported in the Wall Street Journal sampled 899 home-based businesses and found that 369 were owned by women.
Are home-based businesses less likely to be owned by females than by males? This question can be expressed as a hypothesis test. If the population proportion of home-based businesses owned by females is denoted by π, the hypotheses can be written as...
H0 : π = 0.5
HA : π < 0.5
If the null hypothesis is true, the sample number owned by females will have a binomial distribution with parameters n = 899 and π = 0.5. The p-value for the test is therefore the sum of binomial probabilities,
p-value = P(X ≤ 369) = P(0) + P(1) + ... + P(368) + P(369)
A lot of probabilities must be evaluated and summed! And all are close to zero.
Normal approximation
We saw earlier that the normal distribution may be used as an approximation to the binomial when n is large. Both the sample proportion of successes, p, and the number of successes, x = np, are approximately normal when n is large.
The best-fitting normal distribution can be used to obtain an approximation to any binomial tail probability. In particular, it can be used to find an approximate p-value for a hypothesis test.
Approximate p-value
A large random sample of size n is selected from a population with probability π of success and x successes are observed. We will again test the hypotheses
H0 : π = π0
HA : π < π0
The normal approximation to the distribution of x can be used to find the tail probability,
Home-based businesses owned by women
In this example, the sample size, n = 899 is large, so we can use a normal approximation to obtain the probability of 369 or fewer businesses owned by females if the underlying population probability was 0.5 (the null hypothesis).
Click Accumulate then simulate sampling of 899 businesses about 300 times. (Hold down the button Simulate.) From the simulation, it is clear that the probability of obtaining 369 or fewer businesses owned by females is extremely small — there is strong evidence against the null hypothesis.
The same conclusion can be reached without a simulation.
Select Bar chart from the pop-up menu, then select Normal approximation. From the normal approximation, we can determine that the p-value for the test (the tail area below 369) is extremely close to zero.
Continuity correction (advanced)
The approximate p-value could be found by comparing the z-score for x,
with a standard normal distribution. Since x is discrete,
P(X ≤ 369) = P(X ≤ 369.5) = P(X ≤ 369.9) = ...
To find this tail probability, any value of x between 369 and 370 might have been used when evaluating the z-score. The p-value can be more accurately estimate by using 369.5. This is called a continuity correction.
The continuity correction involves either adding or subtracting 0.5 from the observed count, x, before finding the z-score.
Be careful about whether to add or subtract — the probability statement should be unchanged. For example, P(X ≥ 410) = P(X ≥ 409.5), so 0.5 should be subtracted from x = 410 as a continuity correction in order to find this probability using a normal approximation and z-score.
The continuity correction is most important when the observed count is near either 0 or n.