Problems with using a binomial distribution when n is large
Although the number of 'successes' in a random sample always has a binomial distribution, it is computationally difficult to obtain probabilities from a binomial distribution when n is large. In a large random sample of say n = 10,000 categorical values, probabilities of interest usually involve summing the probabilities for a large number of individual values for the number of successes.
P(X < 5,600) = P(X = 0) + P(X = 1) + ... + P(X = 5,599)
We next describe a way to approximate such probabilities without summing so many values.
Proportions and means
If we assign a code of '1' to the successes and '0' to the failures in the random sample, then the resulting values are called an indicator variable.
Individual | Categorical variable | Indicator variable |
---|---|---|
1 2 3 4 5 6 7 ... |
success success failure success failure failure success ... |
1 1 0 1 0 0 1 ... |
The mean of the indicator variable is identical to the proportion of successes.
A sample proportion is really a kind of mean.
Therefore the results that we met earlier about the distribution of sample means can also be applied to sample proportions. In particular, when the sample size is large, the distribution of a sample proportion becomes close to a normal distribution.
In the diagram below, use the sliders to observe that for any fixed π, the shape of the binomial distribution becomes closer to normal as n increases.
Formulae for the binomial mean and standard deviation
Not only does the proportion of successes, p , have a distribution that is close to normal when n increases, but it is also possible to obtain formulae for the mean and standard deviation for this approximating distribution. Since the number of successes, x = np, is a constant times p , there are similar formulae for the mean and standard deviation of x .
The diagram below shows a binomial distribution and its normal approximation.
Use the sliders to verify that the binomial distribution has a very similar shape to its normal approximation when n is large.
With n fairly large and π moderate, drag over the bars of the binomial barchart. The binomial probability of getting a count less than x is shown beneath the barchart. The corresponding probability from the normal approximation is shown on the right.
Observe that probabilities obtained from the normal approximation are close to the true binomial probabilities when n is fairly large.
Finally, note how we use values of x that end in '.5' for the normal approximation. This is sometimes called a continuity correction.