How much data do I need to collect?

In the previous page, we investigated how to determine the sample size needed to estimate a population mean to a specified accuracy. A similar calculation can be used to find the size of sample required for estimating a probability.

A 95% confidence interval for a probability π is of the form

If we want our estimate to be within k of π with probability 0.95, then we need n to be large enough that

In order to use this inequality, we need a guess at the value of p — it does not need to be particularly accurate.

A small pilot survey is often conducted to obtain a preliminary estimate for the proportion.

If we can do no better, the 'worst-case' value, p = 0.5 can be used, but the resulting sample size may be higher than needed.

The necessary sample size can be found by trial-and-error in the above inequality.

Survival from lung cancer

Health researchers want to estimate the proportion of lung cancer patients that survive five years after diagnosis.

How many patient records must be examined to be at least 95% confident that the resulting estimate will be within 2% of the true population proportion?

The following diagram helps with the calculations.

We have not been given a guess at the value of π, so drag the slider to 0.5 — the worst-case scenario. (Use the arrow keys on your keyboard for fine adjustment of π.)

Drag the sample size slider until the '±' value is less than 0.02. Verify that the sample size should be about 2,500 or higher.


Recent reports about survival rates from lung cancer have claimed that the five-year survival rate is only between 8% and 14%, due to the late diagnosis of the disease in many patients. We may feel that it is safe to assume that the current survival rate is no higher than 20%. Use the slider to change the guess of π to 0.20 in the diagram above and verify that a sample size of about 1,600 would be enough to estimate π to within 0.02 with 95% confidence.

Obtaining the sample size by solving an equation

Trial-and-error can be avoided with a little algebra. The equation

can be re-written in the form