Grouping of individuals

A simple random sample of individuals from some population is conceptually the easiest sampling scheme. However more accurate estimates of population characteristics can often be obtained with different sampling schemes.

If the individuals in the population can be split into different groups (called strata in sampling terminology), it is often better to take a simple random sample within each separate group than to sample randomly from the whole population. This is called a stratified random sample.

For example, a simple random sample of 40 students from a class of 200 males and 200 females might (by chance) include 25 males and 15 females. A stratified random sample would randomly select 20 males and 20 females, ensuring that the sex-ratio in the sample matched that in the population.

The benefits from stratified random sampling are greatest if the measurement being sampled is different in the different strata. For example, we might want to estimate the mean summer income of the students. If male students tend to have higher incomes than female students, a stratified random sample based on gender will be more accurate than a simple random sample.

Weekly turnover by grocery stores

The diagram below shows the weekly turnover of 100 grocery stores in a city. Of these stores, 50 belong to large grocery chains and the other 50 are smaller independent stores. The 50 stores belonging to chains tend to have higher turnovers. (This is not real data — the difference between the two types of store is more extreme than would usually be observed — but does illustrate the potential gains from stratified sampling.)

The left half of the diagram illustrates simple random sampling of 10 from the 100 stores, whereas stratified random sampling of 5 store from each group is illustrated on the right.

Click Take sample a few times to observe the variability of the mean turnover for the two sampling schemes. (A jittered dot plot of the means is shown to the right of each samples. A normal curve shows the distribution of the sample means.)

Observe that stratified random sampling gives sample means with less variability. The mean from a stratified random sample is therefore a more accurate estimate of the population mean.

In practice, the aim is more likely to be estimation of the total grocery turnover in the city, but this is simply 100 times the mean turnover, so stratified sampling gives the same improvement over a simple random sample.


Groups with different variability (advanced)

In stratified random samples, random samples are usually taken from the different strata in proportion to the number of population values in the strata. For example, if a population of 1,000 values is split into three strata of N1 = 500, N2 = 300 and N3 = 200 values and a sample of n = 50 is to be taken, then samples of n1 = 25, n2 = 15 and n3 = 10 would be taken from the three strata — i.e. 1/20 of the population within each stratum.

This proportionality is not however essential, and greater accuracy can be obtained by selecting larger samples from strata with greater variability. However if sample size is not proportional to stratum size, the overall sample mean is no longer appropriate for estimating the overall population mean.

If there are k strata of size N1, N2, ..., Nk, and samples of size n1, n2, ..., nk are taken from the strata, giving means 1, 2, ..., k , then the population mean should be estimated by

Weekly turnover by grocery stores

The following diagram is similar to the one above, but in this example, there are 80 small stores (with relatively low turnover) and 20 chain stores with higher turnover and also a higher spread in their distribution.

The left half of the diagram does stratified random sampling with sample sizes proportional to the stratum sizes (8 local stores and 2 chain stores). On the right, a disproportionately large sample is taken from the chain stores because of their higher variability — 3 local stores and 7 chain stores.

Click Take sample a few times to verify that the estimated mean weekly turnover is more accurate when a larger sample is taken from the chain stores — the variability in the estimate is lower.

An extreme example of disproportionate sample sizes occurs when using sampling to estimate the mean profits of companies. If a list of 'large' companies is available, it is often best to record information from all of the large companies but only sample a small fraction of the smaller companies.