Long page
descriptions

Chapter 7   Sampling and Variability

7.1   Finite populations

7.1.1   Census or sample?

A sample provides information about a population when it is too difficult or expensive to make measurements from the whole population.

7.1.2   Variability in a sample

A sample is usually collected to provide information about an underlying population. However sample-to-sample variability must be taken into account when doing this.

7.1.3   Sampling error

When a sample is used to estimate a population characteristic, an error is usually involved. Sampling error is caused by random selection of the sample from the population.

7.1.4   Sampling error and sample size

As the sample size is increased, the sampling error becomes smaller.

7.1.5   Sampling from finite populations ((optional))

Sampling with replacement from a finite population permits the same value to be selected two or more times. Sampling without replacement ensures that there are no such duplicates.

7.1.6   Selecting a random sample

Random digits can be selected by rolling a 10-sided die, looking up a table of random digits or using a computer. These random digits can be combined to select a random member of a population. Repeating the process gives a random sample.

7.2   Samples from distributions

7.2.1   Data as representatives

We often have little interest in the specific individuals from whom data are collected. The data are representative of some wider situation and we want to generalise from the data to describe features of this more general situation.

7.2.2   Randomness of data

If data collection was repeated, perhaps from different individuals, the values would be different. All such data sets should give similar but not identical information. Interpretation of a data set should take into account this randomness.

7.2.3   Model to explain randomness

Many data sets are not generated as random samples from real finite populations. However it is often useful to treat the data as a random sample from some abstract population containing the measurements that might have been recorded.

7.2.4   Infinite populations (distributions)

The populations that are imagined to underly data often contain an infinite number of values and are called distributions.

7.2.5   Information from a sample

The mechanism of sampling from a population explains randomness. Although the population is unknown and we only have a single sample, the sample provides information about the population.

7.3   Probability & probability density

7.3.1   Finite populations

When randomly selecting one value from a population of N different values, the probability of getting any individual value is 1/N. More generally, the probability of sampling a value in any range is the proportion of population values in the range.

7.3.2   Probabilities with infinite populations

The probability of any type of value is again the proportion of such values in the population. It can also be interpreted as the limiting proportion from a sample of values if the sample size is increased indefinitely.

7.3.3   Bar charts of discrete probabilities

Infinite categorical or discrete populations can be described by bar charts of the probabilities.

7.3.4   Probability density functions

Infinite continuous numerical populations are described with a type of histogram called a probability density function.

7.3.5   Normal distributions

Normal distributions are infinite continuous populations. A normal distribution is symmetric and its two parameters, µ and σ, can be adjusted to alter the distribution's location and spread.

7.3.6   Probability and area under the pdf

When a value is sampled from an infinite continuous population, the probability that it is between a and b equals the area under the p.d.f. between these two values.

7.3.7   Properties of probability ((optional))

This page describes some rules that are obeyed by probabilities.

7.4   Simulation (optional)

7.4.1   Probability models and simulation

Probability can be used to model complex situations. A simulation of the model involves using the model's probabilities to generate an instance of the situation. Repeating the simulation can give insight into the behaviour of the system.

7.4.2   Simulation: Will the best team win?

A simulation demonstrates that the best team is often not top of a league at the end of the season even if it has a much higher probability of winning individual matches than all other teams.

7.4.3   Is there evidence of skill in a league?

A simulation of a soccer league shows that the spread of points at the end of an actual soccer season is not consistent with all teams having equal abilities.

7.4.4   Assessing unusual features in data

The variability of displays from 'regular' populations (such as normal distributions) can be used to assess features in a single data set, such as outliers, clusters or skewness.

7.4.5   Random numbers ((advanced))

Simulations are based on randomly generated values. These are generally based on random numbers for which any value between 0 and 1 is equally likely.

7.4.6   Generating categorical values ((advanced))

A random categorical value can be easily generated from a random number between 0 and 1.

7.4.7   Generating numerical values ((advanced))

Generating random numerical values from a particular distribution is harder. This page describes one such method.

7.5   Distribution of sample mean

7.5.1   Variability of sample statistics

A summary value describing a population is called a parameter and the corresponding value in a sample is called a statistic. Sample statistics vary from sample to sample.

7.5.2   Variability of sample mean

A sample mean has a distribution that is centred round the population mean but has smaller spread than the population.

7.5.3   Standard devn of sample mean

The spread of the sample mean decreases as sample size increases. A formula is given for the standard deviation of the sample mean in terms of the sample size and population standard deviation.

7.5.4   Means from normal populations

Sample means from normal populations also have normal distributions.

7.5.5   Large-sample normality of means

The shape of the sample mean's distribution is usually close to normal, even if the population distribution is skew or multimodal. The bigger the sample size, the closer the distribution to normal.

7.5.6   Distribution of mean from a sample

It is possible to estimate the distribution of a sample mean from a single sample.

7.5.7   Requirement of independence

If the sample values are positively correlated, the standard deviation of the mean will be underestimated.

7.5.8   Sampling from finite populations ((advanced))

If a random sample is selected without replacement, the formula for the standard deviation of the sample mean must be modified.

7.6   Normal distributions

7.6.1   Importance of normal distributions

Normal distributions are sometimes useful as models for data, but the main reason for their importance is that sample means and many other summary statistics have approximately normal distributions.

7.6.2   Shape of normal distributions

All normal distributions look the same on a scale of standard deviations from the mean.

7.6.3   Sketching a normal distribution

Normal distributions are centred on their mean, µ, and have hardly any area beyond 3σ on each side. A small area (about 5%) is over 2 standard deviations away from the mean.

7.6.4   Some normal probabilities

The probabilities of being within (σ, 2σ, and 3σ) of the mean are (0.68, 0.95 and 0.997) for all normal distributions. This is a close match to the 70-95-100 rule-of-thumb for numerical data sets.

7.6.5   Z-scores

Any value, x, can be translated into a z-score that gives the number of standard deviations above the mean.

7.6.6   Finding normal probabilities

Z-scores have a standard normal distribution (µ = 0 and σ = 1). The probability of a value less than x can be translated into a probability about a z-score.

7.6.7   Other probabilities

The probability that X is greater than a specified value or is between two values can also be translated into a probability about a z-score.

7.6.8   Normal tables

If a computer is not available, tables of probabilities for the standard normal distribution are used to find normal probabilities.

7.6.9   Finding normal quantiles

The inverse problem of finding the x-value corresponding to a given probability is also solved using z-scores. If normal tables are used, they must be looked up differently.

7.6.10   Normal probability plots ((advanced))

A normal probability plot is an informal graphical method to help assess whether a data set comes from a normal distribution. Curvature in the probability plot suggests that the data may not be normal.

7.7   Distribution of sample proportion

7.7.1   Proportion and probability

The population proportion in a category is called its probability. Proportions and probabilities can be obtained from categorical and numerical variables.

7.7.2   Properties of counts and proportions

The sample proportion is a statistic that varies from sample to sample. The sample count in a category is also random with a discrete distribution.

7.7.3   Binomial distribution

The sample count in a category has a standard distribution called a binomial distribution with parameters n and π.

7.7.4   Binomial probability examples

The binomial distribution can be used to find probabilities relating to sample counts.

7.7.5   Normal approximation to binomial

When the sample size is large, the distribution of the sample count in a category becomes close to a normal distribution.

7.7.6   Normal approximation examples

When the sample size is large, a normal approximation to the binomial distribution can be used to find binomial probabilities.

7.8   Sampling in practice

7.8.1   Stratified sampling

More accurate estimates can sometimes be obtained by taking separate random samples within different parts of the population.

7.8.2   Cluster sampling

When individuals are grouped in clusters, it is often cheaper to sample complete clusters rather than separate individuals.

7.8.3   Two-stage sampling

When the target population is spread over a wide area, it may be cheaper to take a sample from only a few regions (groups of individuals), than to sample from all the regions.

7.8.4   Sampling and non-sampling errors

Samples are usually collected to estimate population characteristics. The ordinary variation caused by the sampling scheme is called sampling error. Practical difficulties with conducting the survey can cause biased estimates (non-sampling errors).

7.8.5   Coverage and non-response errors

Coverage error and non-response error prevent some individuals being included in the sample.

7.8.6   Interviewer and instrument errors

Interviewer error and instrument error can result in 'incorrect' measurements from the sampled individuals.

7.8.7   Survey design issues

Survey information is collected by a variety of mechanisms, from mailed questionnaires to telephone. Each has its own advantages and disadvantages.