A sample provides information about a population when it is too difficult or expensive to make measurements from the whole population.
A sample is usually collected to provide information about an underlying population. However sample-to-sample variability must be taken into account when doing this.
When a sample is used to estimate a population characteristic, an error is usually involved. Sampling error is caused by random selection of the sample from the population.
As the sample size is increased, the sampling error becomes smaller.
Sampling with replacement from a finite population permits the same value to be selected two or more times. Sampling without replacement ensures that there are no such duplicates.
Random digits can be selected by rolling a 10-sided die, looking up a table of random digits or using a computer. These random digits can be combined to select a random member of a population. Repeating the process gives a random sample.
We often have little interest in the specific individuals from whom data are collected. The data are representative of some wider situation and we want to generalise from the data to describe features of this more general situation.
If data collection was repeated, perhaps from different individuals, the values would be different. All such data sets should give similar but not identical information. Interpretation of a data set should take into account this randomness.
Many data sets are not generated as random samples from real finite populations. However it is often useful to treat the data as a random sample from some abstract population containing the measurements that might have been recorded.
The populations that are imagined to underly data often contain an infinite number of values and are called distributions.
The mechanism of sampling from a population explains randomness. Although the population is unknown and we only have a single sample, the sample provides information about the population.
When randomly selecting one value from a population of N different values, the probability of getting any individual value is 1/N. More generally, the probability of sampling a value in any range is the proportion of population values in the range.
The probability of any type of value is again the proportion of such values in the population. It can also be interpreted as the limiting proportion from a sample of values if the sample size is increased indefinitely.
Infinite categorical or discrete populations can be described by bar charts of the probabilities.
Infinite continuous numerical populations are described with a type of histogram called a probability density function.
Normal distributions are infinite continuous populations. A normal distribution is symmetric and its two parameters, µ and σ, can be adjusted to alter the distribution's location and spread.
When a value is sampled from an infinite continuous population, the probability that it is between a and b equals the area under the p.d.f. between these two values.
This page describes some rules that are obeyed by probabilities.
Probability can be used to model complex situations. A simulation of the model involves using the model's probabilities to generate an instance of the situation. Repeating the simulation can give insight into the behaviour of the system.
A simulation demonstrates that the best team is often not top of a league at the end of the season even if it has a much higher probability of winning individual matches than all other teams.
A simulation of a soccer league shows that the spread of points at the end of an actual soccer season is not consistent with all teams having equal abilities.
The variability of displays from 'regular' populations (such as normal distributions) can be used to assess features in a single data set, such as outliers, clusters or skewness.
Simulations are based on randomly generated values. These are generally based on random numbers for which any value between 0 and 1 is equally likely.
A random categorical value can be easily generated from a random number between 0 and 1.
Generating random numerical values from a particular distribution is harder. This page describes one such method.
A summary value describing a population is called a parameter and the corresponding value in a sample is called a statistic. Sample statistics vary from sample to sample.
A sample mean has a distribution that is centred round the population mean but has smaller spread than the population.
The spread of the sample mean decreases as sample size increases. A formula is given for the standard deviation of the sample mean in terms of the sample size and population standard deviation.
Sample means from normal populations also have normal distributions.
The shape of the sample mean's distribution is usually close to normal, even if the population distribution is skew or multimodal. The bigger the sample size, the closer the distribution to normal.
It is possible to estimate the distribution of a sample mean from a single sample.
If the sample values are positively correlated, the standard deviation of the mean will be underestimated.
If a random sample is selected without replacement, the formula for the standard deviation of the sample mean must be modified.
Normal distributions are sometimes useful as models for data, but the main reason for their importance is that sample means and many other summary statistics have approximately normal distributions.
All normal distributions look the same on a scale of standard deviations from the mean.
Normal distributions are centred on their mean, µ, and have hardly any area beyond 3σ on each side. A small area (about 5%) is over 2 standard deviations away from the mean.
The probabilities of being within (σ, 2σ, and 3σ) of the mean are (0.68, 0.95 and 0.997) for all normal distributions. This is a close match to the 70-95-100 rule-of-thumb for numerical data sets.
Any value, x, can be translated into a z-score that gives the number of standard deviations above the mean.
Z-scores have a standard normal distribution (µ = 0 and σ = 1). The probability of a value less than x can be translated into a probability about a z-score.
The probability that X is greater than a specified value or is between two values can also be translated into a probability about a z-score.
If a computer is not available, tables of probabilities for the standard normal distribution are used to find normal probabilities.
The inverse problem of finding the x-value corresponding to a given probability is also solved using z-scores. If normal tables are used, they must be looked up differently.
A normal probability plot is an informal graphical method to help assess whether a data set comes from a normal distribution. Curvature in the probability plot suggests that the data may not be normal.
The population proportion in a category is called its probability. Proportions and probabilities can be obtained from categorical and numerical variables.
The sample proportion is a statistic that varies from sample to sample. The sample count in a category is also random with a discrete distribution.
The sample count in a category has a standard distribution called a binomial distribution with parameters n and π.
The binomial distribution can be used to find probabilities relating to sample counts.
When the sample size is large, the distribution of the sample count in a category becomes close to a normal distribution.
When the sample size is large, a normal approximation to the binomial distribution can be used to find binomial probabilities.
More accurate estimates can sometimes be obtained by taking separate random samples within different parts of the population.
When individuals are grouped in clusters, it is often cheaper to sample complete clusters rather than separate individuals.
When the target population is spread over a wide area, it may be cheaper to take a sample from only a few regions (groups of individuals), than to sample from all the regions.
Samples are usually collected to estimate population characteristics. The ordinary variation caused by the sampling scheme is called sampling error. Practical difficulties with conducting the survey can cause biased estimates (non-sampling errors).
Coverage error and non-response error prevent some individuals being included in the sample.
Interviewer error and instrument error can result in 'incorrect' measurements from the sampled individuals.
Survey information is collected by a variety of mechanisms, from mailed questionnaires to telephone. Each has its own advantages and disadvantages.
Many business and industrial processes are continuously monitored in order to detect problems early.
A run chart plots regular measurements from a process against time. Extreme values indicate problems with the process.
For most distributions, almost all values are within 3 standard deviations of the mean. Control limits 3s on each side of the sample mean from training data can be used to indicate problems with the process.
Runs of successive values that are s or 2s from the mean can also be used to indicate problems in the process.
False alarms occasionally happen even when the process is in control. Several successive triggers give a clearer indication of problems.
Control limits are set from variation when a process is under control then applied to future observations.
Control charts are often based on regular small samples rather than individual values. If individual values have a skew distribution, the +/-�3s limits may be exceeded in 2% or more of values. Sample means are closer to normal, so a control chart for means rarely exceeds its control limits when the process is under control.
A control chart for the range of successive samples can indicate increases in process variability.
If a problem is detected in a process, brainstorming and cause-and-effect diagrams can help to determine the cause of the problem.