Data that are not sampled from a finite population

Sometimes data are actually sampled from a real finite population. For example, a public opinion poll may select individuals from the population of all residents in a city. The previous section showed that:

Random sampling of values from a finite population can explain the sample-to-sample variability of some data.

However there is no real finite population underlying most data sets from which the values can be treated as being sampled. The randomness in such data must be explained in a different way.

Accidents in a factory

In a large factory, the management are concerned about their liability for accidents that occur on the factory floor. The following numbers of accidents were recorded each week over a 50-week period.

Number of accidents in week
0
3
1
2
4
5
0
1
6
2
3
4
2
2
0
1
1
0
3
1
2
0
4
1
0
0
3
1
5
1
4
3
1
0
2
0
0
1
0
6
0
0
2
1
2
1
3
7
0
0

There is no real finite population from which these data can be considered to be sampled. However there is variability within this data set and operating the factory for another 50 weeks would result in a different set of measurements.

Sampling from an abstract population

Random sampling from a population is such an intuitive way to explain sample-to-sample variability, we also use it to explain variability even when there is no real population from which the data were sampled.

We replace the real population that usually underlies survey data with an abstract population of all values that might have been obtained if the data collection had been repeated. We can then treat the observed data as a random sample from this abstract population.

The variation in the underlying abstract population gives us information about the variation in similar data in general.

Defining such an underlying population therefore not only explains sample-to-sample variability but also gives us a focus for generalising from our specific data.

Accidents in a factory

It is convenient to model the variability in the data as being a sample from the infinite population of all possible numbers of accidents that could have been observed. The variability in this hypothetical population reflects the variability in the occurrence of accidents.

The distribution of accidents in the sample of 50 weeks provides information about the distribution of this underlying population — the distribution of accidents in this type of factory in general.