Population proportions and probabilities

Categorical data are often obtained as a random sample from some finite population. For example,

We concentrate on a single category which we will call success and we collectively call the other categories failures. The population proportion of successes is denoted by π. It is also the probability that a single randomly selected value from the population is a success.

In other situations, categorical data are not physically sampled from a real population, but we can still consider them to be sampled from an underlying process in which the probability of success is π.

We will not further distinguish between these two situations — in both cases, we are interested in estimating an underlying probability, π. Although it is more general to treat π as a probability, it is usually easier to interpret π as a 'population proportion'.

Parameter estimate and error

The sample proportion of successes is denoted by p and is an estimate of the population proportion, π.

As in other situations in which a population parameter is estimated, there will be an estimation error,

error   =   p − π

Since π is unknown, we cannot find a numerical value for the error, but we know that the error has a distribution. In the next page, we will show how to find the error distribution.

Rice survey

As part of a survey of rice producers in Sri Lanka, 36 farmers were randomly selected from 4 villages. Each sampled farmer was asked about the variety of rice that he used and the varieties were categorised as 'Old', 'Traditional' or 'New'. The 36 categorical values are described by the following frequency table.

Old varieties    17
Traditional varieties    15
New varieties 4
Total farmers    36

We will concentrate here on the 'Old' varieties of rice. We want to estimate the proportion of farmers in this part of Sri Lanka who use these old varieties — the population parameter, π. However we only have a small sample of farmers and our best estimate is the sample proportion, 17/36 = 0.472.

How accurate is this estimate?

Since π is unknown here, the estimation error cannot be determined.

Simulation: Error distribution

The diagram below simulates preference data in which a sample of 100 people say which of two types of coffee is preferred. The values could equally be from any 'success/failure' binomial scenario.

Our sample of 100 values provides an estimate of the population proportion preferring Coffee A, but the error cannot be found unless we know the value of π.

Since this is a simulation, we can 'cheat'. Click Peek at population to see the population proportion preferring Coffee A and evaluate the estimation error.

Click Another sample several times to repeat the simulation and build up the distribution of the estimation error.

The error distribution describes the likely size of errors from this type of estimate.

Note that this is not a practical way to obtain the error distribution since we can rarely take multiple samples from the population. However it does illustrate the concept of an error distribution.