A sample proportion has a distribution
If a categorical data set is modelled as a random sample from a categorical population, the sample proportions in the various categories must be treated as random quantities — they vary from sample to sample.
The population proportion in any category of a categorical population is called the category's probability, and the Greek letter π is often used to denote the probability of a particular category of interest. The corresponding sample proportion is usually denoted by p.
Sample Statistic | Population Parameter | |
---|---|---|
Mean | ![]() |
µ |
Standard deviation | s | σ |
Proportion/probability | p | π |
Note carefully that...
In statistics, the symbol π is used to represent a probability that may take any value between 0 and 1, depending on context. Do not confuse it with the mathematical constant π. |
It is important that you understand the distinction between a sample proportion and the underlying population probability.
Visitors to Hawaii
In January 2003, 517,141 tourists arrived in Hawaii by air. Of these, 177,759 (34.4%) were recorded as visiting the island of Maui. We can model the decision a tourist makes in early February to visit Maui as a categorical value (visited or not) from a hypothetical infinite population of 34.4% who visit Maui and 65.6% who do not.
Consider ten tourists arriving on 1st February. They would be modelled as a random sample of n = 10 values from this population.
Click Take sample a few times to observe the variability in samples from this model. In particular, observe that the sample proportion who visit Maui varies from sample to sample.
Unknown probabilities
In some applications, we know the population probabilities for the categories of interest, but usually these values are unknown. (In practice, population parameters are usually unknown constants.) The corresponding sample proportions are approximations to these probabilities, but it is important to recognise that the underlying probabilities are unknown.
Use of the internet by doctors
The internet has had a major impact on how many types of business operate. In a nationwide survey of 400 practicing physicians in the USA, it was found that 356 were using the Internet. The symbol π denotes the probability of a physician in the USA using the Internet. The diagram below shows the results graphically.
The unknown parameter π is of greatest interest, but we only know the sample proportion using the Internet, p = 0.89, which throws some light on the likely value of π.
Understanding the sample-to-sample variability of a proportion allow us to assess the proportion that is observed in a single observed data set.