Large and small data sets
If national data are available about the distribution of marks for some standard test — a large data set — then we will have fairly detailed information about the shape of the distribution and a simple curve may not match the data well enough.
However for small data sets, such as a single class set of fewer than 30 marks, a histogram cannot strongly suggest the detailed shape of an approximating curve. It is therefore usually acceptable to use a very simple generic curve.
In the rest of this section, we restrict attention to a 'family' of distributions (curves) with a limited range of shapes called normal distributions and pick one of these to approximate the histogram of a data set.
Shape of the normal distribution
Normal distributions are all symmetric 'bell-shaped' curves. There are two numerical parameters called µ and σ that can be adjusted to give a range of symmetric distributional shapes. (The two parameters are the distribution's mean and standard deviation — see Chapter 3, Numerical Summaries.)
If we are looking for a curve that can be used as a model for a particular data set, we can therefore choose a normal distribution with parameters that provide a shape that matches a histogram of the data resonably closely.
The diagram below illustrates the range of distributions from the normal family.
Use the two sliders to adjust the normal parameters. Observe that the location and spread of the distribution are changed, but other aspects of its shape remain the same for all values of the parameters.
Note also that the total area under the probability density function remains the same (exactly 1.0) for all values of the parameters. This holds for all probability density functions.
The diagram below shows a histogram of marks (out of 60) for 60 year 7 students in a vocabulary test, with a superimposed normal probability density function.
Use the sliders to adjust the normal parameters to obtain as close as possible a match to the histogram. This normal distribution can be used as an approximate model for how the data might have arisen.
We have used a subjective procedure of matching the shapes of the histogram and probability density 'by eye'. A more objective way to 'estimate' the normal parameters will be presented in the next chapter. Click the button Best fit to apply this objective method.