Flexibility in bin widths and bin starting positions

There is much more freedom in the choice of histogram bins than in the corresponding bins for stem and leaf plots. Indeed, any values can be used for the bin boundaries in a histogram.

We initially restrict attention to histograms where all bins are of the same width, but even then:

Bins should be chosen for smoothness

As in stem and leaf plots, we aim for smoothness in the outline of the histogram rectangles. The histogram below of the ages when students reached reading age 8 is reasonably smooth — we informally interpret the histogram in the same way as the smooth blue curve that has been superimposed 'by eye' on it.

Histogram bins should therefore be chosen to make the outline of the histogram as smooth as possible. Adjusting bin width is most important in attaining this goal.

There is no substitution for trial-and-error in this process!

The histogram below shows the distribution of 200 values.

Use the buttons below the histogram to investigate the effect of narrowing and widening the histogram bins. Which histogram is smoothest (and therefore best)?

The general principle is to use the smallest bin width that is not jagged. This is a subjective judgment and any bin width between 4.0 and 8.0 would be acceptable, though a bin width at the lower end of this range is better.

Warning about histograms of small data sets

Adjusting the bin width and the starting position for the first bin can give a surprising amount of variability in histogram shape for small data sets. As a result, you must be wary of over-interpreting features such as clusters or skewness in such histograms.

Maths test mark data

The histogram below shows the 25 maths test marks that we examined earlier.

Use the buttons under the histogram to adjust the bin width and to shift the histogram bins to the left or right. Note that the appearance of splitting into clusters is only apparent in some of the histograms, but not in others.

Are the clusters real, or are they just an artifact of our choice of bins? Without further supporting evidence, the clusters are not pronounced enough for us to conclude that the students must form into two meaningful groups. However they do give an indication of clustering that a good 'data detective' would investigate further.

Dot plots should be used in preference to histograms for small data sets. They show the size of the data set more clearly and hence give some warning about the risk of over-interpretation.

Histograms of larger data sets are more representative

For large data sets, changes to the bins have less effect on the histogram shape — we would sketch a similar smooth 'canopy' over most resulting histograms. Since they provide a much less cluttered display of the data than dot plots or stem and leaf plots, histograms are good summaries of the distribution of values in a large data set.

Finally, the shape of the histogram is less variable when different data sets are measured from the same underlying process.

The histogram below shows the distribution of 300 marks.

Click the button Sample under the histogram to observe the distribution of another 300 marks recorded from similar students. Repeat several times and observe that although details of the distribution's shape vary, the following features are visible in most sample histograms:

Use the buttons under the histogram to adjust the bin width and shift the bins left or right, and observe that the above features persist.