Clusters
If a dot plot, stem and leaf plot or histogram separates into two or more groups of values (clusters), this suggests that the 'individuals' from which the data were recorded may similarly be split into two or more groups. Further investigation might reveal that the clusters correspond to ...
Detecting the cause of differences between the groups may lead to valuable insights into the data. For example, if the data are yields of corn, one variety may give a higher yield than the other. Growing only this variety would improve yields.
Eruptions of Old Faithful geyser
The Old Faithful is a geyser in the Yellowstone National Park in the USA that is known for its regular eruptions. Volunteers collected information about all eruptions in October 1980 (except for those from midnight to 6 am). The dot plot below shows the durations of these eruptions.
The eruption durations form two distinct clusters, so there seem to be two different types of eruption. What other characteristics of the eruptions are different between the two types?
The next dot plot shows the distribution of the intervals between successive eruptions. Again, there are two clusters, though not quite as distinct.
Are the same eruptions in the same clusters for both variables? Are successive eruptions in the same or different clusters? (More advanced statistical methods are needed to answer these questions.)
Discovery of clusters is important information that should lead to further research.
Steel Works Slag
The stem and leaf plot on the right describes the percentage FeO in the slag that remained after 20 batches of iron ore had been smelted. There is considerable variation in the percentage of FeO remaining, ranging from about 1% to 8%.
There appears to be a low-density gap in the distribution between 3% and 5%, suggesting that the samples may be split into two separate clusters.
Although this is only a small data set and the clusters are not well separated, they should be further investigated.
Further investigation of these data revealed that the two clusters of values corresponded largely to ore from two different sources.
It is misleading to examine all the data together — we should separately display (and contrast) data from the two sources.