Clusters
If a dot plot, stem and leaf plot or histogram separates into two or more groups of values (clusters), this suggests that the 'individuals' from which the data were recorded may similarly be split into two or more groups. Further investigation might reveal that the clusters correspond to ...
Detecting the cause of differences between the groups may lead to valuable insights into the data. For example, if the data are yields of corn, one variety may give a higher yield than the other. Growing only this variety would improve yields.
Eruptions of Old Faithful geyser
The Old Faithful is a geyser in the Yellowstone National Park in the USA that is known for its regular eruptions. Volunteers collected information about all eruptions in October 1980 (except for those from midnight to 6 am). The dot plot below shows the durations of these eruptions.
The eruption durations form two distinct clusters, so there seem to be two different types of eruption. What other characteristics of the eruptions are different between the two types?
The next dot plot shows the distribution of the intervals between successive eruptions. Again, there are two clusters, though not quite as distinct.
Are the same eruptions in the same clusters for both variables? Are successive eruptions in the same or different clusters? (More advanced statistical methods are needed to answer these questions.)
Discovery of clusters is important information that should lead to further research.
Rain days
The stem and leaf plot on the right describes the number of rainy days each year for 20 years in a village. There is considerable variation in the data, ranging from 53 days to 119 days.
There appears to be a low-density gap in the distribution between 75 and 90 days, suggesting that the years may be split into two separate clusters.
Although this is only a small data set and the clusters are not well separated, they should be further investigated.
The data collector should further examine the samples for other systematic differences between the clusters — perhaps the clusters correspond to El Nino and La Nina years, or perhaps two people from different parts of the village recorded the data in different years and classified 'rainy days' differently? Information about clustering is often of great importance to the data analyst.
If the two clusters were found to correspond to different people recording the data, it would be misleading to examine all the data together.