Clusters

If a dot plot, stem and leaf plot or histogram separates into two or more groups of values (clusters), this suggests that the 'individuals' from which the data were recorded may similarly be split into two or more groups. Further investigation might reveal that the clusters correspond to ...

Detecting the cause of differences between the groups may lead to valuable insights into the data. For example, in a milk bottling plant it may be found that the evening shift has a lower quality output than the other shifts. Changes to this shift could potentially improve overall quality.

Eruptions of Old Faithful geyser

The Old Faithful is a geyser in the Yellowstone National Park in the USA that is known for its regular eruptions. Volunteers collected information about all eruptions in October 1980 (except for those from midnight to 6 am). The dot plot below shows the durations of these eruptions.

The eruption durations form two distinct clusters, so there seem to be two different types of eruption. What other characteristics of the eruptions are different between the two types?

The next dot plot shows the distribution of the intervals between successive eruptions. Again, there are two clusters, though not quite as distinct.

Are the same eruptions in the same clusters for both variables? Are successive eruptions in the same or different clusters? (More advanced statistical methods are needed to answer these questions.)

Discovery of clusters is important information that should lead to further research.

Accounting Software Support data

The stem and leaf plot on the right describes the time (minutes) for the technical support staff at a software company to respond to queries about an expensive accounting program in one week. There is considerable variation in the times, ranging from about 10 minutes to 80 minutes.

There appears to be a low-density gap in the distribution between 30 and 50 minutes, suggesting that the queries may be split into two separate clusters.

Although this is only a small data set and the clusters are not well separated, they should be further investigated.

Further investigation of the data revealed that the two clusters of values corresponded largely to two different types of query — it took much less time to answer questions from new users (involving initial installation and setup) than questions from experienced users that often required accounting knowledge.

It is misleading to examine all the data together — we should separately display (and contrast) data from the two types of query.