In small data sets, features of displays may not be meaningful
Be careful not to overinterpret patterns in small data sets. Clusters, outliers or skewness may appear by chance even if there is no meaningful basis to these features.
Random data
To investigate this further, we will examine some samples of 50 values from a homogeneous process with no separate sub-groups or clusters.
The stem and leaf plot on the left above describes 50 values from this process. Do you think that there are clusters or outliers?
Click the button Another sample several times to examine other samples. Even though the sample size is not particularly small, there is surprising variability in the shape of the distribution. By chance, there are occasionally gaps and occasionally values that are separated from the others and appear to be outliers.
Look at several samples and click Remember to retain the data set that gives the greatest appearance of separating into two clusters. Then do the same, retaining the data set that looks most likely to have an outlier.
In this example, we know that these features in the samples do not reflect real clusters or outliers in the underlying process.
Steel Works Slag
In steel works, iron ore is smelted to extract as much iron as possible, but some iron remains in the waste from the process (slag) in the form of iron oxide (FeO). The stem and leaf plot below shows the percentage of FeO in slag sampled from 20 batches of iron ore.
The display seems to split into two clusters. However without outside supporting evidence, you should not conclude that a gap such as this must correspond to a meaningful grouping of the iron ore batches into two clusters — the appearance of clusters may be caused only by the randomness of the data.
If outliers or clusters are pronounced, they may be taken as indicative of something meaningful in the underlying process. However less pronounced outliers or clusters must be supported by outside evidence before these features can be interpreted as meaningful.