In small data sets, features of displays may not be meaningful

Be careful not to overinterpret patterns in small data sets. Clusters, outliers, or skewness may apprear by chance even if there is no meaningful basis to these features.

For example, without outside supporting evidence, you should not conclude that a gap such as that in the maths test marks dataset must correspond to a meaningful grouping of the students into two clusters.

Class marks are usually fairly small data sets of between 20 and 30 values, so this warning is particularly important.

Random marks

To investigate this further, we will randomly generate 25 marks (each out of 100) from students in which there are no separate sub-groups or clusters.

The stem and leaf plot on the left describes the 25 marks that have been obtained. Click the button New Sample several times to examine other sets of marks. By chance, there are occasionally gaps in the distribution and occasionally outliers.

Because of how the values have been obtained, we know that these features in the samples are only random artifacts — they do not reflect real clusters or outliers in the underlying process.

 
If outliers or clusters are pronounced, they may be taken as indicative of something meaningful in the underlying process. However less pronounced outliers or clusters must be supported by outside evidence before these features can be interpreted as meaningful.