In small data sets, features must be very prominent to be called outliers or clusters
We have described some information that may be read from a scatterplot. But how strong must the corrresponding patterns be before we should report them?
In both univariate and bivariate data sets, outliers or clusters must be very distinct before we should conclude that they are real, in the absence of further external information confirming that the individuals are distinct.
Particularly in small data sets, outliers, clusters and other patterns may arise by chance, without being associated with any real features in the individuals.
Be careful not to overinterpret features in scatterplot unless they are well defined, especially if the sample size is small.
Accuracy of warehouse inventory
Warehouses keep computer records of their inventory, but these are often inaccurate due to theft and paperwork errors. It was thought that the errors in these computer records may be related to the 'activity' of the warehouse — the volume of goods added and removed from the warehouse daily.
The scatterplot below shows the percentage error in the computer records of 400 warehouses, based on a complete inventory, and an 'activity score'. There is a positive relationship, but no other significant features.
It is expensive to check manually the inventory in all 400 warehouses, so what would have happened if only 20 had been examined? Click the button Take sample to see the scatterplot for a sample of 20 warehouses. Click the button several more times and observe the variability in the scatterplots. Although there are no outliers or clusters in the complete data set, the scatterplot occasionally gives a false suggestion of an outlier, clusters or even a curved relationship.