The shape of a distribution
Many different distributions have the same mean and standard deviation.
The mean and standard deviation hold no information about the shape of a distribution, other than its centre and spread.
In particular, the mean and standard deviation give no indication about whether
a data set contains:
- two or more clusters
- an outlier
- a skew distribution with a long tail to one side
These are important features of a data set and should influence the analysis
that you perform and the conclusions that you reach. In particular, if you
ignore outliers or clusters, you could easily reach the wrong conclusions.
It is therefore essential
that you look at the distribution with a dot plot, histogram or box plot
before 'condensing' the data into a mean and standard deviation for further
analysis.
Distributions with the same mean and standard deviation
The following four data sets all contain the same number of values, n = 100,
and have the same mean,
= 248.5,
and standard deviation, s = 91.1, but should be analysed
in different ways.
- Symmetric bell-shaped distribution
-
The data set above has a distribution whose shape is what would be imagined
from the mean and standard deviation. Its shape is well described by these
two summary statistics.
- Outlier
-
This data set contains an outlier. It is probably a measurement or recording
error or the 'individual' is in some other way different from the rest
of the data and should not be analysed with them.
- After deleting the outlier, the mean reduces from 248.5 to 241.4 and
the standard deviation drops from 91.1 to 57.6. The measurements are therefore
much less variable than the raw standard deviation suggests.
- Clusters
-
In this data set, the values separate into two distinct clusters. The
researcher should investigate what is different about the 'individuals'
in the two clusters. For example, annual rainfalls may have been recorded
in two types of years (e.g. La Nina and El Nino), or two different varieties
of maize may have been grown in a survey of crop yields.
- The two clusters have different means and the standard deviation within
each cluster is much smaller than 91.1, so again, the overall mean and
standard deviation do not adequately describe the data.
- Skew distribution
-
This data set is skew with a long tail towards the high values. The 70-95-100
rule suggests that about 15% of values are below
- s
and 15% above
+ s
(and 70% between these values), but this distribution has no
values lower than
- s,
but 14% are above
+ s,
6% are above
+ 2s
and 2% are above
+ 3s.
- The 70-95-100 rule does not give a good impression of this distribution
— the percentages are only approximately correct for fairly symmetric,
bell-shaped distributions.
In the presence of an outlier, clusters
or skewness, the mean and standard deviation fail to capture an important
aspect of the distribution's shape. They are particularly misleading
in the presence of outliers or clusters.
The diagram below shows the four distributions together as histograms to
make comparison easier.
A histogram or dot plot is needed to describe the clustered distribution,
but a box plot would capture the main features of the skew distribution
and distribution with an outlier.