Meaningful information can be obtained from variation in the values of a variable.
A dot plot displays each value as a cross along a numerical axis.
Jittering is a modification to the basic dot plot that avoids some problems associated with overlapping crosses
Stacking of the crosses is an alternative to jittering that highlights ranges of high or low density.
Stem and leaf plots are similar to stacked dot plots, but a digit is used instead of a cross to retain extra information.
To increase the flexibility of the display, each stem may be repeated either 2 or 5 times, increasing the number of classes in the basic stem and leaf plot by a factor of 2 or 5.
For data analysis, stem and leaf plots are rarely more informative than stacked dot plots, but they are easy to draw by hand.
Does the data contain any outliers -- values that are atypically large or small? The extreme values in a skew distribution are often mistaken for outliers.
Does the data split into separate clusters -- ranges of values with high density separated by ranges with low density? Clusters may correspond to different groups of individuals.
The distribution gives information about a typical value round which the data are spread (the distribution's location or centre) and the variability of the values (the spread of the distribution).
Additional information about the items from which measurements have been made can help us understand the distribution of values in the data.
If we know that the values come from 2 or more groups of individuals, dot plots can be modified to show this extra information.
There is a risk of over-interpreting patterns in small data sets.
The heights of the stacks of crosses in a dot plot describe the density of values.
In a simple histogram, the height of the rectangle above each class on the axis equals the number of values in the class -- the class frequency.
Class width and start-point should be chosen to make the histogram as smooth as possible -- neither too blocky nor too jagged.
The shape of a histogram can be very dependent on the choice of classes if the data set is small; beware over-interpreting its shape. Stacked dot plots are a better display of small data sets.
In a histogram, the proportion of the total area that is above any class equals the relative frequency of the class.
The vertical axis should be relative frequency, not frequency, when comparing two groups with histograms. Population pyramids are often used to compare age distributions.
If a histogram has varying class widths, the vertical axis must be 'density'. The histogram shape would be misleading if frequency or relative frequency was used for the vertical axis.
The proportion of values in any classes always equals the proportion of the total histogram area that is above the classes.
Frequency polygons are closely related to histograms but give a less 'blocky' display of density. Different groups can be compared more easily with them.
Kernel density estimates show density in a still smoother display.
Histograms are based on frequency tables. Class boundaries should avoid possible data values.
Dot plots, stem and leaf plots and histograms contain detailed information that is distracting when two or more data sets are being compared.
The median and quartiles split a batch of values into four equal-sized sets of values. A box plot is a graphical display of the median, quartiles and extremes.
A box plot clearly shows the centre, spread and skewness of a data set. It splits the corresponding histogram into 4 approximately equal areas.
The basic box plot is often modified to display outliers as separate crosses.
Box plots cannot show clusters, so must never be used for data with clusters.
Box plots are particularly effective for displaying differences between several groups of values.
Box plots are relatively stable, and contain less 'noise' than other displays. They can concisely describe differences between even small groups.
The centre of a distribution is a 'typical value'. The spread describes how far the values are from the centre.
The median is a summary of the centre of a distribution. The range and inter-quartile range both describe spread.
The median and mean are alternative measures of the centre of a distribution.
When a data set is not symmetric, the mean and median may differ substantially.
The standard deviation is the most commonly used numerical summary of the spread of values in a data set.
The 70-95-100 rule-of-thumb is useful for understanding the numerical value of the standard deviation.
It is possible to roughly guess the mean and standard deviation from a histogram and roughly sketch a symmetric histogram matching any given mean and standard deviation.
The mean and standard deviation cannot give any indication of the existance of outliers, skewness or clusters. A dot plot or histogram should be examined before reporting these numerical summaries.
If a data set contains an outlier, the mean and especially the standard deviation can be badly affected. The values may be obviously wrong when the 70-95-100 rule is applied in the context of the data but examining a dot plot or box plot is best.
The standard deviation within groups is usually lower than the overall standard deviation.
Splitting a data set into groups of 'similar' values results in more accurate predictions of future values if the group membership is known. The grouping is said to explain some of the overall variation.
The square of the standard deviation is called the variance; its value is harder to understand but it is the basis of important advanced statistical methods. The degrees of freedom are the number of pieces of information contributing to the standard deviation (or variance).
The root mean squared error summarises how close the values in a data set are to a target, k.
The standard deviation is similar to the root mean squared error, but summarises distances to the mean of the data. Its value can be interpreted in terms of the average area of squares on a graph.
A data set containing annual rainfalls in Samaru, Nigeria, will be used for illustrative purposes.
Half the data are lower than the median. A quarter and three quarters are lower than the lower and upper quartiles. At any other value, x, the proportion of data values that are x or lower is called its cumulative proportion.
A graph of the cumulative proportion below x against x is a step function that increases from zero (at small x) to one (at high x).
Given any target proportion, p, it is possible to find a corresponding value, x, for which approximately this proportion of values is x or lower. For example, the percentile for p = 50% is the median.
The 0, 25, 50, 75 and 100'th percentiles are displayed as a box plot. Other percentiles can be displayed in a similar shaded rectangle.
Box plots are useful for comparing groups. If the groups are in order (e.g. the months of a year), the median, quartiles and extremes can be joined and shaded as bands. This effectively describes how the distribution of values varies.
In some applications, different percentiles are important. They can also be joined and shaded as bands to compare ordered groups.
The graph of cumulative probabilities is a step function. Most software reports percentiles that are equivalent to reading values off a smoothed version of this step function.
Linear transformations of data affect the scale on the axis of graphical displays, but do not otherwise change the shape of the distribution of values.
Nonlinear transformations change the shape of the distribution of values more profoundly. A logarithmic transformation can help detect patterns in very skew data sets.
Logarithmic transformations are most useful for 'quantity' data that cover several orders of magnitude.
Power transformations are a more flexible family of nonlinear transformations that are useful in data exploration.
The effect of power transformations on the skewness of data is evident in a wide range of graphical displays.
Discrete data sets contain counts whereas continuous data sets could potentially contain any values within an interval. Stacked dot plots are good displays of small discrete data sets containing small counts.
When the range of possible counts is moderate or large, a histogram is an effective display of the distribution. Class width should be a whole number and class boundaries should end in '.5'.
When the range of possible counts is small, a bar chart is a better representation of the data than a histogram.
A frequency table is often used to summarise discrete data. The mean and standard deviation can be evaluated easily from the frequency table.