Long page
descriptions

Chapter 2   One Numerical Variable

2.1   Graphical display of values

2.1.1   Analysing variation

Meaningful information can be obtained from variation in the values of a variable.

2.1.2   Basic dot plot

A dot plot displays each value as a cross along a numerical axis.

2.1.3   Jittered dot plot

Jittering is a modification to the basic dot plot that avoids some problems associated with overlapping crosses

2.1.4   Stacked dot plots

Stacking of the crosses is an alternative to jittering that highlights ranges of high or low density.

2.1.5   Stem and leaf plots

Stem and leaf plots are similar to stacked dot plots, but a digit is used instead of a cross to retain extra information.

2.1.6   Splitting the stems

To increase the flexibility of the display, each stem may be repeated either 2 or 5 times, increasing the number of classes in the basic stem and leaf plot by a factor of 2 or 5.

2.1.7   Drawing stem and leaf plots

For data analysis, stem and leaf plots are rarely more informative than stacked dot plots, but they are easy to draw by hand.

2.2   Understanding distributions

2.2.1   Outliers

Does the data contain any outliers -- values that are atypically large or small? The extreme values in a skew distribution are often mistaken for outliers.

2.2.2   Clusters

Does the data split into separate clusters -- ranges of values with high density separated by ranges with low density? Clusters may correspond to different groups of individuals.

2.2.3   Distribution of values

The distribution gives information about a typical value round which the data are spread (the distribution's location or centre) and the variability of the values (the spread of the distribution).

2.2.4   Names of individuals

Additional information about the items from which measurements have been made can help us understand the distribution of values in the data.

2.2.5   Distinguishing known groups

If we know that the values come from 2 or more groups of individuals, dot plots can be modified to show this extra information.

2.2.6   Dangers of overinterpretation

There is a risk of over-interpreting patterns in small data sets.

2.3   Histograms and density

2.3.1   Density of values

The heights of the stacks of crosses in a dot plot describe the density of values.

2.3.2   Histogram with equal class widths

In a simple histogram, the height of the rectangle above each class on the axis equals the number of values in the class -- the class frequency.

2.3.3   Choice of classes

Class width and start-point should be chosen to make the histogram as smooth as possible -- neither too blocky nor too jagged.

2.3.4   Histograms of small data sets

The shape of a histogram can be very dependent on the choice of classes if the data set is small; beware over-interpreting its shape. Stacked dot plots are a better display of small data sets.

2.3.5   Relative frequency and area

In a histogram, the proportion of the total area that is above any class equals the relative frequency of the class.

2.3.6   Comparing groups

The vertical axis should be relative frequency, not frequency, when comparing two groups with histograms. Population pyramids are often used to compare age distributions.

2.3.7   Histograms with varying class widths

If a histogram has varying class widths, the vertical axis must be 'density'. The histogram shape would be misleading if frequency or relative frequency was used for the vertical axis.

2.3.8   Understanding histograms

The proportion of values in any classes always equals the proportion of the total histogram area that is above the classes.

2.3.9   Frequency polygons

Frequency polygons are closely related to histograms but give a less 'blocky' display of density. Different groups can be compared more easily with them.

2.3.10   Kernel density estimates ((optional))

Kernel density estimates show density in a still smoother display.

2.3.11   Drawing histograms by hand ((optional))

Histograms are based on frequency tables. Class boundaries should avoid possible data values.

2.4   Median, quartiles & box plots

2.4.1   The need to summarise

Dot plots, stem and leaf plots and histograms contain detailed information that is distracting when two or more data sets are being compared.

2.4.2   Median, quartiles and box plot

The median and quartiles split a batch of values into four equal-sized sets of values. A box plot is a graphical display of the median, quartiles and extremes.

2.4.3   Interpreting a box plot's shape

A box plot clearly shows the centre, spread and skewness of a data set. It splits the corresponding histogram into 4 approximately equal areas.

2.4.4   Displaying outliers

The basic box plot is often modified to display outliers as separate crosses.

2.4.5   Clusters

Box plots cannot show clusters, so must never be used for data with clusters.

2.4.6   Comparison of groups

Box plots are particularly effective for displaying differences between several groups of values.

2.4.7   Dangers of over-interpretation

Box plots are relatively stable, and contain less 'noise' than other displays. They can concisely describe differences between even small groups.

2.5   Describing centre and spread

2.5.1   Centre and spread

The centre of a distribution is a 'typical value'. The spread describes how far the values are from the centre.

2.5.2   Median, range and IQR

The median is a summary of the centre of a distribution. The range and inter-quartile range both describe spread.

2.5.3   Summaries of centre

The median and mean are alternative measures of the centre of a distribution.

2.5.4   Properties of median and mean

When a data set is not symmetric, the mean and median may differ substantially.

2.5.5   Standard deviation

The standard deviation is the most commonly used numerical summary of the spread of values in a data set.

2.5.6   Rules of thumb for st devn

The 70-95-100 rule-of-thumb is useful for understanding the numerical value of the standard deviation.

2.5.7   Understanding means and st devns

It is possible to roughly guess the mean and standard deviation from a histogram and roughly sketch a symmetric histogram matching any given mean and standard deviation.

2.5.8   Warnings about mean & st devn

The mean and standard deviation cannot give any indication of the existance of outliers, skewness or clusters. A dot plot or histogram should be examined before reporting these numerical summaries.

2.6   More about variation (optional)

2.6.1   Effect of outliers

If a data set contains an outlier, the mean and especially the standard deviation can be badly affected. The values may be obviously wrong when the 70-95-100 rule is applied in the context of the data but examining a dot plot or box plot is best.

2.6.2   Standard deviation of grouped data

The standard deviation within groups is usually lower than the overall standard deviation.

2.6.3   Explained and unexplained variation

Splitting a data set into groups of 'similar' values results in more accurate predictions of future values if the group membership is known. The grouping is said to explain some of the overall variation.

2.6.4   Variance and degrees of freedom ((advanced))

The square of the standard deviation is called the variance; its value is harder to understand but it is the basis of important advanced statistical methods. The degrees of freedom are the number of pieces of information contributing to the standard deviation (or variance).

2.6.5   Root mean squared error ((advanced))

The root mean squared error summarises how close the values in a data set are to a target, k.

2.6.6   Distances from the mean ((advanced))

The standard deviation is similar to the root mean squared error, but summarises distances to the mean of the data. Its value can be interpreted in terms of the average area of squares on a graph.

2.7   Proportions and percentiles

2.7.1   Illustrative data set

A data set containing annual rainfalls in Samaru, Nigeria, will be used for illustrative purposes.

2.7.2   Cumulative proportions

Half the data are lower than the median. A quarter and three quarters are lower than the lower and upper quartiles. At any other value, x, the proportion of data values that are x or lower is called its cumulative proportion.

2.7.3   Graph of cumulative proportions

A graph of the cumulative proportion below x against x is a step function that increases from zero (at small x) to one (at high x).

2.7.4   Percentiles

Given any target proportion, p, it is possible to find a corresponding value, x, for which approximately this proportion of values is x or lower. For example, the percentile for p = 50% is the median.

2.7.5   Displaying percentiles

The 0, 25, 50, 75 and 100'th percentiles are displayed as a box plot. Other percentiles can be displayed in a similar shaded rectangle.

2.7.6   Comparing groups

Box plots are useful for comparing groups. If the groups are in order (e.g. the months of a year), the median, quartiles and extremes can be joined and shaded as bands. This effectively describes how the distribution of values varies.

2.7.7   Comparing groups with other percentiles

In some applications, different percentiles are important. They can also be joined and shaded as bands to compare ordered groups.

2.7.8   Better definition of percentiles ((advanced))

The graph of cumulative probabilities is a step function. Most software reports percentiles that are equivalent to reading values off a smoothed version of this step function.

2.8   Transformations

2.8.1   Linear transformations

Linear transformations of data affect the scale on the axis of graphical displays, but do not otherwise change the shape of the distribution of values.

2.8.2   Log transformations

Nonlinear transformations change the shape of the distribution of values more profoundly. A logarithmic transformation can help detect patterns in very skew data sets.

2.8.3   When to use log transform?

Logarithmic transformations are most useful for 'quantity' data that cover several orders of magnitude.

2.8.4   Power transformations ((advanced))

Power transformations are a more flexible family of nonlinear transformations that are useful in data exploration.

2.8.5   Power transforms & skewness

The effect of power transformations on the skewness of data is evident in a wide range of graphical displays.

2.9   Discrete data (counts)

2.9.1   Discrete and continuous data

Discrete data sets contain counts whereas continuous data sets could potentially contain any values within an interval. Stacked dot plots are good displays of small discrete data sets containing small counts.

2.9.2   Histograms for counts

When the range of possible counts is moderate or large, a histogram is an effective display of the distribution. Class width should be a whole number and class boundaries should end in '.5'.

2.9.3   Bar charts

When the range of possible counts is small, a bar chart is a better representation of the data than a histogram.

2.9.4   Mean and st devn ((advanced))

A frequency table is often used to summarise discrete data. The mean and standard deviation can be evaluated easily from the frequency table.