Long page
descriptions

Chapter 5   Categorical Variables

5.1   Frequency tables

5.1.1   Frequency tables

A frequency table holds all information about the distribution of a categorical variable.

5.1.2   Proportions and percentages

The proportions or percentages of values are usually easier to interpret than the raw frequencies.

5.1.3   Recognising frequency tables

In a frequency table, each 'individual' contributes a count (frequency) of 1 to exactly one table entry. Many published tables of counts and percentages are not proper frequency tables.

5.1.4   Changes to the categories

The information in a frequency table may be clearer if the categories are re-ordered, associated categories are combined, or if the table only describes a subset of the categories.

5.2   Bar and pie charts

5.2.1   Bar charts

A bar chart displays the frequencies in a frequency table graphically.

5.2.2   Pareto diagrams

For barcharts of categorical data whose categories have no natural ordering, it is often helpful to sort the categories into decreasing frequencies.

5.2.3   Chartjunk and misleading bar charts

The temptation to embellish simple barcharts should be resisted. Some such 'artistic' embellishments are misleading.

5.2.4   Stacked bar charts and pie charts

These are alternatives to a bar chart for categorical data.

5.2.5   Comparison of bar and pie charts

Bar charts distinguish better the proportions in single categories. Pie charts are better for showing the proportion in a group of categories.

5.2.6   Chartjunk for pie charts

Three-dimensional pie charts are often drawn. If a pie chart holds little information, it is better to draw it small than to enhance it in this way.

5.2.7   Bar and pie charts for quantities

Bar and pie charts can be used to display any data where a total quantity is split into parts.

5.3   Comparing groups

5.3.1   Contingency tables

Categorical data collected from different groups can be shown in a contingency table. It contains a simple frequency table for each group.

5.3.2   Contingency table examples

Contingency tables sometimes arise from experiments. Data collected from surveys are often presented in contingency tables.

5.3.3   Bar charts using proportions

It is easier to compare groups if bar charts are drawn from the proportions within each group rather than the frequencies.

5.3.4   Stacked bar charts

Stacking the bars in each bar chart makes it easier to compare groups. Stacked bar charts are particularly effective for ordinal categorical data.

5.3.5   Two special cases

When the variable has only two categories, stacked bar charts can be simplified. An example is also given where the grouping forms a time series.

5.4   Bivariate categorical distributions

5.4.1   Relationships between variables

If two categorical variables can be treated as a response and explanatory variable, the explanatory variable can be used to split the individuals into groups. If neither variable can be treated as explanatory, we need other ways to describe their association.

5.4.2   3-dimensional bar charts

Bivariate distributions can be displayed in 3-dimensional bar charts but there are better ways to show the data.

5.4.3   Clustered bar charts

A 2-dimensional bar chart can be used to display the joint distribution of two categorical variables, with the bars clustered by one or other of the variables.

5.4.4   Marginal distributions

The two marginal distributions are found by adding the joint frequencies (or proportions) across rows or down columns of a contingency table.

5.4.5   Conditional distributions

Conditional distributions are obtained by scaling the rows (or columns) of a contingency table to make them all sum to 1.0.

5.4.6   More about conditional distributions ((advanced))

Be careful to distinguish between the conditional distribution of Y given X, and that of X given Y.

5.4.7   Conditional vs marginal distns

Proportional Venn diagrams are a useful way to display joint distributions and to understand the relationships between conditional and marginal distributions.

5.5   Presenting data in tables

5.5.1   Gridlines and white space

Never use gridlines to box all values in a table. In large multi-column tables, reading across rows is easier with occasional hairlines of light shading, but otherwise consider using white space to separate associated groups of rows or columns.

5.5.2   Layout and annotation

Use white space to group related rows and columns. Rearranging rows or columns may bring values that should be compared closer. Summarise and interpret in the body of a report but do not simply repeat values.

5.5.3   Significant digits and data noise

The meaningful information is 'signal'. Information that does not help understanding of the data is 'noise'. Noise includes data noise and unnecessary embellishments to the table. Decreasing the significant digits displayed often decreases data noise.

5.5.4   Meaningful variables

Showing proportions in a multi-column table instead of frequencies makes it easier to compare groups. Ratios of variables can be easier to interpret than their raw values.

5.5.5   Swapping rows and columns

It is easier to compare values down rows than across columns. Interchanging the rows and columns of a table can make it easier to make comparisons.

5.5.6   Reordering rows

Rearranging the rows (or columns) may make the information in large tables stand out better.

5.5.7   Example

An example shows a published table whose presentation can be improved in many ways.

5.6   Logistic regression

5.6.1   Categorical responses

With a categorical response and numerical explanatory variable, stacked bar charts at each X are an effective display.

5.6.2   Fitted values and predictions

Using a straight line to describe how the proportion in a category depends on X is not appropriate. A curve is required.

5.6.3   Logistic curve

A 'logistic' curve can be used to model how a proportion depends on X.

5.6.4   Obtaining a good fit

A logistic curve is fitted to an example data set.