If you don't want to print now,

Chapter 5   Categorical Variables

5.1   Frequency tables

5.1.1   Frequency tables

Numerical and categorical data

In a data set, a numerical variable contains a number from each individual. A categorical variable classifies each individual into one of several groups.

Frequency tables

For a categorical variable, the frequencies for the distinct categories are the number of times each category occurs in the data set. The frequencies fully capture all information about the distribution of values and are usually presented in a frequency table.

5.1.2   Proportions and percentages

Proportions

The proportions of values in the categories (their relative frequencies) are the frequencies divided by the total number of values.

Percentages

Percentages are simply proportions multiplied by 100. It is usually easier to quickly compare a column of percentages than the corresponding column of proportions.

5.1.3   Recognising frequency tables

Necessary property of a frequency table

A frequency table distributes each of a collection of 'individuals' into one of several categories. Each individual must therefore contribute 1 to exactly one of the counts in the table.

Make sure that you can recognise whether a table of counts or percentages is a frequency table.

5.1.4   Changes to the categories

Modifying a frequency table

Sometimes a frequency table can be modified to make the information clearer or to highlight particular aspects.

Reordering categories
If there is a natural ordering of the categories, they should be arranged in this order. Otherwise it often helps to sort the categories by their frequencies.

Alphabetic ordering of the categories is rarely best.

Combining categories
The information in the table may be clearer if associated categories are combined. (The frequencies and percentages of the combined categories should be added.)
Looking at subsets of categories
It may be useful to look only at the distribution of a subset of categories — i.e. a sub-group of the individuals. (The percentages should be divided by the total for the displayed categories, so they still add to 100%.

5.2   Bar and pie charts

5.2.1   Bar charts

Bar charts

The main graphical display of categorical data is a bar chart. In a bar chart, the height of each bar is equal to the frequency (or equivalently relative frequency) of that category.

5.2.2   Pareto diagrams

Ordering categories

If the categories have a natural ordering (an ordinal categorical variable), this ordering should be used in a bar chart.

For nominal categorical variables (no natural ordering), alphabetic ordering of the categories should be avoided. It is better to sort them in order of decreasing frequencies, giving a Pareto diagram.

Detecting 'important' categories

Pareto diagrams are particularly useful in industrial quality control and quality improvement where information is collected about the causes of problems in manufacturing processes. The Pareto principle states that:

A large percentage of instances of any problem result from a small percentage of the possible causes.

The leftmost categories in a Pareto diagram are most important. A line is usually added showing the cumulative proportions for the different causes. For the i'th category, the height of the line gives the proportion of problems from any of the i most common categories.

5.2.3   Chartjunk and misleading bar charts

Chartjunk

Bar charts can be very simple and need not take up much space in reports. Avoid the temptation to embellish them 'artistically' to make them more visually appealing. These additions are collectively called chartjunk.

Chartjunk adds 'noise' to a bar chart and makes it harder to read the real information that is contains. Rather than adding chartjunk, draw it small or replace it with a frequency table.

A common form of chartjunk arises when each bar is drawn as a 3-dimensional object. When the resulting 3-dimensional picture is rotated, it often becomes harder to compare the heights of bars and to read off values from the axes. In particular, perspective views should be avoided.

Replacing bars with objects

A more serious problem arises when the rectangular bars in a barchart are replaced with pictures of objects. This often visually mis-represents the proportions in the different categories — the visual importance of a bar is determined by its area or volume, not its height.

5.2.4   Stacked bar charts and pie charts

Other displays of categorical data

A stacked bar chart is simply a bar chart whose bars are stacked on top of each other. Stacked bar charts are often used to compare two or more groups of individuals.

A pie chart, splits a circle into segments according to the proportions in the categories. The angle for a category is given by its proportion.

In all three displays, the area of ink for any category equals the proportion of values in that category.

5.2.5   Comparison of bar and pie charts

Bar charts and pie charts highlight different aspects of the data

Although a bar chart and a pie chart are visual representations of the same values (the proportions in the categories), they highlight different features of these proportions.

Bar charts provide better comparisons of the individual proportions, whereas pie charts allow us to assess the proportions in two or more adjacent categories. The latter is particularly useful if the categories are ordered or split into meaningful groups.

5.2.6   Chartjunk for pie charts

Chartjunk

Resist the temptation to 'enhance' pie charts with chartjunk. In particular, 3-dimensional pie charts can over-emphasise the categories closest to the viewer.

In general, it is better to draw a standard pie chart smaller rather than embellishing it with chartjunk.

5.2.7   Bar and pie charts for quantities

Bar charts for quantities

Bar charts are most often used to show frequencies for discrete or categorical data but they can be used to display any quantity data. (Quantity data are 'amounts' of something and are always positive. They are also called ratio variables.)

Pie charts for quantities

Pie charts can also be used to display quantity data, but there is an additional requirement that must be satisfied before a pie chart is used. The total of all the data that are displayed must itself be meaningful.

In the published pie chart below, the individual values are death rates per 100,000 live births, so their total is meaningless. A pie chart therefore should not have been used.

5.3   Comparing groups

5.3.1   Contingency tables

We are often interested in whether a categorical distribution is the same in two or more groups of individuals. The categorical data in each group can be expressed as a frequency table. Combining these frequency tables into a single rectangular array gives a contingency table.

Categorical variables and groups

The raw data may be a list of values from each of several groups (as above) or the groups may be specified by a categorical variable in a single data matrix.

5.3.2   Contingency table examples

A contingency table may arise from an experiment (where one variable is controlled by the experimentor) or a survey (where there is no control over the individuals).

Example (from experiment)

A company that produces and markets videos continuing education programs for the financial industry has traditionally emailed sample videos with previews of the programs to prospective customers. To find whether giving temporary access to the full videos would increase the number of purchases, 40 contacts were randomly selected from the company's mailing list and given access to full videos and another 40 were sent sample videos.

  Purchased Not purchased
Sample video 6 34
Full video 14 26

Example (from survey)

Urine drug screening was performed on 2537 applicants for career craft positions in the US Postal Service's Boston Management Sectional Center. The contingency table below shows the distribution of test results, split by gender. (Those testing positive for more than one drug were classified under the more serious of the drugs, so each individual only contributed to a single cell in the table.)

  Negative Marijuana Cocaine Other drugs Total
Male 1465 146 33 28 1672
Female 764 52 22 27 865

5.3.3   Bar charts using proportions

Proportions within groups

To compare the distributions of a categorical variable in different groups, it is best to examine the proportions within the groups — the cell frequencies divided by their group totals.

The table below shows the work settings of all enrolled nurses in Australia in 1993, 1996 and 1999.

  Workplace  
Year  Hospitals  Aged homes Community   Other      Total   
1993 19,981 14,714 1,717 6,255 42,667
1996 20,367 11,899 1,571 4,860 38,697
1999 19,847 10,376 2,315 3,159 35,700

From the table of within-year percentages below, it is clearer that the percentage of nurses working in hospitals has increased and the percentage working in aged homes has decreased.

  Workplace  
Year  Hospitals  Aged homes Community    Other       Total   
1993 46.8 34.5 4.0 14.7 100.0
1996 52.6 30.7 4.1 12.6 100.0
1999 55.6 29.1 6.5 08.8 100.0

Bar charts of proportions

Bar charts can be used to graphically compare groups and it is again best to use proportions within groups rather than raw frequencies, especially if the groups are of different sizes.

Clustering the bars

Each cluster of bars above is a valid bar chart for one group. Alternatively, the same bars can be clustered by the variable of interest:

This makes it easier to make comparisons between the groups.

5.3.4   Stacked bar charts

Stacking the bars

The bars for each group in a bar chart can be stacked to help make comparisons between the groups. Stacked bar charts are particularly effective when the categorical variable is ordinal (has categories that can be meaningfully ordered).

The diagram below is a stacked bar chart showing results from a questionnaire sent to account holders at a bank.

The age distribution of account holders is clearest from this bar chart. By stacking the proportions within age groups, this information is lost but it is easier to see that a bigger proportion consider the service to be excellent in the oldest two age groups and almost 20% in the 18-30 age group consider the service to be only acceptable or worse.

5.3.5   Two special cases

Time series

When sets of categorical measurements are recorded at successive times, time can be treated as a grouping variable.

The diagram below shows the increasing percentage of True values.

Binary variables

When the variable of interest can only take two possible values, it is called a binary variable. If the proportions in each group for one of these values are small, the bars for this category can be shown with an expanded vertical scale — no information is lost since the proportions in the other category are one minus them.

5.4   Bivariate categorical distributions

5.4.1   Relationships between variables

Bivariate data without an explanatory variable

In some bivariate categorical data sets, one variable can be treated as a response whose value depends on the other explanatory variable. The explanatory variable can then be used to split the individuals into groups.

In other bivariate data, the relationship between the variables is more symmetrical but we still want to discover whether particular values of one variable are associated with values of the other. A contingency table again summarises the data.

    Variable X
Variable Y       X1       X2       X3   
  Y1 105 7 11
Y2 58 5 13
Y3 84 37 42
Y4 57 16 17

5.4.2   3-dimensional bar charts

Graphical display in a bar chart

If we do not want to classify the variables in a contingency table as a response and explanatory variable, the data can be displayed with a 3-dimensional bar chart.

Three-dimensional bar charts are 'interesting' but there are more informative ways to display the data.

Chartjunk and perspective displays

Beware of adding chartjunk and perspective viewpoints to the display — they just make it harder to understand the data.

5.4.3   Clustered bar charts

Clustering bars in 2-dimensional bar chart

Rather than using a 3-dimensional bar chart, it is usually easier to assess the relationships between two variables from 2-dimensional bar charts. The bars can be clustered by either variable and it is often informative to examine both of these displays.

5.4.4   Marginal distributions

Marginal counts

Although our main interest is usually on the relationship between two categorical variables, it can also be of interest to examine the overall distribution of each variable separately. These are called the marginal distributions of the two variables and are determined by the row and column totals of the contingency table.

    Variable X
Variable Y       X1       X2       X3       X4     Total 
  Y1 002 003 057 06 68
Y2 052 170 163 17 402
Y3 156 125 061 06 348
Y4 220 083 039 04 346
Total 430 381 320 33

The row and column totals correspond to the heights of the stacks in stacked bar charts. For example, the above row totals are the heights of the stacks in the following diagram.

 

and  

5.4.5   Conditional distributions

Spliting into groups

If the values of X are used to split the individuals into groups, the conditional distributions of Y given X are the distributions within each of these groups. They are found by dividing the cell counts by the totals for each such group. The columns of the table below show the conditional distributions for a contingency table, expressed as percentages.

Conditional percentages for Y, given X
    Variable X
Variable Y         X1           X2           X3           X4     
  Y1 000.5 000.8 017.8 018.2
Y2 012.1 044.6 050.9 051.5
Y3 036.3 032.8 019.1 018.2
Y4 051.2 021.8 012.2 012.1
Total 100.0 100.0 100.0 100.0

The conditional distributions of X given Y are similarly found by using Y to create the groups of individuals. They are found by dividing the cell counts by the totals in the other margin of the original contingency table.

Conditional percentages for X, given Y
    Variable X
Variable Y       X1       X2       X3       X4     Total 
  Y1 02.9 04.4 83.8 8.8 100.0 
Y2 12.9 42.3 40.5 4.2 100.0 
Y3 44.8 35.9 17.5 1.7 100.0 
Y4 63.6 24.0 11.3 1.2 100.0 

Both tables of conditional proportions (or percentages) are often informative.

5.4.6   More about conditional distributions ((advanced))

Conditional distributions of X given Y and Y given X

The conditional proportions for X given Y can be quite different from the corresponding conditional proportions for Y given X and you must be careful to distinguish between them.

As an extreme example, under 5% of women are pregnant at any time, but 100% of pregnant people are women!

5.4.7   Conditional vs marginal distns

Conditional and marginal distributions

The distinction is between the marginal distribution for a variable and its conditional distributions is illustrated for the following contingency table that describes bruising of 96 apples in a packing plant.

     OK    Bruised
Granny Smith 40 8
Fuji 24 24

The diagram below shows the apples, arranged in rows by variety.

Observe that:

The apples can be rearranged as follows:

Now observe that:

5.5   Presenting data in tables

5.5.1   Gridlines and white space

Tables from spreadsheets

Never publish tables in which all values are boxed (the default format for tables produced by many spreadsheets). Consider using a bold typeface for headings or using extra white space to separate rows and columns as an alternative to lines.

Reason No. %
Needle/Surg. Injuries    279 0.2
Rape 1502 0.8
TB 1564 0.9
STI 2745 1.5
Med Exam 4717 2.6
Clinical Suspicion 15387 8.5
PMTCT 45590 25.0
VCT      102443              56.3        
Other 7825 4.3

The table below presents the data more effectively.

Reason No. %
Needle/Surg. Injuries    279      .2        
Rape 1,502      .8        
TB 1,564      .9        
STI 2,745      1.5        
Med Exam 4,717      2.6        
Clinical Suspicion 15,387      8.5        
PMTCT (pregnancy) 45,590      25.0        
VCT (voluntary)      102,443              56.3        
Other 7,825      4.3        

Large tables

In large tables, it can be difficult to read across rows. To help the eye to match values on the same line, hairlines can be drawn between occasional rows, or some rows can be printed on a very light grey background.

5.5.2   Layout and annotation

Layout

Think carefully about how to arrange the rows and columns. Values that we want to compare should be close to each other, ideally in a column. Judicious use of white space can help to show the structure of complex tables.

The layout above has little structure. The table below contains the same information but is easier to understand.

Annotation

When a table is included in a report, the main information that can be gained from the table should also be summarised in the body of the report in words.

Do not simply repeat the values in the table. The annotation should summarise and interpret.

5.5.3   Significant digits and data noise

Signal and noise

The useful information in a graphical or tabular display of data is called its signal. Parts of the display that do not contain information that can be usefully interpreted are called noise. We can distinguish:

Non-data noise
This 'non-data ink' includes unnecessary graphics and gridlines that have been added to displays.
Data noise
This is information from the data that is displayed but does not help the reader to understand the 'signal'.

Noise make it harder to detect the signal in a display and should be avoided.

Significant digits

Many tables contain values that are reported with more significant digits than necessary. Usually the pattern of values in a table can be understood from only their first 2 or 3 digits — the remaining digits are data noise.

Reducing the number of significant digits and rearranging the columns makes the information easier to understand.

5.5.4   Meaningful variables

Percentages and proportions

It is often easier to understand proportions (or percentages) than raw counts. This is particularly important for comparing groups of individuals. The table below shows the origin and 'lifestage' of tourists (in thousands) arriving in Hawaii in 2005.

  US West US East Japan Canada Europe
Wedding/honeymoon 103.1 110.0 192.7 8.0 131.5
Family (with children) 667.1 297.1 485.6 44.5 94.4
Young (18-34) 403.3 243.1 229.1 38.8 210.1
Middle aged (35-54) 955.2 634.7 308.0 75.1 374.2
Seniors (55+) 903.7 643.5 303.5 82.3 314.6
Total 3,032.5 1,929.3 1,517.4 248.6 1,123.7

The information is easier to understand as percentages within each country of origin. Scanning across rows in the table below, the highlighted percentages stand out as 'unusual'.

    US West   US East Japan Canada Europe
Wedding/honeymoon 3.4 5.7 12.7 3.2 11.7
Family (with children) 22.0 15.4 32.0 17.9 8.4
Young (18-34) 13.3 12.6 15.1 15.6 18.7
Middle aged (35-54) 31.5 32.9 20.3 30.2 33.3
Seniors (55+) 29.8 33.3 20.0 33.1 28.0
Total 100.0 100.0 100.0 100.0 100.0

Ratios

It is sometimes better to divide values by some measure of 'size' before analysis or display.

5.5.5   Swapping rows and columns

Comparing values down columns

It is easiest to compare values if they are close together in a table. The layout and use of white space should be used to encourage comparison of related values.

In particular, it is easier to compare values down columns than across rows — their most significant digits are closer — so carefully consider whether to swap the rows and columns of a table.

5.5.6   Reordering rows

Order for the rows of a table

In many tables, the rows are ordered alphabetically by their row names, but it is usually better to reorder them in another meaningful way.

If there is no better ordering, sort the rows into decreasing order of the values in the column of most interest.

5.5.7   Example

Tourist arrivals in South Africa

The following table was published as part of a report on tourism in South Africa. It describes the origin of tourist arrivals in 2004 and the amounts that they spent in South Africa (excluding capital expenditure).

This table can be improved by removing grid lines, decreasing the number of significant digits, and reordering the countries within each region.

5.6   Logistic regression

5.6.1   Categorical responses

Comparing the response distributions at different x-values

If a response, Y, is numerical and explanatory variable, X, is categorical, box plots can be used to compare the response distribution at the different x-values.

If the response, Y, is categorical and the explanatory variable, X, is numerical, we are again interested in comparing the response distribution at different x-values. We might use X to define 'groups' by splitting its values into classes (as might be done to draw a histogram) and this allows us to use stacked bar charts to describe the relationship.

It is not necessary for the 'classes' to be of equal width. For example, some of the age groups below are of width 3 months, whereas others are 6 months and the extreme classes are wider still.

5.6.2   Fitted values and predictions

Linear model

It is tempting to try a linear model to explain how the proportion in one response category is affected by the explanatory variable,

predicted proportion,  

Unfortunately this may result in predicted proportions greater than 1 or less than 0.

Nonlinear models

We should use a model that gives values between 0 and 1 for all possible values of X. This means that the equation must be nonlinear in X.

5.6.3   Logistic curve

A curve that lies between 0 and 1 for all values of x

Various nonlinear equations have values between 0 and 1 for all values of x, but the simplest of these is a logistic curve,

predicted proportion,   

The parameters of the logistic curve

The parameter b1 is called the slope of the curve. Increasing it makes the curve steeper, and its sign determines whether the curve slopes upwards or downwards.

The parameter b0 is the curve's intercept and it determines the horizontal position of the curve. Increasing it shifts the curve to the left.

5.6.4   Obtaining a good fit

Estimating the logistic parameters

Estimating the parameters b0 and b1 of a logistic model is more difficult than estimating the parameters for a linear model by least squares, but many statistical programs will do the appropriate calculations for you.

We therefore take a 'black box' approach and simply show what parameter estimation gives without further justification.