Numerical and categorical data
In a data set, a numerical variable contains a number from each individual. A categorical variable classifies each individual into one of several groups. For example, an investigation of the religions with which a group of 100 individuals identify might result in the 100 values,
catholic, anglican, atheist, anglican, muslim, ...
In many data sets, the values are not ordered in any meaningful way. For example, the 100 individuals above were not surveyed in any particular order. (If the data were collected in order, time series methods should be used to analyse them.) We only consider unordered categorical data in this chapter.
Frequency tables
An unordered numerical data set holds much detailed information about the distribution of values. (A dot plot shows full information about the distribution, though we may choose to summarise with a histogram or summary statistics.)
In contrast, an unordered categorical data set contains much less information. The frequencies for the distinct categories are the number of times each category occurs in the data set.
The frequencies fully capture all information about the distribution of values.
These frequencies are usually presented as a frequency table.
Rice survey
As part of a survey of rice producers in Sri Lanka, 36 farmers were randomly selected from 4 villages. Each sampled farmer was asked about the variety of rice that he used and the varieties were categorised into 'Old', 'Traditional' or 'New'. The 36 resulting categorical values are shown on the left of the diagram below.
To calculate the frequencies for each of the three types of rice by hand, you would work through the table of values, drawing a line against the appropriate category name for each value (a tally). These tallies would finally be counted to give the frequencies.
Click on each of the categorical values in turn to illustrate how the tallies and frequencies are obtained.
The final table of frequencies on the right summarises usage of the three types of rice. The frequency table contains all information about the distribution of rice types.
Examining one variable from many
In surveys like the rice survey above, several measurements are often recorded from each participant. Although in-depth analysis of the data would investigate the relationships between the variables, it is often useful to examine the distributions of the variables one-at-a-time.
Rice survey
In the rice survey that was described above, five variables were measured from each farmer.
Frequency tables could be used to summarise the categorical variables whereas dot plots could summarise the distributions of the three numerical variables. The diagram below shows the data in tabular form and we will again build up the frequency distribution of the rice types.
Click on each row (farmer) in turn to build up the frequency table.