Categorical variables and groups
A categorical variable can be used to split the individuals in a data set into groups. For example, the schools in a school census may be recorded as government or private. We could use the 'school type' variable to split them into a group of government schools and a group of private schools and separately examine the class sizes of type of school.
Conversely, if data were separately collected from different groups of individuals, the resulting data sets could be combined with a categorical variable distinguishing between the groups. For example, an experiment might be conducted to compare the lifetimes of three brands of electric light bulb. The three sets of lifetimes might be combined into a single data set with one numerical variable (the lifetimes) and a categorical variable to distinguish the three brands. This type of data set often arises from experiments.
A categorical variable and groups are often two ways of representing the same data.
Data presented in a separate list for each group are often called unstacked data. Data presented as a single list alongside a categorical variable are called stacked data.
Rice survey
As part of a survey of rice producers in Sri Lanka, 36 farmers were randomly selected from 4 villages. The yield of rice (tonnes per hectare) was determined from each farmer.
These data are naturally presented as a separate list of yields for each village. Click on values on the left to see how they are represented using a categorical 'Village' variable in a data matrix.
Days off work from illness
The manager of a factory was concerned by the number of days that workers took off work each year due to illness. Personnel records for 60 workers were examined, and their sick days were recorded along with their age group, old (defined as 40 or over) or young (under 40).
These data are naturally recorded in a data matrix with columns for days off work and for age group, but the ages of the workers can be used to split the other variable into groups. Click on the top row of the data table and drag down to see how the numerical values are split into groups.