Data with one categorical and one numerical variable
We have previously examined bivariate data sets with...
This section briefly examines the remaining combination...
Numerical response and categorical explanatory variable
In some situations, the numerical variable must be treated as the response. Consider large company that is trying to profile its employees. The annual income and educational level (degree, completed high school or did not complete high school) of each employee aged 25-29 was noted. For analysis, income should be treated as the response variable since educational level could affect income, but the income could not affect educational level.
When the explanatory variable is categorical, it should be used to split the individuals into groups. The methods that were described earlier for comparison of numerical distributions can be used. For example, the distributions might be compared with box plots.
This diagram helps us to understand how income depends on education.
Categorical response and numerical explanatory variable
When the categorical variable is the response, a different analysis is required. If we were analysing the relationship between income and membership of an optional pension scheme in the above company, membership of a pension scheme should be treated as the response variable.
Analysis is harder, but we might split income into categories (e.g. under $20,000, $20,000 to $29,999, ...) and use this to split the individuals into groups. Stacked bar charts might then be used to display the relationship.
This diagram helps us to understand how the proportion in a pension scheme depends on income.
When there is no unique response...
In other situations, the classification of variables into a response and explanatory variable is less clear. If the two variables in the above study were income and whether the respondent had ever been married, it cannot be argued that one variable cannot affect the other.
To examine the association between the variables, there are therefore two complementary ways to examine the data.
The remainder of this section expands on how we might explain a categorical response in terms of a numerical explanatory variable.
The following example is not a business one, but is a nice example of data with a categorical response.
Menstruation and age
A study was conducted in Warsaw to determine the proportions of girls who had started menstruating at different ages. A total of 3,898 girls of various ages between 8 and 19 were asked whether they had started menstruating.
Age class (to nearest month) | Menstruating | Total girls | |||
---|---|---|---|---|---|
|
|
|
The response is a categorical variable with two possible values (menstruating or not menstruating). How does the proportion menstruating depends on the explanatory variable age?
The bar charts below help to explain the relationship. The bar chart for each age group is centred on the middle age in the class.
Click the checkbox Stacked. Both the stacked and unstacked displays show clearly the increase in the proportion menstruating with age.
Bad displays of the data
Choose the option Frequency from the pop-up menu. There are two problems with the stacked and unstacked bar charts of the counts.