If you don't want to print now,
Data sets with two categorical variables
Bivariate categorical data sets are usually summarised with a contingency table.
For example, a study examined 686 tourists and classified each by educational level and by whether they were 'information seekers' (who requested destination-specific literature from travel agents) or 'non-seekers':
Information seeker? | |||
---|---|---|---|
Education | Yes | No | Total |
Some high school | 013 | 027 | 40 |
High school degree | 064 | 118 | 182 |
Some college | 100 | 123 | 223 |
College degree | 059 | 069 | 128 |
Graduate degree | 067 | 046 | 113 |
Total | 303 | 383 | 686 |
Joint probabilities
Bivariate categorical data can be modelled as a random sample from an underlying population of pairs of categorical values. The population proportion for each pair (x, y) is denoted by pxy and is called the joint probability for (x, y).
In games of chance, we can often work out the joint probabilities. For example, if a gambler draws a card from a shuffled deck and also tosses a coin, there are eight possible combinations,
Probabilities for a single variable
A model for two categorical variables is characterised by the joint probabilities pxy.
The marginal probability, px, for a variable X is the proportion of (x, y) pairs in the population with X = x . This can be found by adding all joint probabilities for pairs with this x-value.
There is a similar formula for the marginal probabilities of the other variable,
Example
In the following example, the marginal probabilities for X are the row of totals under the table, and the marginal probabilities for Y are the column of totals on the right.
Variable X | ||||
---|---|---|---|---|
Variable Y | X = A | X = B | X = C | Total |
Y = 1 | 0.2576 | 0.1364 | 0.1212 | 0.5152 |
Y = 2 | 0.0909 | 0.0758 | 0.0152 | 0.1818 |
Y = 3 | 0.0455 | 0.0758 | 0.0606 | 0.1818 |
Y = 4 | 0.0152 | 0.0303 | 0.0758 | 0.1212 |
Total | 0.4091 | 0.3182 | 0.2727 | 1.0000 |
Probabilities in a sub-population
Conditional probabilities for Y, given X = x
The general definition of the conditional probabilities for Y given that the value of X is x is
They can be found by rescaling of that row of the table of joint probabilities (dividing by px) so that the row sums to 1.0.
Two sets of conditional probabilities
Conditional probabilities for X given that Y has the value y are defined in a similar way:
You should be careful to distinguish between px | y and py | x.
The probability of being pregnant, given that a randomly selected person is female would be fairly small. The probability of being female, given that a person is pregnant is 1.0 !!
Proportional Venn diagrams
A proportional Venn diagram is drawn from the marginal probabilities of one variable and the conditional probabilities for the other variable,
Rewriting the definition of conditional probabilities,
The area of any rectangle in the diagram therefore equals the joint probability of the categories it represents.
An alternative proportional Venn diagram can be drawn from the marginal probabilities of Y and the conditional probabilites of X given Y. The area for the rectangle corresponding to any (x, y) is its joint probability, pxy.
Example
The table below is based on the world population in 2002, categorised by region and by age group. It shows the joint probabilities for a randomly chosen person being in each age/region category.
Age | |||
---|---|---|---|
0-19 | 20-64 | 65+ | |
Africa and Near East | 0.085 | 0.073 | 0.006 |
Asia | 0.215 | 0.315 | 0.035 |
America, Europe and Oceanea | 0.084 | 0.158 | 0.030 |
The two proportional Venn diagrams are shown below.
Note that the areas are the same in both diagrams — they are simply rearranged.
Marginal and conditional probs can be found from joint probs (and vice versa)
We have used three types of probability to describe a model for two categorical variables — the joint probabilities, the marginal probabilities for the two variables and the conditional probabilities for each variable given the value of the other variable. These sets of probabilities are closely related. Indeed, the model can be equivalently described by any of the following.
Each can be found from the others:
Bayes theorem
In particular, note that it is possible to obtain the conditional probabilities for X given Y, px | y, from the marginal probabilities of X, px, and the conditional probabilities for Y given X, py | x. This can be expressed in a single formula that is called Bayes Theorem, but it is easier in practice to do the calculations in two steps, obtaining the joint probabilities, pxy, in the first step. There are several important applications of Bayes Theorem.
Detection of fraudulent tax claims
Tax inspectors investigate some of the tax returns that are submitted by individuals if they think that some claims for expenses are too high or are unjustified. There are two possible types of error when an inspector decides whether or not to investigate a claim:
Consider a procedure with
p investigated | good claim = 0.1 pnot investigated | bad claim = 0.2
From these, we can also write
p not investigated | good claim = 0.9 pinvestigated | bad claim = 0.8
We will also assume that 10% of tax returns have bad claims,
p bad claim = 0.10
From this information, we can find the probabilities of a claim being bad, given the decision about whether or not to investigate it,
Relationships
The relationship between two numerical variables can be summarised by a correlation coefficient and least squares line. Two categorical variables may also be related.
We say that two categorical variables are associated if knowledge of the value of one tells you something about the likely value of the other.
If the conditional distribution of Y given X = x depends on the value of x, we say that X and Y are associated.
Example
We illustrate the idea of association with an artificial example relating absenteeism of employees in a supermarket chain to their weight. The table below shows the joint probabilities for these employees.
Attendance record | ||||
---|---|---|---|---|
Poor | Satisfactory | Above average | Marginal | |
Underweight | 0.0450 | 0.0900 | 0.0150 | 0.1500 |
Normal | 0.0825 | 0.3025 | 0.1650 | 0.5500 |
Overweight | 0.0500 | 0.1200 | 0.0300 | 0.2000 |
Obese | 0.0300 | 0.0650 | 0.0050 | 0.1000 |
Marginal | 0.1700 | 0.5400 | 0.2900 | 1.0000 |
A proportional Venn diagram displays the conditional probabilities for attendance, given weight category, graphically.
If we know that an employee has normal weight, there is a higher probability of having above average attendance than an overweight employee. Since the conditional probabilities for attendance, given weight are different for different weight categories, the two variables are associated.
Independence
If the conditional probabilities for Y are the same for all values of X, then Y is said to be independent of X.
If X and Y are independent, knowing the value of X does not give us any information about the likely value for Y.
Example
An example of independence is given by the following table of joint probabilities for the weight category and work performance (as assessed by a supervisor) of supermarket employees.
Work performance | ||||
---|---|---|---|---|
Poor | Satisfactory | Above average | Marginal | |
Underweight | 0.0225 | 0.1125 | 0.0150 | 0.1500 |
Normal | 0.0825 | 0.4125 | 0.0550 | 0.5500 |
Overweight | 0.0300 | 0.1500 | 0.0200 | 0.2000 |
Obese | 0.0150 | 0.0750 | 0.0100 | 0.1000 |
Marginal | 0.1500 | 0.7500 | 0.1000 | 1.0000 |
The proportional Venn diagram for this model is shown below.
The conditional probability of above average work performance is the same for all weight categories — knowing an employee's weight would not help you to predict their work performance. The two variables are therefore independent.
Mathematical definition of independence
If Y is independent of X, then:
Assessing independence from a sample
Independence is an important concept, but it is defined in terms of the joint population probabilites and in most practical situations these are unknown. We must assess independence from a sample of individuals — a contingency table.
Example
The contingency table below categorises a sample of 214 individuals by gender and some other characteristic (possibly weight group or grade in a test).
Male | Female | Total | |
---|---|---|---|
A | 20 | 60 | 80 |
B | 9 | 84 | 93 |
C | 2 | 39 | 41 |
Total | 31 | 183 | 214 |
Is this consistent with a model of independence of the characteristic and gender? (Are the probabilities of A, B and C grades the same for males and females?)
Estimated cell counts under independence
To assess independence, we first find the pattern of cell counts that is most consistent with independence in a contingency table with the observed marginal totals.
Male | Female | Total | |
---|---|---|---|
A | ? | ? | 80 |
B | ? | ? | 93 |
C | ? | ? | 41 |
Total | 31 | 183 | 214 |
The pattern that is most consistent with independence has the following estimated cell counts:
where n denotes the total for the whole table and nx and ny denote the marginal totals for row x and column y.
Applying this to our example gives the following table:
Male | Female | Total | |
---|---|---|---|
A | ![]() |
80 | |
B | 93 | ||
C | 41 | ||
Total | 31 | 183 | 214 |
Comparison of observed and estimated cell counts
We test for independence with the hypotheses:
H0 : X and Y are independent
HA : X and Y are dependent
The test asks whether the observed and estimated cell counts are 'sufficiently close' — are the observed counts consistent with the counts estimated under independence?
Male | Female | Total | |
---|---|---|---|
A | 20 (11.59) |
60 (68.41) |
80 |
B | 9 (13.47) |
84 (79.53) |
93 |
C | 2 (5.94) |
39 (35.06) |
41 |
Total | 31 | 183 | 214 |
Possible test statistic?
A simple summary of how close the observed counts, nxy, are to the estimated cell counts, exy, is the sum of the squared differences,
Unfortunately this would be a bad test statistic — its distribution depends not only on the numbers of rows and columns in the table, but also on the number of individuals classified — the overall total for the table. A better test statistic is presented in the next page.
A better test statistic
The following χ2 (pronounced chi-squared) statistic has much better properties than the raw sum of squares on the previous page
Its distribution only depends on the number of rows and columns in the contingency table.
Distribution of chi-squared statistic
When there is independence, the χ2 statistic for a contingency table with r rows and c columns has approximately a standard distribution called a chi-squared distribution with (r - 1)(c - 1) degrees of freedom.
The mean of a chi-squared distribution equals its degrees of freedom and it is skew. Some examples are given below for contingency tables of different sizes:
Testing for independence
H0 : X and Y are independent
HA : X and Y are dependent
The following test statistic is used:
P-value
The p-values is interpreted in the same way as for other hypothesis tests — it describes the strength of evidence against the null hypothesis:
p-value | Interpretation |
---|---|
over 0.1 | no evidence against the null hypothesis (independence) |
between 0.05 and 0.1 | very weak evidence of dependence between the row and column variables |
between 0.01 and 0.05 | moderately strong evidence of dependence between the row and column variables |
under 0.01 | strong evidence of dependence between the row and column variables |
Warning about low estimated cell counts
The χ2 test statistic has only approximately a chi-squared distribution. The p-value found from it can be relied on if:
If the cell counts are small enought that these conditions do not hold, the p-value is less reliable. (But advanced statistical methods are required to do better!)
Examples
Contingency tables and groups
Contingency tables can either arise from bivariate categorical data or from univariate categorical data that is recorded separately from several groups.
The chi-squared test assesses independence in bivariate data. The same test can also be used to compare the different groups if there is grouped data.
Example