If you don't want to print now,

Chapter 13   Independence

13.1   Probability and applications

13.1.1   Joint probabilities

Data sets with two categorical variables

Bivariate categorical data sets are usually summarised with a contingency table.

For example, a study examined 686 tourists and classified each by educational level and by whether they were 'information seekers' (who requested destination-specific literature from travel agents) or 'non-seekers':

  Information seeker?  
Education     Yes         No        Total   
  Some high school 013 027 40
  High school degree    064 118 182
  Some college    100 123 223
  College degree    059 069 128
  Graduate degree    067 046 113
Total 303 383 686

Joint probabilities

Bivariate categorical data can be modelled as a random sample from an underlying population of pairs of categorical values. The population proportion for each pair (xy) is denoted by pxy and is called the joint probability for (xy).

In games of chance, we can often work out the joint probabilities. For example, if a gambler draws a card from a shuffled deck and also tosses a coin, there are eight possible combinations,

13.1.2   Marginal probabilities

Probabilities for a single variable

A model for two categorical variables is characterised by the joint probabilities pxy.

The marginal probability, px, for a variable X is the proportion of (xy) pairs in the population with X  = x . This can be found by adding all joint probabilities for pairs with this x-value.

There is a similar formula for the marginal probabilities of the other variable,

Example

In the following example, the marginal probabilities for X are the row of totals under the table, and the marginal probabilities for Y are the column of totals on the right.

Joint probabilities
  Variable X  
Variable Y X = A X = B X = C Total
Y = 1 0.2576 0.1364 0.1212 0.5152
Y = 2 0.0909 0.0758 0.0152 0.1818
Y = 3 0.0455 0.0758 0.0606 0.1818
Y = 4 0.0152 0.0303 0.0758 0.1212
Total 0.4091 0.3182 0.2727 1.0000

13.1.3   Conditional probabilities

Probabilities in a sub-population

Complete population
The joint probabilities pxy and the marginal probabilities px and py all describe proportions in the complete population of (xy) pairs.
Sub-population
In contrast, it is sometimes meaningful to restrict attention to a subset of the (xy) pairs. For example, we may be interested only in pairs for which the first variable, X , has some particular value. Probabilities that relate to a sub-population are called conditional probabilities.

Conditional probabilities for Y, given X = x

The general definition of the conditional probabilities for Y given that the value of X is x is

They can be found by rescaling of that row of the table of joint probabilities (dividing by px) so that the row sums to 1.0.

Two sets of conditional probabilities

Conditional probabilities for X given that Y  has the value y are defined in a similar way:

You should be careful to distinguish between px | y and py | x.

The probability of being pregnant, given that a randomly selected person is female would be fairly small. The probability of being female, given that a person is pregnant is 1.0 !!

13.1.4   Graphical display of probabilities

Proportional Venn diagrams

A proportional Venn diagram is drawn from the marginal probabilities of one variable and the conditional probabilities for the other variable,

Rewriting the definition of conditional probabilities,

The area of any rectangle in the diagram therefore equals the joint probability of the categories it represents.

An alternative proportional Venn diagram can be drawn from the marginal probabilities of Y and the conditional probabilites of X given Y. The area for the rectangle corresponding to any (x, y) is its joint probability, pxy.

Example

The table below is based on the world population in 2002, categorised by region and by age group. It shows the joint probabilities for a randomly chosen person being in each age/region category.

Joint probabilities
  Age
  0-19 20-64 65+
Africa and Near East 0.085 0.073 0.006
Asia 0.215 0.315 0.035
America, Europe and Oceanea 0.084 0.158 0.030

The two proportional Venn diagrams are shown below.

Note that the areas are the same in both diagrams — they are simply rearranged.

13.1.5   Calculations with probabilities

Marginal and conditional probs can be found from joint probs (and vice versa)

We have used three types of probability to describe a model for two categorical variables — the joint probabilities, the marginal probabilities for the two variables and the conditional probabilities for each variable given the value of the other variable. These sets of probabilities are closely related. Indeed, the model can be equivalently described by any of the following.

Each can be found from the others:

Bayes theorem

In particular, note that it is possible to obtain the conditional probabilities for X given Y, px | y, from the marginal probabilities of X, px, and the conditional probabilities for Y given X, py | x. This can be expressed in a single formula that is called Bayes Theorem, but it is easier in practice to do the calculations in two steps, obtaining the joint probabilities, pxy, in the first step. There are several important applications of Bayes Theorem.

Detection of fraudulent tax claims

Tax inspectors investigate some of the tax returns that are submitted by individuals if they think that some claims for expenses are too high or are unjustified. There are two possible types of error when an inspector decides whether or not to investigate a claim:

Consider a procedure with

p investigated | good claim  =  0.1           pnot investigated | bad claim  =  0.2

From these, we can also write

p not investigated | good claim  =  0.9           pinvestigated | bad claim  =  0.8

We will also assume that 10% of tax returns have bad claims,

p bad claim  =  0.10

From this information, we can find the probabilities of a claim being bad, given the decision about whether or not to investigate it,

13.2   Independence

13.2.1   Association

Relationships

The relationship between two numerical variables can be summarised by a correlation coefficient and least squares line. Two categorical variables may also be related.

We say that two categorical variables are associated if knowledge of the value of one tells you something about the likely value of the other.

If the conditional distribution of Y given X = x depends on the value of x, we say that X and Y are associated.

Example

We illustrate the idea of association with an artificial example relating absenteeism of employees in a supermarket chain to their weight. The table below shows the joint probabilities for these employees.

Joint Probabilities
Attendance record
Poor Satisfactory Above average Marginal
Underweight 0.0450 0.0900 0.0150 0.1500
Normal 0.0825 0.3025 0.1650 0.5500
Overweight 0.0500 0.1200 0.0300 0.2000
Obese 0.0300 0.0650 0.0050 0.1000
Marginal 0.1700 0.5400 0.2900 1.0000

A proportional Venn diagram displays the conditional probabilities for attendance, given weight category, graphically.

If we know that an employee has normal weight, there is a higher probability of having above average attendance than an overweight employee. Since the conditional probabilities for attendance, given weight are different for different weight categories, the two variables are associated.

13.2.2   Independence

Independence

If the conditional probabilities for Y are the same for all values of X, then Y is said to be independent of X.

If X and Y are independent, knowing the value of X does not give us any information about the likely value for Y.

Example

An example of independence is given by the following table of joint probabilities for the weight category and work performance (as assessed by a supervisor) of supermarket employees.


Joint Probabilities
Work performance
    Poor     Satisfactory Above average Marginal
Underweight 0.0225 0.1125 0.0150 0.1500
Normal 0.0825 0.4125 0.0550 0.5500
Overweight 0.0300 0.1500 0.0200 0.2000
Obese 0.0150 0.0750 0.0100 0.1000
Marginal 0.1500 0.7500 0.1000 1.0000

The proportional Venn diagram for this model is shown below.

The conditional probability of above average work performance is the same for all weight categories — knowing an employee's weight would not help you to predict their work performance. The two variables are therefore independent.

Mathematical definition of independence

If Y is independent of X, then:

13.2.3   Independence from samples

Assessing independence from a sample

Independence is an important concept, but it is defined in terms of the joint population probabilites and in most practical situations these are unknown. We must assess independence from a sample of individuals — a contingency table.

Example

The contingency table below categorises a sample of 214 individuals by gender and some other characteristic (possibly weight group or grade in a test).

Sample Data
    Male   Female Total
A 20 60 80
B 9 84 93
C 2 39 41
Total 31 183 214

Is this consistent with a model of independence of the characteristic and gender? (Are the probabilities of A, B and C grades the same for males and females?)

Estimated cell counts under independence

To assess independence, we first find the pattern of cell counts that is most consistent with independence in a contingency table with the observed marginal totals.

    Male   Female Total
A ? ? 80
B ? ? 93
C ? ? 41
Total 31 183 214

The pattern that is most consistent with independence has the following estimated cell counts:

where n denotes the total for the whole table and nx and ny denote the marginal totals for row x and column y.

Applying this to our example gives the following table:

  Male Female Total
A 80
B 93
C 41
Total 31 183 214

13.2.4   Testing for independence

Comparison of observed and estimated cell counts

We test for independence with the hypotheses:

H0 :  X and Y are independent
HA :  X and Y are dependent  

The test asks whether the observed and estimated cell counts are 'sufficiently close' — are the observed counts consistent with the counts estimated under independence?

Observed and estimated cell counts
     Male     Female  Total
A 20
(11.59)
60
(68.41)
80
B 9
(13.47)
84
(79.53)
93
C 2
(5.94)
39
(35.06)
41
Total 31 183 214

Possible test statistic?

A simple summary of how close the observed counts, nxy, are to the estimated cell counts, exy, is the sum of the squared differences,

Unfortunately this would be a bad test statistic — its distribution depends not only on the numbers of rows and columns in the table, but also on the number of individuals classified — the overall total for the table. A better test statistic is presented in the next page.

13.2.5   Chi-squared test statistic

A better test statistic

The following χ2 (pronounced chi-squared) statistic has much better properties than the raw sum of squares on the previous page

Its distribution only depends on the number of rows and columns in the contingency table.

Distribution of chi-squared statistic

When there is independence, the χ2 statistic for a contingency table with r rows and c columns has approximately a standard distribution called a chi-squared distribution with (r - 1)(c - 1) degrees of freedom.

The mean of a chi-squared distribution equals its degrees of freedom and it is skew. Some examples are given below for contingency tables of different sizes:

13.2.6   P-value for chi-squared test

Testing for independence

H0 :  X and Y are independent
HA :  X and Y are dependent  

The following test statistic is used:

If X and Y are independent
χ2 has (approximately) a chi-squared distribution with no unknown parameters
If X and Y are associated
The pattern of observed counts, nxy, is expected to be different from that of the exy, so χ2 is expected to be larger.

P-value

The p-values is interpreted in the same way as for other hypothesis tests — it describes the strength of evidence against the null hypothesis:

p-value Interpretation
over 0.1 no evidence against the null hypothesis (independence)
between 0.05 and 0.1    very weak evidence of dependence between the row and column variables
between 0.01 and 0.05    moderately strong evidence of dependence between the row and column variables
under 0.01 strong evidence of dependence between the row and column variables

Warning about low estimated cell counts

The χ2 test statistic has only approximately a chi-squared distribution. The p-value found from it can be relied on if:

If the cell counts are small enought that these conditions do not hold, the p-value is less reliable. (But advanced statistical methods are required to do better!)

13.2.7   Examples

Examples

13.2.8   Comparing groups

Contingency tables and groups

Contingency tables can either arise from bivariate categorical data or from univariate categorical data that is recorded separately from several groups.

The chi-squared test assesses independence in bivariate data. The same test can also be used to compare the different groups if there is grouped data.

Null hypothesis (corresponding to independence)
The category probabilities are the same within each group.
Alternative hypothesis (corresponding to association)
The different groups have different probabilities.

Example