Relationships between numerical variables
When two or more measurements are made from each individual in a population, we are usually interested in whether these variables are related to each other. When both variables are numerical, the strength of the relationship can be described with a correlation coefficient and regression models allow us to test whether two variables are related on the basis of sample data.
Relationships between categorical variables
Two categorical measurements may also be related.
As with numerical variables, we may be able to conclude that any relationship between categorical variables is causal if it results from an experiment (e.g. a randomised experiment in which some pea seeds are coated and others are uncoated). From observational data however, we usually cannot deduce a causal relationship — all we can say is that the variables may be associated.
What does association mean?
We say that two variables are associated if knowledge of the value of one tells you something about the likely value of the other.
If the conditional distribution of Y given X = x depends on the value of x, we say that X and Y are associated.
For example, if the conditional distribution of the Job satisfaction of new employees given Job type = secretary is different from the conditional distribution of Job satisfaction given Job type = manager, then we say that Job satisfaction and Job type are associated.
In the next page, we will characterise two variables that are not associated, but first we give an example of variables that are related.
Absenteeism and weight
To illustrate the idea of association, we use a table of joint probabilities that constitute a possible model for absenteeism of employees in a supermarket chain and their weight.
Note that the joint probabilities in this model do not accurately represent the effect of weight on absenteeism — they are only used to illustrate the concepts.
Attendance record | ||||
---|---|---|---|---|
Poor | Satisfactory | Above average | Marginal | |
Underweight | 0.0450 | 0.0900 | 0.0150 | 0.1500 |
Normal | 0.0825 | 0.3025 | 0.1650 | 0.5500 |
Overweight | 0.0500 | 0.1200 | 0.0300 | 0.2000 |
Obese | 0.0300 | 0.0650 | 0.0050 | 0.1000 |
Marginal | 0.1700 | 0.5400 | 0.2900 | 1.0000 |
The implications of this model are best explained from conditional probabilities for athletic performance, given weight:
Attendance record | ||||
---|---|---|---|---|
Poor | Satisfactory | Above average | Total | |
Underweight | 0.30 | 0.60 | 0.10 | 1.0 |
Normal | 0.15 | 0.55 | 0.30 | 1.0 |
Overweight | 0.25 | 0.60 | 0.15 | 1.0 |
Obese | 0.30 | 0.65 | 0.05 | 1.0 |
A proportional Venn diagram displays these conditional probabilities graphically.
If this model is correct, the conditional probability of poor attendance is lowest for staff with 'normal' weight, increasing as weight gets further from 'normal'. Similarly, the probability of above average attendance is highest for those with 'normal' weight.