Bivariate categorical data are modelled as a sample from a population that consists of pairs of categorical values. The joint probability for any pair of categories is their population proportion.
The marginal probabilities for a variable are the population proportions for its possible values. They can be found by summing joint probabilities.
Conditional probabilites for a variable are proportions in a sub-population containing a specific value for the other variable. They are found by scaling the joint probabilities in that sub-population.
Joint, marginal and conditional probabilities can be displayed graphically.
The model can be equivalently described by (a) joint probabilities, (b) marginal probabilites for X and conditional probabilities for Y, or (c) marginal probabilites for Y and conditional probabilities for X. Any of these sets of probabilities can be found any other set.
Two categorical variables, X and Y, are associated (related) when the conditional distribution of Y given X=x is different for different values of x. Knowing the value of X therefore tells you something about Y.
When the conditional distribution of Y is the same for all values of X, the variables are called independent. This special case is of practical importance.
Independence is a population property. To assess independence from a sample contingency table, the observed cell counts are compared to those estimated from a model with independence.
The raw sum of squared differences between observed and estimated cell counts is not a good test statistic.
The 'chi-squared' statistic is a modified sum of squared differences that has a standard distribution (a chi-squared distribution) when there is independence.
The chi-squared statistic can be used to find a p-value for testing independence. The p-value has similar interpretation and properties to p-values for all other hypothesis tests.
The chi-squared test is applied to a few real data sets. When the variables are found to be associated, the nature of the relationship is described from a comparison of observed and estimated cell counts.
The chi-squared test assesses independence of two categorical variables. It is also used to test whether a single categorical variable has the same distribution in several groups.