Alternative to the method of moments
Although the method of moments often provides a good estimator for a single unknown parameter, it is difficult to extend to statistical models with more unknown parameters.
We now describe a better estimation method that can be extended to models with many unknown parameters and even situations in which the available data are not a random sample.
Probability of data
We start by considering a random situation in which the variables that will be recorded, \(\{X_1, X_2, \dots, X_n\}\), have a joint probability that involves an unknown parameter, \(\theta\).
\[ p(x_1, x_2, \dots, x_n \;| \; \theta) \]For example, if the variables are a random sample from a \(\GeomDistn(\pi)\) distribution, independence means that the joint probability is the following product.
\[ P(X_1=x_1 \textbf{ and } X_2=x_2 \textbf{ and } \dots \textbf{ and } X_n=x_n) = \prod_{i=1}^n {\pi (1-\pi)^{x_i-1}} \]Likelihood function
In the joint probability, we usually treat the parameter \(\theta\) as a constant and consider how the probability depends on the x-values. We will now do the opposite; we will treat the data values as fixed constants and examine how the probability depends on the value of \(\theta\).
Definition
If random variables \(\{X_1, X_2, \dots, X_n\}\) have joint probability
\[ p(x_1, x_2, \dots, x_n \;| \; \theta) \]then the function
\[ L(\theta \; | \; x_1, x_2, \dots, x_n) \;=\; p(x_1, x_2, \dots, x_n \;| \; \theta) \]is called the likelihood function of \(\theta\).
The likelihood function tells you the probability of getting the data that were observed, for different values of the parameter, \(\theta\). More informally,
\(L(\theta) = Prob(\text{getting the data that were observed})\) if the parameter value was really \(\theta\).
We now give a simple example.
Binomial random variable
Consider a random variable \(X\) that is the number of successes in \(n=20\) independent trials, each of which has probability \(\pi\) of success. The probability function of \(X\) is
\[ p(x \; | \; \pi) = {{20} \choose x} \pi^x(1-\pi)^{20-x} \quad \quad \text{for } x=0, 1, \dots, 20 \]If the experiment resulted in \(x=6\) successes, this would have probability
\[ p(6 \; | \; \pi) = {{20} \choose 6} \pi^6(1-\pi)^{14} = (38,760) \times \pi^6(1-\pi)^{14} \]The likelihood function treats this as a function of \(\pi\),
\[ L(\pi) \;=\; p(6 \; | \; \pi) \;=\; (38,760) \times \pi^6(1-\pi)^{14} \]The likelihood function gives the probability of getting the data that we observed (6 successes) for different values of the parameter \(\pi\).
Note that the likelihood function is a continuous function of \(\pi\), even though the random variable \(X\) is discrete.
The top half of the following diagram shows the likelihood function. The bottom half is a bar chart of a binomial distribution.
The bar chart initially shows the binomial distribution with \(\pi = 0.4\) and the probability \(p(6\;|\;\pi = 0.4)\) is highlighted on it.
This probability is the value of the likelihood for \(\pi = 0.4\) and is also shown in the graph of the likelihood function at the top.
Now drag the slider to highlight \(p(6\;|\;\pi)\) for different values of \(\pi\).
For some values of \(pi\), you would have been unlikely to observe 6 successes, whereas for other values of \(\pi\), 6 successes would have been much more likely. For example,