Alternative to the method of moments

Although the method of moments often provides a good estimator for a single unknown parameter, it is difficult to extend to statistical models with more unknown parameters.

We now describe a better estimation method that can be extended to models with many unknown parameters and even situations in which the available data are not a random sample.

Probability of data

We start by considering a random situation in which the variables that will be recorded, \(\{X_1, X_2, \dots, X_n\}\), have a joint probability that involves an unknown parameter, \(\theta\).

\[ p(x_1, x_2, \dots, x_n \;| \; \theta) \]

For example, if the variables are a random sample from a \(\GeomDistn(\pi)\) distribution, independence means that the joint probability is the following product.

\[ P(X_1=x_1 \textbf{ and } X_2=x_2 \textbf{ and } \dots \textbf{ and } X_n=x_n) = \prod_{i=1}^n {\pi (1-\pi)^{x_i-1}} \]

Likelihood function

In the joint probability, we usually treat the parameter \(\theta\) as a constant and consider how the probability depends on the x-values. We will now do the opposite; we will treat the data values as fixed constants and examine how the probability depends on the value of \(\theta\).

Definition

If random variables \(\{X_1, X_2, \dots, X_n\}\) have joint probability

\[ p(x_1, x_2, \dots, x_n \;| \; \theta) \]

then the function

\[ L(\theta \; | \; x_1, x_2, \dots, x_n) \;=\; p(x_1, x_2, \dots, x_n \;| \; \theta) \]

is called the likelihood function of \(\theta\).

The likelihood function tells you the probability of getting the data that were observed, for different values of the parameter, \(\theta\). More informally,

\(L(\theta) = Prob(\text{getting the data that were observed})\) if the parameter value was really \(\theta\).

We now give a simple example.

Binomial random variable

Consider a random variable \(X\) that is the number of successes in \(n=20\) independent trials, each of which has probability \(\pi\) of success. The probability function of \(X\) is

\[ p(x \; | \; \pi) = {{20} \choose x} \pi^x(1-\pi)^{20-x} \quad \quad \text{for } x=0, 1, \dots, 20 \]

If the experiment resulted in \(x=6\) successes, this would have probability

\[ p(6 \; | \; \pi) = {{20} \choose 6} \pi^6(1-\pi)^{14} = (38,760) \times \pi^6(1-\pi)^{14} \]

The likelihood function treats this as a function of \(\pi\),

\[ L(\pi) \;=\; p(6 \; | \; \pi) \;=\; (38,760) \times \pi^6(1-\pi)^{14} \]

The likelihood function gives the probability of getting the data that we observed (6 successes) for different values of the parameter \(\pi\).

Note that the likelihood function is a continuous function of \(\pi\), even though the random variable \(X\) is discrete.


The top half of the following diagram shows the likelihood function. The bottom half is a bar chart of a binomial distribution.

The bar chart initially shows the binomial distribution with \(\pi = 0.4\) and the probability \(p(6\;|\;\pi = 0.4)\) is highlighted on it.

This probability is the value of the likelihood for \(\pi = 0.4\) and is also shown in the graph of the likelihood function at the top.

Now drag the slider to highlight \(p(6\;|\;\pi)\) for different values of \(\pi\).

For some values of \(pi\), you would have been unlikely to observe 6 successes, whereas for other values of \(\pi\), 6 successes would have been much more likely. For example,