If you don't want to print now,

Chapter 4   Estimating a Parameter

4.1   Estimation

4.1.1   Unknown parameters

Finding an appropriate distribution

Thinking about how a measurement is made sometimes suggests that the variable's distribution should belong to a family of standard distributions, such as a uniform, binomial, geometric or negative binomial distribution.

This reasoning may require some assumptions about the process underlying the variable.

Unfortunately, this usually only leads to a family of standard distributions with one or more parameters whose values are unknown, such as the probability of success \(\pi\) in a series of Bernoulli trials.

How can we find the value of any such unknown parameter?

4.1.2   Estimators of parameters

To get information about unknown parameters, we need data whose distribution depends on these parameters. A function of the data is used to estimate each parameter. We start with models involving a single unknown parameter, \(\theta\).

Definition

If \({X_1, X_2, \dots, X_n}\) is a random sample from a distribution whose shape depends on an unknown parameter \(\theta\), then any function of the random sample,

\[ \hat{\theta}(X_1, X_2, \dots, X_n) \]

is a random variable and could potentially be used as an estimator of \(\theta\).

Possible estimators are:

Sample mean
\[\hat{\theta} = \frac {\sum_{i=1}^n {X_i}} n\]
Sample median
\[\hat{\theta} = median(X_1, X_2, \dots, X_n) \]
Sample maximum
\[\hat{\theta} = max(X_1, X_2, \dots, X_n) \]

Since there are various possible functions of the data that might be used as estimators, what makes a good estimator of a parameter?

4.1.3   Bias

Distribution of an estimator

If \({X_1, X_2, \dots, X_n}\) is a random sample from a distribution involving a single unknown parameter \(\theta\), an estimator \( \hat{\theta}(X_1, X_2, \dots, X_n) \) is a function of these \(n\) random variables and also has a distribution. It is often simply written as \( \hat{\theta}\).

The properties of an estimator depend on its distribution. For example, an estimator with a continuous distribution might have the pdf show below.

For \( \hat{\theta}\) to be a good estimator of \(\theta\), its distribution should concentrated near \(\theta\).

Bias

A good estimator of \(\theta\) should have a distribution whose "centre" is close to \(\theta\). This can be summarised by the distance of the estimator's mean from \(\theta\).

Definition

The bias of an estimator \(\hat{\theta}\) of a parameter \(\theta\) is defined to be

\[ \Bias(\hat{\theta}) \;=\; E[\hat{\theta}] - \theta \]

If its bias is zero, \(\hat{\theta}\) is called an unbiased estimator of \(\theta\).

Many popular estimators are unbiased.

Sample mean

If \({X_1, X_2, \dots, X_n}\) is a random sample from a distribution with mean \(\mu\), the sample mean, \(\overline{X}\), is an unbiased estimator of the distribution mean, \(\mu\).

(Proved in full version)

A sample variance is also an unbiased estimator of a distribution's variance, \(\sigma^2\), but this is harder to prove.

Sample variance

If \({X_1, X_2, \dots, X_n}\) is a random sample from a distribution with variance \(\sigma^2\), the sample variance,

\[ S^2 = \sum_{i=1}^n {\frac {(X_i - \overline{X})^2} {n-1}} \]

is an unbiased estimator of \(\sigma^2\).

(Proved in full version)

Although the sample variance is unbiased, the sample standard deviation is a biased estimator.

Sample standard deviation

The sample standard deviation, \(S\), is a biased estimator of a distribution's standard deviation, \(\sigma\).

(Proved in full version)

4.1.4   Standard error

A good estimator's distribution should have a mean that is equal to (or at least be close to) the parameter being estimated,

However this is not enough to characterise a good estimator.

A good estimator should also have a distribution with a small standard deviation.

Definition

The standard error of an estimator \(\hat{\theta}\) of a parameter \(\theta\) is defined to be its standard deviation.

The standard error is also the standard deviation of the estimation error,

\[ error \;\; = \;\; \hat{\theta} - \theta \]

and this is the reason for its name — it is a 'typical' estimation error.

Question: Active ingredient in medicine

Pharmaceutical companies routinely test their products to ensure that the concentration of active ingredient, \(\mu\), is within tight limits. However the chemical analysis is not precise and repeated measurements of the same specimen differ slightly.

One type of analysis gives estimated concentrations of the active ingredient that are normally distributed with standard deviation \(\sigma = 0.0068\) grams per litre. A product is tested 16 times, giving a sample mean concentration of \(\hat{\mu} = \overline{x} = 0.0724\) grams per litre.

(Solved in full version)

The formula for the standard error often involves unknown parameters, so its exact numerical value cannot be obtained. These parameters are often replaced by estimates in order to get a numerical value for the standard error.

Question: Heat treatment of mangoes

In an experiment to assess the effectiveness of heat-treatment of mangoes as a method of killing fruit fly eggs and larvae, several infested fruit were heat-treated at 44°C. Out of 572 fruit fly eggs in the mangoes, 30 survived, giving an estimate of the probability of survival at this temperature of:

\[ \hat{\pi} = P = \frac {30} {572} = 0.05245 \]

What are the bias and standard error of the estimator?

(Solved in full version)

4.1.5   Mean squared error

A good estimator should have a small bias and small standard error. These two criteria can be combined with into single value called the estimator's mean squared error.

Definition

The mean squared error of an estimator \(\hat{\theta}\) of a parameter \(\theta\) is

\[ \MSE(\hat{\theta}) = E\left[ (\hat{\theta} - \theta)^2 \right] \]

Its relationship to our earlier definitions of bias and standard error is given by the following result.

Mean squared error

The mean squared error of an estimator \(\hat{\theta}\) of \(\theta\) is

\[ \MSE(\hat{\theta}) = \Var(\hat{\theta}) + \Bias(\hat{\theta})^2 \]

(Proved in full version)

4.1.6   Consistency

Consistency

When an estimator is based on a random sample of \(n\) values, \(\hat{\theta}(X_1, X_2, \dots, X_n)\), we would like the estimator to becomes closer to the parameter being estimated, \(\theta\), when the sample size increases.

Definition

An estimator \(\hat{\theta}(X_1, X_2, \dots, X_n)\) that is based on a random sample of \(n\) values is said to be a consistent estimator of \(\theta\) if

\[ \hat{\theta}(X_1, X_2, \dots, X_n) \;\; \xrightarrow[n \rightarrow \infty]{} \;\; \theta \]

A precise definition of consistency explains what this limit means, but is relatively complex. The concept is that the distribution of \(\hat{\theta}\) becomes more and more concentrated on \(\theta\).

Consistency can usually be proved using the following result.

Consistent estimators

An estimator \(\hat{\theta}(X_1, X_2, \dots, X_n)\) is a consistent estimator of \(\theta\) if the following two conditions hold:

\[ \begin{align} \Var(\hat{\theta}) \;\; &\xrightarrow[n \rightarrow \infty]{} \;\; 0 \\[0.5em] Bias(\hat{\theta}) \;\; &\xrightarrow[n \rightarrow \infty]{} \;\; 0 \end{align} \]

4.1.7   Sample mean or median?

Bias and standard error can be used to compare alternative estimators.

Samples from a normal distribution

Consider a random sample of \(n\) values from a normal distribution with known standard deviation,

\[ X \;\; \sim \; \; \NormalDistn(\mu, \;\;\sigma^2 = 0.2^2) \]

Since the normal distribution is symmetric, both its mean and median are \(\mu\), so is the sample mean or sample median a better estimator?

\[ \overline{X} \;\; \sim \; \; \NormalDistn\left(\mu, \;\;\sigma_{\overline{X}}^2 = \frac {0.2^2} n \right) \]

The sample median, \(\tilde{X}\), has a harder distribution but, in large samples, there is an approximate result,

\[ \tilde{X} \;\; \underset{\text{approx}}{\sim} \; \; \NormalDistn\left(\mu, \;\;\sigma_{\tilde{X}}^2 = \frac {0.2^2} n \times 1.571 \right) \]

Both estimators are therefore unbiased, but the standard error of the mean, \(\overline{X}\), is lower than that of the median, \(\tilde{X}\), so the sample mean is the better estimator.

Samples from a skew distribution

On the other hand, consider samples from the following skew distribution.

If the median of this distribution, \(\gamma\), was unknown, the obvious estimator would be the median of a random sample, \(\tilde{X}\).

The distributions of the sample mean and sample median both have similar standard errors. However

\[ E[\overline{X}] \;\; = \; \; \mu \;\;=\;\; 4 \]

so the sample mean has a bias of 1.23 when it is used to estimate this distribution's median, whereas the sample median is approximately unbiased. Moreover, the sample mean's bias does not decrease as \(n\) increases, so it is not a consistent estimator.

The sample median would be the better estimator to use to estimate the median of such a skew distribution.

4.2   Method of moments

4.2.1   The method of moments

We now introduce a simple method for estimating a parameter \(\theta\).

Definition

If \(\{X_1, X_2, \dots, X_n\}\) is a random sample from a distribution whose mean, \(\mu(\theta)\), depends on an unknown parameter, \(\theta\), the method of moments estimator of \(\theta\) is the solution to the equation

\[ \mu(\theta) = \overline{X} \]

We now illustrate this with a simple example.

Estimating a normal distribution's mean (σ known)

Consider a random sample from a normal distribution whose standard deviation, \(\sigma\), is a known value.

\[ X \;\; \sim \; \; \NormalDistn(\mu,\; \sigma=0.6) \]

Since the distribution's mean is \(\mu\), the method of moments estimate is

\[ \hat{\mu} = \overline{x} \]

4.2.2   Examples

We now give two further estimators that can be found by the method of moments.

Question: Sex ratio of Siberian tigers

The probability of a newborn tiger being male is an unknown parameter, \(\pi\). Assuming that the sexes of tigers in a litter are independently determined, the number of males in a litter of size three will be

\[ X \;\; \sim \; \; \BinomDistn(3, \pi) \]

A researcher recorded the numbers of males from a sample of \(n = 207\) such litters, as summarised by the following frequency table.

Number of males 0 1 2 3
Frequency 33 66 80 28

What is the method of moments estimate of \(\pi\)?

(Solved in full version)

This is an unbiased estimator, but the method of moments estimators sometimes results in an estimator whose bias is non-zero.

Question: Sample from a geometric distribution

If \(\{X_1, X_2, \dots, X_n\}\) is a random sample from a geometric distribution with probability function

\[ p(x) = \pi (1-\pi)^{x-1} \quad \quad \text{for } x = 1, 2, \dots \]

what is the method of moments estimator of \(\pi\)? Is it unbiased?

(Solved in full version)

4.3   Maximum likelihood

4.3.1   Likelihood function

Alternative to the method of moments

The method of moments often provides a good estimator when there is a single unknown parameter, but another general estimation method called maximum likelihood is far more general. It can be used for models with several unknown parameters and even situations in which the available data are not a random sample.

Likelihood function

The joint probability of \(\{X_1, X_2, \dots, X_n\}\) may involve an unknown parameter, \(\theta\).

\[ p(x_1, x_2, \dots, x_n \;| \; \theta) \]

For example, if the variables are a random sample from a \(\GeomDistn(\pi)\) distribution, independence means that the joint probability is

\[ P(X_1=x_1 \textbf{ and } X_2=x_2 \textbf{ and } \dots \textbf{ and } X_n=x_n) = \prod_{i=1}^n {\pi (1-\pi)^{x_i-1}} \]

The likelihood function is defined from this.

Definition

If random variables \(\{X_1, X_2, \dots, X_n\}\) have joint probability

\[ p(x_1, x_2, \dots, x_n \;| \; \theta) \]

then the function

\[ L(\theta \; | \; x_1, x_2, \dots, x_n) \;=\; p(x_1, x_2, \dots, x_n \;| \; \theta) \]

is called the likelihood function of \(\theta\).

The likelihood function tells you the probability of getting the data that were observed, for different values of the parameter, \(\theta\). More informally,

\(L(\theta) = Prob(\text{getting the data that were observed})\) if the parameter value was really \(\theta\).

We now give a simple example.

Binomial random variable

If \(X\) is the number of successes in \(n=20\) independent trials, each with probability \(\pi\) of success, its probability function is

\[ p(x \; | \; \pi) = {{20} \choose x} \pi^x(1-\pi)^{20-x} \quad \quad \text{for } x=0, 1, \dots, 20 \]

If we observed \(x=6\) successes, this would have probability

\[ p(6 \; | \; \pi) = {{20} \choose 6} \pi^6(1-\pi)^{14} = (38,760) \times \pi^6(1-\pi)^{14} \]

The likelihood function treats this as a function of \(\pi\),

\[ L(\pi) \;=\; p(6 \; | \; \pi) \;=\; (38,760) \times \pi^6(1-\pi)^{14} \]

The likelihood function gives the probability of getting the data that we observed (6 successes) for different values of \(\pi\). For example, if \(\pi = 0.4\), the probability of observing \(x = 6\) would be 0.124.

From the likelihood function we could also find:

Since there would be such a low probability of observing our actual data if \(\pi\) was 0.1, this throws some doubt on whether this would be the correct value of the parameter \(\pi\).

4.3.2   Maximising the likelihood

Maximum likelihood estimate

The likelihood function, \(L(\theta \; | \; x_1, x_2, \dots, x_n) = p(x_1, x_2, \dots, x_n \;| \; \theta)\), gives the probability of getting the data that were recorded for different values of the unknown parameter \(\theta\). A value of \(\theta\) that gives the observed data high probability is more likely to be correct than one that would make the observed data unlikely.

Definition

The maximum likelihood estimate of a parameter \(\theta\) is the value that maximises the likelihood function,

\[ L(\theta \; | \; x_1, x_2, \dots, x_n) = p(x_1, x_2, \dots, x_n \;| \; \theta) \]

Finding a maximum likelihood estimate (MLE) therefore involves maximising a function of \(\theta\). This is usually a "turning point" of the likelihood function.

Finding the maximum likelihood estimate

The maximum likelihood estimate of a parameter \(\theta\) can normally be obtained as a solution to the equation

\[ \frac {d\; L(\theta \; | \; x_1, x_2, \dots, x_n)} {d\; \theta} \;\; = \;\; 0 \]

It is often easier mathematically to maximise the logarithm of the likelihood function rather than the likelihood function itself.

Maximising the log-likelihood

Writing

\[ \ell(\theta) \;=\; \log L(\theta) \]

the maximum likelihood estimate of a parameter \(\theta\) can normally be found by solving the equation

\[ \frac {d\; \log L(\theta \; | \; x_1, x_2, \dots, x_n)} {d\; \theta} \;\; = \;\; \ell'(\theta) \;\; = \;\; 0 \]

(Proved in full version)

We will now give a simple example.

A simple binomial example

Consider a random variable \(X\) that is the number of successes in \(n=20\) independent trials, each of which has probability \(\pi\) of success. If the experiment resulted in \(x=6\) successes, the likelihood function would be

\[ L(\pi) = {{20} \choose 6} \pi^6(1-\pi)^{20-6} \;\; = \;\;38,760 \; \times \pi^6(1-\pi)^{14} \]

Instead of differentiating \(L(\theta)\), it is easier to differentiate the log-likelihood to find the maximum likelihood estimate,

\[ \ell(\pi) \;\; = \;\; \log L(\pi) \;\; = \;\; 6 \log(\pi) + 14 \log(1 - \pi) + K\]

where \(K\) is a constant that does not depend on \(\pi\). We solve

\[ \frac {d \; \ell(\pi)} {d\; \pi} \;\; = \;\; \frac 6 {\pi} - \frac {14} {1 - \pi} \;\; = \;\; 0 \] \[ 6(1-\pi) = 14\pi \] \[ 6 = 20\pi \]

The maximum likelihood estimate of \(\pi\) is therefore \( \hat {\pi} = \frac 6 {20} \), the sample proportion of successes.

The diagram below shows both the likelihood function and the log-likelihood. It illustrates the fact that both functions have their maximum at the same value of \(\pi\).

Generalising to a binomial experiment in which \(x\) successes are observed in \(n\) trials,

\[ \ell(\pi) \; = \; \log L(\pi) \; = \; x \log(\pi) + (n-x) \log(1 - \pi) + K(n, x) \] \[ \frac {d \; \ell(\pi)} {d\; \pi} \; = \; \frac x {\pi} - \frac {n-x} {1 - \pi} \; = \; 0 \]

which can be solved to give

\[ \hat {\pi} \;=\; \frac x n \]

Note that we are using natural logarithms (base-e) here, not logarithms to the base 10.

4.3.3   Examples

Random sample

If \(\{x_1, x_2, \dots, x_n\}\) is a random sample from a distribution with probability function \(p(x\;|\;\theta)\), then

\[ L(\theta \;|\;x_1, x_2, \dots, x_n) = \prod_{i=1}^n p(x_i \;|\; \theta) \]

so the log-likelihood can be written as

\[ \ell(\theta) = \sum_{i=1}^n \log\left(p(x_i \;|\; \theta)\right) \]

In the next two examples, maximum likelihood estimates must be found from random samples.

Question: Sex ratio of Siberian tigers

The probability of a newborn tiger being male is an unknown parameter, \(\pi\). A researcher recorded the number of males in a sample of \(n = 207\) litters, and these values are summarised in the following frequency table.

Number of males 0 1 2 3
Frequency 33 66 80 28

If it is assumed that the sexes of all tigers in a litter are independently determined, what is the maximum likelihood estimate of \(\pi\)?

(Solved in full version)

In the binomial example (and many others), the maximum likelihood and method of moments estimators are equal. However for some other models, they differ.

Question: Sample from a geometric distribution

If \(\{X_1, X_2, \dots, X_n\}\) is a random sample from a geometric distribution with probability function

\[ p(x) = \pi (1-\pi)^{x-1} \quad \quad \text{for } x = 1, 2, \dots \]

what is the maximum likelihood estimate of \(\pi\)?

(Solved in full version)

The method is now illustrated with a numerical example.

Illustration

Consider a data set {1, 1, 1, 1, 2, 2, 4} which is assumed to be a random sample from a geometric distribution,

\[ X \;\; \sim \; \; \GeomDistn(\pi) \]

The bar chart below shows geometric probabilities when \(\pi = 0.4\). The likelihood is the product of these probabilities for all the data values — i.e. with one term for each cross in the diagram. The log-likelihood is the logarithm of this.

Changing \(\pi\) to 0.8 results in the geometric probabilities below. Although the probabilities are higher for the four values that are "1", the probability of getting the value "4" is now very small. As a result, the product of the probabilities is lower than it might be and the likelihood is not maximised.

The value \(\pi = 0.583\) is a compromise that still has fairly high probabilities for the four 1's but has a larger probability of getting a 4 — this is the value that maximises the likelihood.

4.4   Asymptotic properties of MLEs

4.4.1   Bias, variance and normality

Maximum likelihood estimators have very good large-sample properties.

These strictly require certain "regularity conditions" to be satisfied. We will avoid them in this e-book — they almost always hold.

The following results apply to the maximum likelihood estimator, \(\hat {\theta}\) of a parameter \(\theta\), based on a random sample of size \(n\) when \(n \to \infty\) — i.e. its asymptotic properties.

Bias

It is asymptotically unbiased,

\[ E[\hat {\theta}] \;\; \xrightarrow[n \rightarrow \infty]{} \;\; \theta \]

Variance and consistency

It asymptotically has variance,

\[ \Var(\hat {\theta}) \;\; \xrightarrow[n \rightarrow \infty]{} \;\; - \frac 1 {n \times E\left[\large\frac {d^2 \log\left(p(X \;|\; \theta)\right)} {d\theta^2} \right]} \]

Since this tends to zero as \(n \rightarrow \infty\) and the bias is asymptotically zero, a maximum likelihood estimator is also consistent.

Asymptotic normality

It asymptotically has a normal distribution,

\[ \hat {\theta} \;\; \xrightarrow[n \rightarrow \infty]{} \;\; \text{a normal distribution} \]

We now express these three properties together in a slightly more formal way.

All together

If \(\hat {\theta} \) is the maximum likelihood estimator of a parameter, \(\theta\), based on a random sample of size \(n\),

\[ (\hat {\theta} - \theta) \times \sqrt {-n \times E\left[\frac {d^2\; \log\left(p(X \;|\; \theta)\right)} {d\;\theta^2} \right]} \;\; \xrightarrow[n \rightarrow \infty]{} \;\; \NormalDistn(0, 1) \]

A final result states that a maximum likelihood estimator cannot be beaten in large samples.

Asymptotically "best"

Other estimators of a parameter, \(\theta\), may have lower mean squared errors in small samples, but none have lower mean squared error than the maximum likelihood estimator if the sample size is large enough.

4.4.2   Standard error

Practical problems

There are two practical problems with the approximate variance formula for \(\hat {\theta} \),

\[ \Var(\hat {\theta})\; \approx \;- \dfrac 1 {n \times E\left[\frac {\large d^2\; \log\left(p(X \;|\; \theta)\right)} {\large d\;\theta^2} \right]} \]
Difficulty evaluating the expected value
For many distributions, it is impossible to find a simple formula for the expected value.
Unknown value of \(\theta\)
Even if this expected value can be found, it is usually a function of \(\theta\) and \(\theta\) is an unknown value.

Avoiding the expected value

A numerical value for the approximate variance of \(\hat {\theta}\) can be found with a further approximation,

\[ \Var(\hat {\theta})\; \approx \;- \frac 1 {\ell''(\hat {\theta})} \]

(Justified in full version)

Its square root provides us with a numerical value for the standard error of the maximum likelihood estimator,

\[ \se(\hat {\theta}) \;\;\approx\;\; \sqrt {- \frac 1 {\ell''(\hat {\theta})}} \]

This formula lets us find an approximate numerical value for the standard error of almost any maximum likelihood estimator — even when based on models in which the data are not a simple random sample.

4.4.3   Examples

The following questions relate to the maximum likelihood estimators for the unknown parameters in two standard distributions.

Question: Single binomial value

In a series of \(n\) independent success/failure trials with probability \(\pi\) of success, \(x\) successes were observed. What is the maximum likelihood estimator of \(\pi\) and what are its bias and standard error?

Question: Geometric random sample

If \(\{x_1, x_2, \dots, x_n\}\) is a random sample from a geometric distribution with parameter \(\pi\), what is the maximum likelihood estimator of \(\pi\) and what are its bias and standard error?

(Both solved in full version)

4.5   Numerical methods for MLEs

4.5.1   Newton-Raphson algorithm

The maximum likelihood estimate of a parameter \(\theta\) is usually a value that satisfies the equation

\[ \ell'(\theta) \;\; = \;\; 0 \]

where \(\ell(\theta)\) is the log-likelihood function. Sometimes this equation cannot be solved algebraically, so an iterative numerical method is required to obtain the maximum likelihood estimate.

One way to solve an equation numerically is called the Newton Raphson algorithm. Consider an equation

\[ g(\theta) \;\; = \;\; 0 \]

Newton Raphson algorithm

Starting at an initial guess of the solution, \(\theta_0\), successive values

\[ \theta_{i+1} \;\; = \;\; \theta_i - \frac {g(\theta_i)} { g'(\theta_i)} \qquad \text{for } i=0,\dots\]

are called the Newton Raphson algorithm. If it converges, it is to a solution of the equation \(g(\theta) = 0\).

This is justified by a Taylor series expansion of \(g(\theta)\) around \(\theta_0\).

Applying the algorithm to maximum likelihood

To apply it to maximum likelihood, we use the function \(g(\theta) = \ell'(\theta)\). The Newton Raphson algorithm can therefore be expressed as

\[ \theta_{i+1} \;\; = \;\; \theta_i - \frac {\ell'(\theta_i)} { \ell''(\theta_i)}\]

This usually converges to the maximum likelihood estimate, provided the initial guess, \(\theta_0\) is not too far from the correct value. The algorithm may need to be used from various starting values until one is found for which the algorithm converges.

4.5.2   Log-series distribution

To illustrate the use of the Newton-Raphson algorithm to find maximum likelihood estimates, we will examine another standard distribution that is occasionally encountered, the log-series distribution.

Definition

A discrete random variable \(X\) is said to have a log-series distribution if its probability function is

\[ p(x) \;=\; \frac {-1} {\log(1-\theta)} \times \frac {\theta^x} x \quad\quad \text{for } x=1, 2, \dots \]

where \(0 \lt \theta \lt 1\).

Its shape to the geometric distribution, but has greater spread with a higher probability at one and a longer tail.

Maximum likelihood

If a random sample \({x_1, x_2, \dots, x_n}\) is collected from this distribution, what is the maximum likelihood estimate of \(\theta\)? The logarithm of the probability function is

\[ \log \left(p(x)\right) \;=\; x \log(\theta) - \log \left(- \log(1 - \theta) \right) - \log(x) \]

so the likelihood function is

\[ \ell(\theta) \;=\; \sum_{i=1}^n \log \left(p(x_i)\right) \;=\; {\sum x_i} \log(\theta) - n \times \log \left( -\log(1 - \theta) \right) + K \]

where \(K\) is a constant whose value does not depend on \(\theta\). The MLE is the solution of

\[ \ell'(\theta) \;=\; \frac {\sum x_i} {\theta} + \frac n {(1 - \theta)\log(1 - \theta)} \;=\; 0 \]

Unfortunately this equation cannot be rearranged to obtain an explicit formula for \(\theta\), so a numerical method must be used to find the maximum likelihood estimate. The Newton Raphson algorithm also requires the second derivative of the log-likelihood,

\[ \ell''(\theta) \;=\; -\frac {\sum x_i} {\theta^2} + \frac {n \left(1 + \log(1 - \theta) \right)} {(1 - \theta)^2\log^2(1 - \theta)} \]

The algorithm uses these derivatives iteratively to refine an initial estimate, \(\theta_0\),

\[ \theta_{i+1} \;\; = \;\; \theta_i - \frac {\ell'(\theta_i)} { \ell''(\theta_i)}\]

Numerical illustration

Consider the following data set that is assumed to arise from a log-series distribution.

3 5 1 4 8 10 2 1 1 2
1 8 1 6 13 1 6 2 1 3
1 1 1 2 1 6 1 1 1 1

The derivatives of the log-likelihood involve the values \(n = 30\) and \(\sum x = 95\). Iterations of the Newton-Raphson algorithm from an initial guess at the value of \(\theta = 0.7\) are:

Iteration, i\(\theta_i\)\(\ell'(\theta_i)\)\(\ell''(\theta_i)\)
0 0.7000 52.656 -240.78
1 0.9187 -43.613 -1200.14
2 0.8823 -11.484 -661.52
3 0.8650 -1.139 -538.41
4 0.8629 -0.013 -526.41
5 0.8628 -0.000 -526.28
6 0.8628 -0.000 -526.28
7 0.8628

It converges quickly to \(\hat {\theta} = 0.8628\).

The diagram below illustrates the first iteration of the algorithm; it approximates the shape of the log-likelihood using a quadratic curve with the same value, slope and curvature as the log-likelihood at \(\theta = 0.7\). The next iteration is the value of \(\theta\) that maximises this quadratic.

When you get closer to the MLE, the quadratic's shape becomes closer to the actual log-likelihood, so the iterations approach the MLE more quickly. The diagram below illustrates this from a starting value of 0.88.

4.5.3   Standard error

The asymptotic formula for the standard error is

\[ \se(\hat {\theta}) \;\;\approx\;\; \sqrt {- \frac 1 {\ell''(\hat {\theta})}} \]

The second derivative of the log-likelihood is found in the last iteration of the Newton Raphson algorithm.

Standard error for log-series distribution

The iterations of the Newton-Raphson algorithm for finding the MLE of the log-series distribution's parameter, \(\theta\), to the data on the previous page from an initial guess, \(\theta_0 = 0.7\) were:

Iteration, i \(\theta_i\) \(\ell'(\theta_i)\) \(\ell''(\theta_i)\)
0 0.7000 52.656 -240.78
1 0.9187 -43.613 -1200.14
: : : :
5 0.8628 -0.000 -526.28
6 0.8628 -0.000 -526.28
7 0.8628

The second derivative of the log-likelihood converges to \(\ell''(\hat{\theta}) = -526.28\). The approximate standard error of the estimate is therefore

\[ \se(\hat {\theta}) \;\;\approx\;\; \sqrt {- \frac 1 {\ell''(\hat {\theta})}} \;\;=\;\; \sqrt {\frac 1 {526.28}} \;\;=\;\; 0.0436 \]

4.6   Confidence intervals

4.6.1   Interval estimates

Describing accuracy with an interval estimate

Reporting a single value as a parameter estimate (a point estimate) does not convey any information about the estimator's accuracy — i.e. the likely size of the estimation error.

It is better to give an interval of values within which we are confident that the parameter will lie — an interval estimate.

Will an interval estimate include θ?

If an interval estimate is too narrow, there will be very little chance of it containing the true parameter value, but intervals that are too wide do not convey much information.

We need to find a way to quantify our confidence that any particular interval estimate will include the parameter being estimated.

4.6.2   Distribution of estimation error

Error distribution

Most parameter estimates are unbiased (or at least asymptotically unbiased) and a formula for their standard deviation (or an approximation) can be found. In particular, from the asymptotic properties of maximum likelihood estimators,

\[ \begin{align} E[\hat{\theta}] \;\; &\approx \; \; \theta \\ \se(\hat {\theta}) \;\;&\approx\;\; \sqrt {- \frac 1 {\ell''(\hat {\theta})}} \end{align} \]

Since MLEs are also asymptotically normally distributed, we can find an approximate distribution for the estimation error,

\[ error \;\;=\;\; \hat {\theta} - \theta \;\; \sim \;\; \NormalDistn\left(\mu=0, \;\;\sigma=\se(\hat{\theta})\right) \]

Binomial example

In a series of independent Bernoulli trials with probability \(\pi\) of success, the maximum likelihood estimator of \(\pi\) is the proportion of successes. From the binomial distribution, we have exact formulae for its mean and standard deviation (standard error) and approximate normality in large samples,

\[ \hat {\pi} \;\; \sim \;\; \NormalDistn\left(\pi, \;\;\sigma=\sqrt{\frac {\pi(1-\pi)} n} \right) \]

In the example below, the standard error is used to sketch the approximate distribution of the sampling errors.

From this error distribution, we can conclude that:

Our estimate of the proportion of beginners getting injured during a week of skiing, 0.25, is unlikely to be more than 0.1 from the similar proportion of skiers in general, \(\pi\).


4.6.3   Confidence intervals

95% confidence interval

Assuming that the error distribution is normal (or approximately so), we can use the fact that 95% of any normal distribution is between 1.96 standard deviations of the mean.

We can therefore write

\[ \begin{align} P\left(-1.96 \times \se(\hat {\theta}) \;\;\lt\;\; error \;\;\lt\;\; 1.96 \times \se(\hat {\theta})\right) \;\;&\approx\;\; 0.95 \\ P\left(\hat{\theta}-1.96 \times \se(\hat {\theta}) \;\;\lt\;\; \theta \;\;\lt\;\; \hat{\theta}+1.96 \times \se(\hat {\theta})\right) \;\;&\approx\;\; 0.95 \end{align} \]

We call the interval

\[ \hat{\theta}-1.96 \times \se(\hat {\theta}) \quad \text{ to } \quad \hat{\theta}+1.96 \times \se(\hat {\theta}) \]

a 95% confidence interval for \(\theta\) and we have 95% confidence that it will include the actual value of the parameter.

Other confidence levels

90% of values from a normal distribution are within 1.645 standard deviations of the distribution's mean, so

This leads to a 90% confidence interval,

\[ \hat{\theta}-1.645 \times \se(\hat {\theta}) \quad \text{ to } \quad \hat{\theta}+1.645 \times \se(\hat {\theta}) \]

We say that we are 90% confident that an interval that is calculated in this way will include the true parameter value, \(\theta\).

Binomial example

4.6.4   Other examples

In the next example, there is no exact formula for the standard error of the estimator.

Question: Geometric random sample

If \(\{x_1, x_2, \dots, x_n\}\) is a random sample from a geometric distribution with parameter \(\pi\), find a large-sample 90% confidence interval for the parameter \(\pi\).

(Solved in full version)

In the next example, the Newton-Raphson algorithm should be used to obtain the maximum likelihood estimate and its standard error.

Question: Log-series distribution

The following data set that is assumed to arise from a log-series distribution with probability function

\[ p(x) \;=\; \frac {-1} {\log(1-\theta)} \times \frac {\theta^x} x \quad\quad \text{for } x=1, 2, \dots \]
3 5 1 4 8 10 2 1 1 2
1 8 1 6 13 1 6 2 1 3
1 1 1 2 1 6 1 1 1 1

Find a large-sample 95% confidence interval for the parameter \(\theta\).

(Solved in full version)

 

4.6.5   Properties of confidence intervals

A confidence interval does not always "work" — it may not actually include the unknown parameter value. A 95% confidence interval might be

\[ 0.7773 \;\;\lt\;\; \theta \;\;\lt\;\; 0.9483 \]

but the value of \(\theta\) is unknown and may not actually lie within this interval. The best we can say is that we are 95% confident that \(\theta\) will be between these two values.

If confidence intervals were found from other similar random samples, 95% of them would include \(\theta\).

The notion of a confidence level therefore more a characteristic of the method of finding the confidence interval, than a characteristic of the specific confidence interval from a single data set.

Simulation

The simulation below took 100 random samples of size n = 200 from a population with π = 0.6. Most of the confidence intervals included π = 0.6, but some did not. If the simulation was repeated many more times, the proportion including 0.6 would be close to 0.95.

In practice, you only have a single sample and a single confidence interval, but we have "95% confidence" that it will include the true (and usually unknown) value of π.