Uses kernel density estimation to estimate the underlying density of a sample (P.W. Goedhart).

### Options

`PRINT` = string token |
What to print (`integral` , `summary` , `monitoring` , `graph` ); default `inte` |
---|---|

`METHOD` = string token |
Which automatic bandwidth selection method should be used when the `BANDWIDTH` option is not set (`s1` , `s2` , `s3` , `sj); ` default `sj` |

`BANDWIDTH` = scalar or variate |
Which bandwidth value or values are to be used; default `*` |

`NGRIDEXPONENT2` = scalar |
Defines the number of grid points as 2`**NGRIDEXPONENT2` ; default 11 |

`SAVEGRIDEXTENT` = scalar |
Defines the lower and upper limit of the interval on which the kernel density is saved; the default value of 4 uses the full interval on which the kernel density is calculated |

`NFOURIER` = scalar |
Defines the upper limit of the sample size for which the kernel density is calculated directly (when the sample size exceeds the setting of this option, the fast Fourier transform is used to calculate the kernel density); default 100 |

`PROPORTION` = variate |
Proportions at which to calculate quantiles of the kernel density estimate; default !(0.025, 0.25, 0.5, 0.75, 0.975) |

`PLOT` = string tokens |
Specifies the graphs to be plotted (`kerneldensity` , `histogram` , `sample` ); default `kern` , `hist` , `samp` |

`TITLE` = text |
General title(s) for the graph(s); default `*` |

`WINDOW` = scalar or variate |
Window number(s) for the graph(s); default 1 |

`SCREEN` = string token |
Whether to clear the screen before plotting into the first window, or whether to or continue plotting on the old screen (`clear` , `keep` ); default `clea` |

### Parameters

`SAMPLE` = variates |
The sample for which to calculate the kernel density estimate |
---|---|

`GRID` = variates |
Saves the grid of equidistant points at which the kernel density is calculated |

`DENSITY` = variates or pointers |
Saves the kernel density estimate |

`CUMULATIVE` = variates or pointers |
Saves the estimated cumulative distribution |

`QUANTILE` = variates or pointers |
Saves the quantiles calculated from the estimated cumulative distribution |

`SAVEBANDWIDTH` = scalars |
Saves the automatically selected bandwidths as specified by the `METHOD` option |

### Description

Kernel density estimation is a useful tool for exploring the unknown underlying distribution of a sample, see e.g. Silverman (1986) for a general introduction in density estimation. The kernel method constructs an estimate *f _{h}*(

*t*) of the true density function by placing a kernel function

*K*(

*t*;

*x*,

_{i}*h*) over each observation

*x*in the sample. The kernel function

_{i}*K*(

*t*;

*x*,

*h*) is itself a density function with location parameter

*x*and scale parameter

*h*, also called bandwidth in this context. The density estimate is then given by

*f _{h}*(

*t*) = (1 / (

*n*×

*h*)) × ∑

_{i = 1…n}

*K*((

*t*–

*x*) /

_{i}*h*)

**(1)**

where *n* denotes the sample size. It turns out that the choice of kernel function *K* is not very critical for the resulting estimate *f _{h}*(

*t*), see Section 3.3 of Silverman (1986). The Gaussian kernel is commonly used and is therefore adopted here as kernel function, i.e.

*K*(t) = (1 / √(2 × π)) × exp((-*t*^{2}) / 2) ** (2)**

For this choice of kernel function *K*, there is an efficient algorithm available for the calculation of *f _{h}*(

*t*). This algorithm employs the fast Fourier transform of the data.

The choice of bandwidth *h* is of crucial importance in kernel density estimation. A large value of *h* will give rise to an oversmoothed density estimate, while a small value of *h* will produce a very ragged density with many spikes at the observations. Silverman (1986) recommends examining kernel density estimates for several values of *h*, since this will highlight different features of the data. For automatic use of kernel density estimation, estimation of the bandwidth *h* from the data is very helpful. Silverman (1986) suggests the following normal-based estimates:

S1 = 1.06 × (standard deviation) × *n*^{-1/5}

S2 = 0.79 × (interquartile range) × *n*^{-1/5}

S3 = 0.90 × minimum(standard deviation, interquartile range/1.34) × *n*^{-1/5}

These estimates are popular due to their simplicity. Jones, Marron & Sheather (1996), who provide an extensive review of the many automatic methods for choosing the bandwidth, advise against these estimates. They recommend the method of Sheather & Jones (1991) for general purposes. This method, denoted below by SJ, is therefore the default method used in the `KERNELDENSITY`

procedure.

The sample, for which to estimate the underlying density, must be specified by means of the `SAMPLE`

parameter. The `METHOD`

and `BANDWIDTH`

options determine which bandwidths *h* are used. When the `BANDWIDTH`

option is set to a scalar or variate, then these values are used for the bandwidth *h*. When the `BANDWIDTH`

option is unset, the `METHOD`

option determines which automatic bandwidth selection method is used. The default setting of the `METHOD`

option is `sj`

, which indicates that the method of Sheather & Jones (1991) is to be used. The automatically selected bandwidth can be saved by means of the `SAVEBANDWIDTH`

parameter.

The kernel density estimate is calculated on an interval at a grid of equidistant points. The grid is returned using the `GRID`

parameter, and the density estimate and corresponding cumulative density can be saved with the `DENSITY`

and `CUMULATIVE`

parameters. When the `BANDWIDTH`

option is set to a variate, the `DENSITY`

and `CUMULATIVE parameters are pointers to variates: one variate for each bandwidth value. The number of grid points can be set using the NGRIDEXPONENT2`

option as 2`**NGRIDEXPONENT2`

. The lower and upper limit of the interval on which the kernel density is calculated are given by:

`CALCULATE lower = MINIMUM(SAMPLE) - 4*MAXIMUM(BANDWIDTH)`

`CALCULATE upper = MAXIMUM(SAMPLE) + 4*MAXIMUM(BANDWIDTH)`

This ensures that the integral of the kernel density will be very close to one. The `SAVEGRIDEXTENT`

option can be used to save the grid and the (cumulative) density at a more limited interval defined by

`CALCULATE lowsave = MINIMUM(SAMPLE)\`

` - SAVEGRIDEXTENT*MAXIMUM(BANDWIDTH)`

`CALCULATE uppsave = MAXIMUM(SAMPLE)\`

` + SAVEGRIDEXTENT*MAXIMUM(BANDWIDTH)`

The setting of the `NFOURIER`

option determines whether the kernel density is calculated directly by means of equation (1) or by employing the fast Fourier transform of the data. When the sample size *n* exceeds the setting of the `NFOURIER`

option, the fast Fourier transform is used.

The parameter `QUANTILES`

can be used to save quantiles of the kernel density estimate, for proportions specified by means of the `PROPORTION`

option. When the `BANDWIDTH`

option is set to a variate, the `QUANTILES`

are saved in a pointer containing a set of variates.

The `PRINT`

option controls the output displayed by `KERNELDENSITY`

. The `integral`

setting prints the integral of the kernel density, which should be close to one, while the `summary`

setting print summary statistics of the sample and of the kernel density estimate. The `monitoring`

setting can be used to monitor the iterative bandwidth estimation method SJ. Finally, the setting `graph`

produces a high-resolution plots of the kernel densities, superimposed over a rough histogram estimate of the density calculated as the proportion of the sample falling into `CEILING`

(`SQRT`

(number of samples))+1 equal intervals across the range of sample values. (There will be as many plots as there were bandwidths.) The sample values are also plotted, using the symbol `+`

, along the bottom of the plots. The `PLOT`

option controls which elements (`kerneldensity`

, `histogram`

, `sample`

) are plotted. The `TITLE`

option can provide a title for each graph. The `WINDOW`

option specifies the windows to be used for the plots (default 1), and the `SCREEN`

option controls whether or not the screen is cleared before plotting into the first window (default clear).

Options: `PRINT`

, `METHOD`

, `BANDWIDTH`

, `NGRIDEXPONENT2`

, `SAVEGRIDEXTENT`

, `NFOURIER`

, `PROPORTION`

, `PLOT`

, `TITLE`

, `WINDOW`

, `SCREEN`

.

Parameters: `SAMPLE`

, `GRID`

, `DENSITY`

, `CUMULATIVE`

, `QUANTILE`

, `SAVEBANDWIDTH`

.

### Method

The interquartile range is calculated by means of the `TABULATE`

directive. For sample sizes larger than the setting of the `NFOURIER`

option, the fast Fourier transform of the data is used. This employs the algorithm of Silverman (1982), with the modification of Jones & Lotwick (1984), using the `FOURIER`

directive. The cumulative density is calculated by applying the trapezodial rule to the density. Quantiles of the estimated distribution are calculated with the `INTERPOLATE`

directive applied to the cumulative distribution. The difference between the cumulative distribution calculated directly and by means of the Fourier transform, was found to be less than 0.0001 for samples of size 100 from a wide variety of mixed distributions.

The bandwidth selection method of Sheather & Jones (1991) requires solving of a complicated equation by means of an iterative method. The iterative method stops when the relative difference between subsequent estimates of the bandwidth is smaller than 0.0001. The implementation in Genstat is a transcription of a Fortran subroutine which was kindly made available by Jones (1991). The algorithm uses all squared pairwise differences of the sample, i.e. (*x _{i}* –

*x*)

_{j}^{2}for all

*i*,

*j*. When the sample size is large there are too many such differences. The sampled values are then discretizised on a grid; i.e. all the sampled values are assigned to a binning interval and the expectation of this value is used instead of the exact value (

*x*–

_{i}*x*)

_{j}^{2}. Assuming a uniform distribution of

*x*and

_{i}*x*over their respective binning intervals, the expectation is given by

_{j}*d*

^{2}(

*k*

^{2}+ 1/6), where

*d*is the length of a binning interval and

*k*is the number of intervals in between

*x*and

_{i}*x*. The data are discretizised when the sample size exceeds 500. The grid is refined in a loop until there are more than 500 bins with sampled values assigned to them. The relative difference between the Sheather & and Jones estimate calculated with exact squared pairwise differences and by discretization of the data, was found to be less than 0.0003 for samples of size 100 from a wide variety of mixed distributions.

_{j}The automatic bandwidth selection method of Sheather & Jones (1991) is implemented in a subsidiary procedure `_KERNELSJ`

which is called by `KERNELDENSITY`

. This has the following 5 options:

`PRINT` = string token |
What to print (`monitoring` ); default `*` |
---|---|

`NTIMES` = scalar |
Maximum number of iterations; default 20 |

`TOLERANCE` = scalar |
Convergence criterion; default 0.0001 |

`BINSAMPLE` = scalar |
Defines the upper limit of the sample size for which the method uses exact squared pairwise differences of the sampled value (for sample sizes exceeding the setting of this option, discretization of the sample is used); default 500 |

`NGRID` = scalar |
Defines the number of bins for discretization of the sampled values – the grid is refined in a loop until there are more than `NGRID` bins with sampled values assigned to them; default 500 |

It also has three parameters which must all be set

`SAMPLE` = variates |
The sample for which to estimate the bandwidth |
---|---|

`IQR` = scalar |
Interquartile range of the sample |

`SJ` = scalars |
Saves the estimated bandwidth using Sheather & Jones (1991) |

### Action with `RESTRICT`

The `SAMPLE`

parameter can be restricted. The grid and kernel density estimate are then calculated using only those units that are in the restriction set.

### References

Jones, M.C. & Lotwick, H.W. (1984). A remark on algorithm AS 176. Kernel density estimation using the fast Fourier transform. Remark AS R50. *Applied Statistics*, 33, 120-122.

Jones, M.C. (1991). Fortran subroutine SJEQD for the automatic bandwidth selection method of Sheather & Jones (1991). *Personal communication*.

Jones, M.C., Marron, J.S. & Sheather, S.J. (1996). Progress in data-based bandwidth selection for kernel density estimation. *Computational Statistics*, 11, 337-381.

Sheather, S.J. & Jones, M.C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. *Journal of the Royal Statistical Society, Series B*, 53, 683-690.

Silverman, B.W. (1982). Kernel density estimation using the fast Fourier transform. Applied Statistics Algorithm AS 176. *Applied Statistics*, 31, 93-99.

Silverman, B.W. (1986). *Density Estimation for Statistics and Data Analysis*. Chapman & Hall, London.

### See also

Directive: `DISTRIBUTION`

.

Procedures: `MSEKERNEL2D`

, `PTKERNEL2D`

, `PTK3D`

.

Commands for: Basic and nonparametric statistics.

### Example

CAPTION 'KERNELDENSITY example',!t(\ 'Eruption lengths (in minutes) of the Old Faithful geyser',\ 'from Table 2.2 of Silverman (1986). Note that Jones, Marron',\ '& Sheather (1996) find a bandwidth of 0.21 for these data',\ 'using the method of Sheather & Jones (1991). A bandwidth of',\ '0.2043 is found by the KERNELDENSITY procedure. The difference',\ 'between these values is probably due to distinct definitions',\ 'of the interquartile range.'); STYLE=meta,plain VARIATE [NVALUES=107] Eruption READ Eruption 4.37 3.87 4.00 4.03 3.50 4.08 2.25 4.70 1.73 4.93 1.73 4.62 3.43 4.25 1.68 3.92 3.68 3.10 4.03 1.77 4.08 1.75 3.20 1.85 4.62 1.97 4.50 3.92 4.35 2.33 3.83 1.88 4.60 1.80 4.73 1.77 4.57 1.85 3.52 4.00 3.70 3.72 4.25 3.58 3.80 3.77 3.75 2.50 4.50 4.10 3.70 3.80 3.43 4.00 2.27 4.40 4.05 4.25 3.33 2.00 4.33 2.93 4.58 1.90 3.58 3.73 3.73 1.82 4.63 3.50 4.00 3.67 1.67 4.60 1.67 4.00 1.80 4.42 1.90 4.63 2.93 3.50 1.97 4.28 1.83 4.13 1.83 4.65 4.20 3.93 4.33 1.83 4.53 2.03 4.18 4.43 4.07 4.13 3.95 4.10 2.72 4.58 1.90 4.50 1.95 4.83 4.12 : KERNELDENSITY [PRINT=summary,graph] Eruption CAPTION 'Plot the kernel density estimate for different bandwidths.' VARIATE [VALUES=0.05, 0.1, 0.2, 0.8] bandwidth KERNELDENSITY [PRINT=graph; BANDWIDTH=bandwidth; WINDOW=!(5...8)] Eruption