KERNELDENSITY procedure

Uses kernel density estimation to estimate the underlying density of a sample (P.W. Goedhart).

Options

`PRINT` = string token	What to print (`integral`, `summary`, `monitoring`, `graph`); default `inte`
`METHOD` = string token	Which automatic bandwidth selection method should be used when the `BANDWIDTH` option is not set (`s1`, `s2`, `s3`, `sj);` default `sj`
`BANDWIDTH` = scalar or variate	Which bandwidth value or values are to be used; default `*`
`NGRIDEXPONENT2` = scalar	Defines the number of grid points as 2`**NGRIDEXPONENT2`; default 11
`SAVEGRIDEXTENT` = scalar	Defines the lower and upper limit of the interval on which the kernel density is saved; the default value of 4 uses the full interval on which the kernel density is calculated
`NFOURIER` = scalar	Defines the upper limit of the sample size for which the kernel density is calculated directly (when the sample size exceeds the setting of this option, the fast Fourier transform is used to calculate the kernel density); default 100
`PROPORTION` = variate	Proportions at which to calculate quantiles of the kernel density estimate; default !(0.025, 0.25, 0.5, 0.75, 0.975)
`PLOT` = string tokens	Specifies the graphs to be plotted (`kerneldensity`, `histogram`, `sample`); default `kern`, `hist`, `samp`
`TITLE` = text	General title(s) for the graph(s); default `*`
`WINDOW` = scalar or variate	Window number(s) for the graph(s); default 1
`SCREEN` = string token	Whether to clear the screen before plotting into the first window, or whether to or continue plotting on the old screen (`clear`, `keep`); default `clea`

Parameters

`SAMPLE` = variates	The sample for which to calculate the kernel density estimate
`GRID` = variates	Saves the grid of equidistant points at which the kernel density is calculated
`DENSITY` = variates or pointers	Saves the kernel density estimate
`CUMULATIVE` = variates or pointers	Saves the estimated cumulative distribution
`QUANTILE` = variates or pointers	Saves the quantiles calculated from the estimated cumulative distribution
`SAVEBANDWIDTH` = scalars	Saves the automatically selected bandwidths as specified by the `METHOD` option

Description

Kernel density estimation is a useful tool for exploring the unknown underlying distribution of a sample, see e.g. Silverman (1986) for a general introduction in density estimation. The kernel method constructs an estimate f_h(t) of the true density function by placing a kernel function K(t; x_i, h) over each observation x_i in the sample. The kernel function K(t; x, h) is itself a density function with location parameter x and scale parameter h, also called bandwidth in this context. The density estimate is then given by

f_h(t) = (1 / (n × h)) × ∑_{i = 1…n} K((t – x_i) / h) (1)

where n denotes the sample size. It turns out that the choice of kernel function K is not very critical for the resulting estimate f_h(t), see Section 3.3 of Silverman (1986). The Gaussian kernel is commonly used and is therefore adopted here as kernel function, i.e.

K(t) = (1 / √(2 × π)) × exp((-t²) / 2) (2)

For this choice of kernel function K, there is an efficient algorithm available for the calculation of f_h(t). This algorithm employs the fast Fourier transform of the data.

The choice of bandwidth h is of crucial importance in kernel density estimation. A large value of h will give rise to an oversmoothed density estimate, while a small value of h will produce a very ragged density with many spikes at the observations. Silverman (1986) recommends examining kernel density estimates for several values of h, since this will highlight different features of the data. For automatic use of kernel density estimation, estimation of the bandwidth h from the data is very helpful. Silverman (1986) suggests the following normal-based estimates:

S1 = 1.06 × (standard deviation) × n^-1/5

S2 = 0.79 × (interquartile range) × n^-1/5

S3 = 0.90 × minimum(standard deviation, interquartile range/1.34) × n^-1/5

These estimates are popular due to their simplicity. Jones, Marron & Sheather (1996), who provide an extensive review of the many automatic methods for choosing the bandwidth, advise against these estimates. They recommend the method of Sheather & Jones (1991) for general purposes. This method, denoted below by SJ, is therefore the default method used in the KERNELDENSITY procedure.

The sample, for which to estimate the underlying density, must be specified by means of the SAMPLE parameter. The METHOD and BANDWIDTH options determine which bandwidths h are used. When the BANDWIDTH option is set to a scalar or variate, then these values are used for the bandwidth h. When the BANDWIDTH option is unset, the METHOD option determines which automatic bandwidth selection method is used. The default setting of the METHOD option is sj, which indicates that the method of Sheather & Jones (1991) is to be used. The automatically selected bandwidth can be saved by means of the SAVEBANDWIDTH parameter.

The kernel density estimate is calculated on an interval at a grid of equidistant points. The grid is returned using the GRID parameter, and the density estimate and corresponding cumulative density can be saved with the DENSITY and CUMULATIVE parameters. When the BANDWIDTH option is set to a variate, the DENSITY and CUMULATIVE parameters are pointers to variates: one variate for each bandwidth value. The number of grid points can be set using the NGRIDEXPONENT2 option as 2**NGRIDEXPONENT2. The lower and upper limit of the interval on which the kernel density is calculated are given by:

CALCULATE lower = MINIMUM(SAMPLE) - 4*MAXIMUM(BANDWIDTH)

CALCULATE upper = MAXIMUM(SAMPLE) + 4*MAXIMUM(BANDWIDTH)

This ensures that the integral of the kernel density will be very close to one. The SAVEGRIDEXTENT option can be used to save the grid and the (cumulative) density at a more limited interval defined by

CALCULATE lowsave = MINIMUM(SAMPLE)\

- SAVEGRIDEXTENT*MAXIMUM(BANDWIDTH)

CALCULATE uppsave = MAXIMUM(SAMPLE)\

+ SAVEGRIDEXTENT*MAXIMUM(BANDWIDTH)

The setting of the NFOURIER option determines whether the kernel density is calculated directly by means of equation (1) or by employing the fast Fourier transform of the data. When the sample size n exceeds the setting of the NFOURIER option, the fast Fourier transform is used.

The parameter QUANTILES can be used to save quantiles of the kernel density estimate, for proportions specified by means of the PROPORTION option. When the BANDWIDTH option is set to a variate, the QUANTILES are saved in a pointer containing a set of variates.

The PRINT option controls the output displayed by KERNELDENSITY. The integral setting prints the integral of the kernel density, which should be close to one, while the summary setting print summary statistics of the sample and of the kernel density estimate. The monitoring setting can be used to monitor the iterative bandwidth estimation method SJ. Finally, the setting graph produces a high-resolution plots of the kernel densities, superimposed over a rough histogram estimate of the density calculated as the proportion of the sample falling into CEILING(SQRT(number of samples))+1 equal intervals across the range of sample values. (There will be as many plots as there were bandwidths.) The sample values are also plotted, using the symbol +, along the bottom of the plots. The PLOT option controls which elements (kerneldensity, histogram, sample) are plotted. The TITLE option can provide a title for each graph. The WINDOW option specifies the windows to be used for the plots (default 1), and the SCREEN option controls whether or not the screen is cleared before plotting into the first window (default clear).

Options: PRINT, METHOD, BANDWIDTH, NGRIDEXPONENT2, SAVEGRIDEXTENT, NFOURIER, PROPORTION, PLOT, TITLE, WINDOW, SCREEN.

Parameters: SAMPLE, GRID, DENSITY, CUMULATIVE, QUANTILE, SAVEBANDWIDTH.

Method

The interquartile range is calculated by means of the TABULATE directive. For sample sizes larger than the setting of the NFOURIER option, the fast Fourier transform of the data is used. This employs the algorithm of Silverman (1982), with the modification of Jones & Lotwick (1984), using the FOURIER directive. The cumulative density is calculated by applying the trapezodial rule to the density. Quantiles of the estimated distribution are calculated with the INTERPOLATE directive applied to the cumulative distribution. The difference between the cumulative distribution calculated directly and by means of the Fourier transform, was found to be less than 0.0001 for samples of size 100 from a wide variety of mixed distributions.

The bandwidth selection method of Sheather & Jones (1991) requires solving of a complicated equation by means of an iterative method. The iterative method stops when the relative difference between subsequent estimates of the bandwidth is smaller than 0.0001. The implementation in Genstat is a transcription of a Fortran subroutine which was kindly made available by Jones (1991). The algorithm uses all squared pairwise differences of the sample, i.e. (x_i – x_j)² for all i, j. When the sample size is large there are too many such differences. The sampled values are then discretizised on a grid; i.e. all the sampled values are assigned to a binning interval and the expectation of this value is used instead of the exact value (x_i – x_j)². Assuming a uniform distribution of x_i and x_j over their respective binning intervals, the expectation is given by d² (k² + 1/6), where d is the length of a binning interval and k is the number of intervals in between x_i and x_j. The data are discretizised when the sample size exceeds 500. The grid is refined in a loop until there are more than 500 bins with sampled values assigned to them. The relative difference between the Sheather & and Jones estimate calculated with exact squared pairwise differences and by discretization of the data, was found to be less than 0.0003 for samples of size 100 from a wide variety of mixed distributions.

The automatic bandwidth selection method of Sheather & Jones (1991) is implemented in a subsidiary procedure _KERNELSJ which is called by KERNELDENSITY. This has the following 5 options:

`PRINT` = string token	What to print (`monitoring`); default `*`
`NTIMES` = scalar	Maximum number of iterations; default 20
`TOLERANCE` = scalar	Convergence criterion; default 0.0001
`BINSAMPLE` = scalar	Defines the upper limit of the sample size for which the method uses exact squared pairwise differences of the sampled value (for sample sizes exceeding the setting of this option, discretization of the sample is used); default 500
`NGRID` = scalar	Defines the number of bins for discretization of the sampled values – the grid is refined in a loop until there are more than `NGRID` bins with sampled values assigned to them; default 500

It also has three parameters which must all be set

`SAMPLE` = variates	The sample for which to estimate the bandwidth
`IQR` = scalar	Interquartile range of the sample
`SJ` = scalars	Saves the estimated bandwidth using Sheather & Jones (1991)

Action with `RESTRICT`

The SAMPLE parameter can be restricted. The grid and kernel density estimate are then calculated using only those units that are in the restriction set.

References

Jones, M.C. & Lotwick, H.W. (1984). A remark on algorithm AS 176. Kernel density estimation using the fast Fourier transform. Remark AS R50. Applied Statistics, 33, 120-122.

Jones, M.C. (1991). Fortran subroutine SJEQD for the automatic bandwidth selection method of Sheather & Jones (1991). Personal communication.

Jones, M.C., Marron, J.S. & Sheather, S.J. (1996). Progress in data-based bandwidth selection for kernel density estimation. Computational Statistics, 11, 337-381.

Sheather, S.J. & Jones, M.C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53, 683-690.

Silverman, B.W. (1982). Kernel density estimation using the fast Fourier transform. Applied Statistics Algorithm AS 176. Applied Statistics, 31, 93-99.

Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London.

Example

CAPTION    'KERNELDENSITY example',!t(\
           'Eruption lengths (in minutes) of the Old Faithful geyser',\
           'from Table 2.2 of Silverman (1986). Note that Jones, Marron',\
           '& Sheather (1996) find a bandwidth of 0.21 for these data',\
           'using the method of Sheather & Jones (1991). A bandwidth of',\
           '0.2043 is found by the KERNELDENSITY procedure. The difference',\
           'between these values is probably due to distinct definitions',\
           'of the interquartile range.'); STYLE=meta,plain
VARIATE    [NVALUES=107] Eruption
READ       Eruption
  4.37 3.87 4.00 4.03 3.50 4.08 2.25    4.70 1.73 4.93 1.73 4.62 3.43 4.25
  1.68 3.92 3.68 3.10 4.03 1.77 4.08    1.75 3.20 1.85 4.62 1.97 4.50 3.92
  4.35 2.33 3.83 1.88 4.60 1.80 4.73    1.77 4.57 1.85 3.52 4.00 3.70 3.72
  4.25 3.58 3.80 3.77 3.75 2.50 4.50    4.10 3.70 3.80 3.43 4.00 2.27 4.40
  4.05 4.25 3.33 2.00 4.33 2.93 4.58    1.90 3.58 3.73 3.73 1.82 4.63 3.50
  4.00 3.67 1.67 4.60 1.67 4.00 1.80    4.42 1.90 4.63 2.93 3.50 1.97 4.28
  1.83 4.13 1.83 4.65 4.20 3.93 4.33    1.83 4.53 2.03 4.18 4.43 4.07 4.13
  3.95 4.10 2.72 4.58 1.90 4.50 1.95    4.83 4.12 :
KERNELDENSITY [PRINT=summary,graph] Eruption
CAPTION    'Plot the kernel density estimate for different bandwidths.'
VARIATE    [VALUES=0.05, 0.1, 0.2, 0.8] bandwidth
KERNELDENSITY [PRINT=graph; BANDWIDTH=bandwidth; WINDOW=!(5...8)] Eruption

Updated on February 7, 2023

Was this article helpful?

Yes No