Uses kernel density estimation to estimate the underlying density of a sample (P.W. Goedhart).
Options
PRINT = string token |
What to print (integral , summary , monitoring , graph ); default inte |
---|---|
METHOD = string token |
Which automatic bandwidth selection method should be used when the BANDWIDTH option is not set (s1 , s2 , s3 , sj); default sj |
BANDWIDTH = scalar or variate |
Which bandwidth value or values are to be used; default * |
NGRIDEXPONENT2 = scalar |
Defines the number of grid points as 2**NGRIDEXPONENT2 ; default 11 |
SAVEGRIDEXTENT = scalar |
Defines the lower and upper limit of the interval on which the kernel density is saved; the default value of 4 uses the full interval on which the kernel density is calculated |
NFOURIER = scalar |
Defines the upper limit of the sample size for which the kernel density is calculated directly (when the sample size exceeds the setting of this option, the fast Fourier transform is used to calculate the kernel density); default 100 |
PROPORTION = variate |
Proportions at which to calculate quantiles of the kernel density estimate; default !(0.025, 0.25, 0.5, 0.75, 0.975) |
PLOT = string tokens |
Specifies the graphs to be plotted (kerneldensity , histogram , sample ); default kern , hist , samp |
TITLE = text |
General title(s) for the graph(s); default * |
WINDOW = scalar or variate |
Window number(s) for the graph(s); default 1 |
SCREEN = string token |
Whether to clear the screen before plotting into the first window, or whether to or continue plotting on the old screen (clear , keep ); default clea |
Parameters
SAMPLE = variates |
The sample for which to calculate the kernel density estimate |
---|---|
GRID = variates |
Saves the grid of equidistant points at which the kernel density is calculated |
DENSITY = variates or pointers |
Saves the kernel density estimate |
CUMULATIVE = variates or pointers |
Saves the estimated cumulative distribution |
QUANTILE = variates or pointers |
Saves the quantiles calculated from the estimated cumulative distribution |
SAVEBANDWIDTH = scalars |
Saves the automatically selected bandwidths as specified by the METHOD option |
Description
Kernel density estimation is a useful tool for exploring the unknown underlying distribution of a sample, see e.g. Silverman (1986) for a general introduction in density estimation. The kernel method constructs an estimate fh(t) of the true density function by placing a kernel function K(t; xi, h) over each observation xi in the sample. The kernel function K(t; x, h) is itself a density function with location parameter x and scale parameter h, also called bandwidth in this context. The density estimate is then given by
fh(t) = (1 / (n × h)) × ∑i = 1…n K((t – xi) / h) (1)
where n denotes the sample size. It turns out that the choice of kernel function K is not very critical for the resulting estimate fh(t), see Section 3.3 of Silverman (1986). The Gaussian kernel is commonly used and is therefore adopted here as kernel function, i.e.
K(t) = (1 / √(2 × π)) × exp((-t2) / 2) (2)
For this choice of kernel function K, there is an efficient algorithm available for the calculation of fh(t). This algorithm employs the fast Fourier transform of the data.
The choice of bandwidth h is of crucial importance in kernel density estimation. A large value of h will give rise to an oversmoothed density estimate, while a small value of h will produce a very ragged density with many spikes at the observations. Silverman (1986) recommends examining kernel density estimates for several values of h, since this will highlight different features of the data. For automatic use of kernel density estimation, estimation of the bandwidth h from the data is very helpful. Silverman (1986) suggests the following normal-based estimates:
S1 = 1.06 × (standard deviation) × n-1/5
S2 = 0.79 × (interquartile range) × n-1/5
S3 = 0.90 × minimum(standard deviation, interquartile range/1.34) × n-1/5
These estimates are popular due to their simplicity. Jones, Marron & Sheather (1996), who provide an extensive review of the many automatic methods for choosing the bandwidth, advise against these estimates. They recommend the method of Sheather & Jones (1991) for general purposes. This method, denoted below by SJ, is therefore the default method used in the KERNELDENSITY
procedure.
The sample, for which to estimate the underlying density, must be specified by means of the SAMPLE
parameter. The METHOD
and BANDWIDTH
options determine which bandwidths h are used. When the BANDWIDTH
option is set to a scalar or variate, then these values are used for the bandwidth h. When the BANDWIDTH
option is unset, the METHOD
option determines which automatic bandwidth selection method is used. The default setting of the METHOD
option is sj
, which indicates that the method of Sheather & Jones (1991) is to be used. The automatically selected bandwidth can be saved by means of the SAVEBANDWIDTH
parameter.
The kernel density estimate is calculated on an interval at a grid of equidistant points. The grid is returned using the GRID
parameter, and the density estimate and corresponding cumulative density can be saved with the DENSITY
and CUMULATIVE
parameters. When the BANDWIDTH
option is set to a variate, the DENSITY
and CUMULATIVE parameters are pointers to variates: one variate for each bandwidth value. The number of grid points can be set using the NGRIDEXPONENT2
option as 2**NGRIDEXPONENT2
. The lower and upper limit of the interval on which the kernel density is calculated are given by:
CALCULATE lower = MINIMUM(SAMPLE) - 4*MAXIMUM(BANDWIDTH)
CALCULATE upper = MAXIMUM(SAMPLE) + 4*MAXIMUM(BANDWIDTH)
This ensures that the integral of the kernel density will be very close to one. The SAVEGRIDEXTENT
option can be used to save the grid and the (cumulative) density at a more limited interval defined by
CALCULATE lowsave = MINIMUM(SAMPLE)\
- SAVEGRIDEXTENT*MAXIMUM(BANDWIDTH)
CALCULATE uppsave = MAXIMUM(SAMPLE)\
+ SAVEGRIDEXTENT*MAXIMUM(BANDWIDTH)
The setting of the NFOURIER
option determines whether the kernel density is calculated directly by means of equation (1) or by employing the fast Fourier transform of the data. When the sample size n exceeds the setting of the NFOURIER
option, the fast Fourier transform is used.
The parameter QUANTILES
can be used to save quantiles of the kernel density estimate, for proportions specified by means of the PROPORTION
option. When the BANDWIDTH
option is set to a variate, the QUANTILES
are saved in a pointer containing a set of variates.
The PRINT
option controls the output displayed by KERNELDENSITY
. The integral
setting prints the integral of the kernel density, which should be close to one, while the summary
setting print summary statistics of the sample and of the kernel density estimate. The monitoring
setting can be used to monitor the iterative bandwidth estimation method SJ. Finally, the setting graph
produces a high-resolution plots of the kernel densities, superimposed over a rough histogram estimate of the density calculated as the proportion of the sample falling into CEILING
(SQRT
(number of samples))+1 equal intervals across the range of sample values. (There will be as many plots as there were bandwidths.) The sample values are also plotted, using the symbol +
, along the bottom of the plots. The PLOT
option controls which elements (kerneldensity
, histogram
, sample
) are plotted. The TITLE
option can provide a title for each graph. The WINDOW
option specifies the windows to be used for the plots (default 1), and the SCREEN
option controls whether or not the screen is cleared before plotting into the first window (default clear).
Options: PRINT
, METHOD
, BANDWIDTH
, NGRIDEXPONENT2
, SAVEGRIDEXTENT
, NFOURIER
, PROPORTION
, PLOT
, TITLE
, WINDOW
, SCREEN
.
Parameters: SAMPLE
, GRID
, DENSITY
, CUMULATIVE
, QUANTILE
, SAVEBANDWIDTH
.
Method
The interquartile range is calculated by means of the TABULATE
directive. For sample sizes larger than the setting of the NFOURIER
option, the fast Fourier transform of the data is used. This employs the algorithm of Silverman (1982), with the modification of Jones & Lotwick (1984), using the FOURIER
directive. The cumulative density is calculated by applying the trapezodial rule to the density. Quantiles of the estimated distribution are calculated with the INTERPOLATE
directive applied to the cumulative distribution. The difference between the cumulative distribution calculated directly and by means of the Fourier transform, was found to be less than 0.0001 for samples of size 100 from a wide variety of mixed distributions.
The bandwidth selection method of Sheather & Jones (1991) requires solving of a complicated equation by means of an iterative method. The iterative method stops when the relative difference between subsequent estimates of the bandwidth is smaller than 0.0001. The implementation in Genstat is a transcription of a Fortran subroutine which was kindly made available by Jones (1991). The algorithm uses all squared pairwise differences of the sample, i.e. (xi – xj)2 for all i, j. When the sample size is large there are too many such differences. The sampled values are then discretizised on a grid; i.e. all the sampled values are assigned to a binning interval and the expectation of this value is used instead of the exact value (xi – xj)2. Assuming a uniform distribution of xi and xj over their respective binning intervals, the expectation is given by d2 (k2 + 1/6), where d is the length of a binning interval and k is the number of intervals in between xi and xj. The data are discretizised when the sample size exceeds 500. The grid is refined in a loop until there are more than 500 bins with sampled values assigned to them. The relative difference between the Sheather & and Jones estimate calculated with exact squared pairwise differences and by discretization of the data, was found to be less than 0.0003 for samples of size 100 from a wide variety of mixed distributions.
The automatic bandwidth selection method of Sheather & Jones (1991) is implemented in a subsidiary procedure _KERNELSJ
which is called by KERNELDENSITY
. This has the following 5 options:
PRINT = string token |
What to print (monitoring ); default * |
---|---|
NTIMES = scalar |
Maximum number of iterations; default 20 |
TOLERANCE = scalar |
Convergence criterion; default 0.0001 |
BINSAMPLE = scalar |
Defines the upper limit of the sample size for which the method uses exact squared pairwise differences of the sampled value (for sample sizes exceeding the setting of this option, discretization of the sample is used); default 500 |
NGRID = scalar |
Defines the number of bins for discretization of the sampled values – the grid is refined in a loop until there are more than NGRID bins with sampled values assigned to them; default 500 |
It also has three parameters which must all be set
SAMPLE = variates |
The sample for which to estimate the bandwidth |
---|---|
IQR = scalar |
Interquartile range of the sample |
SJ = scalars |
Saves the estimated bandwidth using Sheather & Jones (1991) |
Action with RESTRICT
The SAMPLE
parameter can be restricted. The grid and kernel density estimate are then calculated using only those units that are in the restriction set.
References
Jones, M.C. & Lotwick, H.W. (1984). A remark on algorithm AS 176. Kernel density estimation using the fast Fourier transform. Remark AS R50. Applied Statistics, 33, 120-122.
Jones, M.C. (1991). Fortran subroutine SJEQD for the automatic bandwidth selection method of Sheather & Jones (1991). Personal communication.
Jones, M.C., Marron, J.S. & Sheather, S.J. (1996). Progress in data-based bandwidth selection for kernel density estimation. Computational Statistics, 11, 337-381.
Sheather, S.J. & Jones, M.C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53, 683-690.
Silverman, B.W. (1982). Kernel density estimation using the fast Fourier transform. Applied Statistics Algorithm AS 176. Applied Statistics, 31, 93-99.
Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London.
See also
Directive: DISTRIBUTION
.
Procedures: MSEKERNEL2D
, PTKERNEL2D
, PTK3D
.
Commands for: Basic and nonparametric statistics.
Example
CAPTION 'KERNELDENSITY example',!t(\ 'Eruption lengths (in minutes) of the Old Faithful geyser',\ 'from Table 2.2 of Silverman (1986). Note that Jones, Marron',\ '& Sheather (1996) find a bandwidth of 0.21 for these data',\ 'using the method of Sheather & Jones (1991). A bandwidth of',\ '0.2043 is found by the KERNELDENSITY procedure. The difference',\ 'between these values is probably due to distinct definitions',\ 'of the interquartile range.'); STYLE=meta,plain VARIATE [NVALUES=107] Eruption READ Eruption 4.37 3.87 4.00 4.03 3.50 4.08 2.25 4.70 1.73 4.93 1.73 4.62 3.43 4.25 1.68 3.92 3.68 3.10 4.03 1.77 4.08 1.75 3.20 1.85 4.62 1.97 4.50 3.92 4.35 2.33 3.83 1.88 4.60 1.80 4.73 1.77 4.57 1.85 3.52 4.00 3.70 3.72 4.25 3.58 3.80 3.77 3.75 2.50 4.50 4.10 3.70 3.80 3.43 4.00 2.27 4.40 4.05 4.25 3.33 2.00 4.33 2.93 4.58 1.90 3.58 3.73 3.73 1.82 4.63 3.50 4.00 3.67 1.67 4.60 1.67 4.00 1.80 4.42 1.90 4.63 2.93 3.50 1.97 4.28 1.83 4.13 1.83 4.65 4.20 3.93 4.33 1.83 4.53 2.03 4.18 4.43 4.07 4.13 3.95 4.10 2.72 4.58 1.90 4.50 1.95 4.83 4.12 : KERNELDENSITY [PRINT=summary,graph] Eruption CAPTION 'Plot the kernel density estimate for different bandwidths.' VARIATE [VALUES=0.05, 0.1, 0.2, 0.8] bandwidth KERNELDENSITY [PRINT=graph; BANDWIDTH=bandwidth; WINDOW=!(5...8)] Eruption