Uses kernel density estimation to estimate the underlying density of a sample (P.W. Goedhart).
|What to print (
||Which automatic bandwidth selection method should be used when the
||Which bandwidth value or values are to be used; default
||Defines the number of grid points as 2
||Defines the lower and upper limit of the interval on which the kernel density is saved; the default value of 4 uses the full interval on which the kernel density is calculated|
||Defines the upper limit of the sample size for which the kernel density is calculated directly (when the sample size exceeds the setting of this option, the fast Fourier transform is used to calculate the kernel density); default 100|
||Proportions at which to calculate quantiles of the kernel density estimate; default !(0.025, 0.25, 0.5, 0.75, 0.975)|
||Specifies the graphs to be plotted (
||General title(s) for the graph(s); default
||Window number(s) for the graph(s); default 1|
||Whether to clear the screen before plotting into the first window, or whether to or continue plotting on the old screen (
||The sample for which to calculate the kernel density estimate|
||Saves the grid of equidistant points at which the kernel density is calculated|
||Saves the kernel density estimate|
||Saves the estimated cumulative distribution|
||Saves the quantiles calculated from the estimated cumulative distribution|
||Saves the automatically selected bandwidths as specified by the
Kernel density estimation is a useful tool for exploring the unknown underlying distribution of a sample, see e.g. Silverman (1986) for a general introduction in density estimation. The kernel method constructs an estimate fh(t) of the true density function by placing a kernel function K(t; xi, h) over each observation xi in the sample. The kernel function K(t; x, h) is itself a density function with location parameter x and scale parameter h, also called bandwidth in this context. The density estimate is then given by
fh(t) = (1 / (n × h)) × ∑i = 1…n K((t – xi) / h) (1)
where n denotes the sample size. It turns out that the choice of kernel function K is not very critical for the resulting estimate fh(t), see Section 3.3 of Silverman (1986). The Gaussian kernel is commonly used and is therefore adopted here as kernel function, i.e.
K(t) = (1 / √(2 × π)) × exp((-t2) / 2) (2)
For this choice of kernel function K, there is an efficient algorithm available for the calculation of fh(t). This algorithm employs the fast Fourier transform of the data.
The choice of bandwidth h is of crucial importance in kernel density estimation. A large value of h will give rise to an oversmoothed density estimate, while a small value of h will produce a very ragged density with many spikes at the observations. Silverman (1986) recommends examining kernel density estimates for several values of h, since this will highlight different features of the data. For automatic use of kernel density estimation, estimation of the bandwidth h from the data is very helpful. Silverman (1986) suggests the following normal-based estimates:
S1 = 1.06 × (standard deviation) × n-1/5
S2 = 0.79 × (interquartile range) × n-1/5
S3 = 0.90 × minimum(standard deviation, interquartile range/1.34) × n-1/5
These estimates are popular due to their simplicity. Jones, Marron & Sheather (1996), who provide an extensive review of the many automatic methods for choosing the bandwidth, advise against these estimates. They recommend the method of Sheather & Jones (1991) for general purposes. This method, denoted below by SJ, is therefore the default method used in the
The sample, for which to estimate the underlying density, must be specified by means of the
SAMPLE parameter. The
BANDWIDTH options determine which bandwidths h are used. When the
BANDWIDTH option is set to a scalar or variate, then these values are used for the bandwidth h. When the
BANDWIDTH option is unset, the
METHOD option determines which automatic bandwidth selection method is used. The default setting of the
METHOD option is
sj, which indicates that the method of Sheather & Jones (1991) is to be used. The automatically selected bandwidth can be saved by means of the
The kernel density estimate is calculated on an interval at a grid of equidistant points. The grid is returned using the
GRID parameter, and the density estimate and corresponding cumulative density can be saved with the
CUMULATIVE parameters. When the
BANDWIDTH option is set to a variate, the
CUMULATIVE parameters are pointers to variates: one variate for each bandwidth value. The number of grid points can be set using the NGRIDEXPONENT2 option as 2
**NGRIDEXPONENT2. The lower and upper limit of the interval on which the kernel density is calculated are given by:
CALCULATE lower = MINIMUM(SAMPLE) - 4*MAXIMUM(BANDWIDTH)
CALCULATE upper = MAXIMUM(SAMPLE) + 4*MAXIMUM(BANDWIDTH)
This ensures that the integral of the kernel density will be very close to one. The
SAVEGRIDEXTENT option can be used to save the grid and the (cumulative) density at a more limited interval defined by
CALCULATE lowsave = MINIMUM(SAMPLE)\
CALCULATE uppsave = MAXIMUM(SAMPLE)\
The setting of the
NFOURIER option determines whether the kernel density is calculated directly by means of equation (1) or by employing the fast Fourier transform of the data. When the sample size n exceeds the setting of the
NFOURIER option, the fast Fourier transform is used.
QUANTILES can be used to save quantiles of the kernel density estimate, for proportions specified by means of the
PROPORTION option. When the
BANDWIDTH option is set to a variate, the
QUANTILES are saved in a pointer containing a set of variates.
integral setting prints the integral of the kernel density, which should be close to one, while the
summary setting print summary statistics of the sample and of the kernel density estimate. The
monitoring setting can be used to monitor the iterative bandwidth estimation method SJ. Finally, the setting
graph produces a high-resolution plots of the kernel densities, superimposed over a rough histogram estimate of the density calculated as the proportion of the sample falling into
SQRT(number of samples))+1 equal intervals across the range of sample values. (There will be as many plots as there were bandwidths.) The sample values are also plotted, using the symbol
+, along the bottom of the plots. The
PLOT option controls which elements (
sample) are plotted. The
TITLE option can provide a title for each graph. The
WINDOW option specifies the windows to be used for the plots (default 1), and the
SCREEN option controls whether or not the screen is cleared before plotting into the first window (default clear).
The interquartile range is calculated by means of the
TABULATE directive. For sample sizes larger than the setting of the
NFOURIER option, the fast Fourier transform of the data is used. This employs the algorithm of Silverman (1982), with the modification of Jones & Lotwick (1984), using the
FOURIER directive. The cumulative density is calculated by applying the trapezodial rule to the density. Quantiles of the estimated distribution are calculated with the
INTERPOLATE directive applied to the cumulative distribution. The difference between the cumulative distribution calculated directly and by means of the Fourier transform, was found to be less than 0.0001 for samples of size 100 from a wide variety of mixed distributions.
The bandwidth selection method of Sheather & Jones (1991) requires solving of a complicated equation by means of an iterative method. The iterative method stops when the relative difference between subsequent estimates of the bandwidth is smaller than 0.0001. The implementation in Genstat is a transcription of a Fortran subroutine which was kindly made available by Jones (1991). The algorithm uses all squared pairwise differences of the sample, i.e. (xi – xj)2 for all i, j. When the sample size is large there are too many such differences. The sampled values are then discretizised on a grid; i.e. all the sampled values are assigned to a binning interval and the expectation of this value is used instead of the exact value (xi – xj)2. Assuming a uniform distribution of xi and xj over their respective binning intervals, the expectation is given by d2 (k2 + 1/6), where d is the length of a binning interval and k is the number of intervals in between xi and xj. The data are discretizised when the sample size exceeds 500. The grid is refined in a loop until there are more than 500 bins with sampled values assigned to them. The relative difference between the Sheather & and Jones estimate calculated with exact squared pairwise differences and by discretization of the data, was found to be less than 0.0003 for samples of size 100 from a wide variety of mixed distributions.
The automatic bandwidth selection method of Sheather & Jones (1991) is implemented in a subsidiary procedure
_KERNELSJ which is called by
KERNELDENSITY. This has the following 5 options:
||What to print (
||Maximum number of iterations; default 20|
||Convergence criterion; default 0.0001|
||Defines the upper limit of the sample size for which the method uses exact squared pairwise differences of the sampled value (for sample sizes exceeding the setting of this option, discretization of the sample is used); default 500|
||Defines the number of bins for discretization of the sampled values – the grid is refined in a loop until there are more than
It also has three parameters which must all be set
||The sample for which to estimate the bandwidth|
||Interquartile range of the sample|
||Saves the estimated bandwidth using Sheather & Jones (1991)|
SAMPLE parameter can be restricted. The grid and kernel density estimate are then calculated using only those units that are in the restriction set.
Jones, M.C. & Lotwick, H.W. (1984). A remark on algorithm AS 176. Kernel density estimation using the fast Fourier transform. Remark AS R50. Applied Statistics, 33, 120-122.
Jones, M.C. (1991). Fortran subroutine SJEQD for the automatic bandwidth selection method of Sheather & Jones (1991). Personal communication.
Jones, M.C., Marron, J.S. & Sheather, S.J. (1996). Progress in data-based bandwidth selection for kernel density estimation. Computational Statistics, 11, 337-381.
Sheather, S.J. & Jones, M.C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53, 683-690.
Silverman, B.W. (1982). Kernel density estimation using the fast Fourier transform. Applied Statistics Algorithm AS 176. Applied Statistics, 31, 93-99.
Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London.
Commands for: Basic and nonparametric statistics.
CAPTION 'KERNELDENSITY example',!t(\ 'Eruption lengths (in minutes) of the Old Faithful geyser',\ 'from Table 2.2 of Silverman (1986). Note that Jones, Marron',\ '& Sheather (1996) find a bandwidth of 0.21 for these data',\ 'using the method of Sheather & Jones (1991). A bandwidth of',\ '0.2043 is found by the KERNELDENSITY procedure. The difference',\ 'between these values is probably due to distinct definitions',\ 'of the interquartile range.'); STYLE=meta,plain VARIATE [NVALUES=107] Eruption READ Eruption 4.37 3.87 4.00 4.03 3.50 4.08 2.25 4.70 1.73 4.93 1.73 4.62 3.43 4.25 1.68 3.92 3.68 3.10 4.03 1.77 4.08 1.75 3.20 1.85 4.62 1.97 4.50 3.92 4.35 2.33 3.83 1.88 4.60 1.80 4.73 1.77 4.57 1.85 3.52 4.00 3.70 3.72 4.25 3.58 3.80 3.77 3.75 2.50 4.50 4.10 3.70 3.80 3.43 4.00 2.27 4.40 4.05 4.25 3.33 2.00 4.33 2.93 4.58 1.90 3.58 3.73 3.73 1.82 4.63 3.50 4.00 3.67 1.67 4.60 1.67 4.00 1.80 4.42 1.90 4.63 2.93 3.50 1.97 4.28 1.83 4.13 1.83 4.65 4.20 3.93 4.33 1.83 4.53 2.03 4.18 4.43 4.07 4.13 3.95 4.10 2.72 4.58 1.90 4.50 1.95 4.83 4.12 : KERNELDENSITY [PRINT=summary,graph] Eruption CAPTION 'Plot the kernel density estimate for different bandwidths.' VARIATE [VALUES=0.05, 0.1, 0.2, 0.8] bandwidth KERNELDENSITY [PRINT=graph; BANDWIDTH=bandwidth; WINDOW=!(5...8)] Eruption