EDFTEST procedure

Performs empirical-distribution-function goodness-of-fit tests (V.M. Cave).

 

Options

PRINT = string tokens Controls printed output (summary, tests); default summ, test
PLOT = string tokens What graphs to plot (kerneldensity, histogram); default *
TEST = string tokens Specifies the type of goodness-of-fit test to perform (andersondarling, cramervonmises, kolmogorovsmirnov); default ande, cram, kolm
DISTRIBUTION = string tokens Continuous distribution that is hypothesized to have generated the DATA; (beta, b2, burr, cauchy, chisquare, ev1 (or gumbel), ev2 (or frechet), ev3, exponential, fdistribution, gamma, gev, gpareto, iburr, igamma, invnormal, iweibull, laplace, loggamma, logistic, loglogistic, lognormal, normal, paralogistic, pareto, stdnormal, stduniform, tdistribution, ubetamix, ugammamix, uniform, weibull, calculated); default norm
CONSTANT = string tokens Whether to estimate a constant for the distribution, when the parameter values are estimated from the DATA (estimate, omit); default omit
TMETHOD = string tokens Specifies the method used to perform the goodness-of-fit tests (likelihoodratio, traditional); default like
PARAMETERS = scalar or variate Parameter values for the hypothesized distribution; if this is not set, parameter values are estimated from the DATA
NAMES = text Names to identify the parameters in PARAMETERS; if this is not set, the default parameter ordering is assumed
CDFCALCULATION = expression Expression, formed using argument X, that defines the cumulative distribution function of the hypothesized distribution; must be specified when DISTRIBUTION = calculated
MCPARAMETERS = string tokens Whether the parameters are re-estimated or fixed during the Monte-Carlo simulations, when the parameter values are estimated from the DATA (fix, estimate); default esti
NTIMES = scalar Number of Monte-Carlo simulations to perform; default 999
SEED = scalar Seed for random number generation; default 0 continues an existing sequence or, if none, selects a seed automatically
TITLE = text Title for the graphs; default generates the title automatically
YTITLE = text Y-axis title for the graphs; default generates the title automatically
XTITLE = text X-axis title for the graphs; default generates the title automatically
WINDOW = scalar Window to use for the graphs; default 3
SCREEN = string tokens Whether to clear the screen before plotting the graph or to continue plotting on the old screen, when a single graph is requested (clear, keep); default clear

 

Parameters

DATA = variate Identifier of the variate holding the data
STATISTIC = pointer Pointer to scalar(s) to save the test statistic(s)
MCSTATISTICS = pointer Pointer to variates(s) to save the Monte-Carlo simulated test statistic(s)
PROBABILITY = pointer Pointer to scalar(s) to save the probability value(s) of the test statistic(s)

 

Description

EDFTEST performs one-sample two-sided empirical-distribution-function goodness-of-fit tests to assess whether a sample of data comes from a specified continuous distribution. The data values must be supplied, in a variate, using the DATA parameter. The type of tests to be performed are specified by the TEST option, with settings andersondarling (Anderson-Darling), cramervonmises (Cramér-von Mises) and kolmogorovsmirnov (Kolmogorov-Smirnov).

The method used to perform these tests is specified by the TMETHOD option, with settings likelihoodratio for the Zhang (2002) likelihood-ratio based method, and traditional for the traditional approach. The default is to use the likelihood-ratio based tests, which are generally more powerful.

The distribution from which the data are assumed to arise is specified using the DISTRIBUTION option; default normal. Values for the parameters can be supplied, in either a scalar or a variate, by the PARAMETERS option. However, when parameter values are supplied, a value must be specified for every parameter.

If parameter values are not supplied, they are estimated from the DATA, except when DISTRIBUTION is set to stdnormal, stduniform or calculated.

The NAMES option specifies a text to identify the individual parameter values within a variate of PARAMETERS. The parameter names associated with each distribution are given below. When the names are not supplied, the default ordering of the parameters is assumed. (This matches the ordering in which parameter estimates are saved using the ESTIMATES parameter of the DPROBABILITY procedure.) The parameter names are listed below, in the default parameter ordering for each distribution:

    Beta Type I (beta) ashape, bshape;
    Beta Type II (b2) ashape, bshape, rate;
    Burr (burr) ashape, scale, bshape;
    Cauchy (cauchy) location, scale;
    Chi-square (chisquare) df;
    Extreme Value Type I (ev1 or gumbel) location, scale;
    Extreme Value Type II (ev2 or frechet) location, scale, shape;
    Extreme Value Type III (ev3) location, scale, shape;
    Exponential (exponential) rate;
    F (fdistribution) ndf, ddf;
    Gamma (gamma) shape, rate, constant (optional);
    Generalized Extreme Value (gev) shape, location, scale;
    Generalized Pareto (gpareto) shape, scale;
    Inverse Burr (iburr) ashape, scale, bshape;
    Inverse Gamma (igamma) shape, scale;
    Inverse Normal (invnormal) mean, shape;
    Inverse Weibull (iweibull) scale, shape;
    Laplace (laplace) location, scale;
    Log-Gamma (loggamma) shape, rate;
    Logistic (logistic) location, scale;
    Log-Logistic (loglogistic) shape, scale;
    Log-Normal (lognormal) mean, sd, constant (optional);
    Normal (normal) mean, sd;
    Paralogistic (paralogistic) shape, scale;
    Pareto (pareto) shape, scale, constant (optional);
    t (tdistribution) df;
    Uniform-Beta mixture (ubetamix) weight, ashape, bshape;
    Uniform-Gamma mixture (ugammamix) weight, shape, scale;
    Uniform (uniform) min, max;
    Weibull (weibull) shape, rate, constant (optional);

The Gamma, Log-Normal, Pareto and Weibull distributions can have an extra constant parameter, so that the data values minus the constant then follow the specified distribution. When PARAMETERS are not supplied, you can set option CONSTANT = estimate to estimate a constant from the DATA. The default is not to estimate a constant.

The DISTRIBUTION option provides the common distributions. Alternatively, for traditional tests (i.e. TMETHOD = traditional) you can set DISTRIBUTION=calculated to define your own distribution. You must then use the CDFCALCULATION option to provide an expression, formed using argument X, to calculate the cumulative distribution function. For example, the exponential distribution with rate parameter of 2 could be specified by setting options

DISTRIBUTION=calculated

and

CDF=!E(X=1-EXP(-2*X))].

Monte-Carlo simulations are used to calculate the empirical probability values of the test statistics under the likelihood-ratio based method (i.e. TMETHOD = likelihoodratio), or, by default, under the traditional method when the parameters are estimated from the DATA. The NTIMES option defines how many Monte-Carlo simulations are used; default 999. The SEED option can be set to initialize the random-number generator used during the Monte-Carlo simulations; if the procedure is called again with the same settings, you will get identical results. The default of zero continues the sequence of random numbers from a previous generation or, if this is the first use of the generator in this run of Genstat, the seed is initialized automatically.

By default, when parameters are estimated from the DATA during the Monte-Carlo simulations, the parameters are re-estimated to ensure that the correct probability values are obtained. However, this can be overridden by setting the MCPARAMETERS option to fix.

Printed output is controlled by the PRINT option, with settings:

    summary to print summary information; and
    tests to print the test statistic(s), with its probability value(s) under the assumption that the data are from the hypothesized distribution (so a low probability indicates that the data are unlikely to be from the hypothesized distribution).

The printed output can be suppressed by setting option PRINT = *. The default is to print the summary and the tests.

The PLOT option controls graphical output, with settings:

    histogram to plot a histogram of the Monte-Carlo simulated test statistics; and
    kerneldensity to produce a kernel density plot of the Monte-Carlo simulated test statistics.

By default, nothing is plotted.

The TITLE, YTITLE and XTITLE options can supply an overall title, a y-axis title and a x-axis title for the graphs, respectively. If these are not supplied, suitable titles are generated automatically. When a single plot is requested, you can set option SCREEN = keep to plot the graph on an existing screen; by default the screen is cleared first. The WINDOW option defines the window to use for the plots; default 3.

The STATISTIC, PROBABILITY and MCSTATISTICS parameters allow the test statistics, their probabilities and the Monte-Carlo simulated test statistics to be saved, respectively, in pointers.

Options: PRINT, PLOT, DISTRIBUTION, CONSTANT, TMETHOD, PARAMETERS, NAMES, CDFCALCULATION, MCPARAMETERS, NTIMES, SEED, TITLE, YTITLE, XTITLE, WINDOW, SCREEN.

Parameters: DATA, STATISTIC, MCSTATISTICS, PROBABILITY.

 

Method

If TMETHOD=traditional, EDFTEST calculates the traditional Anderson-Darling, Cramér-von Mises and Kolmogorov-Smirnov goodness-of-fit tests. When PARAMETERS are supplied (or if MCPARAMETERS = fix), the probability of the Anderson-Darling test statistic is calculated using the fast algorithm (adinf) of Marsaglia & Marsaglia (2004), the probability of the Cramér-von Mises test statistic is calculated using the one-term linking approximation (equation 1.8) of Csörgő & Faraway (1996), and the probability of the Kolmogorov-Smirnov test statistic is calculated using the method of Carvalho (2015) for data sets with fewer than 171 values or using the Wang et al. (2003) approximation for larger data sets. When PARAMETERS are not supplied, Monte-Carlo simulation is used by default to obtain empirical probability values of the test statistics. However, empirical probability values are not available for DISTRIBUTION = ubetamix or ugammamix.

If TMETHOD = likelihoodratio, EDFTEST calculates likelihood-ratio based goodness-of-fit test statistics using the method of Zhang (2002). (Note, however, that the likelihood-ratio based method is not available for DISTRIBUTION = ubetamix, ugammamix, or calculated.) The resulting tests are generally more powerful than their traditional analogues. Monte-Carlo simulation is used to obtain empirical probability values of the test statistics.

When PARAMETERS are not supplied, maximum-likelihood estimates are obtained using the methods in the DPROBABILITY procedure. When MCPARAMETERS = estimate, the parameter values are re-estimated for each simulated data set using the DPROBABILITY procedure.

The kernel-density plot is generated by the KERNELDENSITY procedure, using the method of Sheather & Jones (1991), with the default number of grid points. The simulated test statistics are plotted using red + symbols along the x-axis, and the location of the test statistic is denoted by a blue line. As the observed test statistic contributes to the null distribution, it is included in the calculation of both the kernel density and histogram.

 

Action with RESTRICT

The DATA variate can be restricted to assess a subset of the data.

 

References

Carvalho, L. (2015). An improved evaluation of Kolmogorov’s distribution. Journal of Statistical Software, 65(3), 1-7.

Csörgő, S. & Faraway, J.J. (1996). The exact and asymptotic distributions of Cramér-von Mises statistics. Journal of the Royal Statistical Society, Series B, 58, 221-234.

Marsaglia, G. & Marsaglia, J. (2004). Evaluating the Anderson-Darling distribution. Journal of Statistical Software, 9(2), 1-5.

Sheather, S.J. & Jones, M.C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53, 683-690.

Wang, J., Tsang, W.W. & Marsaglia, G. (2003). Evaluating of Kolmogorov’s distribution. Journal of Statistical Software, 8(18), 1-4.

Zhang (2002). Powerful goodness-of-fit tests based on the likelihood ratio. Journal of the Royal Statistical Society, Series B, 64, 281-294.

 

See also

Directive: DISTRIBUTION.

Procedures: DPROBABILITY, NORMTEST, KOLMOG2, WSTATISTIC.

Commands for: Basic and nonparametric statistics.

Example

CAPTION 'EDFTEST example',\
        !t('Random sample of size 10 assumed to come from the Uniform distribution.'),\
        !t('From W.J. Conover (1980), Practical Nonparametric Statistics 2ed, pg 348.');\
        STYLE=meta,plain,plain     
VARIATE [VALUES=0.621,0.503,0.203,0.477,0.710,0.581,0.329,0.480,0.554,0.382] x

"Assuming a Uniform[0,1] distribution."
"Likelihood-ratio based tests with histograms of the Monte-Carlo test statistics."
EDFTEST [PLOT=histogram; DISTRIBUTION=uniform; PARAMETERS=!(1,0); NAMES=!t(max,min);\ 
         SEED=1234; NTIMES=999] x
"Traditional tests."
EDFTEST [TMETHOD=traditional; DISTRIBUTION=uniform; PARAMETERS=!(1,0);\ 
         NAMES=!t(max,min)] x
         
"Estimating parameter values from the data."
"Likelihood-ratio based tests with kernel density plots of the Monte-Carlo test
 statistics."
EDFTEST [PLOT=kerneldensity; DISTRIBUTION=uniform; SEED=1234; NTIMES=999] x
"Traditional tests with kernel density plots of the Monte-Carlo test statistics."
EDFTEST [TMETHOD=traditional;  PLOT=kerneldensity; DISTRIBUTION=uniform; SEED=1234;\ 
         NTIMES=999] x
Updated on January 15, 2018

Was this article helpful?