1. Home
  2. DISTRIBUTION directive

DISTRIBUTION directive

Estimates the parameters of continuous and discrete distributions.

Options

PRINT = string tokens Printed output required from each individual fit (parameters, samplestatistics, fittedvalues, proportions, monitoring); default para, samp, fitt
CBPRINT = string tokens Printed output required from a fit combining all the input data (parameters, samplestatistics, fittedvalues, proportions, monitoring); default *
DISTRIBUTION = string token Distribution to be fitted (Poisson, geometric, logseries, negativebinomial, NeymanA, PolyaAeppli, PlogNormal, PPascal, Normal, dNvequal, dNvunequal, logNormal, exponential, gamma, Weibull, b1, b2, Pareto); default * i.e. fit nothing
CONSTANT = string token Whether to estimate a location parameter for the gamma, logNormal, Pareto or Weibull distributions (estimate, omit); default omit
LIMITS = variate Variate to specify or save upper limits for classifying the data into groups; default *
NGROUPS = scalar When LIMITS is not specified, this defines the number of groups (of approximately equal size) into which the data are to be classified; default is the integer value nearest to the square root of the number of data values
XDEVIATES = variate Variate to specify points up to which the CUMPROPORTIONS are to be estimated
JOINT = string token Requests joint estimates from the combined fit to be used for a re-fit to the separate data sets (dispersion, variancemeanratio, Poissonindex); default *
PARAMETERS = variate Estimated parameters from the combined fit
SE = variate Standard errors for the estimated parameters of the combined fit
VCOVARIANCE = symmetric matrix Variance-covariance matrix for the estimated parameters of the combined fit
CUMPROPORTIONS = variate Estimated cumulative proportions of the combined distribution up to the values specified by the XDEVIATES option
MAXCYCLE = scalar Maximum number of iterations; default 30
TOLERANCE = scalar Convergence criterion; default 0.0001

Parameters

DATA = variates or tables Data values either classified (table) or unclassified (variate)
NOBSERVATIONS = tables One-way table to save the data classified into groups
RESIDUALS = tables Residuals from each (individual) fit
FITTEDVALUES = tables Fitted values from each fit
PARAMETERS = variates Estimated parameters from each fit
SE = variates Standard errors of the estimates
VCOVARIANCE = symmetric matrices Variance-covariance matrix for each set of estimated parameters
CUMPROPORTIONS = variates Estimated cumulative proportions of each distribution up to the values specified by the XDEVIATES option
CBRESIDUALS = tables Residuals from the combined fit
CBFITTEDVALUES = tables Fitted values from the combined fit
STEPLENGTH = variates Initial step lengths for each fit
INITIAL = variates Initial values for each set fit

Description

The DISTRIBUTION directive is used to fit an observed sample of data to a theoretical distribution function, in order to obtain maximum-likelihood estimates of the parameters of the distribution and test the goodness of fit. The data consists of observations xi of a random variable X, which has a distribution function F(x) defined by F(x)=Pr(Xx). A selection of both discrete and continuous distributions are available; full details are given below.

For discrete distributions X may take non-negative integer values only, except for the log-series distribution where only positive integer values are allowed. For continuous distributions the random variable X may take any values, subject to constraints for certain distributions, for example, data values must be strictly positive in order to fit a log-Normal distribution. Constraints are detailed with the individual distributions described below.

The data can be supplied to DISTRIBUTION as a variate or as a one-way table of counts. If the raw data are available, then these should be supplied (as a variate), since the raw data contains more information than grouped data.

If raw data are not available, then a one-way table of counts, or frequencies, should be given. The factor classifying the table must have its levels vector declared explicitly, since the levels are used to indicate the boundary values of the raw data used to create the grouping. For example, if the discrete variable X takes the values 0…8, with numbers of observations 2,6,7,4,2,1,0,1,0 respectively, a table of counts can be declared by

FACTOR [LEVELS=!(0...8)] F

TABLE [CLASSIFICATION=F; VALUES=2,6,7,4,2,1,0,1,0] T

The factor levels do not have to specify single data values: often it will be desirable to group certain values together, and indeed for continuous data this is the only sensible way to proceed. In general, for a classifying factor with levels l1, l2, … , lf, the count nk for the kth cell of the table will be the number of observations xi such that

    xil1, k=1
    lk1 < xilk, 2≤kf-1
    lf1 < xi, k=f

This means that for all except the last cell of the table, the factor level represents the upper limit on values in that cell. The final class of the table is termed the tail; it is formed by combining the frequencies for all values of X greater than lf1, and the upper limit on values in the tail is infinity. For continuous distributions with no lower bound, the first class will be the lower tail. You will often want to form the tail(s) by amalgamating groups with low numbers of counts. In the example above, you might amalgamate the groups for values 6-8:

FACTOR [LEVELS=!(0...5,99)] F2

TABLE [CLASSIFICATION=F2; VALUES=2,6,7,4,2,1,1] T2

Note that the final factor level, for the tail, can be given a dummy value of 99 to indicate that it has no upper limit, since this value is never used in calculations.

When data are supplied as a table instead of as a variate, the computed log-likelihood is only an approximation to the full log-likelihood and the solution obtained will depend to some extent on the choice of class limits. More reliable results will be achieved with a larger number of classes, since this gives more information on the data distribution, so only classes with very few observations should be amalgamated. In general, care should be taken to choose class limits that give a reasonable number of counts in each class, but with none of the individual classes holding a disproportionately large number of observations.

The DISTRIBUTION option should be set to indicate which distribution is to be fitted to the data. The following distributions are available:

Discrete Continuous
Binomial (as a special case Normal
of the negative binomial) Double Normal (equal variances)
Poisson Double Normal (unequal variances)
Geometric Log-Normal
Log-series Exponential
Negative binomial Gamma
Neyman type A Weibull
Pólya-Aeppli Beta type I and type II
Poisson-log-Normal Pareto
Poisson-Pascal  

Note: the parameterization for the gamma distribution differs from that used in the gamma probability functions. DISTRIBUTION uses the shape parameter k and the rate parameter b, while the functions use the shape parameter k and the scale parameter t, which is the reciprocal of the rate (t=1/b).

The first step of the fitting process is to compute and print various sample statistics. Examining these may help in the selection of appropriate distributions for fitting – properties of the various distributions are listed at the end of this section. The setting DISTRIBUTION=* can be used to produce this output without any model fitting. The following sample statistics are calculated:

Sample size

n

 

Sample mean

m = Σ xi/n

 

Sample variance

s2 = Σ xi2/nm2

discrete distributions

 

s2 = Σ (xim)2 / (n-1)

continuous distributions

Sample skewness

g1 = Σ (xim)3 / (n-1)s3

= m3/s3x

 

Sample kurtosis

g2 = Σ{(xim)4/(n-1)s4} – 3

continuous distributions only

Sample quartiles

xp: F(xp)=p

 

Poisson index

(s2m)/m2

discrete distributions only

Negative binomial index

m(m3-3s2+2m)/(s2m)2

discrete distributions only

If the original data are not available, the sample statistics are calculated by substituting class mid-points in place of the data. For the lower tail, the class “mid-point” is taken to be l1-½(l2l1) and for the upper tail, lf1+½(lf1lf2). No corrections are made for groupings. When a distribution has been fitted to data, the relevant theoretical statistics of that distribution are printed for comparison with the sample statistics, as a check on the appropriateness of the model for the data.

A summary is given of the fit: the parameter estimates are printed with their standard errors and correlations, including the working parameters, which are stable functions of the parameters defining the distribution and are used in the internal algorithm. The goodness of fit to the chosen distribution is indicated by the residual deviance which has an asymptotic chi-square distribution with the specified degrees of freedom. The deviance is also the preferred statistic for comparison of nested models, for example the double Normal distribution with equal and unequal variances. This is followed by a table of observed and fitted values (expected frequencies), together with weighted residuals. If raw data are supplied, by default this table is formed by dividing the data into √n groups of approximately equal observed frequency, which are therefore likely to be of unequal widths. The NGROUPS option may be used to set the number of groups for this table. If data are supplied as a table, the fitted values use the classification from that table. In either case the LIMITS option may be used to supply a different set of limits; with the constraint that if tabulated data are analysed these limits should be a subset of the original limits so that the new groups are formed by aggregation.

The NOBSERVATIONS, RESIDUALS and FITTEDVALUES parameters can be used to save the number of observations in each cell, the fitted number, and the residual respectively (all in tables). The parameter estimates and their standard errors can be saved in variates specified by PARAMETERS and SE. The variance-covariance matrix for the estimated parameters can be saved as a symmetric matrix using the VCOVARIANCE parameter.

Having fitted the required distribution, the estimated cumulative distribution function (CDF) can be evaluated at specified values of X. These are defined using the XDEVIATES option. The values of the CDF can be printed (by selecting PRINT=proportions) or saved in a variate by setting the CUMPROPORTION parameter.

If you have several sets of data you may be interested in fitting the distribution individually to each set; this can be done by setting the DATA parameter to a list of identifiers. A separate analysis is then performed for each set of data, but of course any option settings are common to all the data sets. The data sets should all be specified in the same way, either as raw data or as tabulated counts. For tabulated counts, the same categories must be used for defining every table. You can also carry out one final fit to the combined data set, in order to investigate whether the data can be adequately modelled as coming from a single population. This combined fit is produced if any of the options relating to the combined fit have been set (that is, options CBPRINT, PARAMETERS, SE, VCOVARIANCE or CUMPROPORTION which print or save information from the combined analysis). For each individual data set you can also save fitted values and residuals based on the parameters estimated from the combined data set, using the CBRESIDUALS and CBFITTEDVALUES parameters. The JOINT option can be used to specify that certain parameters should be held constant at their estimated values from the combined analysis during refits to the individual data sets. For continuous distributions only, a common dispersion parameter can be requested; for discrete distributions a common value can be requested for either the Poisson index or the ratio of variance to mean. An analysis of deviance is printed to compare the nested models.

If the original data are available, the full log-likelihood is used in the optimization algorithm. Otherwise, an approximate log-likelihood is optimized, using representative values for each class. For some distributions, it is necessary to use stable working parameters in the optimization algorithm (Ross 1990), and the defining parameters for the distribution are then evaluated by a simple transformation.

The deviance and corresponding degrees of freedom that are printed as part of the model summary are based on the table of fitted values, and thus may be affected by the choice of limits. The residuals computed are deviance residuals (McCullagh & Nelder 1989), and the deviance is therefore the sum of squared residuals. The degrees of freedom are np-1, where n is the number of cells in the table of fitted values and p is the number of parameters estimated in the model. The default limits for grouping the raw data are designed to avoid small expected frequencies (for example in the tail cells) which can have an inflationary affect on the deviance; however, if the tails are important, because of the origin of the data, it may be important to specify the limits explicitly.

An iterative Gauss-Newton optimization method is used to estimate the parameters of the distribution. The parameterization is chosen for each model so that the optimization is stable, but if there are any problems with particular data sets it may be necessary to control this process. The MAXCYCLE and TOLERANCE options allow you to increase the number of iterations and alter the convergence criterion for data sets that fail to converge. You can also specify initial values and step lengths for the parameters for each set of data using the STEPLENGTH and INITIAL parameters. These parameters should be set to variates of length appropriate for the distribution being fitted; for example, if DISTRIBUTION=Poisson they should have just one value. Another use of INITIAL and STEPLENGTH is to constrain a parameter to a particular value; for example when fitting a double Normal the proportion parameter p could be fixed at 0.5 by setting the initial value to 0.5 and the step length to 0, thus fitting a double Normal in equal proportions. Note that the degrees of freedom are not adjusted to take account of this.

Options: PRINT, CBPRINT, DISTRIBUTION, CONSTANT, LIMITS, NGROUPS, XDEVIATES, JOINT, PARAMETERS , SE, VCOVARIANCE, CUMPROPORTIONS, MAXCYCLE, TOLERANCE.

Parameters: DATA, NOBSERVATIONS, RESIDUALS, FITTEDVALUES, PARAMETERS, SE, VCOVARIANCE, CUMPROPORTIONS, CBRESIDUALS, CBFITTEDVALUES, STEPLENGTH, INITIAL.

Action with RESTRICT

You can restrict the units of a DATA variate to fit a distribution to a subset of its values.

References

McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models (second edition). Chapman and Hall, London.

Ross, G.J.S. (1990). Nonlinear Estimation. Springer-Verlag, New York.

See also

Procedures: BBINOMIAL, CUMDISTRIBUTION, DPROBABILITY, EDFTEST, FDRMIXTURE, KERNELDENSITY, NORMTEST, WSTATISTIC, RSURVIVAL.

Functions: CLBETA, CLBINOMIAL, CLBVARIATENORMAL, CLCHISQUARE, CLF, CLGAMMA, CLHYPERGEOMETRIC, CLINVNORMAL, CLLOGNORMAL, CLNORMAL, CLOGLOG, CLPOISSON, CLSMMODULUS, CLSRANGE, CLT, CLUNIFORM, CUBETA, CUBINOMIAL, CUBVARIATENORMAL, CUCHISQUARE, CUF, CUGAMMA, CUHYPERGEOMETRIC, CUINVNORMAL, CULOGNORMAL, CUNORMAL, CUPOISSON, CUSMMODULUS, CUSRANGE, CUT, CUUNIFORM, EDBETA, EDBINOMIAL, EDCHISQUARE, EDF, EDGAMMA, EDHYPERGEOMETRIC, EDINVNORMAL, EDLOGNORMAL, EDNORMAL, EDPOISSON, EDSMMODULUS, EDSRANGE, EDT, EDUNIFORM, GRBETA, GRBINOMIAL, GRCHISQUARE, GRF, GRGAMMA, GRHYPERGEOMETRIC, GRLOGNORMAL, GRNORMAL, GRPOISSON, GRSAMPLE, GRSELECT, GRT, GRUNIFORM, PRBETA, PRBINOMIAL, PRCHISQUARE, PRF, PRGAMMA, PRHYPERGEOMETRIC, PRINVNORMAL, PRLOGNORMAL, PRNORMAL, PRPOISSON, PRSMMODULUS, PRSRANGE, PRT, PRUNIFORM.

Commands for: Basic and nonparametric statistics.

Example

" Example DIST-1: Negative Binomial and Log-Series distributions

  Taken from Chatfield et al. (1966), JRSS A, 129, p317-360.
      The data are recorded frequencies of number of purchases of
      a household product by 2000 households over a 26 week period.
      Thus, 1612 households made 0 purchases of the product, 164 made
      1 purchase, and so on. The final cell of the table is the tail, 
      the number of households that made 21 or more purchases.
"

FACTOR [LEVELS=!(0...21)] Npurchase; DECIMALS=0
TABLE [CLASSIFICATION=Npurchase] Purchases; DECIMALS=0
READ Purchases
1612 164 71 47 28 17 12 12 5 7 6 3 3 5 0 0 0 2 0 0 1 5 :

" Fit negative binomial"
DISTRIBUTION [DISTRIBUTION=negativebinomial] Purchases

" Fit logseries distribution"
DISTRIBUTION [DISTRIBUTION=logseries] Purchases
Updated on March 8, 2019

Was this article helpful?