Fits the models of Williams (1982) to overdispersed proportions (M.S. Ridout & P.W. Goedhart).
|What to print if iterative estimation process converges successfully and whether to monitor the iterations (
||How to treat constant (
||Limit for expansion of model terms; default 3|
||Which warning messages to suppress (
||Which model to fit to take account of the extra variation (
||Whether to leave the modified
||To save estimated weights|
||To save estimated overdispersion parameter|
||Maximum number of iterations; default 10|
||Convergence criterion; default 0.01|
||Model terms to be fitted; if unset it is assumed that the model consists only of a constant term|
In binomial regression models, residual variability is often larger than would be expected if the data were indeed binomially distributed. This may be due to a few outliers or a poor choice of link function but often it simply indicates that the data are from a distribution more variable than the binomial. Such data are said to be “overdispersed” or to exhibit “extra-binomial variation”.
Williams (1982) discusses two possible models to extend the usual binomial model (Model I). Model II assumes that the true variance exceeds the binomial variance by a factor
V = 1 + (
NBINOMIAL-1) × φ (0 ≤ φ ≤ 1)
If the overdispersion parameter PHI were known, the data could be analysed using a binomial model with prior weights 1/V. Procedure
EXTRABINOMIAL estimates φ so that the residual chi-square statistic from this weighted analysis is (approximately) equal to the residual degrees of freedom (Moore 1987). If the binomial totals are all equal, Method II is equivalent to setting the
DISPERSION option of
MODEL equal to the residual chi-square statistic divided by its degrees of freedom.
Alternatively, Model III assumes that the linear predictor varies about its expectation with a constant variance. Usually this variation is assumed to follow a normal distribution; if there is then a logit link, the error distribution will be a logistic normal. Extensions to Model III to have several normal distributions contributing to the variation on the linear predictor, similar to those that occur in stratified analysis of variance, form the basis of many methods suggested for analysing generalized linear mixed models. For Model III, there is generally no simple expression for the exact variance. But the delta method can be used to show that, approximately, the variance exceeds the binomial variance by a factor
V = 1 + (
NBINOMIAL-1) × φ × F2 / (P × (1 – P))
where φ is variance on the scale of the linear predictor, P is the fitted probability and F is the derivative of the inverse of the link function, evaluated at the fitted value of the linear predictor.
MODEL statement must be given, in the usual way, to define the y-variate, the binomial totals, the link and any offset. The error distribution must also of course be set to
binomial but any settings of
DISPERSION are ignored.
The form of
EXTRABINOMIAL is similar in many ways to the
FIT directive. There is a single parameter
TERMS to define the model terms to be fitted, and the first four options,
NOMESSAGE, all have the same syntax and purpose as in
FIT. The remaining options are specific to
METHOD option selects which model to use (
III); by default
METHOD=II. Both models involve the estimation of the weight variate (1/V) required to fit the model using the standard Genstat facilities for generalized linear models. If option
EXTRABINOMIAL will leave the
MODEL statement in its modified form (provided the iterative estimation of φ converges), with the
WEIGHTS option set to these weights and the
DISPERSION option set to 1, so that directives like
DROP can be used to study the effects of individual terms in the model in the usual way. The
TERMS directive will also be left set to the model specified by the
TERMS parameter of
EXTRABINOMIAL, and this model will be the one most recently fitted, so further output can be obtained using
PHI allow the weights and the estimated value of φ, respectively, to be saved. The
MAXCYCLE option specifies the maximum number of iterations in the estimation, and the
TOLERANCE option defines the convergence criterion:
ABS(Chi-square – Residual d.f.) <
TOLERANCE × Residual d.f.
If the binomial totals are all equal, φ is determined (non-iteratively) from the residual chi-square statistic.
Otherwise, φ must be found iteratively and the method used (Williams, 1982) involves nested iterations. Each outer iteration (involving a model fit) requires an inner iteration (which uses only
CALCULATE statements) to get the updated estimate of φ. The option
MAXCYCLE controls the maximum number of outer iterations. The maximum number of inner iterations is fixed at 10.
Very precise convergence is not important in practice; the default setting of the
TOLERANCE option ( 1% ) should give a perfectly adequate estimate of φ, usually within 3 iterations.
Any of the following structures may be restricted: the
Y variate; the
NBINOMIAL variate; the
WEIGHTS variate; the
OFFSET variate; any variate or factor appearing in the model formula. Restrictions on different structures must be compatible. Restricted units are excluded from the analysis.
Moore, D.F. (1987). Modelling the extraneous variance in the presence of extra-binomial variation. Applied Statistics, 36, 8-14.
Williams, D.A. (1982). Extra-binomial variation in logistic linear models. Applied Statistics, 31, 144-148.
Commands for: Regression analysis.
CAPTION 'EXTRABIN example',\ !t('A 2 x 2 factorial experiment comparing germination',\ 'of two types of seed and two root extracts (Crowder, M.J.,',\ '1978, Appl. Statist., 27, 34-37).'); STYLE=meta,plain FACTOR [LABELS=!T(O_75,O_73); VALUES=1,10(1,2)] Seed FACTOR [LABELS=!T(Bean,Cucumber); VALUES=5(1,2),2,5(1,2)] RtExtrct VARIATE NGerm,NSeeds ;\ VALUES=!(10,23,23,26,17,5,53,55,32,46,10,8,10,8,23,0,3,22,15,32,3),\ !(39,62,81,51,39,6,74,72,51,79,13,16,30,28,45,4,12,41,30,51,7) MODEL [DISTRIBUTION=binomial; LINK=logit] NGerm; NBINOMIAL=NSeeds EXTRABIN [PRINT=estimates; PHI=Phi] Seed*RtExtrct PRINT Phi