MODEL directive

Defines the response variate(s) and the type of model to be fitted for linear, generalized linear, generalized additive and nonlinear models.

Options

`DISTRIBUTION` = string token	Distribution of the response variable (`normal`, `poisson`, `binomial`, `gamma`, `inversenormal`, `multinomial`, `calculated`, `negativebinomial`, `geometric`, `exponential`, `bernoulli`); default `norm`
`LINK` = string token	Link function (`canonical`, `identity`, `logarithm`, `logit`, `reciprocal`, `power`, `squareroot`, `probit`, `complementaryloglog`, `calculated`, `logratio`); default `cano` (i.e. `iden` for `DIST=norm` or `calc`; `loga` for `DIST=pois`; `logi` for `DIST=bino`, `bern` or `mult`; `reci` for `DIST=gamm` or `expo`; `powe` for `DIST=inve`; `logr` for `DIST=nega` or `geom`)
`EXPONENT` = scalar	Exponent for power link; default -2
`AGGREGATION` = scalar	Fixed parameter for negative binomial distribution (parameter k as in variance function Var = mean + mean²/k); default 1
`KLOGRATIO` = scalar	Parameter for logratio link, in form log(mean/(mean+k)); default as set in `AGGREGATION` option
`DISPERSION` = scalar	Value of dispersion parameter in calculation of s.e.s etc; default `*` for `DIST=norm`, `gamm`, `inve` or `calc`, and 1 for `DIST=pois`, `bino`, `mult`, `nega`, `geom`, `expo` or `bern`
`WEIGHTS` = variate or symmetric matrix	Variate of weights for weighted regression, or symmetric matrix of weights (one row and column for each unit of data) for generalized least squares; default `*`
`OFFSET` = variate	Offset variate to be included in model; default `*`
`GROUPS` = factor	Absorbing factor defining the groups for within-groups linear or generalized linear regression; default `*`
`RMETHOD` = string token	Type of residuals to form, if any, after each model is fitted (`deviance`, `Pearson`, `simple`); default `devi`
`DMETHOD` = string token	Basis of estimate of dispersion, if not fixed by `DISPERSION` option (`deviance, Pearson`); default `devi`
`FUNCTIONVALUE` = scalar	Scalar whose value is to be minimized by calculation; default `*`
`YRELATION` = string token	Whether to analyse the y-variates separately, as in ordinary regression, or to analyse them cumulatively as counts in successive categories of a multinomial distribution (`separate, cumulative`); default `sepa`
`DCALCULATION` = expression structures	Calculations to define the deviance contributions and variance function for a non-standard distribution; must be specified when `DIST=calc`
`LCALCULATION` = expression structures	Calculations to define the fitted values and link derivative for a non-standard link; must be specified when `LINK=calc`
`DFDISPERSION` = scalar	allows you to specify the number of degrees of freedom for a dispersion parameter specified by the `DISPERSION` option; if this is not set, the supplied dispersion is assumed to be known exactly
`SAVE` = identifier	To name regression save structure; default `*`

Parameters

`Y` = variates	Response variates; only the first is used in nonlinear models and in generalized linear models except when `DIST=mult`, when they specify the numbers in each category of an ordinal response model
`NBINOMIAL` = variate or scalar	Total numbers for `DIST=bino`
`RESIDUALS` = variates	To save residuals for each y variate after fitting a model
`FITTEDVALUES` = variates	To save fitted values, and provide fitted values if no terms are given in `FITNONLINEAR`
`LINEARPREDICTOR` = variate	Specifies the identifier of the variate to hold the linear predictor
`DERIVATIVE` = variate	Specifies the identifier of the variate to hold the derivative of the link function at each unit
`DEVIANCE` = variate	Specifies the identifier of the variate to hold the contribution to the deviance from each unit
`VFUNCTION` = variate	Specifies the identifier of the variate to hold the value of the variance function at each unit

Description

The MODEL directive does not actually fit anything: it simply sets up some structures inside Genstat that are used when you give a FIT, FITCURVE or FITNONLINEAR statement later on. So when you are doing regression, MODEL will always be accompanied by at least one other regression statement to fit a model, like FIT.

The Y parameter allows a list of variates; if you put more than one for linear regression, then you will get an analysis for each. This is a more efficient way of doing many linear regressions with the same explanatory variables, than separate pairs of MODEL and FIT statements. With additive models, generalized linear models and nonlinear models, only the first variate will be analysed (with the exception of multinomial response models); the others will be ignored.

The RESIDUALS and FITTEDVALUES parameters allow you to specify variates to contain the residuals and fitted values for each response variable. The residuals are the “unexplained” component of the response variable, standardized in some way according to the RMETHOD option. The fitted values are the “explained” component: that is, the combination of parameters and explanatory variables fitted in the model. You can get access to these sets of values in a different way through the RKEEP directive.

The DISTRIBUTION and LINK options are used to specify a generalized linear model (McCullagh & Nelder 1989). By default the data are assumed to follow a Normal distribution, as required for ordinary linear regression, but other distributions can be selected using the DISTRIBUTION option. The LINK option specifies the link function that relates the linear model to the expected values of the distribution; in the default ordinary linear regression, this is the identity function (indicating no transformation). So, for example, for a log-linear model we would specify DISTRIBUTION=Poisson and LINK=log, while for logistic regression we would have DISTRIBUTION=binomial and LINK=logit. The NBINOMIAL parameter must also be set when DISTRIBUTION=binomial, to give the number of binomial trials for each unit.

The EXPONENT option specifies the exponent when LINK=power. Similarly, the AGGREGATION option specifies the aggregation parameter k when DISTRIBUTION=negativebinomial. This is a measure of the tendency for observations to cluster together which appears in the formula for the variance as a function of the mean

variance = mean + mean²/k

The default value of k is set at 1, which corresponds to the geometric distribution. The parameter k must be positive, and as it increases to infinity the distribution approaches the Poisson distribution. The KLOGRATIO option sets the parameter k for the logratio link.

You can also define your own distribution or link function for a generalized linear model. To specify your own distribution, you need to set DISTRIBUTION=calculated and then specify expression structures with the DCALCULATION option to calculate the deviance and the variance function for each unit of the response variate, using the current values of the fitted-values variate. You must also set the FITTEDVALUES, DEVIANCE and VFUNCTION parameters to indicate which identifiers are used to represent these in the expressions. To specify your own link, you need to set LINK=calculated and provide expressions with the LCALCULATION option for two other calculations to form the fitted values and the derivative of the link function for each unit of the response variate, using the current values of the linear predictor. You must also set the FITTEDVALUES, LINEARPREDICTOR and DERIVATIVE parameters to specify the identifiers used to represent these in the calculations. In addition, you must provide initial values for the linear predictor, so that the iterative process can get started: often this can be done just by applying the link function to the response variate itself, but it may be necessary to modify extreme values such as 0 that may be mapped to infinity by the link function.

You can fit ordinal response models by setting option YRELATION=cumulative and option DISTRIBUTION=multinomial.

The DISPERSION option controls how the variance of the distribution of the response values is calculated. By default, the variance is estimated from the residual mean square, and standard errors and standardized residuals are calculated from the estimate. If you use DISPERSION to supply a value for the variance of the Normal distribution, or for the dispersion parameter of other distributions, then standard errors and residuals are based on this given value instead. In a generalized linear model, the dispersion of the chosen distribution can be fixed at a value provided by the DISPERSION option, or estimated from either the residual deviance or the Pearson chi-square statistic, as specified by the DMETHOD option.

The DFDISPERSION option allows you to specify the number of degrees of freedom for a value specified by the DISPERSION option. You might want to use this, for example, if you had estimated the dispersion from some other data set. If DFDISPERSION is not set, the supplied dispersion is assumed to be known exactly.

The WEIGHTS option allows you to specify a variate holding weights for each unit. In simple linear regression, the estimate of dispersion is then the weighted residual mean square. Thus, if the variance of the response variable is not constant, and you know the relative size of the variance for each observation, you can set the weight to be proportional to the inverse of the variance of an observation. Alternatively, if the variance is related in a simple way to the mean, you may just need to specify a different distribution for the response. The WEIGHTS option can also be set to a symmetric matrix, supplying weights corresponding to some pattern of correlation or covariance between units as well as variance of each unit. The subsequent analysis is known as generalized least-squares if the response distribution is Normal.

The OFFSET option allows you to include in the regression a variable with no corresponding parameter. Linear regression analysis of Y with offset O is just the same as analysis of Y–O, but the offset has non-trivial applications in generalized linear models.

The GROUPS option specifies a factor whose effects you want to eliminate before any regression is fitted. The factor must already have been defined. This method of elimination is sometimes called absorption; you might want to use it when data from many different groups are to be modelled. Use of GROUPS gives less information than you would get if you included the factor explicitly in the model (leverages, predictions and some parameter correlations cannot be formed), but it saves space and time in fitting the model when the factor has many levels. You can use GROUPS only with linear and generalized linear regression.

The RMETHOD option controls how residuals are formed. By default, residuals are deviance residuals standardized by their estimated variance. The alternative Pearson residuals are defined in exactly the same way if the distribution is Normal, but for regression models with distributions other than Normal the two kinds of residual are different. If you do not want residuals, you can set the option to a missing value (*) to save space within Genstat. However, you will then not be able to get residuals, fitted values or leverages, and the automatic checks on the fit of a model will not be done.

The FUNCTIONVALUE option is relevant only when you want to use FITNONLINEAR to optimize a general function. It then identifies the scalar that stores the results in the expression that calculates the function to be minimized (see the CALCULATION option of FITNONLINEAR). This should calculate a deviance if you are using this general facility to fit a statistical model. FUNCTIONVALUE is ignored if the Y parameter of MODEL is set.

The SAVE option allows you to specify an identifier for the regression save structure. This structure stores the current state of the regression model, and can be used explicitly in the directives RDISPLAY, RKEEP, PREDICT and RFUNCTION. If the identifier in SAVE is of a regression save structure that already has values, those values are deleted. You can reset the current regression save structure at any point in a program by using the SET directive. Then, later regression statements would use the model stored in this save structure.

Options: DISTRIBUTION, LINK, EXPONENT, AGGREGATION, KLOGRATIO, DISPERSION, WEIGHTS, OFFSET, GROUPS, RMETHOD, DMETHOD, FUNCTIONVALUE, YRELATION, DCALCULATION, LCALCULATION, DFDISPERSION, SAVE.

Parameters: Y, NBINOMIAL, RESIDUALS, FITTEDVALUES, LINEARPREDICTOR, DERIVATIVE, DEVIANCE, VFUNCTION.

Action with `RESTRICT`

You can restrict the units that Genstat will use for the regression by putting a restriction on any of the vectors involved in the MODEL statement (response variates, weight variate, offset variate, grouping factor or variate of binomial totals), or on any explanatory variate or factor in a subsequent TERMS statement. However, you are not allowed to have different restrictions on the different vectors. You should not alter the restriction applied to the vectors between the TERMS statement and subsequent fitting statements.

Reference

McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models (second edition). Chapman and Hall, London.

Example

" Example FIT-1: Simple linear regression

  Modelling the relationship between counts of apples from 12 trees
  (recorded as 100s of fruit) and percentage damage by codling moth.
  (Snedecor & Cochran, Statistical analysis, 1980, p162.)"

VARIATE [VALUES= 8, 6,11,22,14,17,18,24,19,23,26,40] Cropsize
&       [VALUES=59,58,56,53,50,45,43,42,39,38,30,27] Wormy
DGRAPH  Wormy; Cropsize

" It is expected that the larger the crop is the less the damage will be,
  since the density of the flying moths is unrelated to the crop size.
  Try fitting a linear model relating the percentage of damage directly
  to the size of the crop."
MODEL Wormy
FIT Cropsize

" Tree number 4 seems different from the rest: perhaps it was not
  adequately protected by the standard spraying programme, or was on the
  side from which the codling moths flew in to the orchard.
  Tree number 12 has a much larger crop than the rest: the results of the
  regression are strongly influenced by this one observation.
  Display all the fitted values, residuals and leverages (influence)."
RDISPLAY [PRINT=fittedvalues]

" Check the effect of omitting tree number 4."
RESTRICT Wormy; .NOT.EXPAND(4; 12)
FIT [PRINT=summary] Cropsize

" Return to the complete dataset, and display the fitted line."
RESTRICT Wormy
FIT [PRINT=*] Cropsize
RGRAPH [GRAPHICS=high]

" Plot the fitted values against the residuals, to check that the
  variance is roughly constant; use the procedure RCHECK from the
  Genstat Procedure Library."
RCHECK [GRAPHICS=high] residual; fittedvalues

Updated on June 19, 2019

Was this article helpful?

Yes No