Defines the response variate(s) and the type of model to be fitted for linear, generalized linear, generalized additive and nonlinear models.
DISTRIBUTION = string token
|Distribution of the response variable (
LINK = string token
|Link function (
EXPONENT = scalar
|Exponent for power link; default -2
AGGREGATION = scalar
|Fixed parameter for negative binomial distribution (parameter k as in variance function Var = mean + mean2/k); default 1
KLOGRATIO = scalar
|Parameter for logratio link, in form log(mean/(mean+k)); default as set in
DISPERSION = scalar
|Value of dispersion parameter in calculation of s.e.s etc; default
calc, and 1 for
WEIGHTS = variate or symmetric matrix
|Variate of weights for weighted regression, or symmetric matrix of weights (one row and column for each unit of data) for generalized least squares; default
OFFSET = variate
|Offset variate to be included in model; default
GROUPS = factor
|Absorbing factor defining the groups for within-groups linear or generalized linear regression; default
RMETHOD = string token
|Type of residuals to form, if any, after each model is fitted (
DMETHOD = string token
|Basis of estimate of dispersion, if not fixed by
DISPERSION option (
deviance, Pearson); default
FUNCTIONVALUE = scalar
|Scalar whose value is to be minimized by calculation; default
YRELATION = string token
|Whether to analyse the y-variates separately, as in ordinary regression, or to analyse them cumulatively as counts in successive categories of a multinomial distribution (
separate, cumulative); default
DCALCULATION = expression structures
|Calculations to define the deviance contributions and variance function for a non-standard distribution; must be specified when
LCALCULATION = expression structures
|Calculations to define the fitted values and link derivative for a non-standard link; must be specified when
DFDISPERSION = scalar
|allows you to specify the number of degrees of freedom for a dispersion parameter specified by the
DISPERSION option; if this is not set, the supplied dispersion is assumed to be known exactly
SAVE = identifier
|To name regression save structure; default
Y = variates
|Response variates; only the first is used in nonlinear models and in generalized linear models except when
DIST=mult, when they specify the numbers in each category of an ordinal response model
NBINOMIAL = variate or scalar
|Total numbers for
RESIDUALS = variates
|To save residuals for each y variate after fitting a model
FITTEDVALUES = variates
|To save fitted values, and provide fitted values if no terms are given in
LINEARPREDICTOR = variate
|Specifies the identifier of the variate to hold the linear predictor
DERIVATIVE = variate
|Specifies the identifier of the variate to hold the derivative of the link function at each unit
DEVIANCE = variate
|Specifies the identifier of the variate to hold the contribution to the deviance from each unit
VFUNCTION = variate
|Specifies the identifier of the variate to hold the value of the variance function at each unit
MODEL directive does not actually fit anything: it simply sets up some structures inside Genstat that are used when you give a
FITNONLINEAR statement later on. So when you are doing regression,
MODEL will always be accompanied by at least one other regression statement to fit a model, like
Y parameter allows a list of variates; if you put more than one for linear regression, then you will get an analysis for each. This is a more efficient way of doing many linear regressions with the same explanatory variables, than separate pairs of
FIT statements. With additive models, generalized linear models and nonlinear models, only the first variate will be analysed (with the exception of multinomial response models); the others will be ignored.
FITTEDVALUES parameters allow you to specify variates to contain the residuals and fitted values for each response variable. The residuals are the “unexplained” component of the response variable, standardized in some way according to the
RMETHOD option. The fitted values are the “explained” component: that is, the combination of parameters and explanatory variables fitted in the model. You can get access to these sets of values in a different way through the
LINK options are used to specify a generalized linear model (McCullagh & Nelder 1989). By default the data are assumed to follow a Normal distribution, as required for ordinary linear regression, but other distributions can be selected using the
DISTRIBUTION option. The
LINK option specifies the link function that relates the linear model to the expected values of the distribution; in the default ordinary linear regression, this is the identity function (indicating no transformation). So, for example, for a log-linear model we would specify
LINK=log, while for logistic regression we would have
NBINOMIAL parameter must also be set when
DISTRIBUTION=binomial, to give the number of binomial trials for each unit.
EXPONENT option specifies the exponent when
LINK=power. Similarly, the
AGGREGATION option specifies the aggregation parameter k when
DISTRIBUTION=negativebinomial. This is a measure of the tendency for observations to cluster together which appears in the formula for the variance as a function of the mean
variance = mean + mean2/k
The default value of k is set at 1, which corresponds to the geometric distribution. The parameter k must be positive, and as it increases to infinity the distribution approaches the Poisson distribution. The
KLOGRATIO option sets the parameter k for the logratio link.
You can also define your own distribution or link function for a generalized linear model. To specify your own distribution, you need to set
DISTRIBUTION=calculated and then specify expression structures with the
DCALCULATION option to calculate the deviance and the variance function for each unit of the response variate, using the current values of the fitted-values variate. You must also set the
VFUNCTION parameters to indicate which identifiers are used to represent these in the expressions. To specify your own link, you need to set
LINK=calculated and provide expressions with the
LCALCULATION option for two other calculations to form the fitted values and the derivative of the link function for each unit of the response variate, using the current values of the linear predictor. You must also set the
DERIVATIVE parameters to specify the identifiers used to represent these in the calculations. In addition, you must provide initial values for the linear predictor, so that the iterative process can get started: often this can be done just by applying the link function to the response variate itself, but it may be necessary to modify extreme values such as 0 that may be mapped to infinity by the link function.
You can fit ordinal response models by setting option
YRELATION=cumulative and option
DISPERSION option controls how the variance of the distribution of the response values is calculated. By default, the variance is estimated from the residual mean square, and standard errors and standardized residuals are calculated from the estimate. If you use
DISPERSION to supply a value for the variance of the Normal distribution, or for the dispersion parameter of other distributions, then standard errors and residuals are based on this given value instead. In a generalized linear model, the dispersion of the chosen distribution can be fixed at a value provided by the
DISPERSION option, or estimated from either the residual deviance or the Pearson chi-square statistic, as specified by the
DFDISPERSION option allows you to specify the number of degrees of freedom for a value specified by the
DISPERSION option. You might want to use this, for example, if you had estimated the dispersion from some other data set. If
DFDISPERSION is not set, the supplied dispersion is assumed to be known exactly.
WEIGHTS option allows you to specify a variate holding weights for each unit. In simple linear regression, the estimate of dispersion is then the weighted residual mean square. Thus, if the variance of the response variable is not constant, and you know the relative size of the variance for each observation, you can set the weight to be proportional to the inverse of the variance of an observation. Alternatively, if the variance is related in a simple way to the mean, you may just need to specify a different distribution for the response. The
WEIGHTS option can also be set to a symmetric matrix, supplying weights corresponding to some pattern of correlation or covariance between units as well as variance of each unit. The subsequent analysis is known as generalized least-squares if the response distribution is Normal.
OFFSET option allows you to include in the regression a variable with no corresponding parameter. Linear regression analysis of Y with offset O is just the same as analysis of Y–O, but the offset has non-trivial applications in generalized linear models.
GROUPS option specifies a factor whose effects you want to eliminate before any regression is fitted. The factor must already have been defined. This method of elimination is sometimes called absorption; you might want to use it when data from many different groups are to be modelled. Use of
GROUPS gives less information than you would get if you included the factor explicitly in the model (leverages, predictions and some parameter correlations cannot be formed), but it saves space and time in fitting the model when the factor has many levels. You can use
GROUPS only with linear and generalized linear regression.
RMETHOD option controls how residuals are formed. By default, residuals are deviance residuals standardized by their estimated variance. The alternative Pearson residuals are defined in exactly the same way if the distribution is Normal, but for regression models with distributions other than Normal the two kinds of residual are different. If you do not want residuals, you can set the option to a missing value (
*) to save space within Genstat. However, you will then not be able to get residuals, fitted values or leverages, and the automatic checks on the fit of a model will not be done.
FUNCTIONVALUE option is relevant only when you want to use
FITNONLINEAR to optimize a general function. It then identifies the scalar that stores the results in the expression that calculates the function to be minimized (see the
CALCULATION option of
FITNONLINEAR). This should calculate a deviance if you are using this general facility to fit a statistical model.
FUNCTIONVALUE is ignored if the
Y parameter of
MODEL is set.
SAVE option allows you to specify an identifier for the regression save structure. This structure stores the current state of the regression model, and can be used explicitly in the directives
RFUNCTION. If the identifier in
SAVE is of a regression save structure that already has values, those values are deleted. You can reset the current regression save structure at any point in a program by using the
SET directive. Then, later regression statements would use the model stored in this save structure.
You can restrict the units that Genstat will use for the regression by putting a restriction on any of the vectors involved in the
MODEL statement (response variates, weight variate, offset variate, grouping factor or variate of binomial totals), or on any explanatory variate or factor in a subsequent
TERMS statement. However, you are not allowed to have different restrictions on the different vectors. You should not alter the restriction applied to the vectors between the
TERMS statement and subsequent fitting statements.
McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models (second edition). Chapman and Hall, London.
" Example FIT-1: Simple linear regression Modelling the relationship between counts of apples from 12 trees (recorded as 100s of fruit) and percentage damage by codling moth. (Snedecor & Cochran, Statistical analysis, 1980, p162.)" VARIATE [VALUES= 8, 6,11,22,14,17,18,24,19,23,26,40] Cropsize & [VALUES=59,58,56,53,50,45,43,42,39,38,30,27] Wormy DGRAPH Wormy; Cropsize " It is expected that the larger the crop is the less the damage will be, since the density of the flying moths is unrelated to the crop size. Try fitting a linear model relating the percentage of damage directly to the size of the crop." MODEL Wormy FIT Cropsize " Tree number 4 seems different from the rest: perhaps it was not adequately protected by the standard spraying programme, or was on the side from which the codling moths flew in to the orchard. Tree number 12 has a much larger crop than the rest: the results of the regression are strongly influenced by this one observation. Display all the fitted values, residuals and leverages (influence)." RDISPLAY [PRINT=fittedvalues] " Check the effect of omitting tree number 4." RESTRICT Wormy; .NOT.EXPAND(4; 12) FIT [PRINT=summary] Cropsize " Return to the complete dataset, and display the fitted line." RESTRICT Wormy FIT [PRINT=*] Cropsize RGRAPH [GRAPHICS=high] " Plot the fitted values against the residuals, to check that the variance is roughly constant; use the procedure RCHECK from the Genstat Procedure Library." RCHECK [GRAPHICS=high] residual; fittedvalues