Defines the response variate(s) and the type of model to be fitted for linear, generalized linear, generalized additive and nonlinear models.
Options
DISTRIBUTION = string token |
Distribution of the response variable (normal , poisson , binomial , gamma , inversenormal , multinomial , calculated , negativebinomial , geometric , exponential , bernoulli ); default norm |
---|---|
LINK = string token |
Link function (canonical , identity , logarithm , logit , reciprocal , power , squareroot , probit , complementaryloglog , calculated , logratio ); default cano (i.e. iden for DIST=norm or calc ; loga for DIST=pois ; logi for DIST=bino , bern or mult ; reci for DIST=gamm or expo ; powe for DIST=inve ; logr for DIST=nega or geom ) |
EXPONENT = scalar |
Exponent for power link; default -2 |
AGGREGATION = scalar |
Fixed parameter for negative binomial distribution (parameter k as in variance function Var = mean + mean2/k); default 1 |
KLOGRATIO = scalar |
Parameter for logratio link, in form log(mean/(mean+k)); default as set in AGGREGATION option |
DISPERSION = scalar |
Value of dispersion parameter in calculation of s.e.s etc; default * for DIST=norm , gamm , inve or calc , and 1 for DIST=pois , bino , mult , nega , geom , expo or bern |
WEIGHTS = variate or symmetric matrix |
Variate of weights for weighted regression, or symmetric matrix of weights (one row and column for each unit of data) for generalized least squares; default * |
OFFSET = variate |
Offset variate to be included in model; default * |
GROUPS = factor |
Absorbing factor defining the groups for within-groups linear or generalized linear regression; default * |
RMETHOD = string token |
Type of residuals to form, if any, after each model is fitted (deviance , Pearson , simple ); default devi |
DMETHOD = string token |
Basis of estimate of dispersion, if not fixed by DISPERSION option (deviance, Pearson ); default devi |
FUNCTIONVALUE = scalar |
Scalar whose value is to be minimized by calculation; default * |
YRELATION = string token |
Whether to analyse the y-variates separately, as in ordinary regression, or to analyse them cumulatively as counts in successive categories of a multinomial distribution (separate, cumulative ); default sepa |
DCALCULATION = expression structures |
Calculations to define the deviance contributions and variance function for a non-standard distribution; must be specified when DIST=calc |
LCALCULATION = expression structures |
Calculations to define the fitted values and link derivative for a non-standard link; must be specified when LINK=calc |
DFDISPERSION = scalar |
allows you to specify the number of degrees of freedom for a dispersion parameter specified by the DISPERSION option; if this is not set, the supplied dispersion is assumed to be known exactly |
SAVE = identifier |
To name regression save structure; default * |
Parameters
Y = variates |
Response variates; only the first is used in nonlinear models and in generalized linear models except when DIST=mult , when they specify the numbers in each category of an ordinal response model |
---|---|
NBINOMIAL = variate or scalar |
Total numbers for DIST=bino |
RESIDUALS = variates |
To save residuals for each y variate after fitting a model |
FITTEDVALUES = variates |
To save fitted values, and provide fitted values if no terms are given in FITNONLINEAR |
LINEARPREDICTOR = variate |
Specifies the identifier of the variate to hold the linear predictor |
DERIVATIVE = variate |
Specifies the identifier of the variate to hold the derivative of the link function at each unit |
DEVIANCE = variate |
Specifies the identifier of the variate to hold the contribution to the deviance from each unit |
VFUNCTION = variate |
Specifies the identifier of the variate to hold the value of the variance function at each unit |
Description
The MODEL
directive does not actually fit anything: it simply sets up some structures inside Genstat that are used when you give a FIT
, FITCURVE
or FITNONLINEAR
statement later on. So when you are doing regression, MODEL
will always be accompanied by at least one other regression statement to fit a model, like FIT
.
The Y
parameter allows a list of variates; if you put more than one for linear regression, then you will get an analysis for each. This is a more efficient way of doing many linear regressions with the same explanatory variables, than separate pairs of MODEL
and FIT
statements. With additive models, generalized linear models and nonlinear models, only the first variate will be analysed (with the exception of multinomial response models); the others will be ignored.
The RESIDUALS
and FITTEDVALUES
parameters allow you to specify variates to contain the residuals and fitted values for each response variable. The residuals are the “unexplained” component of the response variable, standardized in some way according to the RMETHOD
option. The fitted values are the “explained” component: that is, the combination of parameters and explanatory variables fitted in the model. You can get access to these sets of values in a different way through the RKEEP
directive.
The DISTRIBUTION
and LINK
options are used to specify a generalized linear model (McCullagh & Nelder 1989). By default the data are assumed to follow a Normal distribution, as required for ordinary linear regression, but other distributions can be selected using the DISTRIBUTION
option. The LINK
option specifies the link function that relates the linear model to the expected values of the distribution; in the default ordinary linear regression, this is the identity function (indicating no transformation). So, for example, for a log-linear model we would specify DISTRIBUTION=Poisson
and LINK=log
, while for logistic regression we would have DISTRIBUTION=binomial
and LINK=logit
. The NBINOMIAL
parameter must also be set when DISTRIBUTION=binomial
, to give the number of binomial trials for each unit.
The EXPONENT
option specifies the exponent when LINK=power
. Similarly, the AGGREGATION
option specifies the aggregation parameter k when DISTRIBUTION=negativebinomial
. This is a measure of the tendency for observations to cluster together which appears in the formula for the variance as a function of the mean
variance = mean + mean2/k
The default value of k is set at 1, which corresponds to the geometric distribution. The parameter k must be positive, and as it increases to infinity the distribution approaches the Poisson distribution. The KLOGRATIO
option sets the parameter k for the logratio link.
You can also define your own distribution or link function for a generalized linear model. To specify your own distribution, you need to set DISTRIBUTION=calculated
and then specify expression structures with the DCALCULATION
option to calculate the deviance and the variance function for each unit of the response variate, using the current values of the fitted-values variate. You must also set the FITTEDVALUES
, DEVIANCE
and VFUNCTION
parameters to indicate which identifiers are used to represent these in the expressions. To specify your own link, you need to set LINK=calculated
and provide expressions with the LCALCULATION
option for two other calculations to form the fitted values and the derivative of the link function for each unit of the response variate, using the current values of the linear predictor. You must also set the FITTEDVALUES
, LINEARPREDICTOR
and DERIVATIVE
parameters to specify the identifiers used to represent these in the calculations. In addition, you must provide initial values for the linear predictor, so that the iterative process can get started: often this can be done just by applying the link function to the response variate itself, but it may be necessary to modify extreme values such as 0 that may be mapped to infinity by the link function.
You can fit ordinal response models by setting option YRELATION=cumulative
and option DISTRIBUTION=multinomial
.
The DISPERSION
option controls how the variance of the distribution of the response values is calculated. By default, the variance is estimated from the residual mean square, and standard errors and standardized residuals are calculated from the estimate. If you use DISPERSION
to supply a value for the variance of the Normal distribution, or for the dispersion parameter of other distributions, then standard errors and residuals are based on this given value instead. In a generalized linear model, the dispersion of the chosen distribution can be fixed at a value provided by the DISPERSION
option, or estimated from either the residual deviance or the Pearson chi-square statistic, as specified by the DMETHOD
option.
The DFDISPERSION
option allows you to specify the number of degrees of freedom for a value specified by the DISPERSION
option. You might want to use this, for example, if you had estimated the dispersion from some other data set. If DFDISPERSION
is not set, the supplied dispersion is assumed to be known exactly.
The WEIGHTS
option allows you to specify a variate holding weights for each unit. In simple linear regression, the estimate of dispersion is then the weighted residual mean square. Thus, if the variance of the response variable is not constant, and you know the relative size of the variance for each observation, you can set the weight to be proportional to the inverse of the variance of an observation. Alternatively, if the variance is related in a simple way to the mean, you may just need to specify a different distribution for the response. The WEIGHTS
option can also be set to a symmetric matrix, supplying weights corresponding to some pattern of correlation or covariance between units as well as variance of each unit. The subsequent analysis is known as generalized least-squares if the response distribution is Normal.
The OFFSET
option allows you to include in the regression a variable with no corresponding parameter. Linear regression analysis of Y with offset O is just the same as analysis of Y–O, but the offset has non-trivial applications in generalized linear models.
The GROUPS
option specifies a factor whose effects you want to eliminate before any regression is fitted. The factor must already have been defined. This method of elimination is sometimes called absorption; you might want to use it when data from many different groups are to be modelled. Use of GROUPS
gives less information than you would get if you included the factor explicitly in the model (leverages, predictions and some parameter correlations cannot be formed), but it saves space and time in fitting the model when the factor has many levels. You can use GROUPS
only with linear and generalized linear regression.
The RMETHOD
option controls how residuals are formed. By default, residuals are deviance residuals standardized by their estimated variance. The alternative Pearson residuals are defined in exactly the same way if the distribution is Normal, but for regression models with distributions other than Normal the two kinds of residual are different. If you do not want residuals, you can set the option to a missing value (*
) to save space within Genstat. However, you will then not be able to get residuals, fitted values or leverages, and the automatic checks on the fit of a model will not be done.
The FUNCTIONVALUE
option is relevant only when you want to use FITNONLINEAR
to optimize a general function. It then identifies the scalar that stores the results in the expression that calculates the function to be minimized (see the CALCULATION
option of FITNONLINEAR
). This should calculate a deviance if you are using this general facility to fit a statistical model. FUNCTIONVALUE
is ignored if the Y
parameter of MODEL
is set.
The SAVE
option allows you to specify an identifier for the regression save structure. This structure stores the current state of the regression model, and can be used explicitly in the directives RDISPLAY
, RKEEP
, PREDICT
and RFUNCTION
. If the identifier in SAVE
is of a regression save structure that already has values, those values are deleted. You can reset the current regression save structure at any point in a program by using the SET
directive. Then, later regression statements would use the model stored in this save structure.
Options: DISTRIBUTION
, LINK
, EXPONENT
, AGGREGATION
, KLOGRATIO
, DISPERSION
, WEIGHTS
, OFFSET
, GROUPS
, RMETHOD
, DMETHOD
, FUNCTIONVALUE
, YRELATION
, DCALCULATION
, LCALCULATION
, DFDISPERSION
, SAVE
.
Parameters: Y
, NBINOMIAL
, RESIDUALS
, FITTEDVALUES
, LINEARPREDICTOR
, DERIVATIVE
, DEVIANCE
, VFUNCTION
.
Action with RESTRICT
You can restrict the units that Genstat will use for the regression by putting a restriction on any of the vectors involved in the MODEL
statement (response variates, weight variate, offset variate, grouping factor or variate of binomial totals), or on any explanatory variate or factor in a subsequent TERMS
statement. However, you are not allowed to have different restrictions on the different vectors. You should not alter the restriction applied to the vectors between the TERMS
statement and subsequent fitting statements.
Reference
McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models (second edition). Chapman and Hall, London.
See also
Directives: FIT
, FITCURVE
, FITNONLINEAR
, TERMS
.
Commands for: Regression analysis
.
Example
" Example FIT-1: Simple linear regression Modelling the relationship between counts of apples from 12 trees (recorded as 100s of fruit) and percentage damage by codling moth. (Snedecor & Cochran, Statistical analysis, 1980, p162.)" VARIATE [VALUES= 8, 6,11,22,14,17,18,24,19,23,26,40] Cropsize & [VALUES=59,58,56,53,50,45,43,42,39,38,30,27] Wormy DGRAPH Wormy; Cropsize " It is expected that the larger the crop is the less the damage will be, since the density of the flying moths is unrelated to the crop size. Try fitting a linear model relating the percentage of damage directly to the size of the crop." MODEL Wormy FIT Cropsize " Tree number 4 seems different from the rest: perhaps it was not adequately protected by the standard spraying programme, or was on the side from which the codling moths flew in to the orchard. Tree number 12 has a much larger crop than the rest: the results of the regression are strongly influenced by this one observation. Display all the fitted values, residuals and leverages (influence)." RDISPLAY [PRINT=fittedvalues] " Check the effect of omitting tree number 4." RESTRICT Wormy; .NOT.EXPAND(4; 12) FIT [PRINT=summary] Cropsize " Return to the complete dataset, and display the fitted line." RESTRICT Wormy FIT [PRINT=*] Cropsize RGRAPH [GRAPHICS=high] " Plot the fitted values against the residuals, to check that the variance is roughly constant; use the procedure RCHECK from the Genstat Procedure Library." RCHECK [GRAPHICS=high] residual; fittedvalues