Fits a linear, generalized linear, generalized additive or generalized nonlinear model.
|What to print (
||Calculation of explanatory variates involving nonlinear parameters|
||Option setting for
||How to treat the constant (
||Limit for expansion of model terms; default as in previous
||Whether to pool ss in accumulated summary between all terms fitted in a linear model (
||Whether to base ratios in accumulated summary on rms from model with smallest residual ss or smallest residual ms (
||Which warning messages to suppress (
||Printing of probabilities for variance and deviance ratios (
||Printing of probabilities for t-statistics (
||Statistics to be displayed in the summary of analysis produced by
||Probability level for confidence intervals for parameter estimates; default 0.95|
||Number of values of each nonlinear parameter for a grid of function evaluations|
||Whether to calculate s.e.s for linear parameters when nonlinear parameters are also estimated (
||Setting to be used for the
||Setting to be used for the
||Description for line in accumulated analysis of variance (or deviance) table when
|formula||List of explanatory variates and factors, or model formula|
FIT statement must always be preceded by a
MODEL statement, though not necessarily immediately. You can give several
FIT statements after a single
MODEL statement; for example, to try out different explanatory variables.
The parameter of the
FIT directive specifies the explanatory variables in the model. In simple regression, it consists of the identifier of a single explanatory variate. If you omit the parameter, Genstat fits a null model; that is, a model consisting of just one parameter, the overall mean. In multiple regression the parameter consists of a list of explanatory variates, and factors may also appear to include the main effects of qualitative explanatory variables.
More generally, the parameter may be in the form of a model formula, including interactions between explanatory variables and functions of explanatory variables. The interaction between two or more variates is interpreted as another variate formed from the product of the constituent variates. The interaction between factors is interpreted as in the
TREATMENTSTRUCTURE directive; and in general the expansion of model formulae is controlled by the
FACTORIAL option in the same way as in the
ANOVA directive. The interaction between a variate and a factor represents differential responses for the variate at each level of the factor, and similarly if several variates or factors are involved. A formula may also include
COMPARISON functions of variates or factors, representing polynomial contrasts (up to order 4), orthogonalized regression or polynomial contrasts (up to order 8) and non-orthogonalized regression contrasts (up to order 8) respectively. Variates may also appear in
SSPLINE functions, representing cubic smoothing spline effects with specified numbers of degrees of freedom or specified smoothing parameters. Similarly, variates may appear in
LOESS functions, representing smoothed effects from locally weighted regressions. Multi-dimensional smoothing can be achieved by supplying a pointer containing up to four variates as the first argument of
LOESS. Models including such terms are called additive or generalized additive models (Hastie & Tibshirani 1990). Smoothed variates may also appear in interactions, where they represent the same effects as if the variate did not appear in the
SSPLINE function; the model then fits a common smooth effect in addition to the usual linear effects for each combination of factor levels.
CALCULATION option allows you to specify one or more expressions to be evaluated before carrying out the linear or generalized linear fit. This is only done if an
RCYCLE statement has been given to list nonlinear parameters. The expressions can then make use of the current values of the nonlinear parameters to derive components of the fitted model. At each stage of the nonlinear search for the best estimates of these parameters, the linear or generalized linear model is fitted after evaluating the expressions with the current values of these parameters. Models of this kind are referred to as generalized nonlinear models (Lane 1996).
model setting gives a description of the model, including response and explanatory variates.
The output from the
summary setting is a summary analysis of variance, or analysis of deviance in generalized linear models. The summary includes F-probabilities if option
FPROBABILITY=yes, but the interpretation of these probabilities depends on the usual assumptions of regression analysis, and they are only approximate in generalized linear models. Following the analysis of variance further information is presented about the fit of the model, the contents of which are controlled by the
SELECTION option. By default, for models with the Normal distribution, this consists of the percentage variance accounted for and the standard error of the observations. The percentage variance accounted for is the adjusted R2 statistic, expressed as a percentage: 100 × (1 – (Residual m.s.)/(Total m.s.)). The standard error of the observations is estimated by the square root of the residual mean square. For the gamma distribution, the default is to display the coefficient of variation instead, while for other distributions the default is to display the dispersion. The setting
aic presents the Akaike and information criterion, and the settings
sic are synonyms that present the Schwarz (Bayesian) information criterion (see Koehler & Murphree 1988 for a comparison); the values calculated by Genstat omit some constant terms that depend on the data rather than the model, so it is the differences between values for different models that should be of interest rather than the absolute values. There may also be messages in the output, produced as a result of several checks made by Genstat on the adequacy of the model. Extreme residuals and leverage values are reported, and simple checks are made on constancy of variance and systematic departure from the fitted model. You can prevent these messages appearing by using the
NOMESSAGE option. They will not appear in any case if you have set option
RMETHOD=* in the
estimates setting produces the estimates of parameters in the model. The standard errors of the estimates are based by default on the residual mean square. Alternatively, you can supply an estimate of variance by using the
DISPERSION option of
MODEL; if you do this, Genstat will print a reminder about the basis of the standard errors. You can prevent this reminder appearing by setting the
NOMESSAGE option. T-statistics are also displayed, allowing you to test whether each parameter differs significantly from zero, keeping the other parameters fixed; these probabilities too depend on the usual assumptions of regression analysis. The number of degrees of freedom for such a test appears in the column heading. If the estimate of variance is supplied, then the “t-statistics” actually have a standard Normal distribution, indicated by the column heading “t(*)”. If the
TPROBABILITY option is set, the corresponding probabilities are displayed. You can also display confidence intervals for the parameters by including the
confidence setting. The probability value for the intervals is set by the
PROBABILITY option; default 0.95.
The variance inflation factor is calculated for each parameter, and a message is generated if any is greater than 100, to warn that some explanatory terms are nearly aliased and that the standard errors of their parameters are consequently inflated. The parameters involved in the relationship are listed with the inflation factors. The variance inflation factor is defined to be the current diagonal value of the inverse matrix (XTX)-1 corresponding to the parameter, multiplied by the corrected sum of squares of the variate or dummy variate corresponding to the parameter. This can be interpreted as the ratio of the variance of the parameter estimate in the current model compared with that of the estimate in a model containing just that parameter and the constant. The check will not be made if the current model contains any
POL submodels, or any term involving interaction between a variate and a factor, because the dummy variates generated to represent these effects are very likely to be nearly aliased with each other. The check is also omitted if the constant term is excluded from the model. When a generalized linear model is fitted with a log or logit link function, the antilogs of the parameters are also displayed, to summarize their multiplicative effects on the natural or odds scale respectively.
For a linear model with Normally distributed response, the
accumulated setting displays an analysis attributing the variance of the explanatory terms in the order in which they are given in the parameter of
FIT; no subdivision is available for generalized linear or nonlinear models unless terms are explicitly added or dropped one at a time using further directives such as
SWITCH. The subdivision is also not made if the
POOL option is set to
yes. The denominator of the ratios in the analysis can be controlled by setting the
DENOMINATOR option. The lines of the accumulated table are usually labelled by the names of the model terms that have been added or dropped. When
POOL=yes, however, this may become rather too long or complicated, so you can then use the
AOVDESCRIPTION option to supply your own description. If you supply a null text (containing just a single, empty line), the line is omitted from the table.
deviance setting produces an abbreviated summary of the analysis. The
correlations setting gives a correlation matrix of the parameter estimates. The
fitted setting displays a table of unit labels, values of response variate, fitted values, standardized residuals and leverages. The
monitoring setting reports the progress of any iterative search, as used in generalized linear, additive and nonlinear models. Finally, the
grid setting is relevant only for generalized nonlinear models when the
NGRIDLINES option is set, as in
CONSTANT option controls whether the constant parameter is included in the model. In simple linear regression, this parameter is the intercept, in other words the estimate of the response variable when the explanatory variable is zero. In models containing factors, the constant will be the parameter corresponding to the reference level of the factor or factors, and the estimates printed for other levels will be differences between the parameter for those levels and that for the reference level (for more details, see the Guide to the Genstat Command Language, Part 2, Section 3.3.2). Consequently, the constant should then not be omitted unless the
FULL option of
TERMS has been set to ensure that the model contains a parameter for every level of the factor. If you set
CONSTANT=omit for a model containing factors without setting
TERMS, Genstat gives a failure diagnostic. The diagnostic can be suppressed by setting
CONSTANT=ignore instead, but this should be done only in special circumstances (as, for example, inside the procedure
HGANALYSE which fits hierarchical generalized linear models).
NOMESSAGE option controls printing of messages. The
aliasing setting suppresses messages about aliasing of parameters, and the
marginality setting suppresses reports of violation of marginality principles when fitting interactions between explanatory variables. The
leverage setting prevents messages about large leverages, and
residual prevents messages about large residuals or non-constant variance or systematic pattern in the residuals. The
inflation setting suppresses messages about the variance inflation factor, and the
dispersion setting prevents reminders appearing about the basis of the standard errors (as can be produced by the
estimates setting of the
OUTOWN options are as in the
FITNONLINEAR directive, and allow the model calculations for a generalized nonlinear model to be specified in a lower-level language, such as Fortran. The
SELINEAR options are also relevant to these models only, and provide a grid of functions values and standard errors of linear parameters, respectively, as in
After fitting a regression using
FIT, the model can be modified using the
TRY directives, further output can be displayed using the
RDISPLAY directive, and results can be copied into Genstat data structures using the
RKEEP directive. The fit can be assessed graphically using the procedures
You can restrict the units that Genstat will use for the regression by putting a restriction on any of the vectors involved in the
MODEL statement (response variates, weight variate, offset variate, grouping factor or variate of binomial totals), or on any explanatory variate or factor. However, you are not allowed to have different restrictions on the different vectors.
Hastie, T.J. & Tibshirani, R.J. (1990). Generalized Additive Models. Chapman and Hall, London.
Koehler, A.B. & Murphree, E.S. (1988). A comparison of the Akaike and Schwarz criteria for selecting model order. Applied Statistics, 37, 187-195.
Lane, P.W. (1996). Generalized nonlinear models. COMPSTAT 1996 Proceedings in Computational Statistics (ed. Prat, A.), 331-336.
Commands for: Regression analysis.
" Example FIT-1: Simple linear regression Modelling the relationship between counts of apples from 12 trees (recorded as 100s of fruit) and percentage damage by codling moth. (Snedecor & Cochran, Statistical analysis, 1980, p162.)" VARIATE [VALUES= 8, 6,11,22,14,17,18,24,19,23,26,40] Cropsize & [VALUES=59,58,56,53,50,45,43,42,39,38,30,27] Wormy DGRAPH Wormy; Cropsize " It is expected that the larger the crop is the less the damage will be, since the density of the flying moths is unrelated to the crop size. Try fitting a linear model relating the percentage of damage directly to the size of the crop." MODEL Wormy FIT Cropsize " Tree number 4 seems different from the rest: perhaps it was not adequately protected by the standard spraying programme, or was on the side from which the codling moths flew in to the orchard. Tree number 12 has a much larger crop than the rest: the results of the regression are strongly influenced by this one observation. Display all the fitted values, residuals and leverages (influence)." RDISPLAY [PRINT=fittedvalues] " Check the effect of omitting tree number 4." RESTRICT Wormy; .NOT.EXPAND(4; 12) FIT [PRINT=summary] Cropsize " Return to the complete dataset, and display the fitted line." RESTRICT Wormy FIT [PRINT=*] Cropsize RGRAPH [GRAPHICS=high] " Plot the fitted values against the residuals, to check that the variance is roughly constant; use the procedure RCHECK from the Genstat Procedure Library." RCHECK [GRAPHICS=high] residual; fittedvalues