Fits a linear, generalized linear, generalized additive or generalized nonlinear model.
Options
PRINT = string tokens |
What to print (model , deviance , summary , estimates , correlations , fittedvalues , accumulated , monitoring , grid , confidence ); default mode, summ, esti or grid if NGRIDLINES is set |
---|---|
CALCULATION = expression structures |
Calculation of explanatory variates involving nonlinear parameters |
OWN = scalar |
Option setting for OWN directive if this is to be used rather than CALCULATE to calculate explanatory variates |
CONSTANT = string token |
How to treat the constant (estimate, omit , ignore ); default esti |
FACTORIAL = scalar |
Limit for expansion of model terms; default as in previous TERMS statement, or 3 if no TERMS given |
POOL = string token |
Whether to pool ss in accumulated summary between all terms fitted in a linear model (yes, no ); default no |
DENOMINATOR = string token |
Whether to base ratios in accumulated summary on rms from model with smallest residual ss or smallest residual ms (ss, ms ); default ss |
NOMESSAGE = string tokens |
Which warning messages to suppress (dispersion, leverage, residual, aliasing, marginality , vertical , df , inflation ); default * |
FPROBABILITY = string token |
Printing of probabilities for variance and deviance ratios (yes, no ); default no |
TPROBABILITY = string token |
Printing of probabilities for t-statistics (yes, no ); default no |
SELECTION = string tokens |
Statistics to be displayed in the summary of analysis produced by PRINT=summary , seobservations is relevant only for a Normally distributed response, and %cv only for a gamma-distributed response (%variance , %ss , adjustedr2 , r2 , seobservations , dispersion , %cv , %meandeviance , %deviance , aic , bic , sic ); default %var , seob if DIST=normal , %cv if DIST=gamma , and disp for other distributions |
PROBABILITY = scalar |
Probability level for confidence intervals for parameter estimates; default 0.95 |
NGRIDLINES = scalar |
Number of values of each nonlinear parameter for a grid of function evaluations |
SELINEAR = string token |
Whether to calculate s.e.s for linear parameters when nonlinear parameters are also estimated (yes , no ); default no |
INOWN = identifiers |
Setting to be used for the IN parameter of OWN if used to calculate explanatory variates |
OUTOWN = identifiers |
Setting to be used for the OUT parameter of OWN if used to calculate explanatory variates |
AOVDESCRIPTION = text |
Description for line in accumulated analysis of variance (or deviance) table when POOL=yes |
Parameter
formula | List of explanatory variates and factors, or model formula |
---|
Description
A FIT
statement must always be preceded by a MODEL
statement, though not necessarily immediately. You can give several FIT
statements after a single MODEL
statement; for example, to try out different explanatory variables.
The parameter of the FIT
directive specifies the explanatory variables in the model. In simple regression, it consists of the identifier of a single explanatory variate. If you omit the parameter, Genstat fits a null model; that is, a model consisting of just one parameter, the overall mean. In multiple regression the parameter consists of a list of explanatory variates, and factors may also appear to include the main effects of qualitative explanatory variables.
More generally, the parameter may be in the form of a model formula, including interactions between explanatory variables and functions of explanatory variables. The interaction between two or more variates is interpreted as another variate formed from the product of the constituent variates. The interaction between factors is interpreted as in the TREATMENTSTRUCTURE
directive; and in general the expansion of model formulae is controlled by the FACTORIAL
option in the same way as in the ANOVA
directive. The interaction between a variate and a factor represents differential responses for the variate at each level of the factor, and similarly if several variates or factors are involved. A formula may also include POL
, REG
and COMPARISON
functions of variates or factors, representing polynomial contrasts (up to order 4), orthogonalized regression or polynomial contrasts (up to order 8) and non-orthogonalized regression contrasts (up to order 8) respectively. Variates may also appear in SSPLINE
functions, representing cubic smoothing spline effects with specified numbers of degrees of freedom or specified smoothing parameters. Similarly, variates may appear in LOESS
functions, representing smoothed effects from locally weighted regressions. Multi-dimensional smoothing can be achieved by supplying a pointer containing up to four variates as the first argument of LOESS
. Models including such terms are called additive or generalized additive models (Hastie & Tibshirani 1990). Smoothed variates may also appear in interactions, where they represent the same effects as if the variate did not appear in the SSPLINE
function; the model then fits a common smooth effect in addition to the usual linear effects for each combination of factor levels.
The CALCULATION
option allows you to specify one or more expressions to be evaluated before carrying out the linear or generalized linear fit. This is only done if an RCYCLE
statement has been given to list nonlinear parameters. The expressions can then make use of the current values of the nonlinear parameters to derive components of the fitted model. At each stage of the nonlinear search for the best estimates of these parameters, the linear or generalized linear model is fitted after evaluating the expressions with the current values of these parameters. Models of this kind are referred to as generalized nonlinear models (Lane 1996).
The PRINT
option controls output. You can give several settings at the same time, to provide reports on several aspects of the analysis. The model
setting gives a description of the model, including response and explanatory variates.
The output from the summary
setting is a summary analysis of variance, or analysis of deviance in generalized linear models. The summary includes F-probabilities if option FPROBABILITY=yes
, but the interpretation of these probabilities depends on the usual assumptions of regression analysis, and they are only approximate in generalized linear models. Following the analysis of variance further information is presented about the fit of the model, the contents of which are controlled by the SELECTION
option. By default, for models with the Normal distribution, this consists of the percentage variance accounted for and the standard error of the observations. The percentage variance accounted for is the adjusted R2 statistic, expressed as a percentage: 100 × (1 – (Residual m.s.)/(Total m.s.)). The standard error of the observations is estimated by the square root of the residual mean square. For the gamma distribution, the default is to display the coefficient of variation instead, while for other distributions the default is to display the dispersion. The setting aic
presents the Akaike and information criterion, and the settings bic
and sic
are synonyms that present the Schwarz (Bayesian) information criterion (see Koehler & Murphree 1988 for a comparison); the values calculated by Genstat omit some constant terms that depend on the data rather than the model, so it is the differences between values for different models that should be of interest rather than the absolute values. There may also be messages in the output, produced as a result of several checks made by Genstat on the adequacy of the model. Extreme residuals and leverage values are reported, and simple checks are made on constancy of variance and systematic departure from the fitted model. You can prevent these messages appearing by using the NOMESSAGE
option. They will not appear in any case if you have set option RMETHOD=*
in the MODEL
statement.
The estimates
setting produces the estimates of parameters in the model. The standard errors of the estimates are based by default on the residual mean square. Alternatively, you can supply an estimate of variance by using the DISPERSION
option of MODEL
; if you do this, Genstat will print a reminder about the basis of the standard errors. You can prevent this reminder appearing by setting the NOMESSAGE
option. T-statistics are also displayed, allowing you to test whether each parameter differs significantly from zero, keeping the other parameters fixed; these probabilities too depend on the usual assumptions of regression analysis. The number of degrees of freedom for such a test appears in the column heading. If the estimate of variance is supplied, then the “t-statistics” actually have a standard Normal distribution, indicated by the column heading “t(*)”. If the TPROBABILITY
option is set, the corresponding probabilities are displayed. You can also display confidence intervals for the parameters by including the confidence
setting. The probability value for the intervals is set by the PROBABILITY
option; default 0.95.
The variance inflation factor is calculated for each parameter, and a message is generated if any is greater than 100, to warn that some explanatory terms are nearly aliased and that the standard errors of their parameters are consequently inflated. The parameters involved in the relationship are listed with the inflation factors. The variance inflation factor is defined to be the current diagonal value of the inverse matrix (XTX)-1 corresponding to the parameter, multiplied by the corrected sum of squares of the variate or dummy variate corresponding to the parameter. This can be interpreted as the ratio of the variance of the parameter estimate in the current model compared with that of the estimate in a model containing just that parameter and the constant. The check will not be made if the current model contains any POL
submodels, or any term involving interaction between a variate and a factor, because the dummy variates generated to represent these effects are very likely to be nearly aliased with each other. The check is also omitted if the constant term is excluded from the model. When a generalized linear model is fitted with a log or logit link function, the antilogs of the parameters are also displayed, to summarize their multiplicative effects on the natural or odds scale respectively.
For a linear model with Normally distributed response, the accumulated
setting displays an analysis attributing the variance of the explanatory terms in the order in which they are given in the parameter of FIT
; no subdivision is available for generalized linear or nonlinear models unless terms are explicitly added or dropped one at a time using further directives such as ADD
, DROP
or SWITCH
. The subdivision is also not made if the POOL
option is set to yes
. The denominator of the ratios in the analysis can be controlled by setting the DENOMINATOR
option. The lines of the accumulated table are usually labelled by the names of the model terms that have been added or dropped. When POOL=yes
, however, this may become rather too long or complicated, so you can then use the AOVDESCRIPTION
option to supply your own description. If you supply a null text (containing just a single, empty line), the line is omitted from the table.
The deviance
setting produces an abbreviated summary of the analysis. The correlations
setting gives a correlation matrix of the parameter estimates. The fitted
setting displays a table of unit labels, values of response variate, fitted values, standardized residuals and leverages. The monitoring
setting reports the progress of any iterative search, as used in generalized linear, additive and nonlinear models. Finally, the grid
setting is relevant only for generalized nonlinear models when the NGRIDLINES
option is set, as in FITNONLINEAR
.
The CONSTANT
option controls whether the constant parameter is included in the model. In simple linear regression, this parameter is the intercept, in other words the estimate of the response variable when the explanatory variable is zero. In models containing factors, the constant will be the parameter corresponding to the reference level of the factor or factors, and the estimates printed for other levels will be differences between the parameter for those levels and that for the reference level (for more details, see the Guide to the Genstat Command Language, Part 2, Section 3.3.2). Consequently, the constant should then not be omitted unless the FULL
option of TERMS
has been set to ensure that the model contains a parameter for every level of the factor. If you set CONSTANT=omit
for a model containing factors without setting FULL=yes
in TERMS
, Genstat gives a failure diagnostic. The diagnostic can be suppressed by setting CONSTANT=ignore
instead, but this should be done only in special circumstances (as, for example, inside the procedure HGANALYSE
which fits hierarchical generalized linear models).
The NOMESSAGE
option controls printing of messages. The aliasing
setting suppresses messages about aliasing of parameters, and the marginality
setting suppresses reports of violation of marginality principles when fitting interactions between explanatory variables. The leverage
setting prevents messages about large leverages, and residual
prevents messages about large residuals or non-constant variance or systematic pattern in the residuals. The inflation
setting suppresses messages about the variance inflation factor, and the dispersion
setting prevents reminders appearing about the basis of the standard errors (as can be produced by the estimates
setting of the PRINT
option).
The OWN
, INOWN
and OUTOWN
options are as in the FITNONLINEAR
directive, and allow the model calculations for a generalized nonlinear model to be specified in a lower-level language, such as Fortran. The NGRIDLINES
and SELINEAR
options are also relevant to these models only, and provide a grid of functions values and standard errors of linear parameters, respectively, as in FITNONLINEAR
.
After fitting a regression using FIT
, the model can be modified using the ADD
, DROP
, STEP
, SWITCH
and TRY
directives, further output can be displayed using the RDISPLAY
directive, and results can be copied into Genstat data structures using the RKEEP
directive. The fit can be assessed graphically using the procedures RGRAPH
and RCHECK
.
Options: PRINT
, CALCULATION
, OWN
, CONSTANT
, FACTORIAL
, POOL
, DENOMINATOR
, NOMESSAGE
, FPROBABILITY
, TPROBABILITY
, SELECTION
, PROBABILITY
, NGRIDLINES
, SELINEAR
, INOWN
, OUTOWN
, AOVDESCRIPTION
.
Parameter: unnamed.
Action with RESTRICT
You can restrict the units that Genstat will use for the regression by putting a restriction on any of the vectors involved in the MODEL
statement (response variates, weight variate, offset variate, grouping factor or variate of binomial totals), or on any explanatory variate or factor. However, you are not allowed to have different restrictions on the different vectors.
References
Hastie, T.J. & Tibshirani, R.J. (1990). Generalized Additive Models. Chapman and Hall, London.
Koehler, A.B. & Murphree, E.S. (1988). A comparison of the Akaike and Schwarz criteria for selecting model order. Applied Statistics, 37, 187-195.
Lane, P.W. (1996). Generalized nonlinear models. COMPSTAT 1996 Proceedings in Computational Statistics (ed. Prat, A.), 331-336.
See also
Directives: MODEL
, TERMS
, RDISPLAY
, PREDICT
, RKEEP
, RKESTIMATES
, ADD
, DROP
, SWITCH
, STEP
, TRY
, FITCURVE
, FITNONLINEAR
, RCYCLE
, RFUNCTION
.
Procedures: RCHECK
, RGRAPH
, RMPLCONFIDENCE
, RPLCONFIDENCE
, RPERMTEST
, RWALD
, FITINDIVIDUALLY
, FITMULTINOMIAL
, GLMM
, HGANALYSE
, RAR1
, RLOESSGROUPS
.
Functions: COMPARISON
, POL
, REG
, LOESS
, SSPLINE
.
Commands for: Regression analysis.
Example
" Example FIT-1: Simple linear regression Modelling the relationship between counts of apples from 12 trees (recorded as 100s of fruit) and percentage damage by codling moth. (Snedecor & Cochran, Statistical analysis, 1980, p162.)" VARIATE [VALUES= 8, 6,11,22,14,17,18,24,19,23,26,40] Cropsize & [VALUES=59,58,56,53,50,45,43,42,39,38,30,27] Wormy DGRAPH Wormy; Cropsize " It is expected that the larger the crop is the less the damage will be, since the density of the flying moths is unrelated to the crop size. Try fitting a linear model relating the percentage of damage directly to the size of the crop." MODEL Wormy FIT Cropsize " Tree number 4 seems different from the rest: perhaps it was not adequately protected by the standard spraying programme, or was on the side from which the codling moths flew in to the orchard. Tree number 12 has a much larger crop than the rest: the results of the regression are strongly influenced by this one observation. Display all the fitted values, residuals and leverages (influence)." RDISPLAY [PRINT=fittedvalues] " Check the effect of omitting tree number 4." RESTRICT Wormy; .NOT.EXPAND(4; 12) FIT [PRINT=summary] Cropsize " Return to the complete dataset, and display the fitted line." RESTRICT Wormy FIT [PRINT=*] Cropsize RGRAPH [GRAPHICS=high] " Plot the fitted values against the residuals, to check that the variance is roughly constant; use the procedure RCHECK from the Genstat Procedure Library." RCHECK [GRAPHICS=high] residual; fittedvalues