Fits a linear, generalized linear, generalized additive or generalized nonlinear model.

### Options

`PRINT` = string tokens |
What to print (`model` , `deviance` , `summary` , `estimates` , `correlations` , `fittedvalues` , `accumulated` , `monitoring` , `grid` , `confidence` ); default `mode, summ, esti` or `grid` if `NGRIDLINES` is set |
---|---|

`CALCULATION` = expression structures |
Calculation of explanatory variates involving nonlinear parameters |

`OWN` = scalar |
Option setting for `OWN` directive if this is to be used rather than `CALCULATE` to calculate explanatory variates |

`CONSTANT` = string token |
How to treat the constant (`estimate, omit` , `ignore` ); default `esti` |

`FACTORIAL` = scalar |
Limit for expansion of model terms; default as in previous `TERMS` statement, or 3 if no `TERMS` given |

`POOL` = string token |
Whether to pool ss in accumulated summary between all terms fitted in a linear model (`yes, no` ); default `no` |

`DENOMINATOR` = string token |
Whether to base ratios in accumulated summary on rms from model with smallest residual ss or smallest residual ms (`ss, ms` ); default `ss` |

`NOMESSAGE` = string tokens |
Which warning messages to suppress (`dispersion, leverage, residual, aliasing, marginality` , `vertical` , `df` , `inflation` ); default `*` |

`FPROBABILITY` = string token |
Printing of probabilities for variance and deviance ratios (`yes, no` ); default `no` |

`TPROBABILITY` = string token |
Printing of probabilities for t-statistics (`yes, no` ); default `no` |

`SELECTION` = string tokens |
Statistics to be displayed in the summary of analysis produced by `PRINT=summary` , `seobservations` is relevant only for a Normally distributed response, and `%cv` only for a gamma-distributed response (`%variance` , `%ss` , `adjustedr2` , `r2` , `seobservations` , `dispersion` , `%cv` , `%meandeviance` , `%deviance` , `aic` , `bic` , `sic` ); default `%var` , `seob` if `DIST=normal` , `%cv` if `DIST=gamma` , and `disp` for other distributions |

`PROBABILITY` = scalar |
Probability level for confidence intervals for parameter estimates; default 0.95 |

`NGRIDLINES` = scalar |
Number of values of each nonlinear parameter for a grid of function evaluations |

`SELINEAR` = string token |
Whether to calculate s.e.s for linear parameters when nonlinear parameters are also estimated (`yes` , `no` ); default `no` |

`INOWN` = identifiers |
Setting to be used for the `IN` parameter of `OWN` if used to calculate explanatory variates |

`OUTOWN` = identifiers |
Setting to be used for the `OUT` parameter of `OWN` if used to calculate explanatory variates |

`AOVDESCRIPTION` = text |
Description for line in accumulated analysis of variance (or deviance) table when `POOL=yes` |

### Parameter

formula |
List of explanatory variates and factors, or model formula |
---|

### Description

A `FIT`

statement must always be preceded by a `MODEL`

statement, though not necessarily immediately. You can give several `FIT`

statements after a single `MODEL`

statement; for example, to try out different explanatory variables.

The parameter of the `FIT`

directive specifies the explanatory variables in the model. In simple regression, it consists of the identifier of a single explanatory variate. If you omit the parameter, Genstat fits a *null model*; that is, a model consisting of just one parameter, the overall mean. In multiple regression the parameter consists of a list of explanatory variates, and factors may also appear to include the main effects of qualitative explanatory variables.

More generally, the parameter may be in the form of a model formula, including interactions between explanatory variables and functions of explanatory variables. The interaction between two or more variates is interpreted as another variate formed from the product of the constituent variates. The interaction between factors is interpreted as in the `TREATMENTSTRUCTURE`

directive; and in general the expansion of model formulae is controlled by the `FACTORIAL`

option in the same way as in the `ANOVA`

directive. The interaction between a variate and a factor represents differential responses for the variate at each level of the factor, and similarly if several variates or factors are involved. A formula may also include `POL`

, `REG`

and `COMPARISON`

functions of variates or factors, representing polynomial contrasts (up to order 4), orthogonalized regression or polynomial contrasts (up to order 8) and non-orthogonalized regression contrasts (up to order 8) respectively. Variates may also appear in `SSPLINE`

functions, representing cubic smoothing spline effects with specified numbers of degrees of freedom or specified smoothing parameters. Similarly, variates may appear in `LOESS`

functions, representing smoothed effects from locally weighted regressions. Multi-dimensional smoothing can be achieved by supplying a pointer containing up to four variates as the first argument of `LOESS`

. Models including such terms are called *additive* or *generalized additive models* (Hastie & Tibshirani 1990). Smoothed variates may also appear in interactions, where they represent the same effects as if the variate did not appear in the `SSPLINE`

function; the model then fits a common smooth effect in addition to the usual linear effects for each combination of factor levels.

The `CALCULATION`

option allows you to specify one or more expressions to be evaluated before carrying out the linear or generalized linear fit. This is only done if an `RCYCLE`

statement has been given to list nonlinear parameters. The expressions can then make use of the current values of the nonlinear parameters to derive components of the fitted model. At each stage of the nonlinear search for the best estimates of these parameters, the linear or generalized linear model is fitted after evaluating the expressions with the current values of these parameters. Models of this kind are referred to as *generalized nonlinear models* (Lane 1996).

The `PRINT`

option controls output. You can give several settings at the same time, to provide reports on several aspects of the analysis. The `model`

setting gives a description of the model, including response and explanatory variates.

The output from the `summary`

setting is a summary analysis of variance, or analysis of deviance in generalized linear models. The summary includes F-probabilities if option `FPROBABILITY=yes`

, but the interpretation of these probabilities depends on the usual assumptions of regression analysis, and they are only approximate in generalized linear models. Following the analysis of variance further information is presented about the fit of the model, the contents of which are controlled by the `SELECTION`

option. By default, for models with the Normal distribution, this consists of the percentage variance accounted for and the standard error of the observations. The percentage variance accounted for is the *adjusted R*^{2}* statistic*, expressed as a percentage: 100 × (1 – (Residual m.s.)/(Total m.s.)). The standard error of the observations is estimated by the square root of the residual mean square. For the gamma distribution, the default is to display the coefficient of variation instead, while for other distributions the default is to display the dispersion. The setting `aic`

presents the Akaike and information criterion, and the settings `bic`

and `sic`

are synonyms that present the Schwarz (Bayesian) information criterion (see Koehler & Murphree 1988 for a comparison); the values calculated by Genstat omit some constant terms that depend on the data rather than the model, so it is the differences between values for different models that should be of interest rather than the absolute values. There may also be messages in the output, produced as a result of several checks made by Genstat on the adequacy of the model. Extreme residuals and leverage values are reported, and simple checks are made on constancy of variance and systematic departure from the fitted model. You can prevent these messages appearing by using the `NOMESSAGE`

option. They will not appear in any case if you have set option `RMETHOD=*`

in the `MODEL`

statement.

The `estimates`

setting produces the estimates of parameters in the model. The standard errors of the estimates are based by default on the residual mean square. Alternatively, you can supply an estimate of variance by using the `DISPERSION`

option of `MODEL`

; if you do this, Genstat will print a reminder about the basis of the standard errors. You can prevent this reminder appearing by setting the `NOMESSAGE`

option. T-statistics are also displayed, allowing you to test whether each parameter differs significantly from zero, keeping the other parameters fixed; these probabilities too depend on the usual assumptions of regression analysis. The number of degrees of freedom for such a test appears in the column heading. If the estimate of variance is supplied, then the “t-statistics” actually have a standard Normal distribution, indicated by the column heading “t(*)”. If the `TPROBABILITY`

option is set, the corresponding probabilities are displayed. You can also display confidence intervals for the parameters by including the `confidence`

setting. The probability value for the intervals is set by the `PROBABILITY`

option; default 0.95.

The variance inflation factor is calculated for each parameter, and a message is generated if any is greater than 100, to warn that some explanatory terms are nearly aliased and that the standard errors of their parameters are consequently inflated. The parameters involved in the relationship are listed with the inflation factors. The variance inflation factor is defined to be the current diagonal value of the inverse matrix (X^{T}X)^{-1} corresponding to the parameter, multiplied by the corrected sum of squares of the variate or dummy variate corresponding to the parameter. This can be interpreted as the ratio of the variance of the parameter estimate in the current model compared with that of the estimate in a model containing just that parameter and the constant. The check will not be made if the current model contains any `POL`

submodels, or any term involving interaction between a variate and a factor, because the dummy variates generated to represent these effects are very likely to be nearly aliased with each other. The check is also omitted if the constant term is excluded from the model. When a generalized linear model is fitted with a log or logit link function, the antilogs of the parameters are also displayed, to summarize their multiplicative effects on the natural or odds scale respectively.

For a linear model with Normally distributed response, the `accumulated`

setting displays an analysis attributing the variance of the explanatory terms in the order in which they are given in the parameter of `FIT`

; no subdivision is available for generalized linear or nonlinear models unless terms are explicitly added or dropped one at a time using further directives such as `ADD`

, `DROP`

or `SWITCH`

. The subdivision is also not made if the `POOL`

option is set to `yes`

. The denominator of the ratios in the analysis can be controlled by setting the `DENOMINATOR`

option. The lines of the accumulated table are usually labelled by the names of the model terms that have been added or dropped. When `POOL=yes`

, however, this may become rather too long or complicated, so you can then use the `AOVDESCRIPTION`

option to supply your own description. If you supply a null text (containing just a single, empty line), the line is omitted from the table.

The `deviance`

setting produces an abbreviated summary of the analysis. The `correlations`

setting gives a correlation matrix of the parameter estimates. The `fitted`

setting displays a table of unit labels, values of response variate, fitted values, standardized residuals and leverages. The `monitoring`

setting reports the progress of any iterative search, as used in generalized linear, additive and nonlinear models. Finally, the `grid`

setting is relevant only for generalized nonlinear models when the `NGRIDLINES`

option is set, as in `FITNONLINEAR`

.

The `CONSTANT`

option controls whether the constant parameter is included in the model. In simple linear regression, this parameter is the intercept, in other words the estimate of the response variable when the explanatory variable is zero. In models containing factors, the constant will be the parameter corresponding to the reference level of the factor or factors, and the estimates printed for other levels will be differences between the parameter for those levels and that for the reference level (for more details, see the *Guide to the Genstat Command Language*, Part 2, Section 3.3.2). Consequently, the constant should then not be omitted unless the `FULL`

option of `TERMS`

has been set to ensure that the model contains a parameter for every level of the factor. If you set `CONSTANT=omit`

for a model containing factors without setting `FULL=yes`

in `TERMS`

, Genstat gives a failure diagnostic. The diagnostic can be suppressed by setting `CONSTANT=ignore`

instead, but this should be done only in special circumstances (as, for example, inside the procedure `HGANALYSE`

which fits hierarchical generalized linear models).

The `NOMESSAGE`

option controls printing of messages. The `aliasing`

setting suppresses messages about aliasing of parameters, and the `marginality`

setting suppresses reports of violation of marginality principles when fitting interactions between explanatory variables. The `leverage`

setting prevents messages about large leverages, and `residual`

prevents messages about large residuals or non-constant variance or systematic pattern in the residuals. The `inflation`

setting suppresses messages about the variance inflation factor, and the `dispersion`

setting prevents reminders appearing about the basis of the standard errors (as can be produced by the `estimates`

setting of the `PRINT`

option).

The `OWN`

, `INOWN`

and `OUTOWN`

options are as in the `FITNONLINEAR`

directive, and allow the model calculations for a generalized nonlinear model to be specified in a lower-level language, such as Fortran. The `NGRIDLINES`

and `SELINEAR`

options are also relevant to these models only, and provide a grid of functions values and standard errors of linear parameters, respectively, as in `FITNONLINEAR`

.

After fitting a regression using `FIT`

, the model can be modified using the `ADD`

, `DROP`

, `STEP`

, `SWITCH`

and `TRY`

directives, further output can be displayed using the `RDISPLAY`

directive, and results can be copied into Genstat data structures using the `RKEEP`

directive. The fit can be assessed graphically using the procedures `RGRAPH`

and `RCHECK`

.

Options: `PRINT`

, `CALCULATION`

, `OWN`

, `CONSTANT`

, `FACTORIAL`

, `POOL`

, `DENOMINATOR`

, `NOMESSAGE`

, `FPROBABILITY`

, `TPROBABILITY`

, `SELECTION`

, `PROBABILITY`

, `NGRIDLINES`

, `SELINEAR`

, `INOWN`

, `OUTOWN`

, `AOVDESCRIPTION`

.

Parameter: unnamed.

### Action with `RESTRICT`

You can restrict the units that Genstat will use for the regression by putting a restriction on any of the vectors involved in the `MODEL`

statement (response variates, weight variate, offset variate, grouping factor or variate of binomial totals), or on any explanatory variate or factor. However, you are not allowed to have different restrictions on the different vectors.

### References

Hastie, T.J. & Tibshirani, R.J. (1990). *Generalized Additive Models*. Chapman and Hall, London.

Koehler, A.B. & Murphree, E.S. (1988). A comparison of the Akaike and Schwarz criteria for selecting model order. *Applied Statistics*, 37, 187-195.

Lane, P.W. (1996). Generalized nonlinear models. *COMPSTAT 1996 Proceedings in Computational Statistics* (ed. Prat, A.), 331-336.

### See also

Directives: `MODEL`

, `TERMS`

, `RDISPLAY`

, `PREDICT`

, `RKEEP`

, `RKESTIMATES`

, `ADD`

, `DROP`

, `SWITCH`

, `STEP`

, `TRY`

, `FITCURVE`

, `FITNONLINEAR`

, `RCYCLE`

, `RFUNCTION`

.

Procedures: `RCHECK`

, `RGRAPH`

, `RMPLCONFIDENCE`

, `RPLCONFIDENCE`

, `RPERMTEST`

, `RWALD`

, `FITINDIVIDUALLY`

, `FITMULTINOMIAL`

, `GLMM`

, `HGANALYSE`

, `RAR1`

, `RLOESSGROUPS`

.

Functions: `COMPARISON`

, `POL`

, `REG`

, `LOESS`

, `SSPLINE`

.

Commands for: Regression analysis.

### Example

" Example FIT-1: Simple linear regression Modelling the relationship between counts of apples from 12 trees (recorded as 100s of fruit) and percentage damage by codling moth. (Snedecor & Cochran, Statistical analysis, 1980, p162.)" VARIATE [VALUES= 8, 6,11,22,14,17,18,24,19,23,26,40] Cropsize & [VALUES=59,58,56,53,50,45,43,42,39,38,30,27] Wormy DGRAPH Wormy; Cropsize " It is expected that the larger the crop is the less the damage will be, since the density of the flying moths is unrelated to the crop size. Try fitting a linear model relating the percentage of damage directly to the size of the crop." MODEL Wormy FIT Cropsize " Tree number 4 seems different from the rest: perhaps it was not adequately protected by the standard spraying programme, or was on the side from which the codling moths flew in to the orchard. Tree number 12 has a much larger crop than the rest: the results of the regression are strongly influenced by this one observation. Display all the fitted values, residuals and leverages (influence)." RDISPLAY [PRINT=fittedvalues] " Check the effect of omitting tree number 4." RESTRICT Wormy; .NOT.EXPAND(4; 12) FIT [PRINT=summary] Cropsize " Return to the complete dataset, and display the fitted line." RESTRICT Wormy FIT [PRINT=*] Cropsize RGRAPH [GRAPHICS=high] " Plot the fitted values against the residuals, to check that the variance is roughly constant; use the procedure RCHECK from the Genstat Procedure Library." RCHECK [GRAPHICS=high] residual; fittedvalues