FIT directive

Fits a linear, generalized linear, generalized additive or generalized nonlinear model.

Options

`PRINT` = string tokens	What to print (`model`, `deviance`, `summary`, `estimates`, `correlations`, `fittedvalues`, `accumulated`, `monitoring`, `grid`, `confidence`); default `mode, summ, esti` or `grid` if `NGRIDLINES` is set
`CALCULATION` = expression structures	Calculation of explanatory variates involving nonlinear parameters
`OWN` = scalar	Option setting for `OWN` directive if this is to be used rather than `CALCULATE` to calculate explanatory variates
`CONSTANT` = string token	How to treat the constant (`estimate, omit`, `ignore`); default `esti`
`FACTORIAL` = scalar	Limit for expansion of model terms; default as in previous `TERMS` statement, or 3 if no `TERMS` given
`POOL` = string token	Whether to pool ss in accumulated summary between all terms fitted in a linear model (`yes, no`); default `no`
`DENOMINATOR` = string token	Whether to base ratios in accumulated summary on rms from model with smallest residual ss or smallest residual ms (`ss, ms`); default `ss`
`NOMESSAGE` = string tokens	Which warning messages to suppress (`dispersion, leverage, residual, aliasing, marginality`, `vertical`, `df`, `inflation`); default `*`
`FPROBABILITY` = string token	Printing of probabilities for variance and deviance ratios (`yes, no`); default `no`
`TPROBABILITY` = string token	Printing of probabilities for t-statistics (`yes, no`); default `no`
`SELECTION` = string tokens	Statistics to be displayed in the summary of analysis produced by `PRINT=summary`, `seobservations` is relevant only for a Normally distributed response, and `%cv` only for a gamma-distributed response (`%variance`, `%ss`, `adjustedr2`, `r2`, `seobservations`, `dispersion`, `%cv`, `%meandeviance`, `%deviance`, `aic`, `bic`, `sic`); default `%var`, `seob` if `DIST=normal`, `%cv` if `DIST=gamma`, and `disp` for other distributions
`PROBABILITY` = scalar	Probability level for confidence intervals for parameter estimates; default 0.95
`NGRIDLINES` = scalar	Number of values of each nonlinear parameter for a grid of function evaluations
`SELINEAR` = string token	Whether to calculate s.e.s for linear parameters when nonlinear parameters are also estimated (`yes`, `no`); default `no`
`INOWN` = identifiers	Setting to be used for the `IN` parameter of `OWN` if used to calculate explanatory variates
`OUTOWN` = identifiers	Setting to be used for the `OUT` parameter of `OWN` if used to calculate explanatory variates
`AOVDESCRIPTION` = text	Description for line in accumulated analysis of variance (or deviance) table when `POOL=yes`

Parameter

formula	List of explanatory variates and factors, or model formula

Description

A FIT statement must always be preceded by a MODEL statement, though not necessarily immediately. You can give several FIT statements after a single MODEL statement; for example, to try out different explanatory variables.

The parameter of the FIT directive specifies the explanatory variables in the model. In simple regression, it consists of the identifier of a single explanatory variate. If you omit the parameter, Genstat fits a null model; that is, a model consisting of just one parameter, the overall mean. In multiple regression the parameter consists of a list of explanatory variates, and factors may also appear to include the main effects of qualitative explanatory variables.

More generally, the parameter may be in the form of a model formula, including interactions between explanatory variables and functions of explanatory variables. The interaction between two or more variates is interpreted as another variate formed from the product of the constituent variates. The interaction between factors is interpreted as in the TREATMENTSTRUCTURE directive; and in general the expansion of model formulae is controlled by the FACTORIAL option in the same way as in the ANOVA directive. The interaction between a variate and a factor represents differential responses for the variate at each level of the factor, and similarly if several variates or factors are involved. A formula may also include POL, REG and COMPARISON functions of variates or factors, representing polynomial contrasts (up to order 4), orthogonalized regression or polynomial contrasts (up to order 8) and non-orthogonalized regression contrasts (up to order 8) respectively. Variates may also appear in SSPLINE functions, representing cubic smoothing spline effects with specified numbers of degrees of freedom or specified smoothing parameters. Similarly, variates may appear in LOESS functions, representing smoothed effects from locally weighted regressions. Multi-dimensional smoothing can be achieved by supplying a pointer containing up to four variates as the first argument of LOESS. Models including such terms are called additive or generalized additive models (Hastie & Tibshirani 1990). Smoothed variates may also appear in interactions, where they represent the same effects as if the variate did not appear in the SSPLINE function; the model then fits a common smooth effect in addition to the usual linear effects for each combination of factor levels.

The CALCULATION option allows you to specify one or more expressions to be evaluated before carrying out the linear or generalized linear fit. This is only done if an RCYCLE statement has been given to list nonlinear parameters. The expressions can then make use of the current values of the nonlinear parameters to derive components of the fitted model. At each stage of the nonlinear search for the best estimates of these parameters, the linear or generalized linear model is fitted after evaluating the expressions with the current values of these parameters. Models of this kind are referred to as generalized nonlinear models (Lane 1996).

The PRINT option controls output. You can give several settings at the same time, to provide reports on several aspects of the analysis. The model setting gives a description of the model, including response and explanatory variates.

The output from the summary setting is a summary analysis of variance, or analysis of deviance in generalized linear models. The summary includes F-probabilities if option FPROBABILITY=yes, but the interpretation of these probabilities depends on the usual assumptions of regression analysis, and they are only approximate in generalized linear models. Following the analysis of variance further information is presented about the fit of the model, the contents of which are controlled by the SELECTION option. By default, for models with the Normal distribution, this consists of the percentage variance accounted for and the standard error of the observations. The percentage variance accounted for is the adjusted R² statistic, expressed as a percentage: 100 × (1 – (Residual m.s.)/(Total m.s.)). The standard error of the observations is estimated by the square root of the residual mean square. For the gamma distribution, the default is to display the coefficient of variation instead, while for other distributions the default is to display the dispersion. The setting aic presents the Akaike and information criterion, and the settings bic and sic are synonyms that present the Schwarz (Bayesian) information criterion (see Koehler & Murphree 1988 for a comparison); the values calculated by Genstat omit some constant terms that depend on the data rather than the model, so it is the differences between values for different models that should be of interest rather than the absolute values. There may also be messages in the output, produced as a result of several checks made by Genstat on the adequacy of the model. Extreme residuals and leverage values are reported, and simple checks are made on constancy of variance and systematic departure from the fitted model. You can prevent these messages appearing by using the NOMESSAGE option. They will not appear in any case if you have set option RMETHOD=* in the MODEL statement.

The estimates setting produces the estimates of parameters in the model. The standard errors of the estimates are based by default on the residual mean square. Alternatively, you can supply an estimate of variance by using the DISPERSION option of MODEL; if you do this, Genstat will print a reminder about the basis of the standard errors. You can prevent this reminder appearing by setting the NOMESSAGE option. T-statistics are also displayed, allowing you to test whether each parameter differs significantly from zero, keeping the other parameters fixed; these probabilities too depend on the usual assumptions of regression analysis. The number of degrees of freedom for such a test appears in the column heading. If the estimate of variance is supplied, then the “t-statistics” actually have a standard Normal distribution, indicated by the column heading “t(*)”. If the TPROBABILITY option is set, the corresponding probabilities are displayed. You can also display confidence intervals for the parameters by including the confidence setting. The probability value for the intervals is set by the PROBABILITY option; default 0.95.

The variance inflation factor is calculated for each parameter, and a message is generated if any is greater than 100, to warn that some explanatory terms are nearly aliased and that the standard errors of their parameters are consequently inflated. The parameters involved in the relationship are listed with the inflation factors. The variance inflation factor is defined to be the current diagonal value of the inverse matrix (X^TX)^-1 corresponding to the parameter, multiplied by the corrected sum of squares of the variate or dummy variate corresponding to the parameter. This can be interpreted as the ratio of the variance of the parameter estimate in the current model compared with that of the estimate in a model containing just that parameter and the constant. The check will not be made if the current model contains any POL submodels, or any term involving interaction between a variate and a factor, because the dummy variates generated to represent these effects are very likely to be nearly aliased with each other. The check is also omitted if the constant term is excluded from the model. When a generalized linear model is fitted with a log or logit link function, the antilogs of the parameters are also displayed, to summarize their multiplicative effects on the natural or odds scale respectively.

For a linear model with Normally distributed response, the accumulated setting displays an analysis attributing the variance of the explanatory terms in the order in which they are given in the parameter of FIT; no subdivision is available for generalized linear or nonlinear models unless terms are explicitly added or dropped one at a time using further directives such as ADD, DROP or SWITCH. The subdivision is also not made if the POOL option is set to yes. The denominator of the ratios in the analysis can be controlled by setting the DENOMINATOR option. The lines of the accumulated table are usually labelled by the names of the model terms that have been added or dropped. When POOL=yes, however, this may become rather too long or complicated, so you can then use the AOVDESCRIPTION option to supply your own description. If you supply a null text (containing just a single, empty line), the line is omitted from the table.

The deviance setting produces an abbreviated summary of the analysis. The correlations setting gives a correlation matrix of the parameter estimates. The fitted setting displays a table of unit labels, values of response variate, fitted values, standardized residuals and leverages. The monitoring setting reports the progress of any iterative search, as used in generalized linear, additive and nonlinear models. Finally, the grid setting is relevant only for generalized nonlinear models when the NGRIDLINES option is set, as in FITNONLINEAR.

The CONSTANT option controls whether the constant parameter is included in the model. In simple linear regression, this parameter is the intercept, in other words the estimate of the response variable when the explanatory variable is zero. In models containing factors, the constant will be the parameter corresponding to the reference level of the factor or factors, and the estimates printed for other levels will be differences between the parameter for those levels and that for the reference level (for more details, see the Guide to the Genstat Command Language, Part 2, Section 3.3.2). Consequently, the constant should then not be omitted unless the FULL option of TERMS has been set to ensure that the model contains a parameter for every level of the factor. If you set CONSTANT=omit for a model containing factors without setting FULL=yes in TERMS, Genstat gives a failure diagnostic. The diagnostic can be suppressed by setting CONSTANT=ignore instead, but this should be done only in special circumstances (as, for example, inside the procedure HGANALYSE which fits hierarchical generalized linear models).

The NOMESSAGE option controls printing of messages. The aliasing setting suppresses messages about aliasing of parameters, and the marginality setting suppresses reports of violation of marginality principles when fitting interactions between explanatory variables. The leverage setting prevents messages about large leverages, and residual prevents messages about large residuals or non-constant variance or systematic pattern in the residuals. The inflation setting suppresses messages about the variance inflation factor, and the dispersion setting prevents reminders appearing about the basis of the standard errors (as can be produced by the estimates setting of the PRINT option).

The OWN, INOWN and OUTOWN options are as in the FITNONLINEAR directive, and allow the model calculations for a generalized nonlinear model to be specified in a lower-level language, such as Fortran. The NGRIDLINES and SELINEAR options are also relevant to these models only, and provide a grid of functions values and standard errors of linear parameters, respectively, as in FITNONLINEAR.

After fitting a regression using FIT, the model can be modified using the ADD, DROP, STEP, SWITCH and TRY directives, further output can be displayed using the RDISPLAY directive, and results can be copied into Genstat data structures using the RKEEP directive. The fit can be assessed graphically using the procedures RGRAPH and RCHECK.

Options: PRINT, CALCULATION, OWN, CONSTANT, FACTORIAL, POOL, DENOMINATOR, NOMESSAGE, FPROBABILITY, TPROBABILITY, SELECTION, PROBABILITY, NGRIDLINES, SELINEAR, INOWN, OUTOWN, AOVDESCRIPTION.

Parameter: unnamed.

Action with `RESTRICT`

You can restrict the units that Genstat will use for the regression by putting a restriction on any of the vectors involved in the MODEL statement (response variates, weight variate, offset variate, grouping factor or variate of binomial totals), or on any explanatory variate or factor. However, you are not allowed to have different restrictions on the different vectors.

References

Hastie, T.J. & Tibshirani, R.J. (1990). Generalized Additive Models. Chapman and Hall, London.

Koehler, A.B. & Murphree, E.S. (1988). A comparison of the Akaike and Schwarz criteria for selecting model order. Applied Statistics, 37, 187-195.

Lane, P.W. (1996). Generalized nonlinear models. COMPSTAT 1996 Proceedings in Computational Statistics (ed. Prat, A.), 331-336.

Example

" Example FIT-1: Simple linear regression

  Modelling the relationship between counts of apples from 12 trees
  (recorded as 100s of fruit) and percentage damage by codling moth.
  (Snedecor & Cochran, Statistical analysis, 1980, p162.)"

VARIATE [VALUES= 8, 6,11,22,14,17,18,24,19,23,26,40] Cropsize
&       [VALUES=59,58,56,53,50,45,43,42,39,38,30,27] Wormy
DGRAPH  Wormy; Cropsize

" It is expected that the larger the crop is the less the damage will be,
  since the density of the flying moths is unrelated to the crop size.
  Try fitting a linear model relating the percentage of damage directly
  to the size of the crop."
MODEL Wormy
FIT Cropsize

" Tree number 4 seems different from the rest: perhaps it was not
  adequately protected by the standard spraying programme, or was on the
  side from which the codling moths flew in to the orchard.
  Tree number 12 has a much larger crop than the rest: the results of the
  regression are strongly influenced by this one observation.
  Display all the fitted values, residuals and leverages (influence)."
RDISPLAY [PRINT=fittedvalues]

" Check the effect of omitting tree number 4."
RESTRICT Wormy; .NOT.EXPAND(4; 12)
FIT [PRINT=summary] Cropsize

" Return to the complete dataset, and display the fitted line."
RESTRICT Wormy
FIT [PRINT=*] Cropsize
RGRAPH [GRAPHICS=high]

" Plot the fitted values against the residuals, to check that the
  variance is roughly constant; use the procedure RCHECK from the
  Genstat Procedure Library."
RCHECK [GRAPHICS=high] residual; fittedvalues

Updated on February 7, 2023

Was this article helpful?

Yes No