Selects terms to include in or exclude from a linear, generalized linear or generalized additive model according to the ratio of residual mean squares.

### Options

`PRINT` = string tokens |
What to print (`model, deviance, summary, estimates, correlations, fittedvalues, accumulated, monitoring, changes` , `confidence` ); default `mode,summ,esti,chan` |
---|---|

`FACTORIAL` = scalar |
Limit for expansion of model terms; default `*` i.e. that in previous `TERMS` statement |

`POOL` = string token |
Whether to pool ss in accumulated summary between all terms fitted in a linear model (`yes, no` ); default `no` |

`DENOMINATOR` = string token |
Whether to base ratios in accumulated summary on rms from model with smallest residual ss or smallest residual ms (`ss, ms` ); default `ss` |

`NOMESSAGE` = string tokens |
Which warning messages to suppress (`dispersion` , `leverage` , `residual` , `aliasing` , `marginality` , `vertical` , `df` , `inflation` ); default `*` |

`FPROBABILITY` = string token |
Printing of probabilities for variance and deviance ratios (`yes, no` ); default `no` |

`TPROBABILITY` = string token |
Printing of probabilities for t-statistics (`yes, no` ); default `no` |

`SELECTION` = string tokens |
Statistics to be displayed in the summary of analysis produced by `PRINT=summary` , `seobservations` is relevant only for a Normally distributed response, and `%cv` only for a gamma-distributed response (`%variance` , `%ss` , `adjustedr2` , `r2` , `seobservations` , `dispersion` , `%cv` , `%meandeviance` , `%deviance` , `aic` , `bic` , `sic` ); default `%var` , `seob` if `DIST=normal` , `%cv` if `DIST=gamma` , and `disp` for other distributions |

`INRATIO` = scalar |
Criterion for inclusion of terms; default 1.0 |

`OUTRATIO` = scalar |
Criterion for exclusion of terms; default 1.0 |

`MAXCYCLE` = scalar |
Limit on number of times to repeat stepwise selection, unless no change is made; default 1 |

`PROBABILITY` = scalar |
Probability level for confidence intervals for parameter estimates; default 0.95 |

### Parameter

formula |
List of explanatory variates and factors, or model formula |
---|

### Description

`STEP`

modifies the current regression model, which may be linear, generalized linear or generalized additive, in order to achieve the biggest “improvement”. Terms in the specified formula are dropped from the current model if they are already there, or are added to it if they are not. For each term, the residual sum of squares (or deviance) and the residual degrees of freedom are recorded; then Genstat reverts to the original model before trying the next term.

The current model is finally modified by the best term, according to a criterion based on the variance (or deviance) ratios. In a linear model, suppose that the residual sum of squares and residual degrees of freedom of the current model are *s*_{0} and *d*_{0}, and of the model after making a one-term change are *s*_{1} and *d*_{1}. If the variance ratio for any term that is dropped is less than the value of the setting of the `OUTRATIO`

option, then the term that most reduces or least increases the residual mean square is dropped. That is, when the dispersion is being estimated, a term will be dropped only if at least one term has

{(*s*_{1}–*s*_{0}) / (*d*_{1}–*d*_{0})} / {*s*_{0}/*d*_{0}} < `OUTRATIO`

When the dispersion is fixed, the equation becomes

{(*s*_{1}–*s*_{0}) / (*d*_{1}–*d*_{0})} < `OUTRATIO`

If you have set `OUTRATIO=*`

, then no term is dropped. Note that, though the criteria are ratios of variances, you should not interpret them as F-statistics with the usual interpretation of significance. The probability levels would need to be adjusted to take account of correlations between the explanatory variables concerned, and the number of changes being considered.

If no term satisfies the criterion for dropping, then the term that most reduces the residual mean square will be added to the model if its variance ratio is greater than the setting of the `INRATIO`

option. That is, when the dispersion is being estimated, if

{(*s*_{0}–*s*_{1}) / (*d*_{0}–*d*_{1})} / {*s*_{1}/*d*_{1}} > `INRATIO`

When the dispersion is fixed, the equation becomes

{(*s*_{0}–*s*_{1}) / (*d*_{0}–*d*_{1})} > `INRATIO`

Likewise, if you have set `INRATIO=*`

, no term will be added.

If neither criterion is met, the current model is left unchanged.

Usually, the effect of the `STEP`

directive is to make one change of a stepwise regression search. You can make `STEP`

do forward selection by setting the `MAXCYCLE`

option to define a maximum number of changes; `STEP`

will stop at this limit, or earlier if no further changes can be made.

The `changes`

setting of the `PRINT`

option produces a list of terms with the corresponding residual mean squares (or deviances) and residual degrees of freedom, ordered according to the sizes of the residual mean squares; this list is not available for display later by the `RDISPLAY`

directive. The `INRATIO`

and `OUTRATIO`

options are explained above. The rest of the options are as in the `FIT`

directive, except that there is no `CONSTANT`

option.

Options: `PRINT`

, `FACTORIAL`

, `POOL`

, `DENOMINATOR`

, `NOMESSAGE`

, `FPROBABILITY`

, `TPROBABILITY`

, `SELECTION`

, `INRATIO`

, `OUTRATIO`

, `MAXCYCLE`

, `PROBABILITY`

.

Parameter: unnamed.

### Action with `RESTRICT`

If a `TERMS`

statement was given before fitting the model, any restrictions on the variates or factors in the model will have been implemented then. So any restrictions on vectors involved in the model specified by `STEP`

will be ignored. If no `TERMS`

statement has been given and `STEP`

involves new terms not already in the model, restrictions on the variates or factors in these terms will be taken into account and may cause the units involved in the regression to be redefined.

### See also

Directives: `MODEL`

, `TERMS`

, `FIT`

, `ADD`

, `DROP`

, `SWITCH`

, `TRY`

.

Commands for: Regression analysis.

### Example

" Examples 2:3.2, 2:3.2.1-6, 2:3.2.7a-b, 2:3.2.8 " " Multiple linear regression of the heat given out by setting cement on four chemical constituents. Data from Woods, Steinour & Starke (1932); analysed by Draper & Smith (1981) p.629." OPEN '%GENDIR%/Examples/GuidePart2/Cement.dat'; CHANNEL=2 READ [PRINT=data; CHANNEL=2] X[3,1,4,2],%gypsum,Heat " Analyse only those samples with 3.2% gypsum." RESTRICT Heat; %gypsum==3.2 MODEL Heat " Constituents are: X[1] tricalcium aluminate X[2] tricalcium silicate X[3] tetracalcium aluminoferrite X[4] beta-dicalcium silicate " FIT [FPROBABILITY=yes; TPROBABILITY=yes] X[] RDISPLAY [PRINT=accumulated; FPROBABILITY=yes] RKESTIMATES X[]; ESTIMATES=Est[1...4]; SE=se[1...4] PRINT Est[1],se[1],Est[2],se[2],Est[3],se[3],Est[4],se[4]; FIELD=10,8 TERMS [PRINT=correlation] X[] ADD [PRINT=deviance,estimates; TPROBABILITY=yes] X[1,2,4] DROP [PRINT=deviance,estimates; TPROBABILITY=yes] X[4] SWITCH [PRINT=estimates,accumulated; FPROBABILITY=yes;\ TPROBABILITY=yes] X[2,4] TRY X[2,3] FIT [FPROBABILITY=yes; TPROBABILITY=yes] X[] RWALD FIT [PRINT=*] X[1] STEP [INRATIO=4; OUTRATIO=4; FPROBABILITY=yes; TPROBABILITY=yes] X[1...4] TERMS X[] STEP [PRINT=changes; INRATIO=4; OUTRATIO=4; MAXCYCLE=10] X[] RDISPLAY [FPROBABILITY=yes; TPROBABILITY=yes] RSEARCH [METHOD=allpossible] X[1...4] CLOSE 2