RSEARCH procedure

Helps search through models for a regression or generalized linear model (P.W. Goedhart).

Options

`PRINT` = string token	Printed output required (`model`, `results`); default `mode`, `resu`
`METHOD` = string tokens	Model selection method to employ (`allpossible`, `forward`, `backward`, `fstepwise`, `bstepwise`, `accumulated`, `pooled`); default `allp`
`FORCED` = formula	Model formula to include in every model; default `*`
`CONSTANT` = string token	How to treat the constant (`estimate`, `omit`); default `esti`
`FACTORIAL` = scalar	Limit for expansion of all model terms; default 3
`DENOMINATOR` = string token	Whether to base ratios in accumulated summaries on rms from model with smallest residual ss or smallest residual ms (`ss`, `ms`); default `ss`
`INRATIO` = scalar	Criterion for inclusion of terms for forward selection, backward elimination and stepwise regression; default 1.0
`OUTRATIO` = scalar	Criterion for exclusion of terms for forward selection, backward elimination and stepwise regression; default 1.0
`MAXCYCLE` = scalar	Limit on number of times to repeat stepwise selection methods, unless no change is made; default 50
`CRITERION` = string token	Criterion for selecting best models among all possible models (`r2`, `adjusted`, `cp`, `ep`, `aic`, `bic`, `sic`, `meandeviance`, `deviance`); default `adju`
`EXTRA` = string token	Criterion which is also printed for the selected best models (`r2`, `adjusted`, `cp`, `ep`, `aic`, `bic`, `sic`, `meandeviance`, `deviance`); default `cp` when `DISPERSION=*`, and `mean` otherwise
`AFACTORIAL` = scalar	Limit for expansion of `FREE` model terms for the fitting of all possible models; default 3
`PENALTY` = scalar	Penalty for Mallows Cp and Akaike’s information criterion AIC; default 2
`NTERMS` = scalar	Limit on the number of terms to be fitted when fitting all possible models; default 16
`NBESTMODELS` = scalar	Number of best models printed for each subset size; default 8
`PPROBABILITY` = scalar	When `METHOD=allpossible`, only models with all probabilities less than `PPROBABILITY` are printed; default 1 i.e. all models are printed
`FINALMODELS` = pointer	Pointer to save the final models for forward, backward, `fstepwise` and `bstepwise` regression methods
`ALLMODELS` = pointer	Pointer to save formulae for all possible regression models containing the fitted terms of all the models; every formula includes the `FORCED` formula if set
`ESTIMATES` = pointer	Pointer to save variates for all possible regression models containing the parameter estimates
`SE` = pointer	Pointer to save variates for all possible regression models containing standard errors of the parameter estimates
`RESULTS` = pointer	Pointer to save variates for all possible regression models containing the criteria (`r2`, `adjusted`, `cp`, `ep`, `aic`, `sic` or `bic`, `deviance`, `meandeviance`), degrees of freedom for residual and total number of fitted parameters p
`STATISTICS` = pointer	Pointer to save variates for all possible regression models containing the test statistics. These are F-to-delete statistics (i.e. deviance ratios) when the `DISPERSION` option of the `MODEL` directive is set to `*`, and Chi-square-to-delete statistics (i.e. deviance differences scaled by the dispersion parameter) for a fixed dispersion parameter
`DF` = pointer	Pointer to save variates for all possible regression models containing the degrees of freedom for the numerator of the test statistics
`PROBABILITIES` = pointer	Pointer to save variates for all possible regression models containing the probabilities of the test statistics
`MARGINALTERMS` = string token	How to treat terms that are marginal to other terms in the `FREE` formula (`forced`, `free`); default `forc`

Parameter

`FREE` = formula	Model formula specifying the candidate model terms

Description

There are various methods for choosing a regression model when there are many candidate model terms, see e.g. Montgomery & Peck (1992) or Miller (1990). The STEP directive provides forward selection, backward elimination and stepwise regression. However these methods result in only one model and alternative models, with an equivalent or even better fit, are easily overlooked. Especially in observational studies with many non-orthogonal terms there are frequently a number of alternative models, and then selection of just one well-fitting model is unsatisfactory and possibly misleading. A preferable method is to fit all possible regression models, and to evaluate these according to some criterion. In this way a number of best regression models can be selected. However the fitting of all possible regression models is very computer intensive. It should also be used with caution, because models can be selected which appear to have a lot of explanatory power, but contain noise variables only, see e.g. Flack & Chang (1987). This may occur particularly when the number of parameters is large in comparison with the number of units, as illustrated by the example for RSEARCH. Terms should therefore not be selected on the basis of a statistical analysis alone.

RSEARCH can be used to perform these model selection methods. The call to RSEARCH must be preceded by a MODEL statement which defines the response variate and, if required, all other aspects of a (generalized) linear model. Only one response variate is allowed unless the DISTRIBUTION option of MODEL is set to multinomial. The FREE parameter specifies the candidate model terms. These may include variates, factors, interactions and regression functions like POL and SSPLINE. The METHOD option controls which model selection methods are employed:

`accumulated`	prints an accumulated analysis of deviance in which all model terms are added one by one to the model in the given order;
`pooled`	prints an accumulated analysis of deviance in which terms with the same number of identifiers, e.g. main effects or two-factor interactions, are pooled;
`forward`	prints an accumulated analysis of deviance resulting from forward selection;
`backward`	prints an accumulated analysis of deviance resulting from backward elimination;
`fstepwise`	prints an accumulated analysis of deviance resulting from stepwise regression starting with no candidate terms in the model;
`bstepwise`	prints an accumulated analysis of deviance resulting from stepwise regression starting with all candidate terms in the model;
`allpossible`	prints summary statistics for a number of best models among all possible models.

For each model with METHOD=allpossible, the selection criterion and the degrees of freedom of the included terms are printed. The probability for the hypothesis that an included term can be deleted as the last term is also printed. These probabilities are based on F-to-delete statistics (i.e. deviance ratios) when the DISPERSION option of the MODEL directive is set to *, and Chi-square-to-delete statistics (i.e. deviance differences scaled by the dispersion parameter) for a fixed dispersion parameter.

The PPROBABILITY option allows you to reduce the amount of output when METHOD=allpossible. If this is set, only models where all the probabilities are less than PPROBABILITY are printed. (By default PPROBABILITY=1, and so they are all printed.)

It is sometimes desirable to include specific terms in every model. Such terms may be specified by means of the FORCED option. The FORCED model terms are always fitted first. The CONSTANT option controls whether the constant parameter is included in the model. The limit for expanding the FREE and FORCED model formulae can be set with the FACTORIAL option, which has default value 3. The PRINT option can be used to control the output from RSEARCH.

The criteria for inclusion and exclusion of terms for forward selection, backward elimination and stepwise regression can be specified by the INRATIO and OUTRATIO options respectively. The MAXCYCLE option specifies the number of steps. These operate exactly as in the STEP directive. The DENOMINATOR option controls the way in which variance ratios are calculated in accumulated analysis of deviance summaries.

All possible regression models are fitted only when the number of candidate FREE model terms does not exceed 16. If the FREE formula specifies a main effects model, i.e. a model without interactions, the main effects are the candidate terms. When the FREE formula contains interactions, the default is to remove any terms marginal to an interaction from the FREE formula, and include them instead in the FORCED formula. However, you can set option MARGINALTERMS to free to retain them in FREE formula. Note that RSEARCH considers only models that obey the principle of marginality. This states that a model that includes an interaction term must also include all its marginal terms. For example, a model that includes the interaction A.B must also include the main effects A and B.

The AFACTORIAL option can be used to limit the expansion of the FREE model terms for the fitting of all possible regression models. The expansion is limited in addition to the limitation imposed by the FACTORIAL option. As an example, the following calls to RSEARCH result in identical candidate model terms, namely a.b, a.c, b.c and d, for all possible regression models:

RSEARCH [METHOD=forward,backward,allpossible;\

FACTORIAL=3; AFACTORIAL=2] a*b*c + d

RSEARCH [METHOD=forward,backward,allpossible;\

FACTORIAL=2; AFACTORIAL=2; FORCED=a+b+c] a*b*c + dHowever, forward selection starts with no terms in the first call and with the model a+b+c in the second call. Backward elimination starts with the full model including the three factor interaction a.b.c in the first call, while this term is not fitted in the second call.

The CRITERION option controls the selection of the best models among all possible regression models. The criteria employed in RSEARCH are defined as follows:

`r2`	100 × [1 – Dev / Dev0]
`adjusted`	100 × [1 – (Dev / (n–p)) / (Dev0 / (n–p₀))]
`cp`	Dev / f + 2 × p – n
`ep`	Dev × (n+1) × (n-2) / [n × (n–p) × (n–p-1)]
`aic`	Dev / f + 2 × p
`sic` or `bic` (synonyms)	Dev / f + Ln(n) × p
`deviance`	Dev
`meandeviance`	Dev / (n–p)

where

Dev	is the deviance of the current model;
Dev0	is the deviance of the null model;
p	is the number of fitted parameters of the current model;
p₀	is the number of fitted parameters of the null model;
n	is the number of units;
f	is the dispersion parameter.

The null model is the model with only a constant term, which may include the fitting of a grouping factor for a within groups regression and/or the fitting of cut-points for an ordinal response model.

The dispersion parameter f is specified by the DISPERSION option of the MODEL directive or, when DISPERSION is set to *, is estimated by the mean deviance of the model with all the candidate terms. In ordinary linear regression R², adjusted R² and Mallows Cp are widely used. When R² is used, there is no penalty for adding a term, i.e. R² always improves with the addition of a term. When adjusted R² or Cp is employed, there is a penalty for adding a term. Adjusted R² improves when the F-ratio due to the addition of the term is larger than 1, while Cp improves when the F-ratio is larger than 2. Clearly, Cp is the more conservative criterion and will tend to select models with fewer terms as compared to R² and adjusted R². Minimizing Cp minimizes the mean squared error of prediction in ordinary linear regression in the case where predictions will be made at the same values as are present in the current data set. Models with negligible bias have Cp » p. For predictions at new random values, as is common in observational studies, Ep estimates the mean squared error of prediction; then Ep should be minimized. Thompson (1978) and Miller (1990) discuss Cp and Ep in detail.

Criteria suggested for generalized linear models are the Akaike information criterion (AIC) and the Schwarz (Bayesian) information criterion (SIC, or its synonym BIC). The definition of both criteria used here is different from that in the literature. The deviance is used instead of the maximum value of the log-likelihood, which implies a constant shift for distributions without dispersion parameter. Moreover, in the spirit of generalized linear models, the deviance is scaled by the dispersion parameter. This makes AIC equivalent to Cp. Clearly, SIC is the more conservative criterion, especially when the number of units is large.

Note that the best models have a small Cp, Ep, AIC, SIC, deviance and mean deviance, but a large R² and adjusted R². The default penalty of 2 in the definition of Cp and AIC can be altered by setting the PENALTY option, in which case Cp and AIC improves when the F-ratio is larger than PENALTY. The EXTRA option specifies an extra criterion which is printed alongside the selection criterion. The default for CRITERION is adjusted. The default for EXTRA is cp when DISPERSION is set to *, and meandeviance otherwise.

The NTERMS option specifies the maximum number of candidate terms in a model. This can be used when only models with few candidate terms are relevant or to reduce the computational burden. For example with 12 candidate terms there are 4096 different models, while there are only 299 models with maximally three terms. Specifying NTERMS=3 then saves a considerable amount of computing time. The NBESTMODELS option specifies the number of best models within each subset size for which summary statistics are printed.

The FINALMODEL option can be used to save the last models for forward selection, backward elimination and fstepwise and bstepwise regression. Results of the fitting of all possible regression models can be saved by means of the parameters ALLMODELS, ESTIMATES, SE, RESULTS, STATISTICS, DF and PROBABILITIES. This saves results from all the fitted models not only from those that are printed. This includes the constant model.

All regression warnings are suppressed. This is to prevent the printing of long lists of similar warnings like “Iterative weights have become 0, or have been held at a limit”. Note that the printed output of all possible regression models is adjusted to the width of the output file.

Options: PRINT, METHOD, FORCED, CONSTANT, FACTORIAL, DENOMINATOR, INRATIO, OUTRATIO, MAXCYCLE, CRITERION, EXTRA, AFACTORIAL, PENALTY, NTERMS, NBESTMODELS, PPROBABILITY, FINALMODELS, ALLMODELS, ESTIMATES, SE, RESULTS, STATISTICS, DF, PROBABILITIES, MARGINALTERMS.

Parameters: FREE.

Method

First the FREE and FORCED formulae are checked using subsidiary procedure _RSEARCHCHECK, and terms that appear in both are dropped from the FREE formula. Then the full model is fitted and aliased predictors are dropped from both formulae. Forward selection, backward elimination and stepwise regression are straightforward implemented using the STEP directive.

The fitting of all possible regression models uses a sequence of models in which, within each subset size, every model is fitted by dropping one term from the previous model and adding another term. Test statistics are calculated as though the tested term is the last term to enter the model. When the DISPERSION option of the MODEL directive is set to *, terms are tested by means of F-to-delete statistics, which are deviance ratios. For a fixed dispersion parameter Chi-square-to-delete statistics, i.e. deviance differences scaled by the dispersion parameter, are used to calculate probabilities.

Smoothing splines are not allowed in the FREE formula for METHOD=allpossible due to a limitation of the FCLASSIFICATION directive.

Action with `RESTRICT`

Factors and variates in the FREE and FORCED formulae should not be restricted. Any restriction applied to vectors used in the MODEL statement applies also to the results from RSEARCH.

References

Flack, V.F. & Chang, P.C. (1987). Frequency of selecting noise variables in subset regression analysis: a simulation study. The American Statistician, 41, 84-86.

Miller, A.J. (1990). Subset Selection in Regression. Chapman & Hall, London.

Montgomery, D.C. & Peck, E.A. (1992). Introduction to Linear Regression Analysis (second edition). Wiley, New York.

Thompson, M.L. (1978). Selection of variables in multiple regression: Part I. A review and evaluation. International Statistical Review, 46, 1-19.

Example

CAPTION   'RSEARCH example',\
          !t('Analysis of the so-called Hald data: heat evolved in calories',\
          'per gram of cement as a function of the amount of each of four',\
          'ingredients in the mix: tricalcium aluminate, tricalcium',\
          'silicate, tetracalcium alumino ferrite and dicalcium silicate;',\
          'see page 276 of Montgomery & Peck (1992), Introduction to',\
          'Linear Regression Analysis, 2nd Edition, Wiley.'); 
          STYLE=meta,plain
VARIATE   [NVALUES=13] Ingred1,Ingred2,Ingred3,Ingred4,Heat
READ      Ingred1,Ingred2,Ingred3,Ingred4,Heat
     7  26   6  60   78.5     1  29  15  52   74.3    11  56   8  20  104.3
    11  31   8  47   87.6     7  52   6  33   95.9    11  55   9  22  109.2
     3  71  17   6  102.7     1  31  22  44   72.5     2  54  18  22   93.1
    21  47   4  26  115.9     1  40  23  34   83.8    11  66   9  12  113.3
    10  68   8  12  109.4 :
MODEL     Heat
RSEARCH   [METHOD=forward,backward,fstepwise,bstepwise,allpossible]\ 
          Ingred1,Ingred2,Ingred3,Ingred4
CAPTION   !t('Analysis of the damage caused by waves to forward sections of',\
          'cargo-carrying ships. The data are counts of damage incidents',\
          'for each combination of three risk factors: the type of ship,',\
          'the year of construction and the period of operation; see',\
          'page 204 of McCullagh & Nelder (1989), Generalized Linear',\
          'Models, 2nd Edition, Chapman & Hall.')
VARIATE   [NVALUES=40] Service,Incident
FACTOR    [NVALUES=40; LABELS=!t(A,B,C,D,E)] Type
&         [LABELS=!t('60-64','65-69','70-74','75-79')] Construction
&         [LABELS=!t('60-74','75-79')] Operation
GENERATE  Type,Construction,Operation
READ      Service,Incident
    127  0    63  0  1095  3  1095  4  1512  6  3353 18     *  *  2244 11
  44882 39 17176 29 28609 58 20370 53  7064 12 13099 44     *  *  7117 18
   1179  1   552  1   781  0   676  1   783  6  1948  2     *  *   274  1
    251  0   105  0   288  0   192  0   349  2  1208 11     *  *  2051  4
     45  0     0  0   789  7   437  7  1157  5  2161 12     *  *   542  1 :
CALCULATE Service = LOG(Service)
MODEL     [DISTRIBUTION=poisson; OFFSET=Service; DISPERSION=*] Incident
RSEARCH   [FACTORIAL=1; EXTRA=deviance] Type * Construction * Operation
RSEARCH   [METHOD=pooled,accumulated,allpossible; FACTORIAL=2;\
          EXTRA=deviance] Type * Construction * Operation
CAPTION   !t('The following example shows that when the number of',\
          'predictors is large compared with the number of units,',\
          'models with noise variables only may appear to have a lot',\
          'of explanatory power.')
VARIATE   [NVALUES=12] Y,Noise[1...10]
CALCULATE Y,Noise[] = URAND(912439,10(0); 12)
MODEL     Y
RSEARCH   Noise[]

Updated on June 18, 2019

Was this article helpful?

Yes No