Helps search through models for a regression or generalized linear model (P.W. Goedhart).
|Printed output required (
||Model selection method to employ (
||Model formula to include in every model; default
||How to treat the constant (
||Limit for expansion of all model terms; default 3|
||Whether to base ratios in accumulated summaries on rms from model with smallest residual ss or smallest residual ms (
||Criterion for inclusion of terms for forward selection, backward elimination and stepwise regression; default 1.0|
||Criterion for exclusion of terms for forward selection, backward elimination and stepwise regression; default 1.0|
||Limit on number of times to repeat stepwise selection methods, unless no change is made; default 50|
||Criterion for selecting best models among all possible models (
||Criterion which is also printed for the selected best models (
||Limit for expansion of
||Penalty for Mallows Cp and Akaike’s information criterion AIC; default 2|
||Limit on the number of terms to be fitted when fitting all possible models; default 16|
||Number of best models printed for each subset size; default 8|
||Pointer to save the final models for forward, backward,
||Pointer to save formulae for all possible regression models containing the fitted terms of all the models; every formula includes the
||Pointer to save variates for all possible regression models containing the parameter estimates|
||Pointer to save variates for all possible regression models containing standard errors of the parameter estimates|
||Pointer to save variates for all possible regression models containing the criteria (
||Pointer to save variates for all possible regression models containing the test statistics. These are F-to-delete statistics (i.e. deviance ratios) when the
||Pointer to save variates for all possible regression models containing the degrees of freedom for the numerator of the test statistics|
||Pointer to save variates for all possible regression models containing the probabilities of the test statistics|
||How to treat terms that are marginal to other terms in the
||Model formula specifying the candidate model terms|
There are various methods for choosing a regression model when there are many candidate model terms, see e.g. Montgomery & Peck (1992) or Miller (1990). The
STEP directive provides forward selection, backward elimination and stepwise regression. However these methods result in only one model and alternative models, with an equivalent or even better fit, are easily overlooked. Especially in observational studies with many non-orthogonal terms there are frequently a number of alternative models, and then selection of just one well-fitting model is unsatisfactory and possibly misleading. A preferable method is to fit all possible regression models, and to evaluate these according to some criterion. In this way a number of best regression models can be selected. However the fitting of all possible regression models is very computer intensive. It should also be used with caution, because models can be selected which appear to have a lot of explanatory power, but contain noise variables only, see e.g. Flack & Chang (1987). This may occur particularly when the number of parameters is large in comparison with the number of units, as illustrated by the example for
RSEARCH. Terms should therefore not be selected on the basis of a statistical analysis alone.
RSEARCH can be used to perform these model selection methods. The call to
RSEARCH must be preceded by a
MODEL statement which defines the response variate and, if required, all other aspects of a (generalized) linear model. Only one response variate is allowed unless the
DISTRIBUTION option of
MODEL is set to
multinomial. The FREE parameter specifies the candidate model terms. These may include variates, factors, interactions and regression functions like
METHOD option controls which model selection methods are employed:
||prints an accumulated analysis of deviance in which all model terms are added one by one to the model in the given order;|
||prints an accumulated analysis of deviance in which terms with the same number of identifiers, e.g. main effects or two-factor interactions, are pooled;|
||prints an accumulated analysis of deviance resulting from forward selection;|
||prints an accumulated analysis of deviance resulting from backward elimination;|
||prints an accumulated analysis of deviance resulting from stepwise regression starting with no candidate terms in the model;|
||prints an accumulated analysis of deviance resulting from stepwise regression starting with all candidate terms in the model;|
||prints summary statistics for a number of best models among all possible models.|
For each model with
METHOD=allpossible, the selection criterion and the degrees of freedom of the included terms are printed. The probability for the hypothesis that an included term can be deleted as the last term is also printed. These probabilities are based on F-to-delete statistics (i.e. deviance ratios) when the
DISPERSION option of the
MODEL directive is set to
*, and Chi-square-to-delete statistics (i.e. deviance differences scaled by the dispersion parameter) for a fixed dispersion parameter.
PPROBABILITY option allows you to reduce the amount of output when
METHOD=allpossible. If this is set, only models where all the probabilities are less than
PPROBABILITY are printed. (By default
PPROBABILITY=1, and so they are all printed.)
It is sometimes desirable to include specific terms in every model. Such terms may be specified by means of the
FORCED option. The
FORCED model terms are always fitted first. The
CONSTANT option controls whether the constant parameter is included in the model. The limit for expanding the
FORCED model formulae can be set with the
FACTORIAL option, which has default value 3. The
The criteria for inclusion and exclusion of terms for forward selection, backward elimination and stepwise regression can be specified by the
OUTRATIO options respectively. The
MAXCYCLE option specifies the number of steps. These operate exactly as in the
STEP directive. The
DENOMINATOR option controls the way in which variance ratios are calculated in accumulated analysis of deviance summaries.
All possible regression models are fitted only when the number of candidate
FREE model terms does not exceed 16. If the
FREE formula specifies a main effects model, i.e. a model without interactions, the main effects are the candidate terms. When the
FREE formula contains interactions, the default is to remove any terms marginal to an interaction from the
FREE formula, and include them instead in the
FORCED formula. However, you can set option
free to retain them in
FREE formula. Note that
RSEARCH considers only models that obey the principle of marginality. This states that a model that includes an interaction term must also include all its marginal terms. For example, a model that includes the interaction
A.B must also include the main effects
AFACTORIAL option can be used to limit the expansion of the
FREE model terms for the fitting of all possible regression models. The expansion is limited in addition to the limitation imposed by the
FACTORIAL option. As an example, the following calls to
RSEARCH result in identical candidate model terms, namely
d, for all possible regression models:
FACTORIAL=3; AFACTORIAL=2] a*b*c + d
FACTORIAL=2; AFACTORIAL=2; FORCED=a+b+c] a*b*c + dHowever, forward selection starts with no terms in the first call and with the model
a+b+c in the second call. Backward elimination starts with the full model including the three factor interaction
a.b.c in the first call, while this term is not fitted in the second call.
CRITERION option controls the selection of the best models among all possible regression models. The criteria employed in
RSEARCH are defined as follows:
||100 × [1 – Dev / Dev0]|
||100 × [1 – (Dev / (n–p)) / (Dev0 / (n–p0))]|
||Dev / f + 2 × p – n|
||Dev × (n+1) × (n-2) / [n × (n–p) × (n–p-1)]|
||Dev / f + 2 × p|
||Dev / f + Ln(n) × p|
||Dev / (n–p)|
|Dev||is the deviance of the current model;|
|Dev0||is the deviance of the null model;|
|p||is the number of fitted parameters of the current model;|
|p0||is the number of fitted parameters of the null model;|
|n||is the number of units;|
|f||is the dispersion parameter.|
The null model is the model with only a constant term, which may include the fitting of a grouping factor for a within groups regression and/or the fitting of cut-points for an ordinal response model.
The dispersion parameter f is specified by the
DISPERSION option of the
MODEL directive or, when
DISPERSION is set to *, is estimated by the mean deviance of the model with all the candidate terms. In ordinary linear regression R², adjusted R² and Mallows Cp are widely used. When R² is used, there is no penalty for adding a term, i.e. R² always improves with the addition of a term. When adjusted R² or Cp is employed, there is a penalty for adding a term. Adjusted R² improves when the F-ratio due to the addition of the term is larger than 1, while Cp improves when the F-ratio is larger than 2. Clearly, Cp is the more conservative criterion and will tend to select models with fewer terms as compared to R² and adjusted R². Minimizing Cp minimizes the mean squared error of prediction in ordinary linear regression in the case where predictions will be made at the same values as are present in the current data set. Models with negligible bias have Cp » p. For predictions at new random values, as is common in observational studies, Ep estimates the mean squared error of prediction; then Ep should be minimized. Thompson (1978) and Miller (1990) discuss Cp and Ep in detail.
Criteria suggested for generalized linear models are the Akaike information criterion (AIC) and the Schwarz (Bayesian) information criterion (SIC, or its synonym BIC). The definition of both criteria used here is different from that in the literature. The deviance is used instead of the maximum value of the log-likelihood, which implies a constant shift for distributions without dispersion parameter. Moreover, in the spirit of generalized linear models, the deviance is scaled by the dispersion parameter. This makes AIC equivalent to Cp. Clearly, SIC is the more conservative criterion, especially when the number of units is large.
Note that the best models have a small Cp, Ep, AIC, SIC, deviance and mean deviance, but a large R² and adjusted R². The default penalty of 2 in the definition of Cp and AIC can be altered by setting the
PENALTY option, in which case Cp and AIC improves when the F-ratio is larger than
EXTRA option specifies an extra criterion which is printed alongside the selection criterion. The default for
adjusted. The default for
DISPERSION is set to
NTERMS option specifies the maximum number of candidate terms in a model. This can be used when only models with few candidate terms are relevant or to reduce the computational burden. For example with 12 candidate terms there are 4096 different models, while there are only 299 models with maximally three terms. Specifying
NTERMS=3 then saves a considerable amount of computing time. The
NBESTMODELS option specifies the number of best models within each subset size for which summary statistics are printed.
FINALMODEL option can be used to save the last models for forward selection, backward elimination and fstepwise and bstepwise regression. Results of the fitting of all possible regression models can be saved by means of the parameters
PROBABILITIES. This saves results from all the fitted models not only from those that are printed. This includes the constant model.
All regression warnings are suppressed. This is to prevent the printing of long lists of similar warnings like “Iterative weights have become 0, or have been held at a limit”. Note that the printed output of all possible regression models is adjusted to the width of the output file.
FORCED formulae are checked using subsidiary procedure
_RSEARCHCHECK, and terms that appear in both are dropped from the
FREE formula. Then the full model is fitted and aliased predictors are dropped from both formulae. Forward selection, backward elimination and stepwise regression are straightforward implemented using the
The fitting of all possible regression models uses a sequence of models in which, within each subset size, every model is fitted by dropping one term from the previous model and adding another term. Test statistics are calculated as though the tested term is the last term to enter the model. When the
DISPERSION option of the
MODEL directive is set to
*, terms are tested by means of F-to-delete statistics, which are deviance ratios. For a fixed dispersion parameter Chi-square-to-delete statistics, i.e. deviance differences scaled by the dispersion parameter, are used to calculate probabilities.
Smoothing splines are not allowed in the
FREE formula for
METHOD=allpossible due to a limitation of the
Factors and variates in the
FORCED formulae should not be restricted. Any restriction applied to vectors used in the
MODEL statement applies also to the results from
Flack, V.F. & Chang, P.C. (1987). Frequency of selecting noise variables in subset regression analysis: a simulation study. The American Statistician, 41, 84-86.
Miller, A.J. (1990). Subset Selection in Regression. Chapman & Hall, London.
Montgomery, D.C. & Peck, E.A. (1992). Introduction to Linear Regression Analysis (second edition). Wiley, New York.
Thompson, M.L. (1978). Selection of variables in multiple regression: Part I. A review and evaluation. International Statistical Review, 46, 1-19.
Commands for: Regression analysis.
CAPTION 'RSEARCH example',\ !t('Analysis of the so-called Hald data: heat evolved in calories',\ 'per gram of cement as a function of the amount of each of four',\ 'ingredients in the mix: tricalcium aluminate, tricalcium',\ 'silicate, tetracalcium alumino ferrite and dicalcium silicate;',\ 'see page 276 of Montgomery & Peck (1992), Introduction to',\ 'Linear Regression Analysis, 2nd Edition, Wiley.'); STYLE=meta,plain VARIATE [NVALUES=13] Ingred1,Ingred2,Ingred3,Ingred4,Heat READ Ingred1,Ingred2,Ingred3,Ingred4,Heat 7 26 6 60 78.5 1 29 15 52 74.3 11 56 8 20 104.3 11 31 8 47 87.6 7 52 6 33 95.9 11 55 9 22 109.2 3 71 17 6 102.7 1 31 22 44 72.5 2 54 18 22 93.1 21 47 4 26 115.9 1 40 23 34 83.8 11 66 9 12 113.3 10 68 8 12 109.4 : MODEL Heat RSEARCH [METHOD=forward,backward,fstepwise,bstepwise,allpossible]\ Ingred1,Ingred2,Ingred3,Ingred4 CAPTION !t('Analysis of the damage caused by waves to forward sections of',\ 'cargo-carrying ships. The data are counts of damage incidents',\ 'for each combination of three risk factors: the type of ship,',\ 'the year of construction and the period of operation; see',\ 'page 204 of McCullagh & Nelder (1989), Generalized Linear',\ 'Models, 2nd Edition, Chapman & Hall.') VARIATE [NVALUES=40] Service,Incident FACTOR [NVALUES=40; LABELS=!t(A,B,C,D,E)] Type & [LABELS=!t('60-64','65-69','70-74','75-79')] Construction & [LABELS=!t('60-74','75-79')] Operation GENERATE Type,Construction,Operation READ Service,Incident 127 0 63 0 1095 3 1095 4 1512 6 3353 18 * * 2244 11 44882 39 17176 29 28609 58 20370 53 7064 12 13099 44 * * 7117 18 1179 1 552 1 781 0 676 1 783 6 1948 2 * * 274 1 251 0 105 0 288 0 192 0 349 2 1208 11 * * 2051 4 45 0 0 0 789 7 437 7 1157 5 2161 12 * * 542 1 : CALCULATE Service = LOG(Service) MODEL [DISTRIBUTION=poisson; OFFSET=Service; DISPERSION=*] Incident RSEARCH [FACTORIAL=1; EXTRA=deviance] Type * Construction * Operation RSEARCH [METHOD=pooled,accumulated,allpossible; FACTORIAL=2;\ EXTRA=deviance] Type * Construction * Operation CAPTION !t('The following example shows that when the number of',\ 'predictors is large compared with the number of units,',\ 'models with noise variables only may appear to have a lot',\ 'of explanatory power.') VARIATE [NVALUES=12] Y,Noise[1...10] CALCULATE Y,Noise = URAND(912439,10(0); 12) MODEL Y RSEARCH Noise