Selects the best set of variates to discriminate between groups (D.B. Baird, L.H. Schmitt & J.W. McNicol).
|Printed output from the analysis (
||What plots to produce (
||What to display on the discriminant plot (
||The variable selection method to use (
||Number of variates to select; default 4|
||Criterion to use to select variables (
||Which model to save (
||Validation method to use to calculate error rates (
||Number of bootstraps or cross-validation sets to use for selection and for validation; default
||Number of groups for cross-validation, default 10|
||Seed for random number generation; default 0|
||Specifies roots for plotting on y-axes|
||Specifies roots for plotting on x-axes|
||Each pointer contains a set of variates that are available to be selected|
||Define groupings for the units in each training set|
||Variates that must be included in the model|
||Saves the variates in the final model|
||Saves the criterion values for each step in the model selection|
||Saves the validation error rate for the final model|
||Saves the specificity table for the final model|
||Saves the groups allocated by the final model|
||Saves the LRVs from the final discriminant analysis|
||Saves discriminant scores for units from the final model|
SDISCRIMINATE uses forward selection or backwards elimination to search for the best set of variates to discriminate between groups. The variates that are available for the discrimination must be specified, in a pointer, by the
DATA parameter. The membership of the groups must be specified, in a factor, by the
GROUPS parameter. If there are some variates that must always be included in the model, these can be specified, in a pointer, by the
Printed output is controlled by the option
||summary of the model fitting,|
||criterion values evaluated at each step of the model fitting,|
||error rates at each model step,|
||specificity of allocation (i.e. the proportion of each group that is assigned correctly),|
||the standard discriminant analysis output for the final model, and|
||criterion values for each model tried.|
The default is
PLOT option controls what plots are displayed, with settings:
||error rate at each selection step,|
||criterion values at each step of the model fitting,|
||specificity at each selection step, and|
||the standard discriminant plot from the final model.|
By default these are all plotted. The
DDISCRIMINANT option allows group means, labels for group means, unit scores, group polygons enclosing units, and 95% confidence circles around group means to be included on the discriminant plot. The
XROOT options specify the roots for the axes.
The selection method is defined by the
METHOD option. The
forward setting starts with the
FORCED model and then, at each step, looks to see which of
DATA variates not already in the model gives the best improvement; this is the default. The
backward setting starts with the model, and looks to see which variate in model (other than those in
FORCED) gives the least reduction in the criterion when eliminated at that step.
The criterion for evaluating the model is defined by the
CRITERION option, with settings:
||uses the ratio of the determinant of the within-group sums of squares and products to the determinants of the total sums of squares and products (default),|
||uses the cross-validation error rate,|
||uses the bootstrap error rate, and|
Cross validation, bootstrapping and jackknifing take much longer than the use of Wilks’ lambda.
The number of variates in the final model (excluding those in the
FORCED model) is set by
NSELECT option. The
MODELCHOICE option indicates how to choose the final model. The default setting
optimal takes the model from the step with the minimum validation error. Alternatively, the
nselect setting takes the model with the number of variates specified by the
VALIDATIONMETHOD option specifies the validation method, with settings for prediction, cross-validation, jackknife and bootstrap. Cross-validation works by randomly splitting the units into a number of groups specified by the
NCROSSVALIDATIONGROUPS option (default 10). It then omits each of the groups, in turn, and predicts how the the omitted units are allocated to the discrimination groups. Jackknifing leaves the units out one at a time, and uses the rest of the data to predict the group of the omitted unit. The bootstrap method works by drawing a bootstrap sample of units (a random sample of units with replacement of the same size as the original sample), and predicting the units that are not present in the random sample. The resulting bootstrap error rate is then calculated as a weighted average of the error rate of the omitted observations and the predictive error rate of the bootstrap sample. The weights used are 0.632 and 0.368 respectively, and so this is known as the 632 rule.
NSIMULATIONS option sets the number of simulations for cross-validation or bootstrapping. It should be set to a variate with two values: the first value defines the number of simulations to use during selection (default 10), and the second sets the number to use in the estimation of the error rates (default 50).
SEED option provides the seed for the random numbers used for the randomizations during in the simulations. The default value of 0 continues an existing sequence of random numbers, if none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.
SELECTED parameter can save the contents of the chosen model, in a pointer. The
STEPS parameter can save a pointer with a variate for each step of the selection, containing the criterion evaluated for each
DATA variate at then step. The variates contain a missing value if the
DATA variate had already been included or excluded from the model. The
ERRORRATE parameter can save a variate with the minimum value of the validation error rate after each step. The
SPECIFICITY parameter can save a matrix containing the specificity table for the final model. The
LRV parameter can save the latent roots, vectors and trace from the final discriminant analysis, and the
SCORES parameters can save the assigned groups and discriminant scores.
The procedure steps through the models using
FSSPM to calculate Wilks’ Lambda, and subsidiary procedures
_SDISBOOTSTRAP to calculate the other selection criteria.
DISCRIMINATE is called to provide the output for the final model.
The input variates and factor may be restricted (but any restrictions must be identical). The restricted units are omitted from the analysis.
Commands for: Multivariate and cluster analysis.
CAPTION 'SDISCRIMINATE example'; STYLE=meta SPLOAD FILE='%gendir%/examples/Automobile.gsh' POINTER [VALUES=normalized_losses,wheel_base,length,width,height,\ curb_weight,engine_size,bore,stroke,compression_ratio,\ horsepower,peak_rpm,city_mpg,highway_mpg,price] Xvars SDISCRIMINATE [NSELECT=6; SEED=925081] DATA=Xvars; GROUPS=symboling