SDISCRIMINATE procedure

Selects the best set of variates to discriminate between groups (D.B. Baird, L.H. Schmitt & J.W. McNicol).

Options

`PRINT` = string tokens	Printed output from the analysis (`summary`, `steps`, `validation`, `specificity`, `discrimination`, `monitoring`); default `summ`, `vali`, `spec`, `disc`
`PLOT` = string tokens	What plots to produce (`errorrate`, `steps`, `specificity`, `discriminant`); default `erro`, `steps`, `spec`, `disc`
`DDISCRIMINANT` = string tokens	What to display on the discriminant plot (`means`, `mlabels`, `scores`, `polygons`, `confidencecircle`); default `means`, `mlabels`, `scores`, `conf`
`METHOD` = string token	The variable selection method to use (`forward`, `backward`); default `forw`
`NSELECT` = scalar	Number of variates to select; default 4
`CRITERION` = string token	Criterion to use to select variables (`wilkslambda`, `crossvalidation`, `bootstrap`, `jackknife`); default `wilk`
`MODELCHOICE` = string token	Which model to save (`optimal`, `nselect`); default `opti`
`VALIDATIONMETHOD` = string token	Validation method to use to calculate error rates (`bootstrap`, `crossvalidation`, `jackknife`, `prediction`); default `cros`
`NSIMULATIONS` = variate	Number of bootstraps or cross-validation sets to use for selection and for validation; default `!(10,50)`
`NCROSSVALIDATIONGROUPS` = scalar	Number of groups for cross-validation, default 10
`SEED` = scalar	Seed for random number generation; default 0
`YROOT` = scalars	Specifies roots for plotting on y-axes
`XROOT` = scalars	Specifies roots for plotting on x-axes

Parameters

`DATA` = pointers	Each pointer contains a set of variates that are available to be selected
`GROUPS` = factors	Define groupings for the units in each training set
`FORCED` = pointers	Variates that must be included in the model
`SELECTED` = pointers	Saves the variates in the final model
`STEPS` = pointers	Saves the criterion values for each step in the model selection
`ERRORRATE` = scalars	Saves the validation error rate for the final model
`SPECIFICITY` = matrices	Saves the specificity table for the final model
`ALLOCATION` = factors	Saves the groups allocated by the final model
`LRV` = LRVs	Saves the LRVs from the final discriminant analysis
`SCORES=` matrices or pointers	Saves discriminant scores for units from the final model

Description

SDISCRIMINATE uses forward selection or backwards elimination to search for the best set of variates to discriminate between groups. The variates that are available for the discrimination must be specified, in a pointer, by the DATA parameter. The membership of the groups must be specified, in a factor, by the GROUPS parameter. If there are some variates that must always be included in the model, these can be specified, in a pointer, by the FORCED parameter.

Printed output is controlled by the option PRINT, with settings:

`summary`	summary of the model fitting,
`steps`	criterion values evaluated at each step of the model fitting,
`validation`	error rates at each model step,
`specificity`	specificity of allocation (i.e. the proportion of each group that is assigned correctly),
`discrimination`	the standard discriminant analysis output for the final model, and
`monitoring`	criterion values for each model tried.

The default is PRINT=summ,vali,spec,disc.

The PLOT option controls what plots are displayed, with settings:

`errorrate`	error rate at each selection step,
`steps`	criterion values at each step of the model fitting,
`specificity`	specificity at each selection step, and
`discriminant`	the standard discriminant plot from the final model.

By default these are all plotted. The DDISCRIMINANT option allows group means, labels for group means, unit scores, group polygons enclosing units, and 95% confidence circles around group means to be included on the discriminant plot. The YROOT and XROOT options specify the roots for the axes.

The selection method is defined by the METHOD option. The forward setting starts with the FORCED model and then, at each step, looks to see which of DATA variates not already in the model gives the best improvement; this is the default. The backward setting starts with the model, and looks to see which variate in model (other than those in FORCED) gives the least reduction in the criterion when eliminated at that step.

The criterion for evaluating the model is defined by the CRITERION option, with settings:

`wilkslambda`	uses the ratio of the determinant of the within-group sums of squares and products to the determinants of the total sums of squares and products (default),
`crossvalidation`	uses the cross-validation error rate,
`bootstrap`	uses the bootstrap error rate, and
`jackknife`	uses jackknifing.

Cross validation, bootstrapping and jackknifing take much longer than the use of Wilks’ lambda.

The number of variates in the final model (excluding those in the FORCED model) is set by NSELECT option. The MODELCHOICE option indicates how to choose the final model. The default setting optimal takes the model from the step with the minimum validation error. Alternatively, the nselect setting takes the model with the number of variates specified by the NSELECT option.

The VALIDATIONMETHOD option specifies the validation method, with settings for prediction, cross-validation, jackknife and bootstrap. Cross-validation works by randomly splitting the units into a number of groups specified by the NCROSSVALIDATIONGROUPS option (default 10). It then omits each of the groups, in turn, and predicts how the the omitted units are allocated to the discrimination groups. Jackknifing leaves the units out one at a time, and uses the rest of the data to predict the group of the omitted unit. The bootstrap method works by drawing a bootstrap sample of units (a random sample of units with replacement of the same size as the original sample), and predicting the units that are not present in the random sample. The resulting bootstrap error rate is then calculated as a weighted average of the error rate of the omitted observations and the predictive error rate of the bootstrap sample. The weights used are 0.632 and 0.368 respectively, and so this is known as the 632 rule.

The NSIMULATIONS option sets the number of simulations for cross-validation or bootstrapping. It should be set to a variate with two values: the first value defines the number of simulations to use during selection (default 10), and the second sets the number to use in the estimation of the error rates (default 50).

The SEED option provides the seed for the random numbers used for the randomizations during in the simulations. The default value of 0 continues an existing sequence of random numbers, if none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.

The SELECTED parameter can save the contents of the chosen model, in a pointer. The STEPS parameter can save a pointer with a variate for each step of the selection, containing the criterion evaluated for each DATA variate at then step. The variates contain a missing value if the DATA variate had already been included or excluded from the model. The ERRORRATE parameter can save a variate with the minimum value of the validation error rate after each step. The SPECIFICITY parameter can save a matrix containing the specificity table for the final model. The LRV parameter can save the latent roots, vectors and trace from the final discriminant analysis, and the ALLOCATION and SCORES parameters can save the assigned groups and discriminant scores.

Options: PRINT, PLOT, DDISCRIMINANT, METHOD, NSELECT, CRITERION, MODELCHOICE, VALIDATIONMETHOD, NSIMULATIONS, NCROSSVALIDATIONGROUPS, SEED, YROOT, XROOT.

Parameters: DATA, GROUPS, FORCED, SELECTED, STEPS, ERRORRATE, SPECIFICITY, ALLOCATION, LRV, SCORES.

Method

The procedure steps through the models using FSSPM to calculate Wilks’ Lambda, and subsidiary procedures _SDISCROSSVALIDATE and _SDISBOOTSTRAP to calculate the other selection criteria. DISCRIMINATE is called to provide the output for the final model.

Action with `RESTRICT`

The input variates and factor may be restricted (but any restrictions must be identical). The restricted units are omitted from the analysis.

Example

CAPTION       'SDISCRIMINATE example'; STYLE=meta
SPLOAD        FILE='%gendir%/examples/Automobile.gsh'
POINTER       [VALUES=normalized_losses,wheel_base,length,width,height,\
              curb_weight,engine_size,bore,stroke,compression_ratio,\
              horsepower,peak_rpm,city_mpg,highway_mpg,price] Xvars
SDISCRIMINATE [NSELECT=6; SEED=925081] DATA=Xvars; GROUPS=symboling

Updated on March 5, 2019

Was this article helpful?

Yes No