1. Home
  2. SDISCRIMINATE procedure

SDISCRIMINATE procedure

Selects the best set of variates to discriminate between groups (D.B. Baird, L.H. Schmitt & J.W. McNicol).

Options

PRINT = string tokens Printed output from the analysis (summary, steps, validation, specificity, discrimination, monitoring); default summ, vali, spec, disc
PLOT = string tokens What plots to produce (errorrate, steps, specificity, discriminant); default erro, steps, spec, disc
DDISCRIMINANT = string tokens What to display on the discriminant plot (means, mlabels, scores, polygons, confidencecircle); default means, mlabels, scores, conf
METHOD = string token The variable selection method to use (forward, backward); default forw
NSELECT = scalar Number of variates to select; default 4
CRITERION = string token Criterion to use to select variables (wilkslambda, crossvalidation, bootstrap, jackknife); default wilk
MODELCHOICE = string token Which model to save (optimal, nselect); default opti
VALIDATIONMETHOD = string token Validation method to use to calculate error rates (bootstrap, crossvalidation, jackknife, prediction); default cros
NSIMULATIONS = variate Number of bootstraps or cross-validation sets to use for selection and for validation; default !(10,50)
NCROSSVALIDATIONGROUPS = scalar Number of groups for cross-validation, default 10
SEED = scalar Seed for random number generation; default 0
YROOT = scalars Specifies roots for plotting on y-axes
XROOT = scalars Specifies roots for plotting on x-axes

Parameters

DATA = pointers Each pointer contains a set of variates that are available to be selected
GROUPS = factors Define groupings for the units in each training set
FORCED = pointers Variates that must be included in the model
SELECTED = pointers Saves the variates in the final model
STEPS = pointers Saves the criterion values for each step in the model selection
ERRORRATE = scalars Saves the validation error rate for the final model
SPECIFICITY = matrices Saves the specificity table for the final model
ALLOCATION = factors Saves the groups allocated by the final model
LRV = LRVs Saves the LRVs from the final discriminant analysis
SCORES= matrices or pointers Saves discriminant scores for units from the final model

Description

SDISCRIMINATE uses forward selection or backwards elimination to search for the best set of variates to discriminate between groups. The variates that are available for the discrimination must be specified, in a pointer, by the DATA parameter. The membership of the groups must be specified, in a factor, by the GROUPS parameter. If there are some variates that must always be included in the model, these can be specified, in a pointer, by the FORCED parameter.

Printed output is controlled by the option PRINT, with settings:

    summary summary of the model fitting,
    steps criterion values evaluated at each step of the model fitting,
    validation error rates at each model step,
    specificity specificity of allocation (i.e. the proportion of each group that is assigned correctly),
    discrimination the standard discriminant analysis output for the final model, and
    monitoring criterion values for each model tried.

The default is PRINT=summ,vali,spec,disc.

The PLOT option controls what plots are displayed, with settings:

    errorrate error rate at each selection step,
    steps criterion values at each step of the model fitting,
    specificity specificity at each selection step, and
    discriminant the standard discriminant plot from the final model.

By default these are all plotted. The DDISCRIMINANT option allows group means, labels for group means, unit scores, group polygons enclosing units, and 95% confidence circles around group means to be included on the discriminant plot. The YROOT and XROOT options specify the roots for the axes.

The selection method is defined by the METHOD option. The forward setting starts with the FORCED model and then, at each step, looks to see which of DATA variates not already in the model gives the best improvement; this is the default. The backward setting starts with the model, and looks to see which variate in model (other than those in FORCED) gives the least reduction in the criterion when eliminated at that step.

The criterion for evaluating the model is defined by the CRITERION option, with settings:

    wilkslambda uses the ratio of the determinant of the within-group sums of squares and products to the determinants of the total sums of squares and products (default),
    crossvalidation uses the cross-validation error rate,
    bootstrap uses the bootstrap error rate, and
    jackknife uses jackknifing.

Cross validation, bootstrapping and jackknifing take much longer than the use of Wilks’ lambda.

The number of variates in the final model (excluding those in the FORCED model) is set by NSELECT option. The MODELCHOICE option indicates how to choose the final model. The default setting optimal takes the model from the step with the minimum validation error. Alternatively, the nselect setting takes the model with the number of variates specified by the NSELECT option.

The VALIDATIONMETHOD option specifies the validation method, with settings for prediction, cross-validation, jackknife and bootstrap. Cross-validation works by randomly splitting the units into a number of groups specified by the NCROSSVALIDATIONGROUPS option (default 10). It then omits each of the groups, in turn, and predicts how the the omitted units are allocated to the discrimination groups. Jackknifing leaves the units out one at a time, and uses the rest of the data to predict the group of the omitted unit. The bootstrap method works by drawing a bootstrap sample of units (a random sample of units with replacement of the same size as the original sample), and predicting the units that are not present in the random sample. The resulting bootstrap error rate is then calculated as a weighted average of the error rate of the omitted observations and the predictive error rate of the bootstrap sample. The weights used are 0.632 and 0.368 respectively, and so this is known as the 632 rule.

The NSIMULATIONS option sets the number of simulations for cross-validation or bootstrapping. It should be set to a variate with two values: the first value defines the number of simulations to use during selection (default 10), and the second sets the number to use in the estimation of the error rates (default 50).

The SEED option provides the seed for the random numbers used for the randomizations during in the simulations. The default value of 0 continues an existing sequence of random numbers, if none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.

The SELECTED parameter can save the contents of the chosen model, in a pointer. The STEPS parameter can save a pointer with a variate for each step of the selection, containing the criterion evaluated for each DATA variate at then step. The variates contain a missing value if the DATA variate had already been included or excluded from the model. The ERRORRATE parameter can save a variate with the minimum value of the validation error rate after each step. The SPECIFICITY parameter can save a matrix containing the specificity table for the final model. The LRV parameter can save the latent roots, vectors and trace from the final discriminant analysis, and the ALLOCATION and SCORES parameters can save the assigned groups and discriminant scores.

Options: PRINT, PLOT, DDISCRIMINANT, METHOD, NSELECT, CRITERION, MODELCHOICE, VALIDATIONMETHOD, NSIMULATIONS, NCROSSVALIDATIONGROUPS, SEED, YROOT, XROOT.

Parameters: DATA, GROUPS, FORCED, SELECTED, STEPS, ERRORRATE, SPECIFICITY, ALLOCATION, LRV, SCORES.

Method

The procedure steps through the models using FSSPM to calculate Wilks’ Lambda, and subsidiary procedures _SDISCROSSVALIDATE and _SDISBOOTSTRAP to calculate the other selection criteria. DISCRIMINATE is called to provide the output for the final model.

Action with RESTRICT

The input variates and factor may be restricted (but any restrictions must be identical). The restricted units are omitted from the analysis.

See also

Directive: CVA.

Procedures: CVAPLOT, DBIPLOT, DISCRIMINATE, QDISCRIMINATE.

Commands for: Multivariate and cluster analysis.

Example

CAPTION       'SDISCRIMINATE example'; STYLE=meta
SPLOAD        FILE='%gendir%/examples/Automobile.gsh'
POINTER       [VALUES=normalized_losses,wheel_base,length,width,height,\
              curb_weight,engine_size,bore,stroke,compression_ratio,\
              horsepower,peak_rpm,city_mpg,highway_mpg,price] Xvars
SDISCRIMINATE [NSELECT=6; SEED=925081] DATA=Xvars; GROUPS=symboling
Updated on March 5, 2019

Was this article helpful?