DISCRIMINATE procedure

Performs discriminant analysis (L.H. Schmitt & P.G.N. Digby).

Options

`PRINT` = string tokens	Printed output from the analysis (`counts`, `lrv`, `tests`, `ccorrelations`, `icorrelations`, `correlations`, `adjustments`, `means`, `gdistances`, `scores`, `distances`, `newgroups`, `table`, `validation`); default `coun`
`NROOTS` = scalar	The number of dimensions to be used for printed and saved output, and used in calculating the distances and the allocation of units; default is to use the full dimensionality
`REALLOCATE` = string token	Whether units from the training set are to be reallocated to groups (`no`, `yes`); default `no`
`PLOT` = string tokens	Features for the plots (`means`, `mlabels`, `scores`, `polygons`, `confidencecircle`); default `mean,` `scor`, `poly` (Note: `*` suppresses plotting)
`VALIDATIONMETHOD` = string token	Validation method to use to calculate error rates (`bootstrap`, `crossvalidation`, `jackknife`); default `cros`
`NSIMULATIONS` = variate	Number of bootstraps or cross-validation sets to use for selection and for validation; default `!(10,50)`
`NCROSSVALIDATIONGROUPS` = scalar	Number of groups for cross-validation, default 10
`SEED` = scalar	Seed for random number generation; default 0
`YROOT` = scalars	Specifies roots for plotting on y-axes
`XROOT` = scalars	Specifies roots for plotting on x-axes
`TITLE` = strings	Titles for plots
`WINDOW` = scalars	Windows for plots
`SCREEN` = string tokens	Action before each plot (`keep`, `clear`); default `clea`

Parameters

`DATA` = pointers	Each pointer contains a set of variates to be analysed
`GROUPS` = factors	Define groupings for the units in each training set, or missing values for the units to be allocated
`NEWGROUPS` = factors	Saves allocations (and reallocations)
`ALLOCATION` = factors	Saves allocations to groups including those not present in the training set
`MEANS` = matrices or pointers	Saves scores for group means
`SCORES` = matrices or pointers	Saves scores for units
`DISTANCES` = matrices	Saves unit to group-mean squared distances
`LRV` = LRVs	Saves the LRVs from the canonical variates analyses
`ADJUSTMENTS` = matrices	Saves adjustments to the canonical variates analyses
`GDISTANCES` = symmetric matrices	Saves the distances between groups
`CCORRELATIONS` = matrices	Saves canonical correlation coefficients
`ICORRELATIONS` = symmetric matrices	Saves within-group correlation matrices of the input variates
`CORRELATIONS` = matrices	Saves within-group correlations between the input and canonical variates

Description

DISCRIMINATE performs discriminant analysis (see, for example, Mardia, Kent & Bibby 1979).

The input for the procedure is given by a pointer and a factor, specified by the DATA and GROUPS parameters, respectively. The pointer contains a set of variates defining the attributes of the units. Any unit with a missing value in any of the variates is excluded from the analysis. Units can also be excluded from the analysis by restricting the factor or variates; any such restrictions must be consistent (the rules here are exactly as used by the FSSPM directive). The factor specifies the pre-defined groupings of the units from which the allocation is derived (the “training set”); the units to be allocated by the analysis have missing factor values.

Printed output is controlled by the option PRINT with settings:

`counts`	tables of the number of units in each group with a complete set of observations;
`lrv`	canonical variate loadings, latent roots and trace;
`tests`	chi-square tests (as given by `CVA`);
`ccorrelations`	canonical correlation coefficients (see Klecka 1980);
`icorrelations`	within-group correlation matrix of the input variates;
`correlations`	within-group correlations between the input and canonical variates;
`adjustments`	adjustments required to the canonical variate scores;
`means`	canonical variate scores for the group means;
`gdistances`	inter-group distances (as given by `CVA`);
`scores`	canonical variate scores for the units;
`distances`	Mahalanobis squared distances between the units and the group means;
`newgroups`	initial grouping and the allocation of units to groups;
`table`	tables of counts of allocations; and
`validation`	estimated error rates (see the `VALIDATION` option below).

The NROOTS option specifies how many dimensions are printed and retained for the latent roots and vectors, and for the scores of the means and units. The distances of the units from the group means, and thus the allocation of units, are also formed from the scores in the number of dimensions specified by NROOTS. By default, the results are for the full dimensionality, i.e. the smaller of the number of variates and one less than the number of groups.

The REALLOCATE option specifies whether the units in the training set are to be reallocated to groups by the procedure. If the default setting no is used then their group values, either printed or saved, will be missing.

The VALIDATIONMETHOD option specifies the validation method, with settings for cross-validation, jackknife and bootstrap. Cross-validation works by randomly splitting the units into a number of groups specified by the NCROSSVALIDATIONGROUPS option (default 10). It then omits each of the groups, in turn, and predicts how the the omitted units are allocated to the discrimination groups. Jackknifing leaves the units out one at a time, and uses the rest of the data to predict the group of the omitted unit. The bootstrap method works by drawing a bootstrap sample of units (a random sample of units with replacement of the same size as the original sample), and predicting the units that are not present in the random sample. The resulting bootstrap error rate is then calculated as a weighted average of the error rate of the omitted observations and the predictive error rate of the bootstrap sample. The weights used are 0.632 and 0.368 respectively, and so this is known as the 632 rule.

The NSIMULATIONS option sets the number of simulations for cross-validation or bootstrapping. It should be set to a variate with two values: the first value defines the number of simulations to use during selection (default 10), and the second sets the number to use in the estimation of the error rates (default 50).

The SEED option provides the seed for the random numbers used for the randomizations during in the simulations. The default value of 0 continues an existing sequence of random numbers, if none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.

The PLOT option provides for group means, labels for group means, unit scores, group polygons enclosing units, and 95% confidence circles around group means. The YROOT and XROOT options specify the roots for the axes. The TITLE, WINDOW and SCREEN options allow further control of the plots. More than one plot can be output by having a list of scalars for YROOT. In this case, the values of XROOT, TITLE, WINDOW and SCREEN are cycled in parallel. A rug-like plot is drawn if only one root is extracted or if YROOT is set to a missing value.

Results from the analysis can be saved using the parameters NEWGROUPS, ALLOCATION, MEANS, SCORES, DISTANCES, LRV, ADJUSTMENTS, GDISTANCES, CCORRELATIONS, ICORRELATIONS and CORRELATIONS. The structures specified for these parameters need not be declared in advance. The default is to save MEANS and SCORES in matrices. However, if you declare either as a pointer, it will instead store the results as a data matrix (i.e. a pointer of variates corresponding to the columns of the matrix). The results correspond to p dimensions, where p is the smaller of either the number of variates, or the number of groups minus one.

Options: PRINT, NROOTS, REALLOCATE PLOT, VALIDATIONMETHOD, NSIMULATIONS, NCROSSVALIDATIONGROUPS, SEED, YROOT, XROOT, TITLE, WINDOW, SCREEN.

Parameters: DATA, GROUPS, NEWGROUPS, ALLOCATION, MEANS, SCORES, DISTANCES, LRV, ADJUSTMENTS, GDISTANCES, CCORRELATIONS, ICORRELATIONS, CORRELATIONS.

Method

A canonical variates analysis (CVA) is used to obtain the scores for the group means and the LRV containing the loadings (L), roots and trace; the analysis excludes units omitted by RESTRICT, or that have missing values in the data variates or the GROUPS factor. Scores are then calculated for all the units (i.e. ignoring any restrictions or missing values), using the formula

( X L ) – ( J A )

where X is a matrix containing the full set of units-by-variables data, J is a column vector of one’s, and A is a row vector of adjustments required to place the scores for the units onto the same scale as those for the group means.

Mahalanobis squared distances between the units and the group means are calculated from the canonical variate scores. Each unit is then allocated to the group for which it has the smallest Mahalanobis squared distance to the group mean.

There are two internal procedures _DISAXSCALE and _DISENCLOSE.

Action with `RESTRICT`

The input variates and factor may be restricted. The restrictions must be identical. The canonical variates analysis is based only on the units not excluded by the restriction and having non-missing values for all data variates. Scores are calculated for all the units with a complete set of non-missing values, however these are based only on the non-excluded units: i.e. the adjustments for the canonical variate scores are calculated from the non-excluded units, and the loadings used to calculate the scores are those from the canonical variates analysis. If there is a restriction in place, the count setting of the PRINT option will produce two parallel tables, one with the number of units in the training set and another with the number of units if the data were not restricted. The table setting of the PRINT option will produce two tables, one using only those units present in the training set and another for those units excluded by the restriction.

If the restriction results in levels of the GROUPS factor being unrepresented in the training set, the group centroids for these levels are estimated from the scores of the units that were excluded and the levels will be included in the GDISTANCE symmetric matrix. The DISTANCES parameter will include the distances to all the centroids, including those levels not in the training set. The ALLOCATION parameter will allocate to the nearest centroid even if it was not in the training set (as distinct from the NEWGROUPS factor).

For levels and units in the training set, plotted means are marked with symbol 1 (×) and the units with symbol 3 (+). Means for levels and units excluded by the restriction are plotted with symbols 19 and 20 respectively. Units with a missing GROUPS value are plotted with symbol 18 if not in the excluded set otherwise symbol 21 is used. Polygons are not drawn around groups excluded from the training set by a restriction.

References

Klecka, W.R. (1980). Discriminant Analysis (Quantitative Applications in the Social Sciences). Sage Publishing, Newbury Park, California.

Mardia, K.V., Kent, J.T. & Bibby, J.M. (1979). Multivariate Analysis. Academic Press, London.

Example

CAPTION  'DISCRIMINATE example','Fisher''s Iris data.'; STYLE=meta,plain
POINTER  [VALUES=Length,Width] Sepal
FACTOR   [NVALUES=150; LABELS=!t(Setosa,Versicolor,Virginica);\
         VALUES=50(1,2,3)] Species
VARIATE  [NVALUES=150] Sepal_L,Sepal_W,Petal_L,Petal_W
POINTER  [VALUES=Sepal_L,Sepal_W,Petal_L,Petal_W] Measures
READ     Measures[]
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 5.4  3.7  1.5  0.2
 4.8  3.4  1.6  0.2
 4.8  3.0  1.4  0.1
 4.3  3.0  1.1  0.1
 5.8  4.0  1.2  0.2
 5.7  4.4  1.5  0.4
 5.4  3.9  1.3  0.4
 5.1  3.5  1.4  0.3
 5.7  3.8  1.7  0.3
 5.1  3.8  1.5  0.3
 5.4  3.4  1.7  0.2
 5.1  3.7  1.5  0.4
 4.6  3.6  1.0  0.2
 5.1  3.3  1.7  0.5
 4.8  3.4  1.9  0.2
 5.0  3.0  1.6  0.2
 5.0  3.4  1.6  0.4
 5.2  3.5  1.5  0.2
 5.2  3.4  1.4  0.2
 4.7  3.2  1.6  0.2
 4.8  3.1  1.6  0.2
 5.4  3.4  1.5  0.4
 5.2  4.1  1.5  0.1
 5.5  4.2  1.4  0.2
 4.9  3.1  1.5  0.2
 5.0  3.2  1.2  0.2
 5.5  3.5  1.3  0.2
 4.9  3.6  1.4  0.1
 4.4  3.0  1.3  0.2
 5.1  3.4  1.5  0.2
 5.0  3.5  1.3  0.3
 4.5  2.3  1.3  0.3
 4.4  3.2  1.3  0.2
 5.0  3.5  1.6  0.6
 5.1  3.8  1.9  0.4
 4.8  3.0  1.4  0.3
 5.1  3.8  1.6  0.2
 4.6  3.2  1.4  0.2
 5.3  3.7  1.5  0.2
 5.0  3.3  1.4  0.2
 7.0  3.2  4.7  1.4
 6.4  3.2  4.5  1.5
 6.9  3.1  4.9  1.5
 5.5  2.3  4.0  1.3
 6.5  2.8  4.6  1.5
 5.7  2.8  4.5  1.3
 6.3  3.3  4.7  1.6
 4.9  2.4  3.3  1.0
 6.6  2.9  4.6  1.3
 5.2  2.7  3.9  1.4
 5.0  2.0  3.5  1.0
 5.9  3.0  4.2  1.5
 6.0  2.2  4.0  1.0
 6.1  2.9  4.7  1.4
 5.6  2.9  3.6  1.3
 6.7  3.1  4.4  1.4
 5.6  3.0  4.5  1.5
 5.8  2.7  4.1  1.0
 6.2  2.2  4.5  1.5
 5.6  2.5  3.9  1.1
 5.9  3.2  4.8  1.8
 6.1  2.8  4.0  1.3
 6.3  2.5  4.9  1.5
 6.1  2.8  4.7  1.2
 6.4  2.9  4.3  1.3
 6.6  3.0  4.4  1.4
 6.8  2.8  4.8  1.4
 6.7  3.0  5.0  1.7
 6.0  2.9  4.5  1.5
 5.7  2.6  3.5  1.0
 5.5  2.4  3.8  1.1
 5.5  2.4  3.7  1.0
 5.8  2.7  3.9  1.2
 6.0  2.7  5.1  1.6
 5.4  3.0  4.5  1.5
 6.0  3.4  4.5  1.6
 6.7  3.1  4.7  1.5
 6.3  2.3  4.4  1.3
 5.6  3.0  4.1  1.3
 5.5  2.5  4.0  1.3
 5.5  2.6  4.4  1.2
 6.1  3.0  4.6  1.4
 5.8  2.6  4.0  1.2
 5.0  2.3  3.3  1.0
 5.6  2.7  4.2  1.3
 5.7  3.0  4.2  1.2
 5.7  2.9  4.2  1.3
 6.2  2.9  4.3  1.3
 5.1  2.5  3.0  1.1
 5.7  2.8  4.1  1.3
 6.3  3.3  6.0  2.5
 5.8  2.7  5.1  1.9
 7.1  3.0  5.9  2.1
 6.3  2.9  5.6  1.8
 6.5  3.0  5.8  2.2
 7.6  3.0  6.6  2.1
 4.9  2.5  4.5  1.7
 7.3  2.9  6.3  1.8
 6.7  2.5  5.8  1.8
 7.2  3.6  6.1  2.5
 6.5  3.2  5.1  2.0
 6.4  2.7  5.3  1.9
 6.8  3.0  5.5  2.1
 5.7  2.5  5.0  2.0
 5.8  2.8  5.1  2.4
 6.4  3.2  5.3  2.3
 6.5  3.0  5.5  1.8
 7.7  3.8  6.7  2.2
 7.7  2.6  6.9  2.3
 6.0  2.2  5.0  1.5
 6.9  3.2  5.7  2.3
 5.6  2.8  4.9  2.0
 7.7  2.8  6.7  2.0
 6.3  2.7  4.9  1.8
 6.7  3.3  5.7  2.1
 7.2  3.2  6.0  1.8
 6.2  2.8  4.8  1.8
 6.1  3.0  4.9  1.8
 6.4  2.8  5.6  2.1
 7.2  3.0  5.8  1.6
 7.4  2.8  6.1  1.9
 7.9  3.8  6.4  2.0
 6.4  2.8  5.6  2.2
 6.3  2.8  5.1  1.5
 6.1  2.6  5.6  1.4
 7.7  3.0  6.1  2.3
 6.3  3.4  5.6  2.4
 6.4  3.1  5.5  1.8
 6.0  3.0  4.8  1.8
 6.9  3.1  5.4  2.1
 6.7  3.1  5.6  2.4
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8  :
CAPTION  !T('Use DISCRIMINATE: allowing training set to be reallocated;',\ 
            'printing LRV and adjustments from CVA, and allocation;',\ 
            'saving allocation, scores and distances.')
POINTER  MScore,UScore
DISCRIMINATE [PRINT=counts,lrv,tests,icorrelations,correlations,means,\
         adjustments,gdistances,scores,distances,newgroups,table;\
         REALLOCATE=yes; PLOT=means,mlabels,scores,polygons,confidence]\
         Measures; GROUPS=Species; NEWGROUPS=New_Spec; MEANS=MScore;\
         SCORES=UScore; DISTANCES=UMDists
CAPTION  'Tabulate the original grouping and the reallocation of units.'
TABULATE [PRINT=counts; CLASSIFICATION=Species,New_Spec; MARGIN=yes]
PRINT    Species,New_Spec,UScore[] & MScore[] & UMDists

Updated on January 12, 2022

Was this article helpful?

Yes No