Performs discriminant analysis (L.H. Schmitt & P.G.N. Digby).
Options
PRINT = string tokens |
Printed output from the analysis (counts , lrv , tests , ccorrelations , icorrelations , correlations , adjustments , means , gdistances , scores , distances , newgroups , table , validation ); default coun |
---|---|
NROOTS = scalar |
The number of dimensions to be used for printed and saved output, and used in calculating the distances and the allocation of units; default is to use the full dimensionality |
REALLOCATE = string token |
Whether units from the training set are to be reallocated to groups (no , yes ); default no |
PLOT = string tokens |
Features for the plots (means , mlabels , scores , polygons , confidencecircle ); default mean, scor , poly (Note: * suppresses plotting) |
VALIDATIONMETHOD = string token |
Validation method to use to calculate error rates (bootstrap , crossvalidation , jackknife ); default cros |
NSIMULATIONS = variate |
Number of bootstraps or cross-validation sets to use for selection and for validation; default !(10,50) |
NCROSSVALIDATIONGROUPS = scalar |
Number of groups for cross-validation, default 10 |
SEED = scalar |
Seed for random number generation; default 0 |
YROOT = scalars |
Specifies roots for plotting on y-axes |
XROOT = scalars |
Specifies roots for plotting on x-axes |
TITLE = strings |
Titles for plots |
WINDOW = scalars |
Windows for plots |
SCREEN = string tokens |
Action before each plot (keep , clear ); default clea |
Parameters
DATA = pointers |
Each pointer contains a set of variates to be analysed |
---|---|
GROUPS = factors |
Define groupings for the units in each training set, or missing values for the units to be allocated |
NEWGROUPS = factors |
Saves allocations (and reallocations) |
ALLOCATION = factors |
Saves allocations to groups including those not present in the training set |
MEANS = matrices or pointers |
Saves scores for group means |
SCORES = matrices or pointers |
Saves scores for units |
DISTANCES = matrices |
Saves unit to group-mean squared distances |
LRV = LRVs |
Saves the LRVs from the canonical variates analyses |
ADJUSTMENTS = matrices |
Saves adjustments to the canonical variates analyses |
GDISTANCES = symmetric matrices |
Saves the distances between groups |
CCORRELATIONS = matrices |
Saves canonical correlation coefficients |
ICORRELATIONS = symmetric matrices |
Saves within-group correlation matrices of the input variates |
CORRELATIONS = matrices |
Saves within-group correlations between the input and canonical variates |
Description
DISCRIMINATE
performs discriminant analysis (see, for example, Mardia, Kent & Bibby 1979).
The input for the procedure is given by a pointer and a factor, specified by the DATA
and GROUPS
parameters, respectively. The pointer contains a set of variates defining the attributes of the units. Any unit with a missing value in any of the variates is excluded from the analysis. Units can also be excluded from the analysis by restricting the factor or variates; any such restrictions must be consistent (the rules here are exactly as used by the FSSPM
directive). The factor specifies the pre-defined groupings of the units from which the allocation is derived (the “training set”); the units to be allocated by the analysis have missing factor values.
Printed output is controlled by the option PRINT
with settings:
counts |
tables of the number of units in each group with a complete set of observations; |
---|---|
lrv |
canonical variate loadings, latent roots and trace; |
tests |
chi-square tests (as given by CVA ); |
ccorrelations |
canonical correlation coefficients (see Klecka 1980); |
icorrelations |
within-group correlation matrix of the input variates; |
correlations |
within-group correlations between the input and canonical variates; |
adjustments |
adjustments required to the canonical variate scores; |
means |
canonical variate scores for the group means; |
gdistances |
inter-group distances (as given by CVA ); |
scores |
canonical variate scores for the units; |
distances |
Mahalanobis squared distances between the units and the group means; |
newgroups |
initial grouping and the allocation of units to groups; |
table |
tables of counts of allocations; and |
validation |
estimated error rates (see the VALIDATION option below). |
The NROOTS
option specifies how many dimensions are printed and retained for the latent roots and vectors, and for the scores of the means and units. The distances of the units from the group means, and thus the allocation of units, are also formed from the scores in the number of dimensions specified by NROOTS
. By default, the results are for the full dimensionality, i.e. the smaller of the number of variates and one less than the number of groups.
The REALLOCATE
option specifies whether the units in the training set are to be reallocated to groups by the procedure. If the default setting no
is used then their group values, either printed or saved, will be missing.
The VALIDATIONMETHOD
option specifies the validation method, with settings for cross-validation, jackknife and bootstrap. Cross-validation works by randomly splitting the units into a number of groups specified by the NCROSSVALIDATIONGROUPS
option (default 10). It then omits each of the groups, in turn, and predicts how the the omitted units are allocated to the discrimination groups. Jackknifing leaves the units out one at a time, and uses the rest of the data to predict the group of the omitted unit. The bootstrap method works by drawing a bootstrap sample of units (a random sample of units with replacement of the same size as the original sample), and predicting the units that are not present in the random sample. The resulting bootstrap error rate is then calculated as a weighted average of the error rate of the omitted observations and the predictive error rate of the bootstrap sample. The weights used are 0.632 and 0.368 respectively, and so this is known as the 632 rule.
The NSIMULATIONS
option sets the number of simulations for cross-validation or bootstrapping. It should be set to a variate with two values: the first value defines the number of simulations to use during selection (default 10), and the second sets the number to use in the estimation of the error rates (default 50).
The SEED
option provides the seed for the random numbers used for the randomizations during in the simulations. The default value of 0 continues an existing sequence of random numbers, if none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.
The PLOT
option provides for group means, labels for group means, unit scores, group polygons enclosing units, and 95% confidence circles around group means. The YROOT
and XROOT
options specify the roots for the axes. The TITLE
, WINDOW
and SCREEN
options allow further control of the plots. More than one plot can be output by having a list of scalars for YROOT
. In this case, the values of XROOT
, TITLE
, WINDOW
and SCREEN
are cycled in parallel. A rug-like plot is drawn if only one root is extracted or if YROOT
is set to a missing value.
Results from the analysis can be saved using the parameters NEWGROUPS
, ALLOCATION
, MEANS
, SCORES
, DISTANCES
, LRV
, ADJUSTMENTS
, GDISTANCES
, CCORRELATIONS
, ICORRELATIONS
and CORRELATIONS
. The structures specified for these parameters need not be declared in advance. The default is to save MEANS
and SCORES
in matrices. However, if you declare either as a pointer, it will instead store the results as a data matrix (i.e. a pointer of variates corresponding to the columns of the matrix). The results correspond to p dimensions, where p is the smaller of either the number of variates, or the number of groups minus one.
Options: PRINT
, NROOTS
, REALLOCATE
PLOT
, VALIDATIONMETHOD
, NSIMULATIONS
, NCROSSVALIDATIONGROUPS
, SEED
, YROOT
, XROOT
, TITLE
, WINDOW,
SCREEN
.
Parameters: DATA
, GROUPS
, NEWGROUPS
, ALLOCATION
, MEANS
, SCORES
, DISTANCES
, LRV
, ADJUSTMENTS
, GDISTANCES
, CCORRELATIONS
, ICORRELATIONS
, CORRELATIONS
.
Method
A canonical variates analysis (CVA
) is used to obtain the scores for the group means and the LRV containing the loadings (L), roots and trace; the analysis excludes units omitted by RESTRICT
, or that have missing values in the data variates or the GROUPS
factor. Scores are then calculated for all the units (i.e. ignoring any restrictions or missing values), using the formula
( X L ) – ( J A )
where X is a matrix containing the full set of units-by-variables data, J is a column vector of one’s, and A is a row vector of adjustments required to place the scores for the units onto the same scale as those for the group means.
Mahalanobis squared distances between the units and the group means are calculated from the canonical variate scores. Each unit is then allocated to the group for which it has the smallest Mahalanobis squared distance to the group mean.
There are two internal procedures _DISAXSCALE
and _DISENCLOSE
.
Action with RESTRICT
The input variates and factor may be restricted. The restrictions must be identical. The canonical variates analysis is based only on the units not excluded by the restriction and having non-missing values for all data variates. Scores are calculated for all the units with a complete set of non-missing values, however these are based only on the non-excluded units: i.e. the adjustments for the canonical variate scores are calculated from the non-excluded units, and the loadings used to calculate the scores are those from the canonical variates analysis. If there is a restriction in place, the count
setting of the PRINT
option will produce two parallel tables, one with the number of units in the training set and another with the number of units if the data were not restricted. The table setting of the PRINT
option will produce two tables, one using only those units present in the training set and another for those units excluded by the restriction.
If the restriction results in levels of the GROUPS
factor being unrepresented in the training set, the group centroids for these levels are estimated from the scores of the units that were excluded and the levels will be included in the GDISTANCE
symmetric matrix. The DISTANCES
parameter will include the distances to all the centroids, including those levels not in the training set. The ALLOCATION
parameter will allocate to the nearest centroid even if it was not in the training set (as distinct from the NEWGROUPS
factor).
For levels and units in the training set, plotted means are marked with symbol 1 (×) and the units with symbol 3 (+). Means for levels and units excluded by the restriction are plotted with symbols 19 and 20 respectively. Units with a missing GROUPS
value are plotted with symbol 18 if not in the excluded set otherwise symbol 21 is used. Polygons are not drawn around groups excluded from the training set by a restriction.
References
Klecka, W.R. (1980). Discriminant Analysis (Quantitative Applications in the Social Sciences). Sage Publishing, Newbury Park, California.
Mardia, K.V., Kent, J.T. & Bibby, J.M. (1979). Multivariate Analysis. Academic Press, London.
See also
Directive: CVA
.
Procedures: CVAPLOT
, DBIPLOT
, QDISCRIMINATE
, SDISCRIMINATE
.
Commands for: Multivariate and cluster analysis.
Example
CAPTION 'DISCRIMINATE example','Fisher''s Iris data.'; STYLE=meta,plain POINTER [VALUES=Length,Width] Sepal FACTOR [NVALUES=150; LABELS=!t(Setosa,Versicolor,Virginica);\ VALUES=50(1,2,3)] Species VARIATE [NVALUES=150] Sepal_L,Sepal_W,Petal_L,Petal_W POINTER [VALUES=Sepal_L,Sepal_W,Petal_L,Petal_W] Measures READ Measures[] 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5.0 3.6 1.4 0.2 5.4 3.9 1.7 0.4 4.6 3.4 1.4 0.3 5.0 3.4 1.5 0.2 4.4 2.9 1.4 0.2 4.9 3.1 1.5 0.1 5.4 3.7 1.5 0.2 4.8 3.4 1.6 0.2 4.8 3.0 1.4 0.1 4.3 3.0 1.1 0.1 5.8 4.0 1.2 0.2 5.7 4.4 1.5 0.4 5.4 3.9 1.3 0.4 5.1 3.5 1.4 0.3 5.7 3.8 1.7 0.3 5.1 3.8 1.5 0.3 5.4 3.4 1.7 0.2 5.1 3.7 1.5 0.4 4.6 3.6 1.0 0.2 5.1 3.3 1.7 0.5 4.8 3.4 1.9 0.2 5.0 3.0 1.6 0.2 5.0 3.4 1.6 0.4 5.2 3.5 1.5 0.2 5.2 3.4 1.4 0.2 4.7 3.2 1.6 0.2 4.8 3.1 1.6 0.2 5.4 3.4 1.5 0.4 5.2 4.1 1.5 0.1 5.5 4.2 1.4 0.2 4.9 3.1 1.5 0.2 5.0 3.2 1.2 0.2 5.5 3.5 1.3 0.2 4.9 3.6 1.4 0.1 4.4 3.0 1.3 0.2 5.1 3.4 1.5 0.2 5.0 3.5 1.3 0.3 4.5 2.3 1.3 0.3 4.4 3.2 1.3 0.2 5.0 3.5 1.6 0.6 5.1 3.8 1.9 0.4 4.8 3.0 1.4 0.3 5.1 3.8 1.6 0.2 4.6 3.2 1.4 0.2 5.3 3.7 1.5 0.2 5.0 3.3 1.4 0.2 7.0 3.2 4.7 1.4 6.4 3.2 4.5 1.5 6.9 3.1 4.9 1.5 5.5 2.3 4.0 1.3 6.5 2.8 4.6 1.5 5.7 2.8 4.5 1.3 6.3 3.3 4.7 1.6 4.9 2.4 3.3 1.0 6.6 2.9 4.6 1.3 5.2 2.7 3.9 1.4 5.0 2.0 3.5 1.0 5.9 3.0 4.2 1.5 6.0 2.2 4.0 1.0 6.1 2.9 4.7 1.4 5.6 2.9 3.6 1.3 6.7 3.1 4.4 1.4 5.6 3.0 4.5 1.5 5.8 2.7 4.1 1.0 6.2 2.2 4.5 1.5 5.6 2.5 3.9 1.1 5.9 3.2 4.8 1.8 6.1 2.8 4.0 1.3 6.3 2.5 4.9 1.5 6.1 2.8 4.7 1.2 6.4 2.9 4.3 1.3 6.6 3.0 4.4 1.4 6.8 2.8 4.8 1.4 6.7 3.0 5.0 1.7 6.0 2.9 4.5 1.5 5.7 2.6 3.5 1.0 5.5 2.4 3.8 1.1 5.5 2.4 3.7 1.0 5.8 2.7 3.9 1.2 6.0 2.7 5.1 1.6 5.4 3.0 4.5 1.5 6.0 3.4 4.5 1.6 6.7 3.1 4.7 1.5 6.3 2.3 4.4 1.3 5.6 3.0 4.1 1.3 5.5 2.5 4.0 1.3 5.5 2.6 4.4 1.2 6.1 3.0 4.6 1.4 5.8 2.6 4.0 1.2 5.0 2.3 3.3 1.0 5.6 2.7 4.2 1.3 5.7 3.0 4.2 1.2 5.7 2.9 4.2 1.3 6.2 2.9 4.3 1.3 5.1 2.5 3.0 1.1 5.7 2.8 4.1 1.3 6.3 3.3 6.0 2.5 5.8 2.7 5.1 1.9 7.1 3.0 5.9 2.1 6.3 2.9 5.6 1.8 6.5 3.0 5.8 2.2 7.6 3.0 6.6 2.1 4.9 2.5 4.5 1.7 7.3 2.9 6.3 1.8 6.7 2.5 5.8 1.8 7.2 3.6 6.1 2.5 6.5 3.2 5.1 2.0 6.4 2.7 5.3 1.9 6.8 3.0 5.5 2.1 5.7 2.5 5.0 2.0 5.8 2.8 5.1 2.4 6.4 3.2 5.3 2.3 6.5 3.0 5.5 1.8 7.7 3.8 6.7 2.2 7.7 2.6 6.9 2.3 6.0 2.2 5.0 1.5 6.9 3.2 5.7 2.3 5.6 2.8 4.9 2.0 7.7 2.8 6.7 2.0 6.3 2.7 4.9 1.8 6.7 3.3 5.7 2.1 7.2 3.2 6.0 1.8 6.2 2.8 4.8 1.8 6.1 3.0 4.9 1.8 6.4 2.8 5.6 2.1 7.2 3.0 5.8 1.6 7.4 2.8 6.1 1.9 7.9 3.8 6.4 2.0 6.4 2.8 5.6 2.2 6.3 2.8 5.1 1.5 6.1 2.6 5.6 1.4 7.7 3.0 6.1 2.3 6.3 3.4 5.6 2.4 6.4 3.1 5.5 1.8 6.0 3.0 4.8 1.8 6.9 3.1 5.4 2.1 6.7 3.1 5.6 2.4 6.9 3.1 5.1 2.3 5.8 2.7 5.1 1.9 6.8 3.2 5.9 2.3 6.7 3.3 5.7 2.5 6.7 3.0 5.2 2.3 6.3 2.5 5.0 1.9 6.5 3.0 5.2 2.0 6.2 3.4 5.4 2.3 5.9 3.0 5.1 1.8 : CAPTION !T('Use DISCRIMINATE: allowing training set to be reallocated;',\ 'printing LRV and adjustments from CVA, and allocation;',\ 'saving allocation, scores and distances.') POINTER MScore,UScore DISCRIMINATE [PRINT=counts,lrv,tests,icorrelations,correlations,means,\ adjustments,gdistances,scores,distances,newgroups,table;\ REALLOCATE=yes; PLOT=means,mlabels,scores,polygons,confidence]\ Measures; GROUPS=Species; NEWGROUPS=New_Spec; MEANS=MScore;\ SCORES=UScore; DISTANCES=UMDists CAPTION 'Tabulate the original grouping and the reallocation of units.' TABULATE [PRINT=counts; CLASSIFICATION=Species,New_Spec; MARGIN=yes] PRINT Species,New_Spec,UScore[] & MScore[] & UMDists