Performs discriminant analysis (L.H. Schmitt & P.G.N. Digby).

### Options

`PRINT` = string tokens |
Printed output from the analysis (`counts` , `lrv` , `tests` , `ccorrelations` , `icorrelations` , `correlations` , `adjustments` , `means` , `gdistances` , `scores` , `distances` , `newgroups` , `table` , `validation` ); default `coun` |
---|---|

`NROOTS` = scalar |
The number of dimensions to be used for printed and saved output, and used in calculating the distances and the allocation of units; default is to use the full dimensionality |

`REALLOCATE` = string token |
Whether units from the training set are to be reallocated to groups (`no` , `yes` ); default `no` |

`PLOT` = string tokens |
Features for the plots (`means` , `mlabels` , `scores` , `polygons` , `confidencecircle` ); default `mean,` `scor` , `poly` (Note: `*` suppresses plotting) |

`VALIDATIONMETHOD` = string token |
Validation method to use to calculate error rates (`bootstrap` , `crossvalidation` , `jackknife` ); default `cros` |

`NSIMULATIONS` = variate |
Number of bootstraps or cross-validation sets to use for selection and for validation; default `!(10,50)` |

`NCROSSVALIDATIONGROUPS` = scalar |
Number of groups for cross-validation, default 10 |

`SEED` = scalar |
Seed for random number generation; default 0 |

`YROOT` = scalars |
Specifies roots for plotting on y-axes |

`XROOT` = scalars |
Specifies roots for plotting on x-axes |

`TITLE` = strings |
Titles for plots |

`WINDOW` = scalars |
Windows for plots |

`SCREEN` = string tokens |
Action before each plot (`keep` , `clear` ); default `clea` |

### Parameters

`DATA` = pointers |
Each pointer contains a set of variates to be analysed |
---|---|

`GROUPS` = factors |
Define groupings for the units in each training set, or missing values for the units to be allocated |

`NEWGROUPS` = factors |
Saves allocations (and reallocations) |

`ALLOCATION` = factors |
Saves allocations to groups including those not present in the training set |

`MEANS` = matrices or pointers |
Saves scores for group means |

`SCORES` = matrices or pointers |
Saves scores for units |

`DISTANCES` = matrices |
Saves unit to group-mean squared distances |

`LRV` = LRVs |
Saves the LRVs from the canonical variates analyses |

`ADJUSTMENTS` = matrices |
Saves adjustments to the canonical variates analyses |

`GDISTANCES` = symmetric matrices |
Saves the distances between groups |

`CCORRELATIONS` = matrices |
Saves canonical correlation coefficients |

`ICORRELATIONS` = symmetric matrices |
Saves within-group correlation matrices of the input variates |

`CORRELATIONS` = matrices |
Saves within-group correlations between the input and canonical variates |

### Description

`DISCRIMINATE`

performs discriminant analysis (see, for example, Mardia, Kent & Bibby 1979).

The input for the procedure is given by a pointer and a factor, specified by the `DATA`

and `GROUPS`

parameters, respectively. The pointer contains a set of variates defining the attributes of the units. Any unit with a missing value in any of the variates is excluded from the analysis. Units can also be excluded from the analysis by restricting the factor or variates; any such restrictions must be consistent (the rules here are exactly as used by the `FSSPM`

directive). The factor specifies the pre-defined groupings of the units from which the allocation is derived (the “training set”); the units to be allocated by the analysis have missing factor values.

Printed output is controlled by the option `PRINT`

with settings:

`counts` |
tables of the number of units in each group with a complete set of observations; |
---|---|

`lrv` |
canonical variate loadings, latent roots and trace; |

`tests` |
chi-square tests (as given by `CVA` ); |

`ccorrelations` |
canonical correlation coefficients (see Klecka 1980); |

`icorrelations` |
within-group correlation matrix of the input variates; |

`correlations` |
within-group correlations between the input and canonical variates; |

`adjustments` |
adjustments required to the canonical variate scores; |

`means` |
canonical variate scores for the group means; |

`gdistances` |
inter-group distances (as given by `CVA` ); |

`scores` |
canonical variate scores for the units; |

`distances` |
Mahalanobis squared distances between the units and the group means; |

`newgroups` |
initial grouping and the allocation of units to groups; |

`table` |
tables of counts of allocations; and |

`validation` |
estimated error rates (see the `VALIDATION` option below). |

The `NROOTS`

option specifies how many dimensions are printed and retained for the latent roots and vectors, and for the scores of the means and units. The distances of the units from the group means, and thus the allocation of units, are also formed from the scores in the number of dimensions specified by `NROOTS`

. By default, the results are for the full dimensionality, i.e. the smaller of the number of variates and one less than the number of groups.

The `REALLOCATE`

option specifies whether the units in the training set are to be reallocated to groups by the procedure. If the default setting `no`

is used then their group values, either printed or saved, will be missing.

The `VALIDATIONMETHOD`

option specifies the validation method, with settings for cross-validation, jackknife and bootstrap. Cross-validation works by randomly splitting the units into a number of groups specified by the `NCROSSVALIDATIONGROUPS`

option (default 10). It then omits each of the groups, in turn, and predicts how the the omitted units are allocated to the discrimination groups. Jackknifing leaves the units out one at a time, and uses the rest of the data to predict the group of the omitted unit. The bootstrap method works by drawing a bootstrap sample of units (a random sample of units with replacement of the same size as the original sample), and predicting the units that are not present in the random sample. The resulting bootstrap error rate is then calculated as a weighted average of the error rate of the omitted observations and the predictive error rate of the bootstrap sample. The weights used are 0.632 and 0.368 respectively, and so this is known as the *632 rule*.

The `NSIMULATIONS`

option sets the number of simulations for cross-validation or bootstrapping. It should be set to a variate with two values: the first value defines the number of simulations to use during selection (default 10), and the second sets the number to use in the estimation of the error rates (default 50).

The `SEED`

option provides the seed for the random numbers used for the randomizations during in the simulations. The default value of 0 continues an existing sequence of random numbers, if none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.

The `PLOT`

option provides for group means, labels for group means, unit scores, group polygons enclosing units, and 95% confidence circles around group means. The `YROOT`

and `XROOT`

options specify the roots for the axes. The `TITLE`

, `WINDOW`

and `SCREEN`

options allow further control of the plots. More than one plot can be output by having a list of scalars for `YROOT`

. In this case, the values of `XROOT`

, `TITLE`

, `WINDOW`

and `SCREEN`

are cycled in parallel. A rug-like plot is drawn if only one root is extracted or if `YROOT`

is set to a missing value.

Results from the analysis can be saved using the parameters `NEWGROUPS`

, `ALLOCATION`

, `MEANS`

, `SCORES`

, `DISTANCES`

, `LRV`

, `ADJUSTMENTS`

, `GDISTANCES`

, `CCORRELATIONS`

, `ICORRELATIONS`

and `CORRELATIONS`

. The structures specified for these parameters need not be declared in advance. The default is to save `MEANS`

and `SCORES`

in matrices. However, if you declare either as a pointer, it will instead store the results as a data matrix (i.e. a pointer of variates corresponding to the columns of the matrix). The results correspond to *p* dimensions, where *p* is the smaller of either the number of variates, or the number of groups minus one.

Options: `PRINT`

, `NROOTS`

, `REALLOCATE`

`PLOT`

, `VALIDATIONMETHOD`

, `NSIMULATIONS`

, `NCROSSVALIDATIONGROUPS`

, `SEED`

, `YROOT`

, `XROOT`

, `TITLE`

, `WINDOW, `

`SCREEN`

.

Parameters: `DATA`

, `GROUPS`

, `NEWGROUPS`

, `ALLOCATION`

, `MEANS`

, `SCORES`

, `DISTANCES`

, `LRV`

, `ADJUSTMENTS`

, `GDISTANCES`

, `CCORRELATIONS`

, `ICORRELATIONS`

, `CORRELATIONS`

.

### Method

A canonical variates analysis (`CVA`

) is used to obtain the scores for the group means and the LRV containing the loadings (*L*), roots and trace; the analysis excludes units omitted by `RESTRICT`

, or that have missing values in the data variates or the `GROUPS`

factor. Scores are then calculated for all the units (i.e. ignoring any restrictions or missing values), using the formula

( *X L* ) – ( *J A* )

where *X* is a matrix containing the full set of units-by-variables data, *J* is a column vector of one’s, and *A* is a row vector of adjustments required to place the scores for the units onto the same scale as those for the group means.

Mahalanobis squared distances between the units and the group means are calculated from the canonical variate scores. Each unit is then allocated to the group for which it has the smallest Mahalanobis squared distance to the group mean.

There are two internal procedures `_DISAXSCALE`

and `_DISENCLOSE`

.

### Action with `RESTRICT`

The input variates and factor may be restricted. The restrictions must be identical. The canonical variates analysis is based only on the units not excluded by the restriction and having non-missing values for all data variates. Scores are calculated for all the units with a complete set of non-missing values, however these are based only on the non-excluded units: i.e. the adjustments for the canonical variate scores are calculated from the non-excluded units, and the loadings used to calculate the scores are those from the canonical variates analysis. If there is a restriction in place, the `count`

setting of the `PRINT`

option will produce two parallel tables, one with the number of units in the training set and another with the number of units if the data were not restricted. The table setting of the `PRINT`

option will produce two tables, one using only those units present in the training set and another for those units excluded by the restriction.

If the restriction results in levels of the `GROUPS`

factor being unrepresented in the training set, the group centroids for these levels are estimated from the scores of the units that were excluded and the levels will be included in the `GDISTANCE`

symmetric matrix. The `DISTANCES`

parameter will include the distances to all the centroids, including those levels not in the training set. The `ALLOCATION`

parameter will allocate to the nearest centroid even if it was not in the training set (as distinct from the `NEWGROUPS`

factor).

For levels and units in the training set, plotted means are marked with symbol 1 (×) and the units with symbol 3 (+). Means for levels and units excluded by the restriction are plotted with symbols 19 and 20 respectively. Units with a missing `GROUPS`

value are plotted with symbol 18 if not in the excluded set otherwise symbol 21 is used. Polygons are not drawn around groups excluded from the training set by a restriction.

### References

Klecka, W.R. (1980). *Discriminant Analysis (Quantitative Applications in the Social Sciences)*. Sage Publishing, Newbury Park, California.

Mardia, K.V., Kent, J.T. & Bibby, J.M. (1979). *Multivariate Analysis*. Academic Press, London.

### See also

Directive: `CVA`

.

Procedures: `CVAPLOT`

, `DBIPLOT`

, `QDISCRIMINATE`

, `SDISCRIMINATE`

.

Commands for: Multivariate and cluster analysis.

### Example

CAPTION 'DISCRIMINATE example','Fisher''s Iris data.'; STYLE=meta,plain POINTER [VALUES=Length,Width] Sepal FACTOR [NVALUES=150; LABELS=!t(Setosa,Versicolor,Virginica);\ VALUES=50(1,2,3)] Species VARIATE [NVALUES=150] Sepal_L,Sepal_W,Petal_L,Petal_W POINTER [VALUES=Sepal_L,Sepal_W,Petal_L,Petal_W] Measures READ Measures[] 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5.0 3.6 1.4 0.2 5.4 3.9 1.7 0.4 4.6 3.4 1.4 0.3 5.0 3.4 1.5 0.2 4.4 2.9 1.4 0.2 4.9 3.1 1.5 0.1 5.4 3.7 1.5 0.2 4.8 3.4 1.6 0.2 4.8 3.0 1.4 0.1 4.3 3.0 1.1 0.1 5.8 4.0 1.2 0.2 5.7 4.4 1.5 0.4 5.4 3.9 1.3 0.4 5.1 3.5 1.4 0.3 5.7 3.8 1.7 0.3 5.1 3.8 1.5 0.3 5.4 3.4 1.7 0.2 5.1 3.7 1.5 0.4 4.6 3.6 1.0 0.2 5.1 3.3 1.7 0.5 4.8 3.4 1.9 0.2 5.0 3.0 1.6 0.2 5.0 3.4 1.6 0.4 5.2 3.5 1.5 0.2 5.2 3.4 1.4 0.2 4.7 3.2 1.6 0.2 4.8 3.1 1.6 0.2 5.4 3.4 1.5 0.4 5.2 4.1 1.5 0.1 5.5 4.2 1.4 0.2 4.9 3.1 1.5 0.2 5.0 3.2 1.2 0.2 5.5 3.5 1.3 0.2 4.9 3.6 1.4 0.1 4.4 3.0 1.3 0.2 5.1 3.4 1.5 0.2 5.0 3.5 1.3 0.3 4.5 2.3 1.3 0.3 4.4 3.2 1.3 0.2 5.0 3.5 1.6 0.6 5.1 3.8 1.9 0.4 4.8 3.0 1.4 0.3 5.1 3.8 1.6 0.2 4.6 3.2 1.4 0.2 5.3 3.7 1.5 0.2 5.0 3.3 1.4 0.2 7.0 3.2 4.7 1.4 6.4 3.2 4.5 1.5 6.9 3.1 4.9 1.5 5.5 2.3 4.0 1.3 6.5 2.8 4.6 1.5 5.7 2.8 4.5 1.3 6.3 3.3 4.7 1.6 4.9 2.4 3.3 1.0 6.6 2.9 4.6 1.3 5.2 2.7 3.9 1.4 5.0 2.0 3.5 1.0 5.9 3.0 4.2 1.5 6.0 2.2 4.0 1.0 6.1 2.9 4.7 1.4 5.6 2.9 3.6 1.3 6.7 3.1 4.4 1.4 5.6 3.0 4.5 1.5 5.8 2.7 4.1 1.0 6.2 2.2 4.5 1.5 5.6 2.5 3.9 1.1 5.9 3.2 4.8 1.8 6.1 2.8 4.0 1.3 6.3 2.5 4.9 1.5 6.1 2.8 4.7 1.2 6.4 2.9 4.3 1.3 6.6 3.0 4.4 1.4 6.8 2.8 4.8 1.4 6.7 3.0 5.0 1.7 6.0 2.9 4.5 1.5 5.7 2.6 3.5 1.0 5.5 2.4 3.8 1.1 5.5 2.4 3.7 1.0 5.8 2.7 3.9 1.2 6.0 2.7 5.1 1.6 5.4 3.0 4.5 1.5 6.0 3.4 4.5 1.6 6.7 3.1 4.7 1.5 6.3 2.3 4.4 1.3 5.6 3.0 4.1 1.3 5.5 2.5 4.0 1.3 5.5 2.6 4.4 1.2 6.1 3.0 4.6 1.4 5.8 2.6 4.0 1.2 5.0 2.3 3.3 1.0 5.6 2.7 4.2 1.3 5.7 3.0 4.2 1.2 5.7 2.9 4.2 1.3 6.2 2.9 4.3 1.3 5.1 2.5 3.0 1.1 5.7 2.8 4.1 1.3 6.3 3.3 6.0 2.5 5.8 2.7 5.1 1.9 7.1 3.0 5.9 2.1 6.3 2.9 5.6 1.8 6.5 3.0 5.8 2.2 7.6 3.0 6.6 2.1 4.9 2.5 4.5 1.7 7.3 2.9 6.3 1.8 6.7 2.5 5.8 1.8 7.2 3.6 6.1 2.5 6.5 3.2 5.1 2.0 6.4 2.7 5.3 1.9 6.8 3.0 5.5 2.1 5.7 2.5 5.0 2.0 5.8 2.8 5.1 2.4 6.4 3.2 5.3 2.3 6.5 3.0 5.5 1.8 7.7 3.8 6.7 2.2 7.7 2.6 6.9 2.3 6.0 2.2 5.0 1.5 6.9 3.2 5.7 2.3 5.6 2.8 4.9 2.0 7.7 2.8 6.7 2.0 6.3 2.7 4.9 1.8 6.7 3.3 5.7 2.1 7.2 3.2 6.0 1.8 6.2 2.8 4.8 1.8 6.1 3.0 4.9 1.8 6.4 2.8 5.6 2.1 7.2 3.0 5.8 1.6 7.4 2.8 6.1 1.9 7.9 3.8 6.4 2.0 6.4 2.8 5.6 2.2 6.3 2.8 5.1 1.5 6.1 2.6 5.6 1.4 7.7 3.0 6.1 2.3 6.3 3.4 5.6 2.4 6.4 3.1 5.5 1.8 6.0 3.0 4.8 1.8 6.9 3.1 5.4 2.1 6.7 3.1 5.6 2.4 6.9 3.1 5.1 2.3 5.8 2.7 5.1 1.9 6.8 3.2 5.9 2.3 6.7 3.3 5.7 2.5 6.7 3.0 5.2 2.3 6.3 2.5 5.0 1.9 6.5 3.0 5.2 2.0 6.2 3.4 5.4 2.3 5.9 3.0 5.1 1.8 : CAPTION !T('Use DISCRIMINATE: allowing training set to be reallocated;',\ 'printing LRV and adjustments from CVA, and allocation;',\ 'saving allocation, scores and distances.') POINTER MScore,UScore DISCRIMINATE [PRINT=counts,lrv,tests,icorrelations,correlations,means,\ adjustments,gdistances,scores,distances,newgroups,table;\ REALLOCATE=yes; PLOT=means,mlabels,scores,polygons,confidence]\ Measures; GROUPS=Species; NEWGROUPS=New_Spec; MEANS=MScore;\ SCORES=UScore; DISTANCES=UMDists CAPTION 'Tabulate the original grouping and the reallocation of units.' TABULATE [PRINT=counts; CLASSIFICATION=Species,New_Spec; MARGIN=yes] PRINT Species,New_Spec,UScore[] & MScore[] & UMDists