CVA directive

Performs canonical variates analysis.

Options

`PRINT` = string tokens	Printed output required (`roots`, `loadings`, `means`, `residuals`, `distances`, `tests`); default `*` i.e. no printing
`NROOTS` = scalar	Number of latent roots for printed output; default `*` requests them all to be printed
`SMALLEST` = string token	Whether to print the smallest roots instead of the largest (`yes`, `no`); default `no`

Parameters

`WSSPM` = SSPMs	Within-group sums of squares and products, means etc (input for the analyses)
`LRV` = LRVs	Saves loadings, roots and trace from each analysis
`SCORES` = matrices	Saves canonical variate means
`RESIDUALS` = matrices	Saves distances of the means from the dimensions fitted in each analysis
`DISTANCES` = symmetric matrices	Saves inter-group-mean Mahalanobis distances
`ADJUSTMENTS` = matrices	Saves the adjustment terms
`SAVE` = pointers	Saves details of the analysis; if unset, an unnamed save structure is saved automatically (and this can be accessed using the `GET` directive)

Description

You specify the input for CVA using its first parameter, WSSPM, this may contain a list of structures, in which case Genstat repeats the analysis for each of them. The input must be an SSPM structure, declared with the GROUPS option of the SSPM directive set to a factor giving the grouping of the units. If the variates used to form this SSPM structure are restricted, then the SSPM is restricted in the same way, and so the CVA directive takes account of the restriction. The SSPM contains information on the within-group sums of squares and products, pooled over all the groups; it also contains the group means and group sizes, from which Genstat can derive the between-group sums of squares and products. CVA finds linear combinations of the original variables that maximize the ratio of between-group to within-group variation, thereby giving functions of the original variables that can be used to discriminate between the groups. The squares of the printed distances between group means are Mahalanobis D² statistics when all the dimensions are used; otherwise they are approximations. You can form exact Mahalanobis distances with the PCO directive.

The three options of the CVA directive control the printed output. By default there is no printed output, and so you should set the PRINT option to indicate which sections you want. Results can be printed for a subset of the latent roots by setting the NROOTS and SMALLEST options of CVA. NROOTS specifies the number of roots for which you want the results to be printed. By default these will be the largest roots, unless you set SMALLEST=yes; then the results will be printed for the smallest non-zero roots. When you print a subset of the results, residuals can be formed and printed from the dimensions that are not displayed.

The significance tests that are printed are for a significant dimensionality greater than k, that is for the joint significance of the first, second, …, (k+1)th latent roots. This test is printed for k=0, 1, … min(g-1, v)-1. If the test is “not significant” for k=r, then the values of chi-square for k>r should be ignored as the indication is that the remaining dimensions have no interesting structure. The test statistic (Bartlett 1938) is asymptotically distributed as chi-square with (v–k)×(g–k-1) degrees of freedom. Here n is the number of units, g is the number of groups, v is the number of variables, and l_i is the ith latent root. If the coefficient [n–g-½(v–g)] is less than zero, there are too few units for the statistics to be calculated and a message is printed to this effect. In any case, the tests should be treated with caution unless n–g is very much larger than v.

The latent vectors, or loadings, are scaled in such a way that the average within-group variability in each canonical variate dimension is 1: thus the within-group variation is equally represented in each dimension. Since the latent roots are the successive maxima of the ratio of between-group to within-group variation, loadings corresponding to roots less than 1 are for dimensions in the canonical variate space that exhibit more within-group variation than between-group variation.

The scores for the means are arranged so that their centroid, weighted by group size, is at the origin. This is done by subtracting a constant term, for each canonical variate dimension, from the scores initially formed as a linear combination of the group means of the original variables. These adjustments can be saved, in a matrix of size one by number of groups, using the ADJUSTMENTS parameter.

If you ask for distances, they are formed from the group mean scores for the canonical variate dimensions that are printed. If results are printed for the full dimensionality, the distances will be Mahalanobis distances between the groups.

The LRV parameter allows you to save the loadings, latent roots and their sum (the trace) in an LRV structure, while the SCORES parameter saves the canonical variate means. If you have declared the LRV already, its number of rows must be the same as the number of variates involved in forming the input SSPM. The number of rows of the SCORES matrix, if previously declared, must be equal to the number of groups.

The number of columns of the LRV and of the SCORES matrix corresponds to the number of dimensions to be saved from the analysis, and this must be the same for both of them. If the structures have been declared already, Genstat will take the larger of the numbers of columns declared for either, and declare (or redeclare) the other one to match. If neither has been declared and option SMALLEST retains the default setting no, Genstat takes the number of columns from the setting of the NROOTS option. Otherwise, Genstat saves results for the full set of dimensions. The trace saved as the third component of the LRV structure, however, will contain the sums of all the latent roots, whether or not they have all been saved. Procedure LRVSCREE can be used to produce a “scree” diagram which can be helpful in deciding how many dimensions to save.

The RESIDUALS parameter allows you to save the distances of the means from the dimensions fitted in the analysis in a matrix with number of rows equal to the number of groups and one column. If the latent roots and vectors (loadings) are saved from the analysis, the residuals will correspond to the dimensions not saved; the same applies if you save scores. If neither the LRV nor scores are saved, the saved residuals will correspond to the smallest latent roots not printed.

The DISTANCES parameter allows you to save the inter-group-mean Mahalanobis distances in a symmetric matrix.

The SAVE parameter can supply a pointer to save a multivariate save structure contining all the details of the analysis. If this is unset, an unnamed save structure is saved automatically (and this can be accessed using the GET directive). Alternatively, you can set SAVE=* to prevent any save structure being formed if, for example, you have a very large data set and want to avoid committing the storage space.

Options: PRINT, NROOTS, SMALLEST.
Parameters: WSSPM, LRV, SCORES, RESIDUALS, DISTANCES, ADJUSTMENTS, SAVE.

Reference

Bartlett, M.S. (1938). Further aspects of the theory of multiple regression. Proceedings of the Cambridge Philosophical Society, 34, 33-40.

Example

" Genstat example CVA-1: Canonical Variates Analysis

  The data for this example deal with measurements made on 28 brooches
  found at the archaeological site of the cemetry at Musingen. Seven
  measurements are used and have been transformed by taking logarithms.
  
  A grouping factor, obtained from a cluster analysis, with four levels 
  has also been included.

  (Doran and Hodson, Mathematics and computers in archaeology. (1975))
"

"
  Declare the four-level grouping factor.
"
FACTOR [LEVELS=4; VALUES=3,1,2,2,2,1,1,4,2,3,3,4,2,2,2,2,2,4,\ 
  1,3,4,4,2,2,2,1,1,3] Groupno

"
  The data are held in the file 'CVA-1.DAT' and names for the data columns
  are on the first line. Read the file, saving the names in a pointer
  structure called Data.
"
FILEREAD [NAME='%gendir%/examples/CVA-1.DAT';\ 
  IMETHOD=read; MAXCATEGORY=4; ISAVE=Data] 

"
  Declare a sums of squares and products data structure called W.
  The sums of squares and products have to be calculated for our pointer 
  of measurement variates, with groups for within-group SSPMs specified 
  by the grouping factor Groupno.

  Form the structure W.
"  
SSPM [TERMS=Data[]; GROUPS=Groupno] W
FSSPM W

"
  Perform the canonical variates analysis for W, printing out the 
  resulting roots, loadings, the means for the canonical variate groups,
  the values for the significance tests for the latent roots and the 
  distances between groups.
"
CVA [PRINT=roots,loadings,means,tests,distances] W

"
  Carry out the analysis once again, saving the latent roots, loadings
  and trace to the pointer L, and the means to Meanscrs.
"
CVA [PRINT=residuals,distances; NROOTS=2] WSSPM=W; LRV=L; SCORES=Meanscrs
PRINT L[],Meanscrs

"
  If required, the smallest roots can be requested instead of the largest.
"
CVA [PRINT=roots,residuals; NROOTS=2; SMALLEST=yes] W; LRV=L
PRINT L[]

Updated on February 6, 2023

Was this article helpful?

Yes No