Does an analysis of distance of multivariate data (R.W. Payne & R.P. White).
Options
PRINT = string tokens |
Controls printed output (aodtable , permutationtest ); default aodt |
---|---|
TERMS = formula |
Model terms to fit in the analysis; must be specified |
FACTORIAL = scalar |
Limit on the number of factors or variates in a term for it to be included in the analysis; default 3 |
NTIMES = scalar |
Number of permutations to use in the permutation test; default 999 |
SEED = scalar |
Seed for the random number generator used to make the permutations; default 0 continues from the previous generation or (if none) initializes the seed automatically |
Parameters
DATA = symmetric matrices |
Supplies the squared distances between the data points |
---|---|
SSD = variates |
Saves the sums of squared distances |
DF = variates |
Saves the numbers of degrees of freedom |
PRPERMUTATION = variates |
Saves probabilities from the permutation test |
DISTANCES = pointers |
Contains a symmetric matrix of distances for each model term |
Description
This procedure implements the analysis of multivariate distance devised by Gower & Krzanowski (1999). This is useful when you have units whose positions in multi-dimensional space may be explained by a linear statistical model. It provides a breakdown of the sums of squared distances between the units, similar to that provided for sums of squares in an analysis of variance. So, the total squared distance between the units is partitioned into the components that can be explained by each of the terms in the model. These cannot be tested directly as in an analysis of variance, as it is unclear what probability distributions would be appropriate. Instead the importance of the terms can be assessed by doing a permutation test, in which the several permutations of the units are made, and the significances of the sums of squared distances from the observed data are calculated by seeing where they lie in the distribution of values obtained from all the analyses (the original analysis and those of the permuted data sets).
The squared distances between the units must be supplied in a symmetric matrix, using the DATA
parameter. In some situations, these may be actual distances. Alternatively, the units may often be described by a collection of attributes ranging from continuous measurements to categorical variables, like the presence or absence of a particular feature. In these circumstances, the FSIMILARITY
directive can be used combine these attributes to give a symmetric matrix that represents the similarity between each pair of units. This can then be converted into a squared distance matrix, for example, by subtracting the similarities from one. (So MVAOD
can be regarded as providing an alternative to multivariate analysis of variance, for units whose attributes are not all continuous variables.)
The model to fit in the analysis is specified by the TERMS
option. The FACTORIAL
option sets a limit on the number of factors of variates that the terms can contain; any terms with more factors of variates are deleted from the analysis.
Printed output is controlled by the PRINT
option, with settings:
aodtable |
for an analysis-of-distance table, giving the sums of squared distances and numbers of degrees of freedom for each model term; and |
---|---|
permutationtest |
adds a column to the analysis-of-distance table containing probabilities from the permutation test. |
The NTIMES
option specifies the number of permutations to perform; the default is 999. The SEED
option specifies the seed to use to generate the random numbers that are used to select the permutations; the default of zero continues the sequence of random numbers from a previous generation or, if none have yet been used in this Genstat job, it initializes the seed automatically. MVAOD
checks whether NTIMES
is greater than the number of possible permutations available for the data set. If so, it does an exact test instead, which uses each possible permutation once.
The SSD
, DF
and PRPERMUTATION
parameters allow you to save the sums of squared distances, degrees of freedom and permutation probabilities. These are each saved in a variate, with each unit labelled by the name of the model term concerned. There are also two final units in each variate to save the corresponding information for residual and the total.
The DISTANCES
parameter can save a pointer containing a symmetric matrix for each model term. Each matrix has a row for each combination of levels of the factors in the corresponding term, and its values are the distances between the factor combinations in the multi-dimensional space defined by the possible effects of the term. So, to investigate the relationships between the effects of the term, you could convert the DISTANCES
to similarities, and then use them as input for a principal coordinates analysis (see PCO
for details).
Options: PRINT
, TERMS
, FACTORIAL
, NTIMES
, SEED
.
Parameters: DATA
, SSD
, DF
, PRPERMUTATION
, DISTANCES
.
Method
The method of analysis is described by Gower & Krzanowski (1999) and Krzanowski (2002), who show that the sum of squares of distances for each term i is given by
TRACE( Proj[i] *+ DATA *+ Proj[i]) / 2
where Proj[i]
is a projection matrix for the term. If the model contains only factors, MVAOD
uses ANOVA
to check whether the model is orthogonal and, if so, it calculates the projection matrices using the method described by Payne & Tobias (1992). For a non-orthogonal model, MVAOD
adjusts the design matrix X[i]
of each term i for the earlier terms by using its columns as y-variates in a regression analysis, fitting all the earlier terms, and then reforming the design matrix by replacing each column with the residuals from the corresponding regression. The projection matrix is then
X[i] *+ Ginverse(T(X[i] *+ X[i]) *+ T(X[i])
References
Gower, J.C. & Krzanowski, W.J. (1999) Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance. Applied Statistics, 48, 505-519.
Krzanowski, W.J. (2002) Multifactorial analysis of distance in studies of ecological community structure. Journal of Agricultural, Biological and Ecological Statistics, 7, 222-232.
Payne, R.W. & Tobias, R.D. (1992). General balance, combination of information and the analysis of covariance. Scandinavian Journal of Statistics, 19, 3-23.
See also
Directive: PCO
.
Procedures: MANOVA
, RMULTIVARIATE
.
Commands for: Multivariate and cluster analysis.
Example
CAPTION 'MVAOD example',\ !t('Analysis of distance of public bad data; see',\ 'Gower & Krzanowski 1999, Applied Statistics, 48, 505-519.');\ STYLE=meta,plain SPLOAD FILE='%gendir%/examples/Publicbad.gsh' " Form similarity matrix using city-block metric." FSIMILARITY [SIMILARITY=pbsimilarity] publicbad[]; TEST=cityblock " Convert to squared distances." CALCULATE pbdistances = 1 - pbsimilarity " Between-group analysis." FACPRODUCT [IMETHOD=include] !p(G,S,T,N); PRODUCT=group MVAOD [PRINT=aod; TERMS=group; NTIMES=99]\ pbdistances; DISTANCES=groupdistances " PCO analysis of between-group similarities (Gower & Krzanowski, Figure 3)." PCO [PRINT=roots] 1-groupdistances[1]; LRV=grouplrv CALCULATE groupscore[1,2] = grouplrv[1]$[*; 1,2] FRAME 3; SCALING=xyequal XAXIS 3; YORIGIN=0; LPOSITION=*; MPOSITION=* YAXIS 3; XORIGIN=0; LPOSITION=*; MPOSITION=* TXCONSTRUCT [TEXT=groupno] !(1...16) PEN 1; SYMBOLS=0; LABELS=groupno DGRAPH [TITLE='Principal coordinate analysis'; WINDOW=3; KEY=0]\ groupscore[2]; groupscore[1]; PEN=1 GETATTRIBUTE [ATTRIBUTE=labels] group; groupatt TXCONSTRUCT [TEXT=groupkey] !(1...16),' = ',groupatt['labels'] FOR CAPTION 'Key to points on the graph'; STYLE=minor PRINT [IPRINT=*] groupkey ENDFOR " Factorial model - note: this is on a different scale and gives a slightly different breakdown from Table 2 of Gower & Krzanowski, as their analysis was unweighted by group size. Only 99 permutations are made, to save computing time." MVAOD [PRINT=aod,permutation; TERMS=N*T*S*G; NTIMES=99; SEED=629856]\ pbdistances " For Gower & Krzanowski breakdown, use between-group distance matrix." FACTOR [NVALUES=16; LEVELS=2] Gb,Sb,Tb,Nb GENERATE Gb,Sb,Tb,Nb MVAOD [PRINT=aod,permutation; TERMS=Nb*Tb*Sb*Gb] groupdistances[1]