Does correspondence analysis, or reciprocal averaging (P.G.N. Digby & A.I. Glaser).
|Printed output from the analysis (
||Type of analysis required (
||Number of latent roots for printed output; default * requests them all to be printed|
||How to represent proportions or %s in quality statistics (
||Number of dimensions for which quality statistics are required; default 2|
||Indexes of subset rows|
||Indexes of subset columns|
||Indexes of passive rows|
||Indexes of passive columns|
||Data to be analysed|
||Saves the squared singular values from each analysis|
||Saves the scores for the rows of the data matrix|
||Saves the scores for the columns of the data matrix|
||Saves the inertias for the rows of the data matrix|
||Saves the inertias for the columns of the data matrix|
||Saves the quality statistics for rows of the data|
||Saves the quality statistics for columns of the data|
||Saves details of the analysis for use by
Correspondence analysis is an ordination technique used to analyse two-way categorical data tables. Ordination techniques approximate relationships between variables in a reduced number of dimensions.
The type of analysis is specified by the
METHOD option, with one of the following settings:
||correspondence analysis (Greenacre 1984),|
||an alternative implementation of correspondence analysis described by Digby & Kempton (1987),|
||reciprocal averaging (see Digby & Kempton 1987), or|
||a similar biplot-style analysis (again see Digby & Kempton 1987).|
The default setting is
correspondence, and this should be retained if either of the options to subset rows or columns are set.
The data for the procedure are specified by the
DATA parameter as either a matrix or a datamatrix (i.e. a pointer to variates, all with the same length). The matrix must not contain any missing values; it is unchanged on exit from the procedure.
Printed output is controlled by the
||to print the roots (together with the roots expressed as percentages and cumulative percentages),|
||to print the scores for the rows of the data matrix,|
||to print the inertias for the rows of the data matrix,|
||to print the row masses,|
||to print the row chi-square distances,|
||to print the quality statistics for the rows,|
||to print the scores for the columns of the data matrix,|
||to print the inertias for the columns of the data matrix,|
||to print the column masses,|
||to print the column chi-square distances, and|
||to print the quality statistics for the columns.|
NROOTS option controls the printed output of roots, scores and inertias. By default, results are printed for all the roots, but you can set the
NROOTS option to specify a lesser number.
The quality settings produce tables with the following columns:
● the mass of the row (or column), in proportion to the total mass;
● the “quality” of the representation i.e. how much of the inertia of a row (or column) is represented by the dimensions shown;
● the proportion of the total inertia of the row (or column) compared to the total inertia for all rows (or columns);
● principal coordinates of the rows (or columns) in the specified dimension;
● the amount of inertia for each row (or column) in the specified dimension relative to the total amount of inertia given by the value of the quality statistic – hence the sum of a specific row (or column) across the dimensions shown will be equal to the value given by the quality statistic;
● the proportion of inertia explained by a row (or column) in a dimension, compared to the total inertia in that dimension.
The representation of the columns of proportions is controlled by the
%METHOD option; these can be printed either as proportions (default), percentages or as permills i.e. tenths of a percent. The
NDIMENSIONS option specifies the number of dimensions for which to print quality statistics; default 2.
When carrying out correspondence analysis, there may be rows and/or columns (for example outliers with low mass) that you would like to ignore during the calculation of the roots or inertia, so that they have no influence. Instead of removing these rows and/or columns from the data before running
CORANALYSIS, an alternative is to list the indexes of the rows or columns that are to be ignored using the
COLPASSIVE options. These “passive” rows will still be included in the table of quality statistics, where their relative contributions will be shown and compared to total for all the passive rows or columns.
You may want to apply a correspondence analysis calculated from the whole data set onto only a subset of the rows and/or columns when some of the rows and/or columns divide into groups with common traits. This can be done by setting the
COLSUBSET options to the indexes of the rows and/or columns indexes in the subset of interest. If any of these options is set, the
METHOD option must be set to
COLSUBSET) are both set, any indexes that occur in both will be removed from the
Results from the analysis can be saved using the parameters
COLQUALITY. The structures specified for these parameters need not be declared in advance. The
SAVE parameter can save full details of the analysis for use by the
Full details of correspondence analysis (i.e.
METHOD=correspondence) are given by Greenacre (1984 & 2007). The other methods are described by Digby & Kempton (1987).
The data matrix X, is scaled to have sum one for
digbycorrespondence. The matrices U, S and V are taken from the singular-value decomposition of
Y = (X – R C) / √(R C)
Y = ( R-½ X C-½ )
for the other methods, where R and C are diagonal matrices of row and column totals of the data matrix X. The scores for the rows and columns from
A = ( R-½ U )
B = ( C-½ V )
The scores from
METHOD=digbycorrespondence are similar, but are multiplied by S. This makes the row scores obtained here the same as the principal coordinates given with the quality statistics.
With the other two methods X is not scaled to total one, and the scores are given by A = ( R-½ U Sm ) and B = ( C-½ V Sm ): the parameter m is zero for
METHOD=reciprocal, and 0.5 for
The inertia values for the rows and columns are given by
( R A A′ ) S′
( C B B′ ) S′
where S′ = S for
METHOD=correspondence, and S = 1 for the other methods; see Greenacre (1984) for further information.
The roots are the squares of the singular values. Note that the first singular value will always be one for methods other than
correspondence; this corresponds to a trivial solution given in the first column of A and B above, which is automatically removed from the results printed and saved from
Rows and/or columns chosen as passive rows and/or columns are separated from the original data matrix before it is scaled. Rows and/or columns chosen as subset rows and/or columns are separated from Y after this scaling.
For the quality statistics, the weighted sum-of-squares of the principal coordinates on the ith dimension is equal to the ith squared singular value. The row and column scores for
METHOD=digbycorrespondence are equivalent to the principal coordinates. Conversely the row and column scores for
reciprocal are equivalent to standard coordinates, where the weighted sum-of-squares for each dimension is equal to one.
Digby, P.G.N. & Kempton, R.A. (1987). Multivariate Analysis of Ecological Communities. Chapman & Hall, London.
Greenacre, M.J. (1984). Theory and Applications of Correspondence Analysis. Academic Press, London.
Greenacre, M. (2007). Correspondence Analysis in Practice, second edition. Chapman & Hall, London.
Commands for: Multivariate and cluster analysis.
CAPTION 'CORANALYSIS example',\ 'Data from Table 9.1 of Greenacre (2007)'; STYLE=meta,plain TEXT Staff,St; VALUES=!T(Sen_Mngr,Jun_Mngr,Sen_Empl,Jun_Empl,Secretry),\ !T(SM, JM, SE, JE, Sy) & Smoke; VALUES=!T(None,Light,Medium,Heavy) MATRIX [ROWS=Staff; COLUMNS=Smoke] Smoking; VALUES=\ !( 4, 2, 3, 2, 4, 3, 7, 4, 25, 10, 12, 4, 18, 24, 33, 13, 10, 6, 7, 2) PRINT Smoking; FIELDWIDTH=8; DECIMALS=0 CAPTION 'Use CORANALYSIS, printing all results, saving SCORES only.' CORANALYSIS [PRINT=roots,rowscores,colscores,rowinertia,colinertia;\ METHOD=correspondence] Smoking; SAVE=cora1 "Print rowmass" PRINT cora1['rowmass'] "Plot the scores in the 1st and 2nd dimensions. Row are in principal coordinates and columns are in standard coordinates. Figure 9.2 of Greenacre (2007)." CABIPLOT [COLSCALING=standard] LROW=St