Performs principal components analysis.
Options
PRINT = string tokens |
Printed output required (loadings, roots, residuals, scores, tests ); default * i.e. no printing |
---|---|
NROOTS = scalar |
Number of latent roots for printed output; default * requests them all to be printed |
SMALLEST = string token |
Whether to print the smallest roots instead of the largest (yes, no ); default no |
METHOD = string token |
Whether to use sums of squares, correlations or variances and covariances (ssp , correlation , vcovariance , variancecovariance ); default ssp |
Parameters
DATA = pointers or matrices or SSPMs |
Pointer of variates forming the data matrix, or matrix storing the variate values by columns, or SSPM giving their sums of squares and products (or correlations) etc |
---|---|
LRV = LRVs |
To store the principal component loadings, roots and trace from each analysis |
SSPM = SSPMs |
To store the computed sum-of-squares-and-products or correlation matrix |
SCORES = matrices |
To store the principal component scores |
RESIDUALS = matrices or variates |
To store residuals from the dimensions fitted in the analysis (i.e. number of columns of the SCORES matrix, or as defined by the NROOTS option) |
SAVE = pointers |
Saves details of the analysis; if unset, an unnamed save structure is saved automatically (and this can be accessed using the GET directive) |
Description
Principal components analysis finds linear combinations of a set of variates that maximize the variation contained within them, thereby displaying most of the original variability in a smaller number of dimensions. Principal components analysis operates on sums of squares and products, or a correlation matrix, or a matrix of variances and covariances, formed from the variates.
You supply the input for PCP
using the first parameter; this list may have more than one entry, in which case Genstat repeats the analysis for each of the input structures. Instead of supplying an SSPM, you can supply a pointer containing the set of variates, or a matrix storing the variate values by columns. Genstat will then calculate the sums of squares and products, or correlations, or variances and covariances for the analysis (see option METHOD
below).
For example, these two forms of input are equivalent:
SSPM [TERMS=Height,Length,Width,Weight] S
FSSPM S
PCP [PRINT=roots] S
and
PCP [PRINT=roots] !P(Height,Length,Width,Weight)
But the first form does mean that you have the sums of squares and products available for later use, in the SSPM S
. Here the pointer is unnamed but you may wish to use a named pointer. For example:
POINTER [VALUES=Height,Length,Width,Weight] Dmat
PCP [PRINT=roots] Dmat
By default the PCP
directive does not print any results: you use the PRINT
option to specify what output you require. The printed output is in five sections, each with a corresponding setting, as illustrated in the examples below.
The columns of the matrices of principal component loadings and scores correspond to the latent roots. Each latent root corresponds to a single dimension, and gives the variability of the scores in that dimension. The loadings give the linear coefficients of the variables that are used to construct the scores in each dimension.
The significance tests are for equality of the k smallest roots: li (i = 1, 2, … k). The test statistic is
n – ((2p + 11) / 6) [ log( (1/k) ∑i>k ii ) – (1/k) ∑i>k log( ii )]
where n is the number of units and p is the number of variables. Asymptotically, the statistics have a chi-square distribution with (k+2)(k-1)/2 degrees of freedom. If any latent roots are zero, Genstat excludes them from the calculation of the test statistic; the effective value of p is reduced accordingly.
If you omit the NROOTS
option, Genstat prints by default the results corresponding to all the latent roots. The number of latent roots is the number of variates involved in the input to PCP
. The NROOTS
option allows you to print only part of the results, corresponding to the first or last r latent roots. You may then want to print the residuals formed from the remaining columns of scores. The residuals are all positive: this is because residuals from multivariate analyses generally occupy several dimensions, so they represent distances in multidimensional space and signs cannot be attached to them.
To print results corresponding to the r smallest latent roots, you must set option NROOTS
to r and option SMALLEST
to yes
. Now if residuals are printed they will be formed from the scores corresponding to the largest roots. The NROOTS
and SMALLEST
options apply to the latent roots and vectors, the principal component scores and the residuals. So you cannot print directly, for example, the first two columns of scores and the last three columns of loadings. This is rarely required but, if necessary, it can be done by saving the relevant results and printing them separately.
By default, the PCP
directive operates on the SSPM but you can set the METHOD
option to correlations
to operate on a derived matrix of correlations, or to vcovariance
(or its synonym variancecovariance
) to use variances and covariances. Note that when correlations are analysed the significance-test statistics no longer have asymptotic chi-square distributions.
The LRV
parameter allows you to save the principal component loadings, the latent roots, and their sum (the trace) in an LRV structure, while the SCORES
parameter saves the principal component scores in a matrix. If you have declared the LRV already, its number of rows must be the same as the number of variates supplied in an input pointer or implied by an input SSPM. The number of rows of the SCORES
matrix, if previously declared, must be equal to the number of units.
The number of columns of the LRV and of the SCORES
matrix corresponds to the number of dimensions to be saved from the analysis, and this must be the same for both of them. If the structures have been declared already, Genstat will take the larger of the numbers of columns declared for either, and declare (or redeclare) the other one to match. If neither has been declared and option SMALLEST
retains the default setting no
, Genstat takes the number of columns from the setting of the NROOTS
option. Otherwise, Genstat saves results for the full set of dimensions. The trace saved as the third component of the LRV structure, however, will contain the sums of all the latent roots, whether or not they have all been saved. Procedure LRVSCREE
can be used to produce a “scree” diagram which can be helpful in deciding how many dimensions to save.
The SSPM
parameter can save the SSPM structure used for the analysis. A particularly convenient instance is when you have supplied an SSPM structure as input but, for example, have set METHOD=correlation
: the SSPM that is saved will then contain correlations instead of sums of squares and products.
The RESIDUALS
parameter allows you to save the principal component residuals, in a matrix with number of rows equal to the number of units and one column. If the latent roots and vectors (loadings) are saved from the analysis, the residuals will correspond to the dimensions not saved; the same applies if you save scores. If neither the LRV nor scores are saved, the saved residuals will correspond to the smallest latent roots not printed.
The SAVE
parameter can supply a pointer to save a multivariate save structure contining all the details of the analysis. If this is unset, an unnamed save structure is saved automatically (and this can be accessed using the GET
directive). Alternatively, you can set SAVE=*
to prevent any save structure being formed if, for example, you have a very large data set and want to avoid committing the storage space.
If you want principal component scores or residuals to be printed or saved from the analysis, the original data must be available. The matrices to save such results must have been declared with as many rows as the variates have values, ignoring the restriction. You can calculate the analysis from one subset of units, but calculate the scores and residuals for all the units, by using as input to PCP
an SSPM structure formed using a weight variate with zeros for the excluded sampling units and unity for those to be included. For example, to exclude a known set of outliers from an analysis, but to print scores for them, these statements could be used:
POINTER [NVALUES=5] V
FACTOR [LABELS=!T(No,Yes)] Outlier
READ [CHANNEL=2] Outlier,V[]
CALCULATE Wt = Outlier .IN. 'No'
SSPM [TERMS=V] S
FSSPM [WEIGHT=Wt] S
PCP [PRINT=scores] S
Principal component regression is provided by procedure RIDGE
.
Options: PRINT
, NROOTS
, SMALLEST
, METHOD
.
Parameters: DATA
, LRV
, SSPM
, SCORES
, RESIDUALS
, SAVE
.
Action with RESTRICT
If the variables used to form the SSPM structure are restricted, then the analysis will be subject to that restriction. Similarly, if a pointer to a set of variates is used as input to PCP
, then any restriction on the variates will be taken into account by the analysis.
See also
Directives: CVA
, FCA
, MDS
, PCO
, ROTATE
, SSPM
.
Procedures: LRVSCREE
, DBIPLOT
, DMST
, PLS
, PCPCLUSTER
, RIDGE
.
Commands for: Multivariate and cluster analysis.
Example
" Genstat example PCP-1: Principal Components analysis This example carries out a principal components analysis of four variates, each of length 12 " " The data are in a file called 'PCP-1.DAT' and names for the data columns are on the first line. To carry out the analysis, the PCP directive requires the data in a matrix storing the variate values by column, SSPM giving their sums of squares and products (or correlations), or most simply, a pointer containing the set of variates. Therefore use the ISAVE option to save the data variates into a pointer called Dmat. " FILEREAD [NAME='%gendir%/examples/PCP-1.DAT';\ IMETHOD=read; ISAVE=Dmat] FGROUPS=no " Carry out the principal component analysis on the supplied data, printing the resulting latent roots, component scores and loadings, and significance tests for equality of the k smallest roots. " PCP [PRINT=roots,scores,tests,loadings] Dmat " Carry out the analysis again, printing no output, but writing the score values for the first two principal components, to the matrix PCPscore. " PCP Dmat; SCORES=PCPscore PRINT PCPscore " If we specify a data pointer as input to the PCP directive, Genstat will automatically calculate the sums of squares and product, correlations or variances and covariances for the analysis. Another approach is to supply a SSPM matrix which has been previously declared. This approach has the advantage that we have the sums of squares and products readily available for later use. Therefore, declare the sums of squares and product matrix for our set of data and then form this matrix SS. " SSPM [TERMS=Dmat[]] SS FSSPM [PRINT=sspm] SS " Carry out the analysis once more, this time using the sums of squares and product matrix SS as input and write the component loadings, latent roots and trace to the LRV structure L. " PCP [PRINT=roots] SS; LRV=L PRINT L[] " Carry out the analysis again with the sums of squares and product matrix SS as input and using the correlation method. By setting SSPM=SS when METHOD=correlation, the matrix SS will be overwritten and will then contain correlations instead sums of squares and products. " PCP [PRINT=roots; NROOTS=2; METHOD=correlation] SS; SSPM=SS PRINT SS