PCP directive

Performs principal components analysis.

Options

`PRINT` = string tokens	Printed output required (`loadings, roots, residuals, scores, tests`); default `*` i.e. no printing
`NROOTS` = scalar	Number of latent roots for printed output; default `*` requests them all to be printed
`SMALLEST` = string token	Whether to print the smallest roots instead of the largest (`yes, no`); default `no`
`METHOD` = string token	Whether to use sums of squares, correlations or variances and covariances (`ssp`, `correlation`, `vcovariance`, `variancecovariance`); default `ssp`

Parameters

`DATA` = pointers or matrices or SSPMs	Pointer of variates forming the data matrix, or matrix storing the variate values by columns, or SSPM giving their sums of squares and products (or correlations) etc
`LRV` = LRVs	To store the principal component loadings, roots and trace from each analysis
`SSPM` = SSPMs	To store the computed sum-of-squares-and-products or correlation matrix
`SCORES` = matrices	To store the principal component scores
`RESIDUALS` = matrices or variates	To store residuals from the dimensions fitted in the analysis (i.e. number of columns of the `SCORES` matrix, or as defined by the `NROOTS` option)
`SAVE` = pointers	Saves details of the analysis; if unset, an unnamed save structure is saved automatically (and this can be accessed using the `GET` directive)

Description

Principal components analysis finds linear combinations of a set of variates that maximize the variation contained within them, thereby displaying most of the original variability in a smaller number of dimensions. Principal components analysis operates on sums of squares and products, or a correlation matrix, or a matrix of variances and covariances, formed from the variates.

You supply the input for PCP using the first parameter; this list may have more than one entry, in which case Genstat repeats the analysis for each of the input structures. Instead of supplying an SSPM, you can supply a pointer containing the set of variates, or a matrix storing the variate values by columns. Genstat will then calculate the sums of squares and products, or correlations, or variances and covariances for the analysis (see option METHOD below).

For example, these two forms of input are equivalent:

SSPM [TERMS=Height,Length,Width,Weight] S

FSSPM S

PCP [PRINT=roots] S

and

PCP [PRINT=roots] !P(Height,Length,Width,Weight)

But the first form does mean that you have the sums of squares and products available for later use, in the SSPM S. Here the pointer is unnamed but you may wish to use a named pointer. For example:

POINTER [VALUES=Height,Length,Width,Weight] Dmat

PCP [PRINT=roots] Dmat

By default the PCP directive does not print any results: you use the PRINT option to specify what output you require. The printed output is in five sections, each with a corresponding setting, as illustrated in the examples below.

The columns of the matrices of principal component loadings and scores correspond to the latent roots. Each latent root corresponds to a single dimension, and gives the variability of the scores in that dimension. The loadings give the linear coefficients of the variables that are used to construct the scores in each dimension.

The significance tests are for equality of the k smallest roots: l_i (i = 1, 2, … k). The test statistic is

n – ((2p + 11) / 6) [ log( (1/k) ∑_i>k i_i ) – (1/k) ∑_i>k log( i_i )]

where n is the number of units and p is the number of variables. Asymptotically, the statistics have a chi-square distribution with (k+2)(k-1)/2 degrees of freedom. If any latent roots are zero, Genstat excludes them from the calculation of the test statistic; the effective value of p is reduced accordingly.

If you omit the NROOTS option, Genstat prints by default the results corresponding to all the latent roots. The number of latent roots is the number of variates involved in the input to PCP. The NROOTS option allows you to print only part of the results, corresponding to the first or last r latent roots. You may then want to print the residuals formed from the remaining columns of scores. The residuals are all positive: this is because residuals from multivariate analyses generally occupy several dimensions, so they represent distances in multidimensional space and signs cannot be attached to them.

To print results corresponding to the r smallest latent roots, you must set option NROOTS to r and option SMALLEST to yes. Now if residuals are printed they will be formed from the scores corresponding to the largest roots. The NROOTS and SMALLEST options apply to the latent roots and vectors, the principal component scores and the residuals. So you cannot print directly, for example, the first two columns of scores and the last three columns of loadings. This is rarely required but, if necessary, it can be done by saving the relevant results and printing them separately.

By default, the PCP directive operates on the SSPM but you can set the METHOD option to correlations to operate on a derived matrix of correlations, or to vcovariance (or its synonym variancecovariance) to use variances and covariances. Note that when correlations are analysed the significance-test statistics no longer have asymptotic chi-square distributions.

The LRV parameter allows you to save the principal component loadings, the latent roots, and their sum (the trace) in an LRV structure, while the SCORES parameter saves the principal component scores in a matrix. If you have declared the LRV already, its number of rows must be the same as the number of variates supplied in an input pointer or implied by an input SSPM. The number of rows of the SCORES matrix, if previously declared, must be equal to the number of units.

The number of columns of the LRV and of the SCORES matrix corresponds to the number of dimensions to be saved from the analysis, and this must be the same for both of them. If the structures have been declared already, Genstat will take the larger of the numbers of columns declared for either, and declare (or redeclare) the other one to match. If neither has been declared and option SMALLEST retains the default setting no, Genstat takes the number of columns from the setting of the NROOTS option. Otherwise, Genstat saves results for the full set of dimensions. The trace saved as the third component of the LRV structure, however, will contain the sums of all the latent roots, whether or not they have all been saved. Procedure LRVSCREE can be used to produce a “scree” diagram which can be helpful in deciding how many dimensions to save.

The SSPM parameter can save the SSPM structure used for the analysis. A particularly convenient instance is when you have supplied an SSPM structure as input but, for example, have set METHOD=correlation: the SSPM that is saved will then contain correlations instead of sums of squares and products.

The RESIDUALS parameter allows you to save the principal component residuals, in a matrix with number of rows equal to the number of units and one column. If the latent roots and vectors (loadings) are saved from the analysis, the residuals will correspond to the dimensions not saved; the same applies if you save scores. If neither the LRV nor scores are saved, the saved residuals will correspond to the smallest latent roots not printed.

The SAVE parameter can supply a pointer to save a multivariate save structure contining all the details of the analysis. If this is unset, an unnamed save structure is saved automatically (and this can be accessed using the GET directive). Alternatively, you can set SAVE=* to prevent any save structure being formed if, for example, you have a very large data set and want to avoid committing the storage space.

If you want principal component scores or residuals to be printed or saved from the analysis, the original data must be available. The matrices to save such results must have been declared with as many rows as the variates have values, ignoring the restriction. You can calculate the analysis from one subset of units, but calculate the scores and residuals for all the units, by using as input to PCP an SSPM structure formed using a weight variate with zeros for the excluded sampling units and unity for those to be included. For example, to exclude a known set of outliers from an analysis, but to print scores for them, these statements could be used:

POINTER [NVALUES=5] V

FACTOR [LABELS=!T(No,Yes)] Outlier

READ [CHANNEL=2] Outlier,V[]

CALCULATE Wt = Outlier .IN. 'No'

SSPM [TERMS=V] S

FSSPM [WEIGHT=Wt] S

PCP [PRINT=scores] S

Principal component regression is provided by procedure RIDGE.
Options: PRINT, NROOTS, SMALLEST, METHOD.
Parameters: DATA, LRV, SSPM, SCORES, RESIDUALS, SAVE.

Action with `RESTRICT`

If the variables used to form the SSPM structure are restricted, then the analysis will be subject to that restriction. Similarly, if a pointer to a set of variates is used as input to PCP, then any restriction on the variates will be taken into account by the analysis.

Example

" Genstat example PCP-1: Principal Components analysis

  This example carries out a principal components analysis of
  four variates, each of length 12
"

"
  The data are in a file called 'PCP-1.DAT' and names for
  the data columns are on the first line. 

  To carry out the analysis, the PCP directive requires the data
  in a matrix storing the variate values by column, SSPM giving their
  sums of squares and products (or correlations), or most simply, a pointer
  containing the set of variates. Therefore use the ISAVE option to
  save the data variates into a pointer called Dmat.
"
FILEREAD [NAME='%gendir%/examples/PCP-1.DAT';\ 
  IMETHOD=read; ISAVE=Dmat] FGROUPS=no

"
  Carry out the principal component analysis on the supplied data, printing
  the resulting latent roots, component scores and loadings, and significance
  tests for equality of the k smallest roots.
"
PCP [PRINT=roots,scores,tests,loadings] Dmat

" 
  Carry out the analysis again, printing no output, but writing the 
  score values for the first two principal components, to the 
  matrix PCPscore.
"
PCP Dmat; SCORES=PCPscore
PRINT PCPscore

"
  If we specify a data pointer as input to the PCP directive, Genstat
  will automatically calculate the sums of squares and product, correlations
  or variances and covariances for the analysis. 

  Another approach is to supply a SSPM matrix which has been previously 
  declared. This approach has the advantage that we have the sums of squares 
  and products readily available for later use.

  Therefore, declare the sums of squares and product matrix for our set of 
  data and then form this matrix SS.
"
SSPM [TERMS=Dmat[]] SS
FSSPM [PRINT=sspm] SS

"
  Carry out the analysis once more, this time using the sums of squares 
  and product matrix SS as input and write the component loadings, latent
  roots and trace to the LRV structure L.
"
PCP [PRINT=roots] SS; LRV=L
PRINT L[]

"
  Carry out the analysis again with the sums of squares and product matrix 
  SS as input and using the correlation method. By setting SSPM=SS when
  METHOD=correlation, the matrix SS will be overwritten and will then 
  contain correlations instead sums of squares and products.
"
PCP [PRINT=roots; NROOTS=2; METHOD=correlation] SS; SSPM=SS
PRINT SS

Updated on February 7, 2023

Was this article helpful?

Yes No