Forms groups of units using the densities of their `PCP`

scores (R.W. Payne).

### Options

`PRINT` = string tokens |
What to print (`cellclusters` , `density` , `summary` ); default `summ` |

`PLOT` = string tokens |
What to plot (`cellclusters` , `density` , `histogram` , `summary` ); default `cell` , `dens` , `hist` |

`NROOTS` = scalars |
Numbers of dimensions to use; default 2 |

`NPARTITIONS` = scalars |
Numbers of partitions in each dimension; default 10 |

`CLUSTERS` = pointer |
Saves variates defining the clusters for each minimum number of points |

`CELLCLUSTERS` = pointer |
Saves tables containing the clusters of cells for each minimum number of points |

`DENSITY` = table |
Saves the table of cell densities |

`SUMMARY` = pointer |
Saves the summary table |

`MINUNITS` = variate or scalar |
Minimum numbers of units within cells at which to form clusters |

### Parameter

`SAVE` = pointer |
Save structure from the `PCP` analysis to use; default uses the most recent analysis |

### Description

The `PCPCLUSTER`

procedure provides a way to perform cluster analysis for a large data set. The first simplification is that it reduces the number of attributes of the units by taking scores from a `PCP`

analysis. The `SAVE`

option supplies the save structure from the `PCP`

analysis that is to be used. The default is to use the most recent analysis. The `NROOTS`

parameter specifies the number of dimensions of scores to use; default 2.

The second simplification addresses the space and computing problems that occur when there are large numbers of units. Instead of forming a unit-by-unit similarity matrix, the algorithm, in the `PTFCLUSTERS`

procedure, divides the multi-dimensional space defined by the scores into cells, and forms a density table by tabulating the number of units in each cell. The `NPARTITIONS`

parameter specifies the number of cells to form in each dimension; default 10. The clusters are formed by finding contiguous collections of cells in which the density (or number of units) exceeds thresholds specified by the `MINUNITS`

option. The units in these clusters of cells will be connected to each other in a similar way to the units in a hierarchical cluster analysis. Note, though, that points in sparsely populated parts of the space will not be allocated to any cluster. These units can be thus be identified as unusual or aberrant. The default for `MINUNITS`

is to use a list of values calculated as the maximum density multiplied by 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, 0.5, 0.45, 0.4, 0.35, 0.3, 0.25 and 0.2.

`PTFCLUSTERS`

starts with the first `MINUNITS`

value and finds a cell containing more than that number of units. This is the starting point for the first cluster. Additional cells are added to the cluster if they are neighbours of cells in the cluster containing more than that minimum number of units. When this cluster is complete, `PTFCLUSTERS`

looks for a cell that is not in the cluster but which contains more than the minimum number of units. This provides the starting point for another cluster. The process continues until all the cells with more than that minimum number of units have been allocated to a cluster. `PTFCLUSTERS`

then takes the next `MINUNITS`

value and expands the clusters to contain neighbours with that smaller minimum number of units, merging clusters if they become neighbours. For each `MINUNITS`

value, `PTFCLUSTERS`

records the number of clusters, the mean number of units within the cells inside and outside the clusters, the mean number for units within the cells just inside and just outside the boundaries, the minimum number for units within cells on the boundaries, and the maximum number for units within cells just outside the boundaries. This summary information should help to assess which `MINUNITS`

value gives the best set of clusters.

The `PRINT`

option controls the printed output, with settings:

`cellclusters`

shows how the cells are clustered for each minimum number of units,

`density`

prints the table showing the number of units in each cell,

`summary`

prints the summary information recorded for each minimum number of units (default).

The `PLOT`

option specifies how the replications are plotted, with settings:

`cellclusters`

this displays the clustering of the cells for each minimum number of points as a shade plot or as a 3-d graph if there are 2 or 3 dimensions respectively,

`density`

displays shade plots showing the numbers of units in each pair of dimensions,

`histogram`

plot a histogram for the numbers of units in the cells,

`summary`

plots the summary information against the minimum numbers of units. The default is to plot all of these.

The `CLUSTERS`

option can save a pointer containing details of the clusters of units formed at each `MINUNITS`

value. The clusters have integer numbers, from one upwards. The pointer contains a variate for each `MINUNITS`

value. These contain either cluster numbers, or missing values for units in cells that have not been allocated to any cluster.

The `CELLCLUSTERS`

option can similarly save a pointer containing details of the clusters of cells formed at each `MINUNITS`

value. The pointer contains a table for each `MINUNITS`

value. These contain either a cluster number, or a missing value for cells that have not been allocated to any cluster.

The `DENSITY`

option can save the table containing the number of units within each cell.

The `SUMMARY`

option can save the summary table, in a pointer with elements labelled `'Min.`

`no.`

`points'`

, `'No.`

`clusters'`

, `'Mean`

`inside`

`clusters'`

, `'Mean`

`outside`

`clusters'`

, `'Mean`

`on`

`boundary'`

, `'Mean`

`outside`

`boundary'`

, `'Min.`

`on`

`boundary'`

and `'Max.`

`outside`

`boundary'`

.

Options: `PRINT`

, `PLOT`

, `CLUSTERS`

, `CELLCLUSTERS`

, `DENSITY`

, `SUMMARY`

, `INITIALCELLCLUSTERS`

, `MINUNITS`

.

Parameters: `DATA`

, `NPARTITIONS`

.

### Method

`PCPCLUSTER`

calls the `PTFCLUSTERS`

procedure to cluster the cells.

### See also

Directives: `CLUSTER`

, `PCP`

.

Procedures: `PTFCLUSTERS`

, `PTFILLCLUSTERS`

.

Commands for: Multivariate and cluster analysis.

### Example

CAPTION 'PCPCLUSTER example'; STYLE=meta SPLOAD '%data%/iris.gsh' PCP [PRINT=loadings,roots] !p(Sepal_Length,Sepal_Width,Petal_Length,Petal_Width);\ SCORES=Scores PEN 1,2,3; SYMBOL='circle'; CFILL='match' DGRAPH Scores$[*;1]; Scores$[*;2]; PEN=Species PCPCLUSTER [PRINT=cellclusters,density,summary; PLOT=cellclusters,density,histogram,summary;\ NROOTS=2; NPARTITIONS=8; CLUSTERS=clust] CALCULATE clust2 = MVREPLACE(clust[2]; 0) GROUPS clust2; FACTOR=Clusters TABULATE [PRINT=Counts; CLASSIFICATION=Species,Clusters]