Forms a non-hierarchical classification.

### Options

`PRINT` = string tokens |
Printed output required (`criterion` , `optimum` , `units` , `typical` , `initial` , `random` ); default `*` i.e. no printing |
---|---|

`DATA` = matrix or pointer |
Data from which the classification is formed, supplied as a units-by-variates matrix or as a pointer containing the variates of the data matrix |

`CRITERION` = string token |
Criterion for clustering (`sums` , `predictive` , `within` , `Mahalanobis` ); default `sums` |

`INTERCHANGE` = string token |
Permitted moves between groups (`transfer` , `swop` ); default `tran` (implies `swop` also) |

`START` = factor |
Initial classification; default `*` splits the units, in order, into `NGROUPS` classes of nearly equal size |

`NSTARTS` = scalar |
Number of random starting configurations to be used; default 0 |

`SEED` = scalar |
Seed for the random numbers used to form random starting configurations; default 0 |

### Parameters

`NGROUPS` = scalars |
Numbers of classes into which the units are to be classified: note, the values of the scalars must be in descending order |
---|---|

`GROUPS` = factors |
Saves the classification formed for each number of classes |

`CRITERIONVALUE` = scalars |
Saves the criterion values (representing within-class homogeneity) |

`BCRITERIONVALUE` = scalars |
Saves the subsidiary criterion values (representing between-class heterogeneity for maximal predictive classification) |

`MEANS` = matrices |
Saves the variate means for the groups of each classification |

`PREDICTORS` = matrices |
Saves the group predictors from maximal predictive classification |

### Description

Printed output is controlled by the `PRINT`

option. This has the following possible settings.

`criterion` |
prints the optimal criterion value. |
---|---|

`optimum` |
prints the optimal classification. |

`units` |
prints the data with the units ordered into the optimal classes. |

`typical` |
prints a typical value for each class: for maximal predictive classification this is the class predictor; for the other methods it is the class mean. |

`initial` |
if this is set, the requested sections of output are also printed for the initial classification. |

`random` |
if this is set, the requested sections of output are also printed for the optimum configuration obtained from every random start. |

The `DATA`

option supplies the data to be classified. This specifies a single structure that must be either a matrix, with rows corresponding to the units and columns to the variables, or a pointer whose values are the identifiers of the variates in the data matrix. Internally, `CLUSTER`

operates on a matrix, and so it will copy the variate values into a matrix if you supply a pointer as input; thus, it is more efficient to supply a matrix, especially with large data sets.

The `CRITERION`

option specifies which criterion `CLUSTER`

is to optimize. The four available settings are:

The default is `sums`

.

The `INTERCHANGE`

option specifies which types of interchange (transfers or swops) are to be used. The default is `transfer`

, which is taken to imply that both transfers and swops are used, since a swop is simply two transfers. If you set `INTERCHANGE=swop`

, only swops are used. If `INTERCHANGE=*`

the algorithm does not attempt to improve the classification from the initial classification; you might want this, in conjunction with the `PRINT=initial`

setting, to display the results for an existing classification which you do not wish to improve.

The `START`

option can be used to supply a factor to define the initial classification. This might be constructed using the `CLASSIFY`

procedure. If there are *k* classes, `CLASSIFY`

finds the *k* units that are furthest apart in the multi-dimensional space defined by the data variates. These are then used as the nuclei for the classes, with each remaining unit being allocated to the class containing the nearest nucleus. The default splits the units, in order, into `NGROUPS`

classes of nearly equal size.

As an alternative to the use of `CLASSIFY`

, the `NSTARTS`

option allows you to specify a number of random permutations of the initial classification to try. `CLUSTER`

then saves the best classification that it finds. By default, `NSTARTS=0`

, i.e. no randomization is done. The `SEED`

option supplies the seed for the random numbers that are used to do the permutations. The default of zero continues the existing sequence of random numbers, if `CLUSTER`

has already been used in the current Genstat job. If `CLUSTER`

has not yet been used, Genstat picks a seed at random.

The first parameter, `NGROUPS`

, specifies the number of groups, or classes, to be formed. Often you would want several classifications from a single data set, into different numbers of groups. In this case, the `NGROUPS`

parameter should be a list of scalars, defining the numbers of groups in descending order. For the initial classification of the second classification, `CLUSTER`

takes the optimal classification from the first number of groups, and does some reallocation of units to make a smaller number of groups. This is repeated, as often as required, to provide initial classifications for all the later analyses; hence the need to specify the numbers in descending order. Random starts are done only for the first number of groups.

The `GROUPS`

parameter can specify a list of factors to save the optimal classifications. The `CRITERIONVALUE`

parameter can specify a list of scalars to save the criterion values for each number of groups. The subsidiary criterion values involved in maximal predictive classification can be saved (also in scalars) using the `BCRITERIONVALUE`

parameter. The `MEANS`

parameter can save matrices containing the means of the variates within the groups of the classifications, and the `PREDICTORS`

parameter can save matrixes containing the group predictors from maximal predictive classifications.

Options: `PRINT`

, `DATA`

, `CRITERION`

, `INTERCHANGE`

, `START`

, `NSTARTS`

, `SEED`

.

Parameters: `NGROUPS`

, `GROUPS`

, `CRITERIONVALUE`

, `BCRITERIONVALUE`

, `MEANS`

, `PREDICTORS`

.

### Action with `RESTRICT`

Any restrictions, for example on the variates in a `DATA`

pointer, are ignored.

### See also

Directives: `FSIMILARITY`

, `HCLUSTER`

, `HREDUCE`

.

Procedures: `CLASSIFY`

, `BCLASSIFICATION`

, `CINTERACTION`

, `HCOMPAREGROUPINGS`

, `MASCLUSTER`

, `PCPCLUSTER`

.

Commands for: Multivariate and cluster analysis.

### Example

" Example CLUS-1: Cluster analysis with binary data." " The data are in a file called CLUS-1.DAT " FILEREAD [NAME='%gendir%/examples/CLUS-1.DAT'] Y[1...4]; FGROUPS=no " Carry out the non-hierarchical clustering, printing the optimal criterion value, the optimal classification, a typical value for each class (for maximal predictive classification this is the class predictor) and the data with the units ordered into the optimal classification. Save the optimal classifications formed for 5 and 2 classes into factors Optimum[2] and Optimum[5] respectively. For binary data, the setting of the CRITERION option is predictive; maximal predictive classification. " CLUSTER [PRINT=criterion,optimum,typical,units; DATA=Y;\ CRITERION=predictive] NGROUPS=5,2; GROUPS=Optimum[5,2] " One preliminary to comparing two classifications is to tabulate them. To do this, use the factors Optimum[2,5] saved from the clustering, as classification factors in a TABULATE command. The printed table shows that the first group of the classification into 2 groups is formed from groups 1 and 5 of the 5-group classification; group 2 is formed from groups 2,3 and 4. " TABULATE [PRINT=counts; CLASSIFICATION=Optimum[5,2]; MARGIN=yes]