Forms clusters of points from their densities in multi-dimensional space (R.W. Payne).

### Options

`PRINT` = string tokens |
What to print (`cellclusters` , `density` , `summary` ); default `summ` |

`PLOT` = string tokens |
What to plot (`cellclusters` , `density` , `histogram` , `summary` ); default `cell` , `dens` , `hist` |

`CLUSTERS` = pointer |
Saves variates defining the clusters for each minimum number of points |

`CELLCLUSTERS` = pointer |
Saves tables containing the clusters of cells for each minimum number of points |

`DENSITY` = table |
Saves or supplies the table of cell densities |

`SUMMARY` = pointer |
Saves the summary table |

`INITIALCELLCLUSTERS` = table |
Defines clusters of cells to use to start the clustering |

`MINPOINTS` = variate or scalar |
Minimum numbers of points within cells at which to form clusters |

### Parameters

`DATA` = variates |
Coordinates of the points |

`NPARTITIONS` = scalars |
Numbers of partitions in each dimension; default 10 |

### Description

The `PTFCLUSTERS`

procedure forms clusters of points in multi-dimensional space by finding contiguous regions where the density of points exceeds thresholds specified by the `MINPOINTS`

option. The points in these regions will be connected to each other in a similar way to the units in a hierarchical cluster analysis. Note, though, that points in sparsely populated parts of the space will not be allocated to any cluster. These points can be thus be identified as unusual or aberrant.

`PTFCLUSTERS`

divides the space into cells, and uses `TABULATE`

to calculate the number of points in each cell. `PTFCLUSTERS`

starts with the first `MINPOINTS`

value and finds a cell containing more than that number of points. This is the starting point for the first cluster. Additional cells are added to the cluster if they are neighbours of cells in the cluster containing more than that minimum number of points. When this cluster is complete, `PTFCLUSTERS`

looks for a cell that is not in the cluster but which contains more than the minimum number of points. This provides the starting point for another cluster. The process continues until all the cells with more than that minimum number of points have been allocated to a cluster. `PTFCLUSTERS`

then takes the next `MINPOINTS`

value and expands the clusters to contain neighbours with that smaller minimum number of points, merging clusters if they become neighbours. For each `MINPOINTS`

value, `PTFCLUSTERS`

records the number of clusters, the mean number of points within cells inside and outside the clusters, the mean number within cells just inside and just outside the cluster boundaries, the minimum number within cells on the boundaries, and the maximum number for within just outside the boundaries. These should help to assess which `MINPOINTS`

value gives the best set of clusters.

As mentioned above, the `MINPOINTS`

option specifies the minimum numbers of points that are used to form the clusters. The default is to use a list of values calculated as the maximum density multiplied by 0.8, 0,75, 0.7, 0.65, 0.6, 0.55, 0.5, 0.45, 0.4, 0,35, 0.3, 0.25 and 0.2.

The `DATA`

parameter supplies a list of variates containing the coordinates of the points in the various dimensions. The `NPARTITIONS`

parameter supplies a list of scalars indicating the number of partitions to make along each dimension in order to form the multi-dimensional cells; default 10.

The `PRINT`

option controls the printed output, with settings:

`cellclusters`

shows how the cells are clustered for each minimum number of points,

`density`

prints the table showing the number of points in each cell,

`summary`

prints the summary information recorded for each minimum number of points (default).

The `PLOT`

option specifies how the replications are plotted, with settings:

`cellclusters`

this displays the clustering of the cells for each minimum number of points as a shade plot or as a 3-d graph if there are 2 or 3 dimensions respectively,

`density`

displays shade plots showing the numbers of points in each pair of dimensions,

`histogram`

plot a histogram for the numbers of points in the cells,

`summary`

plots the summary information against the minimum number of points. The default is to plot all of these.

The `CLUSTERS`

option can save a pointer containing details of the clusters of points formed at each `MINPOINTS`

value. The clusters have integer numbers, from one upwards. The pointer contains a variate for each `MINPOINTS`

value with a unit for every point. These contain either cluster numbers, or missing values for points in cells that have not been allocated to any cluster.

The `CELLCLUSTERS`

option can similarly save a pointer containing details of the clusters of cells formed at each `MINPOINTS`

value. The pointer contains a table for each `MINPOINTS`

value. These contain either a cluster number, or missing values for cells that have not been allocated to any cluster.

The `DENSITY`

option can save the table containing the number of points within each cell. Alternatively, if you do not set the `DATA`

parameter, it can be used to supply a previously calculated density table. This is useful to save computing time if you want to make several attempts to find clusters with the same data set.

The `SUMMARY`

option can save the summary table, in a pointer with elements labelled `'Min.`

`no.`

`points'`

, `'No.`

`clusters'`

, `'Mean`

`inside`

`clusters'`

, `'Mean`

`outside`

`clusters'`

, `'Mean`

`on`

`boundary'`

, `'Mean`

`outside`

`boundary'`

, `'Min.`

`on`

`boundary'`

and `'Max.`

`outside`

`boundary'`

.

The `INITIALCELLCLUSTERS`

option can supply a table of cell cluster allocations, to act as a starting point for the clustering. For example, you could specify a table previously saved by the `CELLCLUSTERS`

option, if you wanted to expand those clusters with some different values of `MINPOINTS`

.

A final ,important point is that `PTFCLUSTERS`

does not form a units-by-units similarity matrix, as in ordinary cluster analysis, but instead works with a (small) density table. It is therefore suitable for clustering(very) large data sets.

Options: `PRINT`

, `PLOT`

, `CLUSTERS`

, `CELLCLUSTERS`

, `DENSITY`

, `SUMMARY`

, `INITIALCELLCLUSTERS`

, `MINPOINTS`

.

Parameters: `DATA`

, `NPARTITIONS`

.

### Method

`PTFCLUSTERS`

uses the `NEIGHBOURS`

procedure to find the neighbouring cells.

### Action with `RESTRICT`

If any of the `DATA`

variates are restricted, only the points not excluded by the restriction will be clustered.

### See also

Directive: `CLUSTER`

.

Procedures: `NEIGHBOURS`

, `PCPCLUSTER`

, `PTFILLCLUSTERS`

.

Commands for: Multivariate and cluster analysis, Spatial statistics.

### Example

CAPTION 'PTFCLUSTERS example'; STYLE=meta " generate random points: some scattered over the plane (0), 2 overlapping polygons (1 & 2), and a separate polygon (3) " VARIATE xhexagon; VALUES=!(0.3,0.0,0.3,0.7,1.0,0.7) & yhexagon; VALUES=!(0.0,0.5,1.0,1.0,0.5,0.0) GRCSR [PRINT=*] YPOLYGON=yhexagon; XPOLYGON=xhexagon; NPOINTS=100;\ YCSR=ycsr[0]; XCSR=xcsr[0]; SEED=35719 GRCSR [PRINT=*] YPOLYGON=yhexagon/2; XPOLYGON=xhexagon/2; NPOINTS=600;\ YCSR=ycsr[1]; XCSR=xcsr[1] GRCSR [PRINT=*] YPOLYGON=0.4+yhexagon/2; XPOLYGON=xhexagon/2; NPOINTS=700;\ YCSR=ycsr[2]; XCSR=xcsr[2] GRCSR [PRINT=*] YPOLYGON=yhexagon/3; XPOLYGON=0.65+xhexagon/3; NPOINTS=800;\ YCSR=ycsr[3]; XCSR=xcsr[3] VARIATE [VALUES=#xcsr[]] xvar VARIATE [VALUES=#ycsr[]] yvar " plot the points " DPTMAP [YLOWER=0; YUPPER=1; XLOWER=0; XUPPER=1] Y=yvar; X=xvar " form clusters with default minimum points in each cell (156, 127 ... 39) " PTFCLUSTERS [PRINT=cellclusters,density,summary; PLOT=cellclusters,density,histogram;\ CLUSTERS=clust; CELLCLUSTERS=cellclust; DENSITY=density] yvar,xvar; NPARTITIONS=8 " continue from 39 down to 10 minimum points in each cell, saving calculations by using the previously saved density table, and the clustering at 39 minimum points as the initial clustering " PTFCLUSTERS [PRINT=summary; PLOT=cellclusters; CLUSTERS=clust; DENSITY=density;\ INITIALCELLCLUSTERS=cellclust[39]; MINPOINTS=!(39...10)] " choose the clusters formed with 20 minimim number of points per cell " FSPREAD xvar,yvar,clust[20]