Displays results ancillary to hierarchical cluster analyses: matrix of mean similarities between and within groups, a set of nearest neighbours for each unit, a minimum spanning tree, and the most typical elements from each group.
Option
PRINT = string tokens |
Printed output required (neighbours, tree, typicalelements, gsimilarities ); default tree |
---|
Parameters
SIMILARITY = symmetric matrices |
Input similarity matrix for each cluster analysis |
---|---|
NNEIGHBOURS = scalars |
Number of nearest neighbours to be printed |
NEIGHBOURS = matrices |
Matrix to store nearest neighbours of each unit |
GROUPS = factors |
Indicates the groupings of the units (for calculating typical elements and mean similarities between groups) |
TREE = matrices |
To store the minimum spanning tree (as a series of links and corresponding lengths) |
GSIMILARITY = symmetric matrices |
To store similarities between groups |
Description
You can use the HDISPLAY
directive to print ancillary information useful for interpreting cluster analyses, and to save information to use elsewhere in Genstat, for example for plotting.
The SIMILARITIES
parameter specifies a list of symmetric similarity matrices. These are operated on, in turn, to produce the output requested by the PRINT
option and to save the information specified by other parameters. Since the interpretations of the remaining parameters are closely linked to the different settings of the PRINT
option, each setting is discussed below with the relevant parameters.
The NNEIGHBOURS
parameter gives a list of scalars indicating how many neighbours will appear in the printed table of nearest neighbours.
The NEIGHBOURS
parameter can specify a list of identifiers to store details of nearest neighbours. These will be declared implicitly, if necessary, as matrices. The rows of the matrices correspond to the units; there should be an even number of columns. The values in the odd-numbered columns represent the neighbouring units in order of their similarity, while the values in the even-numbered columns are the corresponding similarities. If you have declared the matrix previously and it does not have enough columns, then NEIGHBOURS
stores as many neighbours as possible. If there is an odd number of columns in the matrix, the last column is not filled. If the matrix is declared implicitly, the number of columns will be twice the value of the NNEIGHBOURS
scalar.
If the PRINT
option includes the setting neighbours
, Genstat prints a table of nearest neighbours for every sample, together with their values of similarity. The number of neighbours printed is determined by the value of the NNEIGHBOURS
scalar; if NNEIGHBOURS
is not set, the table is not printed. This information is also useful for interpreting clusters and ordinations.
The GROUPS
parameter specifies a factor to divide the units of each similarity matrix into clusters. You may have formed the factor from a previous hierarchical cluster analysis, using HCLUSTER
. This parameter must be set if the PRINT
option includes the settings typicalelement
or gsimilarities
.
If the PRINT
option includes the setting typicalelement
, Genstat prints the average similarity of each group member with the other group members. This is to help you identify typical members of each group: typical members will have relatively large average similarities compared to those of the other members. Within each group, members are printed in decreasing order of average similarity.
The GSIMILARITY
parameter specifies a list of symmetric matrices in which you can save the mean between-group and within-group similarities. Any structure that you have not declared already will be declared implicitly to be a symmetric matrix with number of rows equal to the number of levels of the factor in the GROUPS
parameter.
If the PRINT
option includes the setting gsimilarities
, Genstat prints the mean similarities between-groups and within-groups. Self-similarities are excluded.
The TREE
parameter can specify a matrix to save the minimum spanning tree. The matrix is set up with two columns and number of rows equal to the number of units. For each unit, the value in the first column is the unit to which that unit is linked on its left; the second column is the corresponding similarity. The first unit is not linked to any unit on its left, as it is always the first unit on the tree; so the first row of the matrix contains missing values. The HFAMALGAMATIONS
procedure can use the tree to form an amalgamations matrix, representing how the clusters would be formed with this similarity matrix by single-linkage cluster analysis.
Setting the PRINT
option to tree
prints the minimum spanning tree associated with the similarity matrix specified the SIMILARITY
parameter. The minimum spanning tree (MST) is not a Genstat structure, but it can be kept in the form described above: that is, in a matrix with two columns. An MST is a tree connecting the n points of a multidimensional representation of the sampling units. In a tree every unit is linked to a connected network and there are no closed loops; the special feature of the MST is that, of all trees with a sampling unit at every node, it is the one whose links have minimum total length. The links include all those that join nearest neighbours; the MST is closely related to single linkage hierarchical trees. Minimum spanning trees are also useful if you superimpose them on ordinations to reveal regions in which distance is badly distorted (see procedure DMST
); if neighbouring points, as given by the MST, are distant in the ordination then something is badly wrong.
Option: PRINT
.
Parameters: SIMILARITY
, NNEIGHBOURS
, NEIGHBOURS
, GROUPS
, TREE
, GSIMILARITY
.
See also
Directives: HCLUSTER
, HLIST
, HSUMMARIZE
.
Procedures: DDENDROGRAM
, DMST
.
Commands for: Multivariate and cluster analysis.
Example
" Examples 2:6.19.1, 2:6.19.2a-d, 2:6.19.3a-b. 2:6.19.4-5, 2:6.19.7-8 "
UNITS [NVALUES=16]
VARIATE Engcc,Ncyl,Tankl,Weight,Length,Width,Height,Wbase,Tspeed,Stst,\
Carb,Drive,Vct[1...3]
POINTER Cd; VALUES=!P(Engcc,Ncyl,Tankl,Weight,Length, \
Width,Height,Wbase,Tspeed,Stst)
READ [PRINT=errors] #Cd,Carb,Drive
1490 4 50 966 414 161 133 245 177 10.9 1 2
1409 4 50 845 399 162 139 242 174 10.2 1 2
2492 6 49 1160 433 163 140 251 210 8.2 1 1
3185 8 87 1430 458 179 126 265 249 7.4 2 1
4942 12 120 1506 449 198 113 255 291 5.8 2 1
1995 4 70 1180 450 7176 143 266 209 7.8 2 2
965 4 35 761 338 149 146 216 134 16.8 1 2
1585 4 55 970 426 165 141 244 180 10.0 1 2
1714 4 55 980 426 165 141 245 150 18.9 3 2
999 4 42 720 364 155 143 236 145 16.2 1 2
1498 4 48 912 397 157 118 220 171 11.0 1 1
5167 12 120 1446 414 200 107 245 286 4.9 1 1
1585 4 45 1000 389 162 138 247 195 8.2 1 2
1995 4 70 1150 459 175 143 266 224 7.6 2 2
1049 4 47 790 339 151 143 216 179 11.8 1 2
1995 4 45 1050 414 162 125 228 190 9.0 2 1 :
TEXT [VALUES=Estate,'Arna1.5','Alfa2.5',Mondialqc,\
Testarossa,Croma,Panda,Regatta,Regattad,Uno,\
X19,Contach,Delta,Thema,Y10,Spider] Carname
FACTOR [NVALUES=Carname; LEVELS=16] Fcar; VALUES=!(1...16)
SYMMETRICMATRIX [ROWS=Carname] Carsim
" Form similarity matrix between cars."
FSIMILARITY [SIMILARITY=Carsim; PRINT=*] #Cd,Carb,Drive; \
TEST=4(cityblock),4(Euclidean),2(cityblock),2(simplematch)
HCLUSTER [PRINT=dendrogram; METHOD=averagelink] Carsim; \
GTHRESHOLD=70; GROUPS=Cargrp; PERMUTATION=Carperm; \
AMALGAMATIONS=Caramalg
FSIMILARITY [PRINT=similarities; SIMILARITY=Carsim; \
PERMUTATION=Carperm; STYLE=abbreviated]
MATRIX [ROWS=Carname; COLUMNS=4] Carneig
HDISPLAY [PRINT=neighbours] Carsim; NNEIGHBOURS=3; NEIGHBOURS=Carneig
PRINT Carneig
FACTOR [LABELS=!t(Fiat,'Alfa Romeo',Lancia,Ferrari,Lamborghini,\
Pinninfarina)] Maker; VALUES=!(2,2,2,4,4,1,1,1,1,1,1,5,3,3,3,6)
HDISPLAY [PRINT=typical] Carsim; GROUPS=Maker
HDISPLAY [PRINT=gsimilarities] Carsim; GROUPS=Maker; \
GSIMILARITY=Cargsim
PRINT Cargsim
HDISPLAY [PRINT=tree] Carsim; TREE=Cartree
PRINT Cartree
HLIST [UNITS=Carname] #Cd,Carb,Drive; \
TEST=4(cityblock),4(Euclidean),2(cityblock),2(simplematch)
HLIST [GROUPS=Maker; UNITS=Carname] #Cd,Carb,Drive; \
TEST=4(cityblock),4(Euclidean),2(cityblock),2(simplematch)
HSUMMARIZE [GROUPS=Cargrp] Weight,Carb; \
TEST=cityblock,simplematch
TEXT Cars; VALUES=!T(Estate,'Arna1.5','Alfa2.5',Mondialqc,\
Testarossa,Croma,Panda,Regatta,Regattad,Uno,\
X19,Contach,Delta,Thema,Y10,Spider)
FRAME 1; YLOWER=0; YUPPER=1; XLOWER=0; XUPPER=1
DDENDROGRAM [STYLE=lower; ORDERING=given; LOWSIMILARITY=0; \
DSIMILARITY=yes] Caramalg; PERMUTATION=Carperm; LABELS=Cars;\
TITLE='Dendrogram as from HCLUSTER'; SAVE=DKeep
" types of ordering "
FRAME 5...8; YLOWER=2(0.5,0.0); YUPPER=2(1.0,0.5);\
XLOWER=(0.0,0.5)2; XUPPER=(0.5,1.0)2
DDENDROGRAM [STYLE=average; ORDERING=first; REVERSE=yes; SCREEN=clear;\
ENDACTION=continue; CHANGE=order; DSIMILARITY=yes] DATA=DKeep;\
TITLE='A: STYLE=average, ORDER=first'; WINDOW=5; SAVE=DSFrstAv
DDENDROGRAM [STYLE=centroid; ORDERING=size,ziggurat;\
SCREEN=keep; ENDACTION=continue; CHANGE=order; DSIMILARITY=yes]\
DATA=DKeep; TITLE='B: STYLE=centroid, ORDER=size,zig'; WINDOW=6
DDENDROGRAM [STYLE=lower; ORDERING=first; REVERSE=yes;\
SCREEN=keep; ENDACTION=continue; CHANGE=dendrogram; DSIMILARITY=yes]\
DATA=DSFrstAv; TITLE='C: STYLE=lower, ORDER=first'; WINDOW=7
DDENDROGRAM [STYLE=full; ORDER=ziggurat,size; SCREEN=keep; \
ENDACTION=pause; CHANGE=order; DSIMILARITY=yes] DATA=DKeep;\
PERMUTATION=PSave; TITLE='D: STYLE=full, ORDER=zig,size'; WINDOW=8;\
ZIGGURAT=ZigDeg; SAVE=DSave
HCLUSTER [PRINT=dendrogram; METHOD=singlelink] Carsim; \
GTHRESHOLD=90; GROUPS=Cargrpsing
PRINT Cargrp,Cargrpsing
HCOMPAREGROUPINGS [PRINT=indexes,tests; METHOD=arand,jaccard,rand]\
FIRSTGROUPING=Cargrp; SECONDGROUPING=Cargrpsing; SEED=93587
" obtain the clusters from the original cluster analysis "
HFCLUSTERS Caramalg; CLUSTERS=Clusters
" see often these clusters occur in 100 bootstrap samples of data variables "
HBOOTSTRAP [PRINT=clusters; METHOD=averagelink; NTIMES=100; SEED=161647;\
CLUSTERS=Clusters; REPLICATION=Reps] #Cd,Carb,Drive;\
TEST=4(cityblock),4(Euclidean),2(cityblock),2(simplematch)
" replot the original dendrogram "
DDENDROGRAM [STYLE=average; ORDERING=given; LOWSIMILARITY=0; \
DSIMILARITY=yes] Caramalg; PERMUTATION=Carperm;\
LABELS=Cars; WINDOW=1
" plot the numbers of occurrence on the dendrogram "
DCLUSTERLABELS [WINDOW=1] #Clusters; LABEL=#Reps