1. Home
  2. HCOMPAREGROUPINGS procedure

HCOMPAREGROUPINGS procedure

Compares groupings generated, for example, from cluster analyses (R.W. Payne).

Options

PRINT = string tokens Controls printed output (indexes, tests); default inde
PLOT = string What to plot (histogram); default *
METHOD = string tokens Which indexes to calculate (arand, jaccard, rand); default arand
NTIMES = scalar Number of permutations to make for the tests; default 999

Parameters

FIRSTGROUPING = factors First set of groupings
SECONDGROUPING = factors Second set of groupings
ESTIMATES = pointers Saves the values of the indexes calculated from the original data set
SEED = scalars Seed for the random number generator used to make the permutations; default 0 continues from the previous generation or (if none) initializes the seed automatically
PERMUTATIONESTIMATES = pointers Saves the values of the indexes calculated from the permuted data sets

Description

HCOMPAREGROUPINGS calculates indexes to assess the similarity between two sets of groupings, which are specified in factors using the FIRSTGROUPING and SECONDGROUPING parameters. These may, for example, have been obtained from two different cluster analyses.

The METHOD option selects the indexes, with settings:

arand adjusted Rand index,
jaccard Jaccard index, and
rand Rand index.

Details are given in the Method section. The default is to calculate only the adjusted Rand index.

The ESTIMATES parameter can save a pointer, containing a scalar for each index, to save the calculated values. The elements of the pointer are labelled by the index names, but defined so that you can refer to them in either lower- or upper-case or a mixture.

The PRINT option controls the printed output, with settings:

indexes prints the indexes, and
tests prints probabilities obtained from random permutation tests.

The random permutation tests allow you to assess whether the similarity may have arisen only by chance. The NTIMES option specifies the number of permutations to take (default 999). HCOMPAREGROUPINGS checks whether NTIMES is greater than the number of possible permutations available for the data set. If so, it does an exact test instead, which uses each possible permutation once. The SEED option specifies the seed that is used to obtain the random numbers used to form the permutations.

The PERMUTATIONESTIMATES parameter can save a pointer, containing a variate for each index, to save the values calculated in the random permutations. The elements of the pointer are labelled by the index names, but defined so that you can refer to them in either lower- or upper-case or a mixture.

You can set option PLOT=histogram to plot histograms showing where the calculated value of each index lies within those obtained from the permutation tests.

Options: PRINT, PLOT, METHOD, NTIMES.
Parameters: FIRSTGROUPING, SECONDGROUPING, ESTIMATES, SEED, PERMUTATIONESTIMATES.

Method

The Rand index (Rand 1971) is defined as

( np1 + np2 ) / NC2

where

np1 is the number of pairs of units that are in the same group in both factors,
np2 is the number of pairs of units that are in different groups in both factors,
N  is the total number of units, and
NC2 is the total number of ways of selecting of 2 units from a sample of N units,
which can be calculated as N×(N-1)/2.

This ranges from zero (for no similarity) to one (for complete similarity).
The adjusted Rand index of Hubert & Arabie (1985) is defined as

{ ∑ i j (mijC2 ) } – { ∑ i ( aiC2 ) × ∑ j ( bjC2 ) / ( NC2) } /
– { ∑ i ( aiC2 ) + ∑ j ( bjC2 ) } – { ∑ i ( aiC2 ) × ∑ j ( bjC2 ) / ( NC2) }

where

mij  is the number of units that are in group i for the first factor, and group j for the second factor,
ais the number of units in group i of the first factor, and
bis the number of units in group j of the second factor.

The first term in the numerator measures the agreement between the groupings. The second term is the expected value of the first term, assuming a generalized hypergeometric distribution, and the first term of the denominator is its maximum value. The index has a value of zero if the groupings are independent, and one if they are in complete agreement.

The Jaccard index is defined as

np1 / ( NC2np2 )

This is similar to the Rand index, except that it excludes the pairs of units that are in different groups in both factors.

Action with RESTRICT

There must be no restrictions.

See also

Directives: CLUSTER, FACTOR, HCLUSTER.
Commands for: Multivariate and cluster analysis.

Example

CAPTION     'HCOMPAREGROUPINGS example',\
            !t('Compare groupings from average and single-link cluster',\
            'analyses of cars in Guide to Genstat, Part 2, Section 6.1.2.');\
            STYLE=meta,plain
TEXT        Cars; !T(Estate,'Arna1.5','Alfa2.5',Mondialqc,Testarossa,Croma,\
            Panda,Regatta,Regattad,Uno,X19,Contach,Delta,Thema,Y10,Spider)
POINTER     Vars; !P(CC,NCyl,Tank,Wt,Length,Width,Ht,WBase,TSpeed,StSt,\
            Carb,Drive)
VARIATE     [NVALUES=Cars] Vars[]
READ        [PRINT=*] Vars[]
 1490  4  50  966 414 161 133 245 177 10.9  1  2
 1409  4  50  845 399 162 139 242 174 10.2  1  2
 2492  6  49 1160 433 163 140 251 210  8.2  1  1
 3185  8  87 1430 458 179 126 265 249  7.4  2  1
 4942 12 120 1506 449 198 113 255 291  5.8  2  1
 1995  4  70 1180 450 176 143 266 209  7.8  2  2
  965  4  35  761 338 149 146 216 134 16.8  1  2
 1585  4  55  970 426 165 141 244 180 10.0  1  2
 1714  4  55  980 426 165 141 245 150 18.9  3  2
  999  4  42  720 364 155 143 236 145 16.2  1  2
 1498  4  48  912 397 157 118 220 171 11.0  1  1
 5167 12 120 1446 414 200 107 245 286  4.9  1  1
 1585  4  45 1000 389 162 138 247 195  8.2  1  2
 1995  4  70 1150 459 175 143 266 224  7.6  2  2
 1049  4  47  790 339 151 143 216 179 11.8  1  2
 1995  4  45 1050 414 162 125 228 190  9.0  2  1 :
SYMMETRIC   [ROWS=Cars] CarSim
FSIMILARITY [SIMILARITY=CarSim]\
            Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching)
HCLUSTER    [PRINT=dendrogram; METHOD=average] CarSim;\
            GROUPS=AverageLink; GTHRESHOLD=90
HCLUSTER    [PRINT=dendrogram; METHOD=single] CarSim;\
            GROUPS=SingleLink; GTHRESHOLD=90
SORT        [INDEX=AverageLink,Cars] AverageLink,Cars; NEWV=Group,Car
PRINT       Group,Car
SORT        [INDEX=SingleLink,Cars] SingleLink,Cars; NEWV=Group,Car
PRINT       Group,Car
HCOMPAREGROUPINGS [PRINT=indexes,tests] FIRSTGROUPING=AverageLink;\
            SECONDGROUPING=SingleLink; SEED=353445

 

Updated on September 12, 2019

Was this article helpful?