FSIMILARITY directive

Forms a similarity matrix or a between-group-elements similarity matrix or prints a similarity matrix.

Options

`PRINT` = string token	Printed output required (`similarities`, `summary`); default `*` i.e. no printing
`STYLE` = string token	Print percentage similarities in full or just the 10% digit (`full, abbreviated`); default `full`
`METHOD` = string token	Form similarity matrix or rectangular between-group-element similarity matrix (`similarities, betweengroupsimilarities`); default `simi`
`SIMILARITY` = matrix or symmetric matrix	Input or output matrix of similarities; default `*`
`GROUPS` = factor	Grouping of units into two groups for between-group-element similarity matrix; default `*`
`PERMUTATION` = variate	Permutation of units (possibly from `HCLUSTER`) for order in which units of the similarity matrix are printed; default `*`
`UNITS` = text or variate	Unit names to label the rows of the similarity matrix; default `*`
`MINKOWSKI` = scalar	Index t for use with `TEST=minkowski`

Parameters

`DATA` = variates or factors	The data values
`TEST` = string tokens	Test type, defining how each `DATA` variate or factor is treated in the calculation of the similarity between each unit (`simplematching`, `jaccard`, `russellrao`, `dice`, `antidice`, `sneathsokal`, `rogerstanimoto`, `cityblock`, `manhattan`, `ecological`, `euclidean`, `pythagorean`, `minkowski`, `divergence`, `canberra`, `braycurtis`, `soergel`); default `*` ignores that variate or factor
`RANGE` = scalars	Range of possible values of each `DATA` variate or factor; if omitted, the observed range is taken

Description

The FSIMILARITY directive forms similarity matrices, essentially using the method described by Gower (1971). The similarity coefficient that is calculated allows variables to be qualitative, quantitative or dichotomous, or mixtures of these types; values of some of the variables may be missing for some samples. The values of a similarity coefficient vary between zero and unity: two samples have a similarity of unity only when both have identical values for all variables; a value of zero occurs when the values for the two samples differ maximally for all variables.

You can form a symmetric matrix of similarities, or a rectangular matrix of similarities between the units in two groups. You can save either form of similarity matrix, using the SIMILARITY option. FSIMILARITY can also be used to print the symmetric matrix of similarities after it has formed it; alternatively, you can input an existing similarity matrix for printing, using the SIMILARITY option.

The DATA parameter specifies a list of variates or factors, all of which must be of the same length. If you want to print an existing similarity matrix, the DATA parameter (and the TEST and RANGE parameters) should be omitted, and the SIMILARITY option used to input the matrix concerned.

The TEST parameter specifies a list of strings, one for each variate or factor in the DATA parameter list, that define their “types”. If you want to exclude a variate or factor from contributing, you should specify an empty string (* or ''). Otherwise the similarity between units i and j is calculated as

∑_k { w_k(x_ik, x_jk) s_k(x_ik, x_jk) } / ∑_k w_k(x_ik, x_jk)

where x_ik is the value of the DATA variate k in unit i, and the contribution functions s_k and weight functions w_k for a variate or factor k of the available types are defined in the tables below (for further details see Gower 1971, 1985).

The first table contains the types appropriate for variates that are recording the presence or absence of a characteristic; these cannot be used with factors.

Type	Contribution s_k	Weight w_k
`Jaccard`	if x_i ≠ 0 and x_j ≠ 0, then 1	1
	if x_i = x_j = 0, then 0	0
	if only one of x_i or x_j = 0, then 0	1
`RussellRao`	if x_i ≠ 0 and x_j ≠ 0, then 1	1
	if x_i = 0 or x_j = 0, then 0	1
`Dice`	if x_i ≠ 0 and x_j ≠ 0, then 1	1
	if x_i = x_j = 0, then 0	0
	if only one of x_i or x_j = 0, then 0	0.5
`antidice`	if x_i ≠ 0 and x_j ≠ 0, then 1	1
	if x_i = x_j = 0, then 0	0
	if only one of x_i or x_j = 0, then 0	2
`SneathSokal`	if x_i ≠ 0 and x_j ≠ 0, then 1	1
	if x_i = x_j = 0, then 1	1
	if only one of x_i or x_j = 0, then 0	0.5
`RogersTanimoto`	if x_i ≠ 0 and x_j ≠ 0, then 1	1
	if x_i = x_j = 0, then 1	1
	if only one of x_i or x_j = 0, then 0	2

The simplematching type is appropriate for qualitative variables, which may be either variates or factors.

Type	Contribution s_k	Weight w_k
`simplematching`	if x_i = x_j, then 1	1
	if x_i ≠ x_j, then 0	1

The next table shows the types that can be used for quantitative variates (but not factors). In the definitions, r is the range of the variate, t is the Minkowski index (defined by the MINKOWSKI option). Note, however, that BrayCurtis and Soergel should not be mixed with other types.

Type	Contribution s_k	Weight w_k
`cityblock`	1 – \|x_i – x_j\| / r	1
`Manhattan`	synonymous with `cityblock`
`ecological`	1 – \|x_i – x_j\| / r	1
	unless x_i = x_j = 0	0
`Euclidean`	1 – {(x_i – x_j) / r}²	1
`Pythagorean`	synonymous with `Euclidean`
`Minkowski`	1 – \|x_i – x_j\|^t / r^t	1
`Divergence`	1 – {(x_i – x_j) / (x_i + x_j)}²	1
`Canberra`	1 – \|x_i – x_j\| / (\|x_i\| + \|x_j\|)	1
`BrayCurtis`	1 – \|x_i – x_j\| / (x_i + x_j)	x_i + x_j
`Soergel`	1 – \|x_i – x_j\| / max(x_i, x_j)	max(x_i, x_j)

The RANGE parameter contains a list of scalars, one for each variate or factor in the DATA list. This allows you to check that the values of each variate lie within the given range. If any variate or factor fails the range check, FSIMILARITY gives an error diagnostic and terminates without forming the similarity matrix. The range is also used to standardize quantitative variates; this lets you impose a standard range, for example when variates are measured on commensurate scales. You can omit the RANGE parameter for all or any of the variates or factors by giving a missing identifier or a scalar with a missing value; Genstat then uses the observed range. If PRINT=summary, Genstat prints the name, the minimum value, and the range for each variate and factor.

The METHOD option controls what type of matrix is produced. METHOD=similarity, the default, gives a symmetric matrix of similarities amongst a single set of units. METHOD=betweengroupsimilarity gives a rectangular matrix of similarities between two sets of units. To form a rectangular matrix of similarities, you must also define the grouping of units by setting the GROUPS option (see below).

The PRINT, STYLE and PERMUTATION options govern the printing of a symmetric matrix of similarities. You can either form the similarity matrix within FSIMILARITY, or input it by the SIMILARITY option. To print the similarity matrix you should set option PRINT=similarity. The STYLE option has two settings, full (the default) or abbreviated. The similarity matrix printed in full style has its values displayed as percentages with one decimal place. If you put STYLE=abbreviated, the values of the similarity matrix are printed as single digits with no spaces, the digit being the 10’s value of the similarity as a percentage. In both cases, though, the actual similarities in the range 0-1 are stored in the similarity matrix itself. The PERMUTATION option lets you specify a variate with values corresponding to the order in which you want the rows of the similarity matrix to be printed. The reordering of the rows is most effective when the permutation arises from a hierarchical clustering and corresponds to the dendrogram order.

You use the GROUPS option to specify a partition of the units into two groups, by giving a factor with two levels. The units with level 1 of the factor correspond to the rows of the matrix, while the units with level 2 correspond to the columns.

The UNITS option lets you label the rows of the output similarity matrix if the variates of the DATA parameter do not have any unit labels, or if you want to use different labels from those labelling the units of the variates. This labelling also applies to the rows and columns of a matrix of similarities between group elements.

Options: PRINT, STYLE, METHOD, SIMILARITY, GROUPS, PERMUTATION, UNITS, MINKOWSKI.
Parameters: DATA, TEST, RANGE.

Action with `RESTRICT`

If any of the DATA variates or factors is restricted, or if the factor in the GROUPS option is restricted, then that restriction is applied to all the variates or factors. If more than one is restricted, then the restrictions must all be to the same set of units. The dimension of the resulting symmetric matrix of similarities is taken from the number of units that contribute to the similarity matrix.

References

Gower, J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 857-871.
Gower, J.C. (1985). Measures of similarity, dissimilarity and distance. In: Encyclopedia of Statistical Sciences, Volume 5, 397-405.

Example

" Genstat example HCLU-1: Cluster analysis

   Data from 'Observers Book of Automobiles', 1986
   16 Italian cars and 10 measurements:
   1.  engine capacity        c.c.        CC
   2.  number of cylinders                NCyl
   3.  fuel tank              litres      Tank
   4.  unladen weight         kg          Wt
   5.  length                 cm          Length
   6.  width                  cm          Width
   7.  height                 cm          Ht
   8.  wheelbase              cm          Wbase
   9.  top speed              kph         TSpeed
  10.  time to 100kph         secs        StSt
  11.  carburettor/inj/diesel 1/2/3       Carb
  12.  front/rear wheel drive 1/2         Drive
"

TEXT [VALUES=Estate,'Arna1.5','Alfa2.5',Mondialqc,Testarossa,Croma,\ 
  Panda,Regatta,Regattad,Uno,X19,Contach,Delta,Thema,Y10,Spider] Cars
POINTER [VALUES=CC,NCyl,Tank,Wt,Length,Width,Ht,WBase,TSpeed,StSt,\ 
  Carb,Drive] Vars
" Read the data - measurements and carnames - from the file
 'HCLU-1.DAT', and then display it."
OPEN '%gendir%/examples/HCLU-1.DAT'; CHANNEL=cardat
READ [CHANNEL=cardat] Vars[]
CLOSE cardat

" Treat the number of cylinders, data[2], differently to the 
  continuous measurements."
HLIST [UNITS=Cars] \
  Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching)

" Form a hierarchical clustering of the cars,
  using the single linkage method."
SYMMETRIC [ROWS=Cars] CarSim
FSIMILARITY [SIMILARITY=CarSim]\ 
  Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching)
HCLUSTER [PRINT=amalgamations; METHOD=single] CarSim

" Use the average-linkage method."
HCLUSTER [PRINT=dendrogram; METHOD=average] CarSim;\ 
  AMALGAMATIONS=Am; PERMUTATION=Perm

" Display a high-resolution dendrogram."
DDENDROGRAM [ORDERING=given] DATA=Am; PERMUTATION=Perm; LABELS=Cars;\ 
  TITLE='Italian cars clustered by average linkage'

Updated on September 2, 2019

Was this article helpful?

Yes No