1. Home
  2. FSIMILARITY directive

FSIMILARITY directive

Forms a similarity matrix or a between-group-elements similarity matrix or prints a similarity matrix.

Options

PRINT = string token Printed output required (similarities, summary); default * i.e. no printing
STYLE = string token Print percentage similarities in full or just the 10% digit (full, abbreviated); default full
METHOD = string token Form similarity matrix or rectangular between-group-element similarity matrix (similarities, betweengroupsimilarities); default simi
SIMILARITY = matrix or symmetric matrix Input or output matrix of similarities; default *
GROUPS = factor Grouping of units into two groups for between-group-element similarity matrix; default *
PERMUTATION = variate Permutation of units (possibly from HCLUSTER) for order in which units of the similarity matrix are printed; default *
UNITS = text or variate Unit names to label the rows of the similarity matrix; default *
MINKOWSKI = scalar Index t for use with TEST=minkowski

Parameters

DATA = variates or factors The data values
TEST = string tokens Test type, defining how each DATA variate or factor is treated in the calculation of the similarity between each unit (simplematching, jaccard, russellrao, dice, antidice, sneathsokal, rogerstanimoto, cityblock, manhattan, ecological, euclidean, pythagorean, minkowski, divergence, canberra, braycurtis, soergel); default * ignores that variate or factor
RANGE = scalars Range of possible values of each DATA variate or factor; if omitted, the observed range is taken

Description

The FSIMILARITY directive forms similarity matrices, essentially using the method described by Gower (1971). The similarity coefficient that is calculated allows variables to be qualitative, quantitative or dichotomous, or mixtures of these types; values of some of the variables may be missing for some samples. The values of a similarity coefficient vary between zero and unity: two samples have a similarity of unity only when both have identical values for all variables; a value of zero occurs when the values for the two samples differ maximally for all variables.

You can form a symmetric matrix of similarities, or a rectangular matrix of similarities between the units in two groups. You can save either form of similarity matrix, using the SIMILARITY option. FSIMILARITY can also be used to print the symmetric matrix of similarities after it has formed it; alternatively, you can input an existing similarity matrix for printing, using the SIMILARITY option.

The DATA parameter specifies a list of variates or factors, all of which must be of the same length. If you want to print an existing similarity matrix, the DATA parameter (and the TEST and RANGE parameters) should be omitted, and the SIMILARITY option used to input the matrix concerned.

The TEST parameter specifies a list of strings, one for each variate or factor in the DATA parameter list, that define their “types”. If you want to exclude a variate or factor from contributing, you should specify an empty string (* or ''). Otherwise the similarity between units i and j is calculated as

k { wk(xik, xjk) sk(xik, xjk) } / ∑k wk(xik, xjk)

where xik is the value of the DATA variate k in unit i, and the contribution functions sk and weight functions wk for a variate or factor k of the available types are defined in the tables below (for further details see Gower 1971, 1985).

The first table contains the types appropriate for variates that are recording the presence or absence of a characteristic; these cannot be used with factors.

Type Contribution sk Weight wk
Jaccard if xi ≠ 0 and xj ≠ 0, then 1 1
  if xi = xj = 0, then 0 0
  if only one of xi or xj = 0, then 0 1
RussellRao if xi ≠ 0 and xj ≠ 0, then 1 1
  if xi = 0 or xj = 0, then 0 1
Dice if xi ≠ 0 and xj ≠ 0, then 1 1
  if xi = xj = 0, then 0 0
  if only one of xi or xj = 0, then 0 0.5
antidice if xi ≠ 0 and xj ≠ 0, then 1 1
  if xi = xj = 0, then 0 0
  if only one of xi or xj = 0, then 0 2
SneathSokal if xi ≠ 0 and xj ≠ 0, then 1 1
  if xi = xj = 0, then 1 1
  if only one of xi or xj = 0, then 0 0.5
RogersTanimoto if xi ≠ 0 and xj ≠ 0, then 1 1
  if xi = xj = 0, then 1 1
  if only one of xi or xj = 0, then 0 2

The simplematching type is appropriate for qualitative variables, which may be either variates or factors.

Type Contribution sk Weight wk
simplematching if xi = xj, then 1 1
  if xi xj, then 0 1

The next table shows the types that can be used for quantitative variates (but not factors). In the definitions, r is the range of the variate, t is the Minkowski index (defined by the MINKOWSKI option). Note, however, that BrayCurtis and Soergel should not be mixed with other types.

Type Contribution sk Weight wk
cityblock 1 – |xixj| / r 1
Manhattan synonymous with cityblock  
ecological 1 – |xixj| / r 1
  unless xi = xj = 0 0
Euclidean 1 – {(xixj) / r}2 1
Pythagorean synonymous with Euclidean  
Minkowski 1 – |xixj|t / rt 1
Divergence 1 – {(xixj) / (xi + xj)}2 1
Canberra 1 – |xixj| / (|xi| + |xj|) 1
BrayCurtis 1 – |xixj| / (xi + xj) xi + xj
Soergel 1 – |xixj| / max(xi, xj) max(xi, xj)

The RANGE parameter contains a list of scalars, one for each variate or factor in the DATA list. This allows you to check that the values of each variate lie within the given range. If any variate or factor fails the range check, FSIMILARITY gives an error diagnostic and terminates without forming the similarity matrix. The range is also used to standardize quantitative variates; this lets you impose a standard range, for example when variates are measured on commensurate scales. You can omit the RANGE parameter for all or any of the variates or factors by giving a missing identifier or a scalar with a missing value; Genstat then uses the observed range. If PRINT=summary, Genstat prints the name, the minimum value, and the range for each variate and factor.

The METHOD option controls what type of matrix is produced. METHOD=similarity, the default, gives a symmetric matrix of similarities amongst a single set of units. METHOD=betweengroupsimilarity gives a rectangular matrix of similarities between two sets of units. To form a rectangular matrix of similarities, you must also define the grouping of units by setting the GROUPS option (see below).

The PRINT, STYLE and PERMUTATION options govern the printing of a symmetric matrix of similarities. You can either form the similarity matrix within FSIMILARITY, or input it by the SIMILARITY option. To print the similarity matrix you should set option PRINT=similarity. The STYLE option has two settings, full (the default) or abbreviated. The similarity matrix printed in full style has its values displayed as percentages with one decimal place. If you put STYLE=abbreviated, the values of the similarity matrix are printed as single digits with no spaces, the digit being the 10’s value of the similarity as a percentage. In both cases, though, the actual similarities in the range 0-1 are stored in the similarity matrix itself. The PERMUTATION option lets you specify a variate with values corresponding to the order in which you want the rows of the similarity matrix to be printed. The reordering of the rows is most effective when the permutation arises from a hierarchical clustering and corresponds to the dendrogram order.

You use the GROUPS option to specify a partition of the units into two groups, by giving a factor with two levels. The units with level 1 of the factor correspond to the rows of the matrix, while the units with level 2 correspond to the columns.

The UNITS option lets you label the rows of the output similarity matrix if the variates of the DATA parameter do not have any unit labels, or if you want to use different labels from those labelling the units of the variates. This labelling also applies to the rows and columns of a matrix of similarities between group elements.

Options: PRINT, STYLE, METHOD, SIMILARITY, GROUPS, PERMUTATION, UNITS, MINKOWSKI.
Parameters: DATA, TEST, RANGE.

Action with RESTRICT

If any of the DATA variates or factors is restricted, or if the factor in the GROUPS option is restricted, then that restriction is applied to all the variates or factors. If more than one is restricted, then the restrictions must all be to the same set of units. The dimension of the resulting symmetric matrix of similarities is taken from the number of units that contribute to the similarity matrix.

References

Gower, J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 857-871.
Gower, J.C. (1985). Measures of similarity, dissimilarity and distance. In: Encyclopedia of Statistical Sciences, Volume 5, 397-405.

See also

Directives: CLUSTER, HCLUSTER, PCO, HREDUCE.
Procedures: ECANOSIM, HBOOTSTRAP, MANTEL, MASCLUSTER.
Commands for: Calculations and manipulation, Multivariate and cluster analysis.

Example

" Genstat example HCLU-1: Cluster analysis

   Data from 'Observers Book of Automobiles', 1986
   16 Italian cars and 10 measurements:
   1.  engine capacity        c.c.        CC
   2.  number of cylinders                NCyl
   3.  fuel tank              litres      Tank
   4.  unladen weight         kg          Wt
   5.  length                 cm          Length
   6.  width                  cm          Width
   7.  height                 cm          Ht
   8.  wheelbase              cm          Wbase
   9.  top speed              kph         TSpeed
  10.  time to 100kph         secs        StSt
  11.  carburettor/inj/diesel 1/2/3       Carb
  12.  front/rear wheel drive 1/2         Drive
"

TEXT [VALUES=Estate,'Arna1.5','Alfa2.5',Mondialqc,Testarossa,Croma,\ 
  Panda,Regatta,Regattad,Uno,X19,Contach,Delta,Thema,Y10,Spider] Cars
POINTER [VALUES=CC,NCyl,Tank,Wt,Length,Width,Ht,WBase,TSpeed,StSt,\ 
  Carb,Drive] Vars
" Read the data - measurements and carnames - from the file
 'HCLU-1.DAT', and then display it."
OPEN '%gendir%/examples/HCLU-1.DAT'; CHANNEL=cardat
READ [CHANNEL=cardat] Vars[]
CLOSE cardat

" Treat the number of cylinders, data[2], differently to the 
  continuous measurements."
HLIST [UNITS=Cars] \
  Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching)

" Form a hierarchical clustering of the cars,
  using the single linkage method."
SYMMETRIC [ROWS=Cars] CarSim
FSIMILARITY [SIMILARITY=CarSim]\ 
  Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching)
HCLUSTER [PRINT=amalgamations; METHOD=single] CarSim

" Use the average-linkage method."
HCLUSTER [PRINT=dendrogram; METHOD=average] CarSim;\ 
  AMALGAMATIONS=Am; PERMUTATION=Perm

" Display a high-resolution dendrogram."
DDENDROGRAM [ORDERING=given] DATA=Am; PERMUTATION=Perm; LABELS=Cars;\ 
  TITLE='Italian cars clustered by average linkage'

Updated on September 2, 2019

Was this article helpful?