Forms a similarity matrix or a between-group-elements similarity matrix or prints a similarity matrix.
Options
PRINT = string token |
Printed output required (similarities , summary ); default * i.e. no printing |
---|---|
STYLE = string token |
Print percentage similarities in full or just the 10% digit (full, abbreviated ); default full |
METHOD = string token |
Form similarity matrix or rectangular between-group-element similarity matrix (similarities, betweengroupsimilarities ); default simi |
SIMILARITY = matrix or symmetric matrix |
Input or output matrix of similarities; default * |
GROUPS = factor |
Grouping of units into two groups for between-group-element similarity matrix; default * |
PERMUTATION = variate |
Permutation of units (possibly from HCLUSTER ) for order in which units of the similarity matrix are printed; default * |
UNITS = text or variate |
Unit names to label the rows of the similarity matrix; default * |
MINKOWSKI = scalar |
Index t for use with TEST=minkowski |
Parameters
DATA = variates or factors |
The data values |
---|---|
TEST = string tokens |
Test type, defining how each DATA variate or factor is treated in the calculation of the similarity between each unit (simplematching , jaccard , russellrao , dice , antidice , sneathsokal , rogerstanimoto , cityblock , manhattan , ecological , euclidean , pythagorean , minkowski , divergence , canberra , braycurtis , soergel ); default * ignores that variate or factor |
RANGE = scalars |
Range of possible values of each DATA variate or factor; if omitted, the observed range is taken |
Description
The FSIMILARITY
directive forms similarity matrices, essentially using the method described by Gower (1971). The similarity coefficient that is calculated allows variables to be qualitative, quantitative or dichotomous, or mixtures of these types; values of some of the variables may be missing for some samples. The values of a similarity coefficient vary between zero and unity: two samples have a similarity of unity only when both have identical values for all variables; a value of zero occurs when the values for the two samples differ maximally for all variables.
You can form a symmetric matrix of similarities, or a rectangular matrix of similarities between the units in two groups. You can save either form of similarity matrix, using the SIMILARITY
option. FSIMILARITY
can also be used to print the symmetric matrix of similarities after it has formed it; alternatively, you can input an existing similarity matrix for printing, using the SIMILARITY
option.
The DATA
parameter specifies a list of variates or factors, all of which must be of the same length. If you want to print an existing similarity matrix, the DATA
parameter (and the TEST
and RANGE
parameters) should be omitted, and the SIMILARITY
option used to input the matrix concerned.
The TEST
parameter specifies a list of strings, one for each variate or factor in the DATA
parameter list, that define their “types”. If you want to exclude a variate or factor from contributing, you should specify an empty string (*
or ''
). Otherwise the similarity between units i and j is calculated as
∑k { wk(xik, xjk) sk(xik, xjk) } / ∑k wk(xik, xjk)
where xik is the value of the DATA
variate k in unit i, and the contribution functions sk and weight functions wk for a variate or factor k of the available types are defined in the tables below (for further details see Gower 1971, 1985).
The first table contains the types appropriate for variates that are recording the presence or absence of a characteristic; these cannot be used with factors.
Type | Contribution sk | Weight wk |
Jaccard |
if xi ≠ 0 and xj ≠ 0, then 1 | 1 |
if xi = xj = 0, then 0 | 0 | |
if only one of xi or xj = 0, then 0 | 1 | |
RussellRao |
if xi ≠ 0 and xj ≠ 0, then 1 | 1 |
if xi = 0 or xj = 0, then 0 | 1 | |
Dice |
if xi ≠ 0 and xj ≠ 0, then 1 | 1 |
if xi = xj = 0, then 0 | 0 | |
if only one of xi or xj = 0, then 0 | 0.5 | |
antidice |
if xi ≠ 0 and xj ≠ 0, then 1 | 1 |
if xi = xj = 0, then 0 | 0 | |
if only one of xi or xj = 0, then 0 | 2 | |
SneathSokal |
if xi ≠ 0 and xj ≠ 0, then 1 | 1 |
if xi = xj = 0, then 1 | 1 | |
if only one of xi or xj = 0, then 0 | 0.5 | |
RogersTanimoto |
if xi ≠ 0 and xj ≠ 0, then 1 | 1 |
if xi = xj = 0, then 1 | 1 | |
if only one of xi or xj = 0, then 0 | 2 |
The simplematching
type is appropriate for qualitative variables, which may be either variates or factors.
Type | Contribution sk | Weight wk |
simplematching |
if xi = xj, then 1 | 1 |
if xi ≠ xj, then 0 | 1 |
The next table shows the types that can be used for quantitative variates (but not factors). In the definitions, r is the range of the variate, t is the Minkowski index (defined by the MINKOWSKI
option). Note, however, that BrayCurtis
and Soergel
should not be mixed with other types.
Type | Contribution sk | Weight wk |
cityblock |
1 – |xi – xj| / r | 1 |
Manhattan |
synonymous with cityblock |
|
ecological |
1 – |xi – xj| / r | 1 |
unless xi = xj = 0 | 0 | |
Euclidean |
1 – {(xi – xj) / r}2 | 1 |
Pythagorean |
synonymous with Euclidean |
|
Minkowski |
1 – |xi – xj|t / rt | 1 |
Divergence |
1 – {(xi – xj) / (xi + xj)}2 | 1 |
Canberra |
1 – |xi – xj| / (|xi| + |xj|) | 1 |
BrayCurtis |
1 – |xi – xj| / (xi + xj) | xi + xj |
Soergel |
1 – |xi – xj| / max(xi, xj) | max(xi, xj) |
The RANGE
parameter contains a list of scalars, one for each variate or factor in the DATA
list. This allows you to check that the values of each variate lie within the given range. If any variate or factor fails the range check, FSIMILARITY
gives an error diagnostic and terminates without forming the similarity matrix. The range is also used to standardize quantitative variates; this lets you impose a standard range, for example when variates are measured on commensurate scales. You can omit the RANGE
parameter for all or any of the variates or factors by giving a missing identifier or a scalar with a missing value; Genstat then uses the observed range. If PRINT=summary
, Genstat prints the name, the minimum value, and the range for each variate and factor.
The METHOD
option controls what type of matrix is produced. METHOD=similarity
, the default, gives a symmetric matrix of similarities amongst a single set of units. METHOD=betweengroupsimilarity
gives a rectangular matrix of similarities between two sets of units. To form a rectangular matrix of similarities, you must also define the grouping of units by setting the GROUPS
option (see below).
The PRINT
, STYLE
and PERMUTATION
options govern the printing of a symmetric matrix of similarities. You can either form the similarity matrix within FSIMILARITY
, or input it by the SIMILARITY
option. To print the similarity matrix you should set option PRINT=similarity
. The STYLE
option has two settings, full
(the default) or abbreviated
. The similarity matrix printed in full style has its values displayed as percentages with one decimal place. If you put STYLE=abbreviated
, the values of the similarity matrix are printed as single digits with no spaces, the digit being the 10’s value of the similarity as a percentage. In both cases, though, the actual similarities in the range 0-1 are stored in the similarity matrix itself. The PERMUTATION
option lets you specify a variate with values corresponding to the order in which you want the rows of the similarity matrix to be printed. The reordering of the rows is most effective when the permutation arises from a hierarchical clustering and corresponds to the dendrogram order.
You use the GROUPS
option to specify a partition of the units into two groups, by giving a factor with two levels. The units with level 1 of the factor correspond to the rows of the matrix, while the units with level 2 correspond to the columns.
The UNITS
option lets you label the rows of the output similarity matrix if the variates of the DATA
parameter do not have any unit labels, or if you want to use different labels from those labelling the units of the variates. This labelling also applies to the rows and columns of a matrix of similarities between group elements.
Options: PRINT
, STYLE
, METHOD
, SIMILARITY
, GROUPS
, PERMUTATION
, UNITS
, MINKOWSKI
.
Parameters: DATA
, TEST
, RANGE
.
Action with RESTRICT
If any of the DATA
variates or factors is restricted, or if the factor in the GROUPS
option is restricted, then that restriction is applied to all the variates or factors. If more than one is restricted, then the restrictions must all be to the same set of units. The dimension of the resulting symmetric matrix of similarities is taken from the number of units that contribute to the similarity matrix.
References
Gower, J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 857-871.
Gower, J.C. (1985). Measures of similarity, dissimilarity and distance. In: Encyclopedia of Statistical Sciences, Volume 5, 397-405.
See also
Directives: CLUSTER
, HCLUSTER
, PCO
, HREDUCE
.
Procedures: ECANOSIM
, HBOOTSTRAP
, MANTEL
, MASCLUSTER
.
Commands for: Calculations and manipulation, Multivariate and cluster analysis.
Example
" Genstat example HCLU-1: Cluster analysis Data from 'Observers Book of Automobiles', 1986 16 Italian cars and 10 measurements: 1. engine capacity c.c. CC 2. number of cylinders NCyl 3. fuel tank litres Tank 4. unladen weight kg Wt 5. length cm Length 6. width cm Width 7. height cm Ht 8. wheelbase cm Wbase 9. top speed kph TSpeed 10. time to 100kph secs StSt 11. carburettor/inj/diesel 1/2/3 Carb 12. front/rear wheel drive 1/2 Drive " TEXT [VALUES=Estate,'Arna1.5','Alfa2.5',Mondialqc,Testarossa,Croma,\ Panda,Regatta,Regattad,Uno,X19,Contach,Delta,Thema,Y10,Spider] Cars POINTER [VALUES=CC,NCyl,Tank,Wt,Length,Width,Ht,WBase,TSpeed,StSt,\ Carb,Drive] Vars " Read the data - measurements and carnames - from the file 'HCLU-1.DAT', and then display it." OPEN '%gendir%/examples/HCLU-1.DAT'; CHANNEL=cardat READ [CHANNEL=cardat] Vars[] CLOSE cardat " Treat the number of cylinders, data[2], differently to the continuous measurements." HLIST [UNITS=Cars] \ Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching) " Form a hierarchical clustering of the cars, using the single linkage method." SYMMETRIC [ROWS=Cars] CarSim FSIMILARITY [SIMILARITY=CarSim]\ Vars[]; TEST=4(cityblock,euclidean),2(cityblock,simplematching) HCLUSTER [PRINT=amalgamations; METHOD=single] CarSim " Use the average-linkage method." HCLUSTER [PRINT=dendrogram; METHOD=average] CarSim;\ AMALGAMATIONS=Am; PERMUTATION=Perm " Display a high-resolution dendrogram." DDENDROGRAM [ORDERING=given] DATA=Am; PERMUTATION=Perm; LABELS=Cars;\ TITLE='Italian cars clustered by average linkage'