Performs non-metric multidimensional scaling.
Options
PRINT = string tokens |
Printed output required (coordinates, roots, distances, fitteddistances, stress, monitoring ); default * i.e. no printing |
---|---|
DATA = symmetric matrix |
Distances amongst a set of units |
METHOD = string token |
Whether to use non-metric scaling, or metric scaling with linear regression of the fitted distances to the actual distances (nonmetric, linear ); default nonm |
SCALING = string token |
Whether least-squares, least-squares-squared, or log-stress scaling is to be used (ls, lss, logstress ); default ls |
TIES = string token |
Treatment of tied data values (primary, secondary, tertiary ); default prim |
WEIGHTS = symmetric matrix |
Weights for each distance value; default * i.e. all distances with weight one |
INITIAL = matrix |
Initial configuration; default * i.e. a principal coordinate solution is used |
NSTARTS = scalar |
Number of starting configurations to be used, by making random perturbations to the initial configuration; default 10 |
SEED = scalar |
Seed for the random-number generator; default 0 |
MAXCYCLE = scalar |
Maximum number of iterations; default 30 |
Parameters
NDIMENSIONS = scalars |
Number of dimensions for each solution |
---|---|
COORDINATES = matrices |
To store the coordinates of the units for each solution |
STRESS = scalars |
To store the stress value for each solution |
DISTANCES = symmetric matrices |
To store the distances amongst the points for the units in the fitted number of dimensions |
FITTEDDISTANCES = symmetric matrices |
To store the fitted distances from the monotonic (METHOD=nonmetric ) or linear (METHOD=linear ) regression |
Description
The MDS
directive carries out iterative scaling, including metric and non-metric scaling. The input data consists of a symmetric matrix whose values may be interpreted, in a general sense, as distances between a set of objects. The matrix is specified by the DATA
option; thus only one matrix can be analysed each time the MDS
directive is used.
The objective of the MDS
directive is to find a set of coordinates whose inter-point distances match, as closely as possible, those of the input data matrix. When plotted, the coordinates provide a display which can be interpreted in the same way as a map: for example, if points in the display are close together, their distance apart in the data matrix was small.
The algorithm invoked by the MDS
directive uses the method of steepest descent to guide the algorithm from an initial configuration of points to the final matrix of coordinates that has the minimum stress of all configurations examined.
Printed output is controlled by the PRINT
option; by default nothing is printed. There are six possible settings:
coordinates |
prints the solution coordinates, rotated to principal coordinates; |
---|---|
roots |
prints the latent roots of the solution coordinates; |
distances |
prints the inter-unit distances, computed from the solution configuration; |
fitteddistances |
prints the fitted values from the regression of the inter-unit distances on the distances in the data matrix, the regression may be monotonic or linear through the origin, depending on the setting of the METHOD option; |
stress |
prints the stress of the solution coordinates; |
monitoring |
prints a summary of the results at each iteration. |
The METHOD
option determines whether metric or non-metric scaling is given. The algorithm involves regression of the distances, calculated from the solution coordinates, against the dissimilarities in the symmetric matrix specified by the DATA
option. With the default setting, METHOD=nonmetric
, monotonic regression is used; if METHOD=linear
, the algorithm uses linear regression through the origin.
The stress function to be minimized can be selected using the STRESS
option. There are three possibilities.
ls (least squares): |
√{ ∑i ∑j {wij (dij – d^ij)2} / ( m ∑i ∑j{ wij dij2} )} |
---|---|
lss (least-squares-squared): |
√{ ∑i ∑j {wij (dij2 – d^ij2)2} / ( m ∑i ∑j{ wij dij4} )} |
logstress : |
√{ ∑i ∑j {wij (log(dij) – log(d^ij))2} / m } |
where the dij are the elements of the dissimilarity matrix calculated for the fitted configuration, the d^ij are the fitted values from the regression selected by the METHOD
option, the wij are the corresponding weights and m is the number of off-diagonal elements in the dissimilarity matrix.
The TIES
option allows you to vary the way in which tied data values in the input data matrix are to be treated. By default, the treatment of ties is primary, and no restrictions are placed on the distances corresponding to tied dissimilarities in the input data matrix. In the secondary treatment of ties, the distances corresponding to tied dissimilarities are required to be as nearly equal as possible. Kendall (1977) describes a compromise between the primary and secondary approaches to ties: the block of ties corresponding to the smallest dissimilarity are handled by the secondary treatment, the remaining blocks of ties are handled by the primary treatment. This tertiary treatment of ties is useful when the dissimilarities take only a few values. For example, in the reconstruction of maps from abuttal information, the dissimilarity coefficient takes only two values: zero if localities abut, and one if they do not. The block of ties associated with the dissimilarity of zero are handled by the secondary treatment, and the block of ties with dissimilarity one by the primary treatment.
The WEIGHT
option can be used to specify a symmetric matrix of weights. Each element of the matrix gives the weight to be attached to the corresponding element of the input data matrix. If the option is not set, the elements of the data matrix are weighted equally: wij=1 for all i and j. The most important use of the option occurs when the matrix of weights contains only zeros and ones; the zeros then correspond to missing values in the input data matrix, allowing incomplete data matrices to be scaled. Up to about two thirds of the data matrix may be missing before the algorithm breaks down. This allows experimenters to design studies in which only a subset of all the dissimilarities need to be observed. This is particularly useful when there are a large number of units; if the number of units is m, say, a complete m × m data matrix requires m(m-1)/2 dissimilarities to be observed.
Since the algorithm is an iterative one, making use of the method of steepest descent, there is no guarantee that the solution coordinates found from any given starting configuration has the minimum stress of all possible configurations. The algorithm may have found a local, rather than the global, minimum. This problem may be partially overcome by using a series of different starting configurations. If several of the solutions arrive at the same lowest stress solution, then you may be reasonably confident of having found the global minimum. The NSTARTS
option determines the number of starting configurations to be used. The starting configuration used on the first start can be specified by the INITIAL
option; if this is not set, the default is to take the principal coordinate solution obtained from a PCO
analysis of the input dissimilarity matrix. Subsequent starting configurations are found by perturbing each coordinate of the first starting configuration by successively larger amounts. This strategy generally results in at least one starting configuration that does not get entrapped in a local minimum: however there can be no guarantee that the global minimum for the stress function has been found. Experience suggests that, for safety, the NSTARTS
option should be set equal to at least 10. By default NSTARTS=1
0.
The SEED
option supplies the seed for the random numbers that are used to perturb the initial configuration. The default of zero continues the existing sequence of random numbers if MDS
has already been used in the current Genstat job. If MDS
has not yet been used, Genstat picks a seed at random.
The MAXCYCLES
option determines the maximum number of iterations of the algorithm. The default of 30 should usually be sufficient. However, it may be necessary to set a larger value for very large data matrices or when using the logstress
setting of the SCALING
option. The monitoring setting of the PRINT
option may be used to see how convergence is progressing.
The NDIMENSIONS
parameter must be set to a scalar (or scalars) to indicate the number(s) of dimensions in which the multidimensional scaling is to be performed on the data matrix. An MDS
statement with a list of scalars will carry out a series of scaling operations, all based on the same matrix of dissimilarities, but with different numbers of dimensions.
The remaining parameters of the MDS
directive allow output to be saved in Genstat data structures. The COORDINATES
parameter can list matrices to store the minimum stress coordinates in each of the dimensions given by the NDIMENSIONS
parameter, and the STRESS
parameter can specify scalars to store the associated minimum stresses. The parameters DISTANCES
and FITTEDDISTANCES
can specify symmetric matrices to store the distances computed from the coordinates matrix and the fitted distances computed from the monotonic or linear regressions, respectively.
Options: PRINT
, DATA
, METHOD
, SCALING
, TIES
, WEIGHTS
, INITIAL
, NSTARTS
, SEED
, MAXCYCLE
.
Parameters: NDIMENSIONS
, COORDINATES
, STRESS
, DISTANCES
, FITTEDDISTANCES
.
Reference
Kendall, D.G. (1977). On the tertiary treatment of ties. Proceedings of the Royal Society of London, Series A, 354, 407-423.
See also
Directives: MONOTONIC
, PCP
, PCO
, CVA
, FCA
.
Commands for: Multivariate and cluster analysis.
Example
" Genstat example MDS-1: Multidimensional Scaling. The data for this example (Nathanson J A 1971. An aplication of multivariate analysis in astronomy. Applied Statistics 20, 239-249) gives squared distances amongst ten types of galaxy: those of an elliptical shape, eight different kinds of spiral galaxy , and irregularly-shaped galaxies. The spiral types vary from those which are mainly made up of a central core (coded as types SO and SBO) to those that are extremely tenuous (Sc and SBc). This example forms an ordination of the ten galaxy types. " " Declare the symmetric data matrix " TEXT GalaxyType; !T(E,SO,SBO,Sa,SBa,Sb,SBb,Sc,SBc,I) SYMMETRIC [ROWS=GalaxyType] Galaxy READ Galaxy 0 1.87 0 2.24 0.91 0 4.03 2.05 1.51 0 4.09 1.74 1.59 0.68 0 5.38 3.41 3.15 1.86 1.27 0 7.03 3.85 3.24 2.25 1.89 2.02 0 6.02 4.85 4.11 3.00 2.13 1.71 1.45 0 6.88 5.70 5.12 3.72 3.01 2.97 1.75 1.13 0 4.12 3.77 3.86 3.93 3.27 3.77 3.52 2.79 3.29 0 : " Carry out the analysis, printing out the stress, latent roots, the coordinates, the inter-unit distances of between the coordinates, and the fitted values from the regression of the distances on the observed distances. Note that MDS requires distances, not squared distances, as input - transform data appropriately. " CALC Galaxy = sqrt(Galaxy) PRINT Galaxy MDS [PRINT=coordinates, roots,distances, stress,fitteddistances;\ DATA=Galaxy; SEED=722922] NDIM=2; COORD=C; DISTANCE=Dist; FITTEDDIST=FitD XAXIS 4; TITLE='Distances' YAXIS 4; TITLE='Fitted Distances' PEN 3; METHOD=line; SYMBOL=0 DGRAPH [WINDOW=4;KEYWINDOW=0;TITLE='MDS - Shepard Diagram']\ Dist,FitD; Galaxy; PEN=1,3;\ DESCRIPTION='Fitted Distances','Monotone Regression' PEN 4; LABELS=GalaxyType; SYMBOL=0; COLOUR=5; SIZE=1.5 DGRAPH [WINDOW=3;KEYWINDOW=0;TITLE='MDS - Fitted Configuration']\ C$[*;2]; C$[*;1]; PEN=4