MDS directive

Performs non-metric multidimensional scaling.

Options

`PRINT` = string tokens	Printed output required (`coordinates, roots, distances, fitteddistances, stress, monitoring`); default `*` i.e. no printing
`DATA` = symmetric matrix	Distances amongst a set of units
`METHOD` = string token	Whether to use non-metric scaling, or metric scaling with linear regression of the fitted distances to the actual distances (`nonmetric, linear`); default `nonm`
`SCALING` = string token	Whether least-squares, least-squares-squared, or log-stress scaling is to be used (`ls, lss, logstress`); default `ls`
`TIES` = string token	Treatment of tied data values (`primary, secondary, tertiary`); default `prim`
`WEIGHTS` = symmetric matrix	Weights for each distance value; default `*` i.e. all distances with weight one
`INITIAL` = matrix	Initial configuration; default `*` i.e. a principal coordinate solution is used
`NSTARTS` = scalar	Number of starting configurations to be used, by making random perturbations to the initial configuration; default 10
`SEED` = scalar	Seed for the random-number generator; default 0
`MAXCYCLE` = scalar	Maximum number of iterations; default 30

Parameters

`NDIMENSIONS` = scalars	Number of dimensions for each solution
`COORDINATES` = matrices	To store the coordinates of the units for each solution
`STRESS` = scalars	To store the stress value for each solution
`DISTANCES` = symmetric matrices	To store the distances amongst the points for the units in the fitted number of dimensions
`FITTEDDISTANCES` = symmetric matrices	To store the fitted distances from the monotonic (`METHOD=nonmetric`) or linear (`METHOD=linear`) regression

Description

The MDS directive carries out iterative scaling, including metric and non-metric scaling. The input data consists of a symmetric matrix whose values may be interpreted, in a general sense, as distances between a set of objects. The matrix is specified by the DATA option; thus only one matrix can be analysed each time the MDS directive is used.

The objective of the MDS directive is to find a set of coordinates whose inter-point distances match, as closely as possible, those of the input data matrix. When plotted, the coordinates provide a display which can be interpreted in the same way as a map: for example, if points in the display are close together, their distance apart in the data matrix was small.

The algorithm invoked by the MDS directive uses the method of steepest descent to guide the algorithm from an initial configuration of points to the final matrix of coordinates that has the minimum stress of all configurations examined.

Printed output is controlled by the PRINT option; by default nothing is printed. There are six possible settings:

`coordinates`	prints the solution coordinates, rotated to principal coordinates;
`roots`	prints the latent roots of the solution coordinates;
`distances`	prints the inter-unit distances, computed from the solution configuration;
`fitteddistances`	prints the fitted values from the regression of the inter-unit distances on the distances in the data matrix, the regression may be monotonic or linear through the origin, depending on the setting of the `METHOD` option;
`stress`	prints the stress of the solution coordinates;
`monitoring`	prints a summary of the results at each iteration.

The METHOD option determines whether metric or non-metric scaling is given. The algorithm involves regression of the distances, calculated from the solution coordinates, against the dissimilarities in the symmetric matrix specified by the DATA option. With the default setting, METHOD=nonmetric, monotonic regression is used; if METHOD=linear, the algorithm uses linear regression through the origin.

The stress function to be minimized can be selected using the STRESS option. There are three possibilities.

`ls` (least squares):	√{ ∑_i ∑_j {w_ij (d_ij – d^{^}_ij)²} / ( m ∑_i ∑_j{ w_ij d_ij²} )}
`lss` (least-squares-squared):	√{ ∑_i ∑_j {w_ij (d_ij² – d^{^}_ij²)²} / ( m ∑_i ∑_j{ w_ij d_ij⁴} )}
`logstress`:	√{ ∑_i ∑_j {w_ij (log(d_ij) – log(d^{^}_ij))²} / m }

where the d_ij are the elements of the dissimilarity matrix calculated for the fitted configuration, the d^_ij are the fitted values from the regression selected by the METHOD option, the w_ij are the corresponding weights and m is the number of off-diagonal elements in the dissimilarity matrix.

The TIES option allows you to vary the way in which tied data values in the input data matrix are to be treated. By default, the treatment of ties is primary, and no restrictions are placed on the distances corresponding to tied dissimilarities in the input data matrix. In the secondary treatment of ties, the distances corresponding to tied dissimilarities are required to be as nearly equal as possible. Kendall (1977) describes a compromise between the primary and secondary approaches to ties: the block of ties corresponding to the smallest dissimilarity are handled by the secondary treatment, the remaining blocks of ties are handled by the primary treatment. This tertiary treatment of ties is useful when the dissimilarities take only a few values. For example, in the reconstruction of maps from abuttal information, the dissimilarity coefficient takes only two values: zero if localities abut, and one if they do not. The block of ties associated with the dissimilarity of zero are handled by the secondary treatment, and the block of ties with dissimilarity one by the primary treatment.

The WEIGHT option can be used to specify a symmetric matrix of weights. Each element of the matrix gives the weight to be attached to the corresponding element of the input data matrix. If the option is not set, the elements of the data matrix are weighted equally: w_ij=1 for all i and j. The most important use of the option occurs when the matrix of weights contains only zeros and ones; the zeros then correspond to missing values in the input data matrix, allowing incomplete data matrices to be scaled. Up to about two thirds of the data matrix may be missing before the algorithm breaks down. This allows experimenters to design studies in which only a subset of all the dissimilarities need to be observed. This is particularly useful when there are a large number of units; if the number of units is m, say, a complete m × m data matrix requires m(m-1)/2 dissimilarities to be observed.

Since the algorithm is an iterative one, making use of the method of steepest descent, there is no guarantee that the solution coordinates found from any given starting configuration has the minimum stress of all possible configurations. The algorithm may have found a local, rather than the global, minimum. This problem may be partially overcome by using a series of different starting configurations. If several of the solutions arrive at the same lowest stress solution, then you may be reasonably confident of having found the global minimum. The NSTARTS option determines the number of starting configurations to be used. The starting configuration used on the first start can be specified by the INITIAL option; if this is not set, the default is to take the principal coordinate solution obtained from a PCO analysis of the input dissimilarity matrix. Subsequent starting configurations are found by perturbing each coordinate of the first starting configuration by successively larger amounts. This strategy generally results in at least one starting configuration that does not get entrapped in a local minimum: however there can be no guarantee that the global minimum for the stress function has been found. Experience suggests that, for safety, the NSTARTS option should be set equal to at least 10. By default NSTARTS=10.

The SEED option supplies the seed for the random numbers that are used to perturb the initial configuration. The default of zero continues the existing sequence of random numbers if MDS has already been used in the current Genstat job. If MDS has not yet been used, Genstat picks a seed at random.

The MAXCYCLES option determines the maximum number of iterations of the algorithm. The default of 30 should usually be sufficient. However, it may be necessary to set a larger value for very large data matrices or when using the logstress setting of the SCALING option. The monitoring setting of the PRINT option may be used to see how convergence is progressing.

The NDIMENSIONS parameter must be set to a scalar (or scalars) to indicate the number(s) of dimensions in which the multidimensional scaling is to be performed on the data matrix. An MDS statement with a list of scalars will carry out a series of scaling operations, all based on the same matrix of dissimilarities, but with different numbers of dimensions.

The remaining parameters of the MDS directive allow output to be saved in Genstat data structures. The COORDINATES parameter can list matrices to store the minimum stress coordinates in each of the dimensions given by the NDIMENSIONS parameter, and the STRESS parameter can specify scalars to store the associated minimum stresses. The parameters DISTANCES and FITTEDDISTANCES can specify symmetric matrices to store the distances computed from the coordinates matrix and the fitted distances computed from the monotonic or linear regressions, respectively.

Options: PRINT, DATA, METHOD, SCALING, TIES, WEIGHTS, INITIAL, NSTARTS, SEED, MAXCYCLE.

Parameters: NDIMENSIONS, COORDINATES, STRESS, DISTANCES, FITTEDDISTANCES.

Reference

Kendall, D.G. (1977). On the tertiary treatment of ties. Proceedings of the Royal Society of London, Series A, 354, 407-423.

Example

" Genstat example MDS-1: Multidimensional Scaling.

  The data for this example (Nathanson J A 1971. An aplication of
  multivariate analysis in astronomy. Applied Statistics 20, 239-249)
  gives squared distances amongst ten types of galaxy: those of an 
  elliptical shape, eight different kinds of spiral galaxy , and 
  irregularly-shaped galaxies. The spiral types vary from those which 
  are mainly made up of a central core (coded as types SO and SBO) to 
  those that are extremely tenuous (Sc and SBc).

  This example forms an ordination of the ten galaxy types.
"
 
"
  Declare the symmetric data matrix
"
TEXT GalaxyType; !T(E,SO,SBO,Sa,SBa,Sb,SBb,Sc,SBc,I)
SYMMETRIC [ROWS=GalaxyType] Galaxy
READ Galaxy
0
1.87 0
2.24 0.91 0
4.03 2.05 1.51 0
4.09 1.74 1.59 0.68 0
5.38 3.41 3.15 1.86 1.27 0
7.03 3.85 3.24 2.25 1.89 2.02 0
6.02 4.85 4.11 3.00 2.13 1.71 1.45 0
6.88 5.70 5.12 3.72 3.01 2.97 1.75 1.13 0
4.12 3.77 3.86 3.93 3.27 3.77 3.52 2.79 3.29 0 :
"
 Carry out the analysis, printing out the stress, latent
 roots, the coordinates, the inter-unit distances of between the
 coordinates, and the fitted values from the regression of the distances
 on the observed distances.

 Note that MDS requires distances, not squared distances, as input -
 transform data appropriately.
"
CALC Galaxy = sqrt(Galaxy)
PRINT Galaxy

MDS [PRINT=coordinates, roots,distances, stress,fitteddistances;\ 
    DATA=Galaxy; SEED=722922] NDIM=2;  COORD=C; DISTANCE=Dist; FITTEDDIST=FitD

XAXIS 4; TITLE='Distances'
YAXIS 4; TITLE='Fitted Distances'
PEN 3;  METHOD=line; SYMBOL=0
DGRAPH [WINDOW=4;KEYWINDOW=0;TITLE='MDS - Shepard Diagram']\ 
         Dist,FitD; Galaxy; PEN=1,3;\
         DESCRIPTION='Fitted Distances','Monotone Regression'

PEN 4; LABELS=GalaxyType; SYMBOL=0; COLOUR=5; SIZE=1.5
DGRAPH [WINDOW=3;KEYWINDOW=0;TITLE='MDS - Fitted Configuration']\
         C$[*;2]; C$[*;1]; PEN=4

Updated on June 19, 2019

Was this article helpful?

Yes No