SOMESTIMATE procedure

Estimates the weights for self-organizing maps (R.W. Payne).

Options

`PRINT` = string tokens	Controls output (`weights`, `errors`, `monitoring`, `report`); default `weig`, `repo`
`PLOT` = string token	Controls what to plot (`fit`, `totalerror`); default `fit`
`DMETHOD` = string token	Method for calculating the distances of data points from the modes (`euclidean`, `cityblock`); default `eucl`
`WMETHOD` = string token	Method for calculating the contribution of a data point to each node when revising the weights (`gaussian`, `neighbour`); default `gaus`
`ALPHA` = scalar or variate	Initial alpha value for each set of iterations; default `!(1,` `0.1)`
`SIGMA` = scalar or variate	Initial sigma value for each set of iterations when `WMETHOD=gaussian`; default `!(1,` `0.01)` multiplied by the maximum distance between nodes
`THRESHOLD` = scalar or variate	Initial distance threshold for each set of iterations when `WMETHOD=neighbour`; default `!(0.5,` `0.1)` multiplied by the maximum distance between nodes
`NCYCLE` = scalar or variate	Number of cycles in each set of iterations; default 500
`NSTOP` = scalar	Number of consecutive cycles with no changes required for convergence; default 10

Parameters

`SOM` = pointers	Save the information about each map
`DATA` = matrices or pointers	Data values for training each map
`ERRORS` = matrices	Reconstruction errors at the nodes of each map
`FITROWS` = factors	Save the positions of the rows allocated to the data points
`FITCOLUMNS` = factors	Save the positions of the columns allocated to the data points
`Y` = variates	Save y-values used to plot the data points
`X` = variates	Save x-values used to plot the data points
`PEN` = scalars, variates or factors	Pens used to plot the maps
`SEED` = scalars	Seed for the random numbers used to initialize the weights in each map

Description

A self-organizing map is a two dimensional grid of nodes, used to classify vectors of observations on p variables. Each node is characterized by a vector of p weights (one for each variable).

Before estimating the weights, you first need to declare a SOM structure to store the map. The SOM procedure, which does this, defines the row and column positions of the nodes on the grid. It also stores the names of the weight variables and information about how distances are to be measured on the grid and how the weights should be adjusted during their estimation. The SOM structure is then input to SOMESTIMATE by the SOM parameter.

The training dataset to estimate the weights is specified by the DATA parameter, either as a matrix with n rows and p columns (where n is the number of observations in the training set) or as a pointer containing p variates each with n units. SOMESTIMATE gives a warning if the row names of a DATA matrix or the names of the variates in a DATA pointer differ from the names stored for the weight variables in the SOM structure.

The weights are estimated by a sequence of iterations, which are performed by the SOMADJUST procedure. In an iteration, the training observations are taken in turn. Each observation i is assessed to find its closest node. The method to use to measure distance on the map will have been specified, by the DMETHOD option of SOM, and stored with the SOM structure when it was declared. However, SOMESTIMATE also has a DMETHOD option in case you want to override the stored setting. The default setting for the DMETHOD option of SOM is euclidean. If X_i is a variate containing the values of the variables for observation i and W_j is the variate of weights at node j, the distance is then given by

d_ij = SQRT(SUM((X_i - W_j)**2))

The alternative setting, cityblock, calculates the distance as

d_ij = SUM(ABS(X_i - W_j)))

Once the closest node, k, has been found, the weights at that node and other nodes are adjusted. The method to use will have been specified when the SOM structure was declared, by the WMETHOD option of SOM. However, SOMESTIMATE again has its own WMETHOD option, that you can use to override the stored setting. The default setting for the DMETHOD option of SOM is gaussian. This adjusts the weights W_j at every node j to become

W_j + alpha * EXP( -0.5 * (d_jk / sigma)**2) * (X_i - W_j)

where d_jk is the distance between nodes j and k. With the alternative setting, neighbour, the weights at node j are adjusted to become

W_j + alpha * (X_i - W_j)

but only if d_jk is less than a threshold r.

The values of alpha, sigma and r change at each iteration. By default, SOMESTIMATE runs two sequences of iterations. At the start of the first set, the parameters have initial values

alpha = 1

sigma = dmax

r = dmax / 2

where dmax is the maximum distance between any two nodes in the network. At the end of the first set, they have final values

alpha = 0.1

sigma = dmax / 10

r = dmax / 10

There are 500 iterations in the first set, and the parameters decrease in equal steps from their initial to their final values. There are also 500 cycles in the second set of iterations, and the parameters now decrease in equal steps to to final values

alpha = 1

sigma = 0

r = dmin

where dmin is the minimum distance between any two nodes in the network. If dmax/10 is less than dmin, then the value of r at the end of the first set will be dmin too.

You can define your own sequence of iterations using the ALPHA, SIGMA, THRESHOLD and NCYCLE options (where SIGMA is relevant only when WMETHOD=gaussian, and THRESHOLD only when WMETHOD=neighbour). Setting all the relevant options to scalars, defines a single set of iterations where the parameters decrease from initial values set by the options to the final values specified above. Alternatively, you can set ALPHA and either SIGMA or THRESHOLD to variates to specify initial values for several sets of iterations. NCYCLE can be set to a scalar if all the sets are to contain the same number of iterations, or to a variate of the same length as ALPHA if you want each set to contain a different number.

The weights are initialized to have random positions within the plane of the first two principal components for the DATA matrix. The SEED parameter supplies a seed for the random numbers used to define the positions. The default value of zero initializes the random number generator automatically if this is the first time that it has been used in the current job, or continues the existing sequence of random numbers.

By default SOMESTIMATE will stop the estimation process if there are more than ten successive iterations in which no observation changes its closest node. Different numbers of successive iterations with no changes can be specified using the NSTOP option.

Printed output is controlled by the PRINT option, with settings:

`weights`	to print the weights at each node of the map;
`errors`	to print the reconstruction errors at each node of the map;
`monitoring`	to provide monitoring about each iteration; and
`report`	to print a report at the end of the estimation process.

By default PRINT=weights,report.

The PLOT option controls which plots are produced, with settings:

`fit`	for a plot showing how the data observations are allocated to the nodes of the map; and
`totalerror`	for a plot showing how the total reconstruction error changes at each iteration.

By default, the map is plotted. The PEN parameter can be used to define the pen or pens to be used to plot the points on the map. If PEN is set to a scalar, the same pen will be used for every point, so you would simply be able to assess the density of points around the map. Alternatively, you can supply a variate or factor to distinguish different groups of observations.

The ERRORS parameter can save a matrix with the reconstruction error at each node of the map. The Y and X parameters can save the coordinates used to plot the points on the map. These are formed by adding a small amount of random variation to the row and column of the nodes, to ensure that points allocated to the same node are not all plotted in the same position.

Options: PRINT, PLOT, DMETHOD, WMETHOD, ALPHA, SIGMA, THRESHOLD, NCYCLE, NSTOP.

Parameters: SOM, DATA, ERRORS, FITROWS, FITCOLUMNS, Y, X, PEN, SEED.

Method

The individual iterations involved in the estimation are carried out by the SOMADJUST procedure.

Action with `RESTRICT`

SOMESTIMATE takes account of any restrictions defined on the DATA variates.

Example

CAPTION 'SOMESTIMATE example',!t('Fisher''s Iris Data'); STYLE=meta,plain
SOM     Som; VARIABLENAMES=!t(Sepal_L,Sepal_W,Petal_L,Petal_W)
MATRIX  [ROWS=150; COLUMNS=!t(Sepal_L,Sepal_W,Petal_L,Petal_W)] Measures
READ    Measures
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 5.4  3.7  1.5  0.2
 4.8  3.4  1.6  0.2
 4.8  3.0  1.4  0.1
 4.3  3.0  1.1  0.1
 5.8  4.0  1.2  0.2
 5.7  4.4  1.5  0.4
 5.4  3.9  1.3  0.4
 5.1  3.5  1.4  0.3
 5.7  3.8  1.7  0.3
 5.1  3.8  1.5  0.3
 5.4  3.4  1.7  0.2
 5.1  3.7  1.5  0.4
 4.6  3.6  1.0  0.2
 5.1  3.3  1.7  0.5
 4.8  3.4  1.9  0.2
 5.0  3.0  1.6  0.2
 5.0  3.4  1.6  0.4
 5.2  3.5  1.5  0.2
 5.2  3.4  1.4  0.2
 4.7  3.2  1.6  0.2
 4.8  3.1  1.6  0.2
 5.4  3.4  1.5  0.4
 5.2  4.1  1.5  0.1
 5.5  4.2  1.4  0.2
 4.9  3.1  1.5  0.2
 5.0  3.2  1.2  0.2
 5.5  3.5  1.3  0.2
 4.9  3.6  1.4  0.1
 4.4  3.0  1.3  0.2
 5.1  3.4  1.5  0.2
 5.0  3.5  1.3  0.3
 4.5  2.3  1.3  0.3
 4.4  3.2  1.3  0.2
 5.0  3.5  1.6  0.6
 5.1  3.8  1.9  0.4
 4.8  3.0  1.4  0.3
 5.1  3.8  1.6  0.2
 4.6  3.2  1.4  0.2
 5.3  3.7  1.5  0.2
 5.0  3.3  1.4  0.2
 7.0  3.2  4.7  1.4
 6.4  3.2  4.5  1.5
 6.9  3.1  4.9  1.5
 5.5  2.3  4.0  1.3
 6.5  2.8  4.6  1.5
 5.7  2.8  4.5  1.3
 6.3  3.3  4.7  1.6
 4.9  2.4  3.3  1.0
 6.6  2.9  4.6  1.3
 5.2  2.7  3.9  1.4
 5.0  2.0  3.5  1.0
 5.9  3.0  4.2  1.5
 6.0  2.2  4.0  1.0
 6.1  2.9  4.7  1.4
 5.6  2.9  3.6  1.3
 6.7  3.1  4.4  1.4
 5.6  3.0  4.5  1.5
 5.8  2.7  4.1  1.0
 6.2  2.2  4.5  1.5
 5.6  2.5  3.9  1.1
 5.9  3.2  4.8  1.8
 6.1  2.8  4.0  1.3
 6.3  2.5  4.9  1.5
 6.1  2.8  4.7  1.2
 6.4  2.9  4.3  1.3
 6.6  3.0  4.4  1.4
 6.8  2.8  4.8  1.4
 6.7  3.0  5.0  1.7
 6.0  2.9  4.5  1.5
 5.7  2.6  3.5  1.0
 5.5  2.4  3.8  1.1
 5.5  2.4  3.7  1.0
 5.8  2.7  3.9  1.2
 6.0  2.7  5.1  1.6
 5.4  3.0  4.5  1.5
 6.0  3.4  4.5  1.6
 6.7  3.1  4.7  1.5
 6.3  2.3  4.4  1.3
 5.6  3.0  4.1  1.3
 5.5  2.5  4.0  1.3
 5.5  2.6  4.4  1.2
 6.1  3.0  4.6  1.4
 5.8  2.6  4.0  1.2
 5.0  2.3  3.3  1.0
 5.6  2.7  4.2  1.3
 5.7  3.0  4.2  1.2
 5.7  2.9  4.2  1.3
 6.2  2.9  4.3  1.3
 5.1  2.5  3.0  1.1
 5.7  2.8  4.1  1.3
 6.3  3.3  6.0  2.5
 5.8  2.7  5.1  1.9
 7.1  3.0  5.9  2.1
 6.3  2.9  5.6  1.8
 6.5  3.0  5.8  2.2
 7.6  3.0  6.6  2.1
 4.9  2.5  4.5  1.7
 7.3  2.9  6.3  1.8
 6.7  2.5  5.8  1.8
 7.2  3.6  6.1  2.5
 6.5  3.2  5.1  2.0
 6.4  2.7  5.3  1.9
 6.8  3.0  5.5  2.1
 5.7  2.5  5.0  2.0
 5.8  2.8  5.1  2.4
 6.4  3.2  5.3  2.3
 6.5  3.0  5.5  1.8
 7.7  3.8  6.7  2.2
 7.7  2.6  6.9  2.3
 6.0  2.2  5.0  1.5
 6.9  3.2  5.7  2.3
 5.6  2.8  4.9  2.0
 7.7  2.8  6.7  2.0
 6.3  2.7  4.9  1.8
 6.7  3.3  5.7  2.1
 7.2  3.2  6.0  1.8
 6.2  2.8  4.8  1.8
 6.1  3.0  4.9  1.8
 6.4  2.8  5.6  2.1
 7.2  3.0  5.8  1.6
 7.4  2.8  6.1  1.9
 7.9  3.8  6.4  2.0
 6.4  2.8  5.6  2.2
 6.3  2.8  5.1  1.5
 6.1  2.6  5.6  1.4
 7.7  3.0  6.1  2.3
 6.3  3.4  5.6  2.4
 6.4  3.1  5.5  1.8
 6.0  3.0  4.8  1.8
 6.9  3.1  5.4  2.1
 6.7  3.1  5.6  2.4
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8  :
FACTOR       [NVALUES=150; LABELS=!t(Setosa,Versicolor,Virginica);\
             VALUES=50(1,2,3)] Species
SOMESTIMATE  [PRINT=weights,errors,report; PLOT=fit,totalerror;\
             NCYCLE=!(100,200); SIGMA=!(5,1)] Som; DATA=Measures;\
             PEN=Species;  SEED=419749

Updated on March 5, 2019

Was this article helpful?

Yes No