Estimates the weights for self-organizing maps (R.W. Payne).
Options
PRINT = string tokens |
Controls output (weights , errors , monitoring , report ); default weig , repo |
---|---|
PLOT = string token |
Controls what to plot (fit , totalerror ); default fit |
DMETHOD = string token |
Method for calculating the distances of data points from the modes (euclidean , cityblock ); default eucl |
WMETHOD = string token |
Method for calculating the contribution of a data point to each node when revising the weights (gaussian , neighbour ); default gaus |
ALPHA = scalar or variate |
Initial alpha value for each set of iterations; default !(1, 0.1) |
SIGMA = scalar or variate |
Initial sigma value for each set of iterations when WMETHOD=gaussian ; default !(1, 0.01) multiplied by the maximum distance between nodes |
THRESHOLD = scalar or variate |
Initial distance threshold for each set of iterations when WMETHOD=neighbour ; default !(0.5, 0.1) multiplied by the maximum distance between nodes |
NCYCLE = scalar or variate |
Number of cycles in each set of iterations; default 500 |
NSTOP = scalar |
Number of consecutive cycles with no changes required for convergence; default 10 |
Parameters
SOM = pointers |
Save the information about each map |
---|---|
DATA = matrices or pointers |
Data values for training each map |
ERRORS = matrices |
Reconstruction errors at the nodes of each map |
FITROWS = factors |
Save the positions of the rows allocated to the data points |
FITCOLUMNS = factors |
Save the positions of the columns allocated to the data points |
Y = variates |
Save y-values used to plot the data points |
X = variates |
Save x-values used to plot the data points |
PEN = scalars, variates or factors |
Pens used to plot the maps |
SEED = scalars |
Seed for the random numbers used to initialize the weights in each map |
Description
A self-organizing map is a two dimensional grid of nodes, used to classify vectors of observations on p variables. Each node is characterized by a vector of p weights (one for each variable).
Before estimating the weights, you first need to declare a SOM structure to store the map. The SOM
procedure, which does this, defines the row and column positions of the nodes on the grid. It also stores the names of the weight variables and information about how distances are to be measured on the grid and how the weights should be adjusted during their estimation. The SOM structure is then input to SOMESTIMATE
by the SOM
parameter.
The training dataset to estimate the weights is specified by the DATA
parameter, either as a matrix with n rows and p columns (where n is the number of observations in the training set) or as a pointer containing p variates each with n units. SOMESTIMATE
gives a warning if the row names of a DATA
matrix or the names of the variates in a DATA
pointer differ from the names stored for the weight variables in the SOM structure.
The weights are estimated by a sequence of iterations, which are performed by the SOMADJUST
procedure. In an iteration, the training observations are taken in turn. Each observation i is assessed to find its closest node. The method to use to measure distance on the map will have been specified, by the DMETHOD
option of SOM
, and stored with the SOM structure when it was declared. However, SOMESTIMATE
also has a DMETHOD
option in case you want to override the stored setting. The default setting for the DMETHOD
option of SOM
is euclidean
. If X_i
is a variate containing the values of the variables for observation i and W_j
is the variate of weights at node j, the distance is then given by
d_ij = SQRT(SUM((X_i - W_j)**2))
The alternative setting, cityblock
, calculates the distance as
d_ij = SUM(ABS(X_i - W_j)))
Once the closest node, k, has been found, the weights at that node and other nodes are adjusted. The method to use will have been specified when the SOM structure was declared, by the WMETHOD
option of SOM
. However, SOMESTIMATE
again has its own WMETHOD
option, that you can use to override the stored setting. The default setting for the DMETHOD
option of SOM
is gaussian
. This adjusts the weights W_j
at every node j to become
W_j + alpha * EXP( -0.5 * (d_jk / sigma)**2) * (X_i - W_j)
where d_jk
is the distance between nodes j and k. With the alternative setting, neighbour
, the weights at node j are adjusted to become
W_j + alpha * (X_i - W_j)
but only if d_jk
is less than a threshold r
.
The values of alpha
, sigma
and r
change at each iteration. By default, SOMESTIMATE
runs two sequences of iterations. At the start of the first set, the parameters have initial values
alpha = 1
sigma = dmax
r = dmax / 2
where dmax
is the maximum distance between any two nodes in the network. At the end of the first set, they have final values
alpha = 0.1
sigma = dmax / 10
r = dmax / 10
There are 500 iterations in the first set, and the parameters decrease in equal steps from their initial to their final values. There are also 500 cycles in the second set of iterations, and the parameters now decrease in equal steps to to final values
alpha = 1
sigma = 0
r = dmin
where dmin
is the minimum distance between any two nodes in the network. If dmax/10
is less than dmin
, then the value of r
at the end of the first set will be dmin
too.
You can define your own sequence of iterations using the ALPHA
, SIGMA
, THRESHOLD
and NCYCLE
options (where SIGMA
is relevant only when WMETHOD=gaussian
, and THRESHOLD
only when WMETHOD=neighbour
). Setting all the relevant options to scalars, defines a single set of iterations where the parameters decrease from initial values set by the options to the final values specified above. Alternatively, you can set ALPHA
and either SIGMA
or THRESHOLD
to variates to specify initial values for several sets of iterations. NCYCLE
can be set to a scalar if all the sets are to contain the same number of iterations, or to a variate of the same length as ALPHA
if you want each set to contain a different number.
The weights are initialized to have random positions within the plane of the first two principal components for the DATA
matrix. The SEED
parameter supplies a seed for the random numbers used to define the positions. The default value of zero initializes the random number generator automatically if this is the first time that it has been used in the current job, or continues the existing sequence of random numbers.
By default SOMESTIMATE
will stop the estimation process if there are more than ten successive iterations in which no observation changes its closest node. Different numbers of successive iterations with no changes can be specified using the NSTOP
option.
Printed output is controlled by the PRINT
option, with settings:
weights |
to print the weights at each node of the map; |
---|---|
errors |
to print the reconstruction errors at each node of the map; |
monitoring |
to provide monitoring about each iteration; and |
report |
to print a report at the end of the estimation process. |
By default PRINT=weights,report
.
The PLOT
option controls which plots are produced, with settings:
fit |
for a plot showing how the data observations are allocated to the nodes of the map; and |
---|---|
totalerror |
for a plot showing how the total reconstruction error changes at each iteration. |
By default, the map is plotted. The PEN
parameter can be used to define the pen or pens to be used to plot the points on the map. If PEN
is set to a scalar, the same pen will be used for every point, so you would simply be able to assess the density of points around the map. Alternatively, you can supply a variate or factor to distinguish different groups of observations.
The ERRORS
parameter can save a matrix with the reconstruction error at each node of the map. The Y
and X
parameters can save the coordinates used to plot the points on the map. These are formed by adding a small amount of random variation to the row and column of the nodes, to ensure that points allocated to the same node are not all plotted in the same position.
Options: PRINT
, PLOT
, DMETHOD
, WMETHOD
, ALPHA
, SIGMA
, THRESHOLD
, NCYCLE
, NSTOP
.
Parameters: SOM
, DATA
, ERRORS
, FITROWS
, FITCOLUMNS
, Y
, X
, PEN
, SEED
.
Method
The individual iterations involved in the estimation are carried out by the SOMADJUST
procedure.
Action with RESTRICT
SOMESTIMATE
takes account of any restrictions defined on the DATA variates.
See also
Procedures: SOM
, SOMADJUST
, SOMDESCRIBE
, SOMIDENTIFY
, SOMPREDICT
.
Commands for: Data mining.
Example
CAPTION 'SOMESTIMATE example',!t('Fisher''s Iris Data'); STYLE=meta,plain SOM Som; VARIABLENAMES=!t(Sepal_L,Sepal_W,Petal_L,Petal_W) MATRIX [ROWS=150; COLUMNS=!t(Sepal_L,Sepal_W,Petal_L,Petal_W)] Measures READ Measures 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5.0 3.6 1.4 0.2 5.4 3.9 1.7 0.4 4.6 3.4 1.4 0.3 5.0 3.4 1.5 0.2 4.4 2.9 1.4 0.2 4.9 3.1 1.5 0.1 5.4 3.7 1.5 0.2 4.8 3.4 1.6 0.2 4.8 3.0 1.4 0.1 4.3 3.0 1.1 0.1 5.8 4.0 1.2 0.2 5.7 4.4 1.5 0.4 5.4 3.9 1.3 0.4 5.1 3.5 1.4 0.3 5.7 3.8 1.7 0.3 5.1 3.8 1.5 0.3 5.4 3.4 1.7 0.2 5.1 3.7 1.5 0.4 4.6 3.6 1.0 0.2 5.1 3.3 1.7 0.5 4.8 3.4 1.9 0.2 5.0 3.0 1.6 0.2 5.0 3.4 1.6 0.4 5.2 3.5 1.5 0.2 5.2 3.4 1.4 0.2 4.7 3.2 1.6 0.2 4.8 3.1 1.6 0.2 5.4 3.4 1.5 0.4 5.2 4.1 1.5 0.1 5.5 4.2 1.4 0.2 4.9 3.1 1.5 0.2 5.0 3.2 1.2 0.2 5.5 3.5 1.3 0.2 4.9 3.6 1.4 0.1 4.4 3.0 1.3 0.2 5.1 3.4 1.5 0.2 5.0 3.5 1.3 0.3 4.5 2.3 1.3 0.3 4.4 3.2 1.3 0.2 5.0 3.5 1.6 0.6 5.1 3.8 1.9 0.4 4.8 3.0 1.4 0.3 5.1 3.8 1.6 0.2 4.6 3.2 1.4 0.2 5.3 3.7 1.5 0.2 5.0 3.3 1.4 0.2 7.0 3.2 4.7 1.4 6.4 3.2 4.5 1.5 6.9 3.1 4.9 1.5 5.5 2.3 4.0 1.3 6.5 2.8 4.6 1.5 5.7 2.8 4.5 1.3 6.3 3.3 4.7 1.6 4.9 2.4 3.3 1.0 6.6 2.9 4.6 1.3 5.2 2.7 3.9 1.4 5.0 2.0 3.5 1.0 5.9 3.0 4.2 1.5 6.0 2.2 4.0 1.0 6.1 2.9 4.7 1.4 5.6 2.9 3.6 1.3 6.7 3.1 4.4 1.4 5.6 3.0 4.5 1.5 5.8 2.7 4.1 1.0 6.2 2.2 4.5 1.5 5.6 2.5 3.9 1.1 5.9 3.2 4.8 1.8 6.1 2.8 4.0 1.3 6.3 2.5 4.9 1.5 6.1 2.8 4.7 1.2 6.4 2.9 4.3 1.3 6.6 3.0 4.4 1.4 6.8 2.8 4.8 1.4 6.7 3.0 5.0 1.7 6.0 2.9 4.5 1.5 5.7 2.6 3.5 1.0 5.5 2.4 3.8 1.1 5.5 2.4 3.7 1.0 5.8 2.7 3.9 1.2 6.0 2.7 5.1 1.6 5.4 3.0 4.5 1.5 6.0 3.4 4.5 1.6 6.7 3.1 4.7 1.5 6.3 2.3 4.4 1.3 5.6 3.0 4.1 1.3 5.5 2.5 4.0 1.3 5.5 2.6 4.4 1.2 6.1 3.0 4.6 1.4 5.8 2.6 4.0 1.2 5.0 2.3 3.3 1.0 5.6 2.7 4.2 1.3 5.7 3.0 4.2 1.2 5.7 2.9 4.2 1.3 6.2 2.9 4.3 1.3 5.1 2.5 3.0 1.1 5.7 2.8 4.1 1.3 6.3 3.3 6.0 2.5 5.8 2.7 5.1 1.9 7.1 3.0 5.9 2.1 6.3 2.9 5.6 1.8 6.5 3.0 5.8 2.2 7.6 3.0 6.6 2.1 4.9 2.5 4.5 1.7 7.3 2.9 6.3 1.8 6.7 2.5 5.8 1.8 7.2 3.6 6.1 2.5 6.5 3.2 5.1 2.0 6.4 2.7 5.3 1.9 6.8 3.0 5.5 2.1 5.7 2.5 5.0 2.0 5.8 2.8 5.1 2.4 6.4 3.2 5.3 2.3 6.5 3.0 5.5 1.8 7.7 3.8 6.7 2.2 7.7 2.6 6.9 2.3 6.0 2.2 5.0 1.5 6.9 3.2 5.7 2.3 5.6 2.8 4.9 2.0 7.7 2.8 6.7 2.0 6.3 2.7 4.9 1.8 6.7 3.3 5.7 2.1 7.2 3.2 6.0 1.8 6.2 2.8 4.8 1.8 6.1 3.0 4.9 1.8 6.4 2.8 5.6 2.1 7.2 3.0 5.8 1.6 7.4 2.8 6.1 1.9 7.9 3.8 6.4 2.0 6.4 2.8 5.6 2.2 6.3 2.8 5.1 1.5 6.1 2.6 5.6 1.4 7.7 3.0 6.1 2.3 6.3 3.4 5.6 2.4 6.4 3.1 5.5 1.8 6.0 3.0 4.8 1.8 6.9 3.1 5.4 2.1 6.7 3.1 5.6 2.4 6.9 3.1 5.1 2.3 5.8 2.7 5.1 1.9 6.8 3.2 5.9 2.3 6.7 3.3 5.7 2.5 6.7 3.0 5.2 2.3 6.3 2.5 5.0 1.9 6.5 3.0 5.2 2.0 6.2 3.4 5.4 2.3 5.9 3.0 5.1 1.8 : FACTOR [NVALUES=150; LABELS=!t(Setosa,Versicolor,Virginica);\ VALUES=50(1,2,3)] Species SOMESTIMATE [PRINT=weights,errors,report; PLOT=fit,totalerror;\ NCYCLE=!(100,200); SIGMA=!(5,1)] Som; DATA=Measures;\ PEN=Species; SEED=419749