HBOOTSTRAP procedure

Performs bootstrap analyses to assess the reliability of clusters from hierarchical cluster analysis (R.W. Payne).

Options

`PRINT` = string token	Controls printed output (`clusters`, `dendrograms`); default `*` i.e. none
`METHOD` = string token	Criterion for forming clusters (`singlelink`, `nearestneighbour`, `completelink`, `furthestneighbour`, `averagelink`, `mediansort`, `groupaverage`); default `sing`
`CLIMIT` = scalar	Similarity value below which clusters are not recorded; default `0`
`UNITS` = text or variate	Names to label the units of the clusters when they are printed; default `*`
`MINKOWSKI` = scalar	Index t for use with `TEST=minkowski`
`CLUSTERS` = pointer	Specifies or saves the clusters
`REPLICATION` = variate	Saves the replication of the clusters in the bootstrap samples
`NDATASAMPLE` = scalar	Number of `DATA` vectors to take in each sample; default takes the same number as supplied by the `DATA` parameter
`NTIMES` = scalar	Number of times to resample; default `100`
`SEED` = scalar	Seed for random number generator; default continue from previous generation or use system clock

Parameters

`DATA` = variates or factors	The characteristics of the units to be clustered
`TEST` = string tokens	Test type, defining how each `DATA` variate or factor is treated in the calculation of the similarity between each unit (`simplematching`, `jaccard`, `russellrao`, `dice`, `antidice`, `sneathsokal`, `rogerstanimoto`, `cityblock`, `manhattan`, `ecological`, `euclidean`, `pythagorean`, `minkowski`, `divergence`, `canberra`, `braycurtis`, `soergel`); default `*` ignores that variate or factor
`RANGE` = scalar	Range of possible values of each `DATA` variate or factor; if omitted, the observed range is taken

Description

HBOOTSTRAP uses bootstrapping to assess the reliability of clusters formed in hierarchical cluster analyses. The characteristics of the units to be clustered are described in a list of variates and factors, specified by the DATA parameter. The TEST parameter defines how each one is to be used when calculating similarities, and the RANGE parameter can specify ranges of their values. These operate as in the FSIMILARITY directive, which is used to form the similarity matrix for each cluster analysis. The MINKOWSKI option specifies the index t for the Minkowski type of test.

For each bootstrap sample, a set of vectors is formed by sampling with replacement from the DATA vectors. The NDATASAMPLE option specifies the number of vectors to take; by default this is the same as the number of vectors supplied by DATA. The NTIMES option specifies the number of bootstrap samples; default 100. The SEED option specifies the seed to use for the random numbers used to select the sample; the default of zero continues an existing sequence of random numbers or, if none, it initializes the sequence using the system clock. HBOOTSTRAP does a cluster analysis with those vectors using the HCLUSTER directive, and obtains the clusters that it forms using the HFCLUSTERS procedure. The CLIMIT option can be used to specify a limit, below which any clusters will be excluded.

The CLUSTERS option can supply a pointer containing a list of clusters whose reliability is to be assessed. This would usually have been obtained previously, from a cluster analysis performed with all the DATA vectors. Alternatively, if CLUSTERS is set to a pointer whose number of values has not been defined, or to an undeclared data structure, this will be defined as a pointer containing one of every cluster that has occurred during the bootstrapping. Each cluster is represented as a variate, containing the number of each unit in that cluster. (This number corresponds to the location of that unit in the DATA vectors.)

The REPLICATION option can save a variate containing the number of times each cluster has occurred during the bootstrapping. These replications can be used by the DCLUSTERLABELS procedure to label the clusters on a dendrogram.

The clusters and their replications can be printed by setting option PRINT=clusters. The UNITS option can be set to a text or a variate, to provide textual labels or other numbers to use for the units of the clusters, instead of the numbers in the CLUSTERS variates. The other PRINT setting, dendrogram, prints the dendrogram of the cluster analysis from each bootstrap sample.

Options: PRINT, METHOD, CLIMIT, UNITS, MINKOWSKI, CLUSTERS, REPLICATION, NDATASAMPLE, NTIMES, SEED.
Parameters: DATA, TEST, RANGE.

Action with `RESTRICT`

The DATA variates and factors must not be restricted.

Example

CAPTION        'HBOOTSTRAP example',\
               !t('Random classification forest for automobile data',\
               'from UCI Machine Learning Repository',\
               'http://archive.ics.uci.edu/ml/datasets/Automobile');\
               STYLE=meta,plain
SPLOAD         [PRINT=*] '%gendir%/examples/Automobile.gsh'
" select cars  with wagon body style "
SUBSET         [body_style.IN.'wagon'] make,\
               fuel_type,aspiration,number_doors,drive_wheels,\
               engine_location,engine_type,number_cylinders,fuel_system,\
               wheel_base,length,width,height,curb_weight,\
               engine_size,bore,stroke,compression_ratio,horsepower,\
               peak_rpm,city_mpg,highway_mpg,price
" form labels for the cars from make and price "
TXCONSTRUCT    [TEXT=car] make,' ',price
" cluster analysis using all the data variables "
FSIMILARITY    [PRINT=*; SIMILARITY=similarity; UNITS=car]\
               fuel_type,aspiration,number_doors,drive_wheels,\
               engine_location,engine_type,number_cylinders,fuel_system,\
               wheel_base,length,width,height,curb_weight,\
               engine_size,bore,stroke,compression_ratio,horsepower,\
               peak_rpm,city_mpg,highway_mpg,price;\
               TEST=8(simplematching),14(euclidean)
HCLUSTER       [METHOD=averagelink] similarity; AMALGAMATIONS=amalg;\
               PERMUTATION=perm
" plot dendrogram "
FRAME          3; XMLOWER=0.2; XMUPPER=0
DDENDROGRAM    [STYLE=average; ORDERING=given; DSIMILARITY=yes] amalg;\
               PERMUTATION=perm; LABELS=car; WINDOW=3
" form the clusters in the dendrogram "
HFCLUSTERS     amalg; CLUSTERS=clusters
" see often these clusters occur in 100 bootstrap samples of data variables "
HBOOTSTRAP     [PRINT=clusters; METHOD=averagelink; NTIMES=100;\
               SEED=161647; CLUSTERS=clusters; REPLICATION=reps]\
               fuel_type,aspiration,number_doors,drive_wheels,\
               engine_location,engine_type,number_cylinders,fuel_system,\
               wheel_base,length,width,height,curb_weight,\
               engine_size,bore,stroke,compression_ratio,horsepower,\
               peak_rpm,city_mpg,highway_mpg,price;\
               TEST=8(simplematching),14(euclidean)
" plot the numbers of occurrence on the dendrogram "
DCLUSTERLABELS [WINDOW=3] #clusters; LABEL=#reps

Updated on September 13, 2019

Was this article helpful?

Yes No