1. Home
  2. HBOOTSTRAP procedure

HBOOTSTRAP procedure

Performs bootstrap analyses to assess the reliability of clusters from hierarchical cluster analysis (R.W. Payne).

Options

PRINT = string token Controls printed output (clusters, dendrograms); default * i.e. none
METHOD = string token Criterion for forming clusters (singlelink, nearestneighbour, completelink, furthestneighbour, averagelink, mediansort, groupaverage); default sing
CLIMIT = scalar Similarity value below which clusters are not recorded; default 0
UNITS = text or variate Names to label the units of the clusters when they are printed; default *
MINKOWSKI = scalar Index t for use with TEST=minkowski
CLUSTERS = pointer Specifies or saves the clusters
REPLICATION = variate Saves the replication of the clusters in the bootstrap samples
NDATASAMPLE = scalar Number of DATA vectors to take in each sample; default takes the same number as supplied by the DATA parameter
NTIMES = scalar Number of times to resample; default 100
SEED = scalar Seed for random number generator; default continue from previous generation or use system clock

Parameters

DATA = variates or factors The characteristics of the units to be clustered
TEST = string tokens Test type, defining how each DATA variate or factor is treated in the calculation of the similarity between each unit (simplematching, jaccard, russellrao, dice, antidice, sneathsokal, rogerstanimoto, cityblock, manhattan, ecological, euclidean, pythagorean, minkowski, divergence, canberra, braycurtis, soergel); default * ignores that variate or factor
RANGE = scalar Range of possible values of each DATA variate or factor; if omitted, the observed range is taken

Description

HBOOTSTRAP uses bootstrapping to assess the reliability of clusters formed in hierarchical cluster analyses. The characteristics of the units to be clustered are described in a list of variates and factors, specified by the DATA parameter. The TEST parameter defines how each one is to be used when calculating similarities, and the RANGE parameter can specify ranges of their values. These operate as in the FSIMILARITY directive, which is used to form the similarity matrix for each cluster analysis. The MINKOWSKI option specifies the index t for the Minkowski type of test.

For each bootstrap sample, a set of vectors is formed by sampling with replacement from the DATA vectors. The NDATASAMPLE option specifies the number of vectors to take; by default this is the same as the number of vectors supplied by DATA. The NTIMES option specifies the number of bootstrap samples; default 100. The SEED option specifies the seed to use for the random numbers used to select the sample; the default of zero continues an existing sequence of random numbers or, if none, it initializes the sequence using the system clock. HBOOTSTRAP does a cluster analysis with those vectors using the HCLUSTER directive, and obtains the clusters that it forms using the HFCLUSTERS procedure. The CLIMIT option can be used to specify a limit, below which any clusters will be excluded.

The CLUSTERS option can supply a pointer containing a list of clusters whose reliability is to be assessed. This would usually have been obtained previously, from a cluster analysis performed with all the DATA vectors. Alternatively, if CLUSTERS is set to a pointer whose number of values has not been defined, or to an undeclared data structure, this will be defined as a pointer containing one of every cluster that has occurred during the bootstrapping. Each cluster is represented as a variate, containing the number of each unit in that cluster. (This number corresponds to the location of that unit in the DATA vectors.)

The REPLICATION option can save a variate containing the number of times each cluster has occurred during the bootstrapping. These replications can be used by the DCLUSTERLABELS procedure to label the clusters on a dendrogram.

The clusters and their replications can be printed by setting option PRINT=clusters. The UNITS option can be set to a text or a variate, to provide textual labels or other numbers to use for the units of the clusters, instead of the numbers in the CLUSTERS variates. The other PRINT setting, dendrogram, prints the dendrogram of the cluster analysis from each bootstrap sample.

Options: PRINT, METHOD, CLIMIT, UNITS, MINKOWSKI, CLUSTERS, REPLICATION, NDATASAMPLE, NTIMES, SEED.
Parameters: DATA, TEST, RANGE.

Action with RESTRICT

The DATA variates and factors must not be restricted.

See also

Directive: HCLUSTER.
Procedures: BOOTSTRAP, DCLUSTERLABELS, HFCLUSTERS, HPCLUSTERS.
Commands for: Multivariate and cluster analysis.

Example

CAPTION        'HBOOTSTRAP example',\
               !t('Random classification forest for automobile data',\
               'from UCI Machine Learning Repository',\
               'http://archive.ics.uci.edu/ml/datasets/Automobile');\
               STYLE=meta,plain
SPLOAD         [PRINT=*] '%gendir%/examples/Automobile.gsh'
" select cars  with wagon body style "
SUBSET         [body_style.IN.'wagon'] make,\
               fuel_type,aspiration,number_doors,drive_wheels,\
               engine_location,engine_type,number_cylinders,fuel_system,\
               wheel_base,length,width,height,curb_weight,\
               engine_size,bore,stroke,compression_ratio,horsepower,\
               peak_rpm,city_mpg,highway_mpg,price
" form labels for the cars from make and price "
TXCONSTRUCT    [TEXT=car] make,' ',price
" cluster analysis using all the data variables "
FSIMILARITY    [PRINT=*; SIMILARITY=similarity; UNITS=car]\
               fuel_type,aspiration,number_doors,drive_wheels,\
               engine_location,engine_type,number_cylinders,fuel_system,\
               wheel_base,length,width,height,curb_weight,\
               engine_size,bore,stroke,compression_ratio,horsepower,\
               peak_rpm,city_mpg,highway_mpg,price;\
               TEST=8(simplematching),14(euclidean)
HCLUSTER       [METHOD=averagelink] similarity; AMALGAMATIONS=amalg;\
               PERMUTATION=perm
" plot dendrogram "
FRAME          3; XMLOWER=0.2; XMUPPER=0
DDENDROGRAM    [STYLE=average; ORDERING=given; DSIMILARITY=yes] amalg;\
               PERMUTATION=perm; LABELS=car; WINDOW=3
" form the clusters in the dendrogram "
HFCLUSTERS     amalg; CLUSTERS=clusters
" see often these clusters occur in 100 bootstrap samples of data variables "
HBOOTSTRAP     [PRINT=clusters; METHOD=averagelink; NTIMES=100;\
               SEED=161647; CLUSTERS=clusters; REPLICATION=reps]\
               fuel_type,aspiration,number_doors,drive_wheels,\
               engine_location,engine_type,number_cylinders,fuel_system,\
               wheel_base,length,width,height,curb_weight,\
               engine_size,bore,stroke,compression_ratio,horsepower,\
               peak_rpm,city_mpg,highway_mpg,price;\
               TEST=8(simplematching),14(euclidean)
" plot the numbers of occurrence on the dendrogram "
DCLUSTERLABELS [WINDOW=3] #clusters; LABEL=#reps
Updated on September 13, 2019

Was this article helpful?