Performs bootstrap analyses to assess the reliability of clusters from hierarchical cluster analysis (R.W. Payne).
|Controls printed output (
||Criterion for forming clusters (
||Similarity value below which clusters are not recorded; default
||Names to label the units of the clusters when they are printed; default
||Index t for use with
||Specifies or saves the clusters|
||Saves the replication of the clusters in the bootstrap samples|
||Number of times to resample; default
||Seed for random number generator; default continue from previous generation or use system clock|
||The characteristics of the units to be clustered|
||Test type, defining how each
||Range of possible values of each
HBOOTSTRAP uses bootstrapping to assess the reliability of clusters formed in hierarchical cluster analyses. The characteristics of the units to be clustered are described in a list of variates and factors, specified by the
DATA parameter. The
TEST parameter defines how each one is to be used when calculating similarities, and the
RANGE parameter can specify ranges of their values. These operate as in the
FSIMILARITY directive, which is used to form the similarity matrix for each cluster analysis. The
MINKOWSKI option specifies the index t for the Minkowski type of test.
For each bootstrap sample, a set of vectors is formed by sampling with replacement from the
DATA vectors. The
NDATASAMPLE option specifies the number of vectors to take; by default this is the same as the number of vectors supplied by
NTIMES option specifies the number of bootstrap samples; default 100. The
SEED option specifies the seed to use for the random numbers used to select the sample; the default of zero continues an existing sequence of random numbers or, if none, it initializes the sequence using the system clock.
HBOOTSTRAP does a cluster analysis with those vectors using the
HCLUSTER directive, and obtains the clusters that it forms using the
HFCLUSTERS procedure. The
CLIMIT option can be used to specify a limit, below which any clusters will be excluded.
CLUSTERS option can supply a pointer containing a list of clusters whose reliability is to be assessed. This would usually have been obtained previously, from a cluster analysis performed with all the
DATA vectors. Alternatively, if
CLUSTERS is set to a pointer whose number of values has not been defined, or to an undeclared data structure, this will be defined as a pointer containing one of every cluster that has occurred during the bootstrapping. Each cluster is represented as a variate, containing the number of each unit in that cluster. (This number corresponds to the location of that unit in the
REPLICATION option can save a variate containing the number of times each cluster has occurred during the bootstrapping. These replications can be used by the
DCLUSTERLABELS procedure to label the clusters on a dendrogram.
The clusters and their replications can be printed by setting option
UNITS option can be set to a text or a variate, to provide textual labels or other numbers to use for the units of the clusters, instead of the numbers in the
CLUSTERS variates. The other
DATA variates and factors must not be restricted.
CAPTION 'HBOOTSTRAP example',\ !t('Random classification forest for automobile data',\ 'from UCI Machine Learning Repository',\ 'http://archive.ics.uci.edu/ml/datasets/Automobile');\ STYLE=meta,plain SPLOAD [PRINT=*] '%gendir%/examples/Automobile.gsh' " select cars with wagon body style " SUBSET [body_style.IN.'wagon'] make,\ fuel_type,aspiration,number_doors,drive_wheels,\ engine_location,engine_type,number_cylinders,fuel_system,\ wheel_base,length,width,height,curb_weight,\ engine_size,bore,stroke,compression_ratio,horsepower,\ peak_rpm,city_mpg,highway_mpg,price " form labels for the cars from make and price " TXCONSTRUCT [TEXT=car] make,' ',price " cluster analysis using all the data variables " FSIMILARITY [PRINT=*; SIMILARITY=similarity; UNITS=car]\ fuel_type,aspiration,number_doors,drive_wheels,\ engine_location,engine_type,number_cylinders,fuel_system,\ wheel_base,length,width,height,curb_weight,\ engine_size,bore,stroke,compression_ratio,horsepower,\ peak_rpm,city_mpg,highway_mpg,price;\ TEST=8(simplematching),14(euclidean) HCLUSTER [METHOD=averagelink] similarity; AMALGAMATIONS=amalg;\ PERMUTATION=perm " plot dendrogram " FRAME 3; XMLOWER=0.2; XMUPPER=0 DDENDROGRAM [STYLE=average; ORDERING=given; DSIMILARITY=yes] amalg;\ PERMUTATION=perm; LABELS=car; WINDOW=3 " form the clusters in the dendrogram " HFCLUSTERS amalg; CLUSTERS=clusters " see often these clusters occur in 100 bootstrap samples of data variables " HBOOTSTRAP [PRINT=clusters; METHOD=averagelink; NTIMES=100;\ SEED=161647; CLUSTERS=clusters; REPLICATION=reps]\ fuel_type,aspiration,number_doors,drive_wheels,\ engine_location,engine_type,number_cylinders,fuel_system,\ wheel_base,length,width,height,curb_weight,\ engine_size,bore,stroke,compression_ratio,horsepower,\ peak_rpm,city_mpg,highway_mpg,price;\ TEST=8(simplematching),14(euclidean) " plot the numbers of occurrence on the dendrogram " DCLUSTERLABELS [WINDOW=3] #clusters; LABEL=#reps