Constructs a random classification forest (R.W. Payne).
|Controls printed output (
||Number of trees in the forest; no default – must be specified|
||Number of units of the X variables to select at random to use in the construction of each tree; default is two thirds of the number of units|
||Selection criterion to use when constructing the trees (
||Groupings of the individuals to identify in the trees|
||Number of individuals in a group at which to stop selecting tests; default 5|
||Adaptive anti-end-cut factor to use (
||Seed for random numbers to select the
||Indicates whether or not your own version of the
||Saves the “out-of-bag” error rate|
||Saves the confusion matrix|
||Saves details of the forest that has been constructed|
||X-variables available for constructing the tree|
||Whether factor levels are ordered (
||Saves the importance of each x-variable|
The data to construct a random classification forest is a sample of individuals from several groups. The characteristics of the individuals are described in Genstat by a set of factors or variates which are specified by the
X parameter of
GROUPS option of
BCFOREST defines the group to which each individual in the sample belongs, and the aim is to be able to identify the groups to which new individuals belong.
A random classification forest is a set of classification trees that are used collectively to identify the group to which an individual specimen belongs (see e.g. Breiman 2001). The identification is obtained by running a new individual through each tree to obtain that tree’s “vote” for the group of the individual. The identification is then taken as the group with most votes.
Each classification tree is formed using a random sample of the
X variables in the data set, and a bootstrap random sample of their units (i.e. sampled with replacement). The
NXTRY option defines how many
X variables to select, and the
NUNITSTRY option defines how many units to take. The default for
NXTRY is the square root of the number of variables, and the default for
NUNITSTRY is two thirds of the number of units. The
SEED option specifies a seed for the random numbers that are used to select the variables and to select the units. The default of zero continues an existing sequence of random numbers, if any of the random functions (
GRSELECT etc) has already been used in the current Genstat run. Otherwise a seed is chosen at random.
A classification tree progressively splits the individuals into subsets based on their values for the factors or variates. Construction starts at a node known as the root, which contains all of the individuals. A factor or variate is chosen to use there that “best” divides the individuals into two subsets. Suppose the available
X vectors are all factors with two levels: the first subset will then contain the individuals with level 1 of the factor, and the second will contain those with level 2. Also any individual with a missing value for the factor is put into both groups; so you can use a missing value to denote either variable or unknown observations. Factors may have either ordered or unordered levels, according to whether the corresponding value
ORDERED parameter is set to
no. For example, a factor called
Dose with levels 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled
'Pethidine' of a factor called
Drug would be regarded as unordered. For unordered factors, all possible ways of dividing the levels into two sets are tried. With variates or ordered factors with more than 2 levels, a suitable value p is found to partition the individuals into those with values less than or greater than p. The tree is then extended to contain two new nodes, one for each of the subsets, and factors or variates are selected for use at each of these nodes to subdivide the subsets further.
The effectiveness of the factor or variate to be chosen for each node depends on how the groups are split between the resulting subsets – the aim is to form subsets that are each composed of individuals from the same group. By default, this is assessed using Gini information (see Breiman et al. 1984, Chapter 4) but you can set option
METHOD=mpi to use the mean posterior improvement criterion devised by Taylor & Silverman (1993). The
ANTIENDCUTFACTOR option allows you to request Taylor & Silverman’s adaptive anti-end-cut factors (by default these are not used). The process stops when either no factor or variate provides any additional information, or the subset contains individuals all from the same group, or the subset contains fewer individuals than a limit specified by the
NSTOP option (default 1). These nodes where the construction ends are known as terminal nodes.
The resulting forest (and its associated information) can be saved using the
SAVE option. This can then be used in the
BCFDISPLAY procedure to produce further output, or in the
BCFIDENTIFY procedure to identify the groups for new values of the x-variables..
OUTOFBAGERROR option can save the “out-of-bag” error rate. This is calculated using the individuals that were not involved in the construction of each tree. So, it gives an independent measure of the reliability of the forest. The idea is to put each individual through all of the trees where it was not used, and accumulate its votes for each of the groups. The individual is then identified by taking the group where it had most votes, and the error rate is calculated by comparing the identifications of the individuals with their true group (as defined by the
CONFUSION option can save the confusion matrix. This is a groups-by-groups matrix that can be calculated at the same time as the out-of-bag error. The rows represent the true groups, and the columns represent the out-of-bag identifications obtained using the forest. The diagonal of the matrix records the number of individuals correctly identified in each group, while the off-diagonal elements show the numbers that have been identified incorrectly (i.e. that have been “confused” with other groups).
IMPORTANCE parameter can save a variate giving the “importance” of each
X variate or factor in the forest. This is calculated by accumulating the sum of the values of the selection function (see
METHOD) over the times when the
X variable is used in the forest.
Printed output is controlled by the
||out-of-bag error rate,|
||importance ratings of the
||importance ratings of the
||monitoring information during the construction process.|
The default is
BCFOREST calls procedure
BCONSTRUCT to form the tree. This uses a special-purpose procedure
BSELECT, which is customized specifically to select splits for use in classification trees. You can use your own method of selection by providing your own
BSELECT and setting option
OWNBSELECT=yes. In the standard version of
BASSESS directive is used to assess the potential splits.
Restrictions on the
X vectors or
GROUPS factor are ignored.
Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.
Breiman, L. (2001) Random forests. Machine Learning, 45m, 5-32.
Taylor, P.C. & Silverman, B.W. (1993). Block diagrams and splitting criteria for classification trees. Statistics & Computing, 3, 147-161.
Commands for: Multivariate and cluster analysis.
CAPTION 'BCFOREST example',\ !t('Random classification forest for automobile data',\ 'from UCI Machine Learning Repository',\ 'http://archive.ics.uci.edu/ml/datasets/Automobile');\ STYLE=meta,plain SPLOAD FILE='%gendir%/examples/Automobile.gsh' BCFOREST [GROUPS=symboling; NTREES=8; NXTRY=10; NUNITSTRY=75; SEED=197883]\ normalized_losses,make,fuel_type,aspiration,number_doors,\ body_style,drive_wheels,engine_location,wheel_base,\ length,width,height,curb_weight,engine_type,number_cylinders,\ engine_size,fuel_system,bore,stroke,compression_ratio,\ horsepower,peak_rpm,city_mpg,highway_mpg,price