Constructs a random classification forest (R.W. Payne).

### Options

`PRINT` = string tokens |
Controls printed output (`outofbagerror` , `confusion` , `importance` , `orderedimportance` , `monitoring` ); default `outo` , `conf` , `impo` |
---|---|

`NTREES` = scalar |
Number of trees in the forest; no default – must be specified |

`NXTRY` = scalar |
Number of `X` variables to select at random at each node from which to choose the `X` variable to use there; default is the square root of number of `X` variables |

`NUNITSTRY` = scalar |
Number of units of the X variables to select at random to use in the construction of each tree; default is two thirds of the number of units |

`METHOD` = string token |
Selection criterion to use when constructing the trees (`Gini` , `MPI` ); default `Gini` |

`GROUPS` = factor |
Groupings of the individuals to identify in the trees |

`NSTOP` = scalar |
Number of individuals in a group at which to stop selecting tests; default 5 |

`ANTIENDCUTFACTOR` = string token |
Adaptive anti-end-cut factor to use (`classnumber` , `reciprocalentropy` ); default `*` i.e. none |

`SEED` = scalar |
Seed for random numbers to select the `NXTRY` `X` -variables and `NUMITSTRY` units; default 0 |

`OWNBSELECT` = string token |
Indicates whether or not your own version of the `BSELECT` procedure is to be used, as explained in the Method section (`yes` , `no` ); default `no` |

`OUTOFBAGERROR` = scalar |
Saves the “out-of-bag” error rate |

`CONFUSION` = matrix |
Saves the confusion matrix |

`SAVE` = pointer |
Saves details of the forest that has been constructed |

### Parameters

`X` = factors or variates |
X-variables available for constructing the tree |
---|---|

`ORDERED` = string tokens |
Whether factor levels are ordered (`yes` , `no` ); default `no` |

`IMPORTANCE` = scalars |
Saves the importance of each x-variable |

### Description

The data to construct a random classification forest is a sample of individuals from several groups. The characteristics of the individuals are described in Genstat by a set of factors or variates which are specified by the `X`

parameter of `BCFOREST`

. The `GROUPS`

option of `BCFOREST`

defines the group to which each individual in the sample belongs, and the aim is to be able to identify the groups to which new individuals belong.

A random classification forest is a set of classification trees that are used collectively to identify the group to which an individual specimen belongs (see e.g. Breiman 2001). The identification is obtained by running a new individual through each tree to obtain that tree’s “vote” for the group of the individual. The identification is then taken as the group with most votes.

Each classification tree is formed using a random sample of the `X`

variables in the data set, and a bootstrap random sample of their units (i.e. sampled with replacement). The `NXTRY`

option defines how many `X`

variables to select, and the `NUNITSTRY`

option defines how many units to take. The default for `NXTRY`

is the square root of the number of variables, and the default for `NUNITSTRY`

is two thirds of the number of units. The `SEED`

option specifies a seed for the random numbers that are used to select the variables and to select the units. The default of zero continues an existing sequence of random numbers, if any of the random functions (`GRSELECT`

etc) has already been used in the current Genstat run. Otherwise a seed is chosen at random.

A classification tree progressively splits the individuals into subsets based on their values for the factors or variates. Construction starts at a node known as the *root*, which contains all of the individuals. A factor or variate is chosen to use there that “best” divides the individuals into two subsets. Suppose the available `X`

vectors are all factors with two levels: the first subset will then contain the individuals with level 1 of the factor, and the second will contain those with level 2. Also any individual with a missing value for the factor is put into both groups; so you can use a missing value to denote either variable or unknown observations. Factors may have either ordered or unordered levels, according to whether the corresponding value `ORDERED`

parameter is set to `yes`

or `no`

. For example, a factor called `Dose`

with levels 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled `'Morphine'`

, `'Amidone'`

, `'Phenadoxone'`

and `'Pethidine'`

of a factor called `Drug`

would be regarded as unordered. For unordered factors, all possible ways of dividing the levels into two sets are tried. With variates or ordered factors with more than 2 levels, a suitable value *p* is found to partition the individuals into those with values less than or greater than *p*. The tree is then extended to contain two new nodes, one for each of the subsets, and factors or variates are selected for use at each of these nodes to subdivide the subsets further.

The effectiveness of the factor or variate to be chosen for each node depends on how the groups are split between the resulting subsets – the aim is to form subsets that are each composed of individuals from the same group. By default, this is assessed using Gini information (see Breiman *et al.* 1984, Chapter 4) but you can set option `METHOD=mpi`

to use the mean posterior improvement criterion devised by Taylor & Silverman (1993). The `ANTIENDCUTFACTOR`

option allows you to request Taylor & Silverman’s adaptive anti-end-cut factors (by default these are not used). The process stops when either no factor or variate provides any additional information, or the subset contains individuals all from the same group, or the subset contains fewer individuals than a limit specified by the `NSTOP`

option (default 1). These nodes where the construction ends are known as *terminal nodes*.

The resulting forest (and its associated information) can be saved using the `SAVE`

option. This can then be used in the `BCFDISPLAY`

procedure to produce further output, or in the `BCFIDENTIFY`

procedure to identify the groups for new values of the x-variables..

The `OUTOFBAGERROR`

option can save the “out-of-bag” error rate. This is calculated using the individuals that were not involved in the construction of each tree. So, it gives an independent measure of the reliability of the forest. The idea is to put each individual through all of the trees where it was not used, and accumulate its votes for each of the groups. The individual is then identified by taking the group where it had most votes, and the error rate is calculated by comparing the identifications of the individuals with their true group (as defined by the `GROUPS`

factor).

The `CONFUSION`

option can save the confusion matrix. This is a groups-by-groups matrix that can be calculated at the same time as the out-of-bag error. The rows represent the true groups, and the columns represent the out-of-bag identifications obtained using the forest. The diagonal of the matrix records the number of individuals correctly identified in each group, while the off-diagonal elements show the numbers that have been identified incorrectly (i.e. that have been “confused” with other groups).

The `IMPORTANCE`

parameter can save a variate giving the “importance” of each `X`

variate or factor in the forest. This is calculated by accumulating the sum of the values of the selection function (see `METHOD`

) over the times when the `X`

variable is used in the forest.

Printed output is controlled by the `PRINT`

option, with settings:

`outofbagerror` |
out-of-bag error rate, |
---|---|

`confusion` |
confusion matrix, |

`importance` |
importance ratings of the `X` variates and factors, |

`orderedimportance` |
importance ratings of the `X` variates and factors in decreasing order, and |

`monitoring` |
monitoring information during the construction process. |

The default is `PRINT=outofbagerror,confusion,importance`

.

Options: `PRINT`

, `NTREES`

, `NXTRY`

, `NUNITSTRY`

, `METHOD`

, `GROUPS`

, `NSTOP`

, `ANTIENDCUTFACTOR`

, `SEED`

, `OWNBSELECT`

, `OUTOFBAGERROR`

, `CONFUSION`

, `SAVE`

.

Parameters: `X`

, `ORDERED`

, `IMPORTANCE`

.

### Method

`BCFOREST`

calls procedure `BCONSTRUCT`

to form the tree. This uses a special-purpose procedure `BSELECT`

, which is customized specifically to select splits for use in classification trees. You can use your own method of selection by providing your own `BSELECT`

and setting option `OWNBSELECT=yes`

. In the standard version of `BSELECT`

, the `BASSESS`

directive is used to assess the potential splits.

### Action with `RESTRICT`

Restrictions on the `X`

vectors or `GROUPS`

factor are ignored.

### References

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). *Classification and Regression Trees*. Wadsworth, Monterey.

Breiman, L. (2001) Random forests. *Machine Learning*, 45m, 5-32.

Taylor, P.C. & Silverman, B.W. (1993). Block diagrams and splitting criteria for classification trees. *Statistics & Computing*, 3, 147-161.

### See also

Procedures: `BCFDISPLAY`

, `BCFIDENTIFY`

, `BCLASSIFICATION`

, `BKEY`

, `BREGRESSION`

, `KNEARESTNEIGHBOURS`

.

Commands for: Multivariate and cluster analysis.

### Example

CAPTION 'BCFOREST example',\ !t('Random classification forest for automobile data',\ 'from UCI Machine Learning Repository',\ 'http://archive.ics.uci.edu/ml/datasets/Automobile');\ STYLE=meta,plain SPLOAD FILE='%gendir%/examples/Automobile.gsh' BCFOREST [GROUPS=symboling; NTREES=8; NXTRY=10; NUNITSTRY=75; SEED=197883]\ normalized_losses,make,fuel_type,aspiration,number_doors,\ body_style,drive_wheels,engine_location,wheel_base,\ length,width,height,curb_weight,engine_type,number_cylinders,\ engine_size,fuel_system,bore,stroke,compression_ratio,\ horsepower,peak_rpm,city_mpg,highway_mpg,price