Constructs a random regression forest (R.W. Payne).
|Controls printed output (
||Response variate for the regression|
||Number of trees in the forest; no default – must be specified|
||Number of units of the X variables to select at random to use in the construction of each tree; default is two thirds of the number of units|
||Limit on the mean square of the observations at a node at which to stop making splits; default 0|
||Specifies the number of observations at a node at which to stop making splits; default 1|
||Seed for random numbers to select the
||Indicates whether or not your own version of the
||Saves the “out-of-bag” error rate|
||Saves the “out-of-bag” estimates of
||Saves details of the forest that has been constructed|
||X-variables available for constructing the tree|
||Whether factor levels are ordered (
||Saves the importance of each x-variable|
A regression tree is a mechanism for predicting a response variable from a set of independent variables (see Chapter 8 of Breiman et al.). A random regression forest is a set of regression trees that are used collectively to form the prediction, by averaging the predictions from the individual trees (see e.g. Breiman 2001). The number of trees in the forest is specified by the
NTREES option. Constructing a large forest can be time consuming, so it may be best to investigate first with a relatively small number of trees (e.g. 10).
The trees are constructed using data on a set of observations. Their values for the response variable are specified (in a variate) using the
Y option, and their values for the independent variables are specified (in a list of variates or factors) using the
X parameter. Factors may have either ordered or unordered levels, according to whether the corresponding value
ORDERED parameter is set to
no. For example, a factor called
Dose with levels 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled
'Pethidine' of a factor called
Drug would be regarded as unordered.
Each regression tree is formed using a random sample of the
X variables in the data set, and a bootstrap random sample of their units (i.e. sampled with replacement). The
NXTRY option defines how many
X variables to select, and the
NUNITSTRY option defines how many units to take. The default for
NXTRY is the square root of the number of variables, and the default for
NUNITSTRY is two thirds of the number of units. The
SEED option specifies a seed for the random numbers that are used to select the variables and to select the units. The default of zero continues an existing sequence of random numbers, if any of the random functions (
GRSELECT etc) has already been used in the current Genstat run. Otherwise, a seed is chosen at random.
The construction process splits the observations into subsets. With an x-variate or a factor with ordered levels, the subsets are formed by taking the observations with values less than or greater than some split point p. For a factor with unordered levels, all possible ways of dividing its levels into two subsets are tried. The aim is to form subsets that have similar values for the response variate. The predicted value of the response variable for each node of the tree is the mean of its value for the subset of observations at that node. The accuracy of the node is the squared distance of the values of the response variate from their mean for the observations at the node, divided by the total number of observations. The potential splits at the node are assessed by their effect on the accuracy, that is the difference between the accuracy of the node and the sum of the accuracies of the two potential successor nodes. The node will become a terminal node if none of the splits provides any improvement in accuracy, or if the mean square of the observations at the node is less than or equal to a limit specified by the
MSLIMIT option (default 0), or if the number of observations at the node is less than or equal to the number specified by the
NSTOP option (default 1).
The resulting forest (and its associated information) can be saved using the
SAVE option. This can then be used in the
BRFDISPLAY procedure to produce further output, or in the
BRFPREDICT procedure to predict the response for new values of the x-variables.
OUTOFBAGERROR parameter can save the “out-of-bag” error rate. This is calculated using the individuals that were not involved in the construction of each tree. So, it gives an independent measure of the reliability of the forest. The idea is to put the x-values in each observation through all of the trees where it was not used, and predict its y-value by taking the average of the predictions from the individual trees. The out-of-bag error is the square root of the mean of the squared differences of the predictions from the values in the response variate. The
YOUTOFBAGESTIMATES can save a variate containing the out-of-bag predictions, and the
%VARIANCE option can save the percentage of the variance in the y-values that is accounted for by the forest. Note: the out-of-bag prediction will be missing for any observation that has been selected in all the random samples (i.e. that has been used to construct every tree).
IMPORTANCE parameter can save a variate giving the “importance” of each
X variate or factor in the forest, calculated as the total amount by which the variable increases the accuracy in the forest.
Printed output is controlled by the
||out-of-bag error rate,|
||out-of-bag predictions of the y-values,|
||importance ratings of the
||importance ratings of the
||monitoring information during the construction process.|
The default is
BRFOREST calls procedure
BCONSTRUCT to form the tree. This uses a special-purpose procedure
BSELECT, which is customized specifically to select splits for use in regression trees. You can use your own method of selection by providing your own
BSELECT and setting option
OWNBSELECT=yes. In the standard version of
BASSESS directive is used to assess the potential splits.
Restrictions on the
Y vectors are ignored.
Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.
Breiman, L. (2001) Random forests. Machine Learning, 45, 5-32.
CAPTION 'BRFOREST example'; STYLE=meta SPLOAD [PRINT=*] '%gendir%/data/water.gsh' BRFOREST [PRINT=outofbagerror,youtofbagestimates,importance;\ Y=Water; NTREES=8; NXTRY=3; NUNITSTRY=10; SEED=185090]\ Employ,Opdays,Product,Temp BRFPREDICT [PRINT=*; PREDICTION=Prediction] Employ,Opdays,Product,Temp PRINT Water,Prediction