Constructs a random regression forest (R.W. Payne).
Options
PRINT = string tokens |
Controls printed output (outofbagerror , youtofbagestimates , importance , orderedimportance , monitoring ); default outo , impo |
---|---|
Y = variate |
Response variate for the regression |
NTREES = scalar |
Number of trees in the forest; no default – must be specified |
NXTRY = scalar |
Number of X variables to select at random at each node from which to choose the X variable to use there; default is the square root of number of X variables |
NUNITSTRY = scalar |
Number of units of the X variables to select at random to use in the construction of each tree; default is two thirds of the number of units |
MSLIMIT = scalar |
Limit on the mean square of the observations at a node at which to stop making splits; default 0 |
NSTOP = scalar |
Specifies the number of observations at a node at which to stop making splits; default 1 |
SEED = scalar |
Seed for random numbers to select the NXTRY X -variables and NUMITSTRY units; default 0 |
OWNBSELECT = string token |
Indicates whether or not your own version of the BSELECT procedure is to be used, as explained in the Method section (yes , no ); default no |
OUTOFBAGERROR = string token |
Saves the “out-of-bag” error rate |
YOUTOFBAGESTIMATES = variate |
Saves the “out-of-bag” estimates of Y |
SAVE = pointer |
Saves details of the forest that has been constructed |
Parameters
X = factors or variates |
X-variables available for constructing the tree |
---|---|
ORDERED = string tokens |
Whether factor levels are ordered (yes , no ); default no |
IMPORTANCE = scalars |
Saves the importance of each x-variable |
Description
A regression tree is a mechanism for predicting a response variable from a set of independent variables (see Chapter 8 of Breiman et al.). A random regression forest is a set of regression trees that are used collectively to form the prediction, by averaging the predictions from the individual trees (see e.g. Breiman 2001). The number of trees in the forest is specified by the NTREES
option. Constructing a large forest can be time consuming, so it may be best to investigate first with a relatively small number of trees (e.g. 10).
The trees are constructed using data on a set of observations. Their values for the response variable are specified (in a variate) using the Y
option, and their values for the independent variables are specified (in a list of variates or factors) using the X
parameter. Factors may have either ordered or unordered levels, according to whether the corresponding value ORDERED
parameter is set to yes
or no
. For example, a factor called Dose
with levels 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled 'Morphine'
, 'Amidone'
, 'Phenadoxone'
and 'Pethidine'
of a factor called Drug
would be regarded as unordered.
Each regression tree is formed using a random sample of the X
variables in the data set, and a bootstrap random sample of their units (i.e. sampled with replacement). The NXTRY
option defines how many X
variables to select, and the NUNITSTRY
option defines how many units to take. The default for NXTRY
is the square root of the number of variables, and the default for NUNITSTRY
is two thirds of the number of units. The SEED
option specifies a seed for the random numbers that are used to select the variables and to select the units. The default of zero continues an existing sequence of random numbers, if any of the random functions (GRSELECT
etc) has already been used in the current Genstat run. Otherwise, a seed is chosen at random.
The construction process splits the observations into subsets. With an x-variate or a factor with ordered levels, the subsets are formed by taking the observations with values less than or greater than some split point p. For a factor with unordered levels, all possible ways of dividing its levels into two subsets are tried. The aim is to form subsets that have similar values for the response variate. The predicted value of the response variable for each node of the tree is the mean of its value for the subset of observations at that node. The accuracy of the node is the squared distance of the values of the response variate from their mean for the observations at the node, divided by the total number of observations. The potential splits at the node are assessed by their effect on the accuracy, that is the difference between the accuracy of the node and the sum of the accuracies of the two potential successor nodes. The node will become a terminal node if none of the splits provides any improvement in accuracy, or if the mean square of the observations at the node is less than or equal to a limit specified by the MSLIMIT
option (default 0), or if the number of observations at the node is less than or equal to the number specified by the NSTOP
option (default 1).
The resulting forest (and its associated information) can be saved using the SAVE
option. This can then be used in the BRFDISPLAY
procedure to produce further output, or in the BRFPREDICT
procedure to predict the response for new values of the x-variables.
The OUTOFBAGERROR
parameter can save the “out-of-bag” error rate. This is calculated using the individuals that were not involved in the construction of each tree. So, it gives an independent measure of the reliability of the forest. The idea is to put the x-values in each observation through all of the trees where it was not used, and predict its y-value by taking the average of the predictions from the individual trees. The out-of-bag error is the square root of the mean of the squared differences of the predictions from the values in the response variate.
The YOUTOFBAGESTIMATES
option can save a variate containing the out-of-bag predictions. Note: the out-of-bag prediction will be missing for any observation that has been selected in all the random samples (i.e. that has been used to construct every tree).
The IMPORTANCE
parameter can save a variate giving the “importance” of each X
variate or factor in the forest, calculated as the total amount by which the variable increases the accuracy in the forest.
Printed output is controlled by the PRINT
option, with settings:
outofbagerror |
out-of-bag error rate, |
---|---|
youtofbagestimates |
out-of-bag predictions of the y-values, |
importance |
importance ratings of the X variates and factors, |
orderedimportance |
importance ratings of the X variates and factors in decreasing order, and |
monitoring |
monitoring information during the construction process. |
The default is PRINT=outofbagerror,importance
.
Options: PRINT
, Y
, NTREES
, NXTRY
, NUNITSTRY
, MSLIMIT
, NSTOP
, SEED
, OWNBSELECT
, OUTOFBAGERROR
, YOUTOFBAGESTIMATES
, SAVE
.
Parameters: X
, ORDERED
, IMPORTANCE
.
Method
BRFOREST
calls procedure BCONSTRUCT
to form the tree. This uses a special-purpose procedure BSELECT
, which is customized specifically to select splits for use in regression trees. You can use your own method of selection by providing your own BSELECT
and setting option OWNBSELECT=yes
. In the standard version of BSELECT
, the BASSESS
directive is used to assess the potential splits.
Action with RESTRICT
Restrictions on the X
or Y
vectors are ignored.
References
Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.
Breiman, L. (2001) Random forests. Machine Learning, 45, 5-32.
See also
Procedures: BRFDISPLAY
, BRFPREDICT
, BREGRESSION
.
Commands for: Regression analysis, Multivariate and cluster analysis.
Example
CAPTION 'BRFOREST example'; STYLE=meta SPLOAD [PRINT=*] '%gendir%/data/water.gsh' BRFOREST [PRINT=outofbagerror,youtofbagestimates,importance;\ Y=Water; NTREES=8; NXTRY=3; NUNITSTRY=10; SEED=185090]\ Employ,Opdays,Product,Temp BRFPREDICT [PRINT=*; PREDICTION=Prediction] Employ,Opdays,Product,Temp PRINT Water,Prediction