Prunes a tree using minimal cost complexity (R.W. Payne).
|Controls printed output (
||Trees to be pruned|
||Accuracy values for the nodes of each tree; default is to use those stored with the tree|
||Saves the trees generated during the pruning of each tree|
||Accuracy of the pruned trees of each tree|
||Number of terminal nodes in the pruned trees of each tree|
The construction of a classification tree or a regression tree generally results in over fitting, that is it continues to extend the branches of the tree beyond the point that can be justified statistically. The solution is to prune the tree to remove the uninformative sub-branches.
The tree to be pruned is specified by the
BPRUNE assumes that there is an accuracy figure R(t) available for each node t of the tree. By default this is assumed to be stored with the tree itself, but you can specify other values using the
ACCURACY parameter. This should be set to a pointer whose suffixes are the same as the numbers of the nodes in the tree, and whose elements are scalars storing the relevant accuracy values.
For a classification tree the accuracy measures the impurity of the subset of individuals at that node (how far it is from being homogeneous i.e. with individuals from a single group). For a regression tree it is the average squared distance of the values of the response variate from their mean for the subset of observations at that node. The accuracy R(T) of the whole tree T is the sum of the accuracies of its terminal nodes.
BPRUNE uses the principle of minimal cost complexity (Breiman et al. 1984, Chapter 3) to produce a sequence of pruned trees. At each stage it prunes at the node which is the weakest link. Define R(Tt) to be the accuracy of the subtree with root at node t, and nterm(t) to be its number of terminal nodes. The weakest link is then the node for which
(R(t) – R(Tt)) / (nterm(t) – 1)
is a minimum. The pruned trees can be saved, in a pointer, using the
NEWTREES parameter. Their accuracies can be saved (in a variate) using the
RTPRUNED parameter, and their numbers of terminal nodes can be saved (also in a variate) using the
Printed output is controlled by the
||prints a table with
||provides monitoring information during the pruning.|
The plot of
NTERMINAL demonstates the trade-off between accuracy and complexity (number of terminal nodes). It should show an initial rapid decrease, followed by a long flat region, and then often a gradual increase. The aim is to select a tree that is accurate but not over-complex. One possibility is to take the tree at the point where the graph levels off. However,
RTPRUNED contains only an estimate of the accuracy of the trees. So Breiman et al. (1984) recommend taking a tree a little above that (in fact at one standard error of
RTPRUNED above the minimium point in the graph: see Chapters 3 and 11). In practice though a small amount of over-fitting should not be a problem, so the exact choice of pruned tree should not be crucial.
Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.
CAPTION 'BPRUNE example',!t('Water usage data (Draper & Smith 1981,',\ 'Applied Regression Analysis, Wiley, New York).'); STYLE=meta,plain READ temp,product,opdays,employ,water 58.8 7.107 21 129 3.067 65.2 6.373 22 141 2.828 70.9 6.796 22 153 2.891 77.4 9.208 20 166 2.994 79.3 14.792 25 193 3.082 81.0 14.564 23 189 3.898 71.9 11.964 20 175 3.502 63.9 13.526 23 186 3.060 54.5 12.656 20 190 3.211 39.5 14.119 20 187 3.286 44.5 16.691 22 195 3.542 43.6 14.571 19 206 3.125 56.0 13.619 22 198 3.022 64.7 14.575 22 192 2.922 73.0 14.556 21 191 3.950 78.9 18.573 21 200 4.488 79.4 15.618 22 200 3.295 : "form the regression tree" BREGRESSION [PRINT=indented; Y=water; TREE=tree]\ employ,opdays,product,temp "prune the tree" BPRUNE [PRINT=table,graph] tree; NEWTREES=pruned "use tree 6 - renumber nodes" BCUT [RENUMBER=yes] pruned; NEWTREE=tree "display the tree" BRDISPLAY [PRINT=indented] tree