1. Home
  2. BPRUNE procedure

BPRUNE procedure

Prunes a tree using minimal cost complexity (R.W. Payne).

Option

PRINT = string tokens Controls printed output (graph, table, monitoring); default tabl

Parameters

TREE = trees Trees to be pruned
ACCURACY = pointers Accuracy values for the nodes of each tree; default is to use those stored with the tree
NEWTREES = pointers Saves the trees generated during the pruning of each tree
RTPRUNED = variates Accuracy of the pruned trees of each tree
NTERMINAL = variates Number of terminal nodes in the pruned trees of each tree

Description

The construction of a classification tree or a regression tree generally results in over fitting, that is it continues to extend the branches of the tree beyond the point that can be justified statistically. The solution is to prune the tree to remove the uninformative sub-branches.

The tree to be pruned is specified by the TREE parameter. BPRUNE assumes that there is an accuracy figure R(t) available for each node t of the tree. By default this is assumed to be stored with the tree itself, but you can specify other values using the ACCURACY parameter. This should be set to a pointer whose suffixes are the same as the numbers of the nodes in the tree, and whose elements are scalars storing the relevant accuracy values.

For a classification tree the accuracy measures the impurity of the subset of individuals at that node (how far it is from being homogeneous i.e. with individuals from a single group). For a regression tree it is the average squared distance of the values of the response variate from their mean for the subset of observations at that node. The accuracy R(T) of the whole tree T is the sum of the accuracies of its terminal nodes.

BPRUNE uses the principle of minimal cost complexity (Breiman et al. 1984, Chapter 3) to produce a sequence of pruned trees. At each stage it prunes at the node which is the weakest link. Define R(Tt) to be the accuracy of the subtree with root at node t, and nterm(t) to be its number of terminal nodes. The weakest link is then the node for which

(R(t) – R(Tt)) / (nterm(t) – 1)

is a minimum. The pruned trees can be saved, in a pointer, using the NEWTREES parameter. Their accuracies can be saved (in a variate) using the RTPRUNED parameter, and their numbers of terminal nodes can be saved (also in a variate) using the NTERMINAL parameter.

Printed output is controlled by the PRINT option, with settings:

    graph plots RTPRUNED against NTERMINAL;
    table prints a table with RTPRUNED and NTERMINAL;
    monitoring provides monitoring information during the pruning.

The plot of RTPRUNED against NTERMINAL demonstates the trade-off between accuracy and complexity (number of terminal nodes). It should show an initial rapid decrease, followed by a long flat region, and then often a gradual increase. The aim is to select a tree that is accurate but not over-complex. One possibility is to take the tree at the point where the graph levels off. However, RTPRUNED contains only an estimate of the accuracy of the trees. So Breiman et al. (1984) recommend taking a tree a little above that (in fact at one standard error of RTPRUNED above the minimium point in the graph: see Chapters 3 and 11). In practice though a small amount of over-fitting should not be a problem, so the exact choice of pruned tree should not be crucial.

Option: PRINT.

Parameter: TREE, ACCURACY, NEWTREES, RTPRUNED, NTERMINAL.

Method

BPRUNE uses the BSCAN function to move around the tree, function BBELOW to obtain the numbers of the nodes below each node, and directive BCUT to perform each pruning operation.

Reference

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.

See also

Procedures: BCLASSIFICATION, BCVALUES, BREGRESSION, BRVALUES.

Commands for: Calculations and manipulation, Multivariate and cluster analysis.

Example

CAPTION 'BPRUNE example',!t('Water usage data (Draper & Smith 1981,',\
     'Applied Regression Analysis, Wiley, New York).'); STYLE=meta,plain
READ  temp,product,opdays,employ,water
58.8  7.107 21 129 3.067
65.2  6.373 22 141 2.828
70.9  6.796 22 153 2.891
77.4  9.208 20 166 2.994
79.3 14.792 25 193 3.082
81.0 14.564 23 189 3.898
71.9 11.964 20 175 3.502
63.9 13.526 23 186 3.060
54.5 12.656 20 190 3.211
39.5 14.119 20 187 3.286
44.5 16.691 22 195 3.542
43.6 14.571 19 206 3.125
56.0 13.619 22 198 3.022
64.7 14.575 22 192 2.922
73.0 14.556 21 191 3.950
78.9 18.573 21 200 4.488
79.4 15.618 22 200 3.295 :
"form the regression tree"
BREGRESSION [PRINT=indented; Y=water; TREE=tree]\
            employ,opdays,product,temp
"prune the tree"
BPRUNE      [PRINT=table,graph] tree; NEWTREES=pruned
"use tree 6 - renumber nodes"
BCUT        [RENUMBER=yes] pruned[6]; NEWTREE=tree
"display the tree"
BRDISPLAY   [PRINT=indented] tree
Updated on March 8, 2019

Was this article helpful?