Constructs a classification tree (R.W. Payne).
Options
PRINT = string tokens |
Controls printed output (summary , details , indented , bracketed , labelleddiagram , numbereddiagram , graph , monitoring ); default * i.e. none |
---|---|
METHOD = string token |
Selection criterion to use when constructing the tree (Gini , MPI ); default Gini |
GROUPS = factor |
Groupings of the individuals in the tree |
TREE = tree |
Saves the tree that has been constructed |
NSTOP = scalar |
Number of individuals in a group at which to stop selecting tests; default 5 |
ANTIENDCUTFACTOR = string token |
Adaptive anti-end-cut factor to use (classnumber , reciprocalentropy ); default * i.e. none |
OWNBSELECT = string token |
Indicates whether or not your own version of the BSELECT procedure is to be used, as explained in the Method section (yes , no ); default no |
Parameters
X = factors or variates |
X-variables available for constructing the tree |
---|---|
ORDERED = string tokens |
Whether factor levels are ordered (yes , no ); default no |
Description
The starting point for a classification tree is a sample of individuals from several groups. The characteristics of the individuals are described in Genstat by a set of factors or variates which are specified by the X
parameter of BCLASSIFICATION
. The GROUPS
option of BCLASSIFICATION
defines the group to which each individual in the sample belongs, and the aim is to be able to identify the groups to which new individuals belong.
The tree progressively splits the individuals into subsets based on their values for the factors or variates. Construction starts at a node known as the root, which contains all of the individuals. A factor or variate is chosen to use there that “best” divides the individuals into two subsets. Suppose the X
vectors are all factors with two levels: the first subset will then contain the individuals with level 1 of the factor, and the second will contain those with level 2. Also any individual with a missing value for the factor is put into both groups; so you can use a missing value to denote either variable or unknown observations. Factors may have either ordered or unordered levels, according to whether the corresponding value ORDERED
parameter is set to yes
or no
. For example, a factor called Dose
with levels 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled 'Morphine'
, 'Amidone'
, 'Phenadoxone'
and 'Pethidine'
of a factor called Drug
would be regarded as unordered. For unordered factors, all possible ways of dividing the levels into two sets are tried. With variates or ordered factors with more than 2 levels, a suitable value p is found to partition the individuals into those with values less than or greater than p. The tree is then extended to contain two new nodes, one for each of the subsets, and factors or variates are selected for use at each of these nodes to subdivide the subsets further.
The effectiveness of the factor or variate to be chosen for each node depends on how the groups are split between the resulting subsets – the aim is to form subsets that are each composed of individuals from the same group. By default, this is assessed using Gini information (see Breiman et al. 1984, Chapter 4) but you can set option METHOD=mpi
to use the mean posterior improvement criterion devised by Taylor & Silverman (1993). The ANTIENDCUTFACTOR
option allows you to request Taylor & Silverman’s adaptive anti-end-cut factors (by default these are not used). The process stops when either no factor or variate provides any additional information, or the subset contains individuals all from the same group, or the subset contains fewer individuals than a limit specified by the NSTOP
option (default 5). These nodes where the construction ends are known as terminal nodes.
The resulting tree can be saved using the TREE
option. Details of the tree can be printed as selected by the PRINT
option, with settings:
summary |
prints a summary of the properties of the tree; |
---|---|
details |
gives detailed information about the nodes of the tree; |
bracketed |
display as used to represent an identification key in “bracketed” form (printed node by node). |
indented |
display as used to represent an identification key in “indented” form (printed branch by branch); |
labelleddiagram |
diagrammatic display including the node labels; |
numbereddiagram |
diagrammatic display with the nodes labelled by their numbers; |
graph |
plots the tree using high-resolution graphics. |
monitoring |
prints information monitoring the construction process. |
BCLASSIFICATION
stores the information required for printing as part of the tree. If the X
vectors are all factors with 2 levels, the labels for the labelled diagram are formed as “identifier==n1“, where n1 is the first level of the factor. The lines of the indented and bracketed forms are formed similarly if the factor has no extra test and no labels. Otherwise, the form is “xname lname“, where xname is the extra text if this has been defined (by the EXTRA
parameter of the FACTOR
command) or else the identifier of the factor, and lname is the label if available or the level if not. If the X
vectors include variates or ordered factors with more than two levels and there is no extra text, the labels are formed as “identifier<p” and “identifier>p“, where p is the value chosen to partition the data for the variate concerned. If there is an extra text for a particular factor or variate, the labels are “xname < p” and “xname > p“. The style is similar for unordered factors, but here the labels involve the operators .IN.
and .NI.
instead of .
Generally the construction will result in over-fitting, that is it will form a tree that keeps selecting factors or variates to subdivide the individuals beyond the point that can be justified statistically. The solution is to prune the tree to remove the uninformative sub-branches, and this can be performed using the BPRUNE
procedure. It is best, if possible, to base the pruning on an independent set of data. The pruning uses accuracy figures, which are stored for each node of the tree. The tree also stores a prediction for each node, which corresponds to the group with most individuals at the node. For each node of a classification tree, the accuracy is the number of misclassified individuals at the node, divided by the total number of individuals in the data set. It thus measures the impurity of the subset at that node (how far it is from it from being homogeneous i.e. having individuals all from a single group). The BCVALUES
procedure can be used to calculate new accuracy and prediction values, from another data set.
Finally, once the tree has been pruned, the group of a new individual can be identified by supplying their values for the X
factors or variates to the BCIDENTIFY
procedure. This runs the individual through the tree to see which terminal node it would reach. The group can then be identified using the prediction value stored for that node.
Options: PRINT
, METHOD
, GROUPS
, TREE
, NSTOP
, ANTIENDCUTFACTOR
, OWNBSELECT
.
Parameters: X
, ORDERED
.
Method
BCLASSIFICATION
calls procedure BCONSTRUCT
to form the tree. This uses a special-purpose procedure BSELECT
, which is customized specifically to select splits for use in classification trees. You can use your own method of selection by providing your own BSELECT
and setting option OWNBSELECT=yes
. In the standard version of BSELECT
, the BASSESS
directive is used to assess the potential splits.
Action with RESTRICT
Restrictions on the X
vectors or GROUPS
factor are ignored.
References
Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.
Taylor, P.C. & Silverman, B.W. (1993). Block diagrams and splitting criteria for classification trees. Statistics & Computing, 3, 147-161.
See also
Procedures: BCDISPLAY
, BCIDENTIFY
, BCKEEP
, BCVALUES
, BGRAPH
, BPRUNE
, BKEY
, BCFOREST
, BREGRESSION
, KNEARESTNEIGHBOURS
.
Commands for: Multivariate and cluster analysis.
Example
CAPTION 'BCLASSIFICATION example',\ 'Classification tree for Fisher''s Iris Data.'; STYLE=meta,plain FACTOR [NVALUES=150; LABELS=!t(Setosa,Versicolor,Virginica);\ VALUES=50(1,2,3)] Species VARIATE [NVALUES=150] Sepal_Length,Sepal_Width,Petal_Length,Petal_Width READ Sepal_Length,Sepal_Width,Petal_Length,Petal_Width 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5.0 3.6 1.4 0.2 5.4 3.9 1.7 0.4 4.6 3.4 1.4 0.3 5.0 3.4 1.5 0.2 4.4 2.9 1.4 0.2 4.9 3.1 1.5 0.1 5.4 3.7 1.5 0.2 4.8 3.4 1.6 0.2 4.8 3.0 1.4 0.1 4.3 3.0 1.1 0.1 5.8 4.0 1.2 0.2 5.7 4.4 1.5 0.4 5.4 3.9 1.3 0.4 5.1 3.5 1.4 0.3 5.7 3.8 1.7 0.3 5.1 3.8 1.5 0.3 5.4 3.4 1.7 0.2 5.1 3.7 1.5 0.4 4.6 3.6 1.0 0.2 5.1 3.3 1.7 0.5 4.8 3.4 1.9 0.2 5.0 3.0 1.6 0.2 5.0 3.4 1.6 0.4 5.2 3.5 1.5 0.2 5.2 3.4 1.4 0.2 4.7 3.2 1.6 0.2 4.8 3.1 1.6 0.2 5.4 3.4 1.5 0.4 5.2 4.1 1.5 0.1 5.5 4.2 1.4 0.2 4.9 3.1 1.5 0.2 5.0 3.2 1.2 0.2 5.5 3.5 1.3 0.2 4.9 3.6 1.4 0.1 4.4 3.0 1.3 0.2 5.1 3.4 1.5 0.2 5.0 3.5 1.3 0.3 4.5 2.3 1.3 0.3 4.4 3.2 1.3 0.2 5.0 3.5 1.6 0.6 5.1 3.8 1.9 0.4 4.8 3.0 1.4 0.3 5.1 3.8 1.6 0.2 4.6 3.2 1.4 0.2 5.3 3.7 1.5 0.2 5.0 3.3 1.4 0.2 7.0 3.2 4.7 1.4 6.4 3.2 4.5 1.5 6.9 3.1 4.9 1.5 5.5 2.3 4.0 1.3 6.5 2.8 4.6 1.5 5.7 2.8 4.5 1.3 6.3 3.3 4.7 1.6 4.9 2.4 3.3 1.0 6.6 2.9 4.6 1.3 5.2 2.7 3.9 1.4 5.0 2.0 3.5 1.0 5.9 3.0 4.2 1.5 6.0 2.2 4.0 1.0 6.1 2.9 4.7 1.4 5.6 2.9 3.6 1.3 6.7 3.1 4.4 1.4 5.6 3.0 4.5 1.5 5.8 2.7 4.1 1.0 6.2 2.2 4.5 1.5 5.6 2.5 3.9 1.1 5.9 3.2 4.8 1.8 6.1 2.8 4.0 1.3 6.3 2.5 4.9 1.5 6.1 2.8 4.7 1.2 6.4 2.9 4.3 1.3 6.6 3.0 4.4 1.4 6.8 2.8 4.8 1.4 6.7 3.0 5.0 1.7 6.0 2.9 4.5 1.5 5.7 2.6 3.5 1.0 5.5 2.4 3.8 1.1 5.5 2.4 3.7 1.0 5.8 2.7 3.9 1.2 6.0 2.7 5.1 1.6 5.4 3.0 4.5 1.5 6.0 3.4 4.5 1.6 6.7 3.1 4.7 1.5 6.3 2.3 4.4 1.3 5.6 3.0 4.1 1.3 5.5 2.5 4.0 1.3 5.5 2.6 4.4 1.2 6.1 3.0 4.6 1.4 5.8 2.6 4.0 1.2 5.0 2.3 3.3 1.0 5.6 2.7 4.2 1.3 5.7 3.0 4.2 1.2 5.7 2.9 4.2 1.3 6.2 2.9 4.3 1.3 5.1 2.5 3.0 1.1 5.7 2.8 4.1 1.3 6.3 3.3 6.0 2.5 5.8 2.7 5.1 1.9 7.1 3.0 5.9 2.1 6.3 2.9 5.6 1.8 6.5 3.0 5.8 2.2 7.6 3.0 6.6 2.1 4.9 2.5 4.5 1.7 7.3 2.9 6.3 1.8 6.7 2.5 5.8 1.8 7.2 3.6 6.1 2.5 6.5 3.2 5.1 2.0 6.4 2.7 5.3 1.9 6.8 3.0 5.5 2.1 5.7 2.5 5.0 2.0 5.8 2.8 5.1 2.4 6.4 3.2 5.3 2.3 6.5 3.0 5.5 1.8 7.7 3.8 6.7 2.2 7.7 2.6 6.9 2.3 6.0 2.2 5.0 1.5 6.9 3.2 5.7 2.3 5.6 2.8 4.9 2.0 7.7 2.8 6.7 2.0 6.3 2.7 4.9 1.8 6.7 3.3 5.7 2.1 7.2 3.2 6.0 1.8 6.2 2.8 4.8 1.8 6.1 3.0 4.9 1.8 6.4 2.8 5.6 2.1 7.2 3.0 5.8 1.6 7.4 2.8 6.1 1.9 7.9 3.8 6.4 2.0 6.4 2.8 5.6 2.2 6.3 2.8 5.1 1.5 6.1 2.6 5.6 1.4 7.7 3.0 6.1 2.3 6.3 3.4 5.6 2.4 6.4 3.1 5.5 1.8 6.0 3.0 4.8 1.8 6.9 3.1 5.4 2.1 6.7 3.1 5.6 2.4 6.9 3.1 5.1 2.3 5.8 2.7 5.1 1.9 6.8 3.2 5.9 2.3 6.7 3.3 5.7 2.5 6.7 3.0 5.2 2.3 6.3 2.5 5.0 1.9 6.5 3.0 5.2 2.0 6.2 3.4 5.4 2.3 5.9 3.0 5.1 1.8 : "form the classification tree" BCLASSIFICATION [PRINT=indented; GROUPS=Species; TREE=Tree]\ Sepal_Length,Sepal_Width,Petal_Length,Petal_Width "prune the tree" BPRUNE [PRINT=table,graph] Tree; NEWTREE=Pruned "use the 4th tree - renumber nodes" BCUT [RENUMBER=yes] Pruned[4]; NEWTREE=Tree "display the tree" BCDISPLAY [PRINT=summary,indented,graph] Tree "check how the original data values are classified" BCIDENTIFY [PRINT=*; TREE=Tree; IDENTIFICATION=Identification]\ Sepal_Length,Sepal_Width,Petal_Length,Petal_Width PRINT Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,\ Species,Identification; FIELD=4(13),2(12); DECIMALS=1