BCLASSIFICATION procedure

Constructs a classification tree (R.W. Payne).

Options

`PRINT` = string tokens	Controls printed output (`summary`, `details`, `indented`, `bracketed`, `labelleddiagram`, `numbereddiagram`, `graph`, `monitoring`); default `*` i.e. none
`METHOD` = string token	Selection criterion to use when constructing the tree (`Gini`, `MPI`); default `Gini`
`GROUPS` = factor	Groupings of the individuals in the tree
`TREE` = tree	Saves the tree that has been constructed
`NSTOP` = scalar	Number of individuals in a group at which to stop selecting tests; default 5
`ANTIENDCUTFACTOR` = string token	Adaptive anti-end-cut factor to use (`classnumber`, `reciprocalentropy`); default `*` i.e. none
`OWNBSELECT` = string token	Indicates whether or not your own version of the `BSELECT` procedure is to be used, as explained in the Method section (`yes`, `no`); default `no`

Parameters

`X` = factors or variates	X-variables available for constructing the tree
`ORDERED` = string tokens	Whether factor levels are ordered (`yes`, `no`); default `no`

Description

The starting point for a classification tree is a sample of individuals from several groups. The characteristics of the individuals are described in Genstat by a set of factors or variates which are specified by the X parameter of BCLASSIFICATION. The GROUPS option of BCLASSIFICATION defines the group to which each individual in the sample belongs, and the aim is to be able to identify the groups to which new individuals belong.

The tree progressively splits the individuals into subsets based on their values for the factors or variates. Construction starts at a node known as the root, which contains all of the individuals. A factor or variate is chosen to use there that “best” divides the individuals into two subsets. Suppose the X vectors are all factors with two levels: the first subset will then contain the individuals with level 1 of the factor, and the second will contain those with level 2. Also any individual with a missing value for the factor is put into both groups; so you can use a missing value to denote either variable or unknown observations. Factors may have either ordered or unordered levels, according to whether the corresponding value ORDERED parameter is set to yes or no. For example, a factor called Dose with levels 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled 'Morphine', 'Amidone', 'Phenadoxone' and 'Pethidine' of a factor called Drug would be regarded as unordered. For unordered factors, all possible ways of dividing the levels into two sets are tried. With variates or ordered factors with more than 2 levels, a suitable value p is found to partition the individuals into those with values less than or greater than p. The tree is then extended to contain two new nodes, one for each of the subsets, and factors or variates are selected for use at each of these nodes to subdivide the subsets further.

The effectiveness of the factor or variate to be chosen for each node depends on how the groups are split between the resulting subsets – the aim is to form subsets that are each composed of individuals from the same group. By default, this is assessed using Gini information (see Breiman et al. 1984, Chapter 4) but you can set option METHOD=mpi to use the mean posterior improvement criterion devised by Taylor & Silverman (1993). The ANTIENDCUTFACTOR option allows you to request Taylor & Silverman’s adaptive anti-end-cut factors (by default these are not used). The process stops when either no factor or variate provides any additional information, or the subset contains individuals all from the same group, or the subset contains fewer individuals than a limit specified by the NSTOP option (default 5). These nodes where the construction ends are known as terminal nodes.

The resulting tree can be saved using the TREE option. Details of the tree can be printed as selected by the PRINT option, with settings:

`summary`	prints a summary of the properties of the tree;
`details`	gives detailed information about the nodes of the tree;
`bracketed`	display as used to represent an identification key in “bracketed” form (printed node by node).
`indented`	display as used to represent an identification key in “indented” form (printed branch by branch);
`labelleddiagram`	diagrammatic display including the node labels;
`numbereddiagram`	diagrammatic display with the nodes labelled by their numbers;
`graph`	plots the tree using high-resolution graphics.
`monitoring`	prints information monitoring the construction process.

BCLASSIFICATION stores the information required for printing as part of the tree. If the X vectors are all factors with 2 levels, the labels for the labelled diagram are formed as “identifier==n₁“, where n₁ is the first level of the factor. The lines of the indented and bracketed forms are formed similarly if the factor has no extra test and no labels. Otherwise, the form is “xname lname“, where xname is the extra text if this has been defined (by the EXTRA parameter of the FACTOR command) or else the identifier of the factor, and lname is the label if available or the level if not. If the X vectors include variates or ordered factors with more than two levels and there is no extra text, the labels are formed as “identifier<p” and “identifier>p“, where p is the value chosen to partition the data for the variate concerned. If there is an extra text for a particular factor or variate, the labels are “xname < p” and “xname > p“. The style is similar for unordered factors, but here the labels involve the operators .IN. and .NI. instead of .

Generally the construction will result in over-fitting, that is it will form a tree that keeps selecting factors or variates to subdivide the individuals beyond the point that can be justified statistically. The solution is to prune the tree to remove the uninformative sub-branches, and this can be performed using the BPRUNE procedure. It is best, if possible, to base the pruning on an independent set of data. The pruning uses accuracy figures, which are stored for each node of the tree. The tree also stores a prediction for each node, which corresponds to the group with most individuals at the node. For each node of a classification tree, the accuracy is the number of misclassified individuals at the node, divided by the total number of individuals in the data set. It thus measures the impurity of the subset at that node (how far it is from it from being homogeneous i.e. having individuals all from a single group). The BCVALUES procedure can be used to calculate new accuracy and prediction values, from another data set.

Finally, once the tree has been pruned, the group of a new individual can be identified by supplying their values for the X factors or variates to the BCIDENTIFY procedure. This runs the individual through the tree to see which terminal node it would reach. The group can then be identified using the prediction value stored for that node.

Options: PRINT, METHOD, GROUPS, TREE, NSTOP, ANTIENDCUTFACTOR, OWNBSELECT.

Parameters: X, ORDERED.

Method

BCLASSIFICATION calls procedure BCONSTRUCT to form the tree. This uses a special-purpose procedure BSELECT, which is customized specifically to select splits for use in classification trees. You can use your own method of selection by providing your own BSELECT and setting option OWNBSELECT=yes. In the standard version of BSELECT, the BASSESS directive is used to assess the potential splits.

Action with `RESTRICT`

Restrictions on the X vectors or GROUPS factor are ignored.

References

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.

Taylor, P.C. & Silverman, B.W. (1993). Block diagrams and splitting criteria for classification trees. Statistics & Computing, 3, 147-161.

Example

CAPTION  'BCLASSIFICATION example',\
         'Classification tree for Fisher''s Iris Data.'; STYLE=meta,plain
FACTOR   [NVALUES=150; LABELS=!t(Setosa,Versicolor,Virginica);\ 
         VALUES=50(1,2,3)] Species
VARIATE  [NVALUES=150] Sepal_Length,Sepal_Width,Petal_Length,Petal_Width
READ     Sepal_Length,Sepal_Width,Petal_Length,Petal_Width
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 5.4  3.7  1.5  0.2
 4.8  3.4  1.6  0.2
 4.8  3.0  1.4  0.1
 4.3  3.0  1.1  0.1
 5.8  4.0  1.2  0.2
 5.7  4.4  1.5  0.4
 5.4  3.9  1.3  0.4
 5.1  3.5  1.4  0.3
 5.7  3.8  1.7  0.3
 5.1  3.8  1.5  0.3
 5.4  3.4  1.7  0.2
 5.1  3.7  1.5  0.4
 4.6  3.6  1.0  0.2
 5.1  3.3  1.7  0.5
 4.8  3.4  1.9  0.2
 5.0  3.0  1.6  0.2
 5.0  3.4  1.6  0.4
 5.2  3.5  1.5  0.2
 5.2  3.4  1.4  0.2
 4.7  3.2  1.6  0.2
 4.8  3.1  1.6  0.2
 5.4  3.4  1.5  0.4
 5.2  4.1  1.5  0.1
 5.5  4.2  1.4  0.2
 4.9  3.1  1.5  0.2
 5.0  3.2  1.2  0.2
 5.5  3.5  1.3  0.2
 4.9  3.6  1.4  0.1
 4.4  3.0  1.3  0.2
 5.1  3.4  1.5  0.2
 5.0  3.5  1.3  0.3
 4.5  2.3  1.3  0.3
 4.4  3.2  1.3  0.2
 5.0  3.5  1.6  0.6
 5.1  3.8  1.9  0.4
 4.8  3.0  1.4  0.3
 5.1  3.8  1.6  0.2
 4.6  3.2  1.4  0.2
 5.3  3.7  1.5  0.2
 5.0  3.3  1.4  0.2
 7.0  3.2  4.7  1.4
 6.4  3.2  4.5  1.5
 6.9  3.1  4.9  1.5
 5.5  2.3  4.0  1.3
 6.5  2.8  4.6  1.5
 5.7  2.8  4.5  1.3
 6.3  3.3  4.7  1.6
 4.9  2.4  3.3  1.0
 6.6  2.9  4.6  1.3
 5.2  2.7  3.9  1.4
 5.0  2.0  3.5  1.0
 5.9  3.0  4.2  1.5
 6.0  2.2  4.0  1.0
 6.1  2.9  4.7  1.4
 5.6  2.9  3.6  1.3
 6.7  3.1  4.4  1.4
 5.6  3.0  4.5  1.5
 5.8  2.7  4.1  1.0
 6.2  2.2  4.5  1.5
 5.6  2.5  3.9  1.1
 5.9  3.2  4.8  1.8
 6.1  2.8  4.0  1.3
 6.3  2.5  4.9  1.5
 6.1  2.8  4.7  1.2
 6.4  2.9  4.3  1.3
 6.6  3.0  4.4  1.4
 6.8  2.8  4.8  1.4
 6.7  3.0  5.0  1.7
 6.0  2.9  4.5  1.5
 5.7  2.6  3.5  1.0
 5.5  2.4  3.8  1.1
 5.5  2.4  3.7  1.0
 5.8  2.7  3.9  1.2
 6.0  2.7  5.1  1.6
 5.4  3.0  4.5  1.5
 6.0  3.4  4.5  1.6
 6.7  3.1  4.7  1.5
 6.3  2.3  4.4  1.3
 5.6  3.0  4.1  1.3
 5.5  2.5  4.0  1.3
 5.5  2.6  4.4  1.2
 6.1  3.0  4.6  1.4
 5.8  2.6  4.0  1.2
 5.0  2.3  3.3  1.0
 5.6  2.7  4.2  1.3
 5.7  3.0  4.2  1.2
 5.7  2.9  4.2  1.3
 6.2  2.9  4.3  1.3
 5.1  2.5  3.0  1.1
 5.7  2.8  4.1  1.3
 6.3  3.3  6.0  2.5
 5.8  2.7  5.1  1.9
 7.1  3.0  5.9  2.1
 6.3  2.9  5.6  1.8
 6.5  3.0  5.8  2.2
 7.6  3.0  6.6  2.1
 4.9  2.5  4.5  1.7
 7.3  2.9  6.3  1.8
 6.7  2.5  5.8  1.8
 7.2  3.6  6.1  2.5
 6.5  3.2  5.1  2.0
 6.4  2.7  5.3  1.9
 6.8  3.0  5.5  2.1
 5.7  2.5  5.0  2.0
 5.8  2.8  5.1  2.4
 6.4  3.2  5.3  2.3
 6.5  3.0  5.5  1.8
 7.7  3.8  6.7  2.2
 7.7  2.6  6.9  2.3
 6.0  2.2  5.0  1.5
 6.9  3.2  5.7  2.3
 5.6  2.8  4.9  2.0
 7.7  2.8  6.7  2.0
 6.3  2.7  4.9  1.8
 6.7  3.3  5.7  2.1
 7.2  3.2  6.0  1.8
 6.2  2.8  4.8  1.8
 6.1  3.0  4.9  1.8
 6.4  2.8  5.6  2.1
 7.2  3.0  5.8  1.6
 7.4  2.8  6.1  1.9
 7.9  3.8  6.4  2.0
 6.4  2.8  5.6  2.2
 6.3  2.8  5.1  1.5
 6.1  2.6  5.6  1.4
 7.7  3.0  6.1  2.3
 6.3  3.4  5.6  2.4
 6.4  3.1  5.5  1.8
 6.0  3.0  4.8  1.8
 6.9  3.1  5.4  2.1
 6.7  3.1  5.6  2.4
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8  :
"form the classification tree"
BCLASSIFICATION [PRINT=indented; GROUPS=Species; TREE=Tree]\
                Sepal_Length,Sepal_Width,Petal_Length,Petal_Width
"prune the tree"
BPRUNE     [PRINT=table,graph] Tree; NEWTREE=Pruned
"use the 4th tree - renumber nodes"
BCUT       [RENUMBER=yes] Pruned[4]; NEWTREE=Tree
"display the tree"
BCDISPLAY  [PRINT=summary,indented,graph] Tree
"check how the original data values are classified"
BCIDENTIFY [PRINT=*; TREE=Tree; IDENTIFICATION=Identification]\
           Sepal_Length,Sepal_Width,Petal_Length,Petal_Width
PRINT      Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,\
           Species,Identification; FIELD=4(13),2(12); DECIMALS=1

Updated on June 20, 2019

Was this article helpful?

Yes No