1. Home
  2. BASSESS directive

BASSESS directive

Assesses potential splits for regression and classification trees.

Options

Y = variate or factor Response variate for a regression tree, or factor specifying the groupings for a classification tree
SELECTED = dummy Returns the identifier of X variate or factor used in the best split
TESTSPLIT = expression structure Logical expression representing the best split
MAXSPLITPOINT = scalar or variate When SELECTED is a variate or a factor with ordered levels this returns a scalar containing the boundary between the two splits, when the SELECTED is a factor with unordered levels it returns a variate containing the levels allocated to the first split
MAXCRITERION = scalar Maximum value obtained for the selection criterion
NOSELECTION = scalar Returns the value 1 if no split has been selected, otherwise 0
FMETHOD = string token Selection method to use when Y is a factor (Gini, MPI); default Gini
ANTIENDCUTFACTOR = string token Anti-end-cut factor to use when Y is a factor (classnumber, reciprocalentropy); default * i.e. none
WEIGHTS = variate Weights; default * i.e. all weights 1
TOLERANCE = scalar Tolerance multiplier used e.g. to check for equality of x-values; default * i.e. set automatically for the implementation concerned

Parameters

X = variates or factors Variables available to make the split
ORDERED = string tokens Whether factor levels are ordered (yes, no); default no
SPLITPOINT = scalars or variates Saves details of the best split found for each X variable; when X is a variate or a factor with ordered levels this returns a scalar containing the boundary between the two splits, when the X is a factor with unordered levels it returns a variate containing the levels allocated to the first split
CRITERIONVALUE = scalars Saves the value of the selection criterion for the best split found for each X variable

 

Description

BASSESS selects splits for use when constructing classification or regression trees. The Y option specifies the factor defining the groupings for a classification tree, or the response variate for a regression tree. The x-variables that are available to make the split are supplied by the X parameter. They can be variates, or factors with either ordered or unordered levels as indicated by the ORDERED parameter. For example, a factor called Dose with levels for example 1, 1.5, 2 and 2.5 would usually be treated as having ordered levels, whereas levels labelled 'Morphine', 'Amidone', 'Phenadoxone' and 'Pethidine' of a factor called Drug would be regarded as unordered.

In a regression tree, the accuracy of each node is the squared distance of the values of the response variate from their mean for the observations at the node, divided by the total number of observations. The potential splits are assessed by their effect on the accuracy, that is the difference between the initial accuracy and the sum of the accuracies of the two successor nodes resulting from the split.

For a classification tree, the FMETHOD option allows one of two selection criteria to be requested, either Gini information or the MPI (mean posterior improvement) criterion of Taylor & Silverman (1993). The default is to use Gini information. The ANTIENDCUTFACTOR option allows you to request use of adaptive anti-end-cut factors as devised by Taylor & Silverman (1993, Section 5). Further details are given in the Methods section. By default no adaptive factors are used.

The SPLITPOINT parameter can be used to save details of the best split found for each X variable. When X is a variate or a factor with ordered levels, this returns a scalar containing the boundary between the two splits. Alternatively, when X is a factor with unordered levels, it returns a variate containing the levels allocated to the first split. The CRITERIONVALUE parameter saves the value of the selection criterion for the best split found for each X variable.

The SELECTED option can be set to a dummy to store the identifier of the X variate or factor used in the best split, and the MAXSPLITPOINT option can save details of the best split, similarly to the SPLITPOINT parameter. The MAXCRITERION option saves the maximum value obtained for the selection criterion, and the NOSELECTION saves a scalar containing the value 0 if a split could be selected or 1 if no further splitting was possible. You can save a logical expression representing the best split using the TESTSPLIT option. So, for example, you can put

BASSESS [Y=Yvar; TESTSPLIT=Test; ...]

RESTRICT Yvar; #Test == 1

PRINT Yvar

to print the y-values of the individuals in the first successor set. BASSESS takes account of restrictions on Y or on any of the X variates or factors. So you also could now use BASSESS to find the best split on that set.

The WEIGHTS option can supply a variate of weights for the observations. This could be used to supply prior probabilities, or to emphasize units that are perceived as being especially important.

Finally, the TOLERANCE option can be used to modify the tolerance multiplier used internally for example to check for equality of x-values. By default this is set automatically to a value appropriate for the Genstat implementation concerned.

Options: Y, SELECTED, TESTSPLIT, MAXSPLITPOINT, MAXCRITERION, NOSELECTION, FMETHOD, ANTIENDCUTFACTOR, WEIGHTS, TOLERANCE.

Parameters: X, ORDERED, SPLITPOINT, CRITERIONVALUE.

Method

Further general information about classification and regression trees can be found in Breiman et al. (1984). The methods used by BASSESS for classification trees are based on Taylor & Silverman (1993). The Gini setting of the FMETHOD option uses the change in Gini information:

G = (1 – ∑k αk2) – (∑k β1k) × (1 – ∑k β1k2) – (∑k β2k) × (1 – ∑k β2k2)

where αk is the proportion of individuals in the original set that are in group k, and βik is the proportion of individuals in successor set i (i = 1 or 2) that are in group k. The aim here is to split the individuals into sets to maximize differences between the within-set group probabilities. An equivalent formula (Taylor & Silverman 1993, Section 4) is

G = (p1 × p2) × { ∑k β1k2 + ∑k β2k2 – ∑k ( β1k × β2k ) }

where pi = ∑k βik. The alternative MPI (mean posterior improvement) criterion concentrates more on making the group probabilities differ between the successor sets:

MPI = (p1 × p2) × { 1 – ∑k (( β1k × β2k) / ( β1k + β2k)) }

Taylor & Silverman (1993) note that the term (p1 × p2) aims to generate successor sets of similar size, and refer to it as the anti-end-cut factor because it aims to avoid sets being produced with only a small number of individuals. They suggest that this should vary according to the complexity of the problem, and instead become

min { p1 × p2, plow × (1 – plow) }

where plow is the reciprocal of the number of groups in the initial set for the classnumber setting of the ANTIENDCUTFACTOR option, and

min { 0.5, 1 / ( ∑k αk2) }

for the reciprocalentropy setting. The idea is to encourage splits that lead to terminal modes – and to take accounts of the fact that these are more likely to be generated as the number of groups becomes small.

Action with RESTRICT

You can request that BASSESS operate on only a subset of the units by applying a restriction to the Y variate or factor, or to any of the X variates or factors, or to the WEIGHTS variate.

References

Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Monterey.

Taylor, P.C. & Silverman, B.W. (1993). Block diagrams and splitting criteria for classification trees. Statistics and Computing, 3, 147-161.

See also

Directives: BCUT, BGROW, BIDENTIFY, BJOIN, TREE.

Procedures: BCONSTRUCT, BCLASSIFICATION, BGRAPH, BKEY, BPRINT, BPRUNE.

Commands for: Calculations and manipulation.

Updated on June 20, 2019

Was this article helpful?