1. Home
  2. KNNTRAIN procedure

KNNTRAIN procedure

Evaluates and optimizes the k-nearest-neighbour algorithm using cross-validation (D.B. Baird).


PRINT = string tokens Printed output required (error, confusion, predictions); default erro, conf
SIMILARITY = symmetric matrix Provides the similarities between the observations
METHOD = string token How to calculate the prediction from a DATA variate (mean, median); default medi
MINSIMILARITY = scalar or variate Cut-off minimum value of the similarity for items to be regarded as neighbours; default 0.75
MINNEIGHBOURS = scalar or variate Minimum number of nearest neighbours to use; default 5
MAXNEIGHBOURS = scalar or variate Maximum number of nearest neighbours to use; default 10
NSIMULATIONS = variate Number of cross-validation sets to use; default 1
NCROSSVALIDATIONGROUPS = scalar Number of groups for cross-validation, default 10
SEED = scalar Seed for the random numbers used to select cross-validation groups; default 0


DATA = variate or factor Data values for the items in the data set
PREDICTIONS = variate or factor Saves the predictions using the optimal options
ERROR = scalar Cross-validation error rate for the optimal combination
CONFUSION = matrix Confusion matrix for the cross-validation predictions from the optimal options
OPTIMAL = pointer Pointer to the optimal options


KNNTRAIN uses cross-validation to evaluate and optimize the data-mining technique known as k-nearest-neighbourclassification. This can be used to select the best options to use with the KNEARESTNEIGHBOURS procedure, which performs the classification on a single data set.

A range of values to try can be specified for each of the options MINSIMILARITY, MINNEIGHBOURS and MAXNEIGHBOURS (described below). KNNTRAIN evaluates the cross-validation error for every combination of the values, and selects the optimal values as the best of these combinations.

Cross-validation works by splitting the data set randomly into groups that are sized as equally as possible. The number of groups to form is specified by the NCROSSVALIDATIONGROUPS option. The items in each group are then predicted from classifications formed from the data in the other groups. The NSIMULATIONS option defines how often to repeat this process. Larger values than the default of one can be specified to give more precision.

The SIMILARITY option provides the information for KNNTRAIN to use to determine the nearby items (or nearest neighbours) for each item in the data set. This is a symmetric matrix with a row (and column) for every item. It can be formed in a wide variety of ways, using mixtures of categorical and continuous data, by the FSIMILARITY directive. For example, if we have a factor Sex, and variates Age, Weight and Height, we could form a symmetric matrix Sim by

FSIMILARITY [SIMILARITY=Sim] Sex,Age,Weight,Height;\

The MINSIMILARITY option sets a minimum value for the similarity between two items if they are to be regarded as neighbours (default 0.75). The MINNEIGHBOURS option specifies the minimum number of neighbours to try to find (default 5), and the MAXNEIGHBOURS option specifies the maximum number (default 10). The search for nearest neighbours to obtain the prediction for a particular item works by finding the most similar item in the training set, and adding this (with any equally-similar training items) to the set of neighbours. If at least MINNEIGHBOURS have been found, the search stops. Otherwise it finds the next most similar items, and adds these to the set of neighbours, continuing until at least MINNEIGHBOURS have been found. If this results in more than MAXNEIGHBOURS neighbours, KNEARESTNEIGHBOURS makes a random selection from those that are least similar to the prediction item, so that the number of neighbours becomes MAXNEIGHBOURS.

The DATA parameter lists variates and/or factors containing values of the variables of interest for the items in the data set. The predictions from the optimal option combination can be saved using the PREDICTIONS parameter (in variates and/or factors to match the settings of the DATA parameter). These predictions are from a jackknife analysis where each item is predicted from the other items in the data set using the optimal options. The cross-validation error can be saved using the ERROR parameter. The OPTIMAL option can save a pointer, containing three scalars giving the optimal combination for MINSIMILARITY, MINNEIGHBOURS and MAXNEIGHBOURS respectively. The pointer has labels giving the option names.

For a DATA factor, the category predicted for each item in the prediction set is taken to be the factor level that occurs most often amongst its nearest neighbours. If more than one level occurs most often, the choice is narrowed down by seeing which of the levels has the most similar neighbours. If this still leaves more than one level, the choice is narrowed further by seeing which of the levels has neighbours with the highest mean similarity. Then, if even that does not lead to a single level, the final choice is made at random. The confusion matrix for the optimal option combination can be saved with the CONFUSION parameter. This gives the percentage of items allocated from each observed group (in rows) to the predicted groups (in columns).

For a DATA variate, the METHOD option controls whether the prediction is made by the median (default) or the mean of the data values of the nearest neighbours of each prediction item.

Printed output is controlled by the PRINT option, with settings:

error to print the cross-validation errors for all combinations,

confusion to print the confusion matrix for the optimal options (only if DATA is a factor), and

predictions to print the predictions from the optimal options.

The default is PRINT=error,confusion.

The SEED option specifies the seed for the random numbers that are used in the selection of tied outcomes and for the selection of cross-validation groups. The default of zero continues an existing sequence of random numbers if any have already been used in this Genstat job, or initializes the seed automatically.


See also

Commands for: Data miningMultivariate and cluster analysis.


CAPTION     'KNNTRAIN example','Classification of 6 types of glass'; \
SPLOAD      [PRINT=summary] '%DATA%/GlassTypes.gsh'
            MAXNEIGHBOURS=!(1...8); NSIMULATIONS=2; SEED=15243] Type
Updated on February 6, 2023

Was this article helpful?