KNNTRAIN procedure

Evaluates and optimizes the k-nearest-neighbour algorithm using cross-validation (D.B. Baird).

Options

`PRINT` = string tokens	Printed output required (`error`, `confusion`, `predictions`); default `erro`, `conf`
`SIMILARITY` = symmetric matrix	Provides the similarities between the observations
`METHOD` = string token	How to calculate the prediction from a `DATA` variate (`mean`, `median`); default `medi`
`MINSIMILARITY` = scalar or variate	Cut-off minimum value of the similarity for items to be regarded as neighbours; default 0.75
`MINNEIGHBOURS` = scalar or variate	Minimum number of nearest neighbours to use; default 5
`MAXNEIGHBOURS` = scalar or variate	Maximum number of nearest neighbours to use; default 10
`NSIMULATIONS` = variate	Number of cross-validation sets to use; default `1`
`NCROSSVALIDATIONGROUPS` = scalar	Number of groups for cross-validation, default 10
`SEED` = scalar	Seed for the random numbers used to select cross-validation groups; default 0

Parameters

`DATA` = variate or factor	Data values for the items in the data set
`PREDICTIONS` = variate or factor	Saves the predictions using the optimal options
`ERROR` = scalar	Cross-validation error rate for the optimal combination
`CONFUSION` = matrix	Confusion matrix for the cross-validation predictions from the optimal options
`OPTIMAL` = pointer	Pointer to the optimal options

Description

KNNTRAIN uses cross-validation to evaluate and optimize the data-mining technique known as k-nearest-neighbourclassification. This can be used to select the best options to use with the KNEARESTNEIGHBOURS procedure, which performs the classification on a single data set.

A range of values to try can be specified for each of the options MINSIMILARITY, MINNEIGHBOURS and MAXNEIGHBOURS (described below). KNNTRAIN evaluates the cross-validation error for every combination of the values, and selects the optimal values as the best of these combinations.

Cross-validation works by splitting the data set randomly into groups that are sized as equally as possible. The number of groups to form is specified by the NCROSSVALIDATIONGROUPS option. The items in each group are then predicted from classifications formed from the data in the other groups. The NSIMULATIONS option defines how often to repeat this process. Larger values than the default of one can be specified to give more precision.

The SIMILARITY option provides the information for KNNTRAIN to use to determine the nearby items (or nearest neighbours) for each item in the data set. This is a symmetric matrix with a row (and column) for every item. It can be formed in a wide variety of ways, using mixtures of categorical and continuous data, by the FSIMILARITY directive. For example, if we have a factor Sex, and variates Age, Weight and Height, we could form a symmetric matrix Sim by

FSIMILARITY [SIMILARITY=Sim] Sex,Age,Weight,Height;\
TEST=simplematching,3(euclidean)

The MINSIMILARITY option sets a minimum value for the similarity between two items if they are to be regarded as neighbours (default 0.75). The MINNEIGHBOURS option specifies the minimum number of neighbours to try to find (default 5), and the MAXNEIGHBOURS option specifies the maximum number (default 10). The search for nearest neighbours to obtain the prediction for a particular item works by finding the most similar item in the training set, and adding this (with any equally-similar training items) to the set of neighbours. If at least MINNEIGHBOURS have been found, the search stops. Otherwise it finds the next most similar items, and adds these to the set of neighbours, continuing until at least MINNEIGHBOURS have been found. If this results in more than MAXNEIGHBOURS neighbours, KNEARESTNEIGHBOURS makes a random selection from those that are least similar to the prediction item, so that the number of neighbours becomes MAXNEIGHBOURS.

The DATA parameter lists variates and/or factors containing values of the variables of interest for the items in the data set. The predictions from the optimal option combination can be saved using the PREDICTIONS parameter (in variates and/or factors to match the settings of the DATA parameter). These predictions are from a jackknife analysis where each item is predicted from the other items in the data set using the optimal options. The cross-validation error can be saved using the ERROR parameter. The OPTIMAL option can save a pointer, containing three scalars giving the optimal combination for MINSIMILARITY, MINNEIGHBOURS and MAXNEIGHBOURS respectively. The pointer has labels giving the option names.

For a DATA factor, the category predicted for each item in the prediction set is taken to be the factor level that occurs most often amongst its nearest neighbours. If more than one level occurs most often, the choice is narrowed down by seeing which of the levels has the most similar neighbours. If this still leaves more than one level, the choice is narrowed further by seeing which of the levels has neighbours with the highest mean similarity. Then, if even that does not lead to a single level, the final choice is made at random. The confusion matrix for the optimal option combination can be saved with the CONFUSION parameter. This gives the percentage of items allocated from each observed group (in rows) to the predicted groups (in columns).

For a DATA variate, the METHOD option controls whether the prediction is made by the median (default) or the mean of the data values of the nearest neighbours of each prediction item.

Printed output is controlled by the PRINT option, with settings:

error to print the cross-validation errors for all combinations,

confusion to print the confusion matrix for the optimal options (only if DATA is a factor), and

predictions to print the predictions from the optimal options.

The default is PRINT=error,confusion.

The SEED option specifies the seed for the random numbers that are used in the selection of tied outcomes and for the selection of cross-validation groups. The default of zero continues an existing sequence of random numbers if any have already been used in this Genstat job, or initializes the seed automatically.

Options: PRINT, SIMILARITY, METHOD, MINSIMILARITY, MINNEIGHBOURS, MAXNEIGHBOURS, NSIMULATIONS, NCROSSVALIDATIONGROUPS, SEED.
Parameters: DATA, PREDICTIONS, ERROR, CONFUSION, OPTIMAL.

Example

CAPTION     'KNNTRAIN example','Classification of 6 types of glass'; \
            STYLE=major,minor
SPLOAD      [PRINT=summary] '%DATA%/GlassTypes.gsh'
FSIMILARITY [SIMILARITY=Sim] RI,Na,Mg,Al,Si,K,Ca,Ba,Fe; TEST=euclidean
KNNTRAIN    [SIMILARITY=Sim; MINS=0.9; MINNEIGHBOURS=!(1...5); \
            MAXNEIGHBOURS=!(1...8); NSIMULATIONS=2; SEED=15243] Type

Updated on February 6, 2023

Was this article helpful?

Yes No

Options

Parameters

Description

See also

Example

Was this article helpful?