Evaluates and optimizes the k-nearest-neighbour algorithm using cross-validation (D.B. Baird).

### Options

`PRINT` = string tokens |
Printed output required (`error` , `confusion` , `predictions` ); default `erro` , `conf` |

`SIMILARITY` = symmetric matrix |
Provides the similarities between the observations |

`METHOD` = string token |
How to calculate the prediction from a `DATA` variate (`mean` , `median` ); default `medi` |

`MINSIMILARITY` = scalar or variate |
Cut-off minimum value of the similarity for items to be regarded as neighbours; default 0.75 |

`MINNEIGHBOURS` = scalar or variate |
Minimum number of nearest neighbours to use; default 5 |

`MAXNEIGHBOURS` = scalar or variate |
Maximum number of nearest neighbours to use; default 10 |

`NSIMULATIONS` = variate |
Number of cross-validation sets to use; default `1` |

`NCROSSVALIDATIONGROUPS` = scalar |
Number of groups for cross-validation, default 10 |

`SEED` = scalar |
Seed for the random numbers used to select cross-validation groups; default 0 |

### Parameters

`DATA` = variate or factor |
Data values for the items in the data set |

`PREDICTIONS` = variate or factor |
Saves the predictions using the optimal options |

`ERROR` = scalar |
Cross-validation error rate for the optimal combination |

`CONFUSION` = matrix |
Confusion matrix for the cross-validation predictions from the optimal options |

`OPTIMAL` = pointer |
Pointer to the optimal options |

### Description

`KNNTRAIN`

uses cross-validation to evaluate and optimize the data-mining technique known as *k-nearest-neighbour*classification. This can be used to select the best options to use with the `KNEARESTNEIGHBOURS`

procedure, which performs the classification on a single data set.

A range of values to try can be specified for each of the options `MINSIMILARITY`

, `MINNEIGHBOURS`

and `MAXNEIGHBOURS`

(described below). `KNNTRAIN`

evaluates the cross-validation error for every combination of the values, and selects the optimal values as the best of these combinations.

Cross-validation works by splitting the data set randomly into groups that are sized as equally as possible. The number of groups to form is specified by the `NCROSSVALIDATIONGROUPS`

option. The items in each group are then predicted from classifications formed from the data in the other groups. The `NSIMULATIONS`

option defines how often to repeat this process. Larger values than the default of one can be specified to give more precision.

The `SIMILARITY`

option provides the information for `KNNTRAIN`

to use to determine the nearby items (or *nearest neighbours*) for each item in the data set. This is a symmetric matrix with a row (and column) for every item. It can be formed in a wide variety of ways, using mixtures of categorical and continuous data, by the `FSIMILARITY`

directive. For example, if we have a factor `Sex`

, and variates `Age`

, `Weight`

and `Height`

, we could form a symmetric matrix `Sim`

by

`FSIMILARITY [SIMILARITY=Sim] Sex,Age,Weight,Height;\`

`TEST=simplematching,3(euclidean)`

The `MINSIMILARITY`

option sets a minimum value for the similarity between two items if they are to be regarded as neighbours (default 0.75). The `MINNEIGHBOURS`

option specifies the minimum number of neighbours to try to find (default 5), and the `MAXNEIGHBOURS`

option specifies the maximum number (default 10). The search for nearest neighbours to obtain the prediction for a particular item works by finding the most similar item in the training set, and adding this (with any equally-similar training items) to the set of neighbours. If at least `MINNEIGHBOURS`

have been found, the search stops. Otherwise it finds the next most similar items, and adds these to the set of neighbours, continuing until at least `MINNEIGHBOURS`

have been found. If this results in more than `MAXNEIGHBOURS`

neighbours, `KNEARESTNEIGHBOURS`

makes a random selection from those that are least similar to the prediction item, so that the number of neighbours becomes `MAXNEIGHBOURS`

.

The `DATA`

parameter lists variates and/or factors containing values of the variables of interest for the items in the data set. The predictions from the optimal option combination can be saved using the `PREDICTIONS`

parameter (in variates and/or factors to match the settings of the `DATA`

parameter). These predictions are from a jackknife analysis where each item is predicted from the other items in the data set using the optimal options. The cross-validation error can be saved using the `ERROR`

parameter. The `OPTIMAL`

option can save a pointer, containing three scalars giving the optimal combination for `MINSIMILARITY`

, `MINNEIGHBOURS`

and `MAXNEIGHBOURS`

respectively. The pointer has labels giving the option names.

For a `DATA`

factor, the category predicted for each item in the prediction set is taken to be the factor level that occurs most often amongst its nearest neighbours. If more than one level occurs most often, the choice is narrowed down by seeing which of the levels has the most similar neighbours. If this still leaves more than one level, the choice is narrowed further by seeing which of the levels has neighbours with the highest mean similarity. Then, if even that does not lead to a single level, the final choice is made at random. The confusion matrix for the optimal option combination can be saved with the `CONFUSION`

parameter. This gives the percentage of items allocated from each observed group (in rows) to the predicted groups (in columns).

For a `DATA`

variate, the `METHOD`

option controls whether the prediction is made by the median (default) or the mean of the data values of the nearest neighbours of each prediction item.

Printed output is controlled by the `PRINT`

option, with settings:

`error`

to print the cross-validation errors for all combinations,

`confusion`

to print the confusion matrix for the optimal options (only if `DATA`

is a factor), and

`predictions`

to print the predictions from the optimal options.

The default is `PRINT=error`

,`confusion`

.

The `SEED`

option specifies the seed for the random numbers that are used in the selection of tied outcomes and for the selection of cross-validation groups. The default of zero continues an existing sequence of random numbers if any have already been used in this Genstat job, or initializes the seed automatically.

Options: `PRINT`

, `SIMILARITY`

, `METHOD`

, `MINSIMILARITY`

, `MINNEIGHBOURS`

, `MAXNEIGHBOURS`

, `NSIMULATIONS`

, `NCROSSVALIDATIONGROUPS`

, `SEED`

.

Parameters: `DATA`

, `PREDICTIONS`

, `ERROR`

, `CONFUSION`

, `OPTIMAL`

.

### See also

Directives: `FSIMILARITY`

, `ASRULES`

, `NNFIT`

, `RBFIT`

.

Procedures: `KNEARESTNEIGHBOURS`

, `BCLASSIFICATION`

, `BCFOREST`

, `BREGRESSION`

, `SOM`

.

Commands for: Data mining, Multivariate and cluster analysis.

### Example

CAPTION 'KNNTRAIN example','Classification of 6 types of glass'; \ STYLE=major,minor SPLOAD [PRINT=summary] '%DATA%/GlassTypes.gsh' FSIMILARITY [SIMILARITY=Sim] RI,Na,Mg,Al,Si,K,Ca,Ba,Fe; TEST=euclidean KNNTRAIN [SIMILARITY=Sim; MINS=0.9; MINNEIGHBOURS=!(1...5); \ MAXNEIGHBOURS=!(1...8); NSIMULATIONS=2; SEED=15243] Type