Classifies items or predicts their responses by examining their *k* nearest neighbours (R.W. Payne).

### Options

`PRINT` = string tokens |
Printed output required (`neighbours` , `predictions` ); default `pred` |
---|---|

`SIMILARITY` = matrix or symmetric matrix |
Provides the similarities between the training and prediction sets of items |

`NEIGHBOURS` = pointer |
Pointer with a variate for each prediction item to save the numbers of its nearest neighbours in the training set |

`GROUPS` = factor |
Defines groupings to identify the training and prediction sets of items when `SIMILARITY` is a symmetric matrix |

`LEVTRAINING` = scalar or text |
Identifies the level of `GROUPS` or dimension of `SIMILARITY` that represents the training set; default 1 |

`LEVPREDICTION` = scalar or text |
Identifies the level of `GROUPS` or dimension of `SIMILARITY` that represents the prediction set; default 2 |

`METHOD` = string token |
How to calculate the prediction from a `DATA` variate (`mean` , `median` ); default `medi` |

`MINSIMILARITY` = scalar |
Cut-off minimum value of the similarity for items to be regarded as neighbours; default 0.75 |

`MINNEIGHBOURS` = scalar |
Minimum number of nearest neighbours to use; default 5 |

`MAXNEIGHBOURS` = scalar |
Maximum number of nearest neighbours to use; default 10 |

`SEED` = scalar |
Seed for the random numbers used to select neighbours when more than `MAXNEIGHBOURS` are available; default 0 |

### Parameters

`DATA` = variates or factors |
Data values for the items in the training set |
---|---|

`PREDICTIONS` = variates or factors |
Saves the predictions |

### Description

`KNEARESTNEIGHBOURS`

provides the data-mining technique known as *k-nearest-neighbour* classification. This allocates unknown items to a category, or it predicts their (continuous) responses, by looking at nearby items in a known data set. The known data set is usually called the *training set*, and we will call the unknown items the *prediction set*.

The `SIMILARITY`

option provides a similarity matrix for `KNEARESTNEIGHBOURS`

to use to determine the nearby items in the training set (or *nearest neighbours*) for each item in the prediction set. This can be a symmetric matrix with a row (and column) for every item in the combined set of training and prediction items. The `GROUPS`

option must then be set to a factor with one level for the training items and another for the prediction items. By default the training set has level 1 and the prediction set has level 2, but these can be changed by the `LEVTRAINING`

and `LEVPREDICTION`

options. Matrices like these can be formed in a wide variety of ways, using mixtures of categorical and continuous data, by the `FSIMILARITY`

directive. For example, if we have a factor `Sex`

, and variates `Age`

, `Weight`

and `Height`

whose values are known for both the training and prediction items, we could form a symmetric matrix `Sim`

by

`FSIMILARITY [SIMILARITY=Sim] Sex,Age,Weight,Height;\`

` TEST=simplematching,3(euclidean)`

However, `Sim`

will contain unnecessary information, as we need the similarities between prediction and training items, but not between training items or between prediction items. So, for large data sets, it will be more efficient to form a (rectangular) between-group similarity matrix by setting the `GROUPS`

option of `FSIMILARITY`

. For example

`FSIMILARITY [SIMILARITY=Gsim; GROUPS=Gfac] Sex,Age,Weight,Height;\`

` TEST=simplematching,3(euclidean)`

where Gfac is a factor with two levels, one for the training set (usually level 1), and the other for the prediction set (usually level 2). You then no longer need to set the `GROUPS`

option of `KNEARESTNEIGHBOUR`

. The `LEVTRAINING`

and `LEVPREDICTION`

options now specify the dimension of the similarity matrix (1 for rows, and 2 for columns) that correspond to the training and prediction data sets, respectively. (They still correspond to group levels though, as they are defined by the numbers of the respective levels of the `GROUPS`

factor in `FSIMILARITY`

.)

The `MINSIMILARITY`

option sets a minimum value on the similarity between two items if they are to be regarded as neighbours (default 0.75). The `MINNEIGHBOURS`

option specifies the minimum number of neighbours to try to find (default 5), and the `MAXNEIGHBOURS`

option specifies the maximum number (default 10). The search for nearest neighbours for a particular prediction item works by finding the most similar item in the training set, and adding this (with any equally-similar training items) to the set of neighbours. If at least `MINNEIGHBOURS`

have been found, the search stops. Otherwise it finds the next most similar items, and adds these to the set of neighbours, continuing until at least `MINNEIGHBOURS`

have been found. If this results in more than `MAXNEIGHBOURS`

neighbours, `KNEARESTNEIGHBOURS`

makes a random selection from those that are least similar to the prediction item, so that the number of neighbours becomes `MAXNEIGHBOURS`

. The `SEED`

option specifies the seed for the random numbers that are used to make that selection. The default of zero continues an existing sequence of random numbers if any have already been used in this Genstat job, or initializes the seed automatically. The `NEIGHBOURS`

option can save a pointer, containing variate for each prediction item storing the numbers of its neighbours within the training set.

Once the neighbours have been found, `KNEARESTNEIGHBOURS`

can use these to form the predictions. The `DATA`

parameter lists variates and/or factors containing values of the variables of interest for the items on the training set. The predictions can be saved using the `PREDICTIONS`

parameter (in variates and/or factors to match the settings of the `DATA`

parameter).

For a `DATA`

factor, the category predicted for each item in the prediction set is taken to be the factor level that occurs most often amongst its nearest neighbours. If more than one level occurs most often, the choice is narrowed down by seeing which of the levels has the the most similar neighbours. If this still leaves more than one level, the choice is narrowed further by seeing which of the levels has neighbours with the highest mean similarity. Then, if even that does not lead to a single level, the final choice is made at random.

For a `DATA`

variate, the `METHOD`

option controls whether the prediction is made by the median (default) or the mean of the data values of the nearest neighbours of each prediction item.

Printed output is controlled by the `PRINT`

option, with settings:

`neighbours` |
to print the nearest neighbours, and |
---|---|

`predictions` |
to print the predictions. |

The default is `PRINT=predictions`

.

So, to print predictions of blood pressure with a variate of training data `Pressure`

, using the similarity matrix `Gsim`

(as above) and default settings for the numbers of neighbours, we simply need to put

`KNEARESTNEIGHBOURS [SIMILARITY=Gsim] Pressure`

Options: `PRINT`

, `SIMILARITY`

, `NEIGHBOURS`

, `GROUPS`

, `LEVTRAINING`

, `LEVPREDICTION`

, `METHOD`

, `MINSIMILARITY`

, `MINNEIGHBOURS`

, `MAXNEIGHBOURS`

, `SEED`

.

Parameters: `DATA`

, `PREDICTIONS`

.

### See also

Directives: `FSIMILARITY`

, `ASRULES`

, `NNFIT`

, `RBFIT`

.

Procedures: `KNNTRAIN`

, `BCLASSIFICATION`

, `BCFOREST`

, `BREGRESSION`

, `SOM`

.

Commands for: Data mining.

### Example

CAPTION 'KNEARESTNEIGHBOURS example',\ !t('Random classification forest for automobile data',\ 'from UCI Machine Learning Repository',\ 'http://archive.ics.uci.edu/ml/datasets/Automobile');\ STYLE=meta,plain SPLOAD FILE='%gendir%/examples/Automobile.gsh' FACTOR [LABELS=!t(yes,no)] loss_known CALCULATE loss_known = 1 + (normalized_losses .EQ. !s(*)) VARIATE known_loss;\ VALUES=ELEMENTS(normalized_losses; WHERE(loss_known.IN.'yes')) FSIMILARITY [METHOD=between; SIMILARITY=Sim; GROUPS=loss_known]\ make,fuel_type,aspiration,number_doors,\ body_style,drive_wheels,engine_location,wheel_base,\ length,width,height,curb_weight,engine_type,number_cylinders,\ engine_size,fuel_system,bore,stroke,compression_ratio,\ horsepower,peak_rpm,city_mpg,highway_mpg,price;\ TEST=7(simplematching),5(euclidean),2(simplematching),\ euclidean,simplematching,8(euclidean) KNEARESTNEIGHBOURS [PRINT=predictions,neighbours; SIMILARITY=Sim]\ known_loss; PREDICTIONS=pred_loss