1. Home
  2. KNEARESTNEIGHBOURS procedure

KNEARESTNEIGHBOURS procedure

Classifies items or predicts their responses by examining their k nearest neighbours (R.W. Payne).

Options

PRINT = string tokens Printed output required (neighbours, predictions); default pred
SIMILARITY = matrix or symmetric matrix Provides the similarities between the training and prediction sets of items
NEIGHBOURS = pointer Pointer with a variate for each prediction item to save the numbers of its nearest neighbours in the training set
GROUPS = factor Defines groupings to identify the training and prediction sets of items when SIMILARITY is a symmetric matrix
LEVTRAINING = scalar or text Identifies the level of GROUPS or dimension of SIMILARITY that represents the training set; default 1
LEVPREDICTION = scalar or text Identifies the level of GROUPS or dimension of SIMILARITY that represents the prediction set; default 2
METHOD = string token How to calculate the prediction from a DATA variate (mean, median); default medi
MINSIMILARITY = scalar Cut-off minimum value of the similarity for items to be regarded as neighbours; default 0.75
MINNEIGHBOURS = scalar Minimum number of nearest neighbours to use; default 5
MAXNEIGHBOURS = scalar Maximum number of nearest neighbours to use; default 10
SEED = scalar Seed for the random numbers used to select neighbours when more than MAXNEIGHBOURS are available; default 0

Parameters

DATA = variates or factors Data values for the items in the training set
PREDICTIONS = variates or factors Saves the predictions

Description

KNEARESTNEIGHBOURS provides the data-mining technique known as k-nearest-neighbour classification. This allocates unknown items to a category, or it predicts their (continuous) responses, by looking at nearby items in a known data set. The known data set is usually called the training set, and we will call the unknown items the prediction set.

The SIMILARITY option provides a similarity matrix for KNEARESTNEIGHBOURS to use to determine the nearby items in the training set (or nearest neighbours) for each item in the prediction set. This can be a symmetric matrix with a row (and column) for every item in the combined set of training and prediction items. The GROUPS option must then be set to a factor with one level for the training items and another for the prediction items. By default the training set has level 1 and the prediction set has level 2, but these can be changed by the LEVTRAINING and LEVPREDICTION options. Matrices like these can be formed in a wide variety of ways, using mixtures of categorical and continuous data, by the FSIMILARITY directive. For example, if we have a factor Sex, and variates Age, Weight and Height whose values are known for both the training and prediction items, we could form a symmetric matrix Sim by

FSIMILARITY [SIMILARITY=Sim] Sex,Age,Weight,Height;\

            TEST=simplematching,3(euclidean)

However, Sim will contain unnecessary information, as we need the similarities between prediction and training items, but not between training items or between prediction items. So, for large data sets, it will be more efficient to form a (rectangular) between-group similarity matrix by setting the GROUPS option of FSIMILARITY. For example

FSIMILARITY [SIMILARITY=Gsim; GROUPS=Gfac] Sex,Age,Weight,Height;\

            TEST=simplematching,3(euclidean)

where Gfac is a factor with two levels, one for the training set (usually level 1), and the other for the prediction set (usually level 2). You then no longer need to set the GROUPS option of KNEARESTNEIGHBOUR. The LEVTRAINING and LEVPREDICTION options now specify the dimension of the similarity matrix (1 for rows, and 2 for columns) that correspond to the training and prediction data sets, respectively. (They still correspond to group levels though, as they are defined by the numbers of the respective levels of the GROUPS factor in FSIMILARITY.)

The MINSIMILARITY option sets a minimum value on the similarity between two items if they are to be regarded as neighbours (default 0.75). The MINNEIGHBOURS option specifies the minimum number of neighbours to try to find (default 5), and the MAXNEIGHBOURS option specifies the maximum number (default 10). The search for nearest neighbours for a particular prediction item works by finding the most similar item in the training set, and adding this (with any equally-similar training items) to the set of neighbours. If at least MINNEIGHBOURS have been found, the search stops. Otherwise it finds the next most similar items, and adds these to the set of neighbours, continuing until at least MINNEIGHBOURS have been found. If this results in more than MAXNEIGHBOURS neighbours, KNEARESTNEIGHBOURS makes a random selection from those that are least similar to the prediction item, so that the number of neighbours becomes MAXNEIGHBOURS. The SEED option specifies the seed for the random numbers that are used to make that selection. The default of zero continues an existing sequence of random numbers if any have already been used in this Genstat job, or initializes the seed automatically. The NEIGHBOURS option can save a pointer, containing variate for each prediction item storing the numbers of its neighbours within the training set.

Once the neighbours have been found, KNEARESTNEIGHBOURS can use these to form the predictions. The DATA parameter lists variates and/or factors containing values of the variables of interest for the items on the training set. The predictions can be saved using the PREDICTIONS parameter (in variates and/or factors to match the settings of the DATA parameter).

For a DATA factor, the category predicted for each item in the prediction set is taken to be the factor level that occurs most often amongst its nearest neighbours. If more than one level occurs most often, the choice is narrowed down by seeing which of the levels has the the most similar neighbours. If this still leaves more than one level, the choice is narrowed further by seeing which of the levels has neighbours with the highest mean similarity. Then, if even that does not lead to a single level, the final choice is made at random.

For a DATA variate, the METHOD option controls whether the prediction is made by the median (default) or the mean of the data values of the nearest neighbours of each prediction item.

Printed output is controlled by the PRINT option, with settings:

    neighbours to print the nearest neighbours, and
    predictions to print the predictions.

The default is PRINT=predictions.

So, to print predictions of blood pressure with a variate of training data Pressure, using the similarity matrix Gsim (as above) and default settings for the numbers of neighbours, we simply need to put

KNEARESTNEIGHBOURS [SIMILARITY=Gsim] Pressure

Options: PRINT, SIMILARITY, NEIGHBOURS, GROUPS, LEVTRAINING, LEVPREDICTION, METHOD, MINSIMILARITY, MINNEIGHBOURS, MAXNEIGHBOURS, SEED.
Parameters: DATA, PREDICTIONS.

See also

Directives: FSIMILARITY, ASRULES, NNFIT, RBFIT.
Procedures: KNNTRAIN, BCLASSIFICATION, BCFOREST, BREGRESSION, SOM.
Commands for: Data mining.

Example

CAPTION     'KNEARESTNEIGHBOURS example',\
            !t('Random classification forest for automobile data',\
            'from UCI Machine Learning Repository',\
            'http://archive.ics.uci.edu/ml/datasets/Automobile');\
            STYLE=meta,plain
SPLOAD      FILE='%gendir%/examples/Automobile.gsh'
FACTOR      [LABELS=!t(yes,no)] loss_known
CALCULATE   loss_known = 1 + (normalized_losses .EQ. !s(*))
VARIATE     known_loss;\
            VALUES=ELEMENTS(normalized_losses; WHERE(loss_known.IN.'yes'))
FSIMILARITY [METHOD=between; SIMILARITY=Sim; GROUPS=loss_known]\ 
            make,fuel_type,aspiration,number_doors,\
            body_style,drive_wheels,engine_location,wheel_base,\
            length,width,height,curb_weight,engine_type,number_cylinders,\
            engine_size,fuel_system,bore,stroke,compression_ratio,\
            horsepower,peak_rpm,city_mpg,highway_mpg,price;\
            TEST=7(simplematching),5(euclidean),2(simplematching),\
            euclidean,simplematching,8(euclidean)
KNEARESTNEIGHBOURS [PRINT=predictions,neighbours; SIMILARITY=Sim]\
            known_loss; PREDICTIONS=pred_loss
Updated on February 7, 2023

Was this article helpful?