KNEARESTNEIGHBOURS procedure

Classifies items or predicts their responses by examining their k nearest neighbours (R.W. Payne).

Options

`PRINT` = string tokens	Printed output required (`neighbours`, `predictions`); default `pred`
`SIMILARITY` = matrix or symmetric matrix	Provides the similarities between the training and prediction sets of items
`NEIGHBOURS` = pointer	Pointer with a variate for each prediction item to save the numbers of its nearest neighbours in the training set
`GROUPS` = factor	Defines groupings to identify the training and prediction sets of items when `SIMILARITY` is a symmetric matrix
`LEVTRAINING` = scalar or text	Identifies the level of `GROUPS` or dimension of `SIMILARITY` that represents the training set; default 1
`LEVPREDICTION` = scalar or text	Identifies the level of `GROUPS` or dimension of `SIMILARITY` that represents the prediction set; default 2
`METHOD` = string token	How to calculate the prediction from a `DATA` variate (`mean`, `median`); default `medi`
`MINSIMILARITY` = scalar	Cut-off minimum value of the similarity for items to be regarded as neighbours; default 0.75
`MINNEIGHBOURS` = scalar	Minimum number of nearest neighbours to use; default 5
`MAXNEIGHBOURS` = scalar	Maximum number of nearest neighbours to use; default 10
`SEED` = scalar	Seed for the random numbers used to select neighbours when more than `MAXNEIGHBOURS` are available; default 0

Parameters

`DATA` = variates or factors	Data values for the items in the training set
`PREDICTIONS` = variates or factors	Saves the predictions

Description

KNEARESTNEIGHBOURS provides the data-mining technique known as k-nearest-neighbour classification. This allocates unknown items to a category, or it predicts their (continuous) responses, by looking at nearby items in a known data set. The known data set is usually called the training set, and we will call the unknown items the prediction set.

The SIMILARITY option provides a similarity matrix for KNEARESTNEIGHBOURS to use to determine the nearby items in the training set (or nearest neighbours) for each item in the prediction set. This can be a symmetric matrix with a row (and column) for every item in the combined set of training and prediction items. The GROUPS option must then be set to a factor with one level for the training items and another for the prediction items. By default the training set has level 1 and the prediction set has level 2, but these can be changed by the LEVTRAINING and LEVPREDICTION options. Matrices like these can be formed in a wide variety of ways, using mixtures of categorical and continuous data, by the FSIMILARITY directive. For example, if we have a factor Sex, and variates Age, Weight and Height whose values are known for both the training and prediction items, we could form a symmetric matrix Sim by

FSIMILARITY [SIMILARITY=Sim] Sex,Age,Weight,Height;\

TEST=simplematching,3(euclidean)

However, Sim will contain unnecessary information, as we need the similarities between prediction and training items, but not between training items or between prediction items. So, for large data sets, it will be more efficient to form a (rectangular) between-group similarity matrix by setting the GROUPS option of FSIMILARITY. For example

FSIMILARITY [SIMILARITY=Gsim; GROUPS=Gfac] Sex,Age,Weight,Height;\

TEST=simplematching,3(euclidean)

where Gfac is a factor with two levels, one for the training set (usually level 1), and the other for the prediction set (usually level 2). You then no longer need to set the GROUPS option of KNEARESTNEIGHBOUR. The LEVTRAINING and LEVPREDICTION options now specify the dimension of the similarity matrix (1 for rows, and 2 for columns) that correspond to the training and prediction data sets, respectively. (They still correspond to group levels though, as they are defined by the numbers of the respective levels of the GROUPS factor in FSIMILARITY.)

The MINSIMILARITY option sets a minimum value on the similarity between two items if they are to be regarded as neighbours (default 0.75). The MINNEIGHBOURS option specifies the minimum number of neighbours to try to find (default 5), and the MAXNEIGHBOURS option specifies the maximum number (default 10). The search for nearest neighbours for a particular prediction item works by finding the most similar item in the training set, and adding this (with any equally-similar training items) to the set of neighbours. If at least MINNEIGHBOURS have been found, the search stops. Otherwise it finds the next most similar items, and adds these to the set of neighbours, continuing until at least MINNEIGHBOURS have been found. If this results in more than MAXNEIGHBOURS neighbours, KNEARESTNEIGHBOURS makes a random selection from those that are least similar to the prediction item, so that the number of neighbours becomes MAXNEIGHBOURS. The SEED option specifies the seed for the random numbers that are used to make that selection. The default of zero continues an existing sequence of random numbers if any have already been used in this Genstat job, or initializes the seed automatically. The NEIGHBOURS option can save a pointer, containing variate for each prediction item storing the numbers of its neighbours within the training set.

Once the neighbours have been found, KNEARESTNEIGHBOURS can use these to form the predictions. The DATA parameter lists variates and/or factors containing values of the variables of interest for the items on the training set. The predictions can be saved using the PREDICTIONS parameter (in variates and/or factors to match the settings of the DATA parameter).

For a DATA factor, the category predicted for each item in the prediction set is taken to be the factor level that occurs most often amongst its nearest neighbours. If more than one level occurs most often, the choice is narrowed down by seeing which of the levels has the the most similar neighbours. If this still leaves more than one level, the choice is narrowed further by seeing which of the levels has neighbours with the highest mean similarity. Then, if even that does not lead to a single level, the final choice is made at random.

For a DATA variate, the METHOD option controls whether the prediction is made by the median (default) or the mean of the data values of the nearest neighbours of each prediction item.

Printed output is controlled by the PRINT option, with settings:

`neighbours`	to print the nearest neighbours, and
`predictions`	to print the predictions.

The default is PRINT=predictions.

So, to print predictions of blood pressure with a variate of training data Pressure, using the similarity matrix Gsim (as above) and default settings for the numbers of neighbours, we simply need to put

KNEARESTNEIGHBOURS [SIMILARITY=Gsim] Pressure

Options: PRINT, SIMILARITY, NEIGHBOURS, GROUPS, LEVTRAINING, LEVPREDICTION, METHOD, MINSIMILARITY, MINNEIGHBOURS, MAXNEIGHBOURS, SEED.
Parameters: DATA, PREDICTIONS.

Example

CAPTION     'KNEARESTNEIGHBOURS example',\
            !t('Random classification forest for automobile data',\
            'from UCI Machine Learning Repository',\
            'http://archive.ics.uci.edu/ml/datasets/Automobile');\
            STYLE=meta,plain
SPLOAD      FILE='%gendir%/examples/Automobile.gsh'
FACTOR      [LABELS=!t(yes,no)] loss_known
CALCULATE   loss_known = 1 + (normalized_losses .EQ. !s(*))
VARIATE     known_loss;\
            VALUES=ELEMENTS(normalized_losses; WHERE(loss_known.IN.'yes'))
FSIMILARITY [METHOD=between; SIMILARITY=Sim; GROUPS=loss_known]\ 
            make,fuel_type,aspiration,number_doors,\
            body_style,drive_wheels,engine_location,wheel_base,\
            length,width,height,curb_weight,engine_type,number_cylinders,\
            engine_size,fuel_system,bore,stroke,compression_ratio,\
            horsepower,peak_rpm,city_mpg,highway_mpg,price;\
            TEST=7(simplematching),5(euclidean),2(simplematching),\
            euclidean,simplematching,8(euclidean)
KNEARESTNEIGHBOURS [PRINT=predictions,neighbours; SIMILARITY=Sim]\
            known_loss; PREDICTIONS=pred_loss

Updated on February 7, 2023

Was this article helpful?

Yes No

Options

Parameters

Description

See also

Example

Was this article helpful?