Classifies items or predicts their responses by examining their k nearest neighbours (R.W. Payne).
Options
PRINT = string tokens |
Printed output required (neighbours , predictions ); default pred |
---|---|
SIMILARITY = matrix or symmetric matrix |
Provides the similarities between the training and prediction sets of items |
NEIGHBOURS = pointer |
Pointer with a variate for each prediction item to save the numbers of its nearest neighbours in the training set |
GROUPS = factor |
Defines groupings to identify the training and prediction sets of items when SIMILARITY is a symmetric matrix |
LEVTRAINING = scalar or text |
Identifies the level of GROUPS or dimension of SIMILARITY that represents the training set; default 1 |
LEVPREDICTION = scalar or text |
Identifies the level of GROUPS or dimension of SIMILARITY that represents the prediction set; default 2 |
METHOD = string token |
How to calculate the prediction from a DATA variate (mean , median ); default medi |
MINSIMILARITY = scalar |
Cut-off minimum value of the similarity for items to be regarded as neighbours; default 0.75 |
MINNEIGHBOURS = scalar |
Minimum number of nearest neighbours to use; default 5 |
MAXNEIGHBOURS = scalar |
Maximum number of nearest neighbours to use; default 10 |
SEED = scalar |
Seed for the random numbers used to select neighbours when more than MAXNEIGHBOURS are available; default 0 |
Parameters
DATA = variates or factors |
Data values for the items in the training set |
---|---|
PREDICTIONS = variates or factors |
Saves the predictions |
Description
KNEARESTNEIGHBOURS
provides the data-mining technique known as k-nearest-neighbour classification. This allocates unknown items to a category, or it predicts their (continuous) responses, by looking at nearby items in a known data set. The known data set is usually called the training set, and we will call the unknown items the prediction set.
The SIMILARITY
option provides a similarity matrix for KNEARESTNEIGHBOURS
to use to determine the nearby items in the training set (or nearest neighbours) for each item in the prediction set. This can be a symmetric matrix with a row (and column) for every item in the combined set of training and prediction items. The GROUPS
option must then be set to a factor with one level for the training items and another for the prediction items. By default the training set has level 1 and the prediction set has level 2, but these can be changed by the LEVTRAINING
and LEVPREDICTION
options. Matrices like these can be formed in a wide variety of ways, using mixtures of categorical and continuous data, by the FSIMILARITY
directive. For example, if we have a factor Sex
, and variates Age
, Weight
and Height
whose values are known for both the training and prediction items, we could form a symmetric matrix Sim
by
FSIMILARITY [SIMILARITY=Sim] Sex,Age,Weight,Height;\
TEST=simplematching,3(euclidean)
However, Sim
will contain unnecessary information, as we need the similarities between prediction and training items, but not between training items or between prediction items. So, for large data sets, it will be more efficient to form a (rectangular) between-group similarity matrix by setting the GROUPS
option of FSIMILARITY
. For example
FSIMILARITY [SIMILARITY=Gsim; GROUPS=Gfac] Sex,Age,Weight,Height;\
TEST=simplematching,3(euclidean)
where Gfac is a factor with two levels, one for the training set (usually level 1), and the other for the prediction set (usually level 2). You then no longer need to set the GROUPS
option of KNEARESTNEIGHBOUR
. The LEVTRAINING
and LEVPREDICTION
options now specify the dimension of the similarity matrix (1 for rows, and 2 for columns) that correspond to the training and prediction data sets, respectively. (They still correspond to group levels though, as they are defined by the numbers of the respective levels of the GROUPS
factor in FSIMILARITY
.)
The MINSIMILARITY
option sets a minimum value on the similarity between two items if they are to be regarded as neighbours (default 0.75). The MINNEIGHBOURS
option specifies the minimum number of neighbours to try to find (default 5), and the MAXNEIGHBOURS
option specifies the maximum number (default 10). The search for nearest neighbours for a particular prediction item works by finding the most similar item in the training set, and adding this (with any equally-similar training items) to the set of neighbours. If at least MINNEIGHBOURS
have been found, the search stops. Otherwise it finds the next most similar items, and adds these to the set of neighbours, continuing until at least MINNEIGHBOURS
have been found. If this results in more than MAXNEIGHBOURS
neighbours, KNEARESTNEIGHBOURS
makes a random selection from those that are least similar to the prediction item, so that the number of neighbours becomes MAXNEIGHBOURS
. The SEED
option specifies the seed for the random numbers that are used to make that selection. The default of zero continues an existing sequence of random numbers if any have already been used in this Genstat job, or initializes the seed automatically. The NEIGHBOURS
option can save a pointer, containing variate for each prediction item storing the numbers of its neighbours within the training set.
Once the neighbours have been found, KNEARESTNEIGHBOURS
can use these to form the predictions. The DATA
parameter lists variates and/or factors containing values of the variables of interest for the items on the training set. The predictions can be saved using the PREDICTIONS
parameter (in variates and/or factors to match the settings of the DATA
parameter).
For a DATA
factor, the category predicted for each item in the prediction set is taken to be the factor level that occurs most often amongst its nearest neighbours. If more than one level occurs most often, the choice is narrowed down by seeing which of the levels has the the most similar neighbours. If this still leaves more than one level, the choice is narrowed further by seeing which of the levels has neighbours with the highest mean similarity. Then, if even that does not lead to a single level, the final choice is made at random.
For a DATA
variate, the METHOD
option controls whether the prediction is made by the median (default) or the mean of the data values of the nearest neighbours of each prediction item.
Printed output is controlled by the PRINT
option, with settings:
neighbours |
to print the nearest neighbours, and |
---|---|
predictions |
to print the predictions. |
The default is PRINT=predictions
.
So, to print predictions of blood pressure with a variate of training data Pressure
, using the similarity matrix Gsim
(as above) and default settings for the numbers of neighbours, we simply need to put
KNEARESTNEIGHBOURS [SIMILARITY=Gsim] Pressure
Options: PRINT
, SIMILARITY
, NEIGHBOURS
, GROUPS
, LEVTRAINING
, LEVPREDICTION
, METHOD
, MINSIMILARITY
, MINNEIGHBOURS
, MAXNEIGHBOURS
, SEED
.
Parameters: DATA
, PREDICTIONS
.
See also
Directives: FSIMILARITY
, ASRULES
, NNFIT
, RBFIT
.
Procedures: KNNTRAIN
, BCLASSIFICATION
, BCFOREST
, BREGRESSION
, SOM
.
Commands for: Data mining.
Example
CAPTION 'KNEARESTNEIGHBOURS example',\ !t('Random classification forest for automobile data',\ 'from UCI Machine Learning Repository',\ 'http://archive.ics.uci.edu/ml/datasets/Automobile');\ STYLE=meta,plain SPLOAD FILE='%gendir%/examples/Automobile.gsh' FACTOR [LABELS=!t(yes,no)] loss_known CALCULATE loss_known = 1 + (normalized_losses .EQ. !s(*)) VARIATE known_loss;\ VALUES=ELEMENTS(normalized_losses; WHERE(loss_known.IN.'yes')) FSIMILARITY [METHOD=between; SIMILARITY=Sim; GROUPS=loss_known]\ make,fuel_type,aspiration,number_doors,\ body_style,drive_wheels,engine_location,wheel_base,\ length,width,height,curb_weight,engine_type,number_cylinders,\ engine_size,fuel_system,bore,stroke,compression_ratio,\ horsepower,peak_rpm,city_mpg,highway_mpg,price;\ TEST=7(simplematching),5(euclidean),2(simplematching),\ euclidean,simplematching,8(euclidean) KNEARESTNEIGHBOURS [PRINT=predictions,neighbours; SIMILARITY=Sim]\ known_loss; PREDICTIONS=pred_loss