Evaluates and optimizes the k-nearest-neighbour algorithm using cross-validation (D.B. Baird).
Options
PRINT = string tokens |
Printed output required (error , confusion , predictions ); default erro , conf |
SIMILARITY = symmetric matrix |
Provides the similarities between the observations |
METHOD = string token |
How to calculate the prediction from a DATA variate (mean , median ); default medi |
MINSIMILARITY = scalar or variate |
Cut-off minimum value of the similarity for items to be regarded as neighbours; default 0.75 |
MINNEIGHBOURS = scalar or variate |
Minimum number of nearest neighbours to use; default 5 |
MAXNEIGHBOURS = scalar or variate |
Maximum number of nearest neighbours to use; default 10 |
NSIMULATIONS = variate |
Number of cross-validation sets to use; default 1 |
NCROSSVALIDATIONGROUPS = scalar |
Number of groups for cross-validation, default 10 |
SEED = scalar |
Seed for the random numbers used to select cross-validation groups; default 0 |
Parameters
DATA = variate or factor |
Data values for the items in the data set |
PREDICTIONS = variate or factor |
Saves the predictions using the optimal options |
ERROR = scalar |
Cross-validation error rate for the optimal combination |
CONFUSION = matrix |
Confusion matrix for the cross-validation predictions from the optimal options |
OPTIMAL = pointer |
Pointer to the optimal options |
Description
KNNTRAIN
uses cross-validation to evaluate and optimize the data-mining technique known as k-nearest-neighbourclassification. This can be used to select the best options to use with the KNEARESTNEIGHBOURS
procedure, which performs the classification on a single data set.
A range of values to try can be specified for each of the options MINSIMILARITY
, MINNEIGHBOURS
and MAXNEIGHBOURS
(described below). KNNTRAIN
evaluates the cross-validation error for every combination of the values, and selects the optimal values as the best of these combinations.
Cross-validation works by splitting the data set randomly into groups that are sized as equally as possible. The number of groups to form is specified by the NCROSSVALIDATIONGROUPS
option. The items in each group are then predicted from classifications formed from the data in the other groups. The NSIMULATIONS
option defines how often to repeat this process. Larger values than the default of one can be specified to give more precision.
The SIMILARITY
option provides the information for KNNTRAIN
to use to determine the nearby items (or nearest neighbours) for each item in the data set. This is a symmetric matrix with a row (and column) for every item. It can be formed in a wide variety of ways, using mixtures of categorical and continuous data, by the FSIMILARITY
directive. For example, if we have a factor Sex
, and variates Age
, Weight
and Height
, we could form a symmetric matrix Sim
by
FSIMILARITY [SIMILARITY=Sim] Sex,Age,Weight,Height;\
TEST=simplematching,3(euclidean)
The MINSIMILARITY
option sets a minimum value for the similarity between two items if they are to be regarded as neighbours (default 0.75). The MINNEIGHBOURS
option specifies the minimum number of neighbours to try to find (default 5), and the MAXNEIGHBOURS
option specifies the maximum number (default 10). The search for nearest neighbours to obtain the prediction for a particular item works by finding the most similar item in the training set, and adding this (with any equally-similar training items) to the set of neighbours. If at least MINNEIGHBOURS
have been found, the search stops. Otherwise it finds the next most similar items, and adds these to the set of neighbours, continuing until at least MINNEIGHBOURS
have been found. If this results in more than MAXNEIGHBOURS
neighbours, KNEARESTNEIGHBOURS
makes a random selection from those that are least similar to the prediction item, so that the number of neighbours becomes MAXNEIGHBOURS
.
The DATA
parameter lists variates and/or factors containing values of the variables of interest for the items in the data set. The predictions from the optimal option combination can be saved using the PREDICTIONS
parameter (in variates and/or factors to match the settings of the DATA
parameter). These predictions are from a jackknife analysis where each item is predicted from the other items in the data set using the optimal options. The cross-validation error can be saved using the ERROR
parameter. The OPTIMAL
option can save a pointer, containing three scalars giving the optimal combination for MINSIMILARITY
, MINNEIGHBOURS
and MAXNEIGHBOURS
respectively. The pointer has labels giving the option names.
For a DATA
factor, the category predicted for each item in the prediction set is taken to be the factor level that occurs most often amongst its nearest neighbours. If more than one level occurs most often, the choice is narrowed down by seeing which of the levels has the most similar neighbours. If this still leaves more than one level, the choice is narrowed further by seeing which of the levels has neighbours with the highest mean similarity. Then, if even that does not lead to a single level, the final choice is made at random. The confusion matrix for the optimal option combination can be saved with the CONFUSION
parameter. This gives the percentage of items allocated from each observed group (in rows) to the predicted groups (in columns).
For a DATA
variate, the METHOD
option controls whether the prediction is made by the median (default) or the mean of the data values of the nearest neighbours of each prediction item.
Printed output is controlled by the PRINT
option, with settings:
error
to print the cross-validation errors for all combinations,
confusion
to print the confusion matrix for the optimal options (only if DATA
is a factor), and
predictions
to print the predictions from the optimal options.
The default is PRINT=error
,confusion
.
The SEED
option specifies the seed for the random numbers that are used in the selection of tied outcomes and for the selection of cross-validation groups. The default of zero continues an existing sequence of random numbers if any have already been used in this Genstat job, or initializes the seed automatically.
Options: PRINT
, SIMILARITY
, METHOD
, MINSIMILARITY
, MINNEIGHBOURS
, MAXNEIGHBOURS
, NSIMULATIONS
, NCROSSVALIDATIONGROUPS
, SEED
.
Parameters: DATA
, PREDICTIONS
, ERROR
, CONFUSION
, OPTIMAL
.
See also
Directives: FSIMILARITY
, ASRULES
, NNFIT
, RBFIT
.
Procedures: KNEARESTNEIGHBOURS
, BCLASSIFICATION
, BCFOREST
, BREGRESSION
, SOM
.
Commands for: Data mining, Multivariate and cluster analysis.
Example
CAPTION 'KNNTRAIN example','Classification of 6 types of glass'; \ STYLE=major,minor SPLOAD [PRINT=summary] '%DATA%/GlassTypes.gsh' FSIMILARITY [SIMILARITY=Sim] RI,Na,Mg,Al,Si,K,Ca,Ba,Fe; TEST=euclidean KNNTRAIN [SIMILARITY=Sim; MINS=0.9; MINNEIGHBOURS=!(1...5); \ MAXNEIGHBOURS=!(1...8); NSIMULATIONS=2; SEED=15243] Type