Evaluates and optimizes the k-nearest-neighbour algorithm using cross-validation (D.B. Baird).
|Printed output required (
||Provides the similarities between the observations|
||How to calculate the prediction from a
||Cut-off minimum value of the similarity for items to be regarded as neighbours; default 0.75|
||Minimum number of nearest neighbours to use; default 5|
||Maximum number of nearest neighbours to use; default 10|
||Number of cross-validation sets to use; default
||Number of groups for cross-validation, default 10|
||Seed for the random numbers used to select cross-validation groups; default 0|
||Data values for the items in the data set|
||Saves the predictions using the optimal options|
||Cross-validation error rate for the optimal combination|
||Confusion matrix for the cross-validation predictions from the optimal options|
||Pointer to the optimal options|
KNNTRAIN uses cross-validation to evaluate and optimize the data-mining technique known as k-nearest-neighbourclassification. This can be used to select the best options to use with the
KNEARESTNEIGHBOURS procedure, which performs the classification on a single data set.
A range of values to try can be specified for each of the options
MAXNEIGHBOURS (described below).
KNNTRAIN evaluates the cross-validation error for every combination of the values, and selects the optimal values as the best of these combinations.
Cross-validation works by splitting the data set randomly into groups that are sized as equally as possible. The number of groups to form is specified by the
NCROSSVALIDATIONGROUPS option. The items in each group are then predicted from classifications formed from the data in the other groups. The
NSIMULATIONS option defines how often to repeat this process. Larger values than the default of one can be specified to give more precision.
SIMILARITY option provides the information for
KNNTRAIN to use to determine the nearby items (or nearest neighbours) for each item in the data set. This is a symmetric matrix with a row (and column) for every item. It can be formed in a wide variety of ways, using mixtures of categorical and continuous data, by the
FSIMILARITY directive. For example, if we have a factor
Sex, and variates
Height, we could form a symmetric matrix
FSIMILARITY [SIMILARITY=Sim] Sex,Age,Weight,Height;\
MINSIMILARITY option sets a minimum value for the similarity between two items if they are to be regarded as neighbours (default 0.75). The
MINNEIGHBOURS option specifies the minimum number of neighbours to try to find (default 5), and the
MAXNEIGHBOURS option specifies the maximum number (default 10). The search for nearest neighbours to obtain the prediction for a particular item works by finding the most similar item in the training set, and adding this (with any equally-similar training items) to the set of neighbours. If at least
MINNEIGHBOURS have been found, the search stops. Otherwise it finds the next most similar items, and adds these to the set of neighbours, continuing until at least
MINNEIGHBOURS have been found. If this results in more than
KNEARESTNEIGHBOURS makes a random selection from those that are least similar to the prediction item, so that the number of neighbours becomes
DATA parameter lists variates and/or factors containing values of the variables of interest for the items in the data set. The predictions from the optimal option combination can be saved using the
PREDICTIONS parameter (in variates and/or factors to match the settings of the
DATA parameter). These predictions are from a jackknife analysis where each item is predicted from the other items in the data set using the optimal options. The cross-validation error can be saved using the
ERROR parameter. The
OPTIMAL option can save a pointer, containing three scalars giving the optimal combination for
MAXNEIGHBOURS respectively. The pointer has labels giving the option names.
DATA factor, the category predicted for each item in the prediction set is taken to be the factor level that occurs most often amongst its nearest neighbours. If more than one level occurs most often, the choice is narrowed down by seeing which of the levels has the most similar neighbours. If this still leaves more than one level, the choice is narrowed further by seeing which of the levels has neighbours with the highest mean similarity. Then, if even that does not lead to a single level, the final choice is made at random. The confusion matrix for the optimal option combination can be saved with the
CONFUSION parameter. This gives the percentage of items allocated from each observed group (in rows) to the predicted groups (in columns).
DATA variate, the
METHOD option controls whether the prediction is made by the median (default) or the mean of the data values of the nearest neighbours of each prediction item.
Printed output is controlled by the
error to print the cross-validation errors for all combinations,
confusion to print the confusion matrix for the optimal options (only if
DATA is a factor), and
predictions to print the predictions from the optimal options.
The default is
SEED option specifies the seed for the random numbers that are used in the selection of tied outcomes and for the selection of cross-validation groups. The default of zero continues an existing sequence of random numbers if any have already been used in this Genstat job, or initializes the seed automatically.
CAPTION 'KNNTRAIN example','Classification of 6 types of glass'; \ STYLE=major,minor SPLOAD [PRINT=summary] '%DATA%/GlassTypes.gsh' FSIMILARITY [SIMILARITY=Sim] RI,Na,Mg,Al,Si,K,Ca,Ba,Fe; TEST=euclidean KNNTRAIN [SIMILARITY=Sim; MINS=0.9; MINNEIGHBOURS=!(1...5); \ MAXNEIGHBOURS=!(1...8); NSIMULATIONS=2; SEED=15243] Type