1. Home
  2. IDENTIFY procedure

IDENTIFY procedure

Identifies an unknown specimen from a defined set of objects (R.W. Payne).


PRINT = string tokens Controls printed output (identification, transcript); default iden, tran
METHOD = string token Type of run (batch, interactive); if this is not set IDENTIFY checks whether the run of Genstat itself is batch or interactive
TAXA = text or factor Names for the taxa (i.e. the objects); default uses the positive integers 1, 2…
NMISTAKE = scalar Number of mistakes to allow for; default 0
IDENTIFICATION = text Saves the names of the taxa that are identified; default * i.e. not saved
DIFFERENCES = variate Saves the number of differences between the observed character states and those that can be displayed by each taxon; default * i.e. not saved


CHARACTER = factors or tables Define the characteristics of the taxa; must be set
OBSERVATION = scalars or texts Can define an observation for each character; default * i.e. none
COST = scalars Costs of observing each character; default 1


IDENTIFY allows you to identify an unknown specimen from a set of possible taxa, for example species of plant, types of machine fault, strains of bacteria, and so on. The specimen is identified by comparing observations that you specify for the specimen against the characteristics that you have defined for the taxa. Each character is assumed to have a set of distinct possible states, which are represented by the levels of a factor.

So, IDENTIFY assumes that the values of the characters are discrete. Often the characters will be binary, representing the presence or absence of some attribute. Alternatively, they may involve counts, for example of numbers of leaves or petals. If you want to use continuous variables, you will need to classify the values into ranges (for example using the GROUPS directive).

Generally, the properties of the taxa with respect to each character can be defined by a factor, whose levels represent the range of values that can occur for the character. If a taxon only ever displays one state of the character (i.e. if it has a fixed response), the unit of the factor corresponding to that taxon should be set to the relevant level. Conversely, if different specimens of the taxon can display different states of the character (i.e. it has a variable response), the unit should contain a missing value.

Representing the properties for a character by a factor assumes that, if a taxon is variable, any of the states of the character may occur. Information will thus be lost for taxa that can show several, but not all, of the states of a character. An alternative representation, therefore, uses a table classified by one factor representing the states of the factor, and another representing the taxa. So, there is a row of the table for each taxon, and this contains a zero value for the states that the taxon cannot display, and a non-zero value (usually one) for those that it can display. The table below defines the texture of the bark for the trees in the example for IDENTIFY.

  smooth rough corky scored horizontally scaling
Ash 1 1 0 0 0
Beech 1 0 0 0 0
Birch 0 0 0 0 1
Elder 0 0 1 0 0
Elm 0 1 0 0 0
Lime 1 0 0 0 0
Oak 0 1 0 0 0
Plane 0 0 0 0 1
Rowan 0 0 0 1 0
Sweet chestnut 0 1 0 0 0
Sycamore 1 0 0 0 1

Most of the trees have fixed responses, for example all Beech trees have smooth bark, and all Elm trees have rough bark. However, Ash trees may have either smooth or rough bark but not, for example, corky bark.

The factors and/or tables defining the properties of the taxa must be listed using the CHARACTER parameter. If any of these is a table, the TAXA option must be set to the factor used to represent the taxa there. The levels of the factor (or its labels if present) then supply names for the taxa that are used in the output. If there are no CHARACTER tables, TAXA can be set to a text containing the taxon names instead. If TAXA is not set, IDENTIFY uses the integers 1, 2… The COST parameter can be used to supply a list of scalars indicating the cost of observing each character; if this is not set, the costs are all assumed to be equal to one.

The METHOD option defines whether IDENTIFY operates interactively, or in batch mode. If this is not set, IDENTIFY checks whether Genstat itself is running interactively or in batch. In an interactive run, IDENTIFY displays menus to guide you through to achieving an identification. The main menu allows you to select any one of the following actions.

1)       list potential identifications – IDENTIFY compares the observations that you specify for the specimen against the characteristics that you have defined for the taxa. It then lists the taxa (if any) that can display all of the character states that you have observed, then those that can display all except one, all except two, and so on. The list is displayed in sections, and you can terminate it at any time.

2)       select and observe a character – IDENTIFY assesses the characters, and lists them in order of their effectiveness. Alongside each one it prints an estimate of the number (of cost if the COST parameter has been set) of the characters that must be observed to complete the identification, assuming that this one is observed next. After you have chosen a character, it displays another menu for you to specify the state that you have observed.

3)       specify an observed character (find in list) – IDENTIFY lists the characters so that you can indicate which one you wish to observe next. After you have chosen a character, it displays another menu for you to specify the state that you have observed.

4)       specify an observed character (type name) – IDENTIFY asks you to type the name of the character that you wish to observe next. If you type just the initial part of the name, IDENTIFY will give you a list of all the characters whose names begin like that. After you have chosen a character, it displays another menu for you to specify the state that you have observed.

5)       modify an observation – IDENTIFY lists the characters that have already been observed to allow you to choose which you want to modify. After you have chosen a character, it displays another menu for you to specify the revised value.

6)       display observations – IDENTIFY displays the characters that have already been observed.

7)       list the characteristics of a taxon – IDENTIFY lists the taxa so that you can indicate the one whose characteristics you wish to display.

8)       show differences between 2 taxa – IDENTIFY lists the taxa so that you can indicate the two that you want to compare. IDENTIFY then lists the characters that differ between them.

9)       set configuration options – IDENTIFY generates a menu allowing you to set various configuration options. Firstly, you can ask IDENTIFY to take account of a specified number of mistakes in your observations. It will then up to this number of differences between your observations and the characteristics of each taxon when suggesting which character to observe next, or when making an identification. The initial setting for the number of mistakes is set by the NMISTAKE option, with a default of zero (i.e. none). You can also control whether or not to produce a transcription of your activities and whether or not to print the identification obtained at the end of your run. The initial settings for these two aspects are set by the PRINT option; by default both are printed.

10)     start a new identification (clearing observed characters) – IDENTIFY clears the current observations so that you can start again.

11)     save/print identification and then exit – IDENTIFY prints and saves the identification, as requested, and then stops.

The identification is saved by setting the IDENTIFICATION option to a text to contain the names of all the taxa that can display the observed character states, allowing for any requested number of mistakes. You can also set the DIFFERENCES option to a variate to contain the number of differences between the observed character states and those that can be displayed by each taxon.

For a batch run, you should use the OBSERVATION parameter to supply values for all the characters that you have observed. These can be either scalars (referring to levels of the factor) or one-line texts (referring to its labels), or a missing value to denote characters that have not been observed. This parameter can be also used in an interactive run, as an alternative to supplying the observations through the menus.




At each stage, IDENTIFY uses the QUESTION procedure to allow you to choose what action to take. The efficiency of the characters is asssessed using the selection criterion function CMV′ of Payne (1981).


Payne, R.W. (1981). Selection criteria for the construction of efficient diagnostic keys. Journal of Statistical Planning and Inference, 5, 27-36.

See also




         !t('11 species of common British tree (data from Payne & Preece,',\
         '1980, Identification keys and diagnostic tables: a review,',\
         'Journal of the Royal Statistical Society, Series A, 143,',\
         '253-292). The example operates in batch mode to identify a',\
         'tree with pinnate leaves and smooth bark as an Ash.',\
         'After running the example, you can type'),\
         '  IDENTIFY [TAXA=Trees] Data[]',\
         'to identify a new specimen of any of these trees interactively.';\
TEXT     [VALUES=Ash,Beech,Birch,Elder,Elm,Lime,\
         Oak,Plane,Rowan,'Sweet chestnut',Sycamore] Treelabels
FACTOR   [LABELS=Treelabels] Trees
FACTOR   [NVALUES=11; LEVELS=3; LABELS=!t('not pinnate or lobed',\
         lobed,pinnate)] Form_of_leaves
FACTOR   [NVALUES=11; LEVELS=!(0...7)] Pairs_of_leaflets_per_leaf
FACTOR   [NVALUES=11; LEVELS=!(0...5); LABELS=!t('N/A','pointed oval',\
         triangular,'heart shaped',oblong,'broad lanceolate')]\
FACTOR   [NVALUES=11; LEVELS=2; LABELS=!t(opposite,alternate)]\
FACTOR   [NVALUES=11; LEVELS=2; LABELS=!t('not toothed',toothed)]\
FACTOR   [NVALUES=11; LEVELS=5; LABELS=!t(smooth,rough,corky,\
         'scored horizontally',scaling)] Texture_of_bark
FACTOR   [NVALUES=11; LEVELS=2; LABELS=!t(unisexual,bisexual)]\
POINTER  [VALUES=Form_of_leaves,Pairs_of_leaflets_per_leaf,\
         Sexual_characteristics_of_flowers] Characters
READ     Characters[]
3 * 0 1 2 * *
1 0 1 2 1 1 1
1 0 * 2 2 5 1
3 * 0 1 2 3 2
1 0 1 2 2 2 2
1 0 3 2 2 1 2
2 0 4 2 1 2 1
2 0 3 2 2 5 1
3 * 0 2 2 4 2
1 0 5 2 2 2 1
2 0 3 1 2 5 1 :
" Some trees are able to display more than one state (i.e. possible value)
  of the characters Pairs_of_leaflets_per_leaf, Basic_shape_of_leaves and
  Texture_of_bark. So the definitions of these characters are coded by
  tables classified by Trees and the factor defining the character concerned.
  In the row of the table for each tree, there is a one in the position for
  every state that the tree is able to display (and a zero everywhere else)."
TABLE    [CLASSIFICATION=Trees,Pairs_of_leaflets_per_leaf; VALUES=\
         0,0,0,1,1,1,1,1, 1,0,0,0,0,0,0,0, 1,0,0,0,0,0,0,0,\
         0,1,1,1,1,0,0,0, 1,0,0,0,0,0,0,0, 1,0,0,0,0,0,0,0,\
         1,0,0,0,0,0,0,0, 1,0,0,0,0,0,0,0, 0,0,0,0,0,1,1,1,\
         1,0,0,0,0,0,0,0, 1,0,0,0,0,0,0,0]\
TABLE    [CLASSIFICATION=Trees,Basic_shape_of_leaves; VALUES=\
         1,0,0,0,0,0, 0,1,0,0,0,0, 0,1,1,0,0,0, 1,0,0,0,0,0,\
         0,1,0,0,0,0, 0,0,0,1,0,0, 0,0,0,0,1,0, 0,0,0,1,0,0,\
         1,0,0,0,0,0, 0,0,0,0,0,1, 0,0,0,1,0,0]\
TABLE    [CLASSIFICATION=Trees,Texture_of_bark; VALUES=\
         1,1,0,0,0, 1,0,0,0,0, 0,0,0,0,1, 0,0,1,0,0,\
         0,1,0,0,0, 1,0,0,0,0, 0,1,0,0,0, 0,0,0,0,1,\
         0,0,0,1,0, 0,1,0,0,0, 1,0,0,0,1]\
POINTER  [VALUES=Form_of_leaves,Table_for_Pairs_of_leaflets_per_leaf,\
         Sexual_characteristics_of_flowers] Data

IDENTIFY [TAXA=Trees; METHOD=batch] Data[];\
Updated on June 19, 2019

Was this article helpful?