Forms irredundant test sets for the efficient identification of a set of objects.
|Controls printed output (
||Saves the best set|
||Saves details of the available sets|
||Saves details of the objects that cannot be distinguished|
||Algorithm to use (
||Defines labels for the objects (or taxa) to be identified; default uses the unit labels vector of the
||Defines groupings of the objects so that the sets are constructed to distinguish only between the objects that belong to different groups; default constructs sets to distinguish between individual objects|
||If this is specified, sets are constructed just to distinguish the specified object (or taxon) from the other objects|
||Number of factors required in each set to distinguish between each pair of objects; default 1|
||Maximum preference of the factors to be included in the sets|
||Limit on number of factors in a set (sets containing more than this are discarded); default
||Number of sets to print (a positive number specifies the number to print, a negative number sets a tolerance on the difference between the sizes of the sets printed and the size of the best set); default
||Number of sets to save in the
||Variate containing two numbers n1 and n2, if this is specified then every time that there are more than n1 sets under construction using the exact method, the sets are arranged in order of increasing size and all sets containing more factors than set n2 are deleted|
||Whether or not to store the distinctions or recalculate them at every stage in the exact algorithm (
||Function to be use to select factors by the sequential method (
||Maximum number of improvement cycles to perform during the sequential method; default 20|
||Value for determining equivalence of the selection criteria of tests selected during the sequential method|
||Factors, and/or tables classified by a single factor, defining the properties of the objects to to be identified|
||Cost associated with each factor; default 1|
||Preference rating for each factor (1 representing most preferred etc.); default 1|
||Factor level used to represent variable information; default is to use a missing value|
||Factor level used to indicate that the information provided by that factor is inapplicable for a particular object|
IRREDUNDANT directive is useful when you have a set of objects (or taxa) whose properties can be described by a set of discrete-valued tests. Many applications are biological. For example, in botanical work, the taxa may be species of plant and the tests may require the observation of characters like the colours of petals or numbers of leaves. Similarly, in microbiology, the tests may involve the ability of an organism to grow in various media.
IRREDUNDANT helps you to select an efficient set of tests that can be applied, in a batch, to identify any unknown specimen of any of the objects. (The batch of tests is then often printed as a diagnostic table; see Payne & Preece 1980.) As all the tests in the set are to be used for every identification, it is best for the set to contain as few tests as possible. So there should thus be no redundant tests: these are tests that can be deleted from the set without causing any object (or taxon) to be no longer identifiable. Sets of tests that contain no redundant tests are known as irredundant.
Consider taxa A, B, C and D, whose responses to tests 1-5 are shown in the table below. The symbol “+”, for example in the entry for taxon A and test 1, indicates that all specimens of taxon A will always give a positive result to test 1, the symbol “-” for taxon D with test 1 indicates a negative result, and the symbol “v” for taxon B with test 3 indicates that some specimens of D will give a positive result to test 3 but others will give a negative result.
The table contains several irredundant sets, one of which contains the tests 1, 3 and 5. (If, for example, test 3 is deleted from this set, taxa A and C can no longer be distinguished). Another set contains tests 2 and 4. So, the irredundant sets can be of different sizes. The optimum set will often be defined to be one containing a minimum number of tests. Alternatively, if the test cost different amounts to apply, the optimum set may be one with minimum total cost. However, whichever of these situations applies, the optimum set will be irredundant, as otherwise a better set could be obtained by deleting a redundant test.
The characteristics of the taxa and tests are specified using the
CHARACTER parameter of
IRREDUNDANT. In the simplest situation, this provides a list of factors, one for each test (or character), as with the
BKEY procedure. The factors contain a unit for each taxon, and the level stored in that unit indicates how the taxon can respond to the test. For example, irredundant sets for the data in the table above could be constructed as follows:
TEXT [VALUES=A,B,C,D] Taxa
FACTOR [NVALUES=4; LEVELS=2] T1,T2,T3,T4,T5
2 2 2 1 2
2 1 * 1 1
2 1 1 2 2
1 2 2 2 1 :
IRREDUNDANT [TAXONNAMES=Taxa] T1,T2,T3,T4,T5
Level 1 of the factors
T5 represents a negative response, and level 2 represents a positive response. The variable response of taxon B with test 3 is represented by a missing value, but you can use the
VARIABLE parameter to use a particular level of the factor instead. There may be tests that are not applicable to some of the taxa. For example, when identifying insects, tests concerning colours of wings are not applicable to those that do not fly! The level to be used to indicate these responses is specified by the
INAPPLICABLE parameter. Costs for the test can be specified by the
COST parameter; by default, these are all taken to be one. Names for the taxa can be supplied, in either a text or a variate or a factor, using the
TAXONNAMES parameter. If this is not set,
IRREDUNDANT uses the unit labels of the
CHARACTER factors if any have been defined (see the
FACTOR directive), or otherwise the integers 1, 2 upwards.
The use of the
VARIABLE option works well with responses that are completely variable i.e. where the specimens of the taxon may give any of the available results to the test. However, when the tests have more than two possible results, there may be taxa that can give some but not all of the available results to a test. The responses to a test like this should be specified by a two-way table classified by one factor with a level for each possible result, and another with a level for each taxon. The table should then contain a positive (e.g. one) whererever the taxon concerned can deliver the result, and zero elsewhere. For example suppose that, with test
T6, taxon A, C and D always give result 1, 2 and 3 respectively, but taxon B can give either or results 2 or 3. The relevant table could then be constructed and used as follows:
FACTOR [LABELS=Taxa] Taxfact
FACTOR [LEVELS=3] T6fact
TABLE [CLASSIFICATION=T6fact,Taxfact; VALUES=\
" level 1:" 1, 0, 0, 0,\
" level 2:" 0, 1, 1, 0,\
" level 3:" 0, 1, 0, 1 ] T6tab
IRREDUNDANT [TAXONNAMES=Taxfact] T1,T2,T3,T4,T5,T6tab
The standard irredundant sets contain at least one test to distinguish each pair of taxa. However, to guard against mistakes in either the original data on during the subsequent use of the set, you can set the
NDISTINCTIONS option to ask for the set to include a larger number of tests able to distinguish each pair. Another refinement is that you can set the
GROUPS option to a factor defining groupings of the taxa. The sets are then formed to distinguish only pairs of taxa that belong to different groups. Alternatively, you may want a set of tests to either confirm whether or not the specimen belongs to one particular taxon. The taxon of interest should then be indicated by setting the
OBJECT option to the number of the taxon or, if textual taxon names have been defined, to the text identifying the taxon. Finally, if you set both
OBJECT, the sets will be constructed to confirm whether or not a specimen belongs to a particular group.
IRREDUNDANT provides two methods for constructing the irredundant sets. The default is to use an exact method (Payne 1991) which constructs all possible sets for the dataset concerned. However, with some datasets, there may be too many sets to construct them all. If you run out of workspace (or time), you can use the
LIMSETS to specify a variate containing two integers n1 and n2. Then whenever there are more than n1 sets under construction, the sets are arranged in order of increasing size and all sets containing more factors than set n2 are deleted. The method then no longer guarantees to find all the irredundant sets containing the fewest number of tests or with the minimum total costs, but in the situations where this modification is needed, it is very unlikely that it will fail to find any of them.
Alternatively, you can set option
METHOD=sequential to use a sequential algorithm (Payne & Preece 1980, Section 6.6). This does not guarantee to find a set with minimum size or cost, but it takes much less computing time and should always should produce a satisfactory set. The sequential method starts with an initial set containing all the essential tests, and then adds additional tests, one at a time, until each pair of taxa can be distinguished. (A test is essential if it is the only test which can distinguish between a particular pair of taxa.) The criterion for selecting the test to add to the set at each stage is usually the number of pairs of taxa that the test distinguishes, of those pairs not distinguished by tests already in the set. If costs have been defined, this number of pairs is divided by the cost of the test concerned.
CRITERION=weighted uses a refinement, suggested by Barnett et al. (1983), which weights each pair of taxa by the reciprocal of the number of tests that can distinguish between them. The criterion is then the maximum weighted number of pairs of taxa (divided by the cost of the test, if defined). This causes tests that distinguish “difficult” pairs of taxa (those with nearly identical characteristics) to be selected earlier during the construction of the set, and thus tends to generate smaller sets. You can set a preference rating for each test using the
PREFERENCE parameter; the most-preferred tests should have ratings of one, and less-preferred tests should have ratings of two and upwards. Then, if at any stage there is then more than one test with the best criterion value, the most-preferred test is selected. If these preferances are especially important, you may also also want to set the
EQUIVALENCE option to a scalar, e say. Then all tests whose criterion values are within e of the current maximum are regarded as equivalent, and the best test is selected from within these tests according to the preferences.
The main disadvantage of most sequential methods is that they produce only a single set of tests. In order to allow a choice of sets and as a way of improving the original set,
IRREDUNDANT can run through a sequence of cycles. In each of these, the tests in the best set are deleted in turn, further tests are selected to separate the pairs of taxa distinguished only by the deleted test, and any redundant tests are deleted. If no improvement is achieved, all the non-essential tests are deleted, and the set is reformed without using those tests. The process can be then repeated until no improvements are being achieved of until the number of cycles exceeds the setting of the
MAXCYCLE option (default 20).
Printed output is controlled by the
||numbers of the tests in the sets,|
||table showing the contents of the sets,|
||lists of pairs of taxa that cannot be distinguished,|
||messages for example when the number of sets has been reduced as requested by the
The default is
The best set can be saved using the
BESTSET option, as a pointer containing the relevant factors. The
SETS option can save a matrix, with a row for each set and a column for each test, representing all the sets that have been formed. In each row the matrix generally stores the number one in the columns corresponding to the tests in that set, and zero elsewhere. However, if the sets have been constructed to confirm the identification of a single taxon, the matrix contains more informative numbers than one. So, down each column wherever one would be stored, it instead stores the level given by the taxon for the factor corresponding to the test concerned. The
NOTDISTINGUISHED option can save information about the pairs of taxa that cannot be distinguished, or that are distinguished by less than
NDISTINCTIONS tests. The matrix has a row for each such pair of taxa, and three columns. Columns 1 and 2 contain the numbers of the taxa in the pair, and column 3 contains the number of tests that can distinguish them.
The exact method uses an extension of the algorithm of Payne (1991). The sequential method is an extension by Barnett et al. (1983) of the algorithm described in Payne & Preece (1980) Section 6.6.
IRREDUNDANT takes account of restrictions on any of the
CHARACTER factors or on
Barnett, J.A., Payne, R.W. & Yarrow, D. (1983). Yeasts: Characteristics and Identification. Cambridge: Cambridge University Press.
Payne, R.W. (1991). Algorithm AS263 Construction of irredundant test sets. Applied Statistics, 40, 213-229.
Payne, R.W. & Preece, D.A. (1980). Identification keys and diagnostic tables: a review (with discussion). Journal of the Royal Statistical Society, Series A, 143, 253-292.
Commands for: Calculations and manipulation,