Gives robust identification of multiple outliers in 2-way tables (J.K.M. Brown).
Options
PRINT = string tokens |
Printed output required (graph , table ); default grap , tabl |
---|---|
GRAPHICS = string tokens |
Type of graph required (highresolution , lineprinter ); default high |
SORT = string tokens |
Sorting of printed output, in order of absolute value of median tetrad (ascending , descending , none ); default none |
Parameters
TABLE = tables |
Specifies the two-way table of data |
---|---|
ROWS = factors |
Saves the factor classifying the table rows |
COLUMNS = factors |
Saves the factor classifying the table columns |
DATA = variates |
Saves the data values in the body of the table |
MEDIANTETRADS = variates |
Saves median tetrads for each cell in the table |
RANKS = variates |
Saves ranks of absolute values of median tetrads |
HALFNORMALSCORES = variates |
Saves half-Normal scores of absolute values of median tetrads |
TESTOUTLIERS = scalars |
Specifies the number of cells, with the highest absolute median tetrads, to be set to their predicted values before re-running the analysis |
Description
In a table of data cross-classified by two factors, some cells may be outliers, in that they contain values substantially higher or lower than those expected from the means of the relevant rows and columns. Median tetrad analysis is a robust, single-step method of identifying several outliers in a two-way table (Bradu & Hawkins 1982).
A tetrad is calculated from four cells which form a square in the body of the table. For instance, if the cell in row i and column j has a value cij, the tetrad involving that cell and the cell in row p and column q is defined as
tij; pq = cij – ciq – cpj + cpq
In a clean tetrad, none of the values ciq, cpj or cpq are themselves outliers, so the tetrad is an estimate of the amount by which cij deviates from its expected value. In a contaminated tetrad, one of more of ciq, cpj or cpq are outliers, so a contaminated tetrad is not a reliable estimate of the deviation of cij from its expectation.
MEDIANTETRAD
calculates the median of all the tetrads involving each cell of the table (such that i ≠ p and j ≠ q, so the four cells in the tetrad form a square). These median tetrads are robust estimates of the deviations for each cell and therefore indicate which cells may contain outliers. The method is robust because the median will be a clean tetrad (and therefore a reliable estimate of the deviation) so long as fewer than half the tetrads involving that cell are contaminated. Furthermore, the robustness of the method allows several outliers to be detected reliably in a single step; other methods of detecting outliers may detect only a single outlier, or may require several steps, one for each outlier.
The options of MEDIANTETRAD
control the output. PRINT
has two settings. The graph
setting produces a plot of half-Normal scores of the median tetrads against the absolute values of the median tetrads. In the half-Normal plot, inliers (values for cells which are not outliers, with low deviations) fall on a straight line passing through the origin, while outliers (with high deviations) fall at the upper end of this line and below the level of the line. A regression line, passing through the origin, of half-Normal scores against absolute values of median tetrads, is also plotted. The setting table
prints the factors which classify the table, the data in the body of the table, the median tetrads, the ranks of the absolute values of the median tetrads and the half-Normal scores. The GRAPHICS
option controls graphical output, as a high-resolution plot (the default setting) or as a line-printer plot. The SORT
option controls whether the output provided by setting PRINT=table
is sorted in ascending order (most extreme median tetrad last), descending order, or not at all.
The TABLE
parameter specifies a table, classified by two factors, in which outliers are to be identified. The table may contain missing values, in which case the corresponding median tetrad is returned as a missing value. The TABLE
parameter must be set, while the other parameters are optional. The next six parameters save output. ROWS
and COLUMNS
save the factors which classify the table, DATA
saves the numerical body of the table, and MEDIANTETRADS
, RANKS
and HALFNORMALSCORES
save the median tetrads, their ranks and half-Normal scores respectively.
When a table has few rows (or, equivalently, few columns), a large outlier in the cell in row i and column j may cause other cells in column j to appear to be moderately outlying. This is bound to be a problem if the table has only two or three rows, in which case 100% or at least 50%, respectively, of tetrads involving cells in column j will be contaminated, so the median tetrads of those cells will be contaminated. The presence of missing values may also cause this problem to occur in larger tables, by reducing the proportion of clean tetrads. The parameter TESTOUTLIERS
can be used to examine the influence of suspected outliers on the deviations of other cells. When TESTOUTLIERS
is set to a positive integer (m), the analysis is run twice. In the first run, the data used is that supplied in TABLE
. In the second run, the cells with the highest m absolute median tetrads are set to values estimated from the remainder of the data (i.e. those not suspected to be outliers). If these m values are indeed the only notable outliers, all the data will now be inliers, so the half-Normal plot of the median tetrads will be a close fit to a straight line passing through the origin. Note that, if TESTOUTLIERS
is set, the output saved in the variates set by the DATA
, MEDIANTETRADS
, RANKS
and HALFNORMALSCORES
parameters will be from the second analysis, that of the modified table. If the option GRAPHICS=highresolution
is set in combination with a non-zero value of TESTOUTLIERS
, you may need to set the option “Multiple Windows” in the Windows version of Genstat Graphics in order to see the two graphs, before and after adjustment of the suspected outliers.
Options: PRINT
, GRAPHICS
, SORT
.
Parameters: TABLE
, ROWS
, COLUMNS
, DATA
, MEDIANTETRADS
, RANKS
, HALFNORMALSCORES
, TESTOUTLIERS
.
Method
All proper tetrads are calculated for each cell and their median is calculated. The median tetrad for a cell with a missing value is set to a missing value. The absolute values of the median tetrads are then ranked and their half-Normal scores calculated, as described in the Procedure Library Manual for APLOT
. If TESTOUTLIERS
is set to an integer m>0, the cells with the highest m outliers are set to missing values, an analysis of variance (anova) is carried out with treatmentstructure ROWS
+ COLUMNS
(i.e. no interaction term is fitted), then the m cells with suspected outliers are given the appropriate fitted value saved from that anova.
References
Bradu, D. & Hawkins, D.M. (1982). Location of multiple outliers in two-way tables, using tetrads. Technometrics, 24, 103-108.
See also
Directive: TABULATE
.
Procedure: DRESIDUALS
, RCHECK
.
Example
CAPTION 'MEDIANTETRAD example',\ !t('Data from Bradu & Hawkins 1982, Table 1. Prevalence rates of',\ 'men of various occupations with hearing levels 16 dB or more',\ 'above the audiometric zero at various frequencies. (There are',\ '3 suspected outliers.)'); STYLE=meta,plain FACTOR [NVALUES=49; LEVELS=7; LABELS=!t(Professionl,Farm,Clerical,\ Craftsman,Operative,Service,Labourer)] Occupation & [LABEL=!t('500 Hz','1000 Hz','2000 Hz','3000 Hz',\ '4000 Hz','6000 Hz','Nrml speech')] Frequency GENERATE Frequency,Occupation TABLE [CLASSIFICATION=Frequency,Occupation] HearTable; VALUES=!(\ 2.1, 6.8, 8.4, 1.4,14.6, 7.9, 4.8, 1.7, 8.1, 8.4, 1.4,12.0, 3.7,\ 4.5,14.4,14.8,27.0,30.9,36.5,36.4,31.4,57.4,62.4,37.4,63.3,65.5,\ 65.6,59.8,66.2,81.7,53.3,80.7,79.7,80.8,82.4,75.2,94.0,74.5,87.9,\ 93.3,87.8,80.5, 4.1,10.2,10.7, 5.5,18.1,11.4, 6.1) MEDIANTETRAD [PRINT=graph,table; SORT=descending] HearTable; ROWS=Freq;\ COLUMNS=Occup; DATA=Hearing; TEST=3