Gives robust identification of multiple outliers in 2-way tables (J.K.M. Brown).
|Printed output required (
||Type of graph required (
||Sorting of printed output, in order of absolute value of median tetrad (
||Specifies the two-way table of data|
||Saves the factor classifying the table rows|
||Saves the factor classifying the table columns|
||Saves the data values in the body of the table|
||Saves median tetrads for each cell in the table|
||Saves ranks of absolute values of median tetrads|
||Saves half-Normal scores of absolute values of median tetrads|
||Specifies the number of cells, with the highest absolute median tetrads, to be set to their predicted values before re-running the analysis|
In a table of data cross-classified by two factors, some cells may be outliers, in that they contain values substantially higher or lower than those expected from the means of the relevant rows and columns. Median tetrad analysis is a robust, single-step method of identifying several outliers in a two-way table (Bradu & Hawkins 1982).
A tetrad is calculated from four cells which form a square in the body of the table. For instance, if the cell in row i and column j has a value cij, the tetrad involving that cell and the cell in row p and column q is defined as
tij; pq = cij – ciq – cpj + cpq
In a clean tetrad, none of the values ciq, cpj or cpq are themselves outliers, so the tetrad is an estimate of the amount by which cij deviates from its expected value. In a contaminated tetrad, one of more of ciq, cpj or cpq are outliers, so a contaminated tetrad is not a reliable estimate of the deviation of cij from its expectation.
MEDIANTETRAD calculates the median of all the tetrads involving each cell of the table (such that i ≠ p and j ≠ q, so the four cells in the tetrad form a square). These median tetrads are robust estimates of the deviations for each cell and therefore indicate which cells may contain outliers. The method is robust because the median will be a clean tetrad (and therefore a reliable estimate of the deviation) so long as fewer than half the tetrads involving that cell are contaminated. Furthermore, the robustness of the method allows several outliers to be detected reliably in a single step; other methods of detecting outliers may detect only a single outlier, or may require several steps, one for each outlier.
The options of
MEDIANTETRAD control the output.
graph setting produces a plot of half-Normal scores of the median tetrads against the absolute values of the median tetrads. In the half-Normal plot, inliers (values for cells which are not outliers, with low deviations) fall on a straight line passing through the origin, while outliers (with high deviations) fall at the upper end of this line and below the level of the line. A regression line, passing through the origin, of half-Normal scores against absolute values of median tetrads, is also plotted. The setting
table prints the factors which classify the table, the data in the body of the table, the median tetrads, the ranks of the absolute values of the median tetrads and the half-Normal scores. The
GRAPHICS option controls graphical output, as a high-resolution plot (the default setting) or as a line-printer plot. The
SORT option controls whether the output provided by setting
PRINT=table is sorted in ascending order (most extreme median tetrad last), descending order, or not at all.
TABLE parameter specifies a table, classified by two factors, in which outliers are to be identified. The table may contain missing values, in which case the corresponding median tetrad is returned as a missing value. The
TABLE parameter must be set, while the other parameters are optional. The next six parameters save output.
COLUMNS save the factors which classify the table,
DATA saves the numerical body of the table, and
HALFNORMALSCORES save the median tetrads, their ranks and half-Normal scores respectively.
When a table has few rows (or, equivalently, few columns), a large outlier in the cell in row i and column j may cause other cells in column j to appear to be moderately outlying. This is bound to be a problem if the table has only two or three rows, in which case 100% or at least 50%, respectively, of tetrads involving cells in column j will be contaminated, so the median tetrads of those cells will be contaminated. The presence of missing values may also cause this problem to occur in larger tables, by reducing the proportion of clean tetrads. The parameter
TESTOUTLIERS can be used to examine the influence of suspected outliers on the deviations of other cells. When
TESTOUTLIERS is set to a positive integer (m), the analysis is run twice. In the first run, the data used is that supplied in
TABLE. In the second run, the cells with the highest m absolute median tetrads are set to values estimated from the remainder of the data (i.e. those not suspected to be outliers). If these m values are indeed the only notable outliers, all the data will now be inliers, so the half-Normal plot of the median tetrads will be a close fit to a straight line passing through the origin. Note that, if
TESTOUTLIERS is set, the output saved in the variates set by the
HALFNORMALSCORES parameters will be from the second analysis, that of the modified table. If the option
GRAPHICS=highresolution is set in combination with a non-zero value of
TESTOUTLIERS, you may need to set the option “Multiple Windows” in the Windows version of Genstat Graphics in order to see the two graphs, before and after adjustment of the suspected outliers.
All proper tetrads are calculated for each cell and their median is calculated. The median tetrad for a cell with a missing value is set to a missing value. The absolute values of the median tetrads are then ranked and their half-Normal scores calculated, as described in the Procedure Library Manual for
TESTOUTLIERS is set to an integer m>0, the cells with the highest m outliers are set to missing values, an analysis of variance (anova) is carried out with treatmentstructure
COLUMNS (i.e. no interaction term is fitted), then the m cells with suspected outliers are given the appropriate fitted value saved from that anova.
Bradu, D. & Hawkins, D.M. (1982). Location of multiple outliers in two-way tables, using tetrads. Technometrics, 24, 103-108.
CAPTION 'MEDIANTETRAD example',\ !t('Data from Bradu & Hawkins 1982, Table 1. Prevalence rates of',\ 'men of various occupations with hearing levels 16 dB or more',\ 'above the audiometric zero at various frequencies. (There are',\ '3 suspected outliers.)'); STYLE=meta,plain FACTOR [NVALUES=49; LEVELS=7; LABELS=!t(Professionl,Farm,Clerical,\ Craftsman,Operative,Service,Labourer)] Occupation & [LABEL=!t('500 Hz','1000 Hz','2000 Hz','3000 Hz',\ '4000 Hz','6000 Hz','Nrml speech')] Frequency GENERATE Frequency,Occupation TABLE [CLASSIFICATION=Frequency,Occupation] HearTable; VALUES=!(\ 2.1, 6.8, 8.4, 1.4,14.6, 7.9, 4.8, 1.7, 8.1, 8.4, 1.4,12.0, 3.7,\ 4.5,14.4,14.8,27.0,30.9,36.5,36.4,31.4,57.4,62.4,37.4,63.3,65.5,\ 65.6,59.8,66.2,81.7,53.3,80.7,79.7,80.8,82.4,75.2,94.0,74.5,87.9,\ 93.3,87.8,80.5, 4.1,10.2,10.7, 5.5,18.1,11.4, 6.1) MEDIANTETRAD [PRINT=graph,table; SORT=descending] HearTable; ROWS=Freq;\ COLUMNS=Occup; DATA=Hearing; TEST=3