MEDIANTETRAD procedure

Gives robust identification of multiple outliers in 2-way tables (J.K.M. Brown).

Options

`PRINT` = string tokens	Printed output required (`graph`, `table`); default `grap`, `tabl`
`GRAPHICS` = string tokens	Type of graph required (`highresolution`, `lineprinter`); default `high`
`SORT` = string tokens	Sorting of printed output, in order of absolute value of median tetrad (`ascending`, `descending`, `none`); default `none`

Parameters

`TABLE` = tables	Specifies the two-way table of data
`ROWS` = factors	Saves the factor classifying the table rows
`COLUMNS` = factors	Saves the factor classifying the table columns
`DATA` = variates	Saves the data values in the body of the table
`MEDIANTETRADS` = variates	Saves median tetrads for each cell in the table
`RANKS` = variates	Saves ranks of absolute values of median tetrads
`HALFNORMALSCORES` = variates	Saves half-Normal scores of absolute values of median tetrads
`TESTOUTLIERS` = scalars	Specifies the number of cells, with the highest absolute median tetrads, to be set to their predicted values before re-running the analysis

Description

In a table of data cross-classified by two factors, some cells may be outliers, in that they contain values substantially higher or lower than those expected from the means of the relevant rows and columns. Median tetrad analysis is a robust, single-step method of identifying several outliers in a two-way table (Bradu & Hawkins 1982).

A tetrad is calculated from four cells which form a square in the body of the table. For instance, if the cell in row i and column j has a value c_ij, the tetrad involving that cell and the cell in row p and column q is defined as

t_ij_{; pq} = c_ij – c_iq – c_pj + c_pq

In a clean tetrad, none of the values c_iq, c_pj or c_pq are themselves outliers, so the tetrad is an estimate of the amount by which c_ij deviates from its expected value. In a contaminated tetrad, one of more of c_iq, c_pj or c_pq are outliers, so a contaminated tetrad is not a reliable estimate of the deviation of c_ij from its expectation.

MEDIANTETRAD calculates the median of all the tetrads involving each cell of the table (such that i ≠ p and j ≠ q, so the four cells in the tetrad form a square). These median tetrads are robust estimates of the deviations for each cell and therefore indicate which cells may contain outliers. The method is robust because the median will be a clean tetrad (and therefore a reliable estimate of the deviation) so long as fewer than half the tetrads involving that cell are contaminated. Furthermore, the robustness of the method allows several outliers to be detected reliably in a single step; other methods of detecting outliers may detect only a single outlier, or may require several steps, one for each outlier.

The options of MEDIANTETRAD control the output. PRINT has two settings. The graph setting produces a plot of half-Normal scores of the median tetrads against the absolute values of the median tetrads. In the half-Normal plot, inliers (values for cells which are not outliers, with low deviations) fall on a straight line passing through the origin, while outliers (with high deviations) fall at the upper end of this line and below the level of the line. A regression line, passing through the origin, of half-Normal scores against absolute values of median tetrads, is also plotted. The setting table prints the factors which classify the table, the data in the body of the table, the median tetrads, the ranks of the absolute values of the median tetrads and the half-Normal scores. The GRAPHICS option controls graphical output, as a high-resolution plot (the default setting) or as a line-printer plot. The SORT option controls whether the output provided by setting PRINT=table is sorted in ascending order (most extreme median tetrad last), descending order, or not at all.

The TABLE parameter specifies a table, classified by two factors, in which outliers are to be identified. The table may contain missing values, in which case the corresponding median tetrad is returned as a missing value. The TABLE parameter must be set, while the other parameters are optional. The next six parameters save output. ROWS and COLUMNS save the factors which classify the table, DATA saves the numerical body of the table, and MEDIANTETRADS, RANKS and HALFNORMALSCORES save the median tetrads, their ranks and half-Normal scores respectively.

When a table has few rows (or, equivalently, few columns), a large outlier in the cell in row i and column j may cause other cells in column j to appear to be moderately outlying. This is bound to be a problem if the table has only two or three rows, in which case 100% or at least 50%, respectively, of tetrads involving cells in column j will be contaminated, so the median tetrads of those cells will be contaminated. The presence of missing values may also cause this problem to occur in larger tables, by reducing the proportion of clean tetrads. The parameter TESTOUTLIERS can be used to examine the influence of suspected outliers on the deviations of other cells. When TESTOUTLIERS is set to a positive integer (m), the analysis is run twice. In the first run, the data used is that supplied in TABLE. In the second run, the cells with the highest m absolute median tetrads are set to values estimated from the remainder of the data (i.e. those not suspected to be outliers). If these m values are indeed the only notable outliers, all the data will now be inliers, so the half-Normal plot of the median tetrads will be a close fit to a straight line passing through the origin. Note that, if TESTOUTLIERS is set, the output saved in the variates set by the DATA, MEDIANTETRADS, RANKS and HALFNORMALSCORES parameters will be from the second analysis, that of the modified table. If the option GRAPHICS=highresolution is set in combination with a non-zero value of TESTOUTLIERS, you may need to set the option “Multiple Windows” in the Windows version of Genstat Graphics in order to see the two graphs, before and after adjustment of the suspected outliers.

Options: PRINT, GRAPHICS, SORT.

Parameters: TABLE, ROWS, COLUMNS, DATA, MEDIANTETRADS, RANKS, HALFNORMALSCORES, TESTOUTLIERS.

Method

All proper tetrads are calculated for each cell and their median is calculated. The median tetrad for a cell with a missing value is set to a missing value. The absolute values of the median tetrads are then ranked and their half-Normal scores calculated, as described in the Procedure Library Manual for APLOT. If TESTOUTLIERS is set to an integer m>0, the cells with the highest m outliers are set to missing values, an analysis of variance (anova) is carried out with treatmentstructure ROWS + COLUMNS (i.e. no interaction term is fitted), then the m cells with suspected outliers are given the appropriate fitted value saved from that anova.

References

Bradu, D. & Hawkins, D.M. (1982). Location of multiple outliers in two-way tables, using tetrads. Technometrics, 24, 103-108.

Example

CAPTION  'MEDIANTETRAD example',\ 
         !t('Data from Bradu & Hawkins 1982, Table 1. Prevalence rates of',\ 
         'men of various occupations with hearing levels 16 dB or more',\ 
         'above the audiometric zero at various frequencies. (There are',\ 
         '3 suspected outliers.)'); STYLE=meta,plain
FACTOR   [NVALUES=49; LEVELS=7; LABELS=!t(Professionl,Farm,Clerical,\ 
         Craftsman,Operative,Service,Labourer)] Occupation
&        [LABEL=!t('500 Hz','1000 Hz','2000 Hz','3000 Hz',\ 
                   '4000 Hz','6000 Hz','Nrml speech')] Frequency
GENERATE Frequency,Occupation
TABLE    [CLASSIFICATION=Frequency,Occupation] HearTable; VALUES=!(\ 
         2.1, 6.8, 8.4, 1.4,14.6, 7.9, 4.8, 1.7, 8.1, 8.4, 1.4,12.0, 3.7,\
         4.5,14.4,14.8,27.0,30.9,36.5,36.4,31.4,57.4,62.4,37.4,63.3,65.5,\
         65.6,59.8,66.2,81.7,53.3,80.7,79.7,80.8,82.4,75.2,94.0,74.5,87.9,\
         93.3,87.8,80.5, 4.1,10.2,10.7, 5.5,18.1,11.4, 6.1)
MEDIANTETRAD [PRINT=graph,table; SORT=descending] HearTable; ROWS=Freq;\ 
         COLUMNS=Occup; DATA=Hearing; TEST=3

Updated on March 7, 2019

Was this article helpful?

Yes No