VSOM procedure

Analyses a simple REML variance components model for outliers using a variance shift outlier model (S.J. Welham, F.N. Gumedze & D.B. Baird).

 

Options

PRINT = string tokens Specifies the output to be produced (fdr, outliers); default fdr, outl
VPRINT = string tokens Controls the output from the REML analysis of the baseline model (model, components, effects, means, stratumvariances, monitoring, vcovariance, deviance, Waldtests, missingvalues, covariancemodels); default mode, comp, Wald, cova
PLOT = string tokens Controls which plots are produced (indexplots, residual); default inde, resi
INDEXPLOT = string tokens Selects the index plots to produce (omega, sigma2, tsquared, lrt, method, all); default meth
TERM = formula Random term to scan for outliers; default is the residual term
METHOD = string token Method for calculating the statistics used to indicate an outlier (full, partial, t); default t
THRMETHOD = string token Method for obtaining the threshold statistics (approximate, bootstrap); default appr for METHOD=full and boot otherwise
NBOOT = scalar Number of bootstrap samples to take to form the threshold statistics; default 99 for METHOD=full and 499 otherwise
FIXED = formula Fixed model terms
RANDOM = formula Random model terms
CONSTANT = string token How to treat the constant term (estimate, omit); default esti
FACTORIAL = scalar Limit on the number of factors or covariates in each fixed term; default 3
VCONSTRAINTS = string token How to constrain the variance components and the residual variance (none, positive, fixrelative, fixabsolute); default posi
INITIAL = variate Initial values for the variance components; default 1
SEED = scalar Seed for random number generation; default 0 continues an existing sequence or, if none, selects a seed automatically
SAVEITEMS = string tokens Selects the items to save (residuals, omega, sigma2, gamma, tsquared, lrt, fdr, approxthresholds, thresholdstats, outliers, method, all); default resi, omeg, sigm, meth, fdr, outl

 

Parameters

Y = variates Response variates
TITLE = texts Specifies the title or titles to use for the plots
SAVE = pointers Saves information from the analysis of each y-variate

 

Description

VSOM uses a mixed-model analysis with a variance shift outlier model (VSOM) to search for potential outliers. By default, the VSOM is used to assess the residuals. However, you can set the TERM option to a random term in the analysis, to assess its effects: i.e. to see whether any of the groups of observations defined by the random term seem to be aberrant. The model defines an extra component of variation for each unit (an individual or a group), in turn, and estimates the extra variance associated with it. The METHOD option specifies how the extra variance is estimated, with the following settings.

    full refits the full model with the added variance term for each unit; this can be very time-consuming.
    partial approximates the change in likelihood by a partial likelihood, where the baseline model parameters are held fixed, and only the extra variance component for each unit is estimated; this is much faster than re-estimating the full model.
    t uses the squared t-statistics (i.e. squared standardized residuals) to approximate the change in likelihood (default); this is the fastest approach.

To assess whether a unit is outside its expected distribution, thresholds are calculated at various levels of significance. The THRMETHOD option specifies the method to use:

    approximate uses the asymptotic distribution to calculate the thresholds; and
    bootstrap uses parametric bootstrap samples, with the variance components in the baseline model, to calculate the thresholds from the percentiles of the order statistics.

Each bootstrap sample is formed by taking the sum of the fitted fixed effects from the baseline model, together with simulated effects for the random terms in the model. Each random effect is simulated by Normal random numbers, with a mean of zero and the variance that was estimated for that term in the baseline model. The NBOOT option defines how many random samples to perform; the default is 99 for METHOD=full, and 499 otherwise. The SEED option specifies the seed for the random number generator, used by the GRNORMAL function to make the bootstrap samples. The default of zero continues the sequence of random numbers from a previous generation or, if this is the first use of the generator in this run of Genstat, it initializes the seed automatically from the computer clock. If you repeat the analysis with the same (non-zero) seed, you will get the same random numbers, and hence the same results.

The FIXED and RANDOM options specify fixed and random terms to be fitted in the analysis; one of these must be specified. The analysis cannot handle covariance models (which would be specified by the VSTRUCTURE directive). The VCONSTRAINTS option specifies constraints on the variance components, using the same settings as the CONSTRAINTS parameter of VCOMPONENTS. The FACTORIAL option sets a limit on the number of factors and variates allowed in each fixed term, and the CONSTANT option allows you to omit the constant.

Printed output is controlled by the PRINT option, with the following settings:

    outliers prints a summary of the potential outliers, as measured against the threshold statistics, at various levels of significance; and
    fdr prints the estimated false discovery rates for the potential outliers.

The false discovery rates (FDR) are estimated from the distribution of p-values calculated with the t-statistics from the asymptotic model. This uses the FDRMIXTURE procedure, or else the FDRBONFERRONI procedure if that fails. The FDR estimates the probability that the outlier is generated by noise. If this is small, it is likely that the outlier is genuine. However, if it is larger than 0.5, there is more chance that it was generated by noise. The FDR probabilities do not allow for correlations between the estimates. So, if there are only 2-3 replicates of the fixed terms, these may be too small, and should be interpreted with caution.

The VPRINT option controls the output from the REML analysis of the baseline model (as specified by the FIXED and RANDOM options). This has the same settings and default as the PRINT option of REML.

Graphical output is controlled by the PLOT option, with the following settings.

    residual when TERM is set, the DRESIDUALS procedure is used to plot histograms and Normal plots of the specified random effects; when TERM is not set, DRESIDUALS is used to plot histograms and Normal plots of the residuals together with a plot of the residuals against the fitted values.
    indexplots plots the statistics, selected by the INDEXPLOT option, against their index (i.e. their position in the y-variate).

For residual and indexplots, points are plotted in red if they are greater than their 5% bootstrap threshold, and in purple or green if greater than the 1% or 5% asymptotic thresholds respectively. The index plot also displays reference lines for the order statistics (OS 1, OS 2…) when THRMETHOD=bootstrap, or the 5%, 1% and 0.1% and 0.01% asymptotic thresholds when THRMETHOD=approximate.

The plots that are produced as components of the index plot can be controlled by the INDEXPLOT option, with the following settings:

    omega variance shift as a ratio to the residual variance,
    sigma2 estimated residual variance under VSOM,
    tsquared squared t-statistic,
    lrt likelihood ratio test,
    method the statistic associated with the setting of the METHOD option, i.e. lrt for full or partial, and tsquared for t (default), and
    all all the statistics.

The Y parameter specifies the response variate. The TITLE parameter can supply a text, with either one or three values, to label the graphs. If the text has a single value, this is used to prefix the standard descriptions for the three graphs. If it has three values, these give (in full) the titles for the comparison, indexplots, residual plots, respectively.

The SAVE parameter can save a pointer containing variates, storing the statistics calculated for each group or individual. The labels of the pointer, and the corresponding statistics, are as follows:

    'residuals' the standardized residuals,
    'omega' the variance shift as a ratio to the residual variance,
    'sigma2' the estimated residual variance under VSOM,
    'gamma' the estimated variance component for TERM under VSOM,
    'tsquared' the squared t-statistic,
    'LRT' the partial likelihood ratio test if THRMETHOD=partial or the full likelihood ratio test otherwise,
    'method' the statistic associated with the setting of the METHOD option (lrt for full or partial, and tsquared for t),
    'FDR' the false discovery rate base on the t-statistics,
    'approxthresholds' the approximate thresholds used to indicate significant departures,
    'thresholdstats' the 95 percentiles of the order statistics from the bootstrap samples in decreasing order, and
    'outliers' the unit numbers of outliers above the thresholds.

The SAVEITEMS option controls which of the above items are saved.

Options: PRINT, VPRINT, PLOT, INDEXPLOT, RTERM, METHOD, THRMETHOD, NBOOT, FIXED, RANDOM, CONSTANT, FACTORIAL, VCONSTRAINTS, INITIAL, SEED, SAVEITEMS.

Parameters: Y, TITLE, SAVE.

 

Method

VSOM uses the method of Gumedze et al. (2010).

 

Action with RESTRICT

The Y parameter can be restricted. All output estimates will then be based only on the unrestricted units.

 

Reference

Gumedze, F.N., Welham, S.J., Gogel, B.J. & Thompson, R. (2010). A variance shift model for detection of outliers in the linear mixed model. Computational Statistics and Data Analysis, 54, 2128-2144.

 

See also

Directives: REML, VCOMPONENTS, VSTRUCTURE.

Procedure: VCHECK, VRCHECK, VPLOT, VDFIELDRESIDUALS, VFRESIDUALS, DRESIDUALS. FDRBONFERRONI, FDRMIXTURE.

Commands for: REML analysis of linear mixed models.

Example

CAPTION  'VSOM examples',\
   !T('Cambridge Filter data (Wagner & Thaggard 1979):',\
   'Nicotine extracted from pads at 14 laboratories'); STYLE=meta,plain
   
SPLOAD [PRINT=*] '%EXAMPLES%/CambridgeFilterData.gsh'

"Check residual term - individual samples for outliers"
VSOM [METHOD=t; FIXED=Sample; RANDOM=Laboratory; SEED=7643] Nicotine;\ 
   TITLE='Cambridge Filter data'

"Check laboratory term for outliers"
VSOM [METHOD=full; FIXED=Sample; RANDOM=Laboratory; TERM=Laboratory]\ 
   Nicotine; TITLE='Cambridge filter data by laboratory'

CAPTION 'Slate Hall spring wheat trial (Kempton & Fox 1997)'; STYLE=plain
SPLOAD [PRINT=*] '%DATA%/SlateHall.gsh'

"Check residual term - individual plots for outliers"
VSOM [PRINT=; VPRINT=*; PLOT=#; INDEXPLOT=all; FIXED=variety;\
     RANDOM=fieldrow*fieldcolumn; METHOD=Partial; NBOOT=199;\ 
     SAVEITEMS=residuals,omega,fdr] yield; TITLE='Slate Hall'; SAVE=results
     
"Test fieldcolumn effects for outliers"
VSOM [FIXED=variety; RANDOM=fieldrow*fieldcolumn; TERM=fieldcolumn]\ 
     yield; TITLE='Slate Hall by field column'
Updated on January 17, 2018

Was this article helpful?