FDRMIXTURE procedure

Estimates false discovery rates using mixture distributions (J.W. McNicol & D.B. Baird).

Options

`PRINT` = string token	What to print (`monitoring`, `estimates`); default `esti`
`DISTRIBUTION` = string token	Which distribution to mix with Uniform (`beta`, `gamma`); default `beta`
`INITIAL` = variate	Initial values for mixing proportion (φ) and Beta or Gamma parameters (A and B); default `!(0.90,` `0.30,` `2)`
`LOWER` = variate	Lower limits for parameters; default `!(0.00001,` `0.001,` `0.001)`
`UPPER` = variate	Upper values for parameters; default `!(0.99999,` `5,` `1000)`
`PLOT` = string token	What to plot (`histogram`, `density`, `logdensity`, `inference`, `loginference`); default `hist`, `dens`, `logd`, `infe`, `logi`
`WINDOW` = scalar	Window for graphs; default 1
`KEYWINDOW` = scalar	Key window for Inference plot; default 2
`MAXCYCLE` = scalar	Maximum iteration cycles; default 50
`TOLERANCE` = scalar or variate	Tolerance for convergence of parameters; default 0.01 for Beta, and 0.001 for Gamma

Parameters

`PROBABILITIES` = variates	Significance values, must lie between 0 and 1
`ESTIMATES` = variates	Saves the estimates of mixture parameters φ, A and B
`FDR` = variates	Saves the False Discovery Rates at the p-values in `PROBABILITIES` i.e. q-values
`FRR` = variates	Saves the False Rejection Rates at the p-values in `PROBABILITIES`
`POWER` = variates	Saves the power estimates as a function of the p-values in `PROBABILITIES`
`POSTHA` = variates	Saves the Posterior Probability of H_a at the p-values in `PROBABILITIES`
`LOGLIKELIHOOD` = scalars	Value of the loglikelihood at end of the iteration process
`NCYCLES` = scalars	Number of iterations taken to convergence

Description

FDRMIXTURE estimates the false discovery rate (FDR), false rejection rate (FRR) and power of a test by modelling significance values as a 2-component mixture of Uniform and Beta or Gamma densities, Allison et al. (2002). The context is multiple testing, with data from any situation where the same simple hypothesis, H_o, is tested many times, such as in transcriptomics (microarrays), metabolomics and proteomics. These tests generate a large number of significance values which, under H₀, have a Uniform distribution and, under H_a, can be modelled as a Beta or truncated Gamma density. FDRMIXTURE estimates the parameters of this mixture distribution to derive the False Discovery Rate, Prob(H₀/D_a), the False Rejection Rate, Prob(H_a/D₀) and the Power of the test, Prob(D_a/H_a), each as a function of p_crit. Here Da denotes the event “p<pcrit”. The procedure also calculates the posterior probability of Ha, Prob(Ha/p), (POSTHa) from the mixture distribution. The significance values are provided by the PROBABILITIES parameter and the choice of distribution (Beta or Gamma) by the DISTRIBUTION option. The FDR, FRR, POWER and POSTHA parameters return estimates at the corresponding values of PROBABILITIES. Thus FDR contains the q-values of Storey & Tibshirani (2003). An EM algorithm is used to estimate the mixture parameters which are returned in the parameter ESTIMATES.

The mixture model parameterization takes a proportion φ from the Uniform distribution, and (1 – φ) from either a Beta or Gamma distribution. The Gamma parameterization is

f(x) = (1/b)^A / Gamma(A) × exp(-x/B) × x^(A-1)

truncated at x=1, and the Beta parameterization is

f(x) = x^(A-1) × (1-x)^(B-1) / Beta(A; B).

Details of the estimation process are returned in the parameters NCYCLES and LOGLIKELIHOOD. Initial values, lower and upper mixture parameter limits are set by the INITIAL, LOWER and UPPER options. Convergence can be controlled by a single tolerance for all three parameters or for each parameter separately using the TOLERANCE option, and the number of iterations by the MAXCYCLE option. A warning is printed when the parameter estimates imply a Beta or Gamma density which is unimodal rather than reverse J-shaped. The former would give rise to situations where Pr(H₀/D_a) > Pr(H_a/D_a) for very small p.

Printed output is controlled by the PRINT option with settings:

`estimates`	for estimates of the parameters, and
`monitoring`	for monitoring information from the fitting process.

By default PRINT=estimates.

Graphical output is controlled by the PLOT option with settings:

`histogram`	for a plot of the fitted mixture against the histogram of probabilities,
`density`	for a plot of the fitted mixture against the kernel density estimate of the probabilities on a logit scale (this allows a more detailed comparison at small probability value),
`logdensity`	gives even greater detail, by putting the density on a log scale (note that greater variation is expected around small density values on the log scale),
`inference`	generates a plot of `FDR`, `FRR` and `POWER` against p, and
`loginference`	plots these statistics on log scales, restricted to p<0.5, with a background grid, to enable estimates to be read for specific probability values.

By default all the plots are produced.

The WINDOW option controls where the plots go and the KEYWINDOW option can be used to position the key in the inference plots.

Options: PRINT, DISTRIBUTION, LOWER, UPPER, PLOT, WINDOW, KEYWINDOW, MAXCYCLE, TOLERANCE.

Parameters: PROBABILITIES, ESTIMATES, FDR, FRR, POWER, POSTHA, LOGLIKELIHOOD, NCYCLES.

Method

In the context of hypothesis testing the false discovery rate, FDR, can be defined as the probability of H₀ being true when the result of the statistical test leads us to accept H_a:

FDR = Prob(H₀/D_a).

By Bayes theorem

Prob(H₀/D_a) = Prob(D_a/H₀) × Prob(H₀) / Prob(D_a)

and

Prob(D_a) = Prob(D_a/H₀) × Prob(H₀) + Prob(D_a/H_a) × Prob(H_a).

Further, in the context of multiple testing, where there are many p-values available, all the terms in these expressions can be derived by modelling the p-values as a 2-component mixture distribution. The p-values, under H₀, have a Uniform density and, under H_a, can be modelled as a Beta or truncated Gamma density. The mixing proportions are Prob(H₀) and Prob(H_a) respectively. Prob(D_a/H_a) is CLBETA(p; A; B) or CLGAMMA(p; A; B). The False Rejection Rate,

FRR = Prob(H_a/Do)

is derived similarly. The posterior probability of H_a,

PostH_a = Prob(H_a/p)

= Prob(p/H_a) × Prob(H_a) / (Prob(p/H_a) × Prob(H_a) + Prob(p/H₀) × Prob(H₀))

and each term is estimated by the mixture model parameters.

Action with `RESTRICT`

The PROBABILITIES parameter can be restricted. All output estimates will then be based only on those unrestricted units.

References

Allison, D.B., Gadbury. G.L., Heo, M., Fernandez, J.R., Lee, C.-K., Prolla, T.A., & Weindruch R. (2002). A mixture model approach for the analysis of microarray gene expression data. Computational Statistics & Data Analysis, 39, 1-16.

Storey J.D. & Tibshirani R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Science, 100, 9440-9445.

Example

CAPTION    'FDRMIXTURE example'; STYLE=meta
SCALAR     ntests;   VALUE = 20000
SCALAR     seed;     VALUE =  1231
SCALAR     delta;    VALUE =   0.3
SCALAR     phi;      VALUE =  0.90
SCALAR     sampsize; VALUE =    50
" Create ntest data sets (rows), each of size sampsize (columns)."
GRANDOM    [NVALUES=1; SEED=seed] dum
POINTER    [NVALUES=sampsize] x
CALCULATE  x[] = GRNORMAL(ntests;0;1)
" Add an offset, delta, to a fraction, (1-phi), of the samples.
  Derive test statistic, z, for Test Ho: mean=0 against Ha:mean0
  and significance, p."
CALCULATE  nHa = ROUND(ntests * (1-phi))
&          nHo = ntests - nHa
VARIATE    [VALUES=#nHo(0),#nHa(delta)] vdelta
CALCULATE  xbar = vmean(x) + vdelta
&          t = ABS(xbar * SQRT(sampsize))
&          p = 2 * CUNORMAL(t; 0; 1)
" Derive FDR, FRR, Power, PostHa."
VARIATE    [VALUES=0.85,0.4,3.74] initb
FDRMIXTURE [DISTRIBUTION=beta; INITIAL=initb; TOLERANCE=0.01]\ 
           p; FDR=qvalues; FRR=frr; POWER=power; POSTHA=postha

Updated on March 8, 2019

Was this article helpful?

Yes No