1. Home
  2. AFFYMETRIX procedure

AFFYMETRIX procedure

Estimates expression values for Affymetrix slides (D.B. Baird).


PRINT = string tokens What to print (estimates, background, monitoring); default para
METHOD = string token Method for calculating probe expression values (mas4, mas5, rma, rma2); default rma
BMETHOD = string token Method to use for background values (mean, quantile, none); default mean for METHOD settings mas4 and mas5, but none for settings rma and rma2
BWEIGHTING = string token Method for weighting background grids (affymetrix, distance); default affy
TRANSFORMATION = string token How to transform the data (log2, none); default log2
NMETHOD = string token Method for normalization i.e. whether to use a mean, median or geometric mean for the averaged normalized distribution (means, medians, geometricmeans, none); default mean
REPLACEDATA = string token Whether to replace the DATA variates with background corrected intensities (yes, no); default no
SPREADSHEET = string token What to save in a spreadsheet (results); default * i.e. nothing
MAXCYCLE = scalar Maximum number of iterations; default 50
TOLERANCE = scalar Tolerance for convergence; default 0.0001


DATA = variates Intensities to be analysed
SLIDES = factors Identify the slides (or chips)
PROBES = factors Identify the probes (or genes) within each slide
ATOMS = factors Identify the PM/MM pairs within each probe
PMMM = factors Distinguish between PM and MM values
TYPEPROBES = factors Defines the probe-type corresponding to each intensity
ROWS = factors Identifies rows within each slide (required only if background corrections are to be made)
COLUMNS = factors Identifies columns within each slide (required only if background corrections are to be made)
ESTIMATES = variates Saves the estimated expression values for each slide and probe combination
SE = variates Saves approximate standard errors for the estimates
IDSLIDES = factors Saves factors to identify the slides in the ESTIMATES variates
IDPROBES = factors Saves factors to identify the probes in the ESTIMATES variates


AFFYMETRIX estimates expression values over the perfect match (PM) and mismatch (MM) pairs for each probe on Affymetrix slides (or chips). On Affymetrix chips, each probe has 8-20 pairs of DNA sequences with a central base changed between the perfect match and mismatch sequences. The value for the probe level of expression is taken as an average over the pairs of perfect match (PM) and mismatch (MM) spots. The intensity values are obtained by reading in a series of Affymetrix CEL files, and the chip information from a CDF file.

The METHOD option selects the method to use to summarize over the PM and MM pairs, with settings:

    rma Robust Means Analysis model – the probe level model introduced by Irizarry et al. (2003) which only uses PM information and transforms the values based on a kernel density estimate of the PM distribution;
    rma2 Robust Means Analysis 2 – an adaptation of RMA algorithm which fits the kernel density to a truncated distribution of the PM values, with the truncation point based on an initial kernel density estimate;
    mas4 Affymetrix Version 4 – the AvDiff algorithm introduced in the Affymetrix version 4 software; and
    mas5 Affymetrix Version 5 – the Tukey biweight algorithm introduced in the Affymetrix version 5 software.

In the Affymetrix MAS 4 and 5 methods, the difference between the signals (PM – MM) is averaged using a robust averaging method. The MAS 4 algorithm uses the AvDiff algorithm which discards the minimum and maximum difference, and any differences greater than 3 standard deviations from the mean. The MAS 5 algorithm uses the Tukey biweight algorithm which reweights the values depending on how far they are from the median, and discards any that are more than 5 times the median absolute distance away. The MAS 5 algorithm also replaces the MM value with a value known as an Ideal Mismatch (IM), which is always less than the PM value.

The standard RMA algorithm would normally use the log2 transformed PM values with no background correction, which then have a quantile normalization applied to them. The adjusted PM values then have a Normal function transformation applied to them with the values for the transformation being calculated from a kernel density estimate applied to the adjusted PM values. Finally the transformed PM values are summarized with a median polish of the slides by atom values for each probe. The log2 transformation can be suppressed by setting option TRANSFORMATION=none.

The RMA model performs a background correction by fitting a two component model to the PM intensities:

Observed intensity = Signal + Noise

where Signal has an exponential distribution with parameter α (the reciprocal of the mean), the Noise has an Normal distribution with parameters μ (the mean) and σ (the standard deviation). α, μ and σ are then estimated and the expected value of the signal is estimated, given the observed value of the intensity.

For all algorithms, the lowest 2% of spots on each slide can be used to estimate a background correction for the intensities. The chip is divided into 16 zones in a 4 × 4 grid, and each spot has a weighted average of these 16 levels removed from it. The levels used are controlled by the BMETHOD options, with settings:

    means the means of the values below the 2% quantile are used as the background levels;
    quantiles the actual 2% quantiles are used as the background levels; and
    none if you want no background correction to be made.

The BWEIGHTING option controls how the background levels are combined before removing them from each spot:

    affymetrix the weights are 1/(squared-distance + 100); and
    distance the weights are 1/(min(squared-distance, 100),

where Squared-distance = (distance from the spot to the zone centroid)2.

The quantile normalization of the PM/MM values on each slide is controlled by the NMETHOD option. Its settings select the way in which the overall distribution is produced from the cumulative density functions on each slide:

    means takes the means;
    medians takes the medians; and
    geometricmeans takes geometric means (i.e. the mean on the log scale, back-transformed to the natural scale); and
    none if you do not want any quantile normalization.

The intensity values are specified by the DATA parameter. If these are in a single variate, the SLIDE parameter should supply a factor to index the slides, and the PROBES parameter should supply a factor to index the probes (or genes). Alternatively you can supply a pointer containing a variate for each slide. The slides factor is then not required; if it is given it should just have one entry for each slide in the order of the variates in the pointer. The PROBES factor is that for a single slide, and all slides must have a common layout.

The ATOMS parameter supplies a factor to identify the PM/MM pairs within each probe, and the PMMM parameter supplies a factor, with levels labelled 'PM' and 'MM', to distinguish between PM and MM values. The TYPEPROBES parameter supplies a factor to specify the probe types. The types of probes that can occur on Affymetrix chips are: 'Expression', 'Genotyping', 'CustomSeq', 'Tag', 'Unknown', 'Checkerboard Negative', 'Checkerboard Positive', 'Hybridization Negative', 'Hybridization Positive', 'Text Negative', 'Text Positive', 'Central Negative', 'Central Positive', 'Gene Exp Negative', 'Gene Exp Positive', 'Cycle Fidelity Negative', 'Cycle Fidelity Positive', 'Central Cross Negative', 'Central Cross Positive', 'Cross Hyb Negative' and 'Cross Hyb Positive'.

The ROWS and COLUMNS parameters can supply factors to identify the rows and columns within each slide. These are required only if background corrections are to be made.

The ESTIMATES parameter must supply a variate to save the estimated expression value for each slide and probe combination. The IDPROBES and IDSLIDES parameters must supply factors to identify the probes and slides, respectively, in the ESTIMATES variate. You can also set parameter SPREADSHEET=results to save these in a Genstat spreadsheet. The SE parameter can supply a variate to save approximate standard errors and, if this is set, the standard errors are included in the spreadsheet.




Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U. & Speed, T.P. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, Number 2, 249-264.

See also


Commands for: Microarray data.


CAPTION      'AFFYMETRIX example'; STYLE=meta
" Warning, this example takes 1GB of RAM to run! "
IF check
  SPLOAD     '%GENDIR%/Data/Microarrays/Hyb-AllData.gwb'
  " Estimate Expression Values from Affymetrix CEL data."
  AFFYMETRIX [PRINT=estimates,background,monitoring; METHOD=RMA;\
             BMETHOD=none; TRANSFORMATION=log2; NMETHOD=medians;\
             MAXCYCLE=10; TOLERANCE=0.0001; "SPREADSHEET=results"]\
             DATA=Intensity; SLIDES=Slide; PROBES=Probe; ATOMS=Atom;\
             IDPROBES=SlideID; IDSLIDES=ProbeID; ESTIMATES=Expression; SE=SE
  CAPTION    'Microarray example datasets have not been installed.'
Updated on March 11, 2019

Was this article helpful?