Performs marker-trait association analysis in a genetically diverse population using bi-allelic and multi-allelic markers (M. Malosetti & J.T.N.M. Thissen).
Options
PRINT = string tokens |
What to print (summary , progress ); default summ |
---|---|
PLOT = string tokens |
What to plot (profile , qq , map ); default prof , qq |
RELATIONSHIPMODEL = string token |
What model to use to account for genetic relatedness (eigenanalysis , kinship , subpopulations , null ); default kins |
SCORES = pointer |
Provides the scores of significant principal components, obtained from an eigenvalue analysis |
METHOD = string token |
What model to use for GWAS (exact , fast ); default fast |
ALPHA = scalar |
Defines a genome-wide significance level to calculate the threshold; default 0.05 |
THRMETHOD = string token |
Method to define the threshold for significance (neffective , bonferroni , given ); default neff |
THRESHOLD = scalar |
Threshold value for significant LD, on the -log10 scale; default 2 |
DISTANCE = scalar |
Minimum distance gap between independent tests (i.e. distance beyond which loci are expected to be in linkeage equilibrium) when THRMETHOD=bonferroni ; default * |
MINORALLELE = scalar |
Frequency of minor alleles; default 0.05 |
KMATRIX = symmetric matrix |
Kinship matrix containing coefficients of coancestries |
KMETHOD = string token |
Method to use to estimate kinship matrix if not supplied by KMATRIX (correlation , dice ); default dice |
SUBPOPULATIONS = factor |
Defines groupings of genotypes into subpopulations |
MODELPART = string token |
Defines which part of the model should include SUBPOPULATIONS if RELATIONSHIPMODEL is set to subpopulations , or the principal components scores if RELATIONSHIPMODEL is set to eigenanalysis (fixed , random ); default rand |
SCALING = string token |
Whether to scale the scores by the square roots of their singular values (singularvalues , none ); default none |
STANDARDIZE = string token |
Whether to standardize the marker scores according to their frequencies (frequency , none ); default freq |
COLOURS = scalar, variate or text |
Colours to use for the chromosomes; default * uses the colours of pens 1, 2 up to the number of chromosomes |
TITLE = text |
General title for the plots |
YTITLE = text |
Title for the y-axis |
XTITLE = text |
Title for the x-axis |
Parameters
TRAIT = variates |
Phenotypic trait to analyse; must be set |
---|---|
GENOTYPES = factors |
Genotype factor |
MKSCORES = pointers |
Genotype codes for each marker; must be set |
CHROMOSOMES= factors |
Linkage groups for the markers; must be set |
POSITIONS = variates |
Positions within the linkage groups of markers; must be set |
MKNAMES = texts |
Marker names |
IDMGENOTYPES = texts |
Labels for the genotypes corresponding to the markers |
GENFILENAME = texts |
Name of a comma-delimited file (*.csv ) containing marker scores (with markers in the rows and genotypes in the columns) |
MAPFILENAME = texts |
Name of a comma-delimited file (*.csv ) with map information |
WALDSTATISTICS = variates |
Saves the Wald test statistics |
NDF = variates |
Saves the degrees of freedom associated with the Wald test |
MINLOG10P = variates |
Saves the associated probability values of the Wald test statistics, on a -log10 scale |
LAMBDA = scalars |
Saves the inflation factor i.e. slope of the QQ plot of -log10(P) values |
QSAVE = pointers |
Saves a pointer with information and results for the significant effects |
DFILENAME = texts |
Name of the graphical file for the plots |
Description
QSASSOCIATION
performs a mixed model marker-trait association analysis (also known as linkage disequilibrium mapping) with data from a single-environment trial. The trait data are supplied by the TRAIT
parameter. The marker scores can be supplied as a pointer of factors by the MKSCORES
parameter. The length of the pointer must be equal to the number of markers. Alternatively, if the fast method is requested by the METHOD
option, they can be supplied in a file whose name is specified by the GENFILENAME
parameter. The file must be comma-delimited (*.csv
), with the markers in the rows and the genotypes in the columns. The first column of the file contains marker names, and the first row of the file contains the names of the genotypes.
The corresponding map information for the markers can be supplied by the CHROMOSOMES
and POSITIONS
parameters, and the labels for the markers can be supplied by the MKNAMES
parameter. The IDMGENOTYPE
parameter can be used to give the genotypes labels in the marker data. Alternatively, if the fast method is requested by the METHOD
option, the map information can be supplied in a file, whose name is specified by the MAPFILE
parameter. This file must also be comma-delimited (*.csv
), and should contain three columns (without headings): marker name, linkage group (chromosome), and position within linkage group of each marker.
To avoid false positives in association mapping studies, some form of control is necessary for the genetic relatedness. The model to use is specified by the RELATIONSHIPMODEL
option, with one of the following settings:
eigenanalysis |
infers the underlying genetic substructure in the population by retaining the most significant principal components from the molecular marker matrix (Patterson et al. 2006) – the scores of the significant axes are used as covariables in the mixed model, which effectively is an approximation to the structuring of the genetic variance covariance matrix by a coefficient of coancestry matrix (kinship matrix); |
---|---|
kinship |
is the default model, and includes a kinship matrix in the mixed model; |
subpopulations |
includes a factor supplied by the SUBPOPULATIONS option in the mixed model; and |
null |
makes no correction for genetic relatedness. |
When RELATIONSHIPMODEL=kinship
, the kinship matrix can be specified by the KMATRIX
option. Alternatively, it can be calculated from the MKSCORES
using the QKINSHIPMATRIX
procedure with the method specified by the KMETHOD
option (and can then be stored by KMATRIX
).
When RELATIONSHIPMODEL=eigenanalysis
, the scores of the significant axes can be supplied using the SCORES
option. Otherwise they are calculated by the QEIGENANALYSIS
procedure (and can then be stored by SCORES
). The STANDARDIZE
and SCALING
options control whether the MKSCORES
factors are standardized and scaled; see QEIGENANALYSIS
for more details.
The MODELPART
option controls whether the principal components scores (if RELATIONSHIPMODEL=eigenanalysis
) or the subpopulations factor (if RELATIONSHIPMODEL=subpopulations
) are included as random or fixed terms (default random).
The threshold for significant marker trait association (on a -log10 scale) is defined by the THRESHOLD
option. The default value is 2.
The MINORALLELE
option defines the frequency q below which alleles are considered rare. Rare alleles are automatically pooled together. Markers whose major frequency allele is greater than or equal to 1-q are considered close to fixation and are not used in the analysis.
The METHOD
option defines the method to use to fit marker-trait association models, either exact or fast. For the exact method, the mixed models are solved for each marker separately. For the fast method, the mixed model is only solved for the genetic background model, without the markers in the model. The estimated variance-covariance matrix from this genetic background model is used to perform a generalized least squares scan for all the marker. The fast method is implemented only for bi-allelic markers, such as SNPs.
The THRMETHOD
option controls how the threshold for significance is defined. The default THRMETHOD=neffective
, first determines the effective number of columns (nC) in the marker matrix data using the estimator given by Patterson et al. (2006), and calculates the threshold as -log10(α/nC). The parameter α is the genome-wide type I error rate, which is defined by the ALPHA
option (default 0.05). Alternatively, THRMETHOD=bonferroni
calculates the effective number of tests assuming one independent test within blocks of a size specified by the DISTANCE
option. If DISTANCE
is not set, the default is to take an independent test at every marker, which is very conservative in most cases. Finally, if THRMETHOD=given
, a user-defined threshold value (on a log10 scale) must be specified using the THRESHOLD
option. With the other setting of THRMETHOD
, THRESHOLD
can be used to save the estimated threshold.
The MINORALLELE
option defines the frequency q below which alleles are considered rare. Rare alleles are automatically pooled together. Markers whose major frequency allele are greater than or equal to 1-q are considered close to fixation, and are not used in the analysis.
The PRINT
option controls printed output, with settings:
summary |
to print the list of markers with a significant association with the trait, and |
---|---|
progress |
to monitor the progress of the analysis. |
The default is PRINT=summary
.
The PLOT
option controls what graphs are produced, with settings:
profile |
plots a genome wide profile of the -log10(P) of the test statistic, |
---|---|
map |
plots a map with the location of the detected significant markers, highlighting whether or not the marker showed significant interaction with the environment, and |
qq |
makes a QQ plot of the -log10(P) values. |
By default PLOT=profile,qq
. The TITLE
option can be used to provide a title for the graph, and the YTITLE
and XTITLE
options can supply titles for the y- and x-axis, respectively. The colours to use for the chromosomes in the upper graph are specified by the COLOURS
option using either a text of colour names or a variate of RGB values (see the PEN
directive for details). If COLOURS
is not set, the default is to use the default colours of the pens 1, 2, onwards, up to the number of chromosomes. By default, the plot
is sent to the screen. However, you can supply a file for the plot, using the DFILENAME
parameter. You can discover the types of graphics file that are supported by running the command DHELP
possible
.
The Wald test statistics, their numbers of degrees of freedom and the associated probability values on a -log10 scale can be saved by the WALDSTATISTICS
, NDF
and MINLOG10P
parameters, respectively. The LAMBDA
parameter can save inflation factor, estimated as the slope of the QQ plot of the –log10(P) values. The QSAVE
parameter can save a pointer containing information and results for the significant markers. The elements of the pointer are labelled as follows to simplify their subsequent use:
'procedure' |
stores the string 'QSASSOCIATION' to indicate the source of the results, |
---|---|
'index' |
index numbers of the significant markers, |
'mkname' |
marker names, |
'chromosomes' |
chromosomes, |
'positions' |
positions, |
'minlog10p' |
probability values on a -log10 scale, |
'allele' |
label of the relevant allele, |
'frequency' |
allele frequencies, |
'effects' |
effects and |
'seeffects' |
standard errors of the effects. |
These are all pointers, with an element for each chromosome. The elements of the chromosome pointers are variates for all components except the standard errors of differences, which are scalars.
Options: PRINT
, PLOT
, RELATIONSHIPMODEL
, SCORES
, METHOD
, ALPHA
, THRMETHOD
, THRESHOLD
, DISTANCE
, MINORALLELE
, KMATRIX
, KMETHOD
, SUBPOPULATIONS
, MODELPART
, SCALING
, STANDARDIZE
, COLOURS
, TITLE
, YTITLE
, XTITLE
.
Parameters: TRAIT
, GENOTYPES
, MKSCORES
, CHROMOSOMES
, POSITIONS
, MKNAMES
, IDMGENOTYPES
, GENFILENAME
, MAPFILENAME
, WALDSTATISTICS
, NDF
, MINLOG10P
, LAMBDA
, QSAVE
, DFILENAME
.
Method
QSASSOCIATION
performs a mixed model marker-trait association analysis, or LD mapping. It takes account of the heterogeneous genetic relatedness between individuals in the population (sometimes referred as “population structure”) using one of three possible models, specified by the RELATIONSHIPMODEL
option, as defined below. The model for marker trait association may included the following terms: an intercept μ, the effects associated with k principal components PCscoreki (fixed or random), the effects of genotype groups Groupk (fixed or random) and the effects of the tested markers MK (fixed).
The RELATIONSHIPMODEL
option specifies which of the three possible models to use for the relatedness, and the MODELPART
option controls whether these terms are treated as fixed or random.
Model | Fixed | Fixed or random | Fixed | Random |
Eigenanalysis | μ + | Σi PCscoreki + | MK + | Gi |
Kinship | μ + | MK + | Gi with G ~ N(0,2KσG) | |
Subpopulations | μ + | Groupk + | MK + | Gi |
Null | μ + | MK + | Gi |
A Wald test is then used for each marker, individually, to test the null hypothesis that its effect is zero. The most frequent allele is set as the reference level. Marker allele frequencies, effects and standard errors are stored.
Action with RESTRICT
Restrictions are not allowed.
Reference
Patterson, N., Price, A.L., Reich, D. (2006). Population structure and eigenanalysis. PLoS Genetics, 2, e190. doi:10.1371/journal.pgen.0020190
See also
Procedures: QEIGENANALYSIS
, QKINSHIPMATRIX
, QLDDECAY
, QMASSOCIATION
, QREPORT
.
Commands for: Statistical genetics and QTL estimation.
Example
CAPTION 'QSASSOCIATION example 1: a data set with bi-allelic markers';\ STYLE=meta QIMPORT [POPULATION=amp] '%GENDIR%/Examples/QAssociation_geno.txt';\ MAPFILE='%GENDIR%/Examples/QAssociation_map.txt'; MKSCORES=mk;\ CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\ IDMGENOTYPES=geno_id IMPORT [PRINT=*] '%GENDIR%/Examples/QAssociation_pheno.csv'; ISAVE=vars " The relationship model is kinship, with QKINSHIPMATRIX is used to estimate the K matrix. The threshold is defined as 2.5." QSASSOCIATION [PRINT=summary,progress; PLOT=profile,map,QQ;\ RELATIONSHIPMODEL=kinship; METHOD=fast;\ THRMETHOD=given; THRESHOLD=2.5; DISTANCE=15; MINORALLELE=0.1;\ KMATRIX=K; KMETHOD=dice] yield; GENOTYPE=genotypes; MKSCORES=mk;\ CHROMOSOMES=mkchr; POSITIONS=mkpos; IDMGENOTYPES=geno_id;\ MKNAMES=mknames; MINLOG10P=PvalK; LAMBDA=inf_factorK;\ QSAVE=outputKFast PRINT outputKFast PRINT outputKFast[3...6]; FIELD=18 PRINT outputKFast[3],outputKFast[7,8][1,2],outputKFast[9,10]; FIELD=10 CAPTION 'QSASSOCIATION example 2: a data set with multi-allelic markers';\ STYLE=meta DELETE [REDEFINE=yes] mk,mkchr,mkpos,mknames,geno_id,vars,K QIMPORT [POPULATION=amp] '%GENDIR%/Examples/LD_example_geno.txt';\ MAPFILE='%GENDIR%/Examples/LD_example_map.txt'; MKSCORES=mk;\ CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\ IDMGENOTYPES=geno_id IMPORT [PRINT=*] '%GENDIR%/Examples/LD_example_pheno.csv'; ISAVE=vars " The relationship model is eigenanalysis, with QEIGENANALYSIS used to calculate the principal component scores. The threshold is calculated and saved in scalar Thr2." QSASSOCIATION [PRINT=summary,progress; PLOT=profile,map,QQ; MINORALLELE=0.05;\ ALPHA=0.05; THRMETHOD=neff; THRESHOLD=Thr2; METHOD=exact;\ RELATIONSHIPMODEL=eigenanalysis; SCORES=PCscores; SCALING=none;\ STANDARDIZE=frequency] gy_th; GENOTYPE=geno; MKSCORES=mk;\ CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\ IDMGENOTYPES=geno_id; MINLOG10P=PvalE; LAMBDA=IF_E; QSAVE=outputE PRINT outputE PRINT outputE[3...6] ; FIELD=18 PRINT outputE[7],outputE[8,9][1,2],outputE[10,11]; FIELD=10 CAPTION !t('QSASSOCIATION Example 3: a large data set (10000 markers)',\ 'with bi-allelic markers.'); STYLE=meta DELETE [REDEFINE=yes] mk,mkchr,mkpos,mknames,geno_id,vars,PCscores GET [WORKINGDIRECTORY=wdir] %CD '%GENDIR%/Examples/' IMPORT [PRINT=*] 'data10000_pheno.csv' SPLOAD [PRINT=*] 'data10000_Kmat.gsh' QSASSOCIATION [PRINT=summary,progress; PLOT=profile,QQ; METHOD=fast;\ MINORALLELE=0.05; ALPHA=0.05; THRMETHOD=bonferroni;\ DISTANCE=1; THRESHOLD=Thr3; RELATIONSHIPMODEL=kinship;\ KMATRIX=Kmat] y001; GENOTYPE=genotype;\ GENFILENAME='data10000_geno.csv';\ MAPFILENAME='data10000_map.csv';\ MINLOG10P=Pval; LAMBDA=IF; QSAVE=output %CD wdir