QSASSOCIATION procedure

Performs marker-trait association analysis in a genetically diverse population using bi-allelic and multi-allelic markers (M. Malosetti & J.T.N.M. Thissen).

Options

`PRINT` = string tokens	What to print (`summary`, `progress`); default `summ`
`PLOT` = string tokens	What to plot (`profile`, `qq`, `map`); default `prof`, `qq`
`RELATIONSHIPMODEL` = string token	What model to use to account for genetic relatedness (`eigenanalysis`, `kinship`, `subpopulations`, `null`); default `kins`
`SCORES` = pointer	Provides the scores of significant principal components, obtained from an eigenvalue analysis
`METHOD` = string token	What model to use for GWAS (`exact`, `fast`); default `fast`
`ALPHA` = scalar	Defines a genome-wide significance level to calculate the threshold; default 0.05
`THRMETHOD` = string token	Method to define the threshold for significance (`neffective`, `bonferroni`, `given`); default `neff`
`THRESHOLD` = scalar	Threshold value for significant LD, on the -log10 scale; default 2
`DISTANCE` = scalar	Minimum distance gap between independent tests (i.e. distance beyond which loci are expected to be in linkeage equilibrium) when `THRMETHOD=bonferroni`; default `*`
`MINORALLELE` = scalar	Frequency of minor alleles; default 0.05
`KMATRIX` = symmetric matrix	Kinship matrix containing coefficients of coancestries
`KMETHOD` = string token	Method to use to estimate kinship matrix if not supplied by `KMATRIX` (`correlation`, `dice`); default `dice`
`SUBPOPULATIONS` = factor	Defines groupings of genotypes into subpopulations
`MODELPART` = string token	Defines which part of the model should include `SUBPOPULATIONS` if `RELATIONSHIPMODEL` is set to `subpopulations`, or the principal components scores if `RELATIONSHIPMODEL` is set to `eigenanalysis` (`fixed`, `random`); default `rand`
`SCALING` = string token	Whether to scale the scores by the square roots of their singular values (`singularvalues`, `none`); default `none`
`STANDARDIZE` = string token	Whether to standardize the marker scores according to their frequencies (`frequency`, `none`); default `freq`
`COLOURS` = scalar, variate or text	Colours to use for the chromosomes; default `*` uses the colours of pens 1, 2 up to the number of chromosomes
`TITLE` = text	General title for the plots
`YTITLE` = text	Title for the y-axis
`XTITLE` = text	Title for the x-axis

Parameters

`TRAIT` = variates	Phenotypic trait to analyse; must be set
`GENOTYPES` = factors	Genotype factor
`MKSCORES` = pointers	Genotype codes for each marker; must be set
`CHROMOSOMES=` factors	Linkage groups for the markers; must be set
`POSITIONS` = variates	Positions within the linkage groups of markers; must be set
`MKNAMES` = texts	Marker names
`IDMGENOTYPES` = texts	Labels for the genotypes corresponding to the markers
`GENFILENAME` = texts	Name of a comma-delimited file (`*.csv`) containing marker scores (with markers in the rows and genotypes in the columns)
`MAPFILENAME` = texts	Name of a comma-delimited file (`*.csv`) with map information
`WALDSTATISTICS` = variates	Saves the Wald test statistics
`NDF` = variates	Saves the degrees of freedom associated with the Wald test
`MINLOG10P` = variates	Saves the associated probability values of the Wald test statistics, on a -log10 scale
`LAMBDA` = scalars	Saves the inflation factor i.e. slope of the QQ plot of -log10(P) values
`QSAVE` = pointers	Saves a pointer with information and results for the significant effects
`DFILENAME` = texts	Name of the graphical file for the plots

Description

QSASSOCIATION performs a mixed model marker-trait association analysis (also known as linkage disequilibrium mapping) with data from a single-environment trial. The trait data are supplied by the TRAIT parameter. The marker scores can be supplied as a pointer of factors by the MKSCORES parameter. The length of the pointer must be equal to the number of markers. Alternatively, if the fast method is requested by the METHOD option, they can be supplied in a file whose name is specified by the GENFILENAME parameter. The file must be comma-delimited (*.csv), with the markers in the rows and the genotypes in the columns. The first column of the file contains marker names, and the first row of the file contains the names of the genotypes.

The corresponding map information for the markers can be supplied by the CHROMOSOMES and POSITIONS parameters, and the labels for the markers can be supplied by the MKNAMES parameter. The IDMGENOTYPE parameter can be used to give the genotypes labels in the marker data. Alternatively, if the fast method is requested by the METHOD option, the map information can be supplied in a file, whose name is specified by the MAPFILE parameter. This file must also be comma-delimited (*.csv), and should contain three columns (without headings): marker name, linkage group (chromosome), and position within linkage group of each marker.

To avoid false positives in association mapping studies, some form of control is necessary for the genetic relatedness. The model to use is specified by the RELATIONSHIPMODEL option, with one of the following settings:

`eigenanalysis`	infers the underlying genetic substructure in the population by retaining the most significant principal components from the molecular marker matrix (Patterson et al. 2006) – the scores of the significant axes are used as covariables in the mixed model, which effectively is an approximation to the structuring of the genetic variance covariance matrix by a coefficient of coancestry matrix (kinship matrix);
`kinship`	is the default model, and includes a kinship matrix in the mixed model;
`subpopulations`	includes a factor supplied by the `SUBPOPULATIONS` option in the mixed model; and
`null`	makes no correction for genetic relatedness.

When RELATIONSHIPMODEL=kinship, the kinship matrix can be specified by the KMATRIX option. Alternatively, it can be calculated from the MKSCORES using the QKINSHIPMATRIX procedure with the method specified by the KMETHOD option (and can then be stored by KMATRIX).

When RELATIONSHIPMODEL=eigenanalysis, the scores of the significant axes can be supplied using the SCORES option. Otherwise they are calculated by the QEIGENANALYSIS procedure (and can then be stored by SCORES). The STANDARDIZE and SCALING options control whether the MKSCORES factors are standardized and scaled; see QEIGENANALYSIS for more details.

The MODELPART option controls whether the principal components scores (if RELATIONSHIPMODEL=eigenanalysis) or the subpopulations factor (if RELATIONSHIPMODEL=subpopulations) are included as random or fixed terms (default random).

The threshold for significant marker trait association (on a -log10 scale) is defined by the THRESHOLD option. The default value is 2.

The MINORALLELE option defines the frequency q below which alleles are considered rare. Rare alleles are automatically pooled together. Markers whose major frequency allele is greater than or equal to 1-q are considered close to fixation and are not used in the analysis.

The METHOD option defines the method to use to fit marker-trait association models, either exact or fast. For the exact method, the mixed models are solved for each marker separately. For the fast method, the mixed model is only solved for the genetic background model, without the markers in the model. The estimated variance-covariance matrix from this genetic background model is used to perform a generalized least squares scan for all the marker. The fast method is implemented only for bi-allelic markers, such as SNPs.

The THRMETHOD option controls how the threshold for significance is defined. The default THRMETHOD=neffective, first determines the effective number of columns (nC) in the marker matrix data using the estimator given by Patterson et al. (2006), and calculates the threshold as -log10(α/nC). The parameter α is the genome-wide type I error rate, which is defined by the ALPHA option (default 0.05). Alternatively, THRMETHOD=bonferroni calculates the effective number of tests assuming one independent test within blocks of a size specified by the DISTANCE option. If DISTANCE is not set, the default is to take an independent test at every marker, which is very conservative in most cases. Finally, if THRMETHOD=given, a user-defined threshold value (on a log10 scale) must be specified using the THRESHOLD option. With the other setting of THRMETHOD, THRESHOLD can be used to save the estimated threshold.

The MINORALLELE option defines the frequency q below which alleles are considered rare. Rare alleles are automatically pooled together. Markers whose major frequency allele are greater than or equal to 1-q are considered close to fixation, and are not used in the analysis.

The PRINT option controls printed output, with settings:

`summary`	to print the list of markers with a significant association with the trait, and
`progress`	to monitor the progress of the analysis.

The default is PRINT=summary.

The PLOT option controls what graphs are produced, with settings:

`profile`	plots a genome wide profile of the -log10(P) of the test statistic,
`map`	plots a map with the location of the detected significant markers, highlighting whether or not the marker showed significant interaction with the environment, and
`qq`	makes a QQ plot of the -log10(P) values.

By default PLOT=profile,qq. The TITLE option can be used to provide a title for the graph, and the YTITLE and XTITLE options can supply titles for the y- and x-axis, respectively. The colours to use for the chromosomes in the upper graph are specified by the COLOURS option using either a text of colour names or a variate of RGB values (see the PEN directive for details). If COLOURS is not set, the default is to use the default colours of the pens 1, 2, onwards, up to the number of chromosomes. By default, the plot
is sent to the screen. However, you can supply a file for the plot, using the DFILENAME parameter. You can discover the types of graphics file that are supported by running the command DHELP possible.

The Wald test statistics, their numbers of degrees of freedom and the associated probability values on a -log10 scale can be saved by the WALDSTATISTICS, NDF and MINLOG10P parameters, respectively. The LAMBDA parameter can save inflation factor, estimated as the slope of the QQ plot of the –log10(P) values. The QSAVE parameter can save a pointer containing information and results for the significant markers. The elements of the pointer are labelled as follows to simplify their subsequent use:

`'procedure'`	stores the string `'QSASSOCIATION'` to indicate the source of the results,
`'index'`	index numbers of the significant markers,
`'mkname'`	marker names,
`'chromosomes'`	chromosomes,
`'positions'`	positions,
`'minlog10p'`	probability values on a -log10 scale,
`'allele'`	label of the relevant allele,
`'frequency'`	allele frequencies,
`'effects'`	effects and
`'seeffects'`	standard errors of the effects.

These are all pointers, with an element for each chromosome. The elements of the chromosome pointers are variates for all components except the standard errors of differences, which are scalars.

Options: PRINT, PLOT, RELATIONSHIPMODEL, SCORES, METHOD, ALPHA, THRMETHOD, THRESHOLD, DISTANCE, MINORALLELE, KMATRIX, KMETHOD, SUBPOPULATIONS, MODELPART, SCALING, STANDARDIZE, COLOURS, TITLE, YTITLE, XTITLE.

Parameters: TRAIT, GENOTYPES, MKSCORES, CHROMOSOMES, POSITIONS, MKNAMES, IDMGENOTYPES, GENFILENAME, MAPFILENAME, WALDSTATISTICS, NDF, MINLOG10P, LAMBDA, QSAVE, DFILENAME.

Method

QSASSOCIATION performs a mixed model marker-trait association analysis, or LD mapping. It takes account of the heterogeneous genetic relatedness between individuals in the population (sometimes referred as “population structure”) using one of three possible models, specified by the RELATIONSHIPMODEL option, as defined below. The model for marker trait association may included the following terms: an intercept μ, the effects associated with k principal components PCscore_ki (fixed or random), the effects of genotype groups Group_k (fixed or random) and the effects of the tested markers MK (fixed).

The RELATIONSHIPMODEL option specifies which of the three possible models to use for the relatedness, and the MODELPART option controls whether these terms are treated as fixed or random.

Model	Fixed	Fixed or random	Fixed	Random
Eigenanalysis	μ +	Σ_i PCscore_ki +	MK +	G_i
Kinship	μ +		MK +	G_i with G ~ N(0,2Kσ_G)
Subpopulations	μ +	Group_k +	MK +	G_i
Null	μ +		MK +	G_i

A Wald test is then used for each marker, individually, to test the null hypothesis that its effect is zero. The most frequent allele is set as the reference level. Marker allele frequencies, effects and standard errors are stored.

Action with `RESTRICT`

Restrictions are not allowed.

Reference

Patterson, N., Price, A.L., Reich, D. (2006). Population structure and eigenanalysis. PLoS Genetics, 2, e190. doi:10.1371/journal.pgen.0020190

Example

CAPTION       'QSASSOCIATION example 1: a data set with bi-allelic markers';\
              STYLE=meta
QIMPORT       [POPULATION=amp] '%GENDIR%/Examples/QAssociation_geno.txt';\ 
              MAPFILE='%GENDIR%/Examples/QAssociation_map.txt'; MKSCORES=mk;\
              CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\  
              IDMGENOTYPES=geno_id
IMPORT        [PRINT=*] '%GENDIR%/Examples/QAssociation_pheno.csv'; ISAVE=vars
" The relationship model is kinship, with QKINSHIPMATRIX is used to 
  estimate the K matrix. The threshold is defined as 2.5."
QSASSOCIATION [PRINT=summary,progress; PLOT=profile,map,QQ;\ 
              RELATIONSHIPMODEL=kinship; METHOD=fast;\ 
              THRMETHOD=given; THRESHOLD=2.5; DISTANCE=15; MINORALLELE=0.1;\ 
              KMATRIX=K; KMETHOD=dice] yield; GENOTYPE=genotypes; MKSCORES=mk;\
              CHROMOSOMES=mkchr; POSITIONS=mkpos; IDMGENOTYPES=geno_id;\
              MKNAMES=mknames; MINLOG10P=PvalK; LAMBDA=inf_factorK;\ 
              QSAVE=outputKFast
PRINT         outputKFast
PRINT         outputKFast[3...6]; FIELD=18
PRINT         outputKFast[3],outputKFast[7,8][1,2],outputKFast[9,10]; FIELD=10

CAPTION       'QSASSOCIATION example 2: a data set with multi-allelic markers';\
              STYLE=meta
DELETE        [REDEFINE=yes] mk,mkchr,mkpos,mknames,geno_id,vars,K
QIMPORT       [POPULATION=amp] '%GENDIR%/Examples/LD_example_geno.txt';\ 
              MAPFILE='%GENDIR%/Examples/LD_example_map.txt'; MKSCORES=mk;\
              CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\ 
              IDMGENOTYPES=geno_id
IMPORT        [PRINT=*] '%GENDIR%/Examples/LD_example_pheno.csv'; ISAVE=vars
" The relationship model is eigenanalysis, with QEIGENANALYSIS used to
  calculate the principal component scores. The threshold is calculated
  and saved in scalar Thr2."
QSASSOCIATION [PRINT=summary,progress; PLOT=profile,map,QQ; MINORALLELE=0.05;\
              ALPHA=0.05; THRMETHOD=neff; THRESHOLD=Thr2; METHOD=exact;\  
              RELATIONSHIPMODEL=eigenanalysis; SCORES=PCscores; SCALING=none;\
               STANDARDIZE=frequency] gy_th; GENOTYPE=geno; MKSCORES=mk;\ 
              CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\ 
              IDMGENOTYPES=geno_id; MINLOG10P=PvalE; LAMBDA=IF_E; QSAVE=outputE
PRINT         outputE
PRINT         outputE[3...6] ; FIELD=18
PRINT         outputE[7],outputE[8,9][1,2],outputE[10,11]; FIELD=10

CAPTION       !t('QSASSOCIATION Example 3: a large data set (10000 markers)',\
              'with bi-allelic markers.'); STYLE=meta
DELETE        [REDEFINE=yes] mk,mkchr,mkpos,mknames,geno_id,vars,PCscores
GET           [WORKINGDIRECTORY=wdir]
%CD           '%GENDIR%/Examples/'
IMPORT        [PRINT=*] 'data10000_pheno.csv'
SPLOAD        [PRINT=*] 'data10000_Kmat.gsh'
QSASSOCIATION [PRINT=summary,progress; PLOT=profile,QQ; METHOD=fast;\ 
              MINORALLELE=0.05; ALPHA=0.05; THRMETHOD=bonferroni;\
              DISTANCE=1; THRESHOLD=Thr3; RELATIONSHIPMODEL=kinship;\
              KMATRIX=Kmat] y001; GENOTYPE=genotype;\
              GENFILENAME='data10000_geno.csv';\
              MAPFILENAME='data10000_map.csv';\
              MINLOG10P=Pval; LAMBDA=IF; QSAVE=output
%CD           wdir

Updated on June 19, 2019

Was this article helpful?

Yes No