1. Home
  2. QSASSOCIATION procedure

QSASSOCIATION procedure

Performs marker-trait association analysis in a genetically diverse population using bi-allelic and multi-allelic markers (M. Malosetti & J.T.N.M. Thissen).

Options

PRINT = string tokens What to print (summary, progress); default summ
PLOT = string tokens What to plot (profile, qq, map); default prof, qq
RELATIONSHIPMODEL = string token What model to use to account for genetic relatedness (eigenanalysis, kinship, subpopulations, null); default kins
SCORES = pointer Provides the scores of significant principal components, obtained from an eigenvalue analysis
METHOD = string token What model to use for GWAS (exact, fast); default fast
ALPHA = scalar Defines a genome-wide significance level to calculate the threshold; default 0.05
THRMETHOD = string token Method to define the threshold for significance (neffective, bonferroni, given); default neff
THRESHOLD = scalar Threshold value for significant LD, on the -log10 scale; default 2
DISTANCE = scalar Minimum distance gap between independent tests (i.e. distance beyond which loci are expected to be in linkeage equilibrium) when THRMETHOD=bonferroni; default *
MINORALLELE = scalar Frequency of minor alleles; default 0.05
KMATRIX = symmetric matrix Kinship matrix containing coefficients of coancestries
KMETHOD = string token Method to use to estimate kinship matrix if not supplied by KMATRIX (correlation, dice); default dice
SUBPOPULATIONS = factor Defines groupings of genotypes into subpopulations
MODELPART = string token Defines which part of the model should include SUBPOPULATIONS if RELATIONSHIPMODEL is set to subpopulations, or the principal components scores if RELATIONSHIPMODEL is set to eigenanalysis (fixed, random); default rand
SCALING = string token Whether to scale the scores by the square roots of their singular values (singularvalues, none); default none
STANDARDIZE = string token Whether to standardize the marker scores according to their frequencies (frequency, none); default freq
COLOURS = scalar, variate or text Colours to use for the chromosomes; default * uses the colours of pens 1, 2 up to the number of chromosomes
TITLE = text General title for the plots
YTITLE = text Title for the y-axis
XTITLE = text Title for the x-axis

Parameters

TRAIT = variates Phenotypic trait to analyse; must be set
GENOTYPES = factors Genotype factor
MKSCORES = pointers Genotype codes for each marker; must be set
CHROMOSOMES= factors Linkage groups for the markers; must be set
POSITIONS = variates Positions within the linkage groups of markers; must be set
MKNAMES = texts Marker names
IDMGENOTYPES = texts Labels for the genotypes corresponding to the markers
GENFILENAME = texts Name of a comma-delimited file (*.csv) containing marker scores (with markers in the rows and genotypes in the columns)
MAPFILENAME = texts Name of a comma-delimited file (*.csv) with map information
WALDSTATISTICS = variates Saves the Wald test statistics
NDF = variates Saves the degrees of freedom associated with the Wald test
MINLOG10P = variates Saves the associated probability values of the Wald test statistics, on a -log10 scale
LAMBDA = scalars Saves the inflation factor i.e. slope of the QQ plot of -log10(P) values
QSAVE = pointers Saves a pointer with information and results for the significant effects
DFILENAME = texts Name of the graphical file for the plots

Description

QSASSOCIATION performs a mixed model marker-trait association analysis (also known as linkage disequilibrium mapping) with data from a single-environment trial. The trait data are supplied by the TRAIT parameter. The marker scores can be supplied as a pointer of factors by the MKSCORES parameter. The length of the pointer must be equal to the number of markers. Alternatively, if the fast method is requested by the METHOD option, they can be supplied in a file whose name is specified by the GENFILENAME parameter. The file must be comma-delimited (*.csv), with the markers in the rows and the genotypes in the columns. The first column of the file contains marker names, and the first row of the file contains the names of the genotypes.

The corresponding map information for the markers can be supplied by the CHROMOSOMES and POSITIONS parameters, and the labels for the markers can be supplied by the MKNAMES parameter. The IDMGENOTYPE parameter can be used to give the genotypes labels in the marker data. Alternatively, if the fast method is requested by the METHOD option, the map information can be supplied in a file, whose name is specified by the MAPFILE parameter. This file must also be comma-delimited (*.csv), and should contain three columns (without headings): marker name, linkage group (chromosome), and position within linkage group of each marker.

To avoid false positives in association mapping studies, some form of control is necessary for the genetic relatedness. The model to use is specified by the RELATIONSHIPMODEL option, with one of the following settings:

    eigenanalysis infers the underlying genetic substructure in the population by retaining the most significant principal components from the molecular marker matrix (Patterson et al. 2006) – the scores of the significant axes are used as covariables in the mixed model, which effectively is an approximation to the structuring of the genetic variance covariance matrix by a coefficient of coancestry matrix (kinship matrix);
    kinship is the default model, and includes a kinship matrix in the mixed model;
    subpopulations includes a factor supplied by the SUBPOPULATIONS option in the mixed model; and
    null makes no correction for genetic relatedness.

When RELATIONSHIPMODEL=kinship, the kinship matrix can be specified by the KMATRIX option. Alternatively, it can be calculated from the MKSCORES using the QKINSHIPMATRIX procedure with the method specified by the KMETHOD option (and can then be stored by KMATRIX).

When RELATIONSHIPMODEL=eigenanalysis, the scores of the significant axes can be supplied using the SCORES option. Otherwise they are calculated by the QEIGENANALYSIS procedure (and can then be stored by SCORES). The STANDARDIZE and SCALING options control whether the MKSCORES factors are standardized and scaled; see QEIGENANALYSIS for more details.

The MODELPART option controls whether the principal components scores (if RELATIONSHIPMODEL=eigenanalysis) or the subpopulations factor (if RELATIONSHIPMODEL=subpopulations) are included as random or fixed terms (default random).

The threshold for significant marker trait association (on a -log10 scale) is defined by the THRESHOLD option. The default value is 2.

The MINORALLELE option defines the frequency q below which alleles are considered rare. Rare alleles are automatically pooled together. Markers whose major frequency allele is greater than or equal to 1-q are considered close to fixation and are not used in the analysis.

The METHOD option defines the method to use to fit marker-trait association models, either exact or fast. For the exact method, the mixed models are solved for each marker separately. For the fast method, the mixed model is only solved for the genetic background model, without the markers in the model. The estimated variance-covariance matrix from this genetic background model is used to perform a generalized least squares scan for all the marker. The fast method is implemented only for bi-allelic markers, such as SNPs.

The THRMETHOD option controls how the threshold for significance is defined. The default THRMETHOD=neffective, first determines the effective number of columns (nC) in the marker matrix data using the estimator given by Patterson et al. (2006), and calculates the threshold as -log10(α/nC). The parameter α is the genome-wide type I error rate, which is defined by the ALPHA option (default 0.05). Alternatively, THRMETHOD=bonferroni calculates the effective number of tests assuming one independent test within blocks of a size specified by the DISTANCE option. If DISTANCE is not set, the default is to take an independent test at every marker, which is very conservative in most cases. Finally, if THRMETHOD=given, a user-defined threshold value (on a log10 scale) must be specified using the THRESHOLD option. With the other setting of THRMETHOD, THRESHOLD can be used to save the estimated threshold.

The MINORALLELE option defines the frequency q below which alleles are considered rare. Rare alleles are automatically pooled together. Markers whose major frequency allele are greater than or equal to 1-q are considered close to fixation, and are not used in the analysis.

The PRINT option controls printed output, with settings:

    summary to print the list of markers with a significant association with the trait, and
    progress to monitor the progress of the analysis.

The default is PRINT=summary.

The PLOT option controls what graphs are produced, with settings:

    profile plots a genome wide profile of the -log10(P) of the test statistic,
    map plots a map with the location of the detected significant markers, highlighting whether or not the marker showed significant interaction with the environment, and
    qq makes a QQ plot of the -log10(P) values.

By default PLOT=profile,qq. The TITLE option can be used to provide a title for the graph, and the YTITLE and XTITLE options can supply titles for the y- and x-axis, respectively. The colours to use for the chromosomes in the upper graph are specified by the COLOURS option using either a text of colour names or a variate of RGB values (see the PEN directive for details). If COLOURS is not set, the default is to use the default colours of the pens 1, 2, onwards, up to the number of chromosomes. By default, the plot
is sent to the screen. However, you can supply a file for the plot, using the DFILENAME parameter. You can discover the types of graphics file that are supported by running the command DHELP possible.

The Wald test statistics, their numbers of degrees of freedom and the associated probability values on a -log10 scale can be saved by the WALDSTATISTICS, NDF and MINLOG10P parameters, respectively. The LAMBDA parameter can save inflation factor, estimated as the slope of the QQ plot of the –log10(P) values. The QSAVE parameter can save a pointer containing information and results for the significant markers. The elements of the pointer are labelled as follows to simplify their subsequent use:

    'procedure' stores the string 'QSASSOCIATION' to indicate the source of the results,
    'index' index numbers of the significant markers,
    'mkname' marker names,
    'chromosomes' chromosomes,
    'positions' positions,
    'minlog10p' probability values on a -log10 scale,
    'allele' label of the relevant allele,
    'frequency' allele frequencies,
    'effects' effects and
    'seeffects' standard errors of the effects.

These are all pointers, with an element for each chromosome. The elements of the chromosome pointers are variates for all components except the standard errors of differences, which are scalars.

Options: PRINT, PLOT, RELATIONSHIPMODEL, SCORES, METHOD, ALPHA, THRMETHOD, THRESHOLD, DISTANCE, MINORALLELE, KMATRIX, KMETHOD, SUBPOPULATIONS, MODELPART, SCALING, STANDARDIZE, COLOURS, TITLE, YTITLE, XTITLE.

Parameters: TRAIT, GENOTYPES, MKSCORES, CHROMOSOMES, POSITIONS, MKNAMES, IDMGENOTYPES, GENFILENAME, MAPFILENAME, WALDSTATISTICS, NDF, MINLOG10P, LAMBDA, QSAVE, DFILENAME.

Method

QSASSOCIATION performs a mixed model marker-trait association analysis, or LD mapping. It takes account of the heterogeneous genetic relatedness between individuals in the population (sometimes referred as “population structure”) using one of three possible models, specified by the RELATIONSHIPMODEL option, as defined below. The model for marker trait association may included the following terms: an intercept μ, the effects associated with k principal components PCscoreki (fixed or random), the effects of genotype groups Groupk (fixed or random) and the effects of the tested markers MK (fixed).

The RELATIONSHIPMODEL option specifies which of the three possible models to use for the relatedness, and the MODELPART option controls whether these terms are treated as fixed or random.

Model Fixed Fixed or random Fixed Random
Eigenanalysis μ + Σi PCscoreki + MK + Gi
Kinship μ +   MK + Gi with G ~ N(0,2KσG)
Subpopulations μ + Groupk + MK + Gi
Null μ +   MK + Gi

A Wald test is then used for each marker, individually, to test the null hypothesis that its effect is zero. The most frequent allele is set as the reference level. Marker allele frequencies, effects and standard errors are stored.

Action with RESTRICT

Restrictions are not allowed.

Reference

Patterson, N., Price, A.L., Reich, D. (2006). Population structure and eigenanalysis. PLoS Genetics, 2, e190. doi:10.1371/journal.pgen.0020190

See also

Procedures: QEIGENANALYSIS, QKINSHIPMATRIX, QLDDECAY, QMASSOCIATION, QREPORT.

Commands for: Statistical genetics and QTL estimation.

Example

CAPTION       'QSASSOCIATION example 1: a data set with bi-allelic markers';\
              STYLE=meta
QIMPORT       [POPULATION=amp] '%GENDIR%/Examples/QAssociation_geno.txt';\ 
              MAPFILE='%GENDIR%/Examples/QAssociation_map.txt'; MKSCORES=mk;\
              CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\  
              IDMGENOTYPES=geno_id
IMPORT        [PRINT=*] '%GENDIR%/Examples/QAssociation_pheno.csv'; ISAVE=vars
" The relationship model is kinship, with QKINSHIPMATRIX is used to 
  estimate the K matrix. The threshold is defined as 2.5."
QSASSOCIATION [PRINT=summary,progress; PLOT=profile,map,QQ;\ 
              RELATIONSHIPMODEL=kinship; METHOD=fast;\ 
              THRMETHOD=given; THRESHOLD=2.5; DISTANCE=15; MINORALLELE=0.1;\ 
              KMATRIX=K; KMETHOD=dice] yield; GENOTYPE=genotypes; MKSCORES=mk;\
              CHROMOSOMES=mkchr; POSITIONS=mkpos; IDMGENOTYPES=geno_id;\
              MKNAMES=mknames; MINLOG10P=PvalK; LAMBDA=inf_factorK;\ 
              QSAVE=outputKFast
PRINT         outputKFast
PRINT         outputKFast[3...6]; FIELD=18
PRINT         outputKFast[3],outputKFast[7,8][1,2],outputKFast[9,10]; FIELD=10

CAPTION       'QSASSOCIATION example 2: a data set with multi-allelic markers';\
              STYLE=meta
DELETE        [REDEFINE=yes] mk,mkchr,mkpos,mknames,geno_id,vars,K
QIMPORT       [POPULATION=amp] '%GENDIR%/Examples/LD_example_geno.txt';\ 
              MAPFILE='%GENDIR%/Examples/LD_example_map.txt'; MKSCORES=mk;\
              CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\ 
              IDMGENOTYPES=geno_id
IMPORT        [PRINT=*] '%GENDIR%/Examples/LD_example_pheno.csv'; ISAVE=vars
" The relationship model is eigenanalysis, with QEIGENANALYSIS used to
  calculate the principal component scores. The threshold is calculated
  and saved in scalar Thr2."
QSASSOCIATION [PRINT=summary,progress; PLOT=profile,map,QQ; MINORALLELE=0.05;\
              ALPHA=0.05; THRMETHOD=neff; THRESHOLD=Thr2; METHOD=exact;\  
              RELATIONSHIPMODEL=eigenanalysis; SCORES=PCscores; SCALING=none;\
               STANDARDIZE=frequency] gy_th; GENOTYPE=geno; MKSCORES=mk;\ 
              CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\ 
              IDMGENOTYPES=geno_id; MINLOG10P=PvalE; LAMBDA=IF_E; QSAVE=outputE
PRINT         outputE
PRINT         outputE[3...6] ; FIELD=18
PRINT         outputE[7],outputE[8,9][1,2],outputE[10,11]; FIELD=10

CAPTION       !t('QSASSOCIATION Example 3: a large data set (10000 markers)',\
              'with bi-allelic markers.'); STYLE=meta
DELETE        [REDEFINE=yes] mk,mkchr,mkpos,mknames,geno_id,vars,PCscores
GET           [WORKINGDIRECTORY=wdir]
%CD           '%GENDIR%/Examples/'
IMPORT        [PRINT=*] 'data10000_pheno.csv'
SPLOAD        [PRINT=*] 'data10000_Kmat.gsh'
QSASSOCIATION [PRINT=summary,progress; PLOT=profile,QQ; METHOD=fast;\ 
              MINORALLELE=0.05; ALPHA=0.05; THRMETHOD=bonferroni;\
              DISTANCE=1; THRESHOLD=Thr3; RELATIONSHIPMODEL=kinship;\
              KMATRIX=Kmat] y001; GENOTYPE=genotype;\
              GENFILENAME='data10000_geno.csv';\
              MAPFILENAME='data10000_map.csv';\
              MINLOG10P=Pval; LAMBDA=IF; QSAVE=output
%CD           wdir
Updated on June 19, 2019

Was this article helpful?