Produces genomic predictions (breeding values) of tested and untested individuals using phenotypic information of the tested set and the whole population genetic relationships, as inferred from molecular marker information (M. Malosetti, M.P. Boer & S.J. Welham).
Options
PRINT = string token |
What to print (summary ); default summ |
---|---|
PLOT = string token |
What to plot (scatterplot , pco ); default scat , pco |
MODELTYPE = string token |
Model to use to obtain the predictions (gblup , gaussian , exponential ); default gblu |
THETA = variate |
Values to use for the tuning parameter θ when the model is Gaussian or exponential |
SIMILARITY = symmetric matrix |
Similarity matrix between individuals of the whole population |
Parameters
TRAIT = variates |
Quantitative trait to be analysed; must be set |
---|---|
GENOTYPES = factors |
Genotype factor; must be set |
MKSCORES = pointers |
Marker scores |
IDMGENOTYPES = texts |
Labels of the tested and untested genotypes |
PREDICTIONS = variates |
Saves the predictions |
NEWGENOTYPES = factors |
Factor to index the predictions |
TESTED = factors |
Factor that classifies NEWGENOTYPES as part of the tested or the untested set |
SAVE = pointers |
Pointer to REML save structures to save details of the analyses |
Description
In genomic prediction (or genomic selection as introduced by Meuwissen et al. 2001), molecular markers of individuals of a population are used in combination with phenotypic information of a subset of that population (tested set) to obtain predictions (breeding values) of all the individuals of the population (i.e. both tested and untested).
GPREDICTION
can be used to obtain predictions by one of three different mixed models, according to the setting of the MODELTYPE
option. These differ according to the way in which the genetic variance covariance matrix is defined. The default setting, gblup
, uses a realised additive relationship matrix calculated from markers, which is equivalent to the inclusion of all the markers as random explanatory variables in the model (with a common variance component). Alternatively, with the gaussian
setting, a Gaussian kernel is used to model the genetic variance-covariance, which effectively accounts for non additive relationships (Gianola & van Kamp 2008, Piepho 2009). Finally, with the exponential
setting, an exponential kernel is used. For the Gaussian and exponential models, an extra (tuning) parameter θ is required, which determines how covariance between individuals decays in relation to distance in the genetic space. Values for θ can be supplied, in a variate, using the THETA
option. If this is unset, the value suggested by Crossa et al. (2010) is used (see the Method section). The SIMILARITY
option can be used either to provide a similarity matrix, or to store the one that is calculated using the markers.
The TRAIT
parameter must supply the observations (phenotypes) of the tested genotypes, and the GENOTYPES
parameter must supply a factor to identify individuals within the tested set. The MKSCORES
parameter supplies the marker scores of all the individuals in the population (tested and untested), and the IDMGENOTYPES
parameter provides labels for all the genotypes in the population (tested and untested). MKSCORES
must be set unless a relationship matrix has been supplied by the SIMILARITY
option. The PREDICTIONS
parameter can save the predictions, the NEWGENOTYPES
parameter can save a factor identifying each individual in the population, and the TESTED
parameter can save a factor classifying individuals as being part of the tested or untested set.
You can set PRINT=summary
to print a summary of the analysis. The SAVE
parameter can save a pointer containing save structures from REML
analyses that have been done.
The PLOT
option controls the graphs that are produced, with settings:
scatterplot |
for a scatter plot of predictions versus observed values of the tested set, and |
---|---|
pco | for a plot showing the first three axes of a principal coordinates analysis of the genetic similarities estimated from markers, to enable you to assess the coverage of the genetic space of the population given by the training set |
Options: PRINT
, PLOT
, MODELTYPE
, THETA
, SIMILARITY
.
Parameters: TRAIT
, GENOTYPES
, MKSCORES
, IDMGENOTYPES
, PREDICTIONS
, NEWGENOTYPES
, TESTED
, SAVE
.
Method
The prediction model is:
y = X β + Z u + ε
with u a vector of random genetic effects,
u ~ N(0, A σ_{u}^{2}),
and residuals ε with
ε ~ N(0, I σ^{2}).
The relationship matrix A is obtained from molecular marker information and formed depending of the model as:
Model | Relationship matrix | |
GBLUP | A = Z Z′ | Z is the genotype by markers matrix |
Gaussian | A = exp(-D^{2} / θ) | D^{2} is the Euclidean squared distance between individuals based on markers, and θ is a tuning parameter |
Exponential | A = exp(-D / θ) | D is the Euclidean distance between individuals based on markers, and θ is a tuning parameter |
Before fitting the mixed model, the matrix A is checked to ensure that it is positive-semi definite. If not procedure POSSEMIDEFINITE
is called to produce a positive semi-definite approximation to be used instead. If one value is set for θ, a mixed model is fitted for each value, and the Akaike Information Coefficient is used to select the best one. If no value is given for θ, then
θ = median(D^{2}) / 2
is used, as suggested by Crossa et al. (2010).
After fitting the mixed model, predictions are formed using the VPREDICT
directive.
Action with RESTRICT
Restrictions are not allowed.
References
Crossa, J., De Los Campos, G., Pérez, P., Gianola, D., Burgueño, J., Araus, J.L., Makumbi, D., Singh, R.P., Dreisigacker, S., Yan, J., Arief, V., Banziger, M. & Braun, H.J. (2010), Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics, 186,713-724.
Gianola, D. & van Kaam, J.B.C.H.M. (2008). Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics, 178, 2289-2303.
Meuwissen, T.H.E., Hayes, B.J. & Goddard, M.E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157, 1819-1829.
Piepho, H.P. (2009). Ridge regression and extensions for genome wide selection in maize. Crop Science, 49,1165-1176.
See also
Directives: REML
, VPREDICT
, PCO
.
Commands for: Statistical genetics and QTL estimation.
Example
CAPTION 'GPREDICTION example','A data set with bi-allelic markers';\ STYLE=meta,plain QIMPORT [POPULATION=amp]\ '%GENDIR%/Examples/dataCrossa_et_al2010_geno.txt';\ MAPFILE='%GENDIR%/Examples/dataCrossa_et_al2010_map.txt';\ MKSCORES=mk; CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\ IDMGENOTYPES=geno_id IMPORT [PRINT=*]\ '%GENDIR%/Examples/DataCrossa_et_al2010_Phenotypes.csv';\ ISAVE=vars " Model: GBLUP, relationship matrix is calculated and saved." GPREDICTION [MODEL=gblup; PLOT=scatterplot,pco; SIMILARITY=Kmat] TRAIT=yld;\ GENOTYPES=Geno; MKSCORES=mk; IDMGENOTYPES=geno_id;\ PREDICTIONS=p_GBLUP; NEWGENOTYPES=Genopred; TESTED=set; SAVE=res " Model: Gaussian kernel, with range of values for theta and relationship matrix estimated from markers." VARIATE [VALUES=0.05,0.1...0.5] theta GPREDICTION [MODEL=gauss; PLOT=scatterplot,pco; SIMILARITY=KmatG; THETA=theta]\ TRAIT=yld; GENOTYPES=Geno; MKSCORES=mk; IDMGENOTYPES=geno_id;\ PREDICTIONS=p_Gauss; NEWGENOTYPES=Genopred2; TESTED=set2; SAVE=res2 " Model: EXPONENTIAL kernel, with range of values for theta and relationship matrix estimated from markers." VARIATE [VALUES=0.05,0.1...0.5] theta GPREDICTION [MODEL=exp; PLOT=scatterplot,pco; SIMILARITY=KmatExp; THETA=theta]\ TRAIT=yld; GENOTYPES=Geno; MKSCORES=mk; IDMGENOTYPES=geno_id;\ PREDICTIONS=p_Exp; NEWGENOTYPES=Genopred3; TESTED=set3; SAVE=res3