GPREDICTION procedure

Produces genomic predictions (breeding values) of tested and untested individuals using phenotypic information of the tested set and the whole population genetic relationships, as inferred from molecular marker information (M. Malosetti, M.P. Boer & S.J. Welham).

Options

`PRINT` = string token	What to print (`summary`); default `summ`
`PLOT` = string token	What to plot (`scatterplot`, `pco`); default `scat`, `pco`
`MODELTYPE` = string token	Model to use to obtain the predictions (`gblup`, `gaussian`, `exponential`); default `gblu`
`THETA` = variate	Values to use for the tuning parameter θ when the model is Gaussian or exponential
`SIMILARITY` = symmetric matrix	Similarity matrix between individuals of the whole population

Parameters

`TRAIT` = variates	Quantitative trait to be analysed; must be set
`GENOTYPES` = factors	Genotype factor; must be set
`MKSCORES` = pointers	Marker scores
`IDMGENOTYPES` = texts	Labels of the tested and untested genotypes
`PREDICTIONS` = variates	Saves the predictions
`NEWGENOTYPES` = factors	Factor to index the predictions
`TESTED` = factors	Factor that classifies `NEWGENOTYPES` as part of the tested or the untested set
`SAVE` = pointers	Pointer to `REML` save structures to save details of the analyses

Description

In genomic prediction (or genomic selection as introduced by Meuwissen et al. 2001), molecular markers of individuals of a population are used in combination with phenotypic information of a subset of that population (tested set) to obtain predictions (breeding values) of all the individuals of the population (i.e. both tested and untested).

GPREDICTION can be used to obtain predictions by one of three different mixed models, according to the setting of the MODELTYPE option. These differ according to the way in which the genetic variance covariance matrix is defined. The default setting, gblup, uses a realised additive relationship matrix calculated from markers, which is equivalent to the inclusion of all the markers as random explanatory variables in the model (with a common variance component). Alternatively, with the gaussian setting, a Gaussian kernel is used to model the genetic variance-covariance, which effectively accounts for non additive relationships (Gianola & van Kamp 2008, Piepho 2009). Finally, with the exponential setting, an exponential kernel is used. For the Gaussian and exponential models, an extra (tuning) parameter θ is required, which determines how covariance between individuals decays in relation to distance in the genetic space. Values for θ can be supplied, in a variate, using the THETA option. If this is unset, the value suggested by Crossa et al. (2010) is used (see the Method section). The SIMILARITY option can be used either to provide a similarity matrix, or to store the one that is calculated using the markers.

The TRAIT parameter must supply the observations (phenotypes) of the tested genotypes, and the GENOTYPES parameter must supply a factor to identify individuals within the tested set. The MKSCORES parameter supplies the marker scores of all the individuals in the population (tested and untested), and the IDMGENOTYPES parameter provides labels for all the genotypes in the population (tested and untested). MKSCORES must be set unless a relationship matrix has been supplied by the SIMILARITY option. The PREDICTIONS parameter can save the predictions, the NEWGENOTYPES parameter can save a factor identifying each individual in the population, and the TESTED parameter can save a factor classifying individuals as being part of the tested or untested set.

You can set PRINT=summary to print a summary of the analysis. The SAVE parameter can save a pointer containing save structures from REML analyses that have been done.

The PLOT option controls the graphs that are produced, with settings:

`scatterplot`	for a scatter plot of predictions versus observed values of the tested set, and
pco	for a plot showing the first three axes of a principal coordinates analysis of the genetic similarities estimated from markers, to enable you to assess the coverage of the genetic space of the population given by the training set

Options: PRINT, PLOT, MODELTYPE, THETA, SIMILARITY.

Parameters: TRAIT, GENOTYPES, MKSCORES, IDMGENOTYPES, PREDICTIONS, NEWGENOTYPES, TESTED, SAVE.

Method

The prediction model is:

y = X β + Z u + ε

with u a vector of random genetic effects,

u ~ N(0, A σ_u²),

and residuals ε with

ε ~ N(0, I σ²).

The relationship matrix A is obtained from molecular marker information and formed depending of the model as:

Model	Relationship matrix
GBLUP	A = Z Z′	Z is the genotype by markers matrix
Gaussian	A = exp(-D² / θ)	D² is the Euclidean squared distance between individuals based on markers, and θ is a tuning parameter
Exponential	A = exp(-D / θ)	D is the Euclidean distance between individuals based on markers, and θ is a tuning parameter

Before fitting the mixed model, the matrix A is checked to ensure that it is positive-semi definite. If not procedure POSSEMIDEFINITE is called to produce a positive semi-definite approximation to be used instead. If one value is set for θ, a mixed model is fitted for each value, and the Akaike Information Coefficient is used to select the best one. If no value is given for θ, then

θ = median(D²) / 2

is used, as suggested by Crossa et al. (2010).

After fitting the mixed model, predictions are formed using the VPREDICT directive.

Action with `RESTRICT`

Restrictions are not allowed.

References

Crossa, J., De Los Campos, G., Pérez, P., Gianola, D., Burgueño, J., Araus, J.L., Makumbi, D., Singh, R.P., Dreisigacker, S., Yan, J., Arief, V., Banziger, M. & Braun, H.J. (2010), Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics, 186,713-724.

Gianola, D. & van Kaam, J.B.C.H.M. (2008). Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics, 178, 2289-2303.

Meuwissen, T.H.E., Hayes, B.J. & Goddard, M.E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157, 1819-1829.

Piepho, H.P. (2009). Ridge regression and extensions for genome wide selection in maize. Crop Science, 49,1165-1176.

Example

CAPTION     'GPREDICTION example','A data set with bi-allelic markers';\ 
            STYLE=meta,plain
QIMPORT     [POPULATION=amp]\
            '%GENDIR%/Examples/dataCrossa_et_al2010_geno.txt';\
            MAPFILE='%GENDIR%/Examples/dataCrossa_et_al2010_map.txt';\ 
            MKSCORES=mk; CHROMOSOMES=mkchr; POSITIONS=mkpos; MKNAMES=mknames;\
            IDMGENOTYPES=geno_id
IMPORT      [PRINT=*]\
            '%GENDIR%/Examples/DataCrossa_et_al2010_Phenotypes.csv';\
            ISAVE=vars
" Model: GBLUP, relationship matrix is calculated and saved."
GPREDICTION [MODEL=gblup; PLOT=scatterplot,pco; SIMILARITY=Kmat] TRAIT=yld;\
            GENOTYPES=Geno; MKSCORES=mk; IDMGENOTYPES=geno_id;\
            PREDICTIONS=p_GBLUP; NEWGENOTYPES=Genopred; TESTED=set; SAVE=res
" Model: Gaussian kernel, with range of values for theta
  and relationship matrix estimated from markers."
VARIATE     [VALUES=0.25,0.3...0.5] theta
GPREDICTION [MODEL=expo; PLOT=scatterplot,pco; SIMILARITY=KmatExp; THETA=theta]\
            TRAIT=yld; GENOTYPES=Geno; MKSCORES=mk; IDMGENOTYPES=geno_id;\ 
            PREDICTIONS=p_Gauss; NEWGENOTYPES=Genopred2; TESTED=set2; SAVE=res2
" Model: EXPONENTIAL kernel, with range of values for theta
  and relationship matrix estimated from markers."
VARIATE     [VALUES=0.05,0.1...0.5] theta
GPREDICTION [MODEL=exp; PLOT=scatterplot,pco; SIMILARITY=KmatExp; THETA=theta]\
            TRAIT=yld; GENOTYPES=Geno; MKSCORES=mk; IDMGENOTYPES=geno_id;\ 
            PREDICTIONS=p_Exp; NEWGENOTYPES=Genopred3; TESTED=set3; SAVE=res3

Updated on March 29, 2022

Was this article helpful?

Yes No