My Account: Log In | Join | Renew
Search
Author
Title
Vol.
Issue
Year
1st Page

Crop Science - Article

 

 

This article in CS

  1. Vol. 52 No. 2, p. 707-719
    OPEN ACCESS
     
    Received: June 3, 2011
    Published: Mar, 2012


    * Corresponding author(s): j.crossa@cgiar.org
 View
 Download
 Alerts
 Permissions
 Share

doi:10.2135/cropsci2011.06.0299

Genomic Prediction of Breeding Values when Modeling Genotype × Environment Interaction using Pedigree and Dense Molecular Markers

  1. Juan Burgueñoa,
  2. Gustavo de los Camposb,
  3. Kent Weigelc and
  4. José Crossa *a
  1. a Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, México D.F., México
    b Dep. of Biostatistics, Univ. of Alabama at Birmingham, Ryals Public Health Bldg. 443, Birmingham, AL 35294
    c Dep. of Dairy Science, Univ. of Wisconsin, Madison, WI 53706

Abstract

Genomic selection (GS) has become an important aid in plant and animal breeding. Multienvironment (multitrait) models allow borrowing of information across environments (traits), which could enhance prediction accuracy. This study presents multienvironment (multitrait) models for GS and compares the predictive accuracy of these models with: (i) multienvironment analysis without pedigree and marker information, and (ii) multienvironment pedigree or/and marker-based models. A statistical framework for incorporating pedigree and molecular marker information in models for multienvironment data is described and applied to data that originate from wheat (Triticum aestivum L.) multienvironment trials. Two prediction problems relevant to plant breeders are considered: (CV1) predicting the performance of untested genotypes (“newly” developed lines), and (CV2) predicting the performance of genotypes that have been evaluated in some environments but not in others. Results confirmed the superiority of models using both marker and pedigree information over those based on pedigree information only. Models with pedigree and/or markers had better predictive accuracy than simple linear mixed models that do not include either of these two sources of information. We concluded that the evaluation of such trials can benefit greatly from using multienvironment GS models.


Abbreviations

    Bayesian LASSO, Bayesian Least Absolute Shrinkage and Selection Operator; CV, cross-validation; D, diagonal; FA, factor analytic model; GE, genotype × environment interaction; GS, genomic selection; I, identity matrix; M, molecular marker; P, pedigree; PM, pedigree + molecular marker; REML, restricted maximum likelihood; UN, unstructured

In plant breeding, multienvironment trials play a fundamental role for assessing the performance of genotypes across different environmental conditions, for studying genotype × environment interaction (GE) and genotype stability, and for predicting the performance of untested genotypes. Genotype × environment interaction can be incorporated into additive infinitesimal models (e.g., Fisher, 1918; Wright, 1921) by means of genetic and environmental covariances (e.g., Piepho, 1997, 1998, 2009; Smith et al., 2001; Crossa et al., 2004, 2006; Oakey et al., 2006; Burgueño et al., 2007, 2008, 2011). Such an approach allows deriving predictions of genetic values that borrow information across individuals through genetic relationships, and within individuals (across environments) through genetic and environmental covariances.

In pedigree-based additive infinitesimal models (e.g., Fisher, 1918; Wright, 1921; Henderson, 1975), genetic relationships are commonly incorporated by the use of the numerator relationship matrix A = {a(i, i′)} . This matrix, whose entries equal twice the kinship coefficient between individuals i and i′, describes a priori, the expected correlations between the genetic values of individuals i and i′, given the information conveyed by the pedigree and under the assumption of an additive infinitesimal model. Expected and realized degrees of genetic similarity differ due to factors such as common ancestry not accounted for by the pedigree data and, more importantly, Mendelian segregation. In additive models and in absence of inbreeding and assortative mating, Mendelian segregation can explain one-half of the additive genetic variance. Dense molecular markers can be used to estimate the realized genetic similarity between individuals (e.g., Ritland, 1996; Lynch and Ritland, 1999; VanRaden, 2007; Habier et al., 2007; de los Campos et al., 2010), and such information can be incorporated into multienvironment models in the same way that pedigree-derived genetic relationships are incorporated into those models.

The availability of thousands of genome-wide molecular markers has made it possible to use genomic selection (GS) for the prediction of genetic values (Meuwissen et al., 2001) in plants (Bernardo and Yu, 2007; de los Campos et al., 2009, 2010; Crossa et al., 2010, 2011; Pérez et al., 2010) and animals (Hayes and Goddard, 2010; de los Campos et al., 2009). Most GS applications use a single-environment model. However, this approach cannot exploit information across environments (or traits), which may limit the predictive power of models for GS.

The importance of exploiting across-environment information has been demonstrated in plant breeding. For example, by using data from multienvironment trials and mixed linear models, Burgueño et al. (2011) showed that, relative to single-environment mixed models, the modeling of GE with a factor analytic (FA) structure can increase predictive power by up to 6%. This study was performed in a context in which pedigree information was not available. Similar results were found by So and Edwards (2011), who compared predictability of the FA structure with the compound symmetric model. On the other hand, empirical evidence with plant breeding data shows that GS can outperform the predictive power of pedigree-based methods by a sizable amount (de los Campos et al., 2009, 2010; Crossa et al., 2010, 2011; Pérez et al., 2010), using single-environment models that did not exploit information across environments (or traits). This body of evidence, together with the relevance of GE in plant breeding, suggests that the development of multienvironment models for GS can boost the predictive power of genomic prediction and, hence, the rate of genetic gain that can be attained with GS.

In this article, we incorporate into a single model elements for modeling GE using FA along with genetic relationship information based on pedigrees or markers for the purpose of genomic prediction. To our knowledge, this is the first demonstration of the combination of complex model components for both genetic relationships and GE interaction. The main objective of this study was to develop multienvironment models for GS and to compare their predictive power with: (i) multienvironment analysis without pedigree and marker information, and (ii) multienvironment pedigree or/and marker-based models. Two different prediction problems that are relevant to plant breeders were considered: (i) predicting performance of newly developed genotypes (i.e., genotypes for which no phenotypic records were available); and (ii) prediction of performance of genotypes across environments, when some genotypes were evaluated in some environments but not in others. A statistical framework for incorporating molecular marker information in models for multienvironment data is described and applied in a plant breeding context using a multienvironment wheat (Triticum aestivum L.) trial.


MATERIALS AND METHODS

We begin this section by introducing a conceptual framework that allows modeling pedigree and molecular marker data using mixed models. Subsequently, we apply this framework to data from CIMMYT's Global Wheat Program.

A Multienvironment Mixed Model for Regression using Molecular Marker and Pedigree Data

In a multienvironment linear mixed model, the phenotypic outcomes of g individuals (i = 1, 2,…,g) evaluated in J environments (j = 1, 2,…,J) are expressed as follows:where yj is the vector of the response variable of phenotypic records (or means) collected in the jth environment; and Xj and Zj are incidence matrices vectors of systematic effects, βj, and random genetic effects, gj, in the jth environment (with elements gij for the ith genotype). In reduced matrix notation,where .

In a standard multienvironment linear mixed model, vectors containing the random effects entering on the right-hand side of Eq. [2] are assumed to follow a multivariate normal density, centered at zero, and with covariance structure Cov (g, g′) = G0A, Cov (ε, ε′) = IgR0, and Cov (g, ε′) = 0, where I is an identity matrix of size g, G0 is a J × J covariance matrix of the genetic effect of genotypes in environments, ⊗ denotes the Kronecker product of matrices, and A = {a(i, i′)} is a g × g numerator relationship matrix, whose entries are twice the kinship coefficients between pairs of lines. The Cov (g, g′) = G0A is usually represented aswhere the jth diagonal element of the J × J matrix G0 is the additive genetic variance within the jth environment, and the jjth element is the additive genetic covariance between sites j′ and j; thus, ρjj is the correlation of the additive genetic effects between sites j′ and j.

The R0 = {Cov (εij, εij)} is a J × J covariance matrix of (within-genotype, across-environments) model residuals that could contain different variance–covariance structures for groups of observations and for field experiments in plant breeding evaluation, where spatial correlation for modeling the plot-to-plot variability is important. Model [2] can be extended to accommodate heterogeneous residual variances by replacing Ig in IgR0 with a diagonal matrix, .

With these assumptions, and following properties of the multivariate normal distribution (MVN), the marginal density of the data is multivariate normalEstimation of model unknowns in the above model can be performed using Bayesian, Likelihood, or Restricted Maximum Likelihood (REML) methods. A standard approach is to first estimate covariance parameters using REML and then obtain estimates of fixed effects and predictions of genetic values by solving the mixed-model equations associated with Eq. [3] with codispersion parameters replaced by REML estimates. The solution to the mixed-model equations are the empirical Best Linear Unbiased Estimates (BLUE) of fixed effects and the empirical Best Linear Unbiased Prediction (BLUP) of random effects (e.g., Henderson, 1975).

In this study, we performed a two-stage analysis consisting of, first, computing the mean of the genotypes in each mega-environment and standardized the data; and second, fitting the different models with the standardized data. For cases where heterogeneity of within-site error variances is evident, a weighted analysis in the second stage for handling the differences in standard errors among genotype estimates within environments is recommended (Welham et al., 2010) for multienvironment trials whose purpose is variety selection. In GS, the aim is to achieve accuracy in the genomic estimate breeding value based on marker or/and pedigree.

It should also be noted that the residuals of Model [2] comprise the confounded effects of nonadditive genetic variance and the error residual; this nonadditive genetic variance represents, in self-pollinated species, the additive × additive epistasis that can be modeled using the relationship matrix A, as shown by Burgueño et al. (2007).

Pedigree and Marker-based Prediction

In the models described by [1–3], A = {a(i, i′)} represents a g × g matrix of additive relationships. Commonly, these are derived from a pedigree (hereinafter denoted as AP). Alternatively, this matrix can be derived from molecular marker information (hereinafter denoted as AM). Unlike standard pedigree-derived genetic relationships, marker-derived estimates of relationships can account for common ancestry not considered in the pedigree,and more importantly, it can account for departures of realized genetic relationships from their expected values given the pedigree due to Mendelian segregation.

There are multiple ways of mapping markers to estimates of genetic relationships (e.g., Ritland, 1996; Lynch and Ritland, 1999; Van Raden, 2007; Habier et al., 2007; de los Campos et al., 2010), and none is considered to be superior (Makowsky et al., 2011). Here, we consider AM = MM′, where M = {mil} is a matrix containing centered and standardized genotypes (i = 1,…,g) at different markers (l = 1,…,L), that is, , where mil counts the number of copies of the minor-frequency allele carried by the ith line at the lth locus, and pl is the estimated minor-frequency allele of the lth marker. In Appendix A we show the equivalence of using AM = MM′ and a certain class of regression with marker covariates.

Finally, pedigree and marker information can be combined in a single model by simply extending Eq. [2] to a model with two random effects, one of which, gpN(0,G0PAP), represents a regression on a pedigree, and the other, a regression on markers, gmN(0,G0MAM), so Eq. [2] and [3] becomeand

Modeling Covariance Structures

In the multienvironment model of Eq. [2–5], the within-line across-environment covariance matrices R0 and G0 are completely unstructured (UN). Each of the matrices involves parameters. Therefore, the number of codispersion parameters grows quadratically with the number of environments. Also, when J is large, and genetic and/or residual effects are highly correlated between environments, estimation of G0 or R0 may be (close to) singular, making the convergence of algorithms slow. This problem may be overcome, and insights about correlations patterns may be gained, by using structured covariance matrices. This is done by representing covariance matrices as functions of a parameter vector. Factor analysis (e.g., Smith et al., 2001; Crossa et al., 2006; de los Campos and Gianola, 2007; Burgueño et al., 2007, 2008, 2011), random regression, principal components (Kirkpatrick and Meyer, 2004; Meyer and Kirkpatrick, 2005), and recursive and simultaneous structural equation models (Gianola and Sorensen, 2004) are alternative ways of structuring covariance matrices. Below we describe a few covariance structures applied to R0 or G0.

The most restrictive and simple covariance structure is obtained by setting , where is a variance parameter that is common across environments. This model, hereinafter denoted as I, imposes independence and homoscedasticity of model residuals across environments. A slightly less restrictive covariance structure is , where (j = 1,…,J) are environment-specific variance parameters. Fitting the model described above with and is equivalent to fitting J single-environments models separately.

The FA model is commonly used for modeling covariance matrices in quantitative genetic models in plants (e.g., Smith et al., 2001; Crossa et al., 2004, 2006; Burgueño et al., 2007, 2008, 2011) and animals (de los Campos and Gianola, 2007). The FA model for Cov (g,g′) = G0A is (ΛΛ′ + Ψ) ⊗ A = FA(k) ⊗ A(for FA(k) (k ≤ s)), where Λ is an J × k matrix where the kth column contains the environmental loadings for the kth latent factor, and Ψ is an J × J diagonal matrix, with different nonnegative parameters on the diagonal. When only one factor, k = 1, is considered, the model is denoted as FA(1); for k = 2, FA(2) has two multiplicative components, and so on. Thus, FA can be interpreted as the linear regression of genotype and GE on environmental covariates (environmental loadings), with each genotype having a separate slope (genotypic scores) but a common intercept (if main effects of genotypes are not distinguished from GE). The slopes of genotypes measure the sensitivity of the genotypes to hypothetical environmental factors represented by the loadings of each environment (Smith et al., 2002).

Parameter identification requires imposing restrictions. Even after imposing these restrictions, loadings are identified up to an orthonormal rotation. Unique estimates can be obtained by either adopting a rotation (e.g., varimax) or by fixing some of the loadings. The number of loadings to fix for a k-common factor model, k > 1, is equal to . The number of parameters of a k-common factor model (J × k + J) must at most equal the number of unknowns in an unstructured model (J × [J + 1]/2); therefore, additional restrictions are required for identification, if k > (J − 1)/2.

Experimental Data

We evaluated models such as those described in Eq. [2–3] and [4–5] using data from CIMMYT's Global Wheat Program, which contains information on 599 wheat lines whose grain yield was evaluated in four environments (E1, low rainfall and irrigated; E2, high rainfall; E3, low rainfall and high temperature; and E4, low humidity and hot) (Braun et al., 1996) with two replicates in each environment. The combined analyses across four environments based on unstandardized data give a heritability for grain yield of 0.50 with a variance component for GE (0.3132) larger than the variance component for entry (0.0889). Individual environment analyses show some heterogeneity of residual errors ranging from 0.0335 for E1 to 0.1121 for E2. It should be pointed out that for pedigree and genomic prediction the data were environment standardized.

A pedigree tracing back many generations was available, and the Browse application of the International Crop Information System, as described in http://cropwiki.irri.org/icis/index.php/TDM_GMS_Browse (verified 20 Nov. 2011) (McLaren et al., 2005), was used for deriving the pedigree relationship matrix (AP). Wheat lines were genotyped using Diversity Array Technology (DArT) markers generated by Triticarte Pty. Ltd. (Canberra, Australia; http://www.triticarte.com.au [verified 20 Nov. 2011]); after editing there were 1279 DArTs available for analysis. This data set is publicly available with the BLR package of R (de los Campos and Pérez, 2010) and was analyzed by Crossa et al. (2010), who used single-trait models for genomic prediction. Further details about the data set and editing procedures applied are described in Crossa et al. (2010).

Models

The main objective of this research was to evaluate the impact of modeling covariance structures in the context of pedigree, marker, and pedigree + marker models. To this end, we defined a sequence of 27 models, fitted them to the entire data set, and evaluated predictive ability using cross-validation (see below). Table 1 describes the models used for analysis. Models 1–9 (P) are pedigree-based models; Models 10–18 (M) are marker-based models. These models are as described by Eq. [2] and [3] with A = AP and A = AM, respectively. Models 19–27 (PM) combine pedigree and marker information, as described by Eq. [4] and [5]. Therefore, the impact of incorporating marker information for prediction can be assessed by comparing models among these three groups.


View Full Table | Close Full ViewTable 1.

Description of the pedigree (P) and genomic (M) models used for the analysis.†

 
Model
Covariance structure
No. Code R0 G0P G0M
Pedigree
1 PI-I I I
2 PD-I I D
3 PFA-I I FA(2)
4 PI-D D I
5 PD-D D D
6 PFA-D D FA(2)
7 PI-UN UN I
8 PD-UN UN D
9 PFA-UN UN FA(2)
Marker
10 MI-I I I
11 MD-I I D
12 MFA-I I FA(2)
13 MI-D D I
14 MD-D D D
15 MFA-D D FA(2)
16 MI-UN UN I
17 MD-UN UN D
18 MFA-UN UN FA(2)
Pedigree + marker
19 PMI-I I I I
20 PMD-I I D D
21 PMFA-I I FA(2) FA(2)
22 PMI-D D I I
23 PMD-D D D D
24 PMFA-D D FA(2) FA(2)
25 PMI-UN UN I I
26 PMD-UN UN D D
27 PMFA-UN UN FA(2) FA(2)
I = Identity (imposes independence and homoscedasticity, one parameter); D = Diagonal (imposes independence but allow heterogeneous variances, four parameters); FA(2) = two-common factor model (imposes heterogeneous variances and correlation between environments, 10 parameters); UN = unstructured (imposes heterogeneous variances and correlation between environments, 10 parameters).

Within each of these groups (P, M, and PM), we evaluated different strategies for structuring the covariance matrices of model residuals and marker effects. Comparing models using different covariance structures can be used to assess the importance of modeling heterogeneous variances (e.g., in P models by comparing Model 1 vs. Model 5) and covariances across environments (e.g., in P models by comparing Model 5 vs. Model 9). As stated earlier, using R0 = D and G0. = D (diagonal covariance structures) is equivalent to fitting J single-environment models separately; therefore, the impact of modeling covariances can be assessed by comparing the performance of models using R0 = D and G0. = D with those that consider genetic or residual covariances.

In models using FA(2), we imposed two restrictions: (i) we set the loading of the first trait on the second factor equal to zero, and (ii) we set one of the specific variances equal to zero. This yielded a model with 10 unknowns per covariance matrix.

Model Validation

Models were compared based on their predictive ability calculated as the correlation between the predictive and observed phenotypic values; the models were fitted to the training set and their predicted values were correlated to the observed values in the validation set. Additionally, we report an estimate of the probability of one change in the ranks produced by predicted and by observed performance. Details about this statistic are given in Appendix B.

Two distinct cross-validation (CV) schemes were used for generating the training and validation sets. A k-fold cross-validation, which consists of randomly dividing the n observations into k nonoverlapping subsets, say, S1, S2,…,Sk was used. Cross-validation was then applied to each partition of the data, that is, k − 1 groups were taken as the training set and the remaining group as the validation set. In this paper, we used k = 10, a 10-fold cross-validation scheme.

The two cross-validation schemes (CV1 and CV2) mimic two possible real situations a breeder might face. The first CV scheme (CV1) was designed to evaluate the predictive ability of models when used for predicting the performance of genotypes that have not undergone field evaluation (i.e., newly developed lines). When analyzing data for the kth fold of CV1, there was a total of 60 genotypes that were missing in all four environments; they were treated as unknown (Table C1 and Appendix C). Thus, predictions derived using CV1 were entirely based on phenotypic records of other lines. In the CV1 scheme, each validation set has 60 genotypes (except Fold 10, which has 59 genotypes).

The other cross-validation scheme (CV2) mimics a situation where lines are evaluated in two environments but missing in another two environments. Here information from relatives is used, and the prediction assessment can benefit from borrowing information between lines within an environment, between lines across environments, and among correlated environments (Table C1 and Appendix C). In the CV2 scheme, for each nonoverlapping validation set, we selected about 240 cells from the 599 × 4 GE matrix to confirm the validation set with the restriction that one genotype is missing only in two random environments and, for the other two environments, the same genotype is set to be missing in another fold. Randomness of the procedure produces a nonperfect balance pattern of environments and genotypes. The number of missing data in the different validation sets in CV2 ranged from 214 (Fold 7) to 268 (Fold 3). Appendix C shows a typical fold for CV1 and CV2.

Software

Models were fitted using the ASReml (2010) package. The REML estimates of variance parameters were obtained using the Average Information algorithm as implemented in ASReml (2010). Subsequently, estimates of fixed effects and predictions of random effects were obtained by solving the corresponding mixed-model equations with (co)dispersion parameters replaced by REML estimates.


RESULTS AND DISCUSSION

Patterns of Genetic and Residual Covariability

The estimates of residual and genetic covariance components from PFA-UN (Model 9), MFA-UN (Model 18), and PMFA-UN (Model 27) in Table 2 show some patterns of heterogeneity of residuals and genetic variances in environments and varying trends of residual and genetic correlations across environments. Although the phenotypic data were standardized, results show that the estimates of residuals and genetic components are heterogeneous across environments for the three models due to differences in heritability. The extent of variance heterogeneity is expected to be even greater when data are not standardized.


View Full Table | Close Full ViewTable 2.

Estimates of residual and genetic covariance matrices (correlation in the upper diagonal), variances and covariances in the diagonal and lower diagonal components of each matrix, respectively, derived from a full-data analysis of models using different covariance structures for four environments (1–4).

 
R0
G0P
G0M
Model† Env.‡ 1 2 3 4 1 2 3 4 1 2 3 4
PFA-UN (Model 9) 1 0.558 0.064 −0.248 0.021 0.573 −0.139 −0.131 −0.316
2 0.035 0.556 0.390 0.250 −0.078 0.553 0.965 0.575
3 −0.131 0.205 0.498 0.163 −0.080 0.583 0.660 0.593
4 0.011 0.134 0.082 0.517 −0.185 0.331 0.373 0.598
MFA-UN (Model 18) 1 0.526 0.153 −0.195 0.116 2.998 −0.258 −0.2080 −0.479
2 0.084 0.573 0.484 0.259 −0.727 2.634 0.925 0.623
3 −0.112 0.292 0.634 0.189 −0.547 2.280 2.305 0.648
4 0.065 0.152 0.117 0.602 −1.290 1.572 1.529 2.414
PMFA-UN (Model 27) 1 0.433 0.146 −0.220 0.183 2.665 −0.306 −0.344 −0.613 0.179 0.139 −0.011 −0.146
2 0.068 0.499 0.390 0.225 −0.695 1.930 0.888 0.686 0.027 0.204 0.968 0.425
3 −0.099 0.189 0.469 0.114 −0.670 1.471 1.422 0.771 −0.003 0.268 0.377 0.441
4 0.082 0.108 0.053 0.462 −1.222 1.165 1.124 1.494 −0.036 0.112 0.159 0.343
Model 9: PFA-UN = factor analytic-pedigree for G0P and unstructured for R0; Model 18: MFA-UN = factor analytic-genomic for G0M and unstructured for R0; Model 27: PMFA-UN = factor analytic-pedigree for G0P, factor analytic-genomic for G0M and unstructured for R0.
Env. = Environment.

The sample phenotypic correlation matrix indicated that there are two groups of environments, E1, which has a low and negative correlation with E2–E4, and the E2–E4 group of environments with moderate to high positive correlations among them. The residual and genetic correlation among environments (Table 2) confirmed these patterns of association among environments (i.e., E1 vs. E2–E4) and also indicated that the correlations induced by both effects are much smaller in R0 than in G0. Furthermore, the genomic-derived models (G0M) produced higher association among environments than pedigree-derived models (G0P). Figure 1 displays the estimated loading (after varimax rotation) of each of the four environments on the first vs. the second common factor of G0 derived from models using pedigree- (PFA-UN) and marker-derived (MFA-UN) genetic relationship matrices. The two groups of environments, E1 and E2–E4, are clearly depicted.

Figure 1.

Estimated loadings (after varimax rotation) of each of the four environments on the first and second common factors of G0P obtained from models using a pedigree-based (PFA-UN) and marker-based (MFA-UN) additive relationship matrix. The xy coordinates are the estimated loadings after varimax rotation, and the numbers in the body of the plot are the environment labels (1–4). In both models, G0 was modeled using a two-common factor model, and the residual covariance matrix, R0, was completely unstructured.

 

The results of the pedigree and genomic patterns of covariance among environments indicated some evidence of heterogeneous genetic and residual variances, as well as clear evidence of genetic association among some environments. The presence of such correlations can be exploited using multienvironment models to derive predictions that borrow information not only across lines, but also across environments. In the next section, we evaluate the impact of modeling across-environment covariances on prediction accuracy.

The residuals modeled using I, D, and UN had two effects that are confounded, the genetic variances that are represented by the additive × additive epistasis effects and the error variance. It is possible to partially overcome this problem by partitioning the total genotypic effects into additive and additive × additive and their interactions with environments (Burgueño et al., 2007). Fitting the additive × additive will increase the complexity of the models and it was not the main objective of the article. The authors are currently investigating modeling additive × additive epistasis and additive × additive × environment using pedigree and genomic relationship matrices in the context of genetic prediction in GS.

Predictive Ability of Models with Pedigree and Genomic Information

The average correlations between predictions and phenotypes in CV1 and CV2 are presented in Table 3 (Models 1–18) and Table 4 (Models 19–27). Summaries of these results are given in Fig. 2 and 3. The CV-correlations ranged from 0.317 to 0.646. In this range, the relationship between correlation and the estimated probability of having a rank change is close to linear (see Appendix B).A correlation of 0.317 (0.646) corresponds to a 0.40 (0.28) probability of having one change in rank.


View Full Table | Close Full ViewTable 3

Mean correlations from 10-fold cross-validation between the predicted and the observed values of genotypes for all environments, for individual Environments 1, 2, 3, and 4, and for Environments 2, 3, 4 together for 18 different models (Models 1–18) for two different cross-validation schemes (CV1 and CV2), each with 10-fold. For CV1 and CV2, the best predicted model among Models 1–9 and among Models 10–18 are underlined.

 
Model no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Model code† PI-I PD-I PFA-I PI-D PD-D PFA-D PI-UN PD-UN PFA-UN MI-I MD-I MFA-I MI-D MD-D MFA-D MI-UN MD-UN MFA-UN
CV1
Environment 1 0.444 0.443 0.437 0.444 0.443 0.433 0.428 0.444 0.447 0.534 0.534 0.532 0.534 0.534 0.531 0.536 0.541 0.550
Environment 2 0.416 0.416 0.410 0.416 0.416 0.406 0.365 0.350 0.427 0.496 0.495 0.467 0.496 0.495 0.463 0.466 0.450 0.490
Environment 3 0.387 0.386 0.389 0.387 0.386 0.388 0.331 0.317 0.397 0.387 0.386 0.404 0.387 0.387 0.401 0.338 0.332 0.406
Environment 4 0.441 0.441 0.432 0.441 0.442 0.427 0.402 0.409 0.435 0.438 0.438 0.433 0.438 0.437 0.428 0.420 0.417 0.441
Environments E2–E4 0.414 0.413 0.409 0.414 0.413 0.405 0.365 0.355 0.418 0.439 0.438 0.432 0.439 0.439 0.428 0.408 0.397 0.445
Environments E1–E4 0.420 0.420 0.415 0.420 0.418 0.408 0.378 0.376 0.423 0.464 0.464 0.457 0.465 0.465 0.452 0.442 0.434 0.473
CV2
Environment 1 0.449 0.449 0.460 0.449 0.447 0.454 0.437 0.451 0.458 0.550 0.550 0.552 0.550 0.550 0.553 0.556 0.560 0.558
Environment 2 0.422 0.421 0.623 0.422 0.421 0.626 0.397 0.382 0.619 0.502 0.500 0.609 0.502 0.500 0.611 0.511 0.490 0.606
Environment 3 0.374 0.374 0.633 0.374 0.374 0.635 0.354 0.336 0.607 0.392 0.391 0.581 0.392 0.390 0.585 0.384 0.368 0.556
Environment 4 0.452 0.452 0.533 0.452 0.452 0.529 0.425 0.429 0.527 0.453 0.452 0.513 0.453 0.452 0.510 0.445 0.439 0.506
Environments E2–E4 0.414 0.413 0.596 0.413 0.410 0.596 0.393 0.381 0.584 0.454 0.453 0.571 0.455 0.453 0.571 0.452 0.435 0.559
Environments E1–E4 0.420 0.419 0.565 0.419 0.416 0.564 0.401 0.397 0.555 0.475 0.473 0.564 0.475 0.473 0.565 0.475 0.462 0.556
Model code: Model 1: PI-I = identity-pedigree for G0P and identity for R0; Model 2: PD-I = diagonal-pedigree for G0P and identity for R0; Model 3: PFA-I = factor analytic-pedigree for G0P and identity for R0; Model 4: PI-D = identity-pedigree for G0P and diagonal for R0; Model 5: PD-D = diagonal-pedigree for G0P and diagonal for R0; Model 6: PFA-D = factor analytic-pedigree for G0P and diagonal for R0; Model 7: PI-UN = identity-pedigree for G0P and unstructured for R0; Model 8: PD-UN = diagonal-pedigree for G0P and unstructured for R0; Model 9: PFA-UN = factor analytic-pedigree for G0P and unstructured for R0; Model 10: MI-I = identity-genomic for G0M and identity for R0; Model 11: MD-I = diagonal-genomic for G0M and identity for R0; Model 12: MFA-I = factor analytic-genomic for G0M and identity for R0; Model 13: MI-D = identity-genomic for G0M and diagonal for R0; Model 14; MD-D = diagonal-genomic for G0M and diagonal for R0; Model 15: MFA-D = factor analytic-genomic for G0M and diagonal for R0; Model 16: MI-UN = identity-genomic for G0M and unstructured for R0; Model 17: MD-UN = diagonal-genomic for G0M and unstructured for R0; Model 18: MFA-UN = factor analytic-genomic for G0M and unstructured for R0.

View Full Table | Close Full ViewTable 4.

Mean correlations from 10-fold cross-validation between the predicted and the observed values of genotypes for all environments, for individual Environments 1, 2, 3, and 4, and for Environments 2, 3, 4 together for nine different models including AP and AM simultaneously (Models 19–27) for two different cross-validation schemes (CV1 and CV2), each with 10-fold. For CV1 and CV2, the best predicted model among Models 19–27 are underlined.

 
Model no. 19 20 21 22 23 24 25 26 27
Model code† PMI-I PMD-I PMFA-I PMI-D PMD-D PMFA-D PMI-UN PMD-UN PMFA-UN
CV1
Environment 1 0.562 0.559 0.542 0.563 0.560 0.551 0.560 0.564 0.560
Environment 2 0.502 0.504 0.469 0.502 0.503 0.482 0.469 0.458 0.495
Environment 3 0.432 0.433 0.432 0.433 0.433 0.432 0.380 0.374 0.423
Environment 4 0.479 0.481 0.444 0.479 0.480 0.453 0.457 0.462 0.462
Environments E2–E4 0.471 0.474 0.445 0.471 0.472 0.454 0.437 0.431 0.459
Environments E1–E4 0.494 0.496 0.4722 0.494 0.495 0.480 0.469 0.464 0.486
CV2
Environment 1 0.572 0.570 0.580 0.573 0.570 0.580 0.574 0.577 0.574
Environment 2 0.507 0.505 0.646 0.507 0.502 0.646 0.512 0.497 0.642
Environment 3 0.428 0.424 0.631 0.427 0.424 0.632 0.414 0.400 0.604
Environment 4 0.495 0.496 0.548 0.495 0.496 0.545 0.481 0.487 0.544
Environments E2–E4 0.480 0.479 0.610 0.480 0.477 0.609 0.475 0.465 0.599
Environments E1–E4 0.502 0.500 0.603 0.502 0.498 0.602 0.498 0.4897 0.592
Model code: Model 19: PMI-I = identity-pedigree (G0P), identity-genomic (G0M) with identity for R0; Model 20: PMD-I = diagonal-pedigree (G0P), diagonal-genomic for (G0M) with identity for R0; Model 21: PMFA-I = factor analytic-pedigree (G0P), factor analytic-genomic for (G0M) with identity for R0; Model 22: PMI-D = identity-pedigree (GP), identity-genomic (G0M) with diagonal for R0; Model 23: PMD-D = diagonal-pedigree (G0P), diagonal-genomic for (G0M) with diagonal for R0; Model 24: PMFA-D = factor analytic-pedigree (G0P), factor analytic-genomic for (G0M) with diagonal for R0; Model 25: PMI-UN = identity-pedigree (G0P), identity-genomic (G0M) with unstructured for R0; Model 26: PMD-UN = diagonal-pedigree (G0P), diagonal-genomic for (G0M) with unstructured for R0; Model 27: PMFA-UN = factor analytic-pedigree (G0P), factor analytic-genomic for (G0M) with unstructured for R0.
Figure 2.

Average (across four environments) correlations between predicted and observed performance derived from models using only pedigree (Model 9 PFA-UN), only markers (Model 18 MFA-UN), or pedigree + markers (Model 27 PMFA-UN) for two cross-validation schemes (CV1 and CV2). The right vertical axis shows the probabilities of one rank change for the given correlations.

 
Figure 3.

Correlation between predicted and observed performance in Environment 1 (E1) and average of Environments 2, 3, and 4 (E2, E3, and E4) obtained in CV2 using (a) pedigree, (b) markers, and (c) pedigree and marker–based models with different specifications for the residual and genetic covariance matrices (D = Diagonal, FA = two-common factors model, UN = unstructured). The right vertical axis shows the probabilities of one rank change for the given correlations.

 

Pedigree vs. Marker-based Prediction

The predictive ability of marker-based models was higher than that of pedigree-based models, both in CV1 and in CV2. Combining pedigree and marker information into the same model improved predictive ability as compared with that of pedigree- or marker-based models across covariance structures, environments, and CV schemes. Therefore, our results confirm the superiority of models for GS over pedigree-based predictions or marker-based predictions alone.

Crossa et al. (2010) evaluated the predictive ability of pedigree, marker, and pedigree plus marker single-environment models using the same data set used here. As expected, the estimates of CV-correlations obtained here in CV1 with models PD-D (4), MD-D (14), and PMD-D (23) are similar to those obtained by Crossa et al. (2010) using single-environment models for pedigree, markers (MBL = molecular marker regression using the Bayesian Least Absolute Shrinkage and Selection Operator [Bayesian LASSO]), and pedigree + markers combined (PM-BL = regression on markers and pedigree using the Bayesian LASSO). Differences are small (on the order of 0.4–4%) and the only exception was pedigree-based prediction in E3, where we obtained a predictive correlation that was 7.4% lower than that reported in Crossa et al. (2010). This difference could be due to many factors, such as the fact that Crossa et al. (2010) used a fully Bayesian approach, while we used REML followed by BLUE–BLUP, or differences could be due to the model used to incorporate marker information (we used a model based entirely on Gaussian assumptions, while the Bayesian LASSO uses a prior for marker effects which is a mixture of scaled normal densities). Finally, differences in estimates of CV-correlation may be partially due to variability in estimates originated by sampling of training and validation sets.

Cross-validation Scheme

Predicting the performance of newly developed lines (CV1) (not tested in the field) is more challenging than predicting the performance of lines that have been evaluated in different but correlated environments (CV2). This is reflected in the results depicted in Fig. 2: averaged across environments, CV-correlations obtained from multienvironment models (PFA-UN, MFA-UN, and PMFA-UN) in CV2 were 31, 17.5, and 21.8% greater than those obtained in CV1 (Fig. 2). This highlights the value of having information from correlated environments when predicting performance. Selection of lines without field testing, as mimicked in CV1, allows shortening of the generation interval, but the predictability is poorer, which might compromise the annual rate of genetic progress in a GS breeding program. Ultimately, the decision of how the breeding scheme should be structured depends on the tradeoff between desired prediction accuracy and generation interval.

Prediction Assessment

Modeling genetic and residual covariances improved the predictability in CV2 but not in CV1. The reason for this is clear; in CV2 modeling GE allows borrowing of information within line across environments. This is not possible in CV1 because in this scheme the lines being predicted have not been evaluated in any environment.

The impact of modeling GE in CV2 is marked in E2–E4, but not in E1 (Fig. 3). As shown in Table 2, genetic values in E2–E4 have high genetic correlations, while genetic values in E1 exhibit low genetic correlations with those from E2–E4. Models FA-D and FA-UN capitalize on these genetic correlations. For correlated environments E2–E4, the benefits in predictive ability come from two sources: from borrowing information from correlated environments by means of modeling GE and from using information regarding pedigree and genetic markers. In terms of probability of a rank change between observed and predicted genotypic values in E2, E3, and E4 using FA-D and FA-UN structures for G0P, G0M, and R0 (for CV2 in Fig. 3c), results show a probability of a single rank change of about 0.145; this value is slightly higher than that estimated for E1. When GE is not modeled (D-D and D-UN), the probability of one rank change for genotypes in E2, E3, and E4 is slightly larger (0.17, Fig. 3c) and is the same for rank changes occurring in E1.

Predictive Ability of Models without Pedigree and Genomic Information

It is interesting to examine the predictive accuracy of models without using pedigree and markers. However, if pedigree or markers are not included, not many models can be fitted for prediction; that is, models for CV1 will predict with the mean, and it is not possible to compute correlations for those genotypes that are completely missing in environments. For CV2, two simple linear mixed models (I and II) were fitted to the training sets of the 10-fold cross-validations without including the pedigree and the marker information. Both models can be described from Model [2]. Model I is y = + Zg + ε, where X is the incidence matrix for the fixed effects of environments β, and Z is the incidence matrix for the random effects of the genotypes g with , and the random residual (ε) is . In this model, the residual contains some of the nonadditive genetic effects, the GE effects, and parts of the error. Model II is y = + Zg + ε, with the fixed effects of environments β and the random effects g that contain the genetic effects, the GE effects, and the error term, ε; the random g effects are modeled using the FA such that Cov (g,g′) = (ΛΛ′ + Ψ) ⊗ Ig = FA(k) ⊗ Ig.

The prediction accuracy of Model I for individual environments and for E2–E4 and E1–E4 ranged from −0.21 for E1 to 0.50 for E2; it was always lower than correlations obtained from Model II. The prediction accuracy of Model II was always better than that of Model I and ranged from 0.22 for E1 to 0.607 for E2. As for the previous cases, E2 and E3 are predicted with more accuracy than the other environments because they have higher genetic correlations. These results are in agreement with Burgueño et al. (2011), who demonstrated that modeling GE is a good thing because it always gives better predictability than using simple linear mixed models.

When comparing the correlations of Model II with those of Models 3, 6, and 9 (pedigree with FA) and Models 12, 15, and 18 (markers with FA) (Table 3), it is clear that all the models that use pedigree or markers or both gave better predictive accuracy for individual environments and for E2–E4 and E1–E4 than Model II (which modeled the GE but did not include additional pedigree and marker information). These results show that for predicting the environments that cause a great deal of the GE, such as environments E1 (and E4), modeling GE using information on molecular markers and/or pedigree gives better prediction accuracy than not modeling GE and not using molecular markers and pedigree information.


CONCLUSIONS

This is the first research that incorporates FA models for GE interaction using pedigree or/and marker-based information simultaneously for genetic prediction in the context of GS. The results show that combining pedigree and marker data can yield substantial increases in prediction accuracy relative (i) to traditional pedigree-based prediction and (ii) to single-environment pedigree and genomics prediction models. Furthermore, for predicting the performance of newly developed lines (CV1), single-environment models are expected to perform similarly to multienvironment models. However, multienvironment models can boost predictive power in across-environment prediction, a problem of great interest in most plant breeding programs. Our results suggest that for the germplasm and environmental conditions used here, most of the benefit of the multienvironment model comes from modeling genetic correlations between environments. We also showed that modeling GE using information on molecular markers and/or pedigree gives better prediction accuracy than not using molecular markers and pedigree information. Further research is required to examine the prediction assessment of modeling the nonadditive genetic variances such as dominance and epistasis and their interactions with environments.

APPENDIX A

Equivalence of Regression Models for Genomic Selection

In the model of Eq. [1], consider replacing genetic values, gij, with a regression on marker genotypes, that is,, where mil counts the number of copies of the minor-frequency allele carried by the ith line at the lth locus, and blj represents the effect of the lth marker on phenotypes in the jth environment. In matrix notation this can be expressed as gj = Mb, where M = {mil} is a matrix of marker genotypes, and bj = (b1j,…,bLj) is a vector of marker effects. By stacking the genetic values of all lines in all environments into a single vector, we obtained g = [IJM]b, where and .

Consider assigning a multivariate normal before the joint vector of marker effects, centered at zero and with covariance matrixIn matrix notation this is expressed as Cov (b) = G0IL, where G0 = {blj,blj} is a within-locus, across-environments covariance matrix of marker effects. Therefore,The vector of genetic values, g = [IJM]b, is a linear combination of the vector of marker effects. Therefore, using Eq. [1A] and from properties of the multivariate normal density, it follows that g is also multivariate normal; further, E[g] = [IJM]E[b] = 0 andTherefore, using A = MM′ in the multienvironment linear mixed model of Eq. [1–3] is equivalent to a linear regression model for genomic selection in which marker effects are environment specific and are assigned a multivariate normal prior centered at zero and with covariance structure Cov (b) = G0IL.

APPENDIX B

Probability of Rank Change of Two Correlated Normal Bivariate Random Variables

We derived the probability of a rank change between the observed phenotypes and the predicted genetic values by assuming a bivariate normal distribution of a pair of random variables with a given correlation. Perfectly correlated variables have zero probability of rank change, whereas variables with zero correlations have a rank change probability of 0.50. This is an approximate approach for model selection that, in this case, offers a measure that is easy to interpret.

To calculate the probability of a change of rank, we considered a bivariate normal distribution (Xi,Yi), where Xi (for i = 1,2,…,g) is the observed genotypic value and Yi is the predicted genotypic value. The Kendall correlation coefficient is defined as the probability of concordance minus the probability of discordance:

The probability of concordance can be interpreted as the probability that the rank between any two genotypes is the same using either the observed or the predicted genetic value. On the other hand, the probability of discordance can be interpreted as the probability that the rank between any two genotypes changes depending on whether observed or predicted genotypic values are used.

Therefore, . Given the relationship between Kendall's correlation coefficient and Pearson's correlation coefficient (ρ), , the probability that there will be a rank change between any two genotypes when using the observed or the predicted genotypic value isFor the range of correlation found in this study (from 0.317 to 0.646), the relationship between the given correlation and the estimated probability of having one rank change is close to linear (Fig. B1).

Figure B1.

Relationship between Pearson's correlation coefficient and the probability that there will be a rank change between any two given genotypes if the observed or the predicted genotypic value is used.

 

APPENDIX C

Cross-validation Schemes

Table C1 shows an example of one-fold cross-validation (CV) in both the CV1 and CV2 schemes. In CV1, for each fold, 60 different genotypes are missing in all four environments. In the CV2 scheme, one random genotype was missing in two random environments until approximately 240 missing cells were completed. To have 10 nonoverlapping subsets, the two remaining environments, which have no missing data, will have missing data in another fold. For example, Genotype 2, which is missing in E1 and E2 in the fold represented in Table C1, will be missing in E3 and E4 in another fold.


View Full Table | Close Full ViewTable C1.

Example of one-fold Cross-validation Scheme 1 (CV1) and Cross-validation Scheme 2 (CV2). The numbers in the table represent the actual standardized grain yield data of 599 genotypes in four environments (E1–E4). Missing genotypes are denoted as dots.

 
CV1
CV2
Genotype E1 E2 E3 E4 E1 E2 E3 E4
1 1.7 −1.7 −1.9 0.1 1.7 −1.7 −1.9 0.1
2 . . . . . . 0.3 −1.7
3 0.3 −0.6 −0.8 −1.1 0.3 −0.6 −0.8 −1.1
4 0.8 0.1 0.6 0.6 0.8 0.1 0.6 0.6
5 1.0 −0.3 1.6 −0.1 1.0 −0.3 1.6 −0.1
6 2.3 0.6 0.1 0.7 2.3 0.6 . .
7 . . . . 0.6 −0.3 0.1 0.0
8 0.6 −0.4 −0.7 1.0 0.6 −0.4 −0.7 1.0
9 −1.0 −1.8 −1.9 −1.5 . −1.8 . −1.5
10 −1.1 −1.6 −2.0 −0.6 −1.1 −1.6 −2.0 −0.6
11 1.7 −0.3 −0.2 0.3 1.7 −0.3 −0.2 0.3
12 . . . . 0.9 −0.2 −0.4 0.7
13 1.2 −0.2 0.8 −0.2 . −0.2 0.8 .
14 0.3 −0.7 −0.4 0.0 0.3 −0.7 −0.4 0.0
15 0.8 −0.6 −0.7 −0.2 0.8 −0.6 −0.7 −0.2
589 −0.8 0.7 0.7 2.6 −0.8 0.7 0.7 2.6
590 −0.9 0.1 1.1 2.8 −0.9 0.1 1.1 2.8
591 . . . . −1.1 . 2.0 .
592 −1.0 1.1 1.9 1.7 −1.0 1.1 1.9 1.7
593 −1.1 0.6 2.4 1.2 −1.1 0.6 2.4 1.2
594 . . . . −0.5 . . 1.8
595 −1.2 1.4 1.6 1.7 −1.2 1.4 1.6 1.7
596 −1.1 0.1 2.1 0.6 −1.1 0.1 2.1 0.6
597 −1.2 0.5 2.0 1.8 . 0.5 . 1.8
598 . . . . −0.9 0.5 1.7 2.7
599 0.2 −0.5 −0.3 0.4 0.2 −0.5 −0.3 0.4

Acknowledgments

The authors thank the numerous cooperators in national agricultural research institutes who carried out Elite Spring Wheat Yield Trials (ESWYT), and provided the phenotypic data presented here. We also thank the International Nursery and Seed Distribution Units at CIMMYT-Mexico for preparing and distributing the seed and computerizing the data. Kent Weigel acknowledges financial support from the National Association of Animal Breeders (Columbia, MO).

 

References

Footnotes


Comments
Be the first to comment.



Please log in to post a comment.
*Society members, certified professionals, and authors are permitted to comment.