Bias in genomic predictions for populations under selection

Z. G. VITEZICA; I. AGUILAR; I. MISZTAL; A. LEGARRA

doi:10.1017/S001667231100022X

Bias in genomic predictions for populations under selection

Published online by Cambridge University Press: 18 July 2011

Z. G. VITEZICA ,

I. AGUILAR ,

I. MISZTAL and

A. LEGARRA

Show author details

Z. G. VITEZICA*: Affiliation:
Université de Toulouse, UMR 1289 TANDEM, INRA/INPT-ENSAT/ENVT, F-31326 Castanet-Tolosan, France
I. AGUILAR: Affiliation:
Instituto Nacional de Investigación AgropecuariaLas Brujas, Canelones 90200, Uruguay
I. MISZTAL: Affiliation:
Department of Animal and Dairy Science, University of Georgia, Athens, Georgia 30602, USA
A. LEGARRA: Affiliation:
INRA, UR 631 SAGA, F-31326 Castanet-Tolosan, France
*: *Corresponding author: UMR 1289 TANDEM, ENSAT, Avenue de l'Agrobiopole, Postal Box 32607, 31326 Auzeville Tolosane, France. E-mail: zulma.vitezica@ensat.fr

Article contents

Summary
Introduction
Materials and methods
Results
Discussion
Conclusion
References

Summary

Prediction of genetic merit or disease risk using genetic marker information is becoming a common practice for selection of livestock and plant species. For the successful application of genome-wide marker-assisted selection (GWMAS), genomic predictions should be accurate and unbiased. The effect of selection on bias and accuracy of genomic predictions was studied in two simulated animal populations under weak or strong selection and with several heritabilities. Prediction of genetic values was by best-linear unbiased prediction (BLUP) using data either from relatives summarized in pseudodata for genotyped individuals (multiple-step method) or using all available data jointly (single-step method). The single-step method combined genomic- and pedigree-based relationship matrices. Predictions by the multiple-step method were biased. Predictions by a single-step method were less biased and more accurate but under strong selection were less accurate. When genomic relationships were shifted by a constant, the single-step method was unbiased and the most accurate. The value of that constant, which adjusts for non-random selection of genotyped individuals, can be derived analytically.

Type: Research Papers
Information: Genetics Research , Volume 93 , Issue 5 , October 2011 , pp. 357 - 366

DOI: https://doi.org/10.1017/S001667231100022X [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

1. Introduction

Selection of economically important quantitative traits in animals and plants is traditionally based on phenotypic records of an individual and its relatives. High-density panels of single-nucleotide polymorphisms (SNP) have recently been used to map genes of domestic animal species and select desirable livestock (Goddard & Hayes, Reference Goddard and Hayes2009). The genetic merit of an individual can be estimated by genome-wide marker-assisted selection (GWMAS), also known as genomic selection (Meuwissen et al., Reference Meuwissen, Hayes and Goddard2001). Such selection is expected to improve the precision of genetic merit predictions because some markers from a dense SNP panel will be in linkage disequilibrium with quantitative trait loci (QTLs) (Hayes et al., Reference Hayes, Vissher and Goddard2009). Genomic selection will lead to faster genetic gain than that achieved with traditional selection methods based on pedigree and phenotypic data only.

Genome-wide evaluation methods use statistical tools to combine phenotypes with high-density marker data to predict the genetic merit of individuals with complex traits. No agreement exists currently on which genome-wide evaluation method is the most appropriate (Daetwyler et al., Reference Daetwyler, Pong-Wong, Villanueva and Woolliams2010; Hill, Reference Hill2010). Most developments in genome-wide evaluation methods published to date assume that all animals have been genotyped. However, it is rarely the case, and most often non-genotyped close family members exist with phenotypic information (Garrick et al., Reference Garrick, Taylor and Fernando2009). Ignoring this information results in less accurate predictions and possible bias because of selection (VanRaden et al., Reference VanRaden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009a, Reference VanRaden, Tooker and Coleb). Unbiased predictions are of paramount importance in selection for accurate estimates of the genetic trend and also comparison of animals across generations (Henderson et al., Reference Henderson, Kempthorne, Searle and von Krosigk1959). The acceptance and wide use of genomic predictions will largely depend on correct statistical modelling and ease of breeder application. Therefore, a genetic evaluation method that produces accurate and unbiased predictions is critically important.

Combining pedigree, phenotypic and marker information to calculate genomic predictions is a challenge. The number of genotyped individuals is extremely small compared with the total number of individuals. Some genotyped individuals (as in dairy cattle) do not have a phenotype of their own. Thus, most proposed methods currently are based on multiple-step procedures (VanRaden et al., Reference VanRaden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009a). First, phenotypic data from relatives are summarized to create pseudodata for genotyped individuals (Garrick et al., Reference Garrick, Taylor and Fernando2009); then in the second step, genomic predictions are computed by a genome-wide method from pseudodata and marker information. For example, in dairy cattle, pseudodata for males are measures of (precorrected) daughter performance called daughter yield deviations (DYD). The use of DYD may involve several problems (VanRaden et al., Reference VanRaden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009a, Reference VanRaden, Tooker and Coleb): information loss of animals with few progeny, which leads to accuracy loss; heterogeneity caused by different amounts of information in the original dataset; and bias (caused by selection). For other species (e.g. sheep or swine) or traits (e.g. maternal traits), pseudodata are more difficult to estimate or even define. Thus, multiple-step methods of computing genomic predictions are not only complicated but likely suboptimal for GWMAS.

Joint use of pedigree, phenotypic and genomic information should theoretically solve such problems. A single-step method based on a linear mixed model and a pedigree relationship matrix augmented with genomic information has been developed recently (Legarra et al., Reference Legarra, Aguilar and Misztal2009; Aguilar et al., Reference Aguilar, Misztal, Johnson, Legarra, Tsuruta and Lawlor2010; Christensen & Lund, Reference Christensen and Lund2010). This method combines pedigree and all available phenotypes and genotypes, needs no creation of pseudodata, and it has been applied to population sizes in the millions (Aguilar et al., Reference Aguilar, Misztal, Johnson, Legarra, Tsuruta and Lawlor2010). Further, the method is general with straightforward extension to other models or species.

In traditional pedigree-based evaluation, all information about selection decisions is included in phenotypes and the relationship matrix, and no bias exists from the selection (Sorensen & Kennedy, Reference Sorensen and Kennedy1984; Im et al., Reference Im, Fernando and Gianola1989). However, how selection is accounted for in GWMAS procedures is unclear, although this is becoming a serious concern (Aguilar et al., Reference Aguilar, Misztal, Johnson, Legarra, Tsuruta and Lawlor2010; Mäntysaari et al., Reference Mäntysaari, Liu and VanRaden2010; Chen et al., Reference Chen, Misztal, Aguilar, Legarra and Muir2011). Models for GWMAS implicitly assume an unselected genotyped population (Hayes et al., Reference Hayes, Vissher and Goddard2009). However, in practice, genotyped individuals are highly selected, and GWMAS models do not take this selection into account. This is in contrast to classical procedures in which models refer to a base unselected population. Thus, GWMAS models appear to be unable to consider past selection based on pedigree and phenotypes, which might cause bias as well as accuracy loss.

The objective of this study was to investigate how unbiased genetic values can be predicted by GWMAS. The effects of selection and genome-wide evaluation method (single- or multiple-step) on bias and prediction accuracy were examined. The effect of trait heritability was also investigated.

2. Materials and methods

(i) Theory

The single-step genomic prediction approach (Legarra et al., Reference Legarra, Aguilar and Misztal2009; Aguilar et al., Reference Aguilar, Misztal, Johnson, Legarra, Tsuruta and Lawlor2010; Christensen & Lund, Reference Christensen and Lund2010) is based on the model y=Xb+Zu+e, where y is the phenotype vector, X and Z are incidence matrices, b denotes fixed effects, e is the residual and p(u)~N(0, Hσ_u²) involves the genetic effect for non-genotyped (u₁) and genotyped (u₂) individuals and the genetic variance σ_u². Here H⁻¹ is derived as in Legarra et al., (Reference Legarra, Aguilar and Misztal2009) and Christensen & Lund (Reference Christensen and Lund2010) :

(1)

where G is a genomic relationship matrix and and are the pedigree-based relationship matrix and its inverse partitioned into non-genotyped and genotyped individuals, respectively. Creation of H (and H⁻¹) can be seen as a projection of genetic merit (or marker genotypes) from genotyped to non-genotyped individuals using pedigree relationships (Legarra et al., Reference Legarra, Aguilar and Misztal2009; Christensen & Lund, Reference Christensen and Lund2010).

The pedigree relationship matrix A implies that the mean genetic value of the base population is 0. Also, according to VanRaden (Reference VanRaden2008) and Hayes et al. (Reference Hayes, Vissher and Goddard2009), the genomic relationship matrix G automatically sets the mean genetic value of the genotyped population to 0 if raw means in the genotyped population are used to estimate current allele frequencies. That is not the case if frequencies for the base population are used, but accurate estimates of those frequencies are difficult to obtain.

If selection occurs, the mean genetic value of genotyped individuals (u₂) on the scale of the whole population (i.e. relative to the base population) may have a value different from 0, say μ. This would be particularly true if genotyped individuals were elite individuals or in the last generations of selection. Let us assume that μ is a random variable; for instance, in (conceptual) repetitions of the selection process, μ will vary due to drift, in other words, because of finite sampling of replacement animals. Thus, p(μ)~N(0, ασ_u²); and, p(u₂|μ)~N(1μ, Gσ_u²), where 1 is a vector of ones. Equivalently, p(u₂|α)~N(0, (G+11′α)σ_u²).

As in Legarra et al. (Reference Legarra, Aguilar and Misztal2009), the distribution of genetic values of non-genotyped individuals conditioned on genetic values of genotyped individuals (using multivariate normality) is p(u₁|u₂)~N(A₁₂ A₂₂⁻¹ u₂, (A¹¹)⁻¹ σ_u²), where (A¹¹)⁻¹=A₁₁−A₁₂ A₂₂⁻¹ A₂₁. Thus, p(u₁|α)~N(0, (A¹¹)⁻¹ σ_u²+A₁₂ A₂₂⁻¹(G+11′α)A₂₂⁻¹ A₂₁σ_u²) and p(u)~N(0, H†σ_u²), where H† is equivalent to H with G substituted for G+11′α. The mixed model equations are

where

(2)

An equivalent model for unbiased genomic predictions based on a genetic group model (Quaas, Reference Quaas1988) is shown in the appendix.

To determine why μ can be presumed as a random variable and obtain the value of α, traditional best-linear unbiased prediction (BLUP) will be assumed to be unbiased and able to account properly for selection and drift (Sorensen & Kennedy, Reference Sorensen and Kennedy1984). We suggest that the mean value μ of genetic effects of genotyped individuals u₂ can be expressed as , where n is the number of individuals. Since μ is a function of random variables, it is a random variable in itself. The variable μ can be estimated from either pedigree (μ_p) or single-step (μ_s) procedures. If genetic prediction based on pedigree (and phenotypic data) is unbiased and accounts for selection and drift, the distribution of μ_p accounts properly for bias, selection and drift as well. The prior distributions for genetic values of u₂ are u_2p~N(0, A₂₂σ_u²) and u_2s~N(0, (G+11′α)σ_u²); thus, the distribution of μ is and

Since the 1′1 are simply summations,

(3)

and

(4)

In order to construct a model with features similar to pedigree-based BLUP, we equate the two variances in eqns (3) and (4) and this gives

(5)

Thus, α is simply the difference between means for A₂₂ and G. The α accounts for the fact that genotyped animals in u₂ are more related through pedigree (in reference to the base population), which is correctly considered in A₂₂. The genomic relationship matrix G does not correctly reflect this fact, especially if current allele frequencies are used, which sets the genomic base as the genotyped individuals (Oliehoek et al., Reference Oliehoek, Winding, van Arendonk and Bijma2006; Van Raden, Reference VanRaden2008). For G to be correct, base allele frequencies would be required. In practice, current allele frequencies are used because base allele frequencies are difficult to estimate. For a population of unrelated individuals where A₂₂=I, α=0.

From Wright's F _ST, another interpretation of α is also possible. The F _ST can be defined as the mean relationship between gametes in a recent population with respect to an older base population (Cockerham, Reference Cockerham1969, Reference Cockerham1973; Powell et al., Reference Powell, Vissher and Goddard2010). Then A₂₂ involves relationships of genotyped individuals with reference to the base population, and G corresponds to relationships within the current population. Consequently, α is equal to twice F _ST; the factor of two is needed because F _ST is referred to as co-ancestries, which are half individual additive relationships. The correction suggested by Powell et al. (Reference Powell, Vissher and Goddard2010) is F _old=F _new+(1−F _new) F _ST, which is equivalent to and similar to our suggestion.

(ii) Simulations

To evaluate the effectiveness of the proposed approach for genomic prediction, two selection scenarios with different heritabilities were simulated. The simulator QMSim (Sargolzaei & Schenkel, Reference Sargolzaei and Schenkel2009) was used to generate historical (undergoing drift and mutation) and recent (undergoing selection) population structures. In total, 10 chromosomes of equal length (100 cM) were simulated. Biallelic markers (10 000) were distributed at random along the chromosomes with equal frequency in the first generation of the historical population. Potentially, 250 QTLs affect the phenotype; QTLs allele effects were sampled from a Gamma distribution with a shape parameter of 0·4. The mutation rate of the markers (recurrent mutation process) and QTL was assumed to be 2·5×10⁻⁵ per locus per generation (Solberg et al., Reference Solberg, Sonesson, Woolliams and Meuwissen2008).

First, a base population of 200 males and 2600 females was generated by mutation and drift over 100 generations (t) in a historical population with an effective population size of 100 (t=1–95) and gradually expanded to 3000 offspring (t=100). Then, 10 generations (t=101–110) of selection for a sex-limited trait with a phenotypic variance of 1 were simulated. Three heritabilities (0·05, 0·30 and 0·50) were used to examine the effect of heritability on genomic predictions. In each generation, 200 males were mated to 2600 females to produce 2600 offspring following random phenotype (P_Y) or positive assortative (matings among best males and females based on estimated breeding values (EBVs); P_EBV) designs. For the next generation, 80 males and 520 females were selected based on P_Y or P_EBV. For P_EBV, EBVs were computed in each generation with data accumulated so far by a BLUP procedure that includes phenotypes and pedigree, which mimics recent selection procedures for livestock populations. At the end of each simulation, true genetic values (TBVs) and pedigree information were available for all ten generations (28 800 individuals in pedigree), phenotypes were available for all generations except for the last one (13 100 records). All males (840 sires of females with records) were genotyped as well as 260 individuals in generation 110, which were potential candidates for selection. No fixed effects were simulated. For each scenario, 20 replicates were run.

(iii) Prediction methods

The 260 genotyped selection candidates in the last generation were evaluated using genomic information. Genetic merit of the selection candidates was estimated using five methods based on BLUP. The first method was a mixed model based on the pedigree relationship matrix and phenotypes (BLUP_PED). The second method was a two-step procedure, in which DYD were computed using a regular method based on pedigree and phenotypes and then used for genomic prediction (BLUP_DYD). For BLUP_DYD, 840 males were included with DYD information from 14 (±10) daughters on average. The DYD were weighted by their accuracy. For the third method, the full dataset (pedigree, phenotypes and genotypes) was directly used in a single-step procedure that used G (BLUP_1STEP). The fourth method (BLUP_α) is also a single-step procedure with the correction of genetic differences among genotyped and non-genotyped individuals simply by using H^†−1, which uses G†=G+11′α instead of G. The correction of G proposed by Powell et al. (Reference Powell, Vissher and Goddard2010) was also implemented using and tested in a single-step procedure ().

All single-step methods (BLUP_1STEP, BLUP_α and ) and BLUP_DYD used , where z _i was coded as −p _i or 1−p _i for the first or second allele, respectively, and p _i is the allele frequency of the second allele (VanRaden, Reference VanRaden2008) computed as raw mean from all available genotypes.

Prediction quality was checked for all 260 selection candidates (validation data). Bias was measured as the difference between predicted and simulated breeding values of the candidates. Regression of TBV on EBV was used as a measure of the inflation of the prediction method, where a regression coefficient of one denotes no inflation. Accuracy of evaluation methods was computed as the square of the correlation between TBV and EBV. In addition, prediction error variance (PEV), which is a measure of candidate prediction error, and mean-squared error (MSE), which is a measure of overall fit of the model to the data, were computed. Results were the mean of the 20 replicates of each scenario.

3. Results

(i) Correction factor

Table 1 shows mean α for the three heritabilities under P_Y or P_EBV. The variance α is higher for selection predominantly on relatives (e.g. low heritability traits under P_EBV). Bias, inflation and accuracy of predictions as a function of α for a given replicate are presented in Fig. 1. In the example, the theoretical value of α according to eqn (5) was 0·07, which was α where bias was close to zero but not the smallest. Optimally, high accuracy (R ²) and low inflation (b) should be balanced to avoid large bias increases.

Fig. 1. Bias (dotted curve with solid circles), coefficient of regression of true on EBV (b; solid curve with squares), and squared correlation between true and EBVs (R ²; dashed curve with circles) as functions of the correction factor α (difference between pedigree-based and genome-based relationships for genotyped animals) for a given replicate with heritability of 0·30 under assortative mating selection based on EBV.

Table 1. Mean α for different heritabilities under P_Y or P_EBV selection

α is the difference between pedigree-based and genome-based relationships for genotyped animals; selection was based either on random phenotype (P_Y) or assortative mating using EBV (P_EBV).

(ii) Bias

Table 2 shows TBV and EBV means from the five prediction methods under P_Y or P_EBV for the selection candidates in the last generation. Mean TBV of the last generation was 3·2–5·4 times larger for P_EBV than for P_Y. As expected, even with selection and different heritabilities, BLUP_PED predicted mean TBV correctly. The BLUP_DYD and BLUP_1STEP methods underestimated mean TBV in the last generation, but BLUP_α and were unbiased.

Table 2. Means (SDs) of breeding values from different heritabilities and prediction methods for selection candidates under P_Y and P_EBV

Prediction methods were based on BLUP: a mixed model based on the pedigree relationship matrix and phenotypes (BLUP_PED), a two-step procedure in which DYD were computed from a regular method based on pedigree and phenotypes and then used for genomic prediction (BLUP_DYD), a single-step procedure that used the genomic relationship matrix and the full dataset of pedigree, phenotypes and genotypes (BLUP_1STEP), a single-step procedure with genetic differences among genotyped and non-genotyped individuals corrected by considering the difference between pedigree-based and genome-based relationships for genotyped animals (BLUP_α) and a single-step procedure that corrected the genomic relationship matrix as proposed by Powell et al. (Reference Powell, Vissher and Goddard2010) (). Selection was based either on random phenotype (P_Y) or assortative mating using EBV (P_EBV).

(iii) Inflation

The degree of inflation from the prediction methods is indicated by the coefficient of regression of TBV on EBV (Table 3). The optimal prediction method of genetic merit of young individuals would have a regression coefficient close to 1. For each scenario, the differences between the methods were small, and the approaches achieved very similar inflation. A strong selection, under P_EBV, increased inflation with values lower than 1. According to Kennedy et al. (Reference Kennedy, Schaeffer and Sorensen1988) relationship matrix accounts for selection, drift and non-random mating, but it does not account for a wrong definition of the base population or a finite number of loci. Potentially 250 QTLs affected the phenotype, and so this discards the second reason. Ideally, the base population should be infinite; this is indeed not the case, and thus the first assumption is violated. Consider animals 1 and 2 (e.g. father and son). If they are related, the distribution of the genetic value of 2 can expressed as conditioned on the genetic value of 1. Thus, knowledge of the EBV of 1 would decrease uncertainty (variance) of the EBV of 2. However, if this relationship is unknown, the conditional distribution cannot be written. Thus, not knowing relationships will increase the variance of EBVs and cause inflation; this has not been verified.

Table 3. Coefficients (SDs) for regression of true on EBV for different heritabilities and prediction methods under P_Y and P_EBV

Prediction methods were based on BLUP: a mixed model based on the pedigree relationship matrix and phenotypes (BLUP_PED), a two-step procedure in which DYD were computed from a regular method based on pedigree and phenotypes and then used for genomic prediction (BLUP_DYD), a single-step procedure that used the genomic relationship matrix and the full dataset of pedigree, phenotypes and genotypes (BLUP_1STEP), a single-step procedure with genetic differences among genotyped and non-genotyped individuals corrected by considering the difference between pedigree-based and genome-based relationships for genotyped animals (BLUP_α), and a single-step procedure that corrected the genomic relationship matrix as proposed by Powell et al. (Reference Powell, Vissher and Goddard2010) (). Selection was based either on random phenotype (P_Y) or assortative mating using EBV (P_EBV).

(iv) Accuracy

Squared correlations between TBV and EBV (i.e. reliability) by heritability and prediction method are shown in Table 4. Squared correlations are presented in Fig. 2 for all replicates using a heritability of 0·30 under P_EBV. Compared with BLUP_PED, all genomic prediction methods increased accuracy by about 37 percentage units and 22 percentage units under assortative mating based on EBV and mass selection, respectively.

Fig. 2. Accuracy (squared correlations between true and EBV) of BLUP methods for estimating breeding value of selection candidates across 20 replicates with heritability of 0·30 under assortative mating selection based on EBV. Prediction methods were a mixed model based on the pedigree relationship matrix and phenotypes (BLUP_PED; solid line with triangles), a two-step procedure in which DYD were computed from a regular method based on pedigree and phenotypes and then used for genomic prediction (BLUP_DYD; dashed line with solid circles), a single-step procedure with genetic differences among genotyped and non-genotyped individuals corrected by considering the difference between pedigree-based and genome-based relationships for genotyped animals (BLUP_α; solid line with squares) and a single-step procedure that corrected the genomic relationship matrix as proposed by Powell et al. (Reference Powell, Vissher and Goddard2010) () (dashed line with x markers).

Table 4. Squared correlations between true and EBVs (SDs) for different heritabilities and prediction methods under P_Y and P_EBV

Squared correlations between true and EBVs were expressed as percentages. Prediction methods were based on BLUP: a mixed model based on the pedigree relationship matrix and phenotypes (BLUP_PED), a two-step procedure in which DYD were computed from a regular method based on pedigree and phenotypes and then used for genomic prediction (BLUP_DYD), a single-step procedure that used the genomic relationship matrix and the full dataset of pedigree, phenotypes and genotypes (BLUP_1STEP), a single-step procedure with genetic differences among genotyped and non-genotyped individuals corrected by considering the difference between pedigree-based and genome-based relationships for genotyped animals (BLUP_α) and a single-step procedure that corrected the genomic relationship matrix as proposed by Powell et al. (Reference Powell, Vissher and Goddard2010) (). Selection was based either on random phenotype (P_Y) or assortative mating using EBV (P_EBV).

With low heritability under P_EBV, accuracy from BLUP_α and was as good as from BLUP_DYD (Table 4). When these accuracies are comparable, the single-step procedures with correction (BLUP_α and ) have an advantage over BLUP_DYD because they provide a unified framework that eliminates all the assumptions applied in multiple-step methods. With medium or high heritabilities under P_EBV, BLUP_α and have the highest accuracies (Table 4; Fig. 2). The lower accuracy of BLUP_DYD compared with BLUP_α and results from ignoring parent average in BLUP_DYD. Including parent average in predictions increases the accuracy in multiple-step methods, but it is complicated (VanRaden et al., Reference VanRaden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009a). Parent averages are automatically included in genomic predictions with single-step methods.

(v) Other comparison measures

In addition to assessing the quality of genetic evaluations through bias and accuracy from the difference and correlation between TBV and EBV, PEV and MSE were also considered (Table 5). All genomic methods had the lowest PEV under P_Y for all heritabilities. Under P_EBV, PEV also were lowest for genomic methods except with low heritability. Compared with MSE for BLUP_DYD predictions, MSE for BLUP_α were 18–66% lower under P_Y and 60–96% lower under P_EBV, with differences increasing with heritability under both selection designs. The greatly decreased MSE for all scenarios indicated that the multiple-step BLUP_DYD was less accurate. Considering the potential genetic gain from selection, BLUP_α and were the most advantageous prediction methods.

Table 5. PEVs (SDs) and MES (SDs) for different heritabilities and prediction methods under P_Y and P_EBV

PEV and MSE are shown. Prediction methods were based on BLUP: a mixed model based on the pedigree relationship matrix and phenotypes (BLUP_PED), a two-step procedure in which DYD were computed from a regular method based on pedigree and phenotypes and then used for genomic prediction (BLUP_DYD), a single-step procedure that used the genomic relationship matrix and the full dataset of pedigree, phenotypes and genotypes (BLUP_1STEP), a single-step procedure with genetic differences among genotyped and non-genotyped individuals corrected by considering the difference between pedigree-based and genome-based relationships for genotyped animals (BLUP_α), and a single-step procedure that corrected the genomic relationship matrix as proposed by Powell et al. (Reference Powell, Vissher and Goddard2010) (). Selection was based either on random phenotype (P_Y) or assortative mating using EBV (P_EBV).

4. Discussion

Comparison of the four genomic prediction methods at various heritabilities under two selection intensities demonstrated that a single-step method with correction (either BLUP_α or ) was a preferred method of accounting for bias in genomic predictions. Furthermore, two possible G for prediction of genetic merit were tested. Genomic preselection of animals often is included in dairy cattle breeding programmes now. Thus, bias becomes an important concern with the multiple-step method, because computation of DYD assumes random selection of the Mendelian sampling, which is clearly violated (Patry & Ducrocq, Reference Patry and Ducrocq2011). Because single-step methods do not create pseudodata and are unbiased, they may be an attractive genomic evaluation method.

Bias differences among the methods may be explained by how the different methods account for genotypes of highly selected individuals (i.e. males). Classical theory for modelling covariance among individuals assumes no selection. Thus, covariances among TBVs of selected individuals are no longer described well by any (genomic or pedigree-based) relationship matrix, unless all records used in the selection are accounted for (as in BLUP_PED) (Sorensen & Kennedy, Reference Sorensen and Kennedy1984; Kennedy et al., Reference Kennedy, Schaeffer and Sorensen1988). Although all records are used in BLUP_1STEP, only genotyped (and mostly selected) individuals are included in G. Since genetic values of non-genotyped animals are, a priori, conditioned on genetic values of genotyped animals (Legarra et al., Reference Legarra, Aguilar and Misztal2009), bias is alleviated but not fully corrected. In a single-step method with correction, such as BLUP_α, bias is eliminated by referencing genomic and pedigree-based relationship matrices to the base population. With correction, BLUP_α (or ) is the most accurate for GWMAS.

A pertinent question is which evaluation criterion should be used to maximize genetic gain in selection schemes. According to classical selection theory, best prediction is ideal for selection (Henderson, Reference Henderson1973; Fernando & Gianola, Reference Fernando and Gianola1986). The best predictor minimizes MSE, and is unbiased and not inflated (b=1) by construction (Henderson, Reference Henderson1973). If we were using the best predictor, accuracy, PEV or MSE should provide the same ranking of methods. The best predictor requires knowing the true model generating the data. This is not the case here because true QTLs were not included in the model for genetic evaluation. Instead, we used a linear model assuming multivariate normality.

In practice, the relevance of the criterion used depends on the selection schemes. If the parents of the next generation come from (and only from) genotyped selection candidates, then accuracy is the criterion to maximize, because selection candidates share a common mean (i.e. they belonged to the same generation) and thus bias is not a concern.

If for different candidates there are different amounts of information (e.g. comparing progeny-tested males vs. newborn animals), PEV or MSE have to be considered. For example, in the presence of bias, genetic gain is over (or under) estimated. Thus, newborns are thought to be better than what they are. In this case, MSE is possibly the criterion of choice, because it also includes bias. For instance, Roehe & Kennedy (Reference Roehe and Kennedy1993), who showed that a wrong model resulted in an artificial overestimation of genetic trend, which raised the estimated merit of young selection candidates. Similarly, inflation (b<1, which is included in PEV or MSE but not in accuracy) results in exaggeration (both over and under) of estimated genetic merit of newborns with respect to progeny-tested animals.

However, accuracy is currently used to assess genomic prediction methods in cross-validation studies (e.g. VanRaden et al., Reference VanRaden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009a), although bias is becoming an increasing concern (e.g. Luan et al., Reference Luan, Woolliams, Lien, Kent, Svendsen and Meuwissen2009; VanRaden et al., Reference VanRaden, Tooker and Cole2009b; Mäntysaari et al., Reference Mäntysaari, Liu and VanRaden2010). Consideration of bias, accuracy and inflation, possibly through MSE, is strongly recommended for the comparison of future genomic selection strategies.

5. Conclusion

Overall, a single-step genomic prediction method with corrected G (BLUP_α or ) was unbiased, similarly inflated and more accurate than other procedures even in the presence of selection. The corrected G is an appropriate methodological solution that takes into account the effect of non-random genotyping (due to strong selection) on prediction. The results clearly showed that a single-step genomic prediction approach is promising for animal/plant breeding.

We are grateful to Paul VanRaden, Daniel Gianola, Daniel Sorensen, Shogo Tsuruta and Ching-Yi Chen for their helpful and constructive comments. This work was partly financed by Holstein Association USA, Smithfield Premium Genetics, the National Beef Cattle Evaluation Consortium and AFRI grants 2009-65205-05665 and 2010-65205-20366 from the USDA NIFA Animal Genome Program (ZV, IA, IM) as well as Apisgene and ANR projects AMASGEN and SheepSNPQTL and POCTEFA project Genomia (AL). The project also was partly supported by the Genopole Toulouse Midi-Pyrénées bioinformatics platform.

Appendix

Genetic group model for unbiased genomic predictions

From the model of unbiased genomic predictions presented in section (i), we have that

p(u₂|μ)~N(1μ, Gσ_u²),

and p(u|μ)~N(Qμ, Hσ_u²), where H is according to eqn (1) and Q is .

The setting is like a genetic group model (Quaas, Reference Quaas1988) and equivalent mixed-model equations, which yields the same solutions for b and u, is derived as in Quaas (Reference Quaas1988) as

The above expression is simplified by computing the product Q′H⁻¹ as

Note that and . We can see that and the product , which is simply the sum of its elements.

Thus, the mixed-model equations can be expressed as

(6)

with Z′Z appropriately expanded with zeros, and setting H^*−1 as

Absorption of μ in (6) gives mixed-model equations with H^†−1 (eqn (2)) using expression (2) in Henderson & Searle (Reference Henderson and Searle1981) . Computing H^*−1 is similar to the creation of A⁻¹ with genetic groups as reported by Quaas (Reference Quaas1988) .

The expression (6) is of interest because there is an explicit estimate of μ. In addition, mixed-model equations in (6) have a straightforward interpretation. The genetic value of a genotyped individual is conditional on its mean μ and relatives through genomic relationships. The genetic value of a non-genotyped individual is conditional on its relatives through pedigree relationships, including relatives with genotypes.

References

Aguilar, I., Misztal, I., Johnson, D. L., Legarra, A., Tsuruta, S. & Lawlor, T. J. (2010). A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. Journal of Dairy Science 93, 743–752.CrossRef Google Scholar PubMed

Chen, C. Y., Misztal, I., Aguilar, I., Legarra, A. & Muir, W. M. (2011). Effect of different genomic relationship matrices on accuracy and scale. Journal of Animal Science, in press.CrossRef Google Scholar PubMed

Christensen, O. F. & Lund, M. S. (2010). Genomic prediction when some animals are not genotyped. Genetics Selection Evolution 42, 2.CrossRef Google Scholar

Cockerham, C. C. (1969). Variance of gene frequencies. Evolution 23, 72–84.CrossRef Google Scholar PubMed

Cockerham, C. C. (1973). Analysis of gene frequencies. Genetics 74, 679–700.CrossRef Google Scholar PubMed

Daetwyler, H. D., Pong-Wong, R., Villanueva, B. & Woolliams, J. A. (2010). The impact of genetic architecture on genome-wide evaluation methods. Genetics 185, 1021–1031.CrossRef Google Scholar PubMed

Fernando, R. & Gianola, D. (1986). Optimal properties of the conditional mean as a selection criterion. Theoretical and Applied Genetics 72, 822–825.CrossRef Google Scholar PubMed

Garrick, D. J., Taylor, J. F. & Fernando, R. L. (2009). Deregressing estimated breeding values and weighting information for genomic regression analyses. Genetics Selection Evolution 41, 55.CrossRef Google Scholar PubMed

Goddard, M. E. & Hayes, B. J. (2009). Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nature Reviews Genetics 10, 381–391.CrossRef Google Scholar PubMed

Hayes, B. J., Vissher, P. M. & Goddard, M. E. (2009). Increased accuracy of artificial selection by using the realized relationship matrix. Genetics Research 91, 47–60.CrossRef Google Scholar PubMed

Henderson, C. R. (1973). Sire evaluation and genetic trends. In: Proceedings of the Animal Breeding and Genetics Symposium in Honor of Dr. J. L. Lush. pp. 10–41. Champaign, IL: American Society of Animal Science and American Dairy Science Association.Google Scholar

Henderson, C. R., Kempthorne, O., Searle, S. R. & von Krosigk, C. M. (1959). The estimation of environmental and genetic trends from records subject to culling. Biometrics 15, 192–218.CrossRef Google Scholar

Henderson, C. R. & Searle, S. R. (1981). On deriving the inverse of a sum of matrices. SIAM Review 23, 53–60.CrossRef Google Scholar

Hill, W. G. (2010). Understanding and using quantitative genetic variation. Philosophical Transactions of the Royal Society B 365, 73–85.CrossRef Google Scholar PubMed

Im, S., Fernando, R. L. & Gianola, D. (1989). Likelihood inferences in animal breeding under selection: a missing-data theory viewpoint. Genetics Selection Evolution 21, 399–414.CrossRef Google Scholar

Kennedy, B. W., Schaeffer, L. R. & Sorensen, D. A. (1988). Genetic properties of animal models. Journal of Dairy Science 71, 17–26.CrossRef Google Scholar

Legarra, A., Aguilar, I. & Misztal, I. (2009). A relationship matrix including full pedigree and genomic information. Journal of Dairy Science 92, 4656–4663.CrossRef Google Scholar PubMed

Luan, T., Woolliams, J. A., Lien, S., Kent, M., Svendsen, M. & Meuwissen, T. H. E. (2009). The accuracy of genomic selection in Norwegian red cattle assessed by cross-validation. Genetics 183, 1119–1126.CrossRef Google Scholar PubMed

Mäntysaari, E., Liu, Z. & VanRaden, P. (2010). Interbull validation test for genomic evaluations. Interbull Bulletin 41, 5 p.Google Scholar

Meuwissen, T. H. E., Hayes, B. J. & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829.CrossRef Google Scholar PubMed

Oliehoek, P. A., Winding, J. J., van Arendonk, J. A. M. & Bijma, P. (2006). Estimating relatedness between individuals in general populations with a focus on their use in conservation programs. Genetics 173, 483–496.CrossRef Google Scholar PubMed

Patry, C. & Ducrocq, V. (2011). Evidence of biases in genetic evaluations due to genomic pre-selection in dairy cattle. Journal of Dairy Science 94, 1011–1020.CrossRef Google Scholar

Powell, J. E., Vissher, P. M. & Goddard, M. E. (2010). Reconciling the analysis of IBD and IBS in complex trait studies. Nature Reviews Genetics 11, 800–805.CrossRef Google Scholar PubMed

Quaas, R. L. (1988). Additive genetic models with groups and relationships. Journal of Dairy Science 71, 1338–1345.CrossRef Google Scholar

Roehe, R. & Kennedy, B. W. (1993). The influence of maternal effects on accuracy of evaluation of litter size in swine. Journal of Animal Science 71, 2353–2364.CrossRef Google Scholar PubMed

Sargolzaei, M. & Schenkel, F. (2009). QMSim: a large-scale genome simulator for livestock. Bioinformatics 25, 680–681.CrossRef Google Scholar PubMed

Solberg, T. R., Sonesson, A. K., Woolliams, J. A. & Meuwissen, T. H. E. (2008). Genomic selection using different marker types and densities. Journal of Animal Science 86, 2447–2454.CrossRef Google Scholar PubMed

Sorensen, D. A. & Kennedy, B. W. (1984). Estimation of response to selection using least squares and mixed model methodology. Journal of Animal Science 58, 1097–1103.CrossRef Google Scholar

VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science 91, 4414–4423.CrossRef Google Scholar PubMed

VanRaden, P. M., Van Tassell, C. P., Wiggans, G. R., Sonstegard, T. S., Schnabel, R. D., Taylor, J. F. & Schenkel, F. S. (2009 a). Invited review: Reliability of genomic predictions for North American Holstein bulls. Journal of Dairy Science 92, 16–24.CrossRef Google Scholar PubMed

VanRaden, P. M., Tooker, M. E. & Cole, J. B. (2009 b). Can you believe those genomic evaluations for young bulls? Journal of Dairy Science 92(E-Suppl. 1), 314 (Abstr.).Google Scholar

Fig. 1. Bias (dotted curve with solid circles), coefficient of regression of true on EBV (b; solid curve with squares), and squared correlation between true and EBVs (R2; dashed curve with circles) as functions of the correction factor α (difference between pedigree-based and genome-based relationships for genotyped animals) for a given replicate with heritability of 0·30 under assortative mating selection based on EBV.

Table 1. Mean α for different heritabilities under PY or PEBV selection

Table 2. Means (SDs) of breeding values from different heritabilities and prediction methods for selection candidates under PY and PEBV

Table 3. Coefficients (SDs) for regression of true on EBV for different heritabilities and prediction methods under PY and PEBV

Fig. 2. Accuracy (squared correlations between true and EBV) of BLUP methods for estimating breeding value of selection candidates across 20 replicates with heritability of 0·30 under assortative mating selection based on EBV. Prediction methods were a mixed model based on the pedigree relationship matrix and phenotypes (BLUPPED; solid line with triangles), a two-step procedure in which DYD were computed from a regular method based on pedigree and phenotypes and then used for genomic prediction (BLUPDYD; dashed line with solid circles), a single-step procedure with genetic differences among genotyped and non-genotyped individuals corrected by considering the difference between pedigree-based and genome-based relationships for genotyped animals (BLUPα; solid line with squares) and a single-step procedure that corrected the genomic relationship matrix as proposed by Powell et al. (2010) ({\rm BLUP}_{{F} _{\rm ST} } ) (dashed line with x markers).

Table 4. Squared correlations between true and EBVs (SDs) for different heritabilities and prediction methods under PY and PEBV

Table 5. PEVs (SDs) and MES (SDs) for different heritabilities and prediction methods under PY and PEBV

Article contents

Bias in genomic predictions for populations under selection

Summary

1. Introduction

2. Materials and methods

(i) Theory

(ii) Simulations

(iii) Prediction methods

3. Results

(i) Correction factor

(ii) Bias

(iii) Inflation

(iv) Accuracy

(v) Other comparison measures

4. Discussion

5. Conclusion

Appendix

Genetic group model for unbiased genomic predictions

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests