Implications
Tools used in genomic selection range from those based on estimation of SNP effects to animal models using a genomic relationship matrix (GRM). When many animals are ungenotyped, modeling options include multi-step or single-step methods. Computations involving mixed models with SNP effects or the inverse of GRM have good computing properties, whereas those including their combination or plain GRM do not. Recent discovery of efficient inverse of GRM removed computing limits from genomic analyses involving GRM. Same discovery enables GREML for very large data sets.
Introduction
The concept of genomic selection (Meuwissen et al., Reference Meuwissen, Hayes and Goddard2001) generated great excitement in the animal breeding community. With genomic information from SNP panels, one can achieve accuracy from young animals almost as high as from a progeny test (Schaeffer, Reference Schaeffer and Kennedy2006; VanRaden et al., Reference VanRaden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009) at a greatly reduced cost. Initially the genomic computations used versions of BLUP with SNP fitted as random effects, such as SNP BLUP, BayesB, etc., (see Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009 for a review). An alternative form of SNP BLUP is genomic BLUP (GBLUP), where the animal effect is fit with a GRM (VanRaden, Reference VanRaden2008). For prediction, SNP BLUP and GBLUP are equivalent models (VanRaden, Reference VanRaden2008; Strandén and Christensen, Reference Strandén and Lidauer2011).
When only a small fraction of the population is genotyped, the information from ungenotyped animals can be summarized in pseudo-observations for genotyped animals. Alternatively, GBLUP can be extended to single-step GBLUP (ssGBLUP) (Aguilar et al., Reference Aguilar, Legarra and Misztal2010; Christensen and Lund, Reference Christensen and Lund2010), where a numerator relationship matrix (NRM) for all individuals and GRM are combined and then applied to BLUP. Benefits of ssGBLUP include simplicity of application (another BLUP), avoidance of double counting, and accounting for pre-selection on Mendelian sampling (Legarra et al., Reference Legarra and Misztal2014). ssGBLUP was extended to be based partially or completely on SNP effects (Fernando et al., Reference Fernando, Dekkers and Garrick2014; Liu et al., Reference Liu, Goddard, Reinhardt and Reents2014).
Large interest in the genomic community exists in identifying an optimal set of SNP and their variances for increased accuracy of evaluation. Such an identification is straightforward with Bayesian SNP models (e.g. Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009). As SNP BLUP and GBLUP are equivalent, SNP and their variances selected in Bayesian SNP models can be transformed to weighted GRM and subsequently BLUP. Methods exist to generate SNP weights directly in the GBLUP model (VanRaden, Reference VanRaden2008; Zhang et al., Reference Zhang, Liu, Ding, Bijma, de Koning and Zhang2010; Sun et al., Reference Sun, Qu, Garrick, Dekkers and Fernando2012) or ssGBLUP (Wang et al., Reference Wang, Misztal, Aguilar, Legarra and Muir2012).
Initially the extent of genomic selection was limited by costs of genotyping. With these costs dropping dramatically, the number of genotyped animals has increased dramatically, with very high-density SNP chips available. In dairy cattle, over 1 million Holsteins had been genotyped in the United States as of March 2016 (Council on Dairy Cattle Breeding; https://www.cdcb.us/Genotype/cur_density.html). Subsequently, computing costs are becoming more of an issue. The purpose of this paper is to present and discuss computational methods in genomic selection.
Definition of terms and basic models
Let A denote a NRM based on pedigrees. Let Z be a matrix of gene content with z ij being the number of occurrences of the minor allele in SNP j of animal i. The values of ‘raw’ z ij are 0, 1 and 2, whereas that of centered z ij is −2p j , 1−p j and 2−p j , where p j is gene frequency of SNP j. The simplest SNP BLUP model is
where y i is a (pseudo)phenotype of animal i, $${\bf a}\,\sim\,N\left( {0,{\bf I}\sigma _{a}^{2} } \right)$$ the vector of SNP effects, $${\bf e}\,\sim\,N\left( {0,{\bf I}\sigma _{e}^{2} } \right)$$ the vector of residuals, and $$\sigma _{a}^{2} $$ and $$\sigma _{e}^{2} $$ are SNP and residual variances, respectively. GBLUP can be defined as
where $${\bf u}\,\sim\,N\left( {0,{\bf G}\sigma _{u}^{2} } \right)$$ is a vector of animal effects, G the GRM and $$\sigma _{u}^{2} $$ the additive variance. G can be derived by a transformation from SNP BLUP:
possibly with an alternative scaling factor (VanRaden, Reference VanRaden2008):
If SNP BLUP involves heterogeneous variance for SNP: $${\bf a}\,\sim\,N(0,{\bf D}\sigma _{a}^{2} )$$ , where D is a diagonal matrix of weights, the equivalent ‘weighted’ G is
SNP BLUP and GBLUP are equivalent models (Strandén and Christensen, Reference Strandén and Lidauer2011).
When only a fraction of animals is genotyped, the numerator and genomic relationships can be combined (Legarra et al., Reference Legarra, Christensen, Aguilar and Misztal2009; Christensen and Lund, Reference Christensen and Lund2010):
where H is a combined matrix and indices in A refer to ungenotyped (1) and genotyped (2) animals. The inverse of matrix H is (Aguilar et al., Reference Aguilar, Legarra and Misztal2010; Christensen and Lund, Reference Christensen and Lund2010)
BLUP with matrix H is called ssGBLUP.
Computing operations in animals breeding
Common operations in animal breeding include solving of mixed model equations and variance component estimation either via REML or Bayesian methods with Gibbs sampling. Such operations in the animal breeding context are described in Mrode (Reference Misztal and Perez-Enciso2014) and Misztal (Reference Misztal, Legarra and Aguilar2014), and are typically based on Henderson’s mixed model equations (MME). Let this linear system of equations be
where B is left-hand-side (LHS), y the right-hand side, and x the vector of solutions. Solving mixed model equations can be accomplished by finite and iterative methods. In finite methods, exact solutions (except for numerical errors) are obtained in a finite number of steps. Popular methods for general matrices are based on the LU decomposition (B=LU, where L and U are lower and upper triangular, respectively) and for symmetric matrices the Cholesky decomposition (B=LL', where L is lower triangular). As often mixed model equations are not full rank, finite methods used for mixed models need to compute generalized solutions rather than cause numerical exceptions. For example, the Cholesky decomposition can be modified to
where D is a diagonal matrix with zero elements for redundant equations and non-zeros otherwise. Although finite methods usually compute solutions with high precision, their cost is high, usually close to cubic. Finite methods can be used to calculate inverses, which are used in REML or for prediction error variance (PEV) calculations. In general, the cost of the inverse is a few times that of a solution by decomposition.
When matrices are sparse, with nearly all elements of LHS equal to 0, a ‘sparse’ Cholesky decomposition exists that avoids operations with zeros, leading to about quadratic rather than cubic costs (George and Liu, Reference George and Liu1981). Also, an algorithm exists for computing a ‘sparse’ inverse, where elements computed are those corresponding to non-zeroes in the original matrix (Takahashi et al., Reference Takahashi, Fagan and Chen1973). Such an inverse is sufficient to calculate traces in REML (Misztal and Perez-Enciso, Reference Misztal1993) or PEV. However, equations involving SNP effects are usually dense. When the system of equations is sparse but contains dense blocks, a sparse matrix package can be modified for greatly improved performance in REML and other applications (Masuda et al., Reference Masuda, Misztal, Tsuruta, Legarra, Aguilar, Lourenco, Fragomeni and Lawlor2015). This is quite new as most matrix packages are efficient for either dense or sparse matrices, but not for mixtures of both.
Iterative methods are such that progressively more accurate solutions are obtained every next round. Two such methods are popular in animal breeding. The first one is Gauss–Seidel iteration (GS), which is also a backbone of Gibbs sampling. GS is very simple, but requires access to equations by rows. The second one is preconditioned conjugate gradient (PCG), a method difficult to understand, but easy to implement and usually with superior convergence rate to GS (Tsuruta et al., Reference Tsuruta, Misztal and Stranden2001).
In animal breeding application, the size of LHS can be much larger than the size of data to generate LHS. For example, one line of data in a 10 trait model with 10 effects per trait includes 20 numbers (one for each trait and effect) but generates 10 000 contributions to LHS. A special implementation of iteration algorithms is matrix-free or by iteration on data (Schaeffer and Kennedy, Reference Schaeffer1986; Misztal and Gianola, Reference Misztal1987), where coefficients of LHS are regenerated from the data every round of iteration. Although the implementation of iteration on data by GS can be complicated as the coefficients need to be recreated row by row, such an implementation in PCG is trivial as PCG only requires a product of B by a vector (Bq, with q a given vector), with no individual elements necessary (except for diagonals).
Tricks and rules in mixed model and genomic computations
Product of numerator relationship matrix
Computing the NRM A by a tabular method has a quadratic cost. Also, computing relationships for specific animals requires creating a matrix for all ancestors. For large populations, A would not fit into memory. When only a product A by a vector, say q, is needed, Aq can be computed inexpensively by two scans of the pedigree (Colleau, Reference Colleau2002). By selecting zeros in q, one can also compute a product of a section of A by a vector (e.g. A 22 q) (Misztal et al., Reference Misztal and Gianola2009; Aguilar et al., Reference Aguilar, Misztal, Johnson, Legarra, Tsuruta and Lawlor2011).
Sequential multiplication and preconditioned conjugate gradient
The LHS of the mixed model equations contains expressions like Z'Z. The PCG algorithm by iteration on data requires computing Z'Zq, where q is a given vector. Strandén and Lidauer (Reference Strandén and Garrick1999) found that while the cost of (Z'Z)q is very high, the cost of Z'(Zq) is much lower.
Convergence properties with various systems of equations
When mixed models are solved by iteration (GS or PCG), the convergence rate is better with fewer equations per phenotype. Experiences indicate that the convergence rate is very good with sire models, much slower with animal models, and could be especially slow with young animals not tied to phenotypes or progeny.
The regular MME equations involve an inverse of A or G, e.g.
An equivalent system of equation avoids creating an inverse (Strandén and Garrick, Reference Strandén and Christensen2009):
The convergence rate in such models is poor and may not occur for larger systems of equations. This is because A −1 ‘connects’ only parents and progeny resulting in ‘sparse’ equations, whereas all animals in A are connected, resulting in dense equations. As G differs little from A (VanRaden, Reference VanRaden2008; Wang et al., Reference Wang, Misztal and Legarra2014), the same thinking applies to models with G and G −1.
Computations in SNP BLUP (and BayesX)
SNP BLUP and its various forms (including BayesA, BayesB, etc.) are very popular models in genomic computations. When the number of SNP is small, the MME for SNP BLUP can be created explicitly. As the memory and compute requirements are quadratic, explicit storage of SNP BLUP equations is unfeasible with large number of SNP. For example, memory requirements for MME with 1 million SNP would be 1012 elements also with 1012 elements of MME created from each genotyped animal. Various options in such a case have been explored by Legarra and Misztal (Reference Golden, Fernando and Garrick2008) with a data set on about 2 k individuals and 20 k SNP markers. Solution by Cholesky decomposition took 2 h, and computations would increase cubically with the number of SNP, whereas memory would increase quadratically. With solution by GS and straightforward iteration on data, computations took about 4 days, with linear memory and quadratic memory costs. With a special GS implementation called GSRU, where currently computed solutions are used to adjust residuals for all individuals, computing took about 1 min and both computing and memory costs were linear. Finally, with PCG iteration on data, computing took only 20 s, with approximately linear costs for memory and computations. Although GSRU is the only realistic algorithm with millions of SNP for Bayesian models, the PCG approach can be useful for models where SNP variances are derived from SNP solutions (Zhang et al., Reference Zhang, Liu, Ding, Bijma, de Koning and Zhang2010; Sun et al., Reference Sun, Qu, Garrick, Dekkers and Fernando2012; Wang et al., Reference Wang, Misztal, Aguilar, Legarra and Muir2012). A hybrid method would use PCG iteration with inner (non-Monte Carlo-based) updates of weights (VanRaden, Reference VanRaden2008).
Variations of single-step equations
Due to the prohibitive cubic cost of inversion of G with many genotyped animals, alternative equations were proposed for ssGBLUP. Equations in Misztal et al. (Reference Misztal and Gianola2009) used H rather than H −1. Convergence was achieved only with a small number of animals. Equations by Legarra and Ducrocq (Reference Legarra and Ducrocq2012) used G and A 22, which are easy to use, instead of G −1 and A 22 −1. The convergence rate was slow with medium models and not achieved with large models (Aguilar et al., Reference Aguilar, Misztal, Legarra and Tsuruta2013). Liu et al. (Reference Liu, Goddard, Reinhardt and Reents2014) proposed equations with SNP effects for genotyped animals. They reported lack of convergence with real data. All these models illustrate computing advantages with inverses of dispersion matrices (A −1 or G −1). As SNP BLUP models have good computing properties, Fernando et al. (Reference Fernando, Dekkers and Garrick2014) looked into ssGBLUP where ungenotyped animals were imputed and only SNP effects (plus imputation errors for ungenotyped animals) were estimated. Although convergence rates were good, imputations required a supercomputer and imputed genotypes required large memory for big populations. Better computing times were reported using a computer with fast graphic cards (Golden et al., Reference Legarra, Aguilar and Misztal2016). Meuwissen et al. (Reference Meuwissen, Svendsen, Solberg and Ødegård2015) presented a new type of H matrix based on (possibly fractional) imputation of ancestors by segregation analyses, followed by a sophisticated tailoring of G to A to compensate for errors in imputation. Although the cost of segregation analysis is high, it is not yet clear whether such an approach offers accuracy benefits when most ancestors are genotyped, and the process of tailoring is rather complex. All non-traditional ssGBLUP would require new research and investment in code for implementation of more complex models, such as multiple-trait, random regression or maternal models.
Sparse genomic relationship matrix
Computations with ssGBLUP would be easy for large number of animals if G −1 was sparse, like A −1. Past studies suggested that G had a limited rank, as inversion is unstable with >5 to 10 k animals, and Gs as used in genetic evaluations are blended with A 22 (VanRaden, Reference VanRaden2008; Aguilar et al. Reference Aguilar, Legarra and Misztal2010) to achieve full rank. The eigenvalue decomposition of G is:
where U is a matrix of eigenvectors and D the matrix of eigenvalues. If all eigenvalues are positive, the inverse of G is
If some eigenvalues are 0, the inverse does not exist. If, due to noise, many eigenvalues are close to 0 and carry no information, they can cause instability of the inverse through large elements of D −1.
Let D t indicate a fraction of D with non-negligible eigenvalues, and let U t be corresponding eigenvalues. Then
If G −1 is to be created explicitly, the computing cost is cubic with quadratic storage. If G − is used by a PCG algorithm as
and the number of eigenvalues in D t is small, G − q can be calculated and stored at linear cost with respect to the number of genotyped animals. However, computing U and D is expensive and requires storage of G in full.
When G has a limited dimensionality, Misztal et al. (Reference Misztal, Legarra and Aguilar2014) proposed to calculate G −1 using only a fraction of G and without eigenvalue decomposition. Let
where index c denotes core animals and index n non-core animals. In the algorithm for proven and young animals (APY)
where M is a diagonal matrix with elements $m_{i} \,{\equals}\,g_{{ii}} {\minus}{\bf G}_{{ic}} {\bf G}_{{cc}}^{{{\minus}1}} {\bf G}_{{ci}} $ for individual i in the core group. Inversion of G by APY has a cubic cost (and quadratic memory) for core individuals and a linear cost (and linear memory) for non-core individuals. Figure 1 shows an example non-zero structure of a regular inverse (after blending to make it positive-definite) and different APY inverses. The inverses are diagonal for non-core animals. As the regular G is singular for a large population, an infinite number of generalized inverses exist, including those by the APY algorithm with different sets of core animals.
Core animals can be randomly selected (Fragomeni et al., Reference Fragomeni, Lourenco, Tsuruta, Masuda, Aguilar, Legarra, Lawlor and Misztal2015), with an optimal number of about 12 000 for Holsteins (Masuda et al., Reference Masuda, Tsuruta, Aguilar and Misztal2016) and 10 000 for Angus (Lourenco et al., Reference Lourenco, Tsuruta, Fragomeni, Masuda, Aguilar, Legarra, Bertrand, Amen, Wang, Moser and Misztal2015). The number of core animals is linked to effective population size (Misztal, Reference Mrode2016). Optimal number of core animals is equal to the number of the largest eigenvalues in D explaining 98% of the variance in G (Pocrnic et al., Reference Pocrnic, Lourenco, Masuda, Legarra and Misztal2016a). With such a number, prediction of breeding values with APY was more accurate than with a regular inverse. Computation of G −1 for 570 k genotyped Holsteins by APY took only about 2 h, whereas a month computing with a large memory would be required for a regular inverse (Masuda et al., Reference Masuda, Tsuruta, Aguilar and Misztal2016). The APY algorithm removes computing limits from inversion of G and subsequently ssGBLUP.
Computing of A22
ssGBLUP as in Aguilar et al. (Reference Aguilar, Legarra and Misztal2010) requires $${\bf A}_{{22}}^{{{\minus}1}} $$ . Although A −1 is sparse, $${\bf A}_{{22}}^{{{\minus}1}} $$ may be relatively dense (Faux and Gengler, Reference Faux and Gengler2013), and its storage for potentially millions of genotyped animals impossible. This matrix can be calculated indirectly (Strandén and Mäntysaari, Reference Strandén and Mäntysaari2014).
where all submatrices are sparse and can be stored explicitly. With PCG iteration, explicit $$A_{{22}}^{{{\minus}1}} $$ is not required, and its product with a vector q can be calculated every round as follows (Masuda et al., Reference Masuda, Tsuruta, Aguilar and Misztal2016):
In particular
is computed as a solution to:
As all matrices are sparse, the cost of above calculations is small (Masuda et al., Reference Masuda, Tsuruta, Aguilar and Misztal2016).
GREML and APY
GREML is REML using a GRM. Similarly, REML using matrix H can be called single-step GREML (ssGREML). In the past, the success with REML in animal populations was due to sparse inverse of A −1 and sparse matrix factorization/inversion (Misztal and Perez-Enciso, Reference Misztal1993; Pérez-Enciso et al., Reference Pérez-Enciso, Misztal and Elzo1994). If G is dense and large, Masuda et al. (Reference Masuda, Misztal, Tsuruta, Legarra, Aguilar, Lourenco, Fragomeni and Lawlor2015) showed that GREML with the old sparse matrix package is inefficient and unstable. A new package called YAMS recognizes dense blocks in MME, rearranges computations accordingly, allowing also for parallel computations for large dense blocks. With YAMS, GREML was much faster and more reliable.
Additional issues
This paper focused on algorithmic issues relevant to current mixed model computations in genomics in animal breeding. Many other issues exist and many will become relevant. For example, genotyping of animals with small amount of information leads to greater importance of avoidance of double counting in multi-step evaluations, or of more accurate match between genomic and additive relationships in ssGBLUP. With efforts to use sequence data, important operations are aligning short fragments, imputation, genomewide association studies, and incorporations of the detected variants in the evaluation. In dairy, a de facto global breeding scheme for Holsteins creates a need for a comprehensive genomic evaluation by Interbull that has a high accuracy, yet does not compromise privacy of the member countries. In general, while brute-force approaches may be formidable, as or more accurate yet inexpensive approaches may exist that exploit peculiarities of data including a limited effective population in farm populations. For instance, the sequence includes three G base pairs, but farm populations can be described by about 4000 to 20 000 chromosomal segments (Pocrnic et al., Reference Pocrnic, Lourenco, Masuda and Misztal2016b). This means a block size of about 150 to 750 kbases with hard to differentiate SNPs, and ability to reduce costs by working with blocks rather than individual SNP. This also suggests that increases of genomic accuracies past 4000 to 20 000 of high accuracy genotyped animals are limited (ignoring G×E and decays of predictivity over time). Similarly, if Interbull countries are unwilling to share genotypes on individual bulls or SNP effects derived from their population, they can contribute GEBV and genotypes of artificially generated animals (Maantysaari, 2015, personal communication) with their number and accuracies relevant to each country.
Conclusions
In conclusion, common operations in genetic selection are computationally feasible for an almost unlimited number of genotyped individuals using appropriate algorithms. Solving ‘SNP’ equations is best accomplished by iteration on data using either a special form of GS iteration or the PCG algorithm. Solving ssGBLUP is best accomplished by iteration on data using the PCG algorithm. For GREML, a sparse matrix package needs to account for dense blocks. Computations in the two latter cases greatly benefit from efficient (and sparse) inverse of the GRM.
Acknowledgments
Helpful suggestions from the two anonymous reviewers are gratefully acknowledged. This research was primarily supported by grants from Holstein Association USA (Brattleboro, VT), American Angus Association, Zoetis, Cobb-Vantress, Smithfield Premium Genetics, Pig Improvement Company, and the US Department of Agriculture’s National Institute of Food and Agriculture (Agriculture and Food Research Initiative competitive grant 2015-67015-22936). The authors thank Ivan Pocrnic for help with editing.