Skip to main content Accessibility help
  • Print publication year: 2016
  • Online publication date: December 2015

3 - Introduction to statistical methods in genome-wide association studies

from Part I - Genome-wide association studies



After the completion of the Human Genome Project (Lander et al., 2001; Venter et al., 2001) and initiation of the International HapMap Project (Sachidanandam et al., 2001), genome-wide association studies (GWAS) were designed to survey the role of common genetic variations in complex human diseases. It was expected that GWAS would have the advantage of not relying on prior knowledge of biological pathways compared with “candidate gene” studies (Tabor et al., 2002; Wang et al., 2005), because it assays a dense set of single-nucleotide polymorphisms (SNPs) across the whole genome. This advantage allows GWAS to overcome the bias of “candidate gene” studies due to incomplete prior knowledge. It was also expected that GWAS would have higher power and finer resolution to identify genetic variants of modest effects compared to family-based linkage studies (Risch & Merikangas, 1996).

The success of identifying genes for age-related macular degeneration (AMD) under the GWAS paradigm (Klein et al., 2005) convinced the genetics community on the efficiency and feasibility of the GWAS approach to identify unknown disease-associated variants. This study used a commercial genotyping array and assayed about 100,000 SNPs throughout the human genome. It identified the association of complement factor H (CFH) with AMD. The success of finding a common risk allele with an odds ratio (OR) of 4.6 in a small sample set of 96 cases and 50 controls has generated considerable excitement in the genetics community. The p-value of the strongest SNP association surpassed the genome-wide significance threshold after the Bonferroni correction. More importantly, this finding was replicated in the following-up studies (Donoso et al., 2010). Undoubtedly, this encouraging finding raised the confidence among researchers to detect genetic variants that underlie various complex diseases through GWAS. In 2007, the Wellcome Trust Case Control Consortium (WTCCC) published the results of seven GWAS, including Bipolar Disorder, Coronary Artery Disease, Crohn's Disease, Hypertension, Rheumatoid Arthritis, Type 1 Diabetes, and Type 2 Diabetes (The Wellcome Trust Case Control Consortium, 2007). The WTCCC study is considered the starting point of large-scale GWAS (Visscher et al., 2012). Since then, an increasing number of GWAS have been conducted and over 10,000 loci have been reported to be significantly associated with at least one complex trait (see the web resource of GWAS catalog (Hindorff et al., 2009),

Allen, H.L., Estrada, K., Lettre, G., et al. (2010). Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 467(7317), 832–838.
Andreassen, O.A., Djurovic, S., Thompson, W.K., et al. (2013). Improved detection of common variants associated with schizophrenia by leveraging pleiotropy with cardiovascular-disease risk factors. Am. J. Hum. Genet., 92(2), 97–109.
Asimit, J. and Zeggini, E. (2010). Rare variant association analysis methods for complex traits. Annu. Rev. Genet., 44, 293–308.
Balding, D. (2006). A tutorial on statistical methods for population association studies. Nature Rev. Genet., 7(10), 781–791.
Bansal, V., Libiger, O., Torkamani, A. and Schork, N. J. (2010). Statistical analysis strategies for association studies involving rare variants. Nature Rev. Genet., 11(11), 773–785.
Bishop, C.M. and Nasrabadi, N.M. (2006). Pattern Recognition and Machine Learning (Vol. 1). Springer, New York.
Boyle, A., Hong, E., Hariharan, M., et al. (2012). Annotation of functional variation in personal genomes using RegulomeDB. Genome Res., 22(9), 1790–1797.
Browning, S.R. and Browning, B.L. (2013). Identity-by-descent-based heritability analysis in the northern Finland birth cohort. Hum. Genet., 132(2), 129–138.
Cantor, R., Lange, K. and Sinsheimer, J. (2010). Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet., 86(1), 6–22.
Chatterjee, N., Wheeler, B., Sampson, J., et al. (2013). Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nature Genet., 45(4), 400–405.
Chen, G. and Witte, J. (2007). Enriching the analysis of genomewide association studies with hierarchical modeling. Am. J. Hum. Genet., 81(2), 397–404.
Cordell, H.J. (2009 ). Detecting gene–gene interactions that underlie human diseases. Nature Rev. Genet., 10, 392–404.
Cowper-Sal-lari, R., Zhang, X., Wright, J., et al. (2012). Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression. Nature Genet., 44(11), 1191–1200.
Cross-Disorder Group of the Psychiatric Genomics Consortium. (2013a). Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nature Genet., 45(9), 984–994.
Cross-Disorder Group of the Psychiatric Genomics Consortium. (2013b). Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet, 381(9875), 1371–1379.
de los Campos, G., Gianola, D. and Allison, D.B. (2010). Predicting genetic predisposition in humans: the promise of whole-genome markers. Nature Rev. Genet., 11(12), 880–886.
de los Campos, G., Vazquez, A. I., Fernando, R., Klimentidis, Y.C. and Sorensen, D. (2013). Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet., 9(7), e1003608.
Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics, 55(4), 997–1004.
Donoso, L.A., Vrabec, T. and Kuivaniemi, H. (2010). The role of complement Factor H in age-related macular degeneration: a review. Surv. Ophthalmol., 55(3), 227–246.
Falconer, D.S. (1996). Introduction to Quantitative Genetics (ed.). Longman, London.
Fisher, R.A. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Soc. Edinb., 52(2), 399–433.
Goddard, M.E., Wray, N.R., Verbyla, K. and Visscher, P.M. (2009). Estimating effects and making predictions from genome-wide marker data. Statist. Sci., 24(4), 517–529.
Golan, D. and Rosset, S. (2013). Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS. arXiv preprint arXiv:1305.5363.
Hartley, S.W. and Sebastiani, P. (2013). PleioGRiP: genetic risk prediction with pleiotropy. Bioinformatics, 29(8), 1086–1088.
Hartley, S.W., Monti, S., Liu, C.-T., Steinberg, M.H. and Sebastiani, P. (2012). Bayesian methods for multivariate modeling of pleiotropic SNP associations and genetic risk prediction. Front. Genet., 3, 176.
Hastie, T. and Tibshirani, R. (2004). Efficient quadratic regularization for expression arrays. Biostatistics, 5(3), 329–340.
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (ed.). Springer, New York.
Hindorff, L., Sethupathy, P., Junkins, H., et al. (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA, 106(23), 9362–9367.
Hopper, J. (1993). Variance components for statistical genetics: applications in medical research to characteristics related to human diseases and health. Statist. Meth. Med. Res., 2(3), 199–223.
Hopper, J. and Mathews, J.D. (1982). Extensions to multivariate normal models for pedigree analysis. Ann. Hum. Genet., 46(4), 373–383.
Hopper, J. and Mathews, J.D. (1983). Extensions to multivariate normal models for pedigree analysis: II. Modeling the effect of shared environment in the analysis of variation in blood lead levels. Am. J. Epidemiol., 117(3), 344–355.
Jiang, J., Li, C., Debashis, P., Yang, C. and Zhao, H. (2013). High dimensional genome-wide association study and mis-specified mixed model analysis. arXiv preprint: arXiv.1404.2355 [math.ST].
Kang, H.M., Sul, J.H., Zaitlen, N.A., et al. (2010). Variance component model to account for sample structure in genome-wide association studies. Nature Genet., 42(4), 348–354.
Klein, R., Zeiss, C., Chew, E., et al. (2005). Complement factor H polymorphism in age-related macular degeneration. Science, 308(5720), 385–389.
Korte, A., Vilhjálmsson, B.J., Segura, V., et al. (2012). A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nature Genet., 44(9), 1066–1071.
Lander, E., Linton, L., Birren, B., et al. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860–921.
Lange, K., WestLake, J. and Spence, M. (1976). Extensions to pedigree analysis III. Variance components by the scoring method. Ann. Hum. Genet., 39(4), 485–491.
Lee, S.H., Wray, N.R., Goddard, M.E. and Visscher, P.M. (2011). Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet., 88(3), 294–305.
Lee, S.H., DeCandia, T.R., Ripke, S., et al. (2012). Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nature Genet., 44(3), 247–250.
Lee, S.-I., Dudley, A., Drubin, D., et al. (2009). Learning a prior on regulatory potential from eQTL data. PLoS Genet., 5(1), e1000358.
Li, C., Yang, C., Gelernter, J. and Zhao, H. (2013). Improving genetic risk prediction by leveraging pleiotropy. arXiv preprint arXiv:1304.7417.
Li, M., Wang, L., Xia, Z., Sham, P. and Wang, J. (2013). GWAS3D: detecting human regulatory variants by integrative analysis of genome-wide associations, chromosome interactions and histone modifications. Nucl. Acids Res., 41, W150–W158.
Lippert, C., Listgarten, J., Liu, Y., et al. (2011). Fast linear mixed models for genome-wide association studies. Nature Meth., 8(10), 833–835.
Lippert, C., Quon, G., Kang, E.Y., et al. (2013). The benefits of selecting phenotype-specific variants for applications of mixed models in genomics. Sci. Rep., 3.
Lynch, M. and Walsh, B. (1998). Genetics and Analysis of Quantitative Traits. Sinauer Associates, Sunderland, MA.
Maher, B. (2008). Personal genomes: the case of the missing heritability. Nature, 456(7218), 18–21.
Manolio, T. (2010). Genomewide association studies and assessment of the risk of disease. New Engl. J. Med., 363(2), 166–176.
Manolio, T.A., Collins, F.S., Cox, N.J., et al. (2009). Finding the missing heritability of complex diseases. Nature, 461(7265), 747–753.
Marchini, J., Cardon, L.R., Phillips, M.S. and Donnelly, P. (2004). The effects of human population structure on large genetic association studies. Nature Genet., 36(5), 512–517.
Morris, A.P., Voight, B.F., Teslovich, T.M., et al. (2012). Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nature Genet., 44(9), 981–990.
Price, A.L., Patterson, N.J., Plenge, R.M., et al. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet., 38(8), 904–909.
Price, A.L., Zaitlen, N.A., Reich, D. and Patterson, N. (2010). New approaches to population stratification in genome-wide association studies. Nature Rev. Genet., 11(7), 459–463.
Purcell, S., Neale, B., Todd-Brown, K., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81(3), 559–575.
Risch, N. and Merikangas, K. (1996). The future of genetic studies of complex human diseases. Science, 273(5281), 1516–1517.
Sachidanandam, R., Weissman, D., Schmidt, S., et al. (2001). A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409(6822), 928–933.
Sakoda, L.C., Jorgenson, E. and Witte, J.S. (2013). Turning of COGS moves forward findings for hormonally mediated cancers. Nature Genet., 45(4), 345–348.
Schaub, M., Boyle, A., Kundaje, A., Batzoglou, S. and Snyder, M. (2012). Linking disease associations with regulatory information in the human genome. Genome Res., 22, 1748–1759.
Searle, S.R., Casella, G. and McCulloch, C.E. (2006). Variance Components. Wiley-Interscience, New York, NY.
Sivakumaran, S., Agakov, F., Theodoratou, E., et al. (2011). Abundant pleiotropy in human complex diseases and traits. Am. J. Hum. Genet., 89(5), 607–618.
Speed, D., Hemani, G., Johnson, M.R. and Balding, D.J. (2012). Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet., 91(6), 1011–1021.
Stahl, E.A., Wegmann, D., Trynka, G., et al. (2012). Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nature Genet., 44(5), 483–489.
Sul, J.H. and Eskin, E. (2013). Mixed models can correct for population structure for genomic regions under selection. Nature Rev. Genet., 14(4), 300.
Svishcheva, G.R., Axenovich, T.I., Belonogova, N.M., Duijn, C.M. and Aulchenko, Y.S. (2012). Rapid variance components-based method for whole-genome association analysis. Nature Genet., 44(10), 1166–1170.
Tabor, H., Risch, N. and Myers, R. (2002). Candidate-gene approaches for studying complex genetic traits: practical considerations. Nature Rev. Genet., 3(5), 391–397.
The ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74.
The Wellcome Trust Case Control Consortium. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145), 661–678.
Thompson, R. (1977a). The estimation of heritability with unbalanced data: II. Data available on more than two generations. Biometrics, 33(3), 497–504.
Thompson, R. (1977b). The estimation of heritability with unbalanced data: I. observations available on parents and offspring. Biometrics, 33(3), 485–495.
Tipping, M.E. (2001). Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res., 1, 211–244.
Tipping, M.E. and Faul, A.C. (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In Bishop, C.M. and Frey, M. (Eds), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics (Vol. 1), Jan 3–6, Key West, FL.
Vattikuti, S., Guo, J., and Chow, C.C. (2012). Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genet., 8(3), e1002637.
Venter, J., Adams, M., Myers, E., et al. (2001). The sequence of the human genome. Science, 291(5507), 1304–1351.
Veyrieras, J.-B., Kudaravalli, S., Kim, S., et al. (2008). High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet., 4(10), e1000214.
Visscher, P.M. (2008). Sizing up human height variation. Nature Genet., 40(5), 489–490.
Visscher, P.M., Hill, W.G. and Wray, N.R. (2008). Heritability in the genomics era 0150 concepts and misconceptions. Nature Rev. Genet., 9(4), 255–266.
Visscher, P.M., Brown, M.A., McCarthy, M.I. and Yang, J. (2012). Five years of GWAS discovery. Am. J. Hum. Genet., 90(1), 7–24.
Wan, X., Yang, C., Yang, Q., et al. (2010). Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics, 26, 30–37.
Wang, W., Barratt, B., Clayton, D. and Todd, J. (2005). Genome-wide association studies: theoretical and practical concerns. Nature Rev. Genet., 6(2), 109–118.
Ward, L. and Kellis, M. (2012a). HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucl. Acids Res., 40(D1), D930–D934.
Ward, L. and Kellis, M. (2012b). Interpreting noncoding genetic variation in complex traits and human disease. Nature Biotechnol., 30, 1095–1106.
Wray, N.R., Yang, J., Hayes, B.J., et al. (2013). Pitfalls of predicting complex traits from SNPs. Nature Rev. Genet., 14(7), 507–515.
Yang, J., Benyamin, B., McEvoy, B.P., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nature Genet., 42(7), 565–569.
Yang, J., Manolio, T.A., Pasquale, L.R., et al. (2011a). Genome partitioning of genetic variation for complex traits using common SNPs. Nature Genet., 43(6), 519–525.
Yang, J., Weedon, M.N., Purcell, S., et al. (2011b). Genomic inflation factors under polygenic inheritance. Eur. J. Hum. Genet., 19(7), 807–812.
Zaitlen, N., Kraft, P., Patterson, N., et al. (2013). Using extended genealogy to estimate components of heritability for 23 quantitative and dichotomous traits. PLoS Genet., 9(5), e1003520.
Zhou, X. and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genet., 44(7), 821–824.
Zhu, X., Zhang, S., Zhao, H. and Cooper, R. S. (2002). Association mapping, using a mixture model for complex traits. Genet. Epidemiol., 23(2), 181–196.