Skip to main content Accessibility help
×
Home
  • Print publication year: 2016
  • Online publication date: December 2015

14 - eQTL mapping

from Part III - Single nucleotide polymorphisms, copy number variants, haplotypes and eQTLs

Summary

Introduction

With an influx of successful genome-wide association studies to identify genetic variations associated with complex diseases, an unprecedented wealth of knowledge has been accumulated for SNP–phenotype associations (McCarthy et al., 2008; Witte 2010; Manolio 2013). However, many SNP–disease associations do not lend themselves to molecular interpretations, because many of the identified loci are located outside of the coding regions. Even when a gene can be inferred to be causal, there is often a significant gap towards the understanding of the underlying molecular mechanisms (Schadt et al., 2005; McCarthy et al., 2008). Genome-wide eQTL mapping has been one effective approach to bridge this gap (Mackay et al., 2009). In eQTL studies, gene expression levels measured by high-throughput technologies, such as microarrays and RNA-Seq, are treated as quantitative traits. Marker genotypes are also collected from the same set of individuals, and statistical analyses are performed to detect associations between markers and expression traits. By simultaneously capturing many regulatory interactions, eQTLs offer valuable insights on the genetic architecture of expression regulation (Rockman and Kruglyak 2006). The ultimate goal of eQTL studies is to elucidate how genetic variations affect phenotypes by using gene expression levels as intermediate molecular phenotypes (Nica and Dermitzakis 2008). In this chapter, we provide an overview of the eQTL analysis workflow (Figure 14.1), introduce publicly available tools for analysis, and further discuss challenges and issues.

Data pre-processing

Genome-wide eQTL mapping considers high-density SNP genotype data and gene expression data from the same individuals in a segregating population. Both require appropriate pre-processing as described below for subsequent analysis.

Genotype data

Three quality control (QC) criteria are often used in the pre-processing of the genotype data. (1) Missing rate: individuals with a large proportion of missing SNP genotypes (e.g., 10%) should be excluded because the DNA samples of those individuals may be of poor quality. SNPs with a large missing rate (e.g., 5%) should also be filtered out. (2) Hardy–Weinberg Equilibrium (HWE): statistically significant deviations from HWE often result from genotyping errors. Therefore, SNPs that fail an exact HWE test (e.g., a P-value less than 0.001) should be filtered out. The criterion does not apply to haploid organisms, such as yeast. (3) Minor allele frequency (MAF): SNPs with low MAF (e.g., 0.05) are sometimes filtered out because of the insufficient statistical power for studies with a relatively small sample size and potentially higher genotype calling error.

Related content

Powered by UNSILO
Ashburner, M., Ball, C.A., Blake, J.A., et al. (2000). Gene ontology: tool for the unification of biology. Nature Genet., 25, 25–29.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. Ser. B (Method.), 57, 289–300.
Bohnert, R. and Rätsch, G. (2010). rQuant.web: a tool for RNA-Seq-based transcript quantitation. Nucleic Acids Res., 38, W348–W351.
Bolstad, B.M., Irizarry, R.A., Åstrand, M. and Speed, T.P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185–193.
Brem, R.B., Yvert, G., Clinton, R. and Kruglyak, L. (2002). Genetic dissection of transcriptional regulation in budding yeast. Science, 296, 752–755.
Broman, K.W., Wu, H., Sen, Ś. and Churchill, G.A. (2003). R/QTL: QTL mapping in experimental crosses. Bioinformatics, 19, 889–890.
Browning, S.R. and Browning, B.L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet., 81, 1084–1097.
Cai, T.T., Li, H., Liu, W. and Xie, J. (2013). Covariate-adjusted precision matrix estimation with an application in genetical genomics. Biometrika, 100, 139–156.
Carey, V.J. (2013). GGtools: Genetics of Gene Expression with Bioconductor. R package version 4.6.2.
Chen, L.S., Sangurdekar, D.P. and Storey, J.D. (2011). trigger: Transcriptional Regulatory Inference from Genetics of Gene ExpRession. R package version 1.4.0.
Chen, M., Ren, Z., Zhao, H. and Zhou, H. (2015). Asymptotic normal estimation of covariate-adjusted gaussian graphical model. J. Am. Stat. Ass. Theory Meth. (in press).
Da Huang, W., Sherman, B.T. and Lempicki, R. A. (2008). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols, 4, 44–57.
Delaneau, O., Zagury, J.-F. and Marchini, J. (2012). Improved whole-chromosome phasing for disease and population genetic studies. Nature Meth., 10, 5–6.
Dillies, M.-A., Rau, A., Aubert, J., et al. (2013). A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis. Brief. Bioinform., 14, 671–683.
Dunning, M.J., Smith, M.L., Ritchie, M.E. and Tavaré, S. (2007). beadarray: R classes and methods for Illumina bead-based data. Bioinformatics, 23, 2183–2184.
Eden, E., Navon, R., Steinfeld, I., Lipson, D. and Yakhini, Z. (2009). GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinform., 10, 48.
Fusi, N., Stegle, O. and Lawrence, N.D. (2012). Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput. Biol., 8, e1002330.
Gagnon-Bartsch, J.A. and Speed, T.P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics, 13, 539–552.
Gautier, L., Cope, L., Bolstad, B.M. and Irizarry, R.A. (2004). affy – analysis of Afymetrix GeneChip data at the probe level. Bioinformatics, 20, 307–315.
Guttman, M., Garber, M., Levin, J.Z., et al. (2010). Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincrnas. Nature Biotechnol., 28, 503–510.
Haley, C.S., Knott, S.A. and Elsen, J. (1994). Mapping quantitative trait loci in crosses between outbred lines using least squares. Genetics, 136, 1195–1207.
Hamel, L.-P., Nicole, M.-C., Duplessis, S. and Ellis, B.E. (2012). Mitogen-activated protein kinase signaling in plant-interacting fungi: distinct messages from conserved messengers. Plant Cell Online, 24, 1327–1351.
Johnson, W.E., Li, C. and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8, 118–127.
Kanehisa, M. and Goto, S. (2000). Kegg: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28, 27–30.
Kang, H.M., Ye, C. and Eskin, E. (2008). Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics, 180, 1909–1925.
Katz, Y., Wang, E.T., Airoldi, E.M. and Burge, C.B. (2010). Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nature Meth., 7, 1009–1015.
Kim, D., Pertea, G., Trapnell, C., et al. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol., 14, R36.
Kim, S. and Xing, E.P. (2012). Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. Ann. Appl. Stat., 6, 1095–1117.
Lander, E.S. and Botstein, D. (1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics, 121, 185–199.
Lee, S., Zhu, J. and Xing, E.P. (2010). Adaptive multi-task lasso: with application to eQTL detection. In Advances in neural information processing systems, pp. 1306–1314.
Leek, J.T., Scharpf, R.B., Bravo, H.C., et al. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genet., 11, 733–739.
Li, B., Ruotti, V., Stewart, R.M., Thomson, J.A. and Dewey, C.N. (2010a). RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26, 493–500.
Li, B., Chun, H. and Zhao, H. (2012b). Sparse estimation of conditional graphical models with application to gene networks. J. Am. Statist. Ass., 107, 152–167.
Li, C. and Wong, W.H. (2001). Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol., 2, 1–11.
Li, J.J., Jiang, C.-R., Brown, J.B., Huang, H. and Bickel, P.J. (2011). Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl Acad. Sci. USA, 108, 19867–19872.
Li, L., Zhang, X. and Zhao, H. (2012b). eQTL. In Quantitative Trait Loci (QTL). Springer, pp. 265–279.
Li, Y., Álvarez, O.A., Gutteling, E.W., et al. (2006). Mapping determinants of gene expression plasticity by genetical genomics in C. elegans. PLoS Genet., 2, e222.
Li, Y., Willer, C.J., Ding, J., Scheet, P. and Abecasis, G.R. (2010b). MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol., 34, 816–834.
Listgarten, J., Kadie, C., Schadt, E.E. and Heckerman, D. (2010). Correction for hidden confounders in the genetic analysis of gene expression. Proc. Natl Acad. Sci. USA, 107, 16465–16470.
Mackay, T.F., Stone, E.A. and Ayroles, J.F. (2009). The genetics of quantitative traits: challenges and prospects. Nature Rev. Genet., 10, 565–577.
Manolio, T.A. (2013). Bringing genome-wide association findings into clinical use. Nature Rev. Genet., 14, 549–558.
McCarthy, M.I., Abecasis, G.R., Cardon, L.R., et al. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Rev. Genet., 9, 356–369.
Michaelson, J.J., Loguercio, S. and Beyer, A. (2009). Detection and interpretation of expression quantitative trait loci (eQTL). Methods, 48, 265–276.
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Meth., 5, 621–628.
Nica, A.C. and Dermitzakis, E.T. (2008). Using gene expression to investigate the genetic basis of complex disorders. Hum. Molec. Genet., 17, R129–R134.
Obozinski, G., Wainwright, M.J. and Jordan, M.I. (2011). Support union recovery in high-dimensional multivariate regression. Ann. Statist., 39, 1–47.
Pastinen, T., Ge, B. and Hudson, T.J. (2006). Influence of human genome polymorphism on gene expression. Hum. Molec. Genet., 15, R9–R16.
Price, A.L., Patterson, N.J., Plenge, R.M., et al. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet., 38, 904–909.
Purcell, S., Neale, B., Todd-Brown, K., et al. (2007). Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575.
Richard, H., Schulz, M.H., Sultan, M., et al. (2010). Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments. Nucleic Acids Res., 38, e112–e112.
Robertson, G., Schein, J., Chiu, R., et al. (2010). De novo assembly and analysis of RNA-Seq data. Nature Meth., 7, 909–912.
Rockman, M.V. and Kruglyak, L. (2006). Genetics of global gene expression. Nature Rev. Genet., 7, 862–872.
Salzman, J., Jiang, H. and Wong, W.H. (2011). Statistical modeling of RNA-Seq data. Statist. Sci., 26, 62–83.
Schadt, E.E., Lamb, J., Yang, X., et al. (2005). An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genet., 37, 710–717.
Shabalin, A.A. (2012). Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics, 28, 1353–1358.
Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2013). A sparse-group lasso. J. Comput. Graph. Statist., 22, 231–245.
Smyth, G.K. (2005). Limma: linear models for microarray data. In Gentleman, R., Carey, V., Dudoit, S., Irizarry, R. and Huber, W. (Eds.), Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York, NY, pp. 397–420.
Stojmirović, A. and Yu, Y.-K. (2009). ITM probe: analyzing information flow in protein networks. Bioinformatics, 25, 2447–2449.
Stojmirović, A. and Yu, Y.-K. (2012). Information flow in interaction networks II: channels, path lengths, and potentials. J. Comput. Biol., 19, 379–403.
Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika, 99, 879–898.
Sun, W. and Hu, Y. (2013). eQTL mapping using RNA-seq data. Statist. Biosci., 5, 198–219.
Suthram, S., Beyer, A., Karp, R.M., Eldar, Y. and Ideker, T. (2008). eQED: an efficient method for interpreting eQTL associations using protein networks. Molec. Syst. Biol., 4, 162.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. Ser. B (Method.), 58, 267–288.
Trapnell, C.,Williams, B.A., Pertea, G., et al. (2010). Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol., 28, 511–515.
Tu, Z., Wang, L., Arbeitman, M.N., Chen, T. and Sun, F. (2006). An integrative approach for causal gene identification and gene regulatory pathway inference. Bioinformatics, 22, e489–e496.
Verbeke, L.P., Cloots, L., Demeester, P., Fostier, J. and Marchal, K. (2013). Epsilon: an eQTL prioritization framework using similarity measures derived from local networks. Bioinformatics, 29, 1308–1316.
Voevodski, K., Teng, S.-H. and Xia, Y. (2009). Spectral affinity in protein networks. BMC Syst. Biol., 3, 112.
Wang, X., Qin, L., Zhang, H., et al. (2015). A regularized multivariate regression approach for eQTL analysis. Statist. Biosci., 7, 129–146.
Witte, J.S. (2010). Genome-wide association studies and beyond. Annu. Rev. Publ. Health, 31, 9–20.
Xia, Z., Wen, J., Chang, C.-C. and Zhou, X. (2011). NSMAP: A method for spliced isoforms identification and quantification from RNA-Seq. BMC Bioinform., 12, 162.
Yang, C., Wang, L., Zhang, S. and Zhao, H. (2013). Accounting for non-genetic factors by low-rank representation and sparse regression for eQTL mapping. Bioinformatics, 29, 1026–1034.
Yin, J. and Li, H. (2011). A sparse conditional gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Statist., 5, 2630.
Zerbino, D.R. and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Statist. Soc. Ser. B (Statist. Method.), 67, 301–320.
Zou, W., Aylor, D.L. and Zeng, Z.-B. (2007). eQTL viewer: visualizing how sequence variation affects genome-wide transcription. BMC Bioinform., 8, 7.