Skip to main content Accessibility help
  • Print publication year: 2016
  • Online publication date: December 2015

15 - Large-scale correlation mining for biomolecular network discovery

from Part IV - Big data over biological networks


Continuing advances in high-throughput mRNA probing, gene sequencing, and microscopic imaging technology is producing a wealth of biomarker data on many different living organisms and conditions. Scientists hope that increasing amounts of relevant data will eventually lead to better understanding of the network of interactions between the thousands of molecules that regulate these organisms. Thus progress in understanding the biological science has become increasingly dependent on progress in understanding the data science. Data-mining tools have been of particular relevance since they can sometimes be used to effectively separate the “wheat” from the “chaff”, winnowing the massive amount of data down to a few important data dimensions. Correlation mining is a data-mining tool that is particularly useful for probing statistical correlations between biomarkers and recovering properties of their correlation networks. However, since the number of correlations between biomarkers is quadratically larger than the number biomarkers, the scalability of correlation mining in the big data setting becomes an issue. Furthermore, there are phase transitions that govern the correlation mining discoveries that must be understood in order for these discoveries to be reliable and of high confidence. This is especially important to understand at big data scales where the number of samples is fixed and the number of biomarkers becomes unbounded, a sampling regime referred to as the “purely high-dimensional setting”. In this chapter, we will discuss some of the main advances and challenges in correlation mining in the context of large scale biomolecular networks with a focus on medicine. A new correlation mining application will be introduced: discovery of correlation sign flips between edges in a pair of correlation or partial correlation networks. The pair of networks could respectively correspond to a disease (or treatment) group and a control group.


Data mining at a large scale has matured over the past 50 years to a point where, every minute, millions of searches over billions of data dimensions are routinely handled by search engines at Google, Yahoo, LinkedIn, Facebook, Twitter, and other media. Similarly, large ontological databases like GO [1] and DAVID [2] have enabled large-scale text data mining for researchers in the life sciences [3].

Related content

Powered by UNSILO
[1] Gene Ontology Consortium, “The gene ontology (GO) database and informatics resource,” Nucleic Acids Research, vol. 32, no. suppl 1, pp. D258–D261, 2004.
[2] G., Dennis Jr, B. T., Sherman, D. A., Hosack, et al., “DAVID: database for annotation, visualization, and integrated discovery,” Genome Biol., vol. 4, no. 5, p. P3, 2003.
[3] M., Ashburner, C. A., Ball, J. A., Blake, et al., “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000.
[4] E. W., Sayers, T., Barrett, D. A., Benson, et al., “Database resources of the national center for biotechnology information,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. D38–D51, 2011.
[5] C. F., Schaefer, K., Anthony, S., Krupa, et al., “Pid: the pathway interaction database,” Nucleic Acids Research, vol. 37, no. suppl 1, pp. D674–D679, 2009.
[6] E. G., Cerami, B. E., Gross, E., Demir, et al., “Pathway commons, a web resource for biological pathway data,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. D685–D690, 2011. [Online]. Available:
[7] J. D., Allen, Y., Xie, M., Chen, L., Girard, and G., Xiao, “Comparing statistical methods for constructing large scale gene networks,” PLos One, vol. 7, no. 1, p. e29348, 2012.
[8] C., Jiang, F., Coenen, and M., Zito, “A survey of frequent subgraph mining algorithms,” The Knowledge Engineering Review, vol. 28, no. 01, pp. 75–105, 2013.
[9] A., Hero and B., Rajaratnam, “Large-scale correlation screening,” Journal of the American Statistical Association, vol. 106, no. 496, pp. 1540–1552, 2011.
[10] A., Hero and B., Rajaratnam, “Hub discovery in partial correlation models,” IEEE Transactions on Information Theory, vol. 58, no. 9, pp. 6064–6078, 2012, available as Arxiv preprint arXiv:1109.6846.
[11] P., Bühlmann and S. van de, Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer, 2011.
[12] R., Fisher, “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society of London, Series A, vol. 222, pp. 309–368, 1922.
[13] R., Fisher, “Theory of statistical estimation,” Proceedings of the Cambridge Philosophical Society, vol. 22, pp. 700–725, 1925.
[14] C., Rao, “Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation,” Mathematical Proceedings of the Cambridge Philosophical Society, vol. 44, pp. 50–57, 1947.
[15] C., Rao, “Criteria of estimation in large samples,” Sankhyā: The Indian Journal of Statistics, Series A, vol. 25, pp. 189–206, 1963.
[16] J., Neyman and E., Pearson, “On the problem of the most efficient tests of statistical hypotheses,” Philosophical Transactions of the Royal Society of London, Series A, vol. 231, pp. 289–337, 1933.
[17] S., Wilks, “The large-sample distribution of the likelihood ratio for testing composite hypotheses,” Annals of Mathematical Statistics, vol. 9, pp. 60–62, 1938.
[18] A., Wald, “Asymptotically most powerful tests of statistical hypotheses,” Annals of Mathematical Statistics, vol. 12, pp. 1–19, 1941.
[19] A., Wald, “Some examples of asymptotically most powerful tests,” Annals of Mathematical Statistics, vol. 12, pp. 396–408, 1941.
[20] A., Wald, “Tests of statistical hypotheses concerning several parameters when the number of observations is large,” Transactions of the American Mathematical Society, vol. 54, pp. 426–482, 1943.
[21] A., Wald, “Note on the consistency of the maximum likelihood estimate,” Annals of Mathematical Statistics, vol. 20, pp. 595–601, 1949.
[22] H., Cramér, Mathematical Methods of Statistics, Princeton, NJ: Princeton University Press, 1946.
[23] H., Cramér,“Acontribution to the theory of statistical estimation,” Scandinavian Actuarial Journal, vol. 29, pp. 85–94, 1946.
[24] L. Le, Cam, “On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates,” University of California Publications in Statistics, vol. 1, pp. 277–330, 1953.
[25] L. Le, Cam, Asymptotic Methods in Statistical Decision Theory, New York: Springer-Verlag, 1986.
[26] H., Chernoff, “Large-sample theory: parametric case,” Annals of Mathematical Statistics, vol. 27, pp. 1–22, 1956.
[27] J., Kiefer and J., Wolfowitz, “Consistency of the maximum likelihood esitmator in the presence of infinitely many incidental parameters,” Annals of Mathematical Statistics, vol. 27, pp. 887–906, 1956.
[28] R., Bahadur, “Rates of convergence of estimates and test statistics,” Annals of Mathematical Statistics, vol. 38, pp. 303–324, 1967.
[29] B., Efron, “Maximum likelihood and decision theory,” Annals of Statistics, vol. 10, pp. 340–356, 1982.
[30] D., Donoho, “For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics, vol. 59, pp. 797–829, 2006.
[31] P., Zhao and B., Yu, “On model selection consistency of Lasso,” Journal of Machine Learning Research, vol. 7, pp. 2541–2563, 2006.
[32] N., Meinshausen and P., Buhlmann, “High-dimensional graphs and variable selection with the lasso,” Annals of Statistics, vol. 34, no. 3, pp. 1436–1462, June 2006.
[33] E., Candès and T., Tao, “The Dantzig selector: statistical estimation when p is much larger than n,” Annals of Statistics, vol. 35, pp. 2313–2351, 2007.
[34] P., Bickel, Y., Ritov, and A., Tsybakov, “Simultaneous analysis of Lasso and Dantzig selector,” Annals of Statistics, vol. 37, pp. 1705–1732, 2009.
[35] J., Peng, P., Wang, N., Zhou, and J., Zhu, “Partial correlation estimation by joint sparse regression models,” Journal of the American Statistical Association, vol. 104, no. 486, 2009.
[36] M., Wainwright, “Information-theoretic limitations on sparsity recovery in the highdimensional and noisy setting,” IEEE Transactions on Information Theory, vol. 55, pp. 5728–5741, 2009.
[37] M., Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery using l1- constrained quadratic programming (Lasso),” IEEE Transactions on Information Theory, vol. 55, pp. 2183–2202, 2009.
[38] K., Khare, S., Oh, and B., Rajaratnam, “A convex pseudo-likelihood framework for high dimensional partial correlation estimation with convergence guarantees,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), to appear, 2014. [Online]. Available:
[39] H., Firouzi, A., Hero, and B., Rajaratnam, “Variable selection for ultra high dimensional regression,” Technical Report, University of Michigan and Stanford University, 2014.
[40] B., Mole, “The gene sequencing future is here,” Science News, February 6, 2014. [Online]. Available:
[41] W., KA, “Dna sequencing costs: data from the nhgri genome sequencing program (gsp),” August 22, 2014. [Online]. Available: gene-sequencing-future-here
[42] A., Zaas, M., Chen, J., Varkey, et al., “Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans,” Cell Host & Microbe, vol. 6, no. 3, pp. 207–217, 2009.
[43] Y., Huang, A., Zaas, A., Rao, et al., “Temporal dynamics of host molecular responses differentiate symptomatic and asymptomatic influenza a infection,” PLoS Genet, vol. 7, no. 8, p. e1002234, 2011.
[44] P. J., Bickel and K. A., Doksum, Mathematical Statistics: Basic Ideas and Selected Topics, Holden-Day, San Francisco, 1977.
[45] H., Jeong, S. P., Mason, A.-L., Barabasi, and Z. N., Oltvai, “Lethality and centrality in protein networks,” Nature, vol. 411, no. 6833, pp. 41–42, May 2001. [Online]. Available: n6833/abs/411041a0.html
[46] M. C., Oldham, S., Horvath, and D. H., Geschwind, “Conservation and evolution of gene coexpression networks in human and chimpanzee brains,” Proceedings of the National Academy of Sciences, vol. 103, no. 47, pp. 17 973–17 978, November 2006. [Online]. Available:
[47] P., Langfelder and S., Horvath, “WGCNA: an R package for weighted correlation network analysis,” BMC bioinformatics, vol. 9, no. 1, p. 559, January 2008. [Online]. Available:
[48] A., Li and S., Horvath, “Network neighborhood analysis with the multi-node topological overlap measure,” Bioinformatics (Oxford, England), vol. 23, no. 2, pp. 222–31, January 2007. [Online]. Available:
[49] L., Wu, C., Zhang, and J., Zhang, “Hmbox1 negatively regulates nk cell functions by suppressing the nkg2d/dap10 signaling pathway,” Cellular & Molecular Immunology, vol. 8, no. 5, pp. 433–440, 2011.
[50] A. Y., Istomin and A., Godzik, “Understanding diversity of human innate immunity receptors: analysis of surface features of leucine-rich repeat domains in nlrs and tlrs,” BMC Immunology, vol. 10, no. 1, p. 48, 2009.
[51] S. L., Lauritzen, Graphical Models, Oxford University Press, 1996.
[52] A., Dempster, “Covariance selection,” Biometrics, vol. 28, no. 1, pp. 157–175, 1972.
[53] J., Friedman, T., Hastie, and R., Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
[54] B., Rajaratnam, H., Massam, and C., Carvalho, “Flexible covariance estimation in graphical Gaussian models,” Annals of Statistics, vol. 36, pp. 2818–2849, 2008.
[55] K., Khare and B., Rajaratnam, “Wishart distributions for decomposable covariance graph models,” The Annals of Statistics, vol. 39, no. 1, pp. 514–555, Mar. 2011. [Online]. Available:
[56] A. J., Rothman, P., Bickel, E., Levina, and J., Zhu, “Sparse permutation invariant covariance estimation,” Electronic Journal of Statistics, vol. 2, pp. 494–515, 2008.
[57] P., Bickel and E., Levina, “Covariance regularization via thresholding,” Annals of Statistics, vol. 34, no. 6, pp. 2577–2604, 2008.
[58] O., Banerjee, L. E., Ghaoui, and A., d'Aspremont, “Model selection through sparse maximum likelihood estimation for multivariateGaussian or binary data,” Journal of Machine Learning Research, vol. 9, pp. 485–516, March 2008.
[59] C.-J., Hsieh, M. A., Sustik, I., Dhillon, P., Ravikumar, and R., Poldrack, “Big & quick: sparse inverse covariance estimation for a million variables,” in Advances in Neural Information Processing Systems, 2013, pp. 3165–3173.
[60] D., Guillot, B., Rajaratnam, B. T., Rolfs, A., Maleki, and I., Wong, “Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation,” in Advances in Neural Information Processing Systems 25, 2012. [Online]. Available:
[61] O., Dalal and B., Rajaratnam, “G-AMA: sparse Gaussian graphical model estimation via alternating minimization,” Technical Report, Department of Statistics, Stanford University (in revision), 2014. [Online]. Available:
[62] G., Rocha, P., Zhao, and B., Yu, “A path following algorithm for Sparse Pseudo-Likelihood Inverse Covariance Estimation (SPLICE),” Statistics Department, UC Berkeley, Berkeley, CA, Tech. Rep., 2008. [Online]. Available: pseudo.pdf
[63] S., Oh, O., Dalal, K., Khare, and B., Rajaratnam, “Optimization methods for sparse pseudolikelihood graphical model selection,” in Advances in Neural Information Processing Systems 27, 2014.
[64] G., Marjanovic and A. O., Hero III, “On lq estimation of sparse inverse covariance,” in Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, May 2014.
[65] G., Marjanovic and A. O., Hero III, “l 0 sparse inverse covariance estimation,” arXiv preprint arXiv:1408.0850, 2014.
[66] T., Tsiligkaridis, A., Hero, and S., Zhou, “Convergence properties of Kronecker Graphical Lasso algorithms,” IEEE Transactions on Signal Processing (also available as arXiv:1204.0585), vol. 61, no. 7, pp. 1743–1755, 2013.
[67] R., Gill, S., Datta, and S., Datta, “A statistical framework for differential network analysis from microarray data,” BMC Bioinformatics, vol. 11, no. 1, p. 95, 2010.
[68] N., Kramer, J., Schafer, and A.-L., Boulesteix, “Regularized estimation of large-scale gene association networks using graphical gaussian models,” BMC Bioinformatics, vol. 10, no. 384, pp. 1–24, 2009.
[69] V., Pihur, S., Datta, and S., Datta, “Reconstruction of genetic association networks from microarray data: a partial least squares approach,” Bioinformatics, vol. 24, no. 4, p. 561, 2008.
[70] D., Mount and S., Arya, “Approximate nearest neighbor code,”˜ mount/ANN.
[71] J., Schäfer and K., Strimmer, “An empirical Bayes approach to inferring large-scale gene association networks,” Bioinformatics, vol. 21, no. 6, pp. 754–764, 2005.
[72] J., Friedman, T., Hastie, and R., Tibshirani, “Applications of the lasso and grouped lasso to the estimation of sparse graphical models,” 2010. [Online]. Available: http://www-stat.
[73] J., Lee and T., Hastie, “Learning the structure of mixed graphical models,” Journal of Computational and Graphical Statistics, vol. 24, pp. 230–253, 2014.
[74] K., Sricharan, A., Hero, and B., Rajaratnam, “A local dependence measure and its application to screening for high correlations in large data sets,” in Information Fusion (FUSION), 2011 Proceedings of the 14th International Conference on, IEEE, 2011, pp. 1–8.
[75] H., Firouzi, D., Wei, and A., Hero, “Spectral correlation hub screening of multivariate time series,” in Excursions in Harmonic Analysis: The February Fourier Talks at the Norbert Wiener Center, R., Balan, M., Begué, J. J., Benedetto, W., Czaja, and K., Okoudjou, Eds., Springer, 2014.
[76] B., He, R., Baird, R., Butera, A., Datta, et al., “Grand challenges in interfacing engineering with life sciences and medicine.” IEEE Transactions on Bio-Medical Engineering (BME), vol. 4, no. 4, 2013.
[77] R., Chen, G. I., Mias, J., Li-Pook-Than, et al., “Personal omics profiling reveals dynamic molecular and medical phenotypes,” Cell, vol. 148, no. 6, pp. 1293–1307, 2012.
[78] J. J., McCarthy, H. L., McLeod, and G. S., Ginsburg, “Genomic medicine: a decade of successes, challenges, and opportunities,” Science Translational Medicine, vol. 5, no. 189, pp. 189sr4–189sr4, 2013.
[79] J. T., Erler and R., Linding, “Network medicine strikes a blow against breast cancer,” Cell, vol. 149, no. 4, pp. 731–733, 2012.
[80] D.B., Boivin, F. O., James, A., Wu, et al., “Circadian clock genes oscillate in human peripheral blood mononuclear cells,” Blood, vol. 102, no. 12, pp. 4143–4145, 2003.
[81] H., Firouzi, D., Wei, and A., Hero, “Spatio-temporal analysis of gaussian wss processes via complex correlation and partial correlation screening,” in Proceedings of IEEE GlobalSIP Conference, also available as arxiv:1303.2378, 2013.
[82] J. J., Eady, G. M., Wortley, Y. M., Wormstone, et al., “Variation in gene expression profiles of peripheral blood mononuclear cells from healthy volunteers,” Physiological Genomics, vol. 22, no. 3, pp. 402–411, 2005.
[83] A. R., Whitney, M., Diehn, S. J., Popper, et al., “Individuality and variation in gene expression patterns in human blood,” Proceedings of the National Academy of Sciences, vol. 100, no. 4, pp. 1896–1901, 2003.
[84] H., Firouzi, A., Hero, and B., Rajaratnam, “Predictive correlation screening: application to two-stage predictor design in high dimension,” in Proceedings of AISTATS, also available as arxiv:1303.2378, 2013.
[85] N., Katenka, E.D., Kolaczyk, et al., “Inference and characterization of multi-attribute networks with application to computational biology,” The Annals of Applied Statistics, vol. 6, no. 3, pp. 1068–1094, 2012.
[86] S., Zhou, “Gemini: graph estimation with matrix variate normal instances,” The Annals of Statistics, vol. 42, no. 2, pp. 532–562, 2014.
[87] P., Langfelder and S., Horvath, “Wgcna: an R package for weighted correlation network analysis,” BMC Bioinformatics, vol. 9, no. 1, p. 559, 2008.
[88] D., Zhu, A., Hero, H., Cheng, R., Kanna, and A., Swaroop, “Network constrained clustering for gene microarray data,” Bioinformatics, vol. 21, no. 21, pp. 4014–4021, 2005.
[89] A., Rao and A. O., Hero, “Biological pathway inference using manifold embedding,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, IEEE, 2011, pp. 5992–5995.