With omics data, results generated from single-dataset analysis are often unsatisfactory. Integrative analysis methods conduct the joint analysis of data from multiple independent studies or on multiple correlated responses, can effectively increase power, and outperform single-dataset analysis and meta-analysis. In this chapter, we review the penalized integrative analysis methods under both the homogeneity and heterogeneity models. Computation using the coordinate descent approach is described. We also discuss several important extensions. The analysis of a genome-wide association study demonstrates the applicability of reviewed methods.
In the study of complex diseases such as cancer, cardiovascular diseases, and autoimmune diseases, profiling studies are nowroutinely conducted, generating “large d, small n” data, where the number of omics features profiled (genes, SNPs, methylation loci, etc.) d is much larger than the sample size n. Many different types of analyses can be conducted. For example, Chapters 3 and 4 were focused on identifying meaningful networks. In this chapter, our analysis goal is to identify a small subset of omics measurements that are associated with disease outcomes or phenotypes. Such measurements are also referred to as “markers” in the literature and in this chapter. Statistically, this is a variable selection problem. The development of integrative analysis methods has been partly motivated by the following examples.
8.1.1 Example 1
Consider the analysis of data generated in multiple independent studies with comparable designs. For example, in Ma et al. (2011), four pancreatic cancer data sets are collected and analyzed. The four data sets were generated in four independent studies, all having a case-control design, collecting mRNA gene expression measurements and searching for genes associated with the risk of pancreatic cancer. In high-dimensional omics studies, it has been recognized that the results generated in single-data-set analysis often have unsatisfactory properties such as low reproducibility. Among many possible contributing factors, the most important one is perhaps the small n. Multi-data-set analysis can effectively increase sample size and outperform single-data-set analysis (Guerra and Goldstein, 2009). This perspective has been explained in multiple chapters of this book. When the designs of multiple studies are “close enough”, it can be reasonable to expect that they identify the same set of markers.