Regularization Techniques for Highly Correlated Gene Expression Data with Unknown Group Structure

doi:10.1017/CBO9781139226448.020

19 - Regularization Techniques for Highly Correlated Gene Expression Data with Unknown Group Structure

Published online by Cambridge University Press: 05 June 2013

Brent A. Johnson

Edited by

Kim-Anh Do ,

Zhaohui Steve Qin and

Marina Vannucci

Show author details

Brent A. Johnson: Affiliation:
Emory University
Kim-Anh Do: Affiliation:
University of Texas, MD Anderson Cancer Center
Zhaohui Steve Qin: Affiliation:
Emory University, Atlanta
Marina Vannucci: Affiliation:
Rice University, Houston

Book contents

Get access

Summary

Introduction

In the analysis of high-dimensional genomic data, the absolute correlation among predictors routinely exceeds 0.9, or even 0.95. In the presence of such high collinearity, special techniques are needed to achieve reliable estimation and variable selection in the linear model because common techniques, such as the Lasso (Tibshirani, 1996), fail in this setting. Although some authors have offered improvements over the Lasso when certain features of the design matrix, such as grouping (Yuan and Lin, 2006) or ordering (Tibshirani et al., 2005), can be exploited, another challenging prediction problem occurs when no additional design structure is known or assumed. Several authors have tackled regularization amidst highly correlated predictors, with the “elastic net” (Zou and Hastie, 2005) being the most popular and most widely propagated among them. In this chapter, we examine deficiencies of the elastic net and argue in favor of a little-known competitor, the “Berhu” penalized least squares estimator (Owen, 2006), for high-dimensional regression analyses of genomic data.

Regularization describes a popular class of computational and statistical methods for estimation, prediction, or regression that can be applied to virtually any statistic. In regression, these methods may be summarized as biased regression tools that sacrifice bias to minimize prediction or classification error. These methods transcend Bayesian and frequentist paradigms and extend naturally to high-dimensional regression. Among all regularized regression estimators, ℓ1- and ℓ2-regularization are the most popular, but only the former method results in a sparse solution. Despite the popularity of ℓ1-regularization, it suffers drawbacks.

Type: Chapter
Information: Advances in Statistical Bioinformatics
Models and Integrative Inference for High-Throughput Data
, pp. 382 - 397

DOI: https://doi.org/10.1017/CBO9781139226448.020 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2013

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book contents

19 - Regularization Techniques for Highly Correlated Gene Expression Data with Unknown Group Structure

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive