Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time

Christopher Oehmen; Scott Dowson; Wes Hatley; Justin Almquist; Bobbie-Jo Webb-Robertson; Jason McDermott; Ian Gorton; Lee Ann McCue

doi:10.1017/CBO9780511844409.009

9 - Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time

Published online by Cambridge University Press: 05 December 2012

Christopher Oehmen ,

Scott Dowson ,

Wes Hatley ,

Justin Almquist ,

Bobbie-Jo Webb-Robertson ,

Jason McDermott ,

Ian Gorton and

Lee Ann McCue

Edited by

Ian Gorton and

Deborah K. Gracio

Show author details

Christopher Oehmen: Affiliation:
Pacific Northwest National Laboratory
Scott Dowson: Affiliation:
Pacific Northwest National Laboratory
Wes Hatley: Affiliation:
Future Point Systems
Justin Almquist: Affiliation:
Pacific Northwest National Laboratory
Bobbie-Jo Webb-Robertson: Affiliation:
Pacific Northwest National Laboratory
Jason McDermott: Affiliation:
Pacific Northwest National Laboratory
Ian Gorton: Affiliation:
Pacific Northwest National Laboratory
Lee Ann McCue: Affiliation:
Pacific Northwest National Laboratory
Ian Gorton: Affiliation:
Pacific Northwest National Laboratory, Washington
Deborah K. Gracio: Affiliation:
Pacific Northwest National Laboratory, Washington

Book contents

Get access

Summary

Discovering Biological Mechanisms through Exploration

The availability of massive amounts of data in biological sciences is forcing us to rethink the role of hypothesis-driven investigation in modern research. Soon thousands, if not millions, of whole-genome DNA and protein sequence data setswill be available thanks to continued improvements in high-throughput sequencing and analysis technologies. At the same time, high-throughput experimental platforms for gene expression, protein and protein fragment measurements, and others are driving experimental data sets to extreme scales. As a result, biological sciences are undergoing a paradigm shift from hypothesisdriven to data-driven scientific exploration. In hypothesis-driven research, one begins with observations, formulates a hypothesis, then tests that hypothesis in controlled experiments. In a data-rich environment, however, one often begins with only a cursory hypothesis (such as some class of molecular components is related to a cellular process) that may require evaluating hundreds or thousands of specific hypotheses rapidly. This large number of experiments is generally intractable to perform in physical experiments. However, often data can be brought to bear to rapidly evaluate and refine these candidate hypotheses into a small number of testable ones. Also, often the amount of data required to discover and refine a hypothesis in this way overwhelms conventional analysis software and hardware. Ideally advanced hardware can help the situation, but conventional batch-mode access models for high-performance computing are not amenable to real-time analysis in larger workflows. We present a model for real-time data-intensive hypothesis discovery process that unites parallel software applications, high-performance hardware, and visual representation of the output.

Type: Chapter
Information: Data-Intensive Computing
Architectures, Algorithms, and Applications
, pp. 235 - 257

DOI: https://doi.org/10.1017/CBO9780511844409.009 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1. Markowitz, V., Chen, I. M., Palaniappan, K., Chu, K., Szeto, E., Grechkin, Y., Rattner, A., Anderson, I., Lykidis, A., Mavromatis, K., Ivanova, N., and Kyrpides, N. “The Integrated Microbial Genome System: An Expanding Comparative Analysis Resource.” Nucl. Acid. Res. 38 (2010): D382–D390.CrossRef Google Scholar PubMed

2. Koonin, E. “Orthologs, Paralogs, and Evolutionary Genomics.” Annu. Rev. Genet. 39, (2005). 309–338.CrossRef Google Scholar PubMed

3. Brent, M. “Steady Progress and Recent Breakthroughs in the Accuracy of Automated Genome Annotation,” Nat. Rev. Genet. 9 (2008): 62–73.CrossRef Google Scholar PubMed

4. Pagel, P., Mewes, H., and Frishman, D. “Conservation of Protein-Protein Interactions- Lessons from Ascomycota.” Trends Genet. 20 (2004): 72–76.CrossRef Google Scholar PubMed

5. van Noort, V., Snel, B., and Huynen, M. “Predicting Gene Function by Conserved Co-Expression.” Trends Genet. 19 (2003): 238–42.CrossRef Google Scholar PubMed

6. Thornton, J., and DeSalle, R. “Gene Family Evolution and Homology: Genomics Meets Phylogenetics.” Annu. Rev. Genomics Hum. Genet. 1 (2000): 41–73.CrossRef Google Scholar PubMed

7. Ekins, S., Mestres, J., and Testa, B. “In silico Pharmacology for Drug Discovery: Methods for Virtual Ligand Screening and Profiling.”Br. J. Pharmacol. 159 (2007): 9–20.Google Scholar

8. Altschul, S., Gish, W., Miller, W., Meyers, E., and Lipman, D. “Basic Local Alignment Search Tool.” J. Mol. Biol. 215 (1990): 403–10.CrossRef Google Scholar PubMed

9. Oehmen, C. S., and Nieplocha, J. “ScalaBLAST: A scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis.” IEEE Trans. Parallel Dist. Sys. 17 (2006): 740–49.CrossRef Google Scholar

10. Webb-Robertson, B. J., Oehmen, C. S., and Shah, A. “A Feature Integration Approach for a Generalized Support Vector Machine Pairwise Homology Algorithm.” Comput. Biol. Chem. 32 (2008): 458–61.CrossRef Google Scholar PubMed

11. Alexeyenko, A., Tamas, I., Liu, G., and Sonnhammer, E. “Automatic Clustering of Orthologs and In-Paralogs Shared by Multiple Proteomes.” Bioinformatics 22 (2006): e9–e15.CrossRef Google Scholar

12. Li, L., Stoeckert, C., and Roos, D. “OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomies.” Genome Res. 13 (2003): 2178–89.CrossRef Google Scholar

13. Remm, M., Storm, C., and Sonnhammer, E. “Automatic Clustering of Orthologs and In-Paralogs from Pairwise Species Comparisons.” J. Mol. Biol. 314 (2001): 1041–52.CrossRef Google Scholar PubMed

14. Storm, C., and Sonnhammer, E. “Automated Ortholog Inference from Phylogenetic Trees and Calculation of Orthology Reliability.” Bioinformatics 18 (2002): 92–99.CrossRef Google Scholar PubMed

15. Tatusov, R., Koonin, E, and Lipman, D. “A Genomic Perspective on Protein Families.” Science 278 (1997): 631–37.CrossRef Google Scholar PubMed

16. Wall, D., Fraser, H., and Hirsh, A. “Detecting Putative Orthologs.” Bioinformatics 19 (2003): 1710–11.CrossRef Google Scholar PubMed

17. Zhou, Y. and Landweber, L. “BLASTO:A Tool for Searching Orthologous Groups.” Nucl Acid. Res. 35, (2007): W678–W682.CrossRef Google Scholar PubMed

18. Zmasek, C. and Eddy, S. “RIO: Analyzing Proteomics by Automated Phylogenomics Using Resampled Inference of Orthologs.” BMC Bioinformatics 3 (2002): 14.CrossRef Google Scholar

19. Chen, F., Mackey, A., Vermunt, J., and Roos, D. “Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes.” PLoS ONE 2, (2007): e383.CrossRef Google Scholar PubMed

20. Hulsen, T., Huynen, M., dee Vlieg, J., and Groenen, P. “Benchmarking Ortholog Identification Methods Using Functional Genomics Data.” Genome Biol. 7 (2006): R31.CrossRef Google Scholar PubMed

21. Risch, J., Rex, D., Dowson, S., Walters, S., May, R., and Moon, B. “The STARLIGHT Information Visualization System.” In IEEE International Information Visualization Conference (IV'97), London, UK, 1997, p. 42.Google Scholar

22. Finn, R., Tate, J., Mistry, J., Coggill, P., Sammut, J., Hotz, H., Ceric, G., Forslund, K., Eddy, S., Sonnhammer, E., and Bateman, A. “The Pfam Protein Families Database.” Nucl. Acid. Res. 36, (2008): D281–D288.Google Scholar PubMed

23. Gorton, IWynne, A., Liu, Y., and Yin, J. “Components in the Pipeline.” Software IEEE 28, no. 3 (May–June 2011): 34–40.CrossRef Google Scholar

24. Fredrickson, J.Romine, M.Beliaev, A.Auchtung, J.Driscoll, M.Gardner, T.Nealson, K.Osterman, A.Pinchuk, G.Reed, J.Rodionov, D.Rodrigues, J.Saffarini, D.Serres, A. M.Spormann, I., and J., Tiedje. “Towards Environmental Systems Biology of Shewanella.” Nat. Rev. Microbiol. 6 (2008): 592–603.CrossRef Google Scholar PubMed

25. Hau, H and Gralnick, J. “Ecology and Biotechnology of the Genus Shewanella.” Microbiol. 61 (2007): 237–58.CrossRef Google Scholar PubMed

26. Andreeva, A., Howorth, D., Chandonia, J., Brenner, S., Hubbard, T., Chothia, C., and Murzin, A. “Data Growth and Its Impact on the Scop Database: New Developments.” Nucl. Acid. Res. 36 (2008): D419–D425.Google Scholar PubMed

27. Ma, LKaserer, WAnnamalai, RScott, DJin, BJiang, XXiao, QMaymani, H.Massis, LFerreira, L and Newton, S. “Evidence of ball-and-Chain Transport of Ferric Enterobactin through Fepa.” J. Biol. Chem. 2821 (2007): 397–406.Google Scholar

28. Biffinger, J., Pietron, J., Ray, R., Little, B., and Ringeisen, B. “A Biofilm Enhanced Miniature Microbial Fuel Cell Using Shewanella Oneidensis DSP 10 and Oxygen Reducing Cathodes.” Biosens. Bioelectron. 22 (2007): 1672–79.CrossRef Google Scholar

Book contents

9 - Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive