Skip to main content Accessibility help
×
Hostname: page-component-76fb5796d-qxdb6 Total loading time: 0 Render date: 2024-04-26T14:44:16.040Z Has data issue: false hasContentIssue false

9 - Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time

Published online by Cambridge University Press:  05 December 2012

Christopher Oehmen
Affiliation:
Pacific Northwest National Laboratory
Scott Dowson
Affiliation:
Pacific Northwest National Laboratory
Wes Hatley
Affiliation:
Future Point Systems
Justin Almquist
Affiliation:
Pacific Northwest National Laboratory
Bobbie-Jo Webb-Robertson
Affiliation:
Pacific Northwest National Laboratory
Jason McDermott
Affiliation:
Pacific Northwest National Laboratory
Ian Gorton
Affiliation:
Pacific Northwest National Laboratory
Lee Ann McCue
Affiliation:
Pacific Northwest National Laboratory
Ian Gorton
Affiliation:
Pacific Northwest National Laboratory, Washington
Deborah K. Gracio
Affiliation:
Pacific Northwest National Laboratory, Washington
Get access

Summary

Discovering Biological Mechanisms through Exploration

The availability of massive amounts of data in biological sciences is forcing us to rethink the role of hypothesis-driven investigation in modern research. Soon thousands, if not millions, of whole-genome DNA and protein sequence data setswill be available thanks to continued improvements in high-throughput sequencing and analysis technologies. At the same time, high-throughput experimental platforms for gene expression, protein and protein fragment measurements, and others are driving experimental data sets to extreme scales. As a result, biological sciences are undergoing a paradigm shift from hypothesisdriven to data-driven scientific exploration. In hypothesis-driven research, one begins with observations, formulates a hypothesis, then tests that hypothesis in controlled experiments. In a data-rich environment, however, one often begins with only a cursory hypothesis (such as some class of molecular components is related to a cellular process) that may require evaluating hundreds or thousands of specific hypotheses rapidly. This large number of experiments is generally intractable to perform in physical experiments. However, often data can be brought to bear to rapidly evaluate and refine these candidate hypotheses into a small number of testable ones. Also, often the amount of data required to discover and refine a hypothesis in this way overwhelms conventional analysis software and hardware. Ideally advanced hardware can help the situation, but conventional batch-mode access models for high-performance computing are not amenable to real-time analysis in larger workflows. We present a model for real-time data-intensive hypothesis discovery process that unites parallel software applications, high-performance hardware, and visual representation of the output.

Type
Chapter
Information
Data-Intensive Computing
Architectures, Algorithms, and Applications
, pp. 235 - 257
Publisher: Cambridge University Press
Print publication year: 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1. Markowitz, V., Chen, I. M., Palaniappan, K., Chu, K., Szeto, E., Grechkin, Y., Rattner, A., Anderson, I., Lykidis, A., Mavromatis, K., Ivanova, N., and Kyrpides, N.The Integrated Microbial Genome System: An Expanding Comparative Analysis Resource.” Nucl. Acid. Res. 38 (2010): D382–D390.CrossRefGoogle ScholarPubMed
2. Koonin, E.Orthologs, Paralogs, and Evolutionary Genomics.” Annu. Rev. Genet. 39, (2005). 309–338.CrossRefGoogle ScholarPubMed
3. Brent, M.Steady Progress and Recent Breakthroughs in the Accuracy of Automated Genome Annotation,” Nat. Rev. Genet. 9 (2008): 62–73.CrossRefGoogle ScholarPubMed
4. Pagel, P., Mewes, H., and Frishman, D.Conservation of Protein-Protein Interactions- Lessons from Ascomycota.” Trends Genet. 20 (2004): 72–76.CrossRefGoogle ScholarPubMed
5. van Noort, V., Snel, B., and Huynen, M.Predicting Gene Function by Conserved Co-Expression.” Trends Genet. 19 (2003): 238–42.CrossRefGoogle ScholarPubMed
6. Thornton, J., and DeSalle, R.Gene Family Evolution and Homology: Genomics Meets Phylogenetics.” Annu. Rev. Genomics Hum. Genet. 1 (2000): 41–73.CrossRefGoogle ScholarPubMed
7. Ekins, S., Mestres, J., and Testa, B.In silico Pharmacology for Drug Discovery: Methods for Virtual Ligand Screening and Profiling.”Br. J. Pharmacol. 159 (2007): 9–20.Google Scholar
8. Altschul, S., Gish, W., Miller, W., Meyers, E., and Lipman, D.Basic Local Alignment Search Tool.” J. Mol. Biol. 215 (1990): 403–10.CrossRefGoogle ScholarPubMed
9. Oehmen, C. S., and Nieplocha, J.ScalaBLAST: A scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis.” IEEE Trans. Parallel Dist. Sys. 17 (2006): 740–49.CrossRefGoogle Scholar
10. Webb-Robertson, B. J., Oehmen, C. S., and Shah, A.A Feature Integration Approach for a Generalized Support Vector Machine Pairwise Homology Algorithm.” Comput. Biol. Chem. 32 (2008): 458–61.CrossRefGoogle ScholarPubMed
11. Alexeyenko, A., Tamas, I., Liu, G., and Sonnhammer, E.Automatic Clustering of Orthologs and In-Paralogs Shared by Multiple Proteomes.” Bioinformatics 22 (2006): e9–e15.CrossRefGoogle Scholar
12. Li, L., Stoeckert, C., and Roos, D.OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomies.” Genome Res. 13 (2003): 2178–89.CrossRefGoogle Scholar
13. Remm, M., Storm, C., and Sonnhammer, E.Automatic Clustering of Orthologs and In-Paralogs from Pairwise Species Comparisons.” J. Mol. Biol. 314 (2001): 1041–52.CrossRefGoogle ScholarPubMed
14. Storm, C., and Sonnhammer, E.Automated Ortholog Inference from Phylogenetic Trees and Calculation of Orthology Reliability.” Bioinformatics 18 (2002): 92–99.CrossRefGoogle ScholarPubMed
15. Tatusov, R., Koonin, E, and Lipman, D.A Genomic Perspective on Protein Families.” Science 278 (1997): 631–37.CrossRefGoogle ScholarPubMed
16. Wall, D., Fraser, H., and Hirsh, A.Detecting Putative Orthologs.” Bioinformatics 19 (2003): 1710–11.CrossRefGoogle ScholarPubMed
17. Zhou, Y. and Landweber, L.BLASTO:A Tool for Searching Orthologous Groups.” Nucl Acid. Res. 35, (2007): W678–W682.CrossRefGoogle ScholarPubMed
18. Zmasek, C. and Eddy, S.RIO: Analyzing Proteomics by Automated Phylogenomics Using Resampled Inference of Orthologs.” BMC Bioinformatics 3 (2002): 14.CrossRefGoogle Scholar
19. Chen, F., Mackey, A., Vermunt, J., and Roos, D.Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes.” PLoS ONE 2, (2007): e383.CrossRefGoogle ScholarPubMed
20. Hulsen, T., Huynen, M., dee Vlieg, J., and Groenen, P.Benchmarking Ortholog Identification Methods Using Functional Genomics Data.” Genome Biol. 7 (2006): R31.CrossRefGoogle ScholarPubMed
21. Risch, J., Rex, D., Dowson, S., Walters, S., May, R., and Moon, B. “The STARLIGHT Information Visualization System.” In IEEE International Information Visualization Conference (IV'97), London, UK, 1997, p. 42.Google Scholar
22. Finn, R., Tate, J., Mistry, J., Coggill, P., Sammut, J., Hotz, H., Ceric, G., Forslund, K., Eddy, S., Sonnhammer, E., and Bateman, A.The Pfam Protein Families Database.” Nucl. Acid. Res. 36, (2008): D281–D288.Google ScholarPubMed
23. Gorton, IWynne, A., Liu, Y., and Yin, J.Components in the Pipeline.” Software IEEE 28, no. 3 (May–June 2011): 34–40.CrossRefGoogle Scholar
24. Fredrickson, J.Romine, M.Beliaev, A.Auchtung, J.Driscoll, M.Gardner, T.Nealson, K.Osterman, A.Pinchuk, G.Reed, J.Rodionov, D.Rodrigues, J.Saffarini, D.Serres, A. M.Spormann, I., and J., Tiedje. “Towards Environmental Systems Biology of Shewanella.” Nat. Rev. Microbiol. 6 (2008): 592–603.CrossRefGoogle ScholarPubMed
25. Hau, H and Gralnick, J.Ecology and Biotechnology of the Genus Shewanella.” Microbiol. 61 (2007): 237–58.CrossRefGoogle ScholarPubMed
26. Andreeva, A., Howorth, D., Chandonia, J., Brenner, S., Hubbard, T., Chothia, C., and Murzin, A.Data Growth and Its Impact on the Scop Database: New Developments.” Nucl. Acid. Res. 36 (2008): D419–D425.Google ScholarPubMed
27. Ma, LKaserer, WAnnamalai, RScott, DJin, BJiang, XXiao, QMaymani, H.Massis, LFerreira, L and Newton, S.Evidence of ball-and-Chain Transport of Ferric Enterobactin through Fepa.” J. Biol. Chem. 2821 (2007): 397–406.Google Scholar
28. Biffinger, J., Pietron, J., Ray, R., Little, B., and Ringeisen, B.A Biofilm Enhanced Miniature Microbial Fuel Cell Using Shewanella Oneidensis DSP 10 and Oxygen Reducing Cathodes.” Biosens. Bioelectron. 22 (2007): 1672–79.CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×