Book contents
- Frontmatter
- Contents
- Foreword by Steven Salzberg
- Preface
- Acknowledgements
- 1 Introduction
- 2 Mathematical preliminaries
- 3 Overview of computational gene prediction
- 4 Gene finder evaluation
- 5 A toy exon finder
- 6 Hidden Markov models
- 7 Signal and content sensors
- 8 Generalized hidden Markov models
- 9 Comparative gene finding
- 10 Machine-learning methods
- 11 Tips and tricks
- 12 Advanced topics
- Appendix
- References
- Index
3 - Overview of computational gene prediction
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Foreword by Steven Salzberg
- Preface
- Acknowledgements
- 1 Introduction
- 2 Mathematical preliminaries
- 3 Overview of computational gene prediction
- 4 Gene finder evaluation
- 5 A toy exon finder
- 6 Hidden Markov models
- 7 Signal and content sensors
- 8 Generalized hidden Markov models
- 9 Comparative gene finding
- 10 Machine-learning methods
- 11 Tips and tricks
- 12 Advanced topics
- Appendix
- References
- Index
Summary
In this chapter we will develop a conceptual framework describing the gene prediction problem from a computational perspective. Our goal will be to expose the reader to the overall problem from a high level, but in a very concrete way, so that the necessities and compromises of the computational methods which we will introduce in the chapters ahead can be seen in light of the practical realities of the problem. A comparison of the material in this chapter with the description of the underlying biology given in Chapter 1 should highlight the gulf which yet needs to be crossed between the goals of genome annotation and the current state of the art in computational gene prediction.
Genes, exons, and coding segments
The common substrate for gene finding is the DNA sequence produced by the genome sequencing and assembly processes. As described in Chapter 1, the raw trace files produced by the sequencing machines are subjected to a base-caller program which infers the most likely nucleotide at each position in a fragment, given the levels of the fluorescent dyes measured by the sequencing machine. The nucleotide sequence fragments produced by the base-caller are then fed to an assembler, a program that combines fragments into longer DNA sequences called contigs. Contigs are generally stored in FASTA files. Figure 3.1 shows an example FASTA file.
- Type
- Chapter
- Information
- Methods for Computational Gene Prediction , pp. 83 - 103Publisher: Cambridge University PressPrint publication year: 2007