Methods for Computational Gene Prediction

William H. Majoros

doi:10.1017/CBO9780511811135

References

Aha, D. W. and Bankert, R. L. (1996) A comparative evaluation of sequential feature selection algorithms. In Fisher, D. and Lenz, H.-Z. (eds.) Learning from Data, pp. 199–206. New York: Springer.

Aho, A. V., Sethi, R., and Ullman, J. D. (1986) Compilers: Principles, Techniques, and Tools. Reading, MA: Addison-Wesley.

Allen, J. E. and Salzberg, S. L. (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21:3596–3603.

Allen, J. E., Pertea, M., and Salzberg, S. L. (2004) Computational gene prediction using mutliple sources of evidence. Genome Research 14:142–148.

Allen, J. E., Majoros, W. H., Pertea, M., and Salzberg, S. L. (2006) JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 7(Suppl. 1):S9.

Alexandersson, M., Cawley, S., and Pachter, L. (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Research 13:496–502.

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Anang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389–3402.

Anton, H. (1987) Elementary Linear Algebra, 5th edn. New York: John Wiley.

Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M. D. R., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N. J., Oinn, T. M., Pagni, M., Servant, F., Sigrist, C. J. A., and Zdobnov, E. M. (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research 29:37–40.

Attwood, T. K., Bradley, P., Flower, D. R., Gaulton, A., Maudling, N., Mitchell, A. L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., and Zygouri, C. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Research 31:400–402.

Azad, R. K. and Borodovsky, M. (2004) Effects of choice of DNA sequence model structure on gene identification accuracy. Bioinformatics 20: 993–1005.

Bafna, V. and Huson, D. H. (2001) The conserved exon method for gene finding. ISMB'2000, 8:3–12.

Bahl L. R., Brown P. F., de Souza P. V., and Mercer R. L. (1986) Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 49–52.

Bailey, T. L. and Elkan, C. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21:51–83.

Bairoch, A. and Apweiler, R. (1996) The SWISS-PROT protein sequence data bank and its new supplement TrEMBL. Nucleic Acids Research 24:21–25.

Bajic, V. B., Seah, S. H., Chong, A., Zhang, G., Koh, J. L. Y., and Brusic, V. (2002) Dragon promoter finder: recognition of vertebrate RNA polymerase II promoters. Bioinformatics 18:198–199.

Bajic, V. B., Tan, S. L., Suzuki, Y., and Sugano, S. (2004) Promoter prediction analysis on the whole human genome. Nature Biotechnology 22:1467–1473.

Bajic, V. B., Brent, M. R., Brown, R. H., Frankish, A., Harrow, J., Ohler, U., Solovyev, V. V., and Tan, S. L. (2006) Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment. Genome Biology 7(Suppl. 1):S3.

Baker, J. K. (1979) Trainable grammars for speech recognition. In Proceedings of the Spring Conference of the Acoustical Society of America, Boston, MA, pp. 547–550.

Bartel, D. P. (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116:281–297.

Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L. L., Studholme, D. J., Yeats, C., and Eddy, S. R. (2004) The Pfam protein families database. Nucleic Acids Research 32:D138–D141.

Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B., and Lander, ES. (2000) Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Research 7:950–958.

Baum, L. E. (1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3:1–8.

Baum, L. E., Petrie, T., Goules, G., and Weiss, N. (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics 41:164–171.

Beaudoing, E., Freier, S., Wyatt, J. R., Claverie, J.-M., and Gautheret, D. (2000) Patterns of variant polyadenylation signal usage in human genes. Genome Research 10:1001–1010.

Benson, D. A., Karsch-Mizrachi I, , Lipman, D. J., Ostell, J., and Wheeler, D. L. (2005) GenBank. Nucleic Acids Research 33:D34–D38.

Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, M., and Rodier, F. (1985) The mosaic genome of warm-blooded vertebrates. Science 228:953–958.

Besemer, J., Lomsadze, A., and Borodovsky, M. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes – implications for finding sequence motifs in regulatory regions. Nucleic Acids Research 29:2607–2618.

Birney, E., Andrews, D., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cox, T., Cunningham, F., Curwen, V., Cutts, T., Down, T., Durbin, R., Fernandez-Suarez, X. M., Flicek, P., Gräf, S., Hammond, M., Herrero, J., Howe, K., Iyer, V., Jekosch, K., Kähäri, A., Kasprzyk, A., Keefe, D., Kokocinski, F., Kulesha, E., London, D., Longden, I., Melsopp, C., Meidl, P., Overduin, B., Parker, A., Proctor, G., Prlic, A., Rae, M., Rios, D., Redmond, S., Schuster, M., Sealy, I., Searle, S., Severin, J., Slater, G., Smedley, D., Smith, J., Stabenau, A., Stalker, J., Trevanion, S., Ureta-Vidal, A., Vogel, J., White, S., Woodwark, C., and Hubbard, T. J. P. (2006) Ensembl 2006. Nucleic Acids Research 34:D556–D561.

Blanco, E., Parra, G., and Guigó, R. (2002). Using GENEID to identify genes. In Baxevanis, A. (ed.), Current Protocols in Bioinformatics, unit 4.3. New York: John Wiley.

Boguski, M. S., Lowe, T. M., and Tolstoshev, C. M. (1993) dbEST: database for expressed sequence tags. Nature Genetics 4:332–333.

Borodovsky, M. and McIninch, J. (1993) GENMARK: parallel gene recognition for both DNA strands. Computers and Chemistry 16:37–43.

Borodovsky, M., Rudd, K. E., and Koonin, E. V. (1994) Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Research 22:4756–4767.

Bouchard G. and Triggs B. (2004) The trade-off between generative and discriminative classifiers. In J. Antoch (ed.), Proceedings of International Symposinm on Computational Statistics (COMPSTAT) 2004, pp. 1–9.

Bray, N., Dubchak, I., and Pachter, L. (2003) AVID: a global alignment program. Genome Research 13:97–102.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984) Classification and Regression Trees. Monterey, CA: Wadsworth International.

Brejová, B., Brown, D. G., Li, M., and Vinar, T. (2005) ExonHunter: a comprehensive approach to gene finding. Bioinformatics 21(Suppl. 1):i57–i65.

Brendel, V. and Kleffe, J. (1998) Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA. Nucleic Acids Research 26:4748–4757.

Brown, R. H., Gross, S. S., and Brent, M. R. (2005) Begin at the beginning: predicting genes with 5′ UTRs. Genome Research 15:742–747.

Bucher P. and Bairoch A. (1994) A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, pp. 53–61.

Buratti, E. and Baralle, F. E. (2004) Influence of RNA secondary structure on the pre-mRNA splicing process. Molecular Cell Biology 24:10505–10514.

Burden, S., Lin, Y. X., and Zhang, R. (2005) Improving promoter prediction for the NNPP2.2 algorithm: a case study using Escherichia coli DNA sequences. Bioinformatics 21:601–607.

Burge C. (1997) Identification of complete gene structures in human genomic DNA. Ph.D. thesis Stanford University, Stanford, CA.

Burge, C. (1998) Modeling dependencies in pre-mRNA splicing signals. In Salzberg, S., Searls, D., and Kasif, S. (eds.), Computational Methods in Molecular Biology, pp. 127–163. Amsterdam: Elsevier.

Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 268:78–94.

Burge, C. B., Tuschl, T., and Sharp, P. A. (1999) Splicing of precursors to mRNAs by the spliceosomes. In Gesteland, R. F., Cech, T. R., and Atkins, J. F. (eds.) The RNA World, 2nd edn., pp. 525–560. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press.

Burges, C. J. C. (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2:121–167.

Burks, C., Cassidy, M., Cinkosky, M. J., Cumella, K. E., Gilna, P., Hayden, J. E.-D., Keen, G. M., Kelley, T. A., Kelly, M., Kristofferson, D., and Ryals, J. (1991) GenBank. Nucleic Acids Research 19:S2221–S2225.

Burset, M. and Guigó R, (1996) Evaluation of gene structure prediction programs. Genomics 34:357–367.

Burset, M., Seledtsov, I. A., and Solovyev, V. V. (2000) Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Research 28:4364–4375.

Cai, D., Delcher, A., Kao, B., and Kasif, S. (2000) Modeling splice sites with Bayes networks. Bioinformatics 16:152–158.

Carninci, P., Sandelin, A., Lenhard, B., Katayama, S., Shimokawa, K., Ponjavic, J., Semple, C. A., Taylor, M. S., Engstrom, P. G., Frith, M. C., Forrest, A. R., Alkema, W. B., Tan, S. L., Plessy, C., Kodzius, R., Ravasi, T., Kasukawa, T., Fukuda, S., Kanamori-Katayama, M., Kitazume, Y., Kawaji, H., Kai, C., Nakamura, M., Konno, H., Nakano, K., Mottagui-Tabar, S., Arner, P., Chesi, A., Gustincich, S., Persichetti, F., Suzuki, H., Grimmond, S. M., Wells, C. A., Orlando, V., Wahlestedt, C., Liu, E. T., Harbers, M., Kawai, J., Bajic, V. B., Hume, D. A., and Hayashizaki, Y. (2006) Genome-wide analysis of mammalian promoter architecture and evolution. Nature Genetics 38:626–635.

Cawley, S. E., Wirth, A. I., and Speed, T. P. (2001) Phat: a gene finding program for Plasmodium falciparum. Molecular and Biochemical Parasitology 118:167–174.

Choo, K. H., Tong, J. C., and Zhang, L. (2004) Recent applications of hidden Markov models in computational biology. Genomics, Proteomics, and Bioinformatics 2:84–96.

Chou W., Juang B. H., and Lee C. H. (1992) Segmental GPD training of HMM based speech recognizer. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 473–476.

Chow, C. K. and Liu, C. N. (1968) Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory 14:462–467.

Chuang, T.-J., Lin, W. C., Lee, H. C., Wang, C. W., Hsiao, K. L., Wang, Z. H., Shieh, D., Lin, S. C., and Chang, L. Y. (2003) A complexity reduction algorithm for analysis and annotation of large genomic sequences. Genome Research 13:313–322.

Chuang, T.-J., Chen, F.-C., and Chou, M.-Y. (2004) A comparative method for identification of gene structures and alternatively spliced variants. Bioinformatics 20:3064–3079.

Clark, F. and Thanaraj, T. A. (2002) Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human. Human Molecular Genetics 11:451–464.

Clote, P. and Backofen, R. (2000) Computational Molecular Biology. New York: John Wiley.

Cocke, J. and Schwartz, J. T. (1970) Programming Languages and their Compilers: Preliminary Notes, Technical Report. New York: Courant Institute of Mathematical Sciences, New York University.

Cor´, D., Herrmann, C., Dieterich, C., Cunto, Di F., Provero, P., and Caselle, M. (2005) Ab initio identification of putative human transcription factor binding sites by comparative genomics. BMC Bioinformatics 6:110.

Cormen, T. H., Leiserson, C. E., and Rivest, R. L. (1992) Introduction to Algorithms. Cambridge, MA: MIT Press.

Cover, T. M. and Hart, P. E. (1967) Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13:57–67.

Culotta, A., Kulp, D., and McCallum, A. (2005) Gene Prediction with Conditional Random Fields, Technical Report UM-CS-2005–028. Amherst, MA: University of Massachusetts.

Darwin, C. (1859) On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. London: John Murray.

Davuluri, V. D., Grosse, I., and Zhang, M. Q. (2001) Computational identification of promoters and first exons in the human genome. Nature Genetics 29:412–417.

Dawkins, R. (1982) The Extended Phenotype: The Long Reach of the Gene. Oxford: Oxford University Press.

Dawkins, R. (1997) Human chauvinism. Evolution 51:1015–1020.

Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978) A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5:345–352.

Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, S. L. (1999a) Alignment of whole genomes. Nucleic Acids Research 27:2369–2376.

Delcher, A. L., Harmon, D., Kasif, S., White, O., and Salzberg, S. L. (1999b) Improved microbial gene identification with GLIMMER. Nucleic Acids Research 27:4636–4641.

Delcher, A. L., Phillippy, A., Carlton, J., and Salzberg, S. L. (2002) Fast algorithms for large-scale genome alignment and comparision. Nucleic Acids Research 30:2478–2483.

Delphin, M. E., Stockwell, P. A., Tate, W. P., and Brown, C. M. (1999) Transterm, the translational signal database, extended to include full coding sequence and untranslated regions. Nucleic Acids Research 27:293–294.

Dieterich, C., Grossmann, S., Tanzer, A., Röpcke, S., Arndt, P. F., Stadler, P. F., and Vingron, M. (2005) Comparative promoter region analysis powered by CORG. BMC Genomics 6:24.

Ding, Y. (2006) Statistical and Bayesian approaches to RNA secondary structure prediction. RNA 12:232–331.

Doudna, J. A. and Cech, T. R. (2002) The chemical repertoire of natural ribozymes. Nature 418:222–228.

Down, T. A. and Hubbard, T. J. P. (2002) Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Research 12:458–461.

Dror, G., Sorek, R., and Shamir, R. (2004) Accurate identification of alternatively spliced exons using support vector machines. Bioinformatics 21:897–901.

Duda, R. O., Hart, P. E., and Stork, D. G. (2000) Pattern Classification, 2nd edn. New York: Wiley-Interscience.

Dunteman, G. H. (1989) Principal Components Analysis.London: Sage Publications.

Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998) Biological Sequence Analysis.Cambridge: Cambridge University Press.

Eddy, S. R. (2002) Computational genomics of noncoding RNA genes. Cell 109:137–140.

Eddy, S. R. (2005) A model of the statistical power of comparative genome sequence analysis. PLoS Biology 3:e10.

Eddy, S., Mitchison, G., and Durbin, R. (1995) Maximum discrimination hidden Markov models of sequence consensus. Journal of Computational Biology 2:9–23.

Edwards, A. W. F. (1992) Likelihood.Baltimore, MD: Johns Hopkins University Press.

Eisen, J. A., Coyne, R. S., Wu, M., Wu, D., Thiagarajan, M., Wortman, J. R., Badger, J. H., Ren, Q., Amedeo, P., Jones, K. M., Tallon, L. J., Delcher, A. L., Salzberg, S. L., Silva, J. C., Haas, B. J., Majoros, W. H., Farzad, M., Carlton, J. M., Smith, R. K., Garg, J., Pearlman, R. E., Karrer, K. M., Sun, L., Manning, G., Elde, N. C., Turkewitz, A. P., Asai, D. J., Wilkes, D. E., Wang, Y., Cai, H., Collins, K., Stewart, B. A., Lee, S. R., Wilamowska, K., Weinberg, Z., Ruzzo, W. L., Wloga, D., Gaertig, J., Frankel, J., Tsao, C. C., Gorovsky, M. A., Keeling, P. J., Waller, R. F., Patron, N. J., Cherry, J. M., Stover, N. A., Krieger, C. J., Toro, Del C., Ryder, H. F., Williamson, S. C., Barbeau, R. A., Hamilton, E. P., and Orias, E. (2006) Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biology 29:4(9).

Fairbrother, W. G., Yeh, R. F., Sharp, P. A., and Burge, C. B. (2002) Predictive identification of exonic splicing enhancers in human genes. Science 297:1007–1013.

Falconer, D. S. (1996) Introduction to Quantitative Genetics, 4th edn. Englewood Cliffs, NJ: Prentice-Hall.

Fariselli, P., Martelli, P. L., and Casadio, R. (2005) The posterior-Viterbi: a new decoding algorithm for hidden Markov models. BMC Bioinformatics 6 (Suppl. 4):S12.

Fausett, L. V. (1994) Fundamentals of Neural Networks. Englewood Cliffs, NJ: Prentice-Hall.

Felsenstein, J. (1981) Evolutionary trees from DNA sequences. Journal of Molecular Evolution 17:368–376.

Felsenstein, J. (1989) PHYLIP: phylogeny inference package (version 3.2). Cladistics 5:164–166.

Felsenstein, J. (2004) Inferring Phylogenies.Sunderland, MA: Sinauer Associates.

Felsenstein, J. and Churchill, G. A. (1996) A hidden Markov model approach to variation among sites in rate of evolution. Molecular Biology and Evolution 13:93–104.

Fischer, C. N. and LeBlanc, R. J. (1991) Crafting a Compiler with C. Menlo Park, CA: Benjamin/Cummings.

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics 7:179–188.

Fletcher, R. (1980) Practical Methods of Optimization, vol. 1, Unconstrained Optimization.New York: John Wiley.

Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M., and Miller, W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Research 8:967–974.

Florea, L., Di Francesco, V., Miller, J., Turner, R., Yao, A., Harris, M., Walenz, B., Mobarry, C., Merkulov, G. V., Charlab, R., Dew, I., Deng, Z., Istrail, S., Li, P., and Sutton, G. (2005) Gene and alternative splicing annotation with AIR. Genome Research 15:54–66.

Fogel, D. B. (2005) Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, 3rd edn. New York: Wiley-IEEE Press.

Foissac, S. and Schiex, T. (2005) Integrating alternative splicing detection into gene prediction. BMC Bioinformatics 6:25.

Freund Y. and Schapire R. E. (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the European Conference on Computational Learning Theory, pp. 23–37.

Friedman, J. H. (1989) Regularized discriminant analysis. Journal of the American Statistical Association 84:165–175.

Galassi, M., Davies, J., Theiler, J., Gough, B., Jungman, G., Booth, M., and Rossia, F. (2005) GNU Scientific Library Reference Manual, 2nd edn. Bristol, UK: Network Theory Ltd.

Gangal, R. and Sharma, P. (2005) Human pol II promoter prediction: time series descriptors and machine learning. Nucleic Acids Research 33:1332–1336.

Gene Ontology Consortium (2000) Gene ontology: tool for the unification of biology. Nature Genetics 25:25–29.

Gierasch, L. M. (1989) Signal sequences. Biochemistry 28:923–930.

Goldberg, D. E. (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA: Addison-Wesley.

Gould, S. J. (1994) The evolution of life on earth. Scientific American 271:62–69.

Goutte C., Gaussier E., Cancedda N., and Dejean H. (2004) Generative vs. discriminative approaches to entity recognition from label-deficient data. In Féme Journées Internationales Analyse Statistique des Données Textuelles (JADT 2004), pp. 1–10.

Gropp, W., Lusk, E., and Skjellum, A. (1994) Using MPI: Portable Parallel Programming with the Message-Passing Interface. Cambridge, MA: MIT Press.

Gross, S. S. and Brent, M. R. (2005) Using multiple alignments to improve gene prediction. In Research in Computational Molecular Biology (RECOMB'05), pp. 374–388.

Guigó, R. (1998) Assembling genes from predicted exons in linear time with dynamic programming. Journal of Computational Biology 5:681–702.

Guigó, R., Flicek, P., Abril, J. F., Reymond, A., Lagarde, J., Denoeud, F., Antonarakis, S., Ashburner, M., Bajic, V. B., Birney, E., Castelo, R., Eyras, E., Gingeras, T. R., Harrow, J., Hubbard, T., Lewis, S., Ucla, C., and Reese, M. G. (2006) EGASP: the human ENCODE genome annotation assessment project. Genome Biology 7(Suppl. 1):S2.

Haas, B. J., Delcher, A. L., Mount, S. M., Wortman, J. R., Smith, R. K., Hannick, L. I., Maiti, R., Ronning, C. M., Rusch, D. B., Town, C. D., Salzberg, S. L., and White, O. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31:5654–5666.

Hall, K. B., Green, M. R., and Redfield, A. G. (1988) Structure of a pre-mRNA branch point / 3′ splice site region. Proceedings of the National Academy of Sciences of the USA 85:704–708.

Hannenhalli, S. and Levy, S. (2001) Promoter prediction in the human genome. Bioinformatics 17:S90–S96.

Hasegawa, M., Kishino, H., and Yano, T. (1985) Dating of the human–ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22:160–174.

Hastie, T., Tibshirani, R., and Friedman, J. H. (2003) The Elements of Statistical Learning. New York: Springer.

Heber, S., Alekseyev, M., Sze, S.-H., Tang, H., and Pevzner, P. A. (2002) Splicing graphs and EST assembly problem. Bioinformatics 18:S181–S188.

Heckerman D., Geiger D., and Chickering D. (1994) Learning Bayesian networks: the combination of knowledge and statistical data. In Knowledge Discovery and Data Mining Workshop (KDD '94), pp. 85–96.

Henderson, J., Salzberg, S., and Fasman, K. (1997) Finding genes in human DNA with a hidden Markov model. Journal of Computational Biology 4:127–141.

Henikoff, J. G., Greene, E. A., Pietrokovski, S., and Henikoff, S. (2000) Increased coverage of protein families with the BLOCKS database servers. Nucleic Acids Research 28:228–230.

Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the USA 89:10915–10919.

Heylighen, F. (1999) The growth of structural and functional complexity during evolution. In Heylighen, F., Bollen, J., and Riegler, A. (eds.) The Evolution of Complexity, pp. 17–44. New York: Kluwer.

Hirschberg, D. (1975) A linear space algorithm for computing maximal common subexpressions. Communications of the Association of Computing Machinery 18:341–343.

Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer, L. S., Tacker, M., and Schuster, P. (1994) Fast folding and comparison of RNA secondary structures. Chemical Monthly 125:167–188.

Hopcroft, J. E. and Ullman, J. D. (1979) Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley.

Hoskins, R. A., Smith, C. D., Carlson, J. W., Carvalho, A. B., Halpern, A., Kaminker, J. S., Kennedy, C., Mungall, C. J., Sullivan, B. A., Sutton, G. G., Yasuhara, J. C., Wakimoto, B. T., Myers, E. W., Celniker, S. E., Rubin, G. M., and Karpen, G. H. (2002) Heterochromatic sequences in a Drosophila whole-genome shotgun assembly. Genome Biology 3:0085.1–0085.16.

Howe, K. L., Chothia, T., and Durbin, R. (2002) GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Research 12:1418–1427.

Huang, X., Adams, M. D., Zhou, H., and Kerlavage, A. R. (1997) A tool for analyzing and annotating genomic sequences. Genomics 46:35–45.

International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921.

International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931–945.

Jaakkola, T. S. and Haussler, D. (1999) Exploiting generative models in discriminative classifiers. Advances in Neural Information Processing Systems 11:487–493.

Jeffreys, A. J., Wilson, V., and Thein, S. L. (1985) Individual-specific “fingerprints” of human DNA. Nature 316:76–79.

Jelinek, F. (1997) Statistical Methods for Speech Recognition. Bradford, MA: Bradford Books.

Jelinek F. and Mercer R. L. (1980) Interpolated estimation of Markov source parameters. In Proceedings of the Workshop on Pattern Recognition in Practice, May 1980,

Jensen, F. V. (2001) Bayesian Networks and Decision Graphs. New York: Springer.

Jenuwein, T. and Allis, C. D. (2001) Translating the histone code. Science 293:1074–1080.

Johansen F. T. (1996) A comparison of hybrid HMM architectures using global discriminative training. In Proceedings of the 4th International Conference on Spoken Language Processing, pp. 498–501.

Jordan, M. I., Ghahramani, Z., T. S., Jaakkola, and Saul, L. K. (1999) An introduction to variational methods for graphical methods. In Jordan, M. I. (ed.) Learning in Graphical Models, pp. 105–162. Cambridge, MA: MIT Press.

Jukes, T. H. and Cantor, C. R. (1969) Evolution of protein molecules. In Munro, H. N. (ed.) Mammalian Protein Metabolism, pp. 21–132. New York: Academic Press.

Käll, L., Krogh, A., and Sonnhammer, E. L. (2005) An HMM posterior decoden for sequence feature prediction that includes homology information. Bioinformatics 21 (Suppl. 1): i251–i257.

Kamal M., Xie X., and Lander E. S. (2006) A large family of ancient repeat elements in the human genome is under strong selection. Proceedings of the National Academy of Sciences of the USA 103:2740–2745.

Karlin S. and Altschul S. F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the USA 87:2264–8.

Kasami, T. (1965). An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages, Scientific Report AFCRL–65–758. Bedford, MA: Air Force Cambridge Research Laboratory.

Katz, S. M. (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing 35: 400–401.

Kaufman, L. (1998) Solving the quadratic programming problem arising in support vector classification. In Scholkopf, B., Burges, C. J. C., and Smola, A. J. (eds.) Advances in Kernel Methods: Support Vector Learning, pp. 147–167. Cambridge, MA: MIT Press.

Keilson, J. (1979) Markov Chain Models: Rarity and Exponentiality.New York: Springer.

Kent, W. J. (2002) BLAT: the BLAST-like alignment tool. Genome Research 12:656–664.

Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., and Haussler, D. (2002) The human genome browser at UCSC. Genome Research 12:996–1006.

Kimura, M. (1980) A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16:111–120.

Kingsbury, N. G. and Rayner, P. J. W. (1971) Digital filtering using logarithmic arithmetic. Electronic Letters 7:56–58.

Kohavi R. and Sahami M. (1996) Error-based and entropy-based discretization of continuous features. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 114–119.

Korf, I. (2004) Gene finding in novel genomes. BMC Bioinformatics 5:59.

Korf, I., Flicek, P., Duan, D., and Brent, M. R. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17:S140–S148.

Korf, I., Yandell, M., and Bedell, J. (2003) BLAST. Sebastopol, CA: O'Reilly.

Krogh A. (1994) Hidden Markov models for labeled sequences. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, pp. 140–144.

Krogh A. (1997) Two methods for improving performance of an HMM and their application for gene finding. In Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology, pp. 179–186.

Krogh, A. (1998) An introduction to hidden Markov models for biological sequences. In Salzberg, S. L., Searls, D. B., and Kasif, S. (eds.) Computational Methods in Molecular Biology, pp. 45–62. Amsterdam: Elsevier.

Krogh A. (2000) Using database matches with HMMGene for automated gene detection in Drosophila. Genome Research10:523–528.

Krogh, A., Mian, I. S., and Haussler, D. (1994) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Research 22:4768–4778.

Koza, J. (1992) Genetic Programming: On the Programming of Computers by Means of Natural Selection.Cambridge, MA: MIT Press.

Kullback, S. (1997) Information Theory and Statistics. New York: Dover.

Kulkarni, O. C., Vigneshwar, R., Jayaraman, V. K., and Kulkarni, B. D. (2005) Identification of coding and non-coding sequences using local Hölder exponent formalism. Bioinformatics 21:3818–3823.

Kulp, D. and Haussler, D. (1997) Integrating database homology in a probabilistic gene structure model. Pacific Symposium on Bioinformatics 2:232–244.

Kulp D., Haussler D., Reese M., and Eeckman F. (1996) A generalized hidden Markov model for the recognition of human genes in DNA. In Proceedings of the 4th International Conference on Intelligent Systems for Molecular Biology, pp. 134–142.

Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. L. (2004) Versatile and open software for comparing large genomes. Genome Biology 5:R12.1–R12.9.

Lam, F., Alexandersson, M., and Pachter, L. (2003) Picking alignments from (Steiner) trees. Journal of Computational Biology 10:509–520.

Lander, E. S. and Waterman, M. S. (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2:231–239.

Landry, J. R., Mager, D. L., and Wilhelm, B. T. (2003) Complex controls: the role of alternative promoters in mammalian genomes. Trends in Genetics 19:640–648.

Lapedes, A., Barnes, C., Burks, C., Farber, R., and Sirotkin, K. (1989) Application of neural networks and other machine learning algorithms to DNA sequence analysis. In Bell, G. and Marr, T. (eds.) Computers and DNA: SFI Studies in the Sciences of Complexity, pp. 157–182. Reading, MA: Addison-Wesley.

Larsen, F., Gundersen, G., Lopez, R., and Prydz, H. (1992) CpG islands as gene markers in the human genome. Genomics 13:1095–1107.

Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton, J. C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262:208–214.

Lee, C., Grasso, C., and Sharlow, M. F. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics 18:452–464.

Lewin, B. (2003) Genes VIII. New York: Prentice-Hall.

Lian, Y. and Garner, H. R. (2005) Evidence for the regulation of alternative splicing via complementary DNA sequence repeats. Bioinformatics 8:1358–1364.

Lim, L. P., and Burge, C. B. (2001) A computational analysis of sequence features involved in recognition of short introns. Proceedings of the National Academy of Sciences of the USA 98:11193–11198.

Liò, P. and Goldman, N. (1998) Models of molecular evolution and phylogeny. Genome Research 8:1233–1244.

Loots, G. G., Ovcharenko, I., Pachter, L., Dubchak, I., and Rubin, E. M. (2002) rVista for comparative sequence-based discovery of functional transcriptional factor binding sites. Genome Research 12:832–839.

Lowe, T. M. and Eddy, S. R. (1999) A computational screen for methylation guide snoRNAs in yeast. Science 283:1168–1171.

Lukashin, A. V. and Borodovsky, M. (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Research 26:1107–1115.

Mackey A. (2005) GLEAN: improved eukaryotic gene prediction by statistical consensus of gene evidence. Poster presented at Genome Informatics Conference, October 28, 2005.

Maglott, D. R., Katz, K. S., Sicotte, H., and Pruitt, K. D. (2000) NCBI's LocusLink and RefSeq. Nucleic Acids Research 28:126–128.

Majoros, W. H. and Salzberg, S. L. (2004) An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics 5:206.

Majoros, W. H., Pertea, M., Antonescu, C., and Salzberg, S. L. (2003) GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders. Nucleic Acids Research 31:3601–3604.

Majoros, W. H., Pertea, M., and Salzberg, S. (2004) TIGRscan and GlimmerHMM: two open source ab initio eukaryotic gene finders. Bioinformatics 20:2878–2879.

Majoros, W. H., Pertea, M., Delcher, A. L., and Salzberg, S. L. (2005a) Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6:16.

Majoros, W. H., Pertea, M., and Salzberg, S. L. (2005b) Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics 21:1782–1788.

Manly, B. F. J. (1994) Multivariate Statistical Methods: A Primer, 2nd edn. New York: Chapman and Hall.

Manning, C. and Schütze H, (1999) Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.

Marashi, S. A., Goodarzi, H., Sadeghi, M., Eslahchi, C., and Pezeshk, H. (2006) Importance of RNA secondary structure information for yeast donor and acceptor splice site predictions by neural networks. Computational Biology and Chemistry 30:50–57.

Markov K., Nakagawa S., and Nakamura S. (2001) Discriminative training of HMM using maximum normalized likelihood algorithm. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 497–500.

Matlin, A. J., Clark, F., and Smith, C. W. (2005) Understanding alternative splicing: towards a cellular code. Nature Reviews: Molecular Cell Biology 6:386–398.

McAuliffe, J. D., Pachter, L., and Jordan, M. I. (2004) Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. Bioinformatics 20:1850–1860.

McCaskill, J. S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29:1105–1119.

Mealy, G. H. (1955) A method for synthesizing sequential circuits. Bell System Technical Journal 34:1045–1079.

Meyer, I. M. and Durbin, R. (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics 18:1309–1318.

Meyer, I. M. and Durbin, R. (2004) Gene structure conservation aids similarity based gene prediction. Nucleic Acids Research 32:776–783.

Mitchell, T. (1997) Machine Learning. New York: McGraw-Hill.

Mitrophanov, A. Y. and Borodovsky, M. (2006) Statistical significance in biological sequence analysis. Briefings in Bioinformatics 7:2–24.

Mizrahi, A. and Sullivan, M. (1986) Calculus and Analytic Geometry, 2nd edn. Belmont, CA: Wadsworth.

Moore, E. F. (1956) Gedanken experiments on sequential machines. In Shannon, C. E. and McCarthy, J. (eds.) Automata Studies, pp. 129–153. Princeton, NJ: Princeton University Press.

Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562.

Mural, R. J., Adams, M. D., Myers, E. W., Smith, H. O., Miklos, G. L., Wides, R., Halpern, A., Li, P. W., Sutton, G. G., Nadeau, J., Salzberg, S. L., Holt, R. A., Kodira, C. D., Lu, F., Chen, L., Deng, Z., Evangelista, C. C., Gan, W., Heiman, T. J., Li, J., Li, Z., Merkulov, G. V., Milshina, N. V., Naik, A. K., Qi, R., Shue, B. C., Wang, A., Wang, J., Wang, X., Yan, X., Ye, J., Yooseph, S., Zhao, Q., Zheng, L., Zhu S. C., Biddick, K., Bolanos, R., Delcher, A. L.Dew, I. M., Fasulo, D.Flanigan, M. J., Huson, D. H., Kravitz, S. A., Miller, J. R., Mobarry, C. M., Reinert, K.Remington, K. A., Zhang, Q.Zheng, X. H., Nusskern, D. R., Lai, Z., Lei, Y., Zhong, W., Yao, A., Guan, P.Ji, R. R., Gu, Z.Wang, Z. Y., Zhong, F., Xiao, C.Chiang, C. C., Yandell, M.Wortman, J. R., Amanatides, P. G., Hladun, S. L., Pratts, E. C., Johnson, J. E., Dodson, K. L., Woodford, K. J., Evans, C. A., Gropman, B.Rusch, D. B., Venter, E., Wang, M.Smith, T. J., Houck, J. T., Tompkins, D. E., Haynes, C., Jacob, D.Chin, S. H., Allen, D. R., Dahlke, C. E., Sanders, R., Li, K., Liu, X.Levitsky, A. A., Majoros, W. H., Chen, Q.Xia, A. C., Lopez, J. R., Donnelly, M. T., Newman, M. H., Glodek, A.Kraft, C. L., Nodell, M., Ali, F.An, H. J., Baldwin-Pitts, D.Beeson, K. Y., Cai, S., Carnes, M., Carver, A.Caulk, P. M., Center, A.Chen, Y. H., Cheng, M. L., Coyne, M. D., Crowder, M., Danaher, S.Davenport, L. B., Desilets, R.Dietz, S. M., Doup, L., Dullaghan, P., Ferriera, S.Fosler, C. R., Gire, H. C., Gluecksmann, A.Gocayne, J. D., Gray, J., Hart, B., Haynes, J., Hoover, J., Howland, T., Ibegwam, C., Jalali, M., Johns, D., Kline, L.Ma, D. S., MacCawley, S., Magoon, A., Mann, F., May, D.McIntosh, T. C., Mehta, S., Moy, L.Moy, M. C., Murphy, B. J., Murphy, S. D., Nelson, K. A., Nuri, Z.Parker, K. A., Prudhomme, A. C., Puri, V. N., Qureshi, H.Raley, J. C., Reardon, M. S., Regier, M. A., Rogers, Y. H., Romblad, D. L., Schutz, J.Scott, J. L., Scott, R.Sitter, C. D., Smallwood, M.Sprague, A. C., Stewart, E., Strong, R. V., Suh, E., Sylvester, K., Thomas, R.Tint, N. N., Tsonis, C., Wang, G., Wang, G.Williams, M. S., Williams, S. M., Windsor, S. M., Wolfe, K.Wu, M. M., Zaveri, J., Chaturvedi, K.Gabrielian, A. E., Ke, Z., Sun, J., Subramanian, G.Venter, J. C., Pfannkoch, C. M., Barnstead, M., and Stephenson, L. D. (2002) A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296:1661–1671.

Murphy P. M. and Aha D. W. (1994) UCI Repository of Machine Learning Databases, Irvine, CA: University of California, Department of Information and Computer Science. Available online at www.ics.uci.edu/~mlearn/MLRepository.html/

Murthy, S. K., Kasif, S., and Salzberg, S. (1994) A system for induction of oblique decision trees. Journal of Artificial Intelligence Research 2:1–32.

Needleman, S. and Wunsch, C. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48:443–453.

Ng A. Y. and Jordan M. I. (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems (NIPS) 14:841–848.

Normandin, Y. (1996) Maximum mutual information estimation of hidden Markov models. In Lee, C.-H., Soong, F. K., and Paliwal, K. K. (eds.) Automatic Speech and Speaker Recognition, pp. 58–81. New York: Kluwer.

Normark, S., Bergstrom, S., Edlund, T., Grundstrom, T., Jaurin, B., Lindberg, F. P., and Olsson, O. (1983) Overlapping genes. Annual Review of Genetics 17:499–525.

Ohler, U., Stemmer, G., Harbeck, S., and Niemann, H. (2000) Stochastic segment models of eukaryotic promoter regions. Proceedings of the Pacific Symposium on Biocomputing 5:377–388.

Ohler, U., Niemann, H., Liao, G., and Rubin, G. M. (2001) Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics 17:S199–S206.

Ohler, U., Liao, G., Niemann, H., and Rubin, G. (2002) Computational analysis of core promoters in the Drosophila genome. Genome Biology 3(12):r0087.1–r0087. 12.

Ohler, U., Shomron, N., and Burge, C. B. (2005) Recognition of unknown conserved alternatively spliced exons. PLoS Computational Biology 1(2):e15.

Oliver, J. L., Carpena, P., Hackenberg, M., and Bernaola-Galván P, (2004) IsoFinder: computational prediction of isochores in genome sequences. Nucleic Acids Research 32:W287–W292.

Osuna, E., Freund, R., and Girosi, F. (1997) An improved training algorithm for support vector machines. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, pp. 276–285.

Pachter, L., Batzoglou, S., Spitkovsky, V. I., Banks, E., Lander, E. S., Kleitman, D. J., and Berger, B. (1999) A dictionary based approach for gene annotation. Journal of Computational Biology6:419–430.

Pachter, L., Alexanderson, M., and Cawley, S. (2002) Applications of generalized pair hidden Markov models to alignment and gene finding problems. Journal of Computational Biology 9:389–399.

Parra, G., Agarwal, P., Abril, J. F., Wiehe, T., Fickett, J. W., and Guigó, R. (2003) Comparative gene prediction in human and mouse. Genome Research 13:108–117.

Patterson, D., Yasuhara, K., and Ruzzo, W. L. (2002) Pre-mRNA secondary structure prediction aids splice site prediction. Pacific Symposium on Bioinformatics 7:223–234.

Pavesi, A., Iaco, B., Granero, M. I., and Porati, A. (1997) On the informational content of overlapping genes in prokaryotic and eukaryotic viruses. Journal of Molecular Evolution 44:625–631.

Pearl, J. (1991) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 2nd edn. Los Altos, CA: Morgan Kaufmann.

Pearson, W. R. and Wood, T. C. (2001) Statistical significance in biological sequence comparison. In Balding, D. J., Bishop, M., and Cannings, C. (eds.) Handbook of Statistical Genetics, pp. 39–65. New York: John Wiley.

Pedersen, J. S. and Hein, J. (2003) Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics 19:219–227.

Pertea M. (2005) The Glimmer HMM Home Page. Available online at: www.cbcb.umd.edu/software/GlimmerHMM

Pertea, M., Lin, X. and Salzberg, S. L. (2001) GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Research 29:1185–1190.

Pertea, M. and Salzberg, S. L. (2002) Computational gene finding in plants. Plant Molecular Biology 48:48–49.

Platt, J. (1998) Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Microsoft Research Technical Report MSR-TR-98–14. Redmond, WA: Microsoft Corporation.

Pontius, J. U., Wagner, L., and Schuler, G. D. (2003) UniGene: a unified view of the transcriptome. In McEntyre, J. and Ostell, J. (eds.) The NCBI Handbook, pp. 21–1–21–12. Bethesda, MD: National Center for Biotechnology Information.

Pop, M., Salzberg, S. L., and Shumway, M. (2002) Genome sequence assembly: algorithms and issues. IEEE Computer 35:47–54.

Potamianos, G. and Jelinek, F. (1998) A study of n-gram and decision tree letter language modeling methods. Speech Communication, 24:171–192.

Powell, M. J. D. (1981) Nonlinear Optimization. New York: Academic Press.

Pozo, R. (1997) Template numerical toolkit for linear algebra: high performance programming with C++ and the Standard Template Library. International Journal of Supercomputer Applications and High Performance Computing 11:251–263.

Press, W. H., Flanner, B. P., Teukolsky, S. A., and Vetterling, W. T. (1992) Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge: Cambridge University Press.

Provost F. J. and Hennessy D. N. (1994) Distributed machine learning: scaling up with coarse-grained parallelism. In Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, pp. 340–347.

Pruitt, K. D., Tatusova, T., and Maglott, D. R. (2005) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research 33:D501–D504.

Quinlan, R. (1993) C4.5: Programs for Machine Learning. Los Altos, CA: Morgan Kaufmann.

Rabiner, L. R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77:257–286.

Rätsch G, , Sonnenburg, S., and Schölkopf B, (2005) RASE: recognition of alternatively spliced exons in C.elegans. Bioinformatics 21 (Suppl. 1):i369–i377.

Reese M. and Eeckman F. (1995) Novel neural network prediction systems for human promoters and splice sites. In Searls GSD., Fickett J., Noordewier M. (eds.) Proceedings of the Workshop on Gene-Finding and Gene Structure Prediction, Philadelphia, PA, pp. 311–324.

Reese, M. G., Eeckman, F. H., Kulp, D., and Haussler, D. (1997) Improved splice site detection in Genie. Journal of Computational Biology 4:311–323.

Reese, M. G., Hartzell, G., Harris, N. L., Ohler, U., and Lewis, S. E. (2000) Genome annotation assessment in Drosophila melanogaster. Genome Research 10:483–501.

Reichl W. and Ruske G. (1995) Discriminative training for continuous speech recognition. In Proceedings of the 4th European Conference on Speech Communication and Technology, pp. 537–540.

Rissanen, J. (1978) Modeling by shortest data description. Automatica 14:465–471.

Ristad E. S. and Thomas R. G. (1997) Hierarchical non-emitting Markov models. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics.

Rivas, E. and Eddy, S. R. (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology 285:2053–2068.

Rivas, E. and Eddy, S. R. (2000) Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics 16:583–605.

Rivas, E. and Eddy, S. R. (2001) Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2:8.

Rombauts, S., Florquin, K., Lescot, M., Marchasl, K., Rouzé, P., and Peer, Y. (2003) Computational approaches to identify promoters and cis-regulatory elements in plant genomes. Plant Physiology 132:1162–1176.

Rosenblatt, F. (1958) The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 65:386–408.

Roth V. and Steinhage V. (1999) Nonlinear discriminant analysis using kernel functions. In Proceedings of the 12th International Conference on Advances in Neural Information Processing Systems, pp. 568–574.

Saetrom, P, Sneve, R., Kristiansen, K. I., Sn⊘ve, O. J., Grünfeld, T., Rognes, T., and Seeberg, E. (2005) Predicting non-coding RNA genes in Escherichia coli with boosted genetic programming. Nucleic Acids Research 33:3263–3270.

Saeys Y. (2004) Feature selection for classification of nucleic acid sequences. Ph.D. thesis, University of Ghent, Belgium.

Saeys, Y., Degroeve, S., Aeyels, D., Rouzé, P., and Peer, Y. (2004) Feature selection for splice site prediction: a new method using EDA-based feature ranking. BMC Bioinformatics 5:64.

Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4:406–425.

Sakai M., Yoneda M., and Hase H. (1998) A new robust quadratic discriminant function. In Proceedings of the 14th International Conference on Pattern Recognition, pp. 99–102.

Salzberg, S. L. (1999) On comparing classifiers: a critique of current research and methods. Data Mining and Knowledge Discovery 1:1–12.

Salzberg, S. L., Delcher, A. L., Kasif, S., and White, O. (1998a) Microbial gene identification using interpolated Markov models. Nucleic Acids Research 26:544–548.

Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J., and Tettelin, H. (1998b) Interpolated Markov models for eukaryotic gene finding. Genomics 59:24–31.

Schadt, E. and Lange, K. (2002) Codon and rate variation models in molecular phylogeny. Molecular Biology and Evolution 19:1534–1549.

Schlüter, R., Macherey, W., Müller B., , and Ney, H. (2001) Comparison of discriminative training criteria and optimization methods for speech recognition. Speech Communication 34:287–310.

Schultz J., Milpetz F., Bork P., and Ponting C. P. (1998) SMART, a simple modular architecture research tool: identification of signaling domains. Proceedings of the National Academy of Sciences of the USA 95:5857–5864.

Schwartz R. and Chow Y.-L. (1990) The N-best algorithm: an efficient and exact procedure for finding the N most likely hypotheses. In Proceedings of the IEEE Conference on Aconstics, Speech, and Signal Processing, pp. 81–84.

Seneff, S., Wang, C., and Burge, C. B. (2004) Gene structure prediction using an orthologous gene of known exon–intron structure. Applied Bioinformatics 3:81–90.

Servant, F., Bru, C., Carre, S., Courcelle, E., Gouzy, J., Peyruc, D., and Kahn, D. (2002) ProDom: automated clustering of homologous domains. Briefings in Bioinformatics 3:246–251.

Shannon, C. E. (1948) A mathematical theory of communication. Bell System Technical Journal 27:379–423, 623–656.

Shmatkov, A. M., Melikyan, A. A., Chernousko, F. L., and Borodovsky, M. (1999) Finding prokyarotic genes by the “frame-by-frame” algorithm: targeting gene starts and overlapping genes. Bioinformatics 15:874–886.

Siepel, A. and Haussler, D. (2004a) Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Computational Biology 11:413–428.

Siepel A. and Haussler D. (2004b) Computational identification of evolutionarily conserved exons. In Research in Computational Molecular Biology (RECOMB'04), pp. 277–286.

Siepel, A. and Haussler, D. (2004c) Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Molecular Biology and Evolution 21:468–488.

Siepel, A. and Haussler, D. (2005) Phylogenetic hidden Markov models. In Nielsen, R. (ed.) Statistical Methods in Molecular Evolution, pp. 1034–1050. New York: Springer.

Simonoff, J. S. (1996) Smoothing Methods in Statistics. New York: Springer.

Sinha, S., Nimwegen, E., and Siggia, E. D. (2003) A probabilistic method to detect regulatory modules. Bioinformatics 19:i292–i301.

Smit A. F. A., and Green P. (1996) RepeatMasker. Available online at http://ftp.genome.waschington.edu/RM/ RepeatMasker.html/

Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. Journal of Molecular Biology 147:195–197.

Sneath, P. H. A. and Sokal, R. R. (1973) Numerical Taxonomy. San Francisco, CA: W. H. Freeman.

Snyder E. E. (1994) Identification of protein coding regions in genomic DNA. Ph.D. thesis, University of Colorado, Boulder, CO.

Snyder, E. E. and Stormo, G. D. (1993) Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Research 21:607–613.

Sokal, R. R. and Rohlf, F. J. (1995) Biometry: The Principles and Practice of Statistics in Biological Research. New York: W. H. Freeman.

Solovyev, V. V. and Shahmuradov, I. A. (2003) PromH: promoters identification using orthologous genomic sequences. Nucleic Acids Research 31:3540–3545.

Solovyev V. V., Salamov A. A., and Lawrence C. B. (1995) Identification of human gene structure using linear discriminant functions and dynamic programming. In Proceedings of the 3rd International Conference on Intelligent Systems for Molecular Biology, pp. 367–375.

Sonnenburg S. (2002) New methods for splice site recognition. Diploma thesis, Humboldt University, Berlin, Germany.

Sonnenburg S., Zien A., and Rätsch G. (2006) ARTS: accurate recognition of transcription starts in human. In Proceedings of the 14th International Conference on Intelligent Systems for Molecular Biology, pp. 472–480.

Sorek, R., Ast, G., and Graur, D. (2002) Alu-containing exons are alternatively spliced. Genome Research 12:1060–1067.

Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research 12:505–519.

Stajich, J. E., Block, D., Boulez, K., Brenner, S. E., Chervitz, S. A., Dagdigian, C., Fuellen, G., Gilbert, J. G., Korf, I., Lapp, H., Lehvaslaiho, H., Matsalla, C., Mungall, C. J., Osborne, B. I., Pocock, M. R., Schattner, P., Senger, M., Stein, L. D., Stupka, E., Wilkinson, M. D., and Birney, E. (2002) The Bioperl toolkit: perl modules for the life sciences. Genome Research 12:1611–1618.

Stanke, M. and Waack, S. (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19:II215–II225.

Stein, L. (2001) Genome annotation: from sequence to biology. Nature Reviews: Genetics 2:493–503.

Stormo G. D. and Haussler D. (1994) Optimally parsing a sequence into different classes based on multiple types of evidence. In Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, pp. 369–375.

Suzek, B. E., Ermolaeva, M. D., Schreiber, M., and Salzberg, S. L. (2001) A probabilistic method for identifying start codons in bacterial genomes. Bioinformatics 17:1123–1130.

Tikhonov, A. N. (1963) Solution of incorrectly formulated problems and the regularization method. Soviet Mathematics, Doklady 4:1035–1038.

Tipping, M. E. (2001) Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1:211–244.

Tong S. and Koller D. (2000) Restricted Bayes optimal classifiers. In Proceedings of the 17th National Conference on Artificial Intelligence, pp. 658–664.

Toutanova K., Mitchell M., and Manning C. D. (2003) Optimizing local probability models for statistical parsing. In Proceedings of the 14th European Conference on Machine Learning, pp. 409–420.

Tveter, D. (1998) The Pattern Recognition Basis of Artificial Intelligence. Indianapolis, IN: Wiley-IEEE Computer Society Press.

Uberbacher, E. C. and Mural, R. J. (1991) Locating protein coding regions in human DNA sequences using a multiple-sensor neural network approach. Proceedings of the National Academy of Sciences of the USA 88:11261–11265.

Usuka, J., Zhu, W., and Brendel, V. (2000) Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16:203–224.

Vapnik, V. (1998) Statistical Learning Theory. New York: John Wiley.

Venter, J. C., Smith, H. O., and Hood, L. (1996) A new strategy for genome sequencing. Nature 381:364–366.

Venter, J. C.Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M.Evans, C. A., Holt, R. A., Gocayne, J. D., Amanatides, P.Ballew, R. M., Huson, D. H., Wortman, J. R., Zhang, Q.Kodira, C. D., Zheng, X. H., Chen, L., Skupski, M., Subramanian, G.Thomas, P. D., Zhang, J., Gabor Miklos G. L., Nelson, C., Broder, S.Clark, A. G., Nadeau, J.McKusick, V. A., Zinder, N.Levine, A. J., Roberts, R. J., Simon, M., Slayman, C., Hunkapiller, M., Bolanos, R., Delcher, A., Dew, I., Fasulo, D., Flanigan, M., Florea, L., Halpern, A., Hannenhalli, S., Kravitz, S., Levy, S., Mobarry, C., Reinert, K., Remington, K., Abu-Threideh, J., Beasley, E., Biddick, K., Bonazzi, V., Brandon, R., Cargill, M., Chandramouliswaran, I., Charlab, R., Chaturvedi, K., Deng, Z., Di Francesco, V., Dunn, P., Eilbeck, K., Evangelista, C.Gabrielian, A. E., Gan, W., Ge, W., Gong, F., Gu, Z., Guan, P.Heiman, T. J., Higgins, M. E., Ji, R. R., Ke, Z.Ketchum, K. A., Lai, Z., Lei, Y., Li, J., Li, Z., Liang, Y., Lin, X., Lu, F.Merkulov, G. V., Milshina, N.Moore, H. M., Naik, A. K., Narayan, V. A., Neelam, B., Nusskern, D.Rusch, D. B., Salzberg, S., Shao, W., Shue, B., Sun, J., Wang, Z., Wang, A., Wang, X., Wang, J., Wei, M., Wides, R., Xiao, C., Yan, C., Yao, A., Ye, J., Zhan, M., Zhang, W., Zhang, H., Zhao, Q., Zheng, L., Zhong, F., Zhong, W., Zhu, S., Zhao, S., Gilbert, D., Baumhueter, S., Spier, G., Carter, C., Cravchik, A., Woodage, T., Ali, F., An, H., Awe, A., Baldwin, D., Baden, H., Barnstead, M., Barrow, I., Beeson, K., Busam, D., Carver, A., Center, A.Cheng, M. L., Curry, L., Danaher, S., Davenport, L., Desilets, R., Dietz, S., Dodson, K., Doup, L., Ferriera, S., Garg, N., Gluecksmann, A., Hart, B., Haynes, J., Haynes, C., Heiner, C., Hladun, S., Hostin, D., Houck, J., Howland, T., Ibegwam, C., Johnson, J., Kalush, F., Kline, L., Koduru, S., Love, A., Mann, F., May, D., McCawley, S., McIntosh, T., McMullen, I., Moy, L., Moy, M., Murphy, B., Nelson, K., Pfannkoch, C., Pratts, E., Puri, V., Qureshi, H., Reardon, M., Rodriguez, R.Rogers, Y. H., Romblad, D., Ruhfel, B., Scott, R., Sitter, C., Smallwood, M., Stewart, E., Strong, R., Suh, E., Thomas, R.Tint, N. N., Tse, S., Vech, C., Wang, G., Wetter, J., Williams, M., Williams, S., Windsor, S., Winn-Deen, E., Wolfe, K., Zaveri, J., Zaveri, K.Abril, J. F., Guigó R.Campbell, M. J., Sjolander, K. V., Karlak, B., Kejariwal, A., Mi, H., Lazareva, B., Hatton, T., Narechania, A., Diemer, K., Muruganujan, A., Guo, N., Sato, S., Bafna, V., Istrail, S., Lippert, R., Schwartz, R., Walenz, B., Yooseph, S., Allen, D., Basu, A., Baxendale, J., Blick, L., Caminha, M., Carnes-Stine, J., Caulk, P.Chiang, Y. H., Coyne, M., Dahlke, C., Mays, A., Dombroski, M., Donnelly, M., Ely, D., Esparham, S., Fosler, C., Gire, H., Glanowski, S., Glasser, K., Glodek, A., Gorokhov, M., Graham, K., Gropman, B., Harris, M., Heil, J., Henderson, S., Hoover, J., Jennings, D., Jordan, C., Jordan, J., Kasha, J., Kagan, L., Kraft, C., Levitsky, A., Lewis, M., Liu, X., Lopez, J., Ma, D.Majoros, W. H., McDaniel, J., Murphy, S., Newman, M., Nguyen, N., Nguyen, T., Nodell, M., Pan, S., Peck, J., Peterson, M., Rowe, W., Sanders, R., Scott, J., Simpson, M., Smith, T., Sprague, A., Stockwell, T., Turner, R., Venter, E., Wang, M., Wen, M., Wu, D., Wu, M., Xia, A., Zandieh, A., and Zhu, X. (2001) The sequence of the human genome. Science 291:1304–1351.

Voorhees, E. M. (1986) Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management 22:465–476.

Vinson, J., DeCaprio, D., Luoma, S., and Galagan, J. E. (2006) Gene prediction using conditional random fields. In: The Biology of Genomes, Cold Spring Harbor Laboratory, New York, May 10–14, 2006 (abstract).

Viterbi, A. (1967) Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory 13:260–269.

Hippel, P. T. (2005) Mean, median, and skew: correcting a textbook rule. Journal of Statistics Education 13.

Wain, H. M., Lovering, R. C., Bruford, E. A., Lush, M. J., Wright, M. W., and Povey, S. (2002) Guidelines for human gene nomenclature. Genomics 79:464–470.

Watson, J. D. and Crick, FHC. (1953) Molecular structure of nucleic acids. Nature 4356:737–738.

Wheelan, S. J., Church, D. M., and Ostell, J. M. (2001) Spidey: a tool for mRNA-to-genomic alignments. Genome Research 11:1952–1957.

Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Church, D. M., DiCuccio, M., Edgar, R., Federhen, S., Helmberg, W., Kenton, D. L., Khovayko, O., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Pontius, J. U., Pruitt, K. D., Schuler, G. D., Schriml, L. M., Sequeira, E., Sherry, S. T., Sirotkin, K., Starchenko, G., Suzek, T. O., Tatusov, R., Tatusova, T. A., Wagner, L., and Yaschenko, E. (2005) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 33:D39–D45.

Wiehe, T., Gebauer-Jung, S., Mitchell-Olds, T., and Guigó, R. (2001) SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Research 9:1574–1583.

Wingender, E., Kel, A. E., Kel, O. V., Karas, H., Heinemeyer, T., Dietze, P., Knuppel, R., Romaschenko, A. G., and Kolchanov, N. A. (1997) TRANSFAC, TRRD and COMPEL: towards a federated database system on transcriptional regulation. Nucleic Acids Research 25:265–268.

Wojtowicz, W. M., Flanagan, J. J., Millard, S. S., Zipursky, S. L., and Clemens, J. C. (2004) Alternative splicing of Drosophila Dscam generates axon guidance receptors that exhibit isoform-specific homophilic binding. Cell 118:619–633.

Wortman, J. R., Haas, B. J., Hannick, L. I., Smith, R. K., Maiti, R., Ronning, C. M., Chan, A. P., Yu, C., Ayele, M., Whitelaw, C. A., White, O. R., and Town, C. D. (2003) Annotation of the Arabidopsis genome. Plant Physiology 132:461–468.

Wu, C. H., , Yeh L.-S. L., Guang, H., Arminski, L., Castro-Alvear, J, Chen, Y., Hu, Z.-Z., Ledley, R. S., Kourtesis, P., Suzek, B. E., Vinayaka, C. R., Zhang, J., and Barker, W. C. (2003) The Protein Information Resource. Nucleic Acids Research 31:345–347.

Xu Y. and Uberbacher E. C. (1996) Gene prediction by pattern recognition and homology search. In Proceedings of the 4th International Conference on Intelligent Systems for Molecular Biology, pp. 242–256.

Yan, J. and Marr, T. G. (2005) Computational analysis of 3′-ends of ESTs shows four classes of alternative polyadenylation in human, mouse, and rat. Genome Research 15:369–375.

Yang, Z. (1994) Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. Journal of Molecular Evolution 39:306–314.

Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Research 11:803–809.

Yeo, G. and Burge, C. B. (2004) Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. Journal of Computational Biology 11:377–394.

Yeo, G. W., Nostrand, E., Holste, D., Poggio, T., and Burge, C. B. (2005) Identification and analysis of alternative splicing events conserved in human and mouse. Proceedings of the National Academy of Sciences of the USA 102:2850–2855.

Younger, D. H. (1967) Recognition and parsing of context-free languages in time n3. Information and Control 10:189–208.

Yu, P., Ma, D., and Xu, M. (2005) Nested genes in the human genome. Genomics 86:414–422.

Zar, J. H. (1996) Biostatistical Analysis, 3rd edn. Englewood Cliffs, NJ: Prentice-Hall.

Zhang M. Q. (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proceedings of the National Academy of Sciences of the USA 94:565–568.

Zhang M. Q. (2003) Prediction, annotation, and analysis of human promoters. Cold Spring Harbor Laboratory Symposium in Quantitative Biology68:217–225.

Zhang, M. Q. and Marr, T. G. (1993) A weight array method for splicing signal analysis. Computer Applications in the Biosciences 9:499–509.

Zhang, H., Hu, J., Recce, M., and Tian, B. (2005) PolyA_DB: a database for mammalian mRNA polyadenylation. Nucleic Acids Research 33:D116–D120.

Zhang, L., Pavlovic, V., Cantor, C. R., and Kasif, S. (2003) Human–mouse gene identification by comparative evidence integration and evolutionary analysis. Genome Research 13:1190–1202.

Zhao, J., Hyman, L., and Moore, C. (1999) Formation of mRNA 3′ ends in eukaryotes: mechanism, regulation, and interrelationships with other steps in mRNA synthesis. Microbiology and Molecular Biology Reviews 63:405–445.

Zien, A., Rätsch, G., Mika, S., Scholkopf, B., Lengauer, T., and Muller, K.-R. (2000) Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16:799–807.

Zuker, M.Mathews, D. H., and Turner D. H. (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide. In Barciszewski, J. and Clark, B. F. C. (eds.) RNA Biochemistry and Biotechnology, pp. 11–43. New York: Kluwer.

Methods for Computational Gene Prediction

This Book has been cited by the following publications. This list is generated based on data provided by Crossref.

Book description

Reviews

Refine List

Actions for selected content:

Contents

Frontmatter
pp i-iv

Contents
pp v-x

Foreword by Steven Salzberg
pp xi-xii

Preface
pp xiii-xv

Acknowledgements
pp xvi-xviii

1 - Introduction
pp 1-27

2 - Mathematical preliminaries
pp 28-82

3 - Overview of computational gene prediction
pp 83-103

4 - Gene finder evaluation
pp 104-119

5 - A toy exon finder
pp 120-135

6 - Hidden Markov models
pp 136-183

7 - Signal and content sensors
pp 184-213

8 - Generalized hidden Markov models
pp 214-266

9 - Comparative gene finding
pp 267-324

10 - Machine-learning methods
pp 325-357

11 - Tips and tricks
pp 358-368

12 - Advanced topics
pp 369-387

Appendix
pp 388-389

References
pp 390-407

Index
pp 408-430

Metrics

Altmetric attention score

Full text views

Book summary page views

Methods for Computational Gene Prediction

Book description

Reviews

Refine List

Actions for selected content:

Save Search

Contents

Metrics

Altmetric attention score

Full text views

Book summary page views