INTRODUCTION
Advanced nucleic acid sequencing and bioinformatic technologies allow the investigation of genomes and transcriptomes, and thus provide useful tools to investigate the molecular biology, biochemistry and physiology of organisms.
This chapter describes genome-sequencing methodologies, approaches and algorithms used for genome assembly and annotation. Compared with the long-established biochemical methodologies, the sequencing and drafting of genomes is constantly evolving as new bioinformatics tools become available. Despite technological advances, there have been considerable challenges in the sequencing, annotation and analyses of previously uncharacterised eukaryotic genomes. In this chapter, we will therefore discuss some bioinformatic workflows that have proven to be efficient for the assembly and annotation of complex eukaryotic genomes, and give a perspective on future research toward improving genomic and transcriptomic methodologies.
A genome project starts with the assembly and annotation of a draft genome from experimentally determined DNA sequences referred to as reads (nucleotide sequence regions). The success of an assembly depends on the sequencing technology and the assembly algorithms used, as well as the quality of the data. Typically, short-read, shotgun assemblies do not lead to complete chromosomal assemblies of eukaryote genomes, mainly because of challenges in resolving regions with repeats, a lack of uniform read coverage linked to a suboptimal quality of genomic DNA or poor library construction. Clearly, the quality of a genomic assembly is critical for subsequent gene predictions, and the quality of gene prediction is crucial to achieve an acceptable annotation.
The annotation of a genome is typically divided into structural and functional phases. For structural annotation, genomic features, such as genes, RNAs and repeats, are predicted, and their composition and location in the genome inferred. For gene prediction, results of both ab initio and evidence-based predictions (from mRNA, cDNA and/or proteomic data) are often combined. Functional annotation (also called functional prediction) assigns a potential function to a gene or genome element. In general, functions are predicted using similarity searches, structural comparisons, phylogenetic approaches, genetic interaction networks and machine-learning approaches. The following sections cover commonly used algorithms and recent methods for genome assembly, the prediction of protein-encoding genes and the functional annotation of such genes and their products.