Skip to main content Accessibility help
  • Print publication year: 2016
  • Online publication date: June 2016

1 - Perspective: Challenges in assembling the ‘next generation’ Tree of Life

from Part I - Next Generation Phylogenetics


Make no little plan

Attributed to Daniel Burnham, Chicago architect


Phylogenetic trees are getting large. Trees based on single loci have been constructed for > 100 000 taxa (Price et al. 2010), and trees based on a handful of loci for > 10 000 taxa (Goloboff et al. 2009; Smith et al. 2011a). Basic counting arguments show that the number of loci needed to reconstruct a tree accurately scales up with the number of leaves, N, in the tree (Mossel and Steel 2005, p. 400). Whether this scaling occurs at a conjectured rate of log(N), or is worse than that, the need for genome-scale datasets is likely to increase. Fortunately, the pace at which new sequence data are accumulating is extraordinary, and its revolutionary impact on systematics has been noted many times (e.g. Goldman and Yang 2008). What is perhaps more noteworthy is that taxon sampling has been keeping pace with advances in sequencing technology, so that the size of phylogenetic datasets has been steadily increasing in both dimensions. Figure 1.1 shows the expanding wave front of phylogenetic dataset size, a kind of ‘Moore's Law’ for phylogenomics. This pattern undoubtedly has its limits. Goldman and Yang (2008) documented the exponential growth in number of sequences in databases, but cautioned that molecular phylogenetic studies are accumulating at a rate that is less than exponential. This is probably due to a combination of the mean number of sequences per study increasing over time (Fig 1.1), and the inevitable increasing difficulty of obtaining samples of rare taxa. Given the ‘hollow curve’ of distribution, the fact that most species are both geographically restricted and locally uncommon (McGill 2010), it is doubtful that sampling across taxa will be able to keep up with sampling across individual genomes. Nonetheless, today ~ 19% of described biodiversity has at least one sequence in GenBank (355 000 species out of 1.9 million, as of March 2016).

There are many reasons to add genome-scale data to phylogenetic inference in local problems in the Tree of Life, or to solidify its deep backbone with a small number of exemplars, but this paper focuses on the task of building large, species rich, high-resolution phylogenies.

Ané, C., Larget, B., Baum, D. A., Smith, S. D. and Rokas, A. (2007). Bayesian estimation of concordance among gene trees. Molecular Biology and Evolution, 24, 412–26.
Bansal, M., Burleigh, J. G., Eulenstein, O. and Wehe, A. (2007). Heuristics for the gene-duplication problem, An O(N) speed-up for the local search. In RECOMB 2007, ed. Speed, T. and Huang, H.. Heidelberg, Springer; pp. 238–252.
Barker, M. S., Baute, G. J. and Liu, S.-L. (2012). Duplications and turnover in plant genomes. In Plant Genome Diversity, ed. Wendel, J. F.. Vienna, Springer; pp. 155–69.
Bininda-Emonds, O. R. P., Brady, S. G., Kim, J. and Sanderson, M. J. (2001). Scaling of accuracy in extremely large phylogenetic trees. Pacific Symposium on Biocomputing, 6, 547–58.
Bremer, B., Bremer, K., Chase, M. W., et al. (2009). An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants, APG III. Botanical Journal of the Linnean Society, 161, 105–21.
Burleigh, J. G., Bansal, M. S., Eulenstein, O., Hartmann, S., Wehe, A. and Vision, T. J. (2011). Genome-scale phylogenetics, inferring the plant tree of life from 18,896 gene trees. Systematic Biology, 60, 117–25.
Burleigh, J. G., Bansal, M. S., Wehe, A. and Eulenstein, O. (2009). Locating large-scale gene duplication events through reconciled trees, implications for identifying ancient polyploidy events in plants. Journal of Computational Biology, 16, 1071–83.
Chase, M. W., Soltis, D. E., Olmstead, R. G., et al. (1993). Phylogenetics of seed plants: An analysis of nucleotide sequences from the plastid gene rbcL. Annals of the Missouri Botanical Garden, 80, 528–80.
Cranston, K. A., Hurwitz, B., Sanderson, M. J., Ware, D., Wing, R. A. and Stein, L. (2010). Phylogenomic analysis from deep BAC-end sequence libraries in rice. Systematic Botany, 35, 512–23.
Davidson, R., Vachaspati, P., Mirarab, S. and Warnow, T. (2015). Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer. BMC Genomics, 16(Suppl. 10):S1.
Demuth, J. P. and Hahn, M. W. (2009). The life and death of gene families. BioEssays, 31, 29–39.
Duarte, J. M., Wall, P. K., Edger, P. P., et al. (2010). Identification of shared single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryza and their phylogenetic utility across various taxonomic levels. BMC Evolutionary Biology, 10, 61.
Ebersberger, I., Galgoczy, P., Taudien, S., Taenzer, S., Platzer, M. and Von Haeseler, A. (2007). Mapping human genetic ancestry. Molecular Biology and Evolution, 24, 2266–76.
Edgar, R. C. (2004). MUSCLE, multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32, 1792–7.
Edwards, E. J. and Smith, S. A. (2010). Phylogenetic analyses reveal the shady history of C-4 grasses. Proceedings of the National Academy of Sciences of the United States of America, 107, 2532–7.
Fan, H. H. and Kubatko, L. S. (2011). Estimating species trees using approximate Bayesian computation. Molecular Phylogenetics and Evolution, 59, 354–63.
Felsenstein, J. (1978). Cases in which parsimony or compatibility methods will be positively misleading. Systematic Zoology, 27, 401–10.
Felsenstein, J. (2004). Inferring Phylogenies. Sunderland, MA, Sinauer Press.
Fletcher, W. and Yang, Z. H. (2009). INDELible, a flexible simulator of biological sequence evolution. Molecular Biology and Evolution, 26, 1879–88.
Goldman, N. and Yang, Z. (2008). Introduction. Statistical and computational challenges in molecular phylogenetics and evolution. Philosophical Transactions of the Royal Society of London B – Biological Sciences, 363, 3889–92.
Goloboff, P. A., Catalano, S. A., Mirande, J. M., et al. (2009). Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups. Cladistics, 25, 211–30.
Goodman, M., Czelusniak, J., Moore, G. W., Romeroherrera, A. E. and Matsuda, G. (1979). Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Systematic Zoology, 28, 132–63.
Guindon, S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W. and Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies, assessing the performance of PhyML 3.0. Systematic Biology, 59, 307–21.
Hejnol, A., Obst, M., Stamatakis, A., et al. (2009). Assessing the root of bilaterian animals with scalable phylogenomic methods. Proceedings of the Royal Society B – Biological Sciences, 276, 4261–70.
Izquierdo-Carrasco, F., Smith, S. A. and Stamatakis, A. (2011). Algorithms, data structures, and numerica for likelihood-based phylogenetic inference of huge trees. BMC Bioinformatics, 12, 470.
Källersjö, M., Farris, J. S., Chase, M. W., et al. (1998). Simultaneous parsimony jackknife analysis of 2538 rbcL DNA sequences reveals support for major clades of green plants, land plants, seed plants and flowering plants. Plant Systematics and Evolution, 213, 259–87.
Knowles, L. L. (2009). Estimating species trees, methods of phylogenetic analysis when there is incongruence across genes. Systematic Biology, 58, 463–7.
Liu, K., Linder, C. R. and Warnow, T. (2012). RAxML and FastTree, comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS One, 6, 11.
Liu, K., Warnow, T. J., Holder, M. T., Nelesen, S. M., Yu, J. Y., Stamatakis, A. P. and Linder, C. R. (2011). SATe-II, Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Systematic Biology, 61, 90–106.
Liu, L. and Pearl, D. K. (2007). Species trees from gene trees: Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Systematic Biology, 56, 504–14.
Liu, L., Xi, Z. X., Wu, S. Y., Davis, C. C. and Edwards, S. V. (2015). Estimating phylogenetic trees from genome-scale data. In Year in Evolutionary Biology, ed. Mousseau, T. A. and Fox, C. W.. Annals of the New York Academy of Sciences, 1360: 36–53.
Liu, L., Yu, L. L., Kubatko, L., Pearl, D. K. and Edwards, S. V. (2009). Coalescent methods for estimating phylogenetic trees. Molecular Phylogenetics and Evolution, 53, 320–8.
Liu, L., Yu, L. L. and Pearl, D. K. (2010). Maximum tree, a consistent estimator of the species tree. Journal of Mathematical Biology, 60, 95–106.
Löytynoja, A. and Goldman, N. (2008). Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science, 320, 1632–5.
McGill, B. J. (2010). Towards a unification of unified theories of biodiversity. Ecology Letters, 13, 627–42.
McMahon, M. M. and Sanderson, M. J. (2006). Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. Systematic Biology, 55, 818–36.
Mossel, E., Roch, S. and Sly, A. (2011). On the inference of large phylogenies with long branches, how long is too long?Bulletin of Mathematical Biology, 73, 1627–44.
Mossel, E. and Steel, M. (2005). How much can evolved characters tell us about the tree that generated them? In Mathematics of Evolution and Phylogeny, ed. Gascuel, O.. New York, Oxford University Press.
Page, R. D. M. and Charleston, M. A. (1997). From gene to organismal phylogeny, reconciled trees and the gene tree/species tree problem. Molecular Phylogenetics and Evolution, 7, 231–40.
Philippe, H., Snell, E., Bapteste, E., Lopez, P., Holland, P. and Casane, D. (2004). Phylogenomics of eukaryotes: Impact of missing data on large alignments. Molecular Biology and Evolution, 21, 1740–52.
Piel, W. H., Donoghue, M. J. and Sanderson, M. J. (2002). TreeBASE, a database of phylogenetic knowledge. In To the Interoperable “Catalog of Life”, ed. Shimura, J., Wilson, K. L. and Gordon, D.. Tsukuba, Japan, National Institute for Environmental Studies; pp. 41–7.
Pollard, D. A., Iyer, V. N., Moses, A. M. and Eisen, M. B. (2006). Widespread discordance of gene trees with species tree in Drosophila: Evidence for incomplete lineage sorting. PLoS Genetics, 2, 1634–47.
Price, M. N., Dehal, P. S. and Arkin, A. P. (2010). FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS One, 5, e9490.
Qiu, Y. L., Dombrovska, O., Lee, J., et al. (2005). Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. International Journal of Plant Sciences, 166, 815–42.
Reineke, A. R., Bornberg-Bauer, E. and Gu, J. (2011). Evolutionary divergence and limits of conserved non-coding sequence detection in plant genomes. Nucleic Acids Research, 39, 6029–43.
Roch, S. (2006). A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE-ACM Transactions on Computational Biology and Bioinformatics, 3, 92–4.
Sanderson, M. J. (2008). Phylogenetic signal in the eukaryotic tree of life. Science, 321, 121–3.
Sanderson, M. J., Boss, D., Chen, D., Cranston, K. A. and Wehe, A. (2008). The PhyLoTA browser, processing GenBank for molecular phylogenetics research. Systematic Biology, 57, 335–46.
Sanderson, M. J. and McMahon, M. M. (2007). Inferring angiosperm phylogeny from EST data with widespread gene duplication. BMC Evolutionary Biology, 7 (Suppl. 1), S3.
Sanderson, M. J., McMahon, M. M. and Steel, M. (2010). Phylogenomics with incomplete taxon coverage, the limits to inference. BMC Evolutionary Biology, 10, 155.
Sanderson, M. J., McMahon, M. M. and Steel, M. (2011). Terraces in phylogenetic tree space. Science, 333, 448–50.
Sankoff, D., Zheng, C. F., Munoz, A., et al. (2010). Issues in the reconstruction of gene order evolution. Journal of Computer Science and Technology, 25, 10–25.
Semple, C. and Steel, M. (2003). Phylogenetics. New York, Oxford University Press.
Smith, S. A., Beaulieu, J. M., Stamatakis, A. and Donoghue, M. J. (2011a) Understanding angiosperm diversification using small and large phylogenetic trees. American Journal of Botany, 98, 404–14.
Smith, S. A., Wilson, N. G., Goetz, F. E., et al. (2011b). Resolving the evolutionary relationships of molluscs with phylogenomic tools. Nature, 480, 364–7.
Soltis, D. E., Smith, S. A., Cellinese, N., et al. (2011). Angiosperm phylogeny, 17 genes, 640 taxa. American Journal of Botany, 98, 704–30.
Soltis, D. E., Soltis, P. S., Chase, M. W., et al. (2000). Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences. Botanical Journal of the Linnean Society, 133, 381–461.
Stamatakis, A., Hoover, P. and Rougemont, J. (2008). A rapid bootstrap algorithm for the RAxML web servers. Systematic Biology, 57, 758–71.
Stamatakis, A. and Ott, M. (2008). Efficient computation of the phylogenetic likelihood function on multi-gene alignments and multi-core architectures. Philosophical Transactions of the Royal Society of London B – Biological Sciences, 363, 3977–84.
Steel, M. and Sanderson, M. J. (2010). Characterizing phylogenetically decisive taxon coverage. Applied Mathematics Letters, 23, 82–6.
Tautz, D. and Domazet-Loso, T. (2011). The evolutionary origin of orphan genes. Nature Reviews Genetics, 12, 692–702.
White, M. A., Ane, C., Dewey, C. N., Larget, B. R. and Payseur, B. A. (2009). Fine-scale phylogenetic discordance across the house mouse genome. PLoS Genetics, 5, e1000729.
Wiens, J. J. (2003). Missing data, incomplete taxa, and phylogenetic accuracy. Systematic Biology, 52, 528–38.
Wilkinson, M. (2003). Missing entries and multiple trees, instability, relationships and support in parsimony analysis. Journal of Vertebrate Paleontology, 23, 311–23.