NIAES Annual Report : Manuscript (revised: 30 October 2003)


Estimating larger phylogenetic trees for biodiversity studies

Nobuhiro Minaka
National Institute for Agro-Environmental Sciences, Tsukuba, Japan
minaka@affrc.go.jp

Analysis of biological diversity gives fundamental knowledge for various branches of biology and agronomy. A taxonomic or species-based database would be a useful tool for cataloguing the worldfs biota. However, the construction of such a taxonomic database or inventory is not the only task we need to undertake if we are to understand, maintain, and recreate biodiversity. Taxonomic classification has intuitive appeal to us as humans, because we like to catalog living things into groups within groups according to the degrees of similarity of various characters. However, such taxonomic groups or etaxaf--- species, genus, family, and so forth --- do not have explicit historical or evolutionary implications. All organisms on Earth are the products of biological evolution. Current states of biodiversity have their evolutionary origin in the past, and if we are to maintain and control biodiversity, we need to know its evolutionary history. If our research were based on evolutionary biology, we would be able to obtain more definite knowledge of the origin of biodiversity.

Estimation of the history of organisms (ephylogenyf) is one of the most rapidly growing fields of research in evolutionary biology. It is also a flourishing area of interaction between biology, statistics, mathematics, and computer science. In the past the science of phylogeny (ephylogeneticsf) was denigrated as mere speculation driven predominantly by the action of bold imagination on scant empirical data. However, because of the growing accumulation of molecular biological data (DNA or amino acid sequence data) and methodological advances in phylogenetic biology over the past 40 years, we now have techniques for estimating more reliable phylogenetic trees by using high-speed computers. Recent advances in genome informatics have given rise to the new interdisciplinary science of bioinformatics, and comparable advances in phylogenetic reconstruction will open up the promising field of phyloinformatics in the near future.

Phylogenetic reconstruction, whose purpose is to find an optimal graph called phylogenetic tree from character data (e.g., DNA sequences, morphology), has been faced with a serious problem in computer science: gNP-completenessh. A computational problem of class P (polynomial) requires us polynomial computing time proportionate to its size. On the other hand the category of NP-complete problems are known to possess the highest level of computational difficulty in terms of discrete optimization because they require NP (nondeterministic polynomial) computing time. NP-completeness implies exponential time for resolving problems. The problem of finding optimal phylogenetic trees was proven to belong to the NP-complete category in 1982. Its computational difficulty can be understood intuitively from the fact that phylogeneticists must make comparisons among alternative trees whose number N is calculated by the formula: N = 1 x 3 x 5 x ... x (2n-3), where n is the number of organisms under study. The tree space for searching will rapidly magnify as n increases (e.g., N is more than 34 million for n = 10). For this reason, a more efficient heuristic algorithm is required to calculate larger phylogenetic trees in over a shorter time.

BOGEN is a new, more efficient program for calculating the most parsimonious trees from molecular sequence data. The principal characteristic of our program is that it incorporates a new heuristic search strategy for building optimal initial trees using simultaneous subtree-connections. The upper limit of the size of the data matrix is 10000 organisms / 50000 base pairs for the current version of BOGEN. From our benchmark test, the computing time on a Pentium 4 (2.26 GHz) Windows PC is as follows: for 500 organisms / 1000 bp, 50 s (initial tree) and 37 min (branch-swapping); for 1000 organisms / 1000 bp, 220 s (initial tree); for 5000 organisms / 1000 bp, 1 h (initial tree). Calculating a 10 000-organisms tree requires, on the average, at least 30 h for optimal initial tree reconstruction. In our comparisons so far of optimal trees from BOGEN with those from PAUP* 4.0 (currently the most popular software for phylogenetic reconstruction), BOGEN trees have been consistently shorter lengths than PAUP* trees, even without branch-swapping operations. We are now developing a branch-swapping program for larger trees. Several examples are presented in the following references for comparison of BOGEN with other parsimony programs.


References
  • Minaka N., Suemura T., Asano T., Yamamoto H. and Machii K. (2003) BOGEN: A faster parsimony program for computing larger phylogenetic trees. Paper presented at the 22nd Annual Meeting of the Willi Hennig Society held at the New York Botanical Garden, New York, 20-25 July, 2003. [Abstract]

  • Minaka N., Yamamoto H., Asano T., Suemura T. and Machii K. (2003) A more efficient heuristic algorithm for the large phylogenetic Steiner problem. Cladistics 19: 157.