NIAES Annual Report 2003

NIAES Annual Report : Manuscript (28 October 2003)

Estimating larger phylogenetic trees for biodiversity studies

Nobuhiro Minaka
National Institute for Agro-Environmental Sciences, Tsukuba, Japan
minaka@affrc.go.jp

Analyzing biological diversity gives fundamental knowledge for various branches of biology and agronomy. Taxonomic or species-based database might be a useful tool for aiming to catalogue the world’s biota. However, the construction of such a taxonomic database (or inventory) is not the only task we need to undertake if we are to for understanding, maintaining, and recreating biodiversity. Indeed, taxonomic classification has intuitive appeal to us as humans, because we like to us, human beings, for cataloging living things into groups within groups according to degrees of similarity of various characters. But taxonomic groups (or ‘taxa’) --- species, genus, family, etc.--- do not give any explicit historical or evolutionary implications. All organisms on the earth are the products of biological evolution. Current states of biodiversity have their evolutionary origin in the past, and if we are maintaining and controlling biodiversity, will require us to know its evolutionary history. We could get more definite knowledge of the origin of biodiversity if our research is based on evolutionary biology.

Estimating the history of organisms (“phylogeny”) has become one of the most rapidly growing fields of research in evolutionary biology. It is also a flourishing area of interaction between biology, statistics, mathematics, and computer science. In the past the science of phylogeny (“phylogenetics”) had been denigrated as mere speculation driven up by bold imagination on scant empirical data. However, because of the growing accumulation of molecular biological data (DNA or amino acid sequence data) and methodological advances in phylogenetic biology over the past forty years, we now have highly applied techniques for estimating more reliable phylogenetic trees by using high-speed computers. Recent advances in genome informatics gave rise to a newer branch of interdisciplinary science of “bioinformatics”, and comparable advances in phylogenetic reconstruction will open up a promising field of “phyloinformatics” in the near future.

Phylogenetic reconstruction, whose purpose is to find an optimal graph called “phylogenetic tree” from character data (e.g., DNA sequences, morphology, etc.), has been faced with a serious problem in computer science: “NP-completeness”. A computational problem of class P (polynomial) requires us polynomial computing time proportionate to its size. On the other hand the category of NP-complete problems are known to have the highest level of computational difficulty in discrete optimization because they require NP (nondeterministic polynomial) computing time. NP-completeness implies exponential time for resolving problems. The problem of finding optimal phylogenetic trees was proved to belong to this category in 1982. Its computational difficulty can be understood intuitively from the fact that phylogeneticists must compare among alternative trees whose number N is calculated by the formula: N = 1 x 3 x 5 x ... x (2n - 3), where n is the number of organisms under study. The tree space for searching will become rapidly huge as the size (n) of tree increases (e.g., N is more than 34 million for n = 10). For this reason, more efficient heuristic algorithm is required to calculate larger phylogenetic trees in shorter time.

BOGEN is a new, more efficient program for calculating most parsimonious trees from molecular sequence data. The principal characteristic of our program is that it incorporates a new heuristic search strategy for building optimal initial trees using simultaneous subtree-connections. The upper limit of the size of data matrix is 10000 organisms / 50000 base pairs for the current version of BOGEN. From our benchmark test, the computing time on a Pentium 4 (2.26 GHz) Windows PC is as follows: for 500 organisms / 1000 bp, 50 seconds (initial tree) and 37 min (branch-swapping); for 1000 organisms / 1000 bp, 220 seconds (initial tree); for 5000 organisms / 1000 bp, 1 hour (initial tree). Calculating a 10000-organisms tree requires, on the average, at least 30 hours for optimal initial tree reconstruction. As far as we can compare the optimal trees from BOGEN with those from PAUP* 4.0 (currently most famous software for phylogenetic reconstruction), BOGEN trees have consistently shorter lengths than PAUP* trees, even without any branch-swapping operation. We are now developing a branch-swapping program for larger trees. Several examples are presented in the following references for comparing BOGEN with other parsimony programs.

REFERENCES

Nobuhiro Minaka, Takashi Suemura, Takehiro Asano, Haruo Yamamoto and Kouki Machii (2003) BOGEN: A faster parsimony program for computing larger phylogenetic trees. Paper presented at the 22th Annual Meeting of the Willi Hennig Society held at the New York Botanical Garden, New York, 20-25 July, 2003. [Abstract]
Nobuhiro Minaka, Haruo Yamamoto, Takehiro Asano, Takashi Suemura and Kouki Machii (2003) A more efficient heuristic algorithm for the large phylogenetic Steiner problem. Cladistics 19: 157.