Information

Are phylogenetic tree construction algorithms any different than general clustering algorithms?

Are phylogenetic tree construction algorithms any different than general clustering algorithms?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I suspect the answer is no. But I don't know enough to be sure that that is the case. Of course phylogenetic tree construction uses biological knowledge, e.g special distance metrics, but does it brings anything new at the level of clustering agorithm (e.g hierarchical, neighbor-joining etc).


One difference is that phylogenetic tree construction algorithms will typically try to jointly estimate the tree and the transition parameters. A general clustering algorithm might assume that a difference in one feature would make the same contribution to 'distance' between clusters as a difference in another feature. In phylogenetic tree construction, it is explicit that not all differences between taxa represent the same evolutionary distance between taxa, and the parameters governing the distribution of distance/difference may be jointly estimated with the tree.


Genetic Algorithm Based Phylogenetic Tree Reconstruction Biology Essay

ABSTRACT−Phylogenetic tree construction is a challenging and widely studied problem in Bioinformatics. Due to NP-Complete characteristics, it is still an open problem for researchers. With the increase in the number of species the computational complexity also increases, which cannot be solved by the traditional methods (such as Unweighted pair group methods using arithmetic averages (UPGMA), Maximum likelihood, Maximum Parsimony) etc. In order to solve this problem, some metaheuristic methods are being researched by several researchers and phylogenetic tree has been constructed and have reported some promising results. This paper gives a brief survey on some metaheuristic approaches such as Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) which were used to optimize the phylogenetic tree reconstruction.

INDEX TERM− Particle Swarm Optimization, Ant Colony Optimization, Genetic Algorithm, Phylogenetic Tree Reconstruction, Evolutionary computation.

Phylogenetic [19] is a method of studying the evolutionary history of various living organism where the divergences between the species is represented by directed graphs or trees, known as phylogeny. The tree is constructed based on the molecular sequences among different species. The representation of molecular sequence is derived from genes or protein sequences known as gene phylogeny whereas the species phylogeny is defined as process of representing the evolutionary path of different species. A gene phylogeny can be though of as a local descriptor which describes the gene evolution and the encoded gene sequences which helps in knowing the interrelation among different genes i.e. the gene sequence among different gene have more or less interrelated to each other. There are mainly two types of trees a). rooted trees: in which all nodes were derived from the single node and b) unrooted trees: are those which donot from one particular node.

The tree constructed must follow graph theory standards notation, in which nodes represents the species and the branch or edge were used to represent the relationship between species. The remainder of the paper is organized as follows: Section 2. gives an introduction of Genetic Algorithm, followed by the work done for Optimizing the phylogenetic tree by using GA. Section 3. covers an introduction of Ant Colony Optimization, Section 4. describes the methods other than GAs and ACO applied for the construction of phylogenetic tree and finally Section 5. Concludeds the paper.

Fig.1.1 (a) Unrooted Tree, (b) Rooted Tree.

Genetic Algorithm [2] is a heuristic search algorithm based on the technique inspired by natural evolution,

such as inheritance, mutation, selection, and crossover. GAs begins with a population of encoded random solutions. Such encoded solutions are usually termed as chromosomes, and their ability in solving the problem is described with the help of the fitness function. These individuals are subjected to natural selection, based on the there fitness value. In each generation individuals is subjected to mutation and recombination events, with mutation and recombination operators defined based on the nature of the problem. The new population is then used in the next iteration of the algorithm. The algorithm usually terminates when a solution that produce an output is close enough or equal to the desired answer or satisfactory fitness level has been reached for the population. In section B, we discuss how Genetic Algorithm is applied in Phylogenetic Tree Reconstruction, and then in Section C we describe the work done by several researchers for the construction of Phylogenetic Tree through Genetic Algorithm.

2.2 How Genetic Algorithm is applied in Phylogenetic Tree Reconstruction

GAs have been applied to a variety of complex problems in engineering for many years, although their use in problems involving biological data is only a few years back explored The ability of GAs to find near-optimal solutions quickly in the case of complex data makes them ideal candidates for the problem of phylogenetic inference, especially when many taxes are included or complicated evolutionary models (necessitating the use of computer intensive inference methods such as maximum likelihood) are applied. In the case of phylogeny reconstruction, the single chromosome of each individual can be designed to encode a single phylogenetic tree, along with its branch lengths and the values of other parameters comprising the substitution model used. Mutation and recombination operators can be defined for phylogenetic trees, and the fitness of an individual may be equated to its natural log likelihood (lnL) score. Trees with higher values of lnL thus tend to leave more offspring to the next generation, and natural selection increases the average lnL of the individuals in the simulated population. The tree with the highest lnL after the population fitness ceases to improve is taken to be the best estimate of the maximum-likelihood tree [8].

2.3 Genetic Algorithm based Phylogenetic Tree Reconstruction

Hideo Matsuda [5] (1996) proposed a Construction of Phylogenetic Trees from Amino Acid Sequences using a Genetic Algorithm which differs from that of simple genetic algorithm [2] on the basis of implementation of encoding scheme, crossover and mutation operator. At the initial stage, among the available

alternative trees, a fixed number of trees were selected with the help of roulette selection based on their fitness value. After this the crossover and mutation operation were applied from generation to generation for improving the quality of the trees. As the number of trees in each generation was fixed, trees with the best score will be removed by these operators. The algorithm also checks that the constructed tree with the best score must survive for each generation. The main advantage of the algorithm is its capability to construct more likely tree from randomly generated trees with the help of crossover and mutation operators. The experimental results show that the performance of the proposed algorithm is comparable to that of other tree construction methods such as Maximum Parsimon Maximum Likelihood, UPGMA methods with different search algorithms.

The Phylogeny reconstruction is a difficult computational problem, because as more number of taxes (object) included the number of possible solutions also increases which further increases the amount of time spend in evaluating non optimal trees. To overcome this problem Paul.et.al [8] (1998) proposed A Genetic Algorithm for Maximum- Likelihood Phylogeny Inference Using Nucleotide Sequence Data. Paul provides a genetic algorithm based heuristic search, which reduce the time required for maximum-likelihood phylogenetic inference, in case of datasets involving large numbers of taxa’s. The algorithm works as follows, Firstly each individual is initialized with random tree topology in which every branch is assigned a random value. Based on InL score the fitness value of each particle is calculated. The individual having highest InL score value is used to generate the offspring for the next generation. Finally recombination operation is performed. This recombination operation separates GA from other traditional methods of obtaining a solution in less time. The experimental results show that only 6% of the computation effort required by a conventional heuristic search using tree bisection reconnection (TBR) branch swapping to obtain the same Maximum-Likelihood topology.

In 2002 Clare et. al [10] proposed "Gaphyl: an evolutionary algorithms approach to investigate the evolutionary relationship among organisms". The existing phylogenetic software packages use heuristic search methods to find the optimum phylogenetic tree while Graphyl uses evolutionary mechanisms, thus finds a more complete solution in less time. The GA search process as implemented in Graphyl represents a gain for phylogenetics in finding more equally plausible trees than Phylip [3] in the same runtime. Furthermore, as the datasets get larger due to increase in number of species and attributes, the effectiveness of Gaphyl over Phylip appears to increase because the Gaphyl search process is independent of the number of attributes (and attributes-values) and the complexity of the search varies with the number of species which determines the number of leaf nodes in the tree.

Gaphyl Clare. et. al [12] (2003) proposed a new version of Gaphyl in which the Gaphyl is extended to work with genetic data. In the proposed algorithm, the DNA version of Gaphyl is constructed and the search process of Gaphyl and Phylip is compared based on DNA data. The experimental results reveal that Gaphyl’s performance is better than that of Phylip, in some cases.

3. ANT COLONY OPTIMIZATION

Ant Colony Optimization (ACO) is an evolution algorithm developed by M. Dorigo.

et. al [6] (1996), inspired by the foraging behavior of real ants. When an ant searches the food, ant initially moves the area covered by nest randomly. When an ant finds the food source, it analyzes the quality and the quantity of it and carry back some of its amount to the nest. The chemical pheromone trail is deposirted on the ground when the ant returns back. This helps other ants to reach the food sourcs. With the help of indirect communication between the ants via pheromone trails helps in finding the shortest path between the nest and the food source.This property of ant colonies is used in artificial ant colonies in order to solve combinatorial optimization (CO) problems. In general, the ACO repeat the two steps for solving the optimization problems.

1) Candidate solution is constructed using a pheromone model i.e. In the solution space use of a parameterized probability distribution

2) The candidate solutions are used to update or modify the pherome values for filtering in obtaining good quality solution.

In section B, we discuss how Ant Colony Optimization (ACO) is applied in Phylogenetic Tree Reconstruction, and then in section C we describe the work done by several researchers for the construction of Phylogenetic Tree through Ant Colony Optimization.

3.2 How Ant Colony Optimization is applied in Phylogenetic Tree Reconstruction.

The phylogenetic tree construction problem bears close resemblance to a standard TSP, (Traveling Salesman Problem). One can simply associate one imaginary city to each taxa, and defined as the distance between two cities the data obtained from the data matrix for the corresponding pair of taxas. This kind of formulation of the problem paves the path for the application of heuristic algorithms like ACO.The intermediary node is selected by ant system between the two previously selected ones. Based on the intermediary node, the distances to the remaining nodes (species) are recalculated. This procedure is recursively repeated until all the nodes which were visited not belonging to already visited nodes after this the path is constructed. The sum of the transition probabilities of the adjacent nodes of the path is termed a the score of the path used in updating the pheromone trail. During the execution cycle all those nodes belongs to at least one path will helps in incrementing the pheromne trail. This key point helps to avoid trapping in a local maximum. In this way, following an algorithm very close in spirit to the ant colony algorithm for solving the TSP, the phylogenetic trees may be reconstructed efficiently [20].

3.3 Ant Colony Optimization based Phylogenetic Tree Reconstruction

Shin Ando.et.al [11] (2002) proposed an Ant Algorithm for Construction of Evolutionary Tree, that hybridizes Ant colony algorithm with stack count. The algorithm applies the ACO algorithm for the exploring the metaheuristic search in NP problems. Author introduces two new mechanisms, the suffix representation and vertex choosing mechanism that helps in enhancing the exploration capability of the ant colony by applying the stack count [7] strategy. The algorithm chooses a tree from the set of possible trees, which minimizes the score for a given set of DNA sequences. The proposed algorithm shows satisfactory results in simulated experiment and alignment of protein sequences from 15 species.

Pisist Kumnorkaew.et.al [15] (2004) proposed Ant Colony based new algorithm in which the evolutionary tree is constructed with minimum total branch lengths by including the tree construction, branch length calculation, a branch point selection, new ACO parameter and a distance-weighting parameter. The algorithm starts by placing the ants at different branch points and set the initial value of pheromone trail on every edge. Once the branch point of each ant in the branch point selection vector is sorted, each ant selects cities to move to the next step based on pheromone trail and distance. The algorithm repeats city selection and movement of each ant until all ants have completed their tours and their branch

point selection vector is filled. The total branch

length of each ant is computed and the value of pheromone is updated. Before emptying the branch point selection vector, the shortest total branch length is stored. This process continues until the terminating condition is reached. To further enhance the algorithm's ability, a small negative branch lengths acceptance has been used because in the case of large values of n (n is the number of species in the evolutionary tree problem), the positive branch lengths acceptance limits the convergence. The output obtained is the shortest pathway of the ant including the branch labels and lengths, which is sufficient for the construction of the evolutionary tree. The experimental results show that the algorithm greatly reduces the exponential time complexity of the evolutionary tree problem in polynomial time.

Mauricio Perretto. et. al [16] (2005) proposed a Reconstruction of Phylogenetic Tree using the Ant Colony Optimization Paradigm. In the proposed algorithm the reconstruction of phylogenetic trees is

done by constructing a fully connected graph using the distance matrix among species. In this graph’s edge represents the distance between species and the nodes represent the species. Initially ants select a random node, then at each node the direction is determined based on the transition function. The main objective of the given ant is to find a path which maximizes the transition probabilities, as a result the sequence of species were obtained which produces smallest evolutionary distance. The proposed algorithm is compared with the well-know PHYLIP package using the programs NEIGHBOR and FITCH. The comparison is based on the analysis of their structure and the total distance between nodes. Overall, the experimental results reported in this paper were very promising.

Ling. et. al [18] (2006) proposed a novel approach to phylogenetic tree construction using stochastic, optimization and clustering in which ant colony algorithm is applied with both clustering method and aglobal optimization technique so that an optimal tree can be found even with bad initial tree topology. The proposed method consists of three components namely initialization, constructing phylogenetic trees through clustering and phylogenetic tree optimization. In the initialization phase a weighted digraph is built in which vertex represent data to be clustered and edge represent the acceptance rate between two objects. Then the ant travels in the digraph and update the pheromone on the path and finally the ant colony and its pheromone feedback system is used which act as a global optimization technique for deriving the optimal topology of the tree. The proposed algorithm is compared with Genetic Algorithm and the results shows that it converges much faster and achieves high quality.

Jing Juo. et. al [17] (2006) proposed A Self Adaptive Ant Colony Algorithm for Phylogenetic Tree Construction, in which the phylogenetic tree is constructed based on the equilibrium of distribution. The proposed method involves 3 steps, initialization, constructing phylogenetic trees by the optimal path found by ants, and optimization. First fully connected graph is constructed using the distance matrix among species. To begin the Phylogenetic Tree Reconstruction, the ants start by selecting the random node. They travel across the structured graph and at each node based on the probability function ants finds its direction. The algorithm adjusts the trail information and based on the equilibrium of thepath selection distribution the probability of each path is determined. Every ant repeats theprocedure until all nodes are traversed once, i.e. all nodes are listed into the list of already visited nodes. The score of this path is given by the sum of the probability function of the adjacent nodes in the path. To accelerate the convergence and also to avoid local convergence the algorithm adjusts the probability of selection and the strategy of the trail information on each path based on the quality and the distribution of the solutions obtained. The proposed algorithm is compared with Neighbor Joining (NJ) programs in the PHYLIP software package and TSP-Approach. The experimental results show that the proposed algorithm is easier to implement and obtain higher quality results than other algorithms.

Ling Chen.et. al [21] (2009) proposed a new algorithm for Phylogenetic Tree Construction based on Ant Colony Partitioning. Initially the root of the tree is defined which corresponds to the set of gene sequences. Then the algorithm bisects the set of gene sequences such that one subset have similar property and gene sequences between different subsets have different property. This process is recursively repeated until all subsets contain only one gene sequence. With the help of these subsets, a phylogenetic tree is progressively constructed in which the leaves are the gene sequences. Each level of bisection is based on an extension of ant colony optimization for traveling salesman problem. The experimental results demonstrate that the proposed algorithm is easy to implement, efficient, converge faster and obtain higher quality results than other methods.

4. PHYLOGENETIC TREE RECONSTRUCTION WITH OTHERS

Shuying.et.al [9](2000) develop a Bayesian method based Markov chain phylogenetic reconstruction method. The method generates a sequence of phylogenetic trees using Markov chain Monte Carlo (MCMC) technique.The Markov chain is based on the metropolis algorithm, whose stationary distribution is the conditional distribution of the phylogenetic tree given the observed sequences. The algorithm maintains the balance between the desire to move globally around the phylogenies and need to make feasible moves in the high probability area. The proposed algorithm is fast per iteration, because the calculation of the target node is kept local and as the large data set is potentially swapped, changes in the trees are possible with the fewest moves.

Most of the existing approach for phylogenetic inference use multiple alignment sequences. But multiple sequence alignment is inefficient due to gene rearrangements, inversion, transposition and translocation at the substring level, unequal length of sequences, etc. and also it does not work for whole genome phylogeny. Complete genome based phylogenetic analysis is appealing because single gene sequences does not contain enough information to construct an evolutionary history of organisms. To overcome such problem Hasan. et. al [13] (2003) proposed A new sequence distance measure for phylogenetic tree construction, in which a phylogenetic tree is constructed based on the distance measured between finite sequences using LZ complexity [1]. LZ complexity of the finite sequence S is defined as the number of steps required by a production process that built S. The obtained distance matrix is used to construct phylogenetic trees. The main advantage of the proposed approach is that it does not require any sequence alignment strategy and is totally automatic. From Experiment results it reveals that the proposed algorithm successfully constructed an efficient & consistent phylogenies for real and simulated data sets.

The Hui-Ying. et. al [14] (2004) proposed a novel algorithm for Phylogenetic Tree Reconstruction in which a Discrete Particle Swarm Optimization (DPSO) is used to select the best tree from the population. In the proposed algorithm, Initially the fitness value of each particle is calculated in the population and individual with maximum fitness value is then used for the phylogenetic tree construction. Once the tree is constructed, the population updation and branch adjustment is performed. In the population updation the position and velocity is updated using DPSO [4] position and velocity update equations. In the next step to adjust the branch of the tree,comparision is done. If the distance between two nodes is greater than or equal to 2D (D refer to the distance between two Sequences) then separate the branch otherwise combine the branch. This updation continues until the phylogenetic tree is not optimized. The DPSO algorithm gives optimized results even if initial population is changed. The DPSO algorithm is applied on 25 sequences problem which involve sequences of the chloroplast gene rbcL from a diversity of green plants and Experimental results reveals a satisfactory result when compared to other traditional algorithms.

In this paper we overviewed some recent efforts made by several researchers for the constructiuon of Phylogenetic Tree. After giving the brief introduction to the problem of phylogenetyic tree reconstruction, we have discuss the applicability and work done by several researchers for phylogenetic tree reconstruction by GA’s, PSO and ACO. Several traditional methods are challenged and reviewed closely for there relevance and acceptance but they lack in achieving a near optimum solution and suffer from extensive computational overhead. To overcome these problems, GA has been combined with several other methods. The results shows satisfactory outcomes as compared to traditional methods. However the SI tools also seems to be promising because several tasks in bioinformatic involve optimization of different criteria thereby making the application of SI tools (like ACO and PSO) more obvious and appropriately in solving phylogenetic tree reconstruction problem. As compared to GA, the SI based algorithm proposed by different researchers gives better results. The papers published in this context, may be small in volume, but are of immense significance to the researchers of tomorrow because the field is broad and a lot of research work is still needed to be done.

The authors would like to thank the anonymous reviewers for their detailed, valuable comments and constructive suggestions.


Results

Constructing a specific digraph for the objects

Ants can volatilize a kind of chemical odour called pheromone when they encounter each other or in the process of seeking their fellows. Based on this kind of odour, ants will naturally attract those who have similar features and repel those that are different. In this paper, artificial ants were set to travel on the graph and deposits pheromone on the edges they passed. As showed in Fig. ​ Fig.1 1 and Fig. ​ Fig.2, 2 , in each step, the artificial ant selects the next vertex according to the acceptance weight in digraph and some heuristic information. The pheromone on each edge of the digraph will be updated with the artificial ants' adaptive movements, and some adaptive strategies are also presented to speed up the clustering progress.

Strong component analysis

The more similar the objects are, the higher the quantity of pheromone may be deposited on the edge between their vertexes. To make full use of the quantity of pheromone on each edge, we omit some connections whose pheromone value is less than a certain threshold to get a new digraph, and the strong connected components of the new digraph forms the finial clusters. This way, the initial objects are separated into a few clusters by the ant sub-colony. Finally these clusters obtained by the ants are used to construct the phylogenetic trees progressively.

Optimizing the phelogenetic trees

Artificial ants in the same sub-colony try to construct an independent phylogenetic tree as a solution of the problem by their cooperation and different sub-colonies construct different trees so as to maintain diversity of candidates. After optimizing these trees, the performance of these solutions is improved. Meanwhile, the pheromones on the edges of high fitness valued trees are increased to strengthen the ants' clustering process.

The phylogenetic tree construction method showed in this paper is tested to compare its results with that of GA, experimental results show that our algorithm is easier to implement and more efficient. Comparing to GA, it can converge much faster and obtain higher solution quality.


Are phylogenetic tree construction algorithms any different than general clustering algorithms? - Biology

Introduction
Although many biologists believe that reticulate events such as hybridization, horizontal gene transfer, recombination and reassortment play an important role in evolution, most published studies use trees to represent the evolutionary history for the set of species studied. One reason for this is the lack of robust and accepted methods for inferring non-tree histories or phylogenetic networks. A lot of work has been done in recent years to address this problem. Our work in this area was originally focused on developing methods for inferring unrooted phylogenetic networks and our computer program, SplitsTree, is currently the most widely-used software for the construction of phylogenetic networks. More recently, we have focused on developing methods for rooted phylogenetic networks, which we will make available via our tree- and network drawing program Dendroscope.

Phylogenetic networks were one of the main topics of the four month research programme on Phylogenetics held at the Newton Institute of Cambridge University in 2007. There we formed the impression that the field of phylogenetic networks had advanced to a point where there was enough material to warrant a book that gives an introduction to the field and attempts to present the different questions and approaches in a unified manner. Our goal was to write a book that covers the field in a style that is accessible to bioinformatics, biologists that are interested in methods and algorithms and computer scientists that are interested in evolution. This book is the ideal companion to our SplitsTree4 and Dendroscope programs.

Reviews
'Networks - rather than just trees - are fast becoming the essential tool for making sense of the complexities of evolution, and conflicting signal[s] in genomic data. Phylogenetic Networks provides a long-overdue exposition of network-based methods, their possible uses, and details on practical software. A detailed and unified treatment of the many different types of networks is complemented by a crisp synopsis of the underlying theory. Numerous example[s] and illustrations make the text easy to follow. This book will further transform the way biologists use genomic data to study evolution. The Tubingen group has led the development of phylogenetic network algorithms, and this book delivers a clear exposition for biologists bewildered by a plethora of recent methods, as well as for bioinformaticians aiming to develop the field further. It is essential reading for any scientist or student seeking to understand how genomic data can be used to represent and study the intricate 'web of life'.' Mike Steel, University of Canterbury.

'This textbook, by one of the leaders of the field (Daniel Huson) and his co-authors, provides a mathematically rigorous introduction to one of the most exciting and beautiful research areas in computational biology: phylogenetic networks. The text is clear and provides all the necessary biology background it should be accessible to graduate students (or upper-division undergraduates) in mathematics, computer science, or statistics.' Tandy Warnow, University of Texas.

'This wonderfully accessible book is by far the most thorough and up-to-date treatment of phylogenetic networks about. Many evolutionary processes in nature do not conform to the simple model of phylogenetic trees examples are hybridizations, symbioses, and lateral gene transfer. The more we probe nature with genomics, the more significant and numerous these examples become, so there is a real need for using networks in phylogenetics. This volume is a must for researchers working with phylogenetic networks. It is for an advanced college audience. Beautifully organized and clearly written, it really fills a void.' Bill Martin, University of Düsseldorf.

Table of contents
Contents
Preface
Part I Introduction
1 Basics
1.1 Overview
1.2 Undirected and directed graphs
1.3 Trees
1.4 Rooted DAGs
1.5 Traversals of trees and DAGs
1.6 Taxa, clusters, clades and splits
2 Sequence Alignment
2.1 Overview
2.2 Pairwise sequence alignment
2.3 Multiple sequence alignment
3 Phylogenetic Trees
3.1 Overview
3.2 Phylogenetic trees
3.3 The number of phylogenetic trees
3.4 Models of DNA evolution
3.5 The phylogenetic tree reconstruction problem
3.6 Sequence-based methods
3.7 Maximum parsimony
3.8 Branch-swapping methods
3.9 Maximum likelihood estimation
3.10 Bootstrap analysis
3.11 Bayesian methods
3.12 Distance-based methods
3.13 UPGMA
3.14 Neighbor-joining
3.15 Balanced Minimum Evolution
3.16 Comparing trees
3.17 Consensus trees
3.18 The Newick format
4 Introduction to Phylogenetic Networks
4.1 Overview
4.2 What is a phylogenetic network?
4.3 Unrooted phylogenetic networks
4.4 Rooted phylogenetic networks
4.5 The extended Newick format
4.6 Which types of networks are currently used in practice?
Part II Theory
5 Splits and Unrooted Phylogenetic Networks
5.1 Overview
5.2 Splits
5.3 Compatibility and incompatibility
5.4 Splits and clusters
5.5 Split networks
5.6 The canonical split network
5.7 Circular splits and planar split networks
5.8 Weak compatibility
5.9 The split decomposition
5.10 Representing trees in a split network
5.11 Comparing split networks
5.12 T-Theory
6 Clusters and Rooted Phylogenetic Networks
6.1 Overview
6.2 Clusters, compatibility and incompatibility
6.3 Hasse diagrams
6.4 Cluster networks
6.5 Rooted phylogenetic networks
6.6 The lowest stable ancestor
6.7 Representing trees in rooted networks
6.8 Hardwired and softwired clusters
6.9 Minimum rooted phylogenetic networks
6.10 Decomposability
6.11 Topological constraints on rooted networks 148
6.12 Cluster containment in rooted networks
6.13 Tree containment
6.14 Comparing rooted networks
Part III Algorithms and Applications
7 Phylogenetic Networks from Splits
7.1 The convex hull algorithm
7.2 The circular network algorithm
8 Phylogenetic Networks from Clusters
8.1 Cluster networks
8.2 Divide-and-conquer using decomposition
8.3 Galled trees
8.4 Galled networks
8.5 Level-k networks from clusters
9 Phylogenetic Networks from Sequences
9.1 Condensed alignments
9.2 Binary sequences and splits
9.3 Parsimony splits
9.4 Median networks
9.5 Quasi-median networks
9.6 Median-joining
9.7 Pruned quasi-median network
9.8 Recombination networks
9.9 Galled trees
10 Phylogenetic Networks from Distances
10.1 Distances and splits
10.2 Minimum spanning network
10.3 Split decomposition
10.4 Neighbor-net
10.5 T-rex
11 Phylogenetic Networks from Trees
11.1 Consensus split networks
11.2 Consensus super split networks for unrooted trees
11.3 Distortion-filtered super split networks for unrooted trees
11.4 Consensus cluster networks for rooted trees
11.5 Minimum hybridization networks
11.6 Minimum hybridization networks and galled trees
11.7 Networks from multi-labeled trees
11.8 DLT reconciliation of gene- and species trees
12 Phylogenetic Networks from Triples or Quartets
12.1 Trees from rooted triples 272
12.2 Level-k networks from rooted triples
12.3 The quartet-net method
13 Drawing Phylogenetic Networks
13.1 Overview
13.2 Cladograms for rooted phylogenetic trees
13.3 Cladograms for rooted phylogenetic networks
13.4 Phylograms for rooted phylogenetic trees
13.5 Phylograms for rooted phylogenetic networks
13.6 Drawing rooted phylogenetic networks with transfer edges
13.7 Radial diagrams for unrooted trees
13.8 Radial diagrams for split networks
14 Software
14.1 SplitsTree
14.2 Network
14.3 TCS
14.4 Dendroscope
14.5 Other programs
Glossary
Bibliography
Index
376 pages,

Publisher
Cambridge University Press, book

Datasets
Most of the phylogenetic networks in the book were generated using our programs SplitsTree and Dendroscope. Source files for most of the figures are available here.

Errata (first printing)
Page 4: The condition that every edge is incident to two different nodes excludes parallel edges or multi-graphs (not multi-edges).
Page 6: Clarification: strongly connected requires a path from any one node to any other, so, in particular, two paths between each pair of nodes.
Page 121: Lemma 5.10.1: Philip Gambette pointed out that this result does not hold for general split networks. Indeed, the network shown in Figure 5.7b is a counterexample. We believe that this can be fixed as follows: The result holds for canonical split networks. It also holds for split networks constructed using the circular network algorithm.
Page 130: An incompatible cluster set is a set of clusters in which at least one pair of clusters is pairwise incompatible (and not, as erroneously stated, a set of clusters in which all clusters are pairwise incompatible).
Page 167: In Exercise 6.11.21, we erroneously state that the network displayed in Figure6.24c is a cluster network. This is not true since for this network the uniqueness condition of Definition 6.4.1 is not fulfilled. Please use this network instead (each node is shown as a small disk).
Page 168: Exercise 6.12.2 is correct, but not easy. (Hint: use the result that the lca of any two nodes can be computed in constant time, after a linear amount of preprocessing (Harel and Tarjan 1984).)
Page 176: The last tripartition of tree T_2 is: (,emptyset,)
Page 181: In the list of nested labels for the network N_2, remove the curly brackets around taxon a in each of the first three lines (thanks to Benjamin Albrecht for pointing this out).
The follow errors were pointed out by Simone Linz:
Page 275: 4th line of section 11.5.1: Script C should be plain C.
Page 278: 21st line: . then that subtree is a component. should be . then that subtree is contained in a component.
Page 281: Theorem 11.5.7. Should also state that if the acyclic agreement forest ist maximum with h components, then any phylogenetic network N on X that represents both T1 and T2 will have at least h-1 reticulations.
Page 282: 4th line: . problem of finding an acyclic agreement forest. should be . problem of finding a maximum acyclic agreement forest.
Page 283: 20th line, holds: should be hold:


Solving UOT-RF(+) in linear time

An unrooted tree can be converted into a rooted tree by adding a root node on a chosen edge (thereby splitting the chosen edge into two edges, with the two end points of the chosen edge becoming the two children of the root node). Thus, if the unrooted tree has e edges then there are e ways to root that tree, with each of the e ways resulting in a different rooted tree.

If S and T are unrooted trees then we will show how to compute an optimal completion of T on (<<,mathrm< extit>,>>(S)) by using Algorithm OneTreeCompletion on appropriately rooted versions of S and T. The following observation establishes a direct relationship between the RF distance between two unrooted trees on the same leaf set and the RF distance between appropriately rooted versions of the two unrooted trees. This observation is also proved in [14].

Observation 1

LetPandQbe unrooted trees on the same leaf set, andlbe any leaf node (common toPandQ). Let (>) be obtained by rootingPon the edge connectinglto the rest ofP, and (>) be obtained by rootingQon the edge connectinglto the rest ofQ. Then, (RF (P, Q) = RF (>, >)) .

Proof

Consider any edge ((u,v) in E(P)) . We will use (P_u) to denote the subtree containing node u and (P_v) to denote the subtree containing node v, obtained when edge (u, v) is removed from P. Edge (u, v) defines the split (<<<,mathrm< extit>,>>(P_u), <<,mathrm< extit>,>>(P_v)>) in P. We define a bijection (f:<<,mathrm< extit>,>>(P) ightarrow <<,mathrm< extit>,>>(>) setminus >,>>(P)>) from splits in P to clades in (>) as follows. Given any split (<<<,mathrm< extit>,>>(P_u), <<,mathrm< extit>,>>(P_v)>) , without loss of generality, we assume that the leaf l occurs in the (P_u) side of this split, i.e., (l in <<,mathrm< extit>,>>(P_u)) , and define (f(<<<,mathrm< extit>,>>(P_u), <<,mathrm< extit>,>>(P_v)>) = C_<>>(v)) .

Lemma 2

LetSandTbe unrooted trees such that (<<,mathrm< extit>,>>(T) subseteq <<,mathrm< extit>,>>(S)) . Let (T') be an optimal completion ofTon (<<,mathrm< extit>,>>(S)) , such that (T') minimizes (RF (S, T')) . Letlbe any leaf node common toTandS. Let (>) be obtained by rootingSon the edge connectinglto the rest ofS, and (>) be obtained by rootingTon the edge connectinglto the rest ofT. If (>') is an optimal completion of (>) on (<<,mathrm< extit>,>>(>)) then (RF (S, T') = RF (>, >')) .

Proof

Observe that S and (T') are on the same leaf set. Let (T'') be obtained by rooting (T') on the edge connecting l to the rest of (T') . The tree (T'') must be a valid (not necessarily optimal) completion of the tree (>) on (<<,mathrm< extit>,>>(>)) . Thus, by Observation 1, (RF (S, T') = RF (>, T'')) .

Likewise, observe that (>) and (>') are on the same leaf set. Let (>'') be the unrooted tree obtained by suppressing the root node of (>') . The tree (>'') must be a valid (not necessarily optimal) completion of the tree T on (<<,mathrm< extit>,>>(S)) . Thus, by Observation 1, (RF (>, >') = RF (S, >'')) .

We claim that (T'') must be an optimal completion of (>) on (<<,mathrm< extit>,>>(>)) . If not, then (RF (>, >') < RF (>, T'')) , implying that (RF (S, >'') < RF (S, T')) , which is a contradiction since (T') is an optimal completion of T on (<<,mathrm< extit>,>>(S)) . Thus, we must have (RF (>, >') = RF (>, T'')) , implying that (RF (S, T') = RF (>, >')) . (square)

Based on the observation above, we solve the UOT-RF(+) problem as follows:

Algorithm for UOT-RF(+) on input treesSandT:

Let l be any leaf from (<<,mathrm< extit>,>>(T)) . Construct (>) by rooting S on the edge connecting l to the rest of S, and (>) by rooting T on the edge connecting l to the rest of T.

Call Algorithm ( extit) with trees (>) and (>) as input. Let (>') be the tree returned.

Convert (>') into an unrooted tree by suppressing the root node and output the resulting tree.

Theorem 2

The UOT-RF(+) problem can be solved inO(|V(S)|) time.

Proof

Let (T^*) denote the output of the algorithm described above, and let (T') denote an optimal completion of T on (<<,mathrm< extit>,>>(S)) . Since (>) and (>) are rooted at a common leaf-edge, l, of S and T, and since the tree (>') minimizes (RF (>, >')) , Lemma 2 implies that (RF (S, T') = RF (>, >')) .

Now, observe that S and (T^*) have the same leaf set, and that l is a leaf node common to S and (T^*) . Furthermore, (>) is obtained by rooting S on the edge connecting l to the rest of S, and (>') is obtained by rooting (T^*) on the edge connecting l to the rest of (T^*) . Thus, by Observation 1, we must have (RF (S, T^*) = RF (>, >')) . Thus, (RF (S, T^*)) must be equal to (RF (S, T')) , implying that (T^*) is an optimal completion of T on (<<,mathrm< extit>,>>(S)) . (square)

The previous fastest algorithm for solving the UOT-RF(+) problem [28] has quadratic time complexity. Our algorithm is able to find edges on which to graft the missing subtrees more efficiently than the algorithm from [28] because we use appropriately rooted versions of the unrooted input trees and then use simple post-order and pre-order tree traversals of the trees coupled with efficient least common ancestor computations.


References

Felsenstein J: Inferring phylogenies. 2003, Sinauer Associates

Bryant D, Moulton V: NeighborNet: An agglomerative method for the construction of phylogenetic networks. Molecular Biology and Evolution. 2004, 21: 255-265. 10.1093/molbev/msh018.

Saitou N, Nei M: The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution. 1987, 4 (4): 406-425.

Hu J, Fu HC, Lin CH, Su HJ, Yeh HH: Reassortment and Concerted Evolution in Banana Bunchy Top Virus Genomes. Journal of Virology. 2007, 81: 1746-1761.

Lacher D, Steinsland H, Blank T, Donnenberg M, Whittam T: Sequence Typing and Virulence Gene Allelic Profiling. Journal of Bacteriology. 2007, 189: 342-350.

Kilian B, Ozkan H, Deusch O, Effgen S, Brandolini A, Kohl J, Martin W, Salamini F: Independent Wheat B and G Genome Origins in Outcrossing Aegilops Progenitor Haplotypes. Molecular Biology Evolution. 2007, 24: 217-227. 10.1093/molbev/msl151.

Hamed MB: Neighbour-nets portray the Chinese dialect continuum and the linguistic legacy of China's demic history. Proc Royal Society B: Biological Sciences. 2005, 272: 1015-1022. 10.1098/rspb.2004.3015.

Dress A, Huson D, Moulton V: Analyzing and visualizing sequence and distance data using SplitsTree. Discrete Applied Mathematics. 1996, 71: 95-110. 10.1016/S0166-218X(96)00059-5.

Huson D, Bryant D: Application of Phylogenetic Networks in Evolutionary Studies. Molecular Biology and Evolution. 2006, 23: 254-267. 10.1093/molbev/msj030.

Bandelt HJ, Dress A: A canonical split decomposition theory for metrics on a finite set. Advances in Mathematics. 1992, 92: 47-105. 10.1016/0001-8708(92)90061-O.

Semple C, Steel M: Phylogenetics. 2003, Oxford University Press

Chepoi V, Fichet B: A note on circular decomposable metrics. Geometriae Dedicata. 1998, 69: 237-240. 10.1023/A:1004907919611.

Christopher G, Farach M, Trick M: The structure of circular decomposable metrics. Proc of European Symposium on Algorithms (ESA), Volume 1136 of LNCS, Springer. 1996, 486-500.

Dress A, Huson D: Constructing split graphs. IEEE Transactions on Computational Biology and Bioinformatics. 2004, 1 (3): 109-115. 10.1109/TCBB.2004.27.

Kalmanson K: Edgeconvex circuits and the travelling salesman problem. Canadian Journal of Mathematics. 1975, 27: 1000-1010.

Grünewald S, Forslund K, Dress A, Moulton V: QNet: An agglomerative method for the construction of phylogenetic networks from weighted quartets. Molecular Biology and Evolution. 2007, 24: 532-538. 10.1093/molbev/msl180.

Kotetishvili M, Stine O, Kreger A, Morris J, Sulakvelidze A: Multilocus sequence typing for characterization of clinical and environmental salmonella strains. Journal of Clinical Microbiology. 2002, 40: 1626-1635.


Are phylogenetic tree construction algorithms any different than general clustering algorithms? - Biology

Introduction
Although many biologists believe that reticulate events such as hybridization, horizontal gene transfer, recombination and reassortment play an important role in evolution, most published studies use trees to represent the evolutionary history for the set of species studied. One reason for this is the lack of robust and accepted methods for inferring non-tree histories or phylogenetic networks. A lot of work has been done in recent years to address this problem. Our work in this area was originally focused on developing methods for inferring unrooted phylogenetic networks and our computer program, SplitsTree, is currently the most widely-used software for the construction of phylogenetic networks. More recently, we have focused on developing methods for rooted phylogenetic networks, which we will make available via our tree- and network drawing program Dendroscope.

Phylogenetic networks were one of the main topics of the four month research programme on Phylogenetics held at the Newton Institute of Cambridge University in 2007. There we formed the impression that the field of phylogenetic networks had advanced to a point where there was enough material to warrant a book that gives an introduction to the field and attempts to present the different questions and approaches in a unified manner. Our goal was to write a book that covers the field in a style that is accessible to bioinformatics, biologists that are interested in methods and algorithms and computer scientists that are interested in evolution. This book is the ideal companion to our SplitsTree4 and Dendroscope programs.

Reviews
'Networks - rather than just trees - are fast becoming the essential tool for making sense of the complexities of evolution, and conflicting signal[s] in genomic data. Phylogenetic Networks provides a long-overdue exposition of network-based methods, their possible uses, and details on practical software. A detailed and unified treatment of the many different types of networks is complemented by a crisp synopsis of the underlying theory. Numerous example[s] and illustrations make the text easy to follow. This book will further transform the way biologists use genomic data to study evolution. The Tubingen group has led the development of phylogenetic network algorithms, and this book delivers a clear exposition for biologists bewildered by a plethora of recent methods, as well as for bioinformaticians aiming to develop the field further. It is essential reading for any scientist or student seeking to understand how genomic data can be used to represent and study the intricate 'web of life'.' Mike Steel, University of Canterbury.

'This textbook, by one of the leaders of the field (Daniel Huson) and his co-authors, provides a mathematically rigorous introduction to one of the most exciting and beautiful research areas in computational biology: phylogenetic networks. The text is clear and provides all the necessary biology background it should be accessible to graduate students (or upper-division undergraduates) in mathematics, computer science, or statistics.' Tandy Warnow, University of Texas.

'This wonderfully accessible book is by far the most thorough and up-to-date treatment of phylogenetic networks about. Many evolutionary processes in nature do not conform to the simple model of phylogenetic trees examples are hybridizations, symbioses, and lateral gene transfer. The more we probe nature with genomics, the more significant and numerous these examples become, so there is a real need for using networks in phylogenetics. This volume is a must for researchers working with phylogenetic networks. It is for an advanced college audience. Beautifully organized and clearly written, it really fills a void.' Bill Martin, University of Düsseldorf.

Table of contents
Contents
Preface
Part I Introduction
1 Basics
1.1 Overview
1.2 Undirected and directed graphs
1.3 Trees
1.4 Rooted DAGs
1.5 Traversals of trees and DAGs
1.6 Taxa, clusters, clades and splits
2 Sequence Alignment
2.1 Overview
2.2 Pairwise sequence alignment
2.3 Multiple sequence alignment
3 Phylogenetic Trees
3.1 Overview
3.2 Phylogenetic trees
3.3 The number of phylogenetic trees
3.4 Models of DNA evolution
3.5 The phylogenetic tree reconstruction problem
3.6 Sequence-based methods
3.7 Maximum parsimony
3.8 Branch-swapping methods
3.9 Maximum likelihood estimation
3.10 Bootstrap analysis
3.11 Bayesian methods
3.12 Distance-based methods
3.13 UPGMA
3.14 Neighbor-joining
3.15 Balanced Minimum Evolution
3.16 Comparing trees
3.17 Consensus trees
3.18 The Newick format
4 Introduction to Phylogenetic Networks
4.1 Overview
4.2 What is a phylogenetic network?
4.3 Unrooted phylogenetic networks
4.4 Rooted phylogenetic networks
4.5 The extended Newick format
4.6 Which types of networks are currently used in practice?
Part II Theory
5 Splits and Unrooted Phylogenetic Networks
5.1 Overview
5.2 Splits
5.3 Compatibility and incompatibility
5.4 Splits and clusters
5.5 Split networks
5.6 The canonical split network
5.7 Circular splits and planar split networks
5.8 Weak compatibility
5.9 The split decomposition
5.10 Representing trees in a split network
5.11 Comparing split networks
5.12 T-Theory
6 Clusters and Rooted Phylogenetic Networks
6.1 Overview
6.2 Clusters, compatibility and incompatibility
6.3 Hasse diagrams
6.4 Cluster networks
6.5 Rooted phylogenetic networks
6.6 The lowest stable ancestor
6.7 Representing trees in rooted networks
6.8 Hardwired and softwired clusters
6.9 Minimum rooted phylogenetic networks
6.10 Decomposability
6.11 Topological constraints on rooted networks 148
6.12 Cluster containment in rooted networks
6.13 Tree containment
6.14 Comparing rooted networks
Part III Algorithms and Applications
7 Phylogenetic Networks from Splits
7.1 The convex hull algorithm
7.2 The circular network algorithm
8 Phylogenetic Networks from Clusters
8.1 Cluster networks
8.2 Divide-and-conquer using decomposition
8.3 Galled trees
8.4 Galled networks
8.5 Level-k networks from clusters
9 Phylogenetic Networks from Sequences
9.1 Condensed alignments
9.2 Binary sequences and splits
9.3 Parsimony splits
9.4 Median networks
9.5 Quasi-median networks
9.6 Median-joining
9.7 Pruned quasi-median network
9.8 Recombination networks
9.9 Galled trees
10 Phylogenetic Networks from Distances
10.1 Distances and splits
10.2 Minimum spanning network
10.3 Split decomposition
10.4 Neighbor-net
10.5 T-rex
11 Phylogenetic Networks from Trees
11.1 Consensus split networks
11.2 Consensus super split networks for unrooted trees
11.3 Distortion-filtered super split networks for unrooted trees
11.4 Consensus cluster networks for rooted trees
11.5 Minimum hybridization networks
11.6 Minimum hybridization networks and galled trees
11.7 Networks from multi-labeled trees
11.8 DLT reconciliation of gene- and species trees
12 Phylogenetic Networks from Triples or Quartets
12.1 Trees from rooted triples 272
12.2 Level-k networks from rooted triples
12.3 The quartet-net method
13 Drawing Phylogenetic Networks
13.1 Overview
13.2 Cladograms for rooted phylogenetic trees
13.3 Cladograms for rooted phylogenetic networks
13.4 Phylograms for rooted phylogenetic trees
13.5 Phylograms for rooted phylogenetic networks
13.6 Drawing rooted phylogenetic networks with transfer edges
13.7 Radial diagrams for unrooted trees
13.8 Radial diagrams for split networks
14 Software
14.1 SplitsTree
14.2 Network
14.3 TCS
14.4 Dendroscope
14.5 Other programs
Glossary
Bibliography
Index
376 pages,

Publisher
Cambridge University Press, book

Datasets
Most of the phylogenetic networks in the book were generated using our programs SplitsTree and Dendroscope. Source files for most of the figures are available here.

Errata (first printing)
Page 4: The condition that every edge is incident to two different nodes excludes parallel edges or multi-graphs (not multi-edges).
Page 6: Clarification: strongly connected requires a path from any one node to any other, so, in particular, two paths between each pair of nodes.
Page 121: Lemma 5.10.1: Philip Gambette pointed out that this result does not hold for general split networks. Indeed, the network shown in Figure 5.7b is a counterexample. We believe that this can be fixed as follows: The result holds for canonical split networks. It also holds for split networks constructed using the circular network algorithm.
Page 130: An incompatible cluster set is a set of clusters in which at least one pair of clusters is pairwise incompatible (and not, as erroneously stated, a set of clusters in which all clusters are pairwise incompatible).
Page 167: In Exercise 6.11.21, we erroneously state that the network displayed in Figure6.24c is a cluster network. This is not true since for this network the uniqueness condition of Definition 6.4.1 is not fulfilled. Please use this network instead (each node is shown as a small disk).
Page 168: Exercise 6.12.2 is correct, but not easy. (Hint: use the result that the lca of any two nodes can be computed in constant time, after a linear amount of preprocessing (Harel and Tarjan 1984).)
Page 176: The last tripartition of tree T_2 is: (,emptyset,)
Page 181: In the list of nested labels for the network N_2, remove the curly brackets around taxon a in each of the first three lines (thanks to Benjamin Albrecht for pointing this out).
The follow errors were pointed out by Simone Linz:
Page 275: 4th line of section 11.5.1: Script C should be plain C.
Page 278: 21st line: . then that subtree is a component. should be . then that subtree is contained in a component.
Page 281: Theorem 11.5.7. Should also state that if the acyclic agreement forest ist maximum with h components, then any phylogenetic network N on X that represents both T1 and T2 will have at least h-1 reticulations.
Page 282: 4th line: . problem of finding an acyclic agreement forest. should be . problem of finding a maximum acyclic agreement forest.
Page 283: 20th line, holds: should be hold:


Authors’ contributions

SW and QZ conceived and designed the study. SW performed the experiments and wrote the paper. Prof. reviewed and edited the manuscript. Both authors read and approved the manuscript.

Competing interests

The authors declare that they have no competing interests.

Availability of data and materials

Consent for publication

Ethics approval and consent to participate

Funding

The work was supported by the Natural Science Foundation of China (No. 61771331).

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


OWA-based linkage method in hierarchical clustering: Application on phylogenetic trees

The linkage methods are mostly used in hierarchical clustering. In this paper, we integrate Ordered Weighted Averaging (OWA) operator with hierarchical clustering in order to find distances between clusters. In case of using OWA operator in order to find distance between clusters, OWA acts as a generalized case of single linkage, complete linkage, and average linkage methods. In order to illustrate the proposed method, we handle a phylogenetic tree constructed by hierarchical clustering of protein sequences. To illustrate the efficiency of the method, we use 2D-data set. We obtain graphs demonstrating the relationships of the clusters and we calculate the root-mean-square standard deviation (RMSSDT) and R-squared (RS) validity indices, respectively, which are frequently used to evaluate results of the hierarchical clustering algorithms.

Highlights

► We integrate Ordered Weighted Averaging (OWA) operator with hierarchical clustering. ► OWA acts as a generalized case of single, complete and average linkage methods. ► Phylogenetic tree of protein sequences are constructed by OWA-based linkage. ► Cluster validity indices verify the efficiency of the proposed method.


Watch the video: Creating a Phylogenetic Tree (December 2022).