What is an Upgma Tree

Calculation of phylogenetic trees with distance measurements

Transcript

1 Calculation of phylogenetic trees with distance measurements Seminar: Relationship and Descent in Strings WS 2004/2005 Stephan Klinger Benjamin Großmann

2 Outline Introduction ... 3 History of evolutionary research ... 3 Molecular trees ... 4 Distance determination ... 5 Ultrametrics ... 6 Introduction ... 6 Ultrametric data ... 6 Ultrametric trees ... 6 Construction with UPGMA ... 7 Problems of Ultrametrics ... 8 Additive Trees ... 9 Introduction ... 9 Neighbor Joining ... 10 The Neighbor Joining Algorithm ... 10 Additive Trees ... 12 The Application of Phylogenetic Algorithms ... 12 Problems of the evolutionary model ... 14 Practical suitability of distance-based methods

3 Introduction Phylogenesis is about studying the evolutionary evolution of living beings over the course of the earth's history. The term is derived from the Latin word phylum, which stands for trunk. In tribal development it is assumed that all existing living beings emerged from a common archetype, which founded life on earth. Due to the process of constantly changing environmental conditions (changed atmosphere, temperature fluctuations, tectonic shifts, etc.), this archetypal type developed in many ways over many generations. The goal is now to research possible ways of the development of living beings. The method of research is the reconstruction of evolution trees. The subject of this elaboration are methods that calculate the relationship between species by means of distance measures (differences). The algorithms used are used to evaluate large quantities of strings (DNA or gene sequences) and to determine genetic distances between species. Phylogeny has its main application in the field of biology. In linguistics, language relationships were largely reconstructed in painstaking manual work. It is therefore an interesting question to what extent the phylogenetic methods can also be used to research the origin and development of languages. History of evolutionary research It was not until the 18th century that the scholar Carl Linnaeus first attempted to classify plants and animals on the basis of numerous external characteristics. In this extremely life-filling task, however, he not only limited himself to recognizing the differences between individual species, he also found generic terms for similar groups and created an extensive hierarchical classification scheme that is still in use today. He subdivided into stem, class, order, family, genus and species in descending order. However, Linnaeus was still based on the consistency of the species. He took every difference as God given, he did not include developments in his model. A short time later, however, the French scholar Lamarck announced a new theory in his book Philosophy Zoologique (1809) that foresees changes in species. In his opinion, living beings can develop characteristics if they use the corresponding organ more intensively. As an example, he used the giraffe, whose constantly stretched neck becomes longer in the course of life through grazing on higher leaves. In addition, animals can pass on these acquired traits to their descendants. This theory caused heated discussions. Experiments with mice, among others, which, despite the tail being cut off, always produced offspring with a developed tail, seemed to refute this thesis. 3

4 So it was not long before the Englishman Charles Darwin proclaimed the famous theory of evolution in 1858. This essentially includes two assumptions that have remained valid to this day: 1.) Biological species change due to random influences (today's terms: mutation and variation) 2.) Under given environmental conditions (which are also subject to changes), species survive with beneficial ones Changes are preferred, so there is natural selection. molecular family trees In the 1940s, the term synthetic evolutionary theory was coined. This term is understood to mean the connection from the Darwin model with the findings from cell research and the still young branch of genetics. So far, living beings have only been compared on the basis of their morphological properties. The new finding was that genes are the carriers of these characteristics, and consequently the phenotype is the expression of the entire genetic information of an organism. Random hereditary mutations within the genes can occur at any time and change the phenotype. In sexual reproduction, variations continue to occur due to the mixing of genes, which in turn can cause changes in the phenotype. With these findings, it was possible to explain the evolutionary model at the molecular level, which paved the way for a multitude of new theories and experiments. The method presented here for calculating molecular family trees with distance measurements was one of these new theories and was developed in the 1960s in collaboration between the Austrian physician Emil Zuckerkandl and the American chemist Linus Pauling. The scientists assumed that evolution progressed steadily and only in small steps. They developed the Molecular Clock Theory with the following statement: In every protein the rate of benevolent mutations of the amino acid sequence is constant. Well-meaning mutations are those that are able to survive and are therefore inherited and can continue to mutate. Thus, according to this theory, the changes in living beings are proportional to the duration. Since parts of DNA sequences were to serve as the source for this procedure and these could only be measured from living organisms, these organisms formed the leaves in the emerging tree. The inner nodes corresponding to the extinct species were then calculated from the distances. With this method one does not necessarily get the actual evolution tree, but only a possible one. In the following, the determination of distances will be discussed briefly, as they are the prerequisite for the algorithms presented here. 4th

5 Determining the distance When determining the distance, the aim is to determine the smallest distance between two DNA sequences. The distance is defined by the number of change operations required to convert from one sequence to the other. The following sequences are given as an example: ATGCGGTGCAATG ATGGTGCAT There are many possible alignments for these sequences. The so-called edit distance, which is calculated from the sum of the insert, delete and replace operations, can then be calculated for each alignment. The following are examples of alignments with their edit intervals: Alignment ATGCGGTGCAATG ATGCGGTGCAATG AT GG TGCA_T ATG GTGCA_T_ ATGCGGTGCAATG ATGGTGCAT Edit interval The distance between two sequences is the smallest edit interval from all possible alignments. With the dynamic programming method, the alignment with the smallest edit distance can be calculated with a quadratic effort and a quadratic space requirement. This distance determination must now be carried out for all combinations of the sequences to be examined and a matrix can be created that contains all the distances. For example, be A, B, C, D and E the sequences to be analyzed with the following distance matrix: A B C D E A B C D E Since the matrix is ​​symmetrical to the diagonal (distances between two sequences are the same in both directions), only one half is written in the following for the sake of clarity. 5

6 Ultrametrics Introduction This section is about the construction of a family tree that is supposed to map the evolutionary history of taxa. The nodes within the tree represent points of branching, i.e. the points in time of the events at which different properties were first developed within a species. If the data used is of a sufficiently high quality, you can even read the periods of development based on the distances between the nodes. Ultrametric trees have the following laws: They have a root and are binary. This means that every inner node has exactly two child nodes, and all nodes except for one have a predecessor node. The living species are shown on the lowest level as leaves, the inner nodes form calculated values. Ultrametric Data To create ultrametric trees, one needs ultrametric data. Since the data are distances, the 4 axioms for metrics apply: 1. d (x, y) 0 (condition of positive distances) 2. d (x, y) = 0 x = 0 (point condition) 3. d (x, y) = d (y, x) (symmetry condition) 4.d (x, y) d (x, z) + d (z, y) (triangle condition) For ultrametrics the 4th Axiom more restricted: 4. d (x, y)

7 Then T is a tree with a root and the following properties: there are n leaves, each denoted by a row of D the internal nodes are denoted by an entry from D: D (i, j) denotes the smallest common ancestor of Leaves i and j the internal node values ​​decrease in descending order every internal node has two children With this definition one can prove that an ultrametric tree can be constructed for every ultrametric matrix and vice versa. So an ultrametric tree represents an ultrametric in a compact form. Construction with UPGMA This section deals with the construction of ultrametric trees using the UPGMA method (unweighted pair group method with arithmetic mean) or hierarchical clustering. This method, which can easily be implemented with the computer, generates a tree for any symmetric matrix. Apart from the swapping of two children, this is clear. However, the result is only useful if the data is ultrametric. The resulting tree is binary as required and has one root. The procedure consists of the following steps: 1. Create a sheet for each row of the matrix with the name of the respective row 2. Select a smallest value D (i, j) from the matrix values ​​3. Connect sheets i and j with a new node named D (i, j); the two edge lengths to i and j are each D (i, j) / 2 4. Remove the columns i and j and the rows i and j, add a row ij and a column ij and calculate the new values ​​from the average of the old values ​​Example: D (ij, z): = (D (i, z) + D (j, z)) / 2 5. Repeat steps 2 4 until the matrix contains only one value. Example: A B C D E A B C D 0 5 E 0 AE B C D AE B C 0 8 D 0 Step 2 Step 3 Step 4 Tree after repeat. Applying steps 2-4 7

8 Problems of Ultrametrics The conditions for ultrametric data are quite strict and therefore severely limit the area of ​​application. Real data are rarely ultrametric. The reason for this lies in the different selection pressures that act on the various proteins, which means that the mutations are rarely constant. The assumption of the molecular clock is thus violated. In addition, only living organisms can be taken into account for the calculation. All inner nodes are calculated. The inclusion of the DNA of extinct organisms is therefore impossible. With regard to linguistic research, this fact is likely to be even more decisive, as old text excerpts are often available which, if included, could significantly improve the quality of the family tree. 8th

9 Additive Trees Introduction The assumption of the ultrametric of data is a very strong requirement that the data in reality often do not do justice to. For this reason and because you still want to work with the existing data, the assumption about the quality of the data is weakened. In the case of additive trees, it is assumed that the data are only additive, i.e. the distance measures used represent the distance between the examined units, but it is not possible to determine the extent of the changes over time. Analogous to a distance matrix for ultrametric trees, a distance matrix must meet certain conditions so that an additive tree can be generated from it. The matrix must be symmetrical, i.e. the distance AB must be equal to the distance BA. The matrix must have a 0 diagonal. All other fields in the matrix must be greater than 0. The values ​​of the matrix must meet the 4-point condition. See Figure Matrix 1 for an example of a matrix that meets the stated conditions. A B C D A B C 0 6 D 0 Figure: Matrix 1 While all other conditions are quite easy to understand, the 4-point condition is a bit more complicated. Therefore, they will be discussed in detail below. In the first step, the 4-point condition is presented as a theorem. A matrix D has an additive tree, iff a tree can be constructed for all rows i, j, k, l so that the 4-point condition applies: D (i, k) + D (j, l) D ( i, j) + D (k, l) = D (i, l) + D (k, j) Theorem: 4-point condition Applying this abstract theorem to the matrix in Figure Matrix I, the formula is 4-point condition: D (A, B) + D (D, C) D (A, D) + D (B, C) = D (A, C) + D (B, D). By inserting the corresponding values ​​one obtains = This proves that the 4-point condition is fulfilled for the example matrix and that this matrix therefore has an additive tree. 9

10 Neighbor Joining The Neighbor Joining algorithm is used to generate an additive tree from a given matrix. This method always generates a tree from a given matrix, so it is important that the matrix meets the conditions mentioned in the previous section so that the result is actually an additive tree. The neighbor joining algorithm is similar to the UPGMA algorithm for generating ultrametric trees. Both algorithms are hierarchical clustering methods. The Neighbor Joining method creates a binary tree without a root. These trees without roots can then be given a root in a further step. The differences between Neighbor Joining and UPGMA lie in the grouping of objects according to their proximity to one another. While UPGMA simply uses the given distance measurements, Neighbor Joining also takes into account the distance between the objects and other clusters. From this difference it follows that with UPGMA trees with roots always arise and with Neighbor Joining trees without roots. The Neighbor Joining Algorithm 1. Compute for each species u i = k i D (i, k) n 2 2. Select the i and j for which D ij u i u j gives the smallest value. 3. Merge clusters i and j to form a new cluster (ij), and add a corresponding node to tree T. Calculate the edge lengths of i and j to this new node as follows: d (i, ij) = D (i, j) + uiuj 2 d (j, ij) = D (i, j) + ujui 2 4. Calculate the Distances between the new cluster and all other clusters. D (i, k) + D (j, k) D (i, j) D (k, ij) = 2 5. Delete the clusters i and j from the table and replace them with (ij). 6. Repeat the entire process as long as there is more than one cluster. The algorithm should now be reproduced using an example. The output matrix is ​​shown in the output matrix table. The distance u i to the other clusters is in the bottom 10

11 row added to the matrix. The following equation is obtained for column A when inserting the values ​​in the formula of the first step of the algorithm: () / (4-2) = 15. The values ​​for columns A-C were calculated in the same way. A B C D A B C D u i Table: Matrix 2 B C D A B C -20 Table: Neighbor-Joining- Distances of the Cluster Pairs The table Neighbor-Joining-Distances of the Cluster Pairs contains the Neighbor-Joining-Distances of the cluster pairs of the output matrix. The values ​​in this table were calculated using the formula in the second step of the algorithm. Substituting in, the equation = -20 for A and B is obtained. After the calculation, the lowest values ​​are found in cells AB and CD. For our example cell CD is chosen as the cluster to be merged. According to step 3, the distances between the individual points and the new common cluster are now calculated. The distance from C to CD is () / 2 = 1. The distance from D to CD is / 2 = 5. With these edge lengths, the newly determined cluster can now be entered in the tree. The map of additive tree shows the complete additive tree that results from this example. In the fourth step, the distances between the old clusters and the newly formed cluster are calculated. Table Matrix 2 shows the corresponding matrix. The distance from A to CD results from the formula from the equation / 2 = 9. A B CD A B CD u i Table: Matrix 3 The fifth step is to delete the clusters that have been combined to form the new cluster from the matrix. This has also been done in Table Matrix 3. From this point on, procedure 11

12 repeats as long as there is more than one cluster in the matrix. The resulting additive tree is shown in the additive tree figure below. Additive trees As mentioned above, additive trees do not necessarily have a root.An additive tree has weighted edges as well as labeled and unlabeled nodes. (see Fig. additive tree) Figure: additive tree The path between the nodes corresponds to the distance that the corresponding clusters have in the distance matrix. Evolutionary relationships are represented by additive trees, but no direction of evolution can be read from them. With additive trees it is possible to represent objects of the distance matrix as inner objects. This property is particularly interesting with regard to the use of the method for a historical linguistic question, as Hochmuth (2004) states: old text versions [are] available for languages, which means that a text version has to be moved inside the tree. Since the mentioned work is an investigation that deals with the application of algorithmic processes from bio-informatics to linguistic questions, it is briefly presented in the following section under the aspect of the application of additive trees. The application of phylogenetic algorithms In M. Hochmuth's work, methods from bioinformatics were applied to a linguistic problem. The relation to our work is that distance-based methods were used to calculate the relationship between languages ​​and that additive trees were used to represent the relationships. The Levenshtein distance is a method used in molecular biology to determine the distance between two species. In the study of M. Hochmuth, this method was applied to historical texts, whereby the distance between words with the same meaning was calculated. The Levenshtein distance has been used for a long time in the field of linguistics in the further 12

13 senses are used, for example for automatic spelling checks with correction suggestions, but the method has not yet been used to calculate the relationship between languages. The Levenshtein method produces spacing matrices as a result. The values ​​in these matrices represent the distances between the examined languages. The matrices can be used as input data for phylogenetic algorithms for calculating family trees, such as those presented in this work. Hochmuth used different versions of Our Father from different historical language levels and contemporary dialects as a database for the experiments. In addition, a Latin, two English, a Gothic and an Icelandic version of this text have been added to the database. In the experiments to calculate the distances between the language variants, it was investigated what influence the change in various parameters had on the result. This was done to determine which form of preparation of the data would give the best results. The figure Hochmuth 2004 shows a result of the experiments. It is an additive tree for the texts from New High German (NHD), Early Middle High German (FNHD), Middle High German (MHD), Old High German (AHD) and Old Saxon (AHD). Figure: Hochmuth 2004; Additive tree, for the simple Levenshtein distance between five versions of the Lord's Prayer (Hochmuth 2004) The distribution of the language levels under consideration corresponds to what is known in linguistics about these language levels. The series of experiments showed that the results could be improved with increased processing effort for the input data, but even the simple Levenshtein distance provided good results. In view of the comparatively small amount of data, however, no robust results could be achieved. In addition to this problem, there are also fundamental problems of distance-based methods and evolutionary approaches in general, which we will discuss in the following section. 13th

14 Problems of the Evolutionary Model The evolutionary model presented is an abstraction that ignores certain phenomena. Gene fusion is the fusion of genes in a genome. As a result, it can happen that the genetic information of two ancestors is combined in a single genome. This contradicts the model's assumption that only one clear ancestor is possible. Another assumption of the model is that genetic information is passed on only vertically. Exceptions to this can be found in biology. In the case of viruses and bacteria, there is sometimes a horizontal transfer of genetic data. Complete foreign sequences are introduced into the existing DNA. In the field of languages ​​there are phenomena that can be seen analogously to the horizontal transfer in biology. In the course of the seminar it became quite clear how strong the influence of contact between languages ​​is on the development of languages. It is precisely at this point that it seems questionable what successes can be achieved by applying the evolutionary model to the development of languages. Further problems of the model are that evolutionary events are not exclusively mutations. A counterexample is sexual reproduction. The assumption that similarity equals relationship is also not unreservedly correct. In homoplasia, there are sequences that are similar but not related, for example in the form of convergences. Practical suitability of distance-based methods In addition to distance-based methods, as presented here, there are also other methods for generating phylogenetic trees. All methods have certain fundamental problems. Because these are abstract models, there is always a loss of information when processing the input data. With the distance-based methods, the loss of information due to abstraction is very high, because all genetic information is lost when reducing to numerical distances. In addition, these methods are not very robust, since they cannot tolerate even minor errors in the source data and instead generate unusable trees. Nevertheless, they are used because the procedures can be applied very quickly and easily. Due to the dangers described, it is necessary to carefully check the plausibility of the results. 14th

15 Literature - Gusfield, Dan: "Algorithms on Strings, Trees, and Sequences", Cambridge University Press, 1997, S Ewens, Grant: "Statistical Methods in Bioinformatics", Springer 2002, S Hochmuth, Mirko: String-based algorithms for the reconstruction of language relationships , Student thesis, reader, Ulf: from the lecture Algoritmic Bioinformatics WS2004 / 05, set of slides: Distance-based phylogenetic algorithms 15