[PDF] - Algorithm Summary Method Input Output Sankoffs & Fitchs PDF Document

SLIDE 1

2/4/09 1

CSCI1950‐Z Computa4onal Methods for Biology Lecture 4

Ben Raphael February 2, 2009

hHp://cs.brown.edu/courses/csci1950‐z/

Algorithm Summary

Method Input Output Sankoff’s & Fitch’s Alg. Characters, T A, B Perfect Phylogeny Characters A, B, T Felsenstein Characters, T, B A T = tree topology B = branch lengths A = ancestral states Probabilis4c Parsimony

SLIDE 2

2/4/09 2

Pairwise Compa4bility Test

(Wilson 1965) Binary characters i and j are pairwise compa4ble if and only if: j is homogenous w.r.t i0 or i1. Equivalently: i1 and j1 are disjoint or one contains the other Equivalently: all 4 rows do not exist (0,0), (0,1), (1,0), (1,1)

k A 0 B 0 C 1 D 1 E 0 i0 i1 i A 0 B 0 C 0 D 1 E 1 j A 0 B 0 C 1 D 0 E 0

Pairwise Compa4bility Theorem

(Estabrook et al. 1976)

A set S of binary characters is mutually compa4ble if and only if all pairs c and c’ of characters in S are pairwise compa4ble. Pairwise compa4bility  mutual compa4bility.

SLIDE 3

2/4/09 3

Perfect Phylogeny

A set of mutually compa4ble binary characters gives a perfect phylogeny:

1. Evolu4onary model – Binary characters {0,1} – Each character changes state only once in evolu4onary history (no homoplasy!). 2. Tree in which every muta4on is on an edge of the tree. – All the species in one sub‐tree contain a 0, and all species in the other contain a 1. – For simplicity, assume root = (0, 0, 0, 0, 0)

Last )me: algorithm to reconstruct a tree.

1 2 3 4 5

A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 traits species 1 1

Trees and Splits

Given a set X, a split is a par44on of X into two

non‐empty subsets A and B. X = A | B.

For a phylogene4c tree T with leaves L, each edge

e defines a split Le = A | B, where A and B are the leaves in the subtrees obtained by removing e.

i1 i0 i

In perfect phylogeny, edges where binary character changes state gave split i0 and i1. We will return to splits in a future lecture.

SLIDE 4

2/4/09 4

Splits Equivalence Theorem

A phylogene4c tree T defines a collec4on of splits Σ(T) = { Le | e is edge in T}. Splits A1 | B1 and A2 | B2 are pairwise compa3ble if at least one of A1∩A2 , A1∩B2 , B1∩A2, and B1∩B2 is the empty set. Splits Equivalence Theorem: Let Σ be a collec4on of

splits. There is a phylogene4c tree such that Σ(T) = Σ if

and only if the splits in Σ are pairwise compa4ble. The Pairwise Compa4bility Theorem (for binary characters) follows from this theorem.

Outline

Distance‐based methods for phylogene4c tree reconstruc4on.

Review of distances/metrics.
Tree distances and addi4ve distances

– Small and large phylogeny problems.

Non‐addi4ve distances and clustering

– UPGMA and ultrametric distances.

SLIDE 5

2/4/09 5

Distances

A distance on a set X is a func4on d: X R sa4sfying:

d(x, y) ≥ 0, with equality iff x = y. For all x, y ∈ X, d(x, y) = d(y, x) [symmetry] For all x, y, z ∈ X, d(x, z) ≤ d(x, y) + d(y, z) [triangle inequality]

Examples:

X = real numbers, d(x, y) = | x – y| is distance. X = strings over some alphabet. dH(s, t) = number of posi4ons where s and t differ is called Hamming distance.

Distances in Biological Data

String distances (e.g. Hamming distance, edit

distance) on DNA/protein sequence data

Subs4tu4on model (Jukes‐Cantor, Kimura, etc.):

scores for par4cular changes A T, C G, etc.

Rat: ACAGTCACGCCCCACACGT Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGAGGTAGCAAACGA Human: CCTGTGAGGTAGCACACGA

SLIDE 6

2/4/09 6

Distance Matrix

For n species, form n x n distance matrix Dij
Example: Dij = edit distance between a gene in

species i and species j.

7 11 10 7 4 6 11 4 2 10 6 2

Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC Chimpanzee: CCTGCCAGTTAGCAAACGC Human: CCTGCCAGTTAGCACACGA

Alignment vs. Distance Matrix

Sequence a gene of

length m in n species  n x m alignment matrix. n x n distance matrix Reverse transforma4on not possible due to loss

f informa4on.

Transform into…

7 11 10 7 4 6 11 4 2 10 6 2

Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC Chimpanzee: CCTGCCAGTTAGCAAACGC Human: CCTGCCAGTTAGCACACGA

SLIDE 7

2/4/09 7

Distances in Trees

Given a tree T with a posi4ve weight w(e) on each edge, we define the tree distance dT on the set L of leaves by: dT(i, j) = sum of weights of edges on unique path from i to j.

In evolu4onary biology, weights are some4mes called branch lengths.

Distance in Trees: an Example

dT(1,4) = 12 + 13 + 14 + 17 + 13 = 69 i j

SLIDE 8

2/4/09 8

Distance vs. Tree Distance

n x n distance matrix for n species
Note that dT(i, j), tree distance between i and j, not

necessarily equal to Dij as given by distance matrix.

Rat: ACAGTGACGCCCCAAACGT Mouse: ACAGTGACGCTACAAACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

Fivng a Distance Matrix

Given n species, we can compute the n x n

distance matrix Dij

Evolu4on of these species is described by a

tree that we don’t know.

We need an algorithm to construct a tree that

best fits the distance matrix Dij

Find a tree T such that:

Dij = dT(i,j)

Lengths of path in an (unknown) tree T Distance between species (known)

SLIDE 9

2/4/09 9

Distance Based Phylogeny Problem

Goal: Reconstruct an evolu4onary tree from a distance matrix Input: n x n distance matrix Dij Output: weighted tree T with n leaves fivng D

Unknown topology of tree makes evolu4onary tree reconstruc4on hard! # unrooted binary trees n leaves: T(n) = (2n‐3)! / ((n‐2)! 2n‐2) n = 24: T(n) = 5.74 x 10

26

If D is addi3ve, this problem has a solu4on and there is a simple algorithm to solve it

Distance‐based vs. character‐based

Key difference: Distance‐based methods do not reconstruct ancestral states.

A B C D A 1 2 2 B 1 1 1 C 2 1 D 2 1 Note that C and D are iden4cal.

SLIDE 10

2/4/09 10

Reconstruc4ng a 3 Leaved Tree

Tree reconstruc4on for a 3x3 matrix is

straighxorward

We have 3 leaves i, j, k and a center vertex c

Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk

Reconstruc4ng a 3 Leaved Tree (cont’d)

dic + djc = Dij + dic + dkc = Dik 2dic + djc + dkc = Dij + Dik

2dic + Djk = Dij + Dik

dic = (Dij + Dik – Djk)/2 Similarly, djc = (Dij + Djk – Dik)/2

dkc = (Dki + Dkj – Dij)/2

SLIDE 11

2/4/09 11

Trees with > 3 Leaves

A binary tree with n leaves has 2n‐3 edges
Fivng a given tree to a distance matrix D

requires solving a system with

n(n‐1)/2 equa4ons and 2n‐3 variables

Solu4on not always possible for n > 3.

Addi4ve Distance Matrices

Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE

therwise

SLIDE 12

2/4/09 12

Addi4ve Distance Phylogeny

Small Addi>ve Distance Phylogeny: Given phylogene4c tree T and distance matrix D, determine branch lengths such that dT(i,j) = Dij. Large Addi>ve Distance Phylogeny: Given distance matrix D, find T and branch lengths such that dT(i,j) = Dij. Both of these problems can be solved efficiently.

Reconstruc4ng Addi4ve Distances Given T

x y z w v 5 4 7 3 3 4 6

v w x y z v

10 17 16 16

w

15 14 14

x

9 15

y

14

z

T

If we know T and D, but do not know the length of each edge, we can reconstruct those lengths

D

SLIDE 13

2/4/09 13

Reconstruc4ng Addi4ve Distances Given T

x y z w v

v w x y z v

10 17 16 16

w

15 14 14

x

9 15

y

14

z

T D

Reconstruc4ng Addi4ve Distances Given T

x y z w v

v w x y z v

10 17 16 16

w

15 14 14

x

9 15

y

14

z

Find neighbors v, w (common parent)

D

a x y z a

11 10 10

x

9 15

y

14

z

a

D1

dax = ½ (dvx + dwx – dvw) day = ½ (dvy + dwy – dvw) daz = ½ (dvz + dwz – dvw)

SLIDE 14

2/4/09 14

Reconstruc4ng Addi4ve Distances Given T

x y z w v a

a b z a

6 10

b

10

z

D2

b c

a c a

3

c

D3

d(a, c) = 3 d(b, c) = d(a, b) – d(a, c) = 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 d(a, v) = d(z, v) – d(a, z) = 6 Correct!!! 5 4 7 3 3 4 6

a x y z a

11 10 10

x

9 15

y

14

z

D1

Neighbors x, y (common parent)

Trees and Neighbors

Previous algorithm relied only on finding neighboring leaves:

1. Find neighboring leaves i and j with parent k
2. Remove the rows and columns of i and j
3. Add a new row and column corresponding to k,

where the distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2

Compress i and j into k, iterate algorithm for rest of tree

SLIDE 15

2/4/09 15

Finding Neighboring Leaves

To find neighboring leaves we simply select a pair of closest leaves.

i j k l i

13 21 22

j

12 13

k

13

l

WRONG!

i and j are neighbors, but (dij = 13) > (djk = 12). Finding a pair of neighboring leaves is a nontrivial problem!

Degenerate Triples

A degenerate triple is a set of three dis4nct

elements 1 ≤ i, j, k ≤ n where

Dij + Djk = Dik

Element j in a degenerate triple i,j,k lies on the

evolu4onary path from i to k (or is aHached to this path by an edge of length 0).

SLIDE 16

2/4/09 16

Looking for Degenerate Triples

If distance matrix D has a degenerate triple

i,j,k then j can be “removed” from D thus reducing the size of the problem.

If distance matrix D does not have a

degenerate triple i,j,k, one can “create” a degenera4ve triple in D by shortening all hanging edges (in the tree).

Shortening Hanging Edges to Produce Degenerate Triples

Shorten all “hanging” edges (edges that

connect leaves) un4l a degenerate triple is found

SLIDE 17

2/4/09 17

Finding Degenerate Triples

If there is no degenerate triple, all hanging edges are

reduced by the same amount δ, so that all pair‐wise distances in the matrix are reduced by 2δ.

Eventually this process collapses one of the leaves (when

δ = length of shortest hanging edge), forming a degenerate triple i,j,k and reducing the size of the distance matrix D.

The aHachment point for j can be recovered in the

reverse transforma4ons by saving Dij for each collapsed leaf.

Reconstruc4ng Trees for Addi4ve Distance Matrices

Trim(D, δ) for all 1 ≤ i ≠ j ≤ n Dij = Dij ‐ 2δ

SLIDE 18

2/4/09 18

Addi4vePhylogeny Algorithm

AdditivePhylogeny AdditivePhylogeny(D) if if D is a 2 x 2 matrix T = tree of a single edge of length D1,2 return return T if if D is non-degenerate Compute trimming parameter δ Trim(D, δ)

Find a triple i, j, k in D such that Dij + Djk = Dik x = Dij Remove jth row and jth column from D T = AdditivePhylogeny(D) Traceback

Addi4vePhylogeny (cont’d)

Traceback Add a new vertex v to T at distance x from i to k Add j back to T by creating an edge (v,j) of length 0 for for every leaf l in T if if distance from l to v in the tree ≠ Dl,j

utput “matrix is not additive”

return return Extend all “hanging” edges by length δ return return T

Ques>on: How to compute δ?

SLIDE 19

2/4/09 19

Addi4ve Distance

How to tell if D is addi4ve?
Addi4vePhylogeny provides a way to check if

distance matrix D is addi4ve

An even more efficient addi>vity check is the

“four‐point condi>on”

The Four Point Condi4on

(Zaretskii 1965, Buneman 1971)

Let 1 ≤ i,j,k,l ≤ n be four dis4nct leaves in a tree Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk 1 2 3

2 and 3 represent the same number: (length

f all edges) + 2 *

(length middle edge) 1 represents a smaller number: (length of all edges) – (length middle edge)

SLIDE 20

2/4/09 20

The Four Point Condi4on

Four point condi>on: Every four leaves (quartet) can be labeled as i,j,k,l such that: Dij + Dkl ≤ Dik + Djl = Dil + Djk Theorem : An n x n matrix D is addi4ve if and

nly if the four point condi4on holds for every

quartet 1 ≤ i,j,k,l ≤ n

The Four Point Condi4on

Theorem : An n x n matrix D is addi4ve if and only if the four point condi4on holds for every quartet 1 ≤ i,j,k,l ≤ n. Proof:  Since D addi4ve, D = dT. Find split such that: i, j ∈ S1 and k, l ∈ S2. Define λm to be weights in tree below. Dik + Djl = (λ1 + λ3 + λ4) + (λ2 + λ3 + λ5) = Dil + Djk >= (λ1 + λ2) + ( λ4 + λ5).

j l i k λ1 λ3 λ5 λ2 λ4

SLIDE 21

2/4/09 21

Non‐addi4ve Distances

What if there is no tree T such that Dij = dT(i,j).
Approaches:
1. Find tree such that minimizes “error”
2. Heuris4c approaches: clustering.

Least Squares Distance Phylogeny Problem

If the distance matrix D is NOT addi4ve, then we look

for a tree T that approximates D the best: Squared Error : ∑i,j (dij(T) – Dij)2

Squared Error is a measure of the quality of the fit

between distance matrix and the tree: we want to minimize it.

Least Squares Distance Phylogeny Problem:

– Find approxima4on tree T with minimum squared error for a non‐addi4ve matrix D. – (NP‐hard)

SLIDE 22

2/4/09 22

Tree construc4on as clustering

1 4 3 2 5 1 4 2 3 5

Pair Group Methods

Itera4vely combine closest leaves/groups into larger groups.

C  { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] d(Ci, Cj ) = min d(Ci, Cj). Ck  Ci ∪ Cj [Replace Ci and Cj by Ck.] C  (C \ Ci \ Cj )∪ Ck.

1 4 3 2 5 1 4 2 3 5

SLIDE 23

2/4/09 23

Pair Group Methods

What is d? How to define branch lengths?

C  { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] d(Ci, Cj ) = min d(Ci, Cj). Ck  Ci ∪ Cj [Replace Ci and Cj by Ck.] C  (C \ Ci \ Cj )∪ Ck.

1 4 3 2 5 1 4 2 3 5

UPGMA

Unweighted Pair Group Method with Averages

Distance between

clusters defined as average pairwise distance

Assigns height to

every vertex in the tree, effec4vely da4ng every vertex

1 4 3 2 5 1 4 2 3 5

SLIDE 24

2/4/09 24

UPGMA

Unweighted Pair Group Method with Averages

Distance between clusters defined as average pairwise distance

1 4 3 2 5 1 4 2 3 5

Given two disjoint clusters Ci, Cj of sequences,

1

dij = ––––––––– Σ{p ∈Ci, q ∈Cj}dpq |Ci| × |Cj|

UPGMA

Unweighted Pair Group Method with Averages

Assigns height to every vertex in the tree, effec4vely da4ng every vertex Add a vertex connec4ng Ci, Cj and place it at height dij / 2

1 4 3 2 5 1 4 2 3 5

SLIDE 25

2/4/09 25

UPGMA Algorithm

Ini>aliza>on: Assign each xi to its own cluster Ci Define one leaf per sequence, each at height 0 Itera>on: Find two clusters Ci and Cj such that dij is min Let Ck = Ci ∪ Cj Add a vertex connec4ng Ci, Cj and place it at height dij /2

Delete Ci and Cj

Termina>on: When a single cluster remains

1 4 3 2 5 1 4 2 3 5

Trees from UPGMA

UPGMA produces an ultrametric tree; distance from the root to any leaf is the same

The Molecular Clock: The evolu4onary distance between species x and y is twice the Earth 4me to reach the nearest common ancestor That is, the molecular clock has constant rate in all species

1 4 2 3 5

years The molecular clock results in ultrametric distances

SLIDE 26

2/4/09 26

UPGMA’s Weakness: Example

2 3 4 1 1 4 3 2

Correct tree UPGMA

Ultrametrics

Dij is an ultrametric provided for all species i, j, k (dis4nct leaves of tree) two of the distances Dij, Djk and Dik are equal and ≥ the third.

Ex. d(i,k) = d(j, k) ≥ d(i, j)

j k i

Proposi>on: If d is ultrametric, then d is addi4ve.

λ1 λ2

λ1 + λ2 ≥ 2 λ1 Thus λ2 ≥ λ1

λ1

SLIDE 27

2/4/09 27

Ultrametrics

Both addi4ve distance phylogeny and perfect phylogeny can be reduced to the ultrametric phylogeny problem. Let v = row of D containing largest entry mv. Define D’ij = mv + (Dij – Dvi – Dvj) / 2 = mv – λ3 Theorem: D is addi4ve if and only if D’ is ultrametric. (See Gusfield, Ch. 17)

j v i λ2 λ3 λ1