Algorithm Summary Method Input Output Sankoffs & Fitchs - - PDF document

algorithm summary
SMART_READER_LITE
LIVE PREVIEW

Algorithm Summary Method Input Output Sankoffs & Fitchs - - PDF document

2/4/09 CSCI1950Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hHp://cs.brown.edu/courses/csci1950z/ Algorithm Summary Method Input Output Sankoffs & Fitchs Characters, T A, B Parsimony Alg.


slide-1
SLIDE 1

2/4/09 1

CSCI1950‐Z Computa4onal Methods for Biology Lecture 4

Ben Raphael February 2, 2009

hHp://cs.brown.edu/courses/csci1950‐z/

Algorithm Summary

Method Input Output Sankoff’s & Fitch’s Alg. Characters, T A, B Perfect Phylogeny Characters A, B, T Felsenstein Characters, T, B A T = tree topology B = branch lengths A = ancestral states Probabilis4c Parsimony

slide-2
SLIDE 2

2/4/09 2

Pairwise Compa4bility Test

(Wilson 1965) Binary characters i and j are pairwise compa4ble if and only if: j is homogenous w.r.t i0 or i1. Equivalently: i1 and j1 are disjoint or one contains the other Equivalently: all 4 rows do not exist (0,0), (0,1), (1,0), (1,1)

k A 0 B 0 C 1 D 1 E 0 i0 i1 i A 0 B 0 C 0 D 1 E 1 j A 0 B 0 C 1 D 0 E 0

Pairwise Compa4bility Theorem

(Estabrook et al. 1976)

A set S of binary characters is mutually compa4ble if and only if all pairs c and c’ of characters in S are pairwise compa4ble. Pairwise compa4bility  mutual compa4bility.

slide-3
SLIDE 3

2/4/09 3

Perfect Phylogeny

A set of mutually compa4ble binary characters gives a perfect phylogeny:

1. Evolu4onary model – Binary characters {0,1} – Each character changes state only once in evolu4onary history (no homoplasy!). 2. Tree in which every muta4on is on an edge of the tree. – All the species in one sub‐tree contain a 0, and all species in the other contain a 1. – For simplicity, assume root = (0, 0, 0, 0, 0)

Last )me: algorithm to reconstruct a tree.

1 2 3 4 5

A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 traits species 1 1

Trees and Splits

  • Given a set X, a split is a par44on of X into two

non‐empty subsets A and B. X = A | B.

  • For a phylogene4c tree T with leaves L, each edge

e defines a split Le = A | B, where A and B are the leaves in the subtrees obtained by removing e.

i1 i0 i

In perfect phylogeny, edges where binary character changes state gave split i0 and i1. We will return to splits in a future lecture.

slide-4
SLIDE 4

2/4/09 4

Splits Equivalence Theorem

A phylogene4c tree T defines a collec4on of splits Σ(T) = { Le | e is edge in T}. Splits A1 | B1 and A2 | B2 are pairwise compa3ble if at least one of A1∩A2 , A1∩B2 , B1∩A2, and B1∩B2 is the empty set. Splits Equivalence Theorem: Let Σ be a collec4on of

  • splits. There is a phylogene4c tree such that Σ(T) = Σ if

and only if the splits in Σ are pairwise compa4ble. The Pairwise Compa4bility Theorem (for binary characters) follows from this theorem.

Outline

Distance‐based methods for phylogene4c tree reconstruc4on.

  • Review of distances/metrics.
  • Tree distances and addi4ve distances

– Small and large phylogeny problems.

  • Non‐addi4ve distances and clustering

– UPGMA and ultrametric distances.

slide-5
SLIDE 5

2/4/09 5

Distances

A distance on a set X is a func4on d: X R sa4sfying:

d(x, y) ≥ 0, with equality iff x = y. For all x, y ∈ X, d(x, y) = d(y, x) [symmetry] For all x, y, z ∈ X, d(x, z) ≤ d(x, y) + d(y, z) [triangle inequality]

Examples:

X = real numbers, d(x, y) = | x – y| is distance. X = strings over some alphabet. dH(s, t) = number of posi4ons where s and t differ is called Hamming distance.

Distances in Biological Data

  • String distances (e.g. Hamming distance, edit

distance) on DNA/protein sequence data

  • Subs4tu4on model (Jukes‐Cantor, Kimura, etc.):

scores for par4cular changes A T, C G, etc.

Rat: ACAGTCACGCCCCACACGT Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGAGGTAGCAAACGA Human: CCTGTGAGGTAGCACACGA

slide-6
SLIDE 6

2/4/09 6

Distance Matrix

  • For n species, form n x n distance matrix Dij
  • Example: Dij = edit distance between a gene in

species i and species j.

7 11 10 7 4 6 11 4 2 10 6 2

Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC Chimpanzee: CCTGCCAGTTAGCAAACGC Human: CCTGCCAGTTAGCACACGA

Alignment vs. Distance Matrix

Sequence a gene of

length m in n species  n x m alignment matrix. n x n distance matrix Reverse transforma4on not possible due to loss

  • f informa4on.

Transform into…

7 11 10 7 4 6 11 4 2 10 6 2

Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC Chimpanzee: CCTGCCAGTTAGCAAACGC Human: CCTGCCAGTTAGCACACGA

slide-7
SLIDE 7

2/4/09 7

Distances in Trees

Given a tree T with a posi4ve weight w(e) on each edge, we define the tree distance dT on the set L of leaves by: dT(i, j) = sum of weights of edges on unique path from i to j.

In evolu4onary biology, weights are some4mes called branch lengths.

Distance in Trees: an Example

dT(1,4) = 12 + 13 + 14 + 17 + 13 = 69 i j

slide-8
SLIDE 8

2/4/09 8

Distance vs. Tree Distance

  • n x n distance matrix for n species
  • Note that dT(i, j), tree distance between i and j, not

necessarily equal to Dij as given by distance matrix.

Rat: ACAGTGACGCCCCAAACGT Mouse: ACAGTGACGCTACAAACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

Fivng a Distance Matrix

  • Given n species, we can compute the n x n

distance matrix Dij

  • Evolu4on of these species is described by a

tree that we don’t know.

  • We need an algorithm to construct a tree that

best fits the distance matrix Dij

Find a tree T such that:

Dij = dT(i,j)

Lengths of path in an (unknown) tree T Distance between species (known)

slide-9
SLIDE 9

2/4/09 9

Distance Based Phylogeny Problem

Goal: Reconstruct an evolu4onary tree from a distance matrix Input: n x n distance matrix Dij Output: weighted tree T with n leaves fivng D

Unknown topology of tree makes evolu4onary tree reconstruc4on hard! # unrooted binary trees n leaves: T(n) = (2n‐3)! / ((n‐2)! 2n‐2) n = 24: T(n) = 5.74 x 10

26

If D is addi3ve, this problem has a solu4on and there is a simple algorithm to solve it

Distance‐based vs. character‐based

Key difference: Distance‐based methods do not reconstruct ancestral states.

A B C D A 1 2 2 B 1 1 1 C 2 1 D 2 1 Note that C and D are iden4cal.

slide-10
SLIDE 10

2/4/09 10

Reconstruc4ng a 3 Leaved Tree

  • Tree reconstruc4on for a 3x3 matrix is

straighxorward

  • We have 3 leaves i, j, k and a center vertex c

Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk

Reconstruc4ng a 3 Leaved Tree (cont’d)

dic + djc = Dij + dic + dkc = Dik 2dic + djc + dkc = Dij + Dik

2dic + Djk = Dij + Dik

dic = (Dij + Dik – Djk)/2 Similarly, djc = (Dij + Djk – Dik)/2

dkc = (Dki + Dkj – Dij)/2

slide-11
SLIDE 11

2/4/09 11

Trees with > 3 Leaves

  • A binary tree with n leaves has 2n‐3 edges
  • Fivng a given tree to a distance matrix D

requires solving a system with

n(n‐1)/2 equa4ons and 2n‐3 variables

  • Solu4on not always possible for n > 3.

Addi4ve Distance Matrices

Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE

  • therwise
slide-12
SLIDE 12

2/4/09 12

Addi4ve Distance Phylogeny

Small Addi>ve Distance Phylogeny: Given phylogene4c tree T and distance matrix D, determine branch lengths such that dT(i,j) = Dij. Large Addi>ve Distance Phylogeny: Given distance matrix D, find T and branch lengths such that dT(i,j) = Dij. Both of these problems can be solved efficiently.

Reconstruc4ng Addi4ve Distances Given T

x y z w v 5 4 7 3 3 4 6

v w x y z v

10 17 16 16

w

15 14 14

x

9 15

y

14

z

T

If we know T and D, but do not know the length of each edge, we can reconstruct those lengths

D

slide-13
SLIDE 13

2/4/09 13

Reconstruc4ng Addi4ve Distances Given T

x y z w v

v w x y z v

10 17 16 16

w

15 14 14

x

9 15

y

14

z

T D

Reconstruc4ng Addi4ve Distances Given T

x y z w v

v w x y z v

10 17 16 16

w

15 14 14

x

9 15

y

14

z

Find neighbors v, w (common parent)

D

a x y z a

11 10 10

x

9 15

y

14

z

a

D1

dax = ½ (dvx + dwx – dvw) day = ½ (dvy + dwy – dvw) daz = ½ (dvz + dwz – dvw)

slide-14
SLIDE 14

2/4/09 14

Reconstruc4ng Addi4ve Distances Given T

x y z w v a

a b z a

6 10

b

10

z

D2

b c

a c a

3

c

D3

d(a, c) = 3 d(b, c) = d(a, b) – d(a, c) = 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 d(a, v) = d(z, v) – d(a, z) = 6 Correct!!! 5 4 7 3 3 4 6

a x y z a

11 10 10

x

9 15

y

14

z

D1

Neighbors x, y (common parent)

Trees and Neighbors

Previous algorithm relied only on finding neighboring leaves:

  • 1. Find neighboring leaves i and j with parent k
  • 2. Remove the rows and columns of i and j
  • 3. Add a new row and column corresponding to k,

where the distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2

Compress i and j into k, iterate algorithm for rest of tree

slide-15
SLIDE 15

2/4/09 15

Finding Neighboring Leaves

To find neighboring leaves we simply select a pair of closest leaves.

i j k l i

13 21 22

j

12 13

k

13

l

WRONG!

i and j are neighbors, but (dij = 13) > (djk = 12). Finding a pair of neighboring leaves is a nontrivial problem!

Degenerate Triples

  • A degenerate triple is a set of three dis4nct

elements 1 ≤ i, j, k ≤ n where

Dij + Djk = Dik

  • Element j in a degenerate triple i,j,k lies on the

evolu4onary path from i to k (or is aHached to this path by an edge of length 0).

slide-16
SLIDE 16

2/4/09 16

Looking for Degenerate Triples

  • If distance matrix D has a degenerate triple

i,j,k then j can be “removed” from D thus reducing the size of the problem.

  • If distance matrix D does not have a

degenerate triple i,j,k, one can “create” a degenera4ve triple in D by shortening all hanging edges (in the tree).

Shortening Hanging Edges to Produce Degenerate Triples

  • Shorten all “hanging” edges (edges that

connect leaves) un4l a degenerate triple is found

slide-17
SLIDE 17

2/4/09 17

Finding Degenerate Triples

  • If there is no degenerate triple, all hanging edges are

reduced by the same amount δ, so that all pair‐wise distances in the matrix are reduced by 2δ.

  • Eventually this process collapses one of the leaves (when

δ = length of shortest hanging edge), forming a degenerate triple i,j,k and reducing the size of the distance matrix D.

  • The aHachment point for j can be recovered in the

reverse transforma4ons by saving Dij for each collapsed leaf.

Reconstruc4ng Trees for Addi4ve Distance Matrices

Trim(D, δ) for all 1 ≤ i ≠ j ≤ n Dij = Dij ‐ 2δ

slide-18
SLIDE 18

2/4/09 18

Addi4vePhylogeny Algorithm

AdditivePhylogeny AdditivePhylogeny(D) if if D is a 2 x 2 matrix T = tree of a single edge of length D1,2 return return T if if D is non-degenerate Compute trimming parameter δ Trim(D, δ)

Find a triple i, j, k in D such that Dij + Djk = Dik x = Dij Remove jth row and jth column from D T = AdditivePhylogeny(D) Traceback

Addi4vePhylogeny (cont’d)

Traceback Add a new vertex v to T at distance x from i to k Add j back to T by creating an edge (v,j) of length 0 for for every leaf l in T if if distance from l to v in the tree ≠ Dl,j

  • utput “matrix is not additive”

return return Extend all “hanging” edges by length δ return return T

Ques>on: How to compute δ?

slide-19
SLIDE 19

2/4/09 19

Addi4ve Distance

  • How to tell if D is addi4ve?
  • Addi4vePhylogeny provides a way to check if

distance matrix D is addi4ve

  • An even more efficient addi>vity check is the

“four‐point condi>on”

The Four Point Condi4on

(Zaretskii 1965, Buneman 1971)

Let 1 ≤ i,j,k,l ≤ n be four dis4nct leaves in a tree Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk 1 2 3

2 and 3 represent the same number: (length

  • f all edges) + 2 *

(length middle edge) 1 represents a smaller number: (length of all edges) – (length middle edge)

slide-20
SLIDE 20

2/4/09 20

The Four Point Condi4on

Four point condi>on: Every four leaves (quartet) can be labeled as i,j,k,l such that: Dij + Dkl ≤ Dik + Djl = Dil + Djk Theorem : An n x n matrix D is addi4ve if and

  • nly if the four point condi4on holds for every

quartet 1 ≤ i,j,k,l ≤ n

The Four Point Condi4on

Theorem : An n x n matrix D is addi4ve if and only if the four point condi4on holds for every quartet 1 ≤ i,j,k,l ≤ n. Proof:  Since D addi4ve, D = dT. Find split such that: i, j ∈ S1 and k, l ∈ S2. Define λm to be weights in tree below. Dik + Djl = (λ1 + λ3 + λ4) + (λ2 + λ3 + λ5) = Dil + Djk >= (λ1 + λ2) + ( λ4 + λ5).

j l i k λ1 λ3 λ5 λ2 λ4

slide-21
SLIDE 21

2/4/09 21

Non‐addi4ve Distances

  • What if there is no tree T such that Dij = dT(i,j).
  • Approaches:
  • 1. Find tree such that minimizes “error”
  • 2. Heuris4c approaches: clustering.

Least Squares Distance Phylogeny Problem

  • If the distance matrix D is NOT addi4ve, then we look

for a tree T that approximates D the best: Squared Error : ∑i,j (dij(T) – Dij)2

  • Squared Error is a measure of the quality of the fit

between distance matrix and the tree: we want to minimize it.

  • Least Squares Distance Phylogeny Problem:

– Find approxima4on tree T with minimum squared error for a non‐addi4ve matrix D. – (NP‐hard)

slide-22
SLIDE 22

2/4/09 22

Tree construc4on as clustering

1 4 3 2 5 1 4 2 3 5

Pair Group Methods

Itera4vely combine closest leaves/groups into larger groups.

C  { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] d(Ci, Cj ) = min d(Ci, Cj). Ck  Ci ∪ Cj [Replace Ci and Cj by Ck.] C  (C \ Ci \ Cj )∪ Ck.

1 4 3 2 5 1 4 2 3 5

slide-23
SLIDE 23

2/4/09 23

Pair Group Methods

What is d? How to define branch lengths?

C  { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] d(Ci, Cj ) = min d(Ci, Cj). Ck  Ci ∪ Cj [Replace Ci and Cj by Ck.] C  (C \ Ci \ Cj )∪ Ck.

1 4 3 2 5 1 4 2 3 5

UPGMA

Unweighted Pair Group Method with Averages

  • Distance between

clusters defined as average pairwise distance

  • Assigns height to

every vertex in the tree, effec4vely da4ng every vertex

1 4 3 2 5 1 4 2 3 5

slide-24
SLIDE 24

2/4/09 24

UPGMA

Unweighted Pair Group Method with Averages

Distance between clusters defined as average pairwise distance

1 4 3 2 5 1 4 2 3 5

Given two disjoint clusters Ci, Cj of sequences,

1

dij = ––––––––– Σ{p ∈Ci, q ∈Cj}dpq |Ci| × |Cj|

UPGMA

Unweighted Pair Group Method with Averages

Assigns height to every vertex in the tree, effec4vely da4ng every vertex Add a vertex connec4ng Ci, Cj and place it at height dij / 2

1 4 3 2 5 1 4 2 3 5

slide-25
SLIDE 25

2/4/09 25

UPGMA Algorithm

Ini>aliza>on: Assign each xi to its own cluster Ci Define one leaf per sequence, each at height 0 Itera>on: Find two clusters Ci and Cj such that dij is min Let Ck = Ci ∪ Cj Add a vertex connec4ng Ci, Cj and place it at height dij /2

Delete Ci and Cj

Termina>on: When a single cluster remains

1 4 3 2 5 1 4 2 3 5

Trees from UPGMA

UPGMA produces an ultrametric tree; distance from the root to any leaf is the same

The Molecular Clock: The evolu4onary distance between species x and y is twice the Earth 4me to reach the nearest common ancestor That is, the molecular clock has constant rate in all species

1 4 2 3 5

years The molecular clock results in ultrametric distances

slide-26
SLIDE 26

2/4/09 26

UPGMA’s Weakness: Example

2 3 4 1 1 4 3 2

Correct tree UPGMA

Ultrametrics

Dij is an ultrametric provided for all species i, j, k (dis4nct leaves of tree) two of the distances Dij, Djk and Dik are equal and ≥ the third.

  • Ex. d(i,k) = d(j, k) ≥ d(i, j)

j k i

Proposi>on: If d is ultrametric, then d is addi4ve.

λ1 λ2

λ1 + λ2 ≥ 2 λ1 Thus λ2 ≥ λ1

λ1

slide-27
SLIDE 27

2/4/09 27

Ultrametrics

Both addi4ve distance phylogeny and perfect phylogeny can be reduced to the ultrametric phylogeny problem. Let v = row of D containing largest entry mv. Define D’ij = mv + (Dij – Dvi – Dvj) / 2 = mv – λ3 Theorem: D is addi4ve if and only if D’ is ultrametric. (See Gusfield, Ch. 17)

j v i λ2 λ3 λ1