[PPT] - Multiple Sequence Multiple Sequence Alignments Alignments PowerPoint Presentation

SLIDE 1

Multiple Sequence Multiple Sequence Alignments Alignments

SLIDE 2

Multiple alignment

Pairwise alignment

– Infer biological relationships from string similarity

Multiple alignment

– Infer string similarity from biological relationships

SLIDE 3

Biological Motivations

One of the most essential tools in molecular biology

– Finding highly conserved sub-regions or embedded patterns of a set of biological sequences – Production of consensus sequence – Estimation of evolutionary distance between sequences – Prediction of protein secondary/tertiary structure – To find conserved regions

Local multiple alignment reveals conserved regions
Conserved regions usually are key functional regions, prime targets

for drug developments

Practically useful methods only since D. Sankoff

(1987) based on phylogenetics

– Before 1987 they were constructed by hand – The basic problem: no dynamic programming approach can be used

SLIDE 4

Alignment between globins (human beta globin, horse beta globin, human alpha globin, horse alpha globin, cyanohaemoglobin, whale myoglobin, leghaemoglobin) produced by Clustal. Boxes mark the seven alpha helices composing each globin.

.

SLIDE 5

SLIDE 6

Definition

Given strings x1, x2 … xk a multiple (global)

alignment is a matrix of k rows and A columns where each row represents a sequence and a column contains a symbol from each sequence or gaps symbols (at least one non gap)

SLIDE 7

Multiple Sequence Alignment

Matrix 3 rows 8 colums

SLIDE 8

Family representations

Outcome of multiple alignment
Three kinds

– Profile representation

Frequencies of symbols in each column
Weight vector
Alignment to a profile

– Consensus sequence representation

Steiner string

– Signature representation

PROSITE, BLOCKS databases
Regular expression

SLIDE 9

Scoring Function

Ideally:

– Find alignment that maximizes probability that sequences evolved from common ancestor

x y z w v ?

SLIDE 10

Multiple Sequence Alignment

Mult-Seq-Align allows to detect similarities which

cannot be detected with Pairwise-Seq-Align methods.

Detection of family characteristics.

Three questions: 1. Scoring

2. Computation of Mult-Seq-Align.
3. Family representation.

SLIDE 11

A fragment of multiple alignment of 7 kinases. ClustalW program from SRS server.

SLIDE 12

Scoring: SP (sum of pairs)

SP – the sum of pairwise scores of all pairs of symbols in the column. SP3(-,A,A) = (-,A)+(-,A)+(A,A) SP Total Score = sum over all columns (-,-) = 0

SLIDE 13

Induced pairwise alignment

Induced pairwise alignment or projection of a multiple alignment. a(S1, S2 ) a(S2, S3) a(S1, S3) (-,-) = 0 SP Total Score = Σi<j score[ a(Si, Sj ) ]

SLIDE 14

Consensus

AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Find optimal consensus string m* to maximize

S(m) = Σi s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)

SLIDE 15

Optimal solution

Multidimensional Dynamic Programming
Generalization of pair-wise alignment
For simplicity, assume k sequences of length n
The dynamic programming array is k-dimensional

hyperlattice of length n+1 (including initial gaps)

The entry F(i1, …, ik) represents score of optimal

alignment for s1[1..i1], … sk[1..ik]

Initialize values on the faces of the hyperlattice

SLIDE 16

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

SLIDE 17

Sum Of Pairs

The sum-of-pairs (SP) score of a multiple

alignment A is the sum of the scores of all induced pairwise alignments S(A) = Σi<j S(Aij) Aij is the induced alignment of xi, xj

SLIDE 18

Dyn.Prog. Solution

SLIDE 19

s( NV NS NA) s( NV N- NA)

V S A

s( N- NS NA) s( N- NS N- ) s( NV N- N- )

s( N N N)

+δ( − S A)

s( NV NS NA)=max

{

s( N N N)+δ( V S A) s( NV N- N- )+δ( − S A) s( N- NS N- )+δ( V − A) s( N- N- NA)+δ( V S −) s( N- NS NA)+δ( V − −) s( NV N- NA)+δ( − S −) s( NV NS N- )+δ( − − A)

k=3 2k –1=7

SLIDE 20

Example: in 3D (three

sequences):

7 neighbors/cell

F(i,j,k) = max{ F(i-1,j-1,k-1)+ S(xi,xj,xk), F(i-1,j-1,k) + S(xi,xj, -), F(i-1,j,k-1) + S(xi,-, xk), F(i-1,j, k) + S(xi,-, -), F(i,j-1,k-1) + S( -,xj,xk), F(i,j-1,k) + S( -,xj,xk), F(i,j,k-1) + S( -,-, xk) }

Multidimensional Dynamic Programming

SLIDE 21

Space complexity: O(nk) for k sequences each n

long.

Computing at a cell: O(2k). cost of computing δ.
Time complexity: O(2knk). cost of computing δ.
Finding the optimal solution is exponential in k
Proven to be NP-complete for a number of cost

functions

Complexity

SLIDE 22

Faster Dynamic Programming (SP)

– Carrillo and Lipman 88 (MSA) – Pruning of hyperlattice in DP – Practical for about 6 sequences of length about 200.

Star alignment (SP)
Progressive methods

– CLUSTALW – PILEUP

Iterative algorithms
Sampling (Gibbs) based methods
Hidden Markov Model (HMM) based methods
Expectation Maximization Algorithm

Algorithms

SLIDE 23

Find pairwise alignment
Trial multiple alignment produced by a tree, cost = d
This provides a limit to the volume within which
ptimal alignments are found
Specifics

– Sequences x1,..,xr. – Alignment A, cost = c(A) – Optimal alignment A* – Aij = induced alignment on xi,..,xj on account of A – D(xi,xj) = cost of optimal pairwise alignment of xi,xj <= c(Aij )

Idea behind MSA algorithm

SLIDE 24

Progressive Alignment

Multiple Alignment is NP-complete
Most used heuristic: Progressive Alignment

Algorithm: – Align two of the sequences xi, xj – Fix that alignment – Align a third sequence xk to the alignment xi,xj – Repeat until all sequences are aligned Running Time: O( N L2 )

SLIDE 25

Star Alignments

Heuristic method for multiple sequence

alignments

Select a sequence c as the center of the star
For each sequence x1, …, xk such that index

i ≠ c, perform a Needleman-Wunsch global alignment

Aggregate alignments with the principle

“once a gap, always a gap.”

SLIDE 26

Star Alignments Example

s2 s1 s3 s4 x1: MPE x2: MKE x3: MSKE x4: SKE

MPE | | MKE MSKE

||

MKE SKE || MKE

MPE MKE

MPE
MKE

MSKE

MPE
MKE

MSKE

SKE

SLIDE 27

Choosing a center

Try them all and pick the one with the best

score

Calculate all O(k2) alignments, and pick the

sequence xc that minimizes Σ D(xc,xi)

D(xc,xi) = c(Aci), A is the multiple alignment

i > c

SLIDE 28

Analysis

Assuming all sequences have length n
O(k2n2) to calculate center
Step i takes O((i.n).n) time

– two strings of length n and i.n

O(k2n2) overall cost
Produces multiple sequence alignments

whose SP values are at most twice that of the

ptimal solutions, provided triangle

inequality holds.

SLIDE 29

Profile

– Apply dynamic programming – Score depends on the profile

Consensus string

– Apply dynamic programming

Signature representations

– Align to regular expressions / CFG/ …

Aligning to family representations

SLIDE 30

Progressive alignment (CLUSTALW)

CLUSTALW is the most popular multiple protein

alignment Algorithm:

1. Find all dij: alignment dist (xi, xj)
2. Construct a tree

(Neighbor-joining hierarchical clustering)

3. Align nodes in order of decreasing similarity
sequence to sequence
sequence to profile
profile to profile

+ a large number of heuristics

SLIDE 31

S1 S2 S3 S4

S1 S2 S3 S4 S1 4 9 4 S2 4 7 S3 4 S4

All Pairwise Alignments S1 S3 S2 S4

Distance

Cluster Analysis Similarity Matrix Dendrogram

Multiple Alignment Step: 1. Aligning S1 and S3 2. Aligning S2 and S4 3. Aligning (S1,S3) with (S2,S4).

From Higgins(1991) and Thompson(1994).

SLIDE 32

Problems with Progressive Alignments

Depends on pairwise alignments
If sequences are very distantly related,

much higher likelihood of errors

Care must be made in choosing scoring

matrices and penalties

SLIDE 33

Progressive Alignment: CLUSTALW

CLUSTALW: most popular multiple protein alignment Algorithm:

Find all dij: alignment dist (xi, xj)
Construct a tree

(Neighbor-joining hierarchical clustering)

Align nodes in order of decreasing similarity

+ a large number of heuristics

SLIDE 34

Iterative Refinement

One problem of progressive alignment:

Initial alignments are “frozen” even when new evidence comes

Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG

Frozen! Now clear correct y = GA-CTT

SLIDE 35

Iterative Refinement

Algorithm (Barton-Stenberg):

Align most similar xi, xj
Align xk most similar to (xixj)
Repeat 2 until (x1…xN) are aligned
For j = 1 to N,

Remove xj, and realign to x1…xj-1xj+1…xN

Repeat 4 until convergence

Note: Guaranteed to converge

SLIDE 36

Other methods

MEME (Expectation Maximization)
GibbsDNA (Gibbs Sampling)
HMMER (Hidden Markov Model)
Random projections
CONSENUS (greedy multiple alignment)
WINNOWER (Clique finding in graphs)