SLIDE 1
Multiple Sequence Multiple Sequence Alignments Alignments
SLIDE 2 Multiple alignment
– Infer biological relationships from string similarity
– Infer string similarity from biological relationships
SLIDE 3 Biological Motivations
- One of the most essential tools in molecular biology
– Finding highly conserved sub-regions or embedded patterns of a set of biological sequences – Production of consensus sequence – Estimation of evolutionary distance between sequences – Prediction of protein secondary/tertiary structure – To find conserved regions
- Local multiple alignment reveals conserved regions
- Conserved regions usually are key functional regions, prime targets
for drug developments
- Practically useful methods only since D. Sankoff
(1987) based on phylogenetics
– Before 1987 they were constructed by hand – The basic problem: no dynamic programming approach can be used
SLIDE 4
Alignment between globins (human beta globin, horse beta globin, human alpha globin, horse alpha globin, cyanohaemoglobin, whale myoglobin, leghaemoglobin) produced by Clustal. Boxes mark the seven alpha helices composing each globin.
.
SLIDE 5
SLIDE 6 Definition
- Given strings x1, x2 … xk a multiple (global)
alignment is a matrix of k rows and A columns where each row represents a sequence and a column contains a symbol from each sequence or gaps symbols (at least one non gap)
SLIDE 7
Multiple Sequence Alignment
Matrix 3 rows 8 colums
SLIDE 8 Family representations
- Outcome of multiple alignment
- Three kinds
– Profile representation
- Frequencies of symbols in each column
- Weight vector
- Alignment to a profile
– Consensus sequence representation
– Signature representation
- PROSITE, BLOCKS databases
- Regular expression
SLIDE 9 Scoring Function
– Find alignment that maximizes probability that sequences evolved from common ancestor
x y z w v ?
SLIDE 10 Multiple Sequence Alignment
- Mult-Seq-Align allows to detect similarities which
cannot be detected with Pairwise-Seq-Align methods.
- Detection of family characteristics.
Three questions: 1. Scoring
- 2. Computation of Mult-Seq-Align.
- 3. Family representation.
SLIDE 11
A fragment of multiple alignment of 7 kinases. ClustalW program from SRS server.
SLIDE 12
Scoring: SP (sum of pairs)
SP – the sum of pairwise scores of all pairs of symbols in the column. SP3(-,A,A) = (-,A)+(-,A)+(A,A) SP Total Score = sum over all columns (-,-) = 0
SLIDE 13
Induced pairwise alignment
Induced pairwise alignment or projection of a multiple alignment. a(S1, S2 ) a(S2, S3) a(S1, S3) (-,-) = 0 SP Total Score = Σi<j score[ a(Si, Sj ) ]
SLIDE 14 Consensus
- AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
- Find optimal consensus string m* to maximize
S(m) = Σi s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)
SLIDE 15 Optimal solution
- Multidimensional Dynamic Programming
- Generalization of pair-wise alignment
- For simplicity, assume k sequences of length n
- The dynamic programming array is k-dimensional
hyperlattice of length n+1 (including initial gaps)
- The entry F(i1, …, ik) represents score of optimal
alignment for s1[1..i1], … sk[1..ik]
- Initialize values on the faces of the hyperlattice
SLIDE 16
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
SLIDE 17 Sum Of Pairs
- The sum-of-pairs (SP) score of a multiple
alignment A is the sum of the scores of all induced pairwise alignments S(A) = Σi<j S(Aij) Aij is the induced alignment of xi, xj
SLIDE 18
Dyn.Prog. Solution
SLIDE 19 s( NV NS NA) s( NV N- NA)
V S A
s( N- NS NA) s( N- NS N- ) s( NV N- N- )
s( N N N)
+δ( − S A)
s( NV NS NA)=max
{
s( N N N)+δ( V S A) s( NV N- N- )+δ( − S A) s( N- NS N- )+δ( V − A) s( N- N- NA)+δ( V S −) s( N- NS NA)+δ( V − −) s( NV N- NA)+δ( − S −) s( NV NS N- )+δ( − − A)
k=3 2k –1=7
SLIDE 20
sequences):
F(i,j,k) = max{ F(i-1,j-1,k-1)+ S(xi,xj,xk), F(i-1,j-1,k) + S(xi,xj, -), F(i-1,j,k-1) + S(xi,-, xk), F(i-1,j, k) + S(xi,-, -), F(i,j-1,k-1) + S( -,xj,xk), F(i,j-1,k) + S( -,xj,xk), F(i,j,k-1) + S( -,-, xk) }
Multidimensional Dynamic Programming
SLIDE 21
- Space complexity: O(nk) for k sequences each n
long.
- Computing at a cell: O(2k). cost of computing δ.
- Time complexity: O(2knk). cost of computing δ.
- Finding the optimal solution is exponential in k
- Proven to be NP-complete for a number of cost
functions
Complexity
SLIDE 22
- Faster Dynamic Programming (SP)
– Carrillo and Lipman 88 (MSA) – Pruning of hyperlattice in DP – Practical for about 6 sequences of length about 200.
- Star alignment (SP)
- Progressive methods
– CLUSTALW – PILEUP
- Iterative algorithms
- Sampling (Gibbs) based methods
- Hidden Markov Model (HMM) based methods
- Expectation Maximization Algorithm
Algorithms
SLIDE 23
- Find pairwise alignment
- Trial multiple alignment produced by a tree, cost = d
- This provides a limit to the volume within which
- ptimal alignments are found
- Specifics
– Sequences x1,..,xr. – Alignment A, cost = c(A) – Optimal alignment A* – Aij = induced alignment on xi,..,xj on account of A – D(xi,xj) = cost of optimal pairwise alignment of xi,xj <= c(Aij )
Idea behind MSA algorithm
SLIDE 24 Progressive Alignment
- Multiple Alignment is NP-complete
- Most used heuristic: Progressive Alignment
Algorithm: – Align two of the sequences xi, xj – Fix that alignment – Align a third sequence xk to the alignment xi,xj – Repeat until all sequences are aligned Running Time: O( N L2 )
SLIDE 25 Star Alignments
- Heuristic method for multiple sequence
alignments
- Select a sequence c as the center of the star
- For each sequence x1, …, xk such that index
i ≠ c, perform a Needleman-Wunsch global alignment
- Aggregate alignments with the principle
“once a gap, always a gap.”
SLIDE 26 Star Alignments Example
s2 s1 s3 s4 x1: MPE x2: MKE x3: MSKE x4: SKE
MPE | | MKE MSKE
MKE SKE || MKE
MPE MKE
MSKE
MSKE
SLIDE 27 Choosing a center
- Try them all and pick the one with the best
score
- Calculate all O(k2) alignments, and pick the
sequence xc that minimizes Σ D(xc,xi)
- D(xc,xi) = c(Aci), A is the multiple alignment
i > c
SLIDE 28 Analysis
- Assuming all sequences have length n
- O(k2n2) to calculate center
- Step i takes O((i.n).n) time
– two strings of length n and i.n
- O(k2n2) overall cost
- Produces multiple sequence alignments
whose SP values are at most twice that of the
- ptimal solutions, provided triangle
inequality holds.
SLIDE 29
– Apply dynamic programming – Score depends on the profile
– Apply dynamic programming
- Signature representations
– Align to regular expressions / CFG/ …
Aligning to family representations
SLIDE 30 Progressive alignment (CLUSTALW)
- CLUSTALW is the most popular multiple protein
alignment Algorithm:
- 1. Find all dij: alignment dist (xi, xj)
- 2. Construct a tree
(Neighbor-joining hierarchical clustering)
- 3. Align nodes in order of decreasing similarity
- sequence to sequence
- sequence to profile
- profile to profile
+ a large number of heuristics
SLIDE 31 S1 S2 S3 S4
S1 S2 S3 S4 S1 4 9 4 S2 4 7 S3 4 S4
All Pairwise Alignments S1 S3 S2 S4
Distance
Cluster Analysis Similarity Matrix Dendrogram
Multiple Alignment Step: 1. Aligning S1 and S3 2. Aligning S2 and S4 3. Aligning (S1,S3) with (S2,S4).
From Higgins(1991) and Thompson(1994).
SLIDE 32 Problems with Progressive Alignments
- Depends on pairwise alignments
- If sequences are very distantly related,
much higher likelihood of errors
- Care must be made in choosing scoring
matrices and penalties
SLIDE 33 Progressive Alignment: CLUSTALW
CLUSTALW: most popular multiple protein alignment Algorithm:
- Find all dij: alignment dist (xi, xj)
- Construct a tree
(Neighbor-joining hierarchical clustering)
- Align nodes in order of decreasing similarity
+ a large number of heuristics
SLIDE 34 Iterative Refinement
One problem of progressive alignment:
- Initial alignments are “frozen” even when new evidence comes
Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG
Frozen! Now clear correct y = GA-CTT
SLIDE 35 Iterative Refinement
Algorithm (Barton-Stenberg):
- Align most similar xi, xj
- Align xk most similar to (xixj)
- Repeat 2 until (x1…xN) are aligned
- For j = 1 to N,
Remove xj, and realign to x1…xj-1xj+1…xN
- Repeat 4 until convergence
Note: Guaranteed to converge
SLIDE 36 Other methods
- MEME (Expectation Maximization)
- GibbsDNA (Gibbs Sampling)
- HMMER (Hidden Markov Model)
- Random projections
- CONSENUS (greedy multiple alignment)
- WINNOWER (Clique finding in graphs)