We were talking about similarity, sequence comparison and - - PDF document

we were talking about similarity sequence comparison and
SMART_READER_LITE
LIVE PREVIEW

We were talking about similarity, sequence comparison and - - PDF document

We were talking about similarity, sequence comparison and alignment. HOW DOES IT WORK ? The high end solution Use the most sensible, most powerful, and best trainable tool available ... ... your eyes A T A T T G C A A


slide-1
SLIDE 1

We were talking about similarity, sequence comparison and alignment. HOW DOES IT WORK ?

slide-2
SLIDE 2

The high end solution

and best trainable tool available ... Use the most sensible, most powerful,

slide-3
SLIDE 3

... your eyes

slide-4
SLIDE 4

A T A T T G C A A T C T T C G C A

slide-5
SLIDE 5

A T C T T C G C A A T A T T G C A

slide-6
SLIDE 6

T C A T G C A T T G

slide-7
SLIDE 7

T C A T G C A T T G

slide-8
SLIDE 8

T C A T G C A T T G

slide-9
SLIDE 9

DOTPLOTS

...TAGGAAA... ...TAGCACG... TAT... ...CT GCC... ...TA

slide-10
SLIDE 10

A T G C G T C G T T T G C T G C G T A

slide-11
SLIDE 11

A T G C G T C G T T T G C T G C T A C

slide-12
SLIDE 12

A T G C G T C G T T T A G C G C G T T

slide-13
SLIDE 13

ATCCG−CGTC ATCCGCGTC−−− AT−−GCGTCGTT ATGCGTCGTT A T G C G T C G T T G C T A C G T C C

slide-14
SLIDE 14

...TA TATAGCGTCATGCGTACCCCCCTAGGAAAGGATCAGCCCTATATCT GCCTAAACCACTGTGTCTCTTTAGCACGGGGTATCCATA ...TAGGAAA... ...TAGCACG... TAT... ...CT GCC...

slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20

... detect multiple domain structure ... detect internal repeats similarity ... detect both global and local Dotplots ...

slide-21
SLIDE 21

... are qualitative and not quantitative ... rely on the power of human cognition Dotplots ...

slide-22
SLIDE 22

allowed end of S1 and S2, and then placing the resulting strings one above the other so that everyspace or character in either sequence is opposite a unique chosen spaces, either into or at the

  • ther string. Matching spaces are not

A global alignment between two sequences S1 and S2 is obtained by first inserting Definition (global alignment) character or a unique space in the

slide-23
SLIDE 23

SHORT Given two sequences: it is identical to the second. Only the first sequence is edited !!!

Editing

(1) Replacements: R(A−>T) Edit operations: (2) Deletions: D (3) Insertions: I(T) (4) Do Nothing: N (A) D I N R Edit the first sequence such that

slide-24
SLIDE 24

(G) A T A G C G G A T 1 2 3 4 5 6 7 8 9 10 1: N 2: R(T−>C) 3: N 5: N 6: N 7: N 9: N 10:N 4: D 8: I(T) Edit Script: Example A C A C G G T A T

slide-25
SLIDE 25

sequence and the alignment. Script: N D(C) N I(T) N Given the first sequence and an Edit script we can reconstruct the second C − A T T Alignment: C C A − T First Sequence: C C A T

slide-26
SLIDE 26

Short Script: N D N I N C C A − T Given both sequences we can reconstruct the alignment and the short version of an edit script First Sequence: C C A T Second Sequence: C A T T Every alignment is equivalent to a string on the alphabet {R D I N} C − A T T

slide-27
SLIDE 27

The edit distance between two sequences is the minimum number of edit

  • perations {R I D} needed to transform

the first sequence into the second. Definition (Edit Distance) Note that {N} operations are not counted

slide-28
SLIDE 28

solve an optimization problem:

  • f two sequences we need to

Given two sequences: What is the shortest edit script that transforms the first sequence into the second. The length of the script is the number

  • f {R I D} in it.

In order to calculate the edit distance

slide-29
SLIDE 29

(

let S2 be a sequence of length n2. between S1 and S2 different global alignments There are at least

)

n1 n1+n2

Let S1 be a sequence of length n1 and

slide-30
SLIDE 30

A PROOF IN RED AND BLUE C A A A G T T G C A RB R B R C B B R B B C A − T G C A C A A G T − − There are( n1+n2 n1 ) to place the n1 blue Bs in this string of length n1+n2 ways R

slide-31
SLIDE 31

2.7029e+299 different alignments Two sequences of length 500:

A T C T T C G C A A T A T T G C A

24310 different alignments

slide-32
SLIDE 32

to be computed, into smaller problems that may be efficiently computed. Then assemble the answers to give a solution to the large problem. Divide and conquer Subdivide a problem that is to large

slide-33
SLIDE 33

subdivide a large problem into subproblems of the same type. Dynamic Programming Subproblems should share subproblems. Calculate the solution of all the subproblems just Save the answer in a table, thereby avoiding the work of recomputing the answers everytime the subproblem is encountered.

  • nce.

Recursively

slide-34
SLIDE 34

The 3 Steps of a dynamic programming algorithm. (1) The recurrence relation (2) A tabular computation scheme (3) The traceback

slide-35
SLIDE 35

S1[1..i] and S2[1..j] S1:TAGGTCAT CCATATAATA S1[1..8] Notation: Let S1 and S2 be two sequences. S1[1..i] and S2[1..j] are the first i resp. j characters of the sequences. D(i,j) denotes the edit distance of

slide-36
SLIDE 36

shorter sequences Problem: Calculate the minimal edit distance

  • f 2 sequences and the corresponding

global alignment. Observation: That is easier for short sequences. Strategy: Solve the problem for all S1[1..i] and S2[1..j].

slide-37
SLIDE 37

−TTCCT

alignment of S1[1..5] and S2[1..5].

ATCGCTGGCATAC GCCTAC TTCCTA ATCGC− TTCCTA− T ATCGC TTCCT T A An alignmnet ends either with (1) a match/mismatch (2) a gap in the first sequence (3) a gap in the second sequence

use the opt. alignment of S2[1..5]. use the opt. alignment of S1[1..5] and S1[1..6] and S2[1..6].

S1: S2: One of the alignments is optimal ! ATCGCT− A

use the opt.

slide-38
SLIDE 38

−TTCCT TTCCT T A ATCGC− TTCCTA− T D(5,6) +1 D(5,5)+1 D(6,5) +1 D(6,6) = min Edit steps D(5,5)+1 D(6,5) +1 D(5,6) +1 The recurrence relation ATCGCT− A ATCGC

slide-39
SLIDE 39

t(i,j)=0 if S1(i)= S2(1) "match" D(i,j) = min D(i−1,j−1) D(i,j−1) D(i−1,j) +1 +1 +t(i,j) The general recurrence relation t(i,j)=1 if S1(i)= S2(1) "mismatch"

slide-40
SLIDE 40

BOTTOM−UP COMPUTATION Idea: like "calculate D(1,1)" We start with solving easy problems

  • r even

"calculate D(0,0),D(0,1),D(1,0) ..." "Calculate D(3,4)" is also a subproblem

  • f "calculate D(12,15)"

"calculate D(5,5)" "Calculate D(3,4)" is a subproblem of We solve "calculate D(3,4)" only once

slide-41
SLIDE 41

characters of S2: N T N E R S1: WRITERS S2: NTERS VI VI −−... ... INITIALIZATION This results in 2 insertions. Align the first 0 characters of S1 to the first 2 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 W R I T E R S V I

slide-42
SLIDE 42

Tabular calculation

7 2 2 2 3 4 5 6 3 3 3 3 4 5 6 4 4 4 ? W R I T E R S V I N T N E R 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6

slide-43
SLIDE 43

5 4 5 6 6 6 6 4 5 6 7 6 7 6 5 4 5 Edit distance

  • f S1 and S2

W R I T E R S V I N T N E R 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 2 2 3 4 5 6 3 3 3 3 4 5 6 4 4 4 3 4 5 6 5 5 5 4

slide-44
SLIDE 44

THE TRACEBACK

W R I T E R S V I N T N E R 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 2 2 3 4 5 6 3 3 3 3 4 5 6 4 4 4 3 4 5 6 5 5 5 4 4 5 6 6 6 6 5 4 5 6 7 6 7 6 5 4 5

slide-45
SLIDE 45

V−INTNER− ** * * * WRIT−ERS *** * * VINTNER− WRI−T−ERS −VINTNER− ** * * *

RETRIEVING COOPTIMAL ALIGNMENTS

WRI−T−ERS W R I T E R S V I N T N E R 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 2 2 3 4 5 6 3 3 3 3 4 5 6 4 4 4 3 4 5 6 5 5 5 4 4 5 6 6 6 6 5 4 5 6 7 6 7 6 5 4 5