[PPT] - Dot Plot in Python We want to write a python program that create a PowerPoint Presentation

SLIDE 1

Dot Plot in Python

We want to write a python program that create a

dotoplot. The program should take from command

line:

1. a fasta file with two sequences,
2. a scoring matrix
3. a word size
4. a display threshold

SLIDE 2

Text example Id matrix, wordsize=3, threshold=1

\ S A A A S S A A T T T T A * * * A * * * * * A * * * T * T T

SLIDE 3

matplotlib example Id matrix, wordsize=3, threshold=1

SLIDE 4

Building an internal description of the dotplot

def computeDP(s1,s2,sm,ws,th): ''' s1,s2= sequences, sm= score matrix (as dict), ws=word size, th= cutoff threshold ''' dp=set() for i in range(len(s1)-ws+1): for j in range(len(s2)-ws+1): sc=0.0 for k in range(ws): sc=sc+sm.get((s1[i+k],s2[j+k]), 0.0) if scoreij >= th: dp.add((i,j)) return dp

SLIDE 5

Printing in in text format

def printDP(s1,s2,dp,shift=0): ''' s1,s2=sequences, dp=dotplot, shif=shifting print positions''' print "\\", for j in range(len(s2)): print s2[j], print for i in range(len(s1)): print s1[i], for j in range(len(s2)): pc=" " if (i-shift,j-shift) in dp : pc="*" print pc, print

SLIDE 6

Printing using matplotlib

def printMDP(s1,s2,dp,shift=0): import matplotlib.pyplot as plt L1=len(s1) L2=len(s2) plt.axis([0, L1+1, 0, L2+1]) x,y=[],[] for i,j in dp: x.append(i+shift+1) y.append(j+shift+1) plt.plot(x, y, 'ro') plt.show()

SLIDE 7

Computational complexity is quadratic: O(|s1||s2|ws)

def computeDP(s1,s2,sm,ws,th): ''' s1,s2= sequences, sm= score matrix (as dict), ws=word size, th= cutoff threshold ''' dp=set() for i in range(len(s1)-ws+1): for j in range(len(s2)-ws+1): sc=0.0 for k in range(ws): sc=sc+sm.get((s1[i+k],s2[j+k]), 0.0) if scoreij >= th: dp.add((i,j)) return dp

SLIDE 8

Computational complexity is quadratic: O(|s1||s2|ws)

Is it possible to improve the computational complexity?

SLIDE 9

Idea: pre-process one sequence in order to know in advance the possible matches, then scan the

ther sequence.

SLIDE 10

Example: assume you have a function getSimpos

>>> s1="TTTAAATTT" >>> s2="SAAASSAAT" >>> al,sm=readScoreMatrix('id.mat') >>> getSimpos(s1,al,sm,ws=1,th=0) {'A': [3, 4, 5], 'T': [0, 1, 2, 6, 7, 8]} >>> getSimpos(s1,al,sm,ws=3,th=3) {'AAA': [3], 'TTT': [0, 6], 'AAT': [4], 'ATT': [5], 'TAA': [2], 'TTA': [1]} >>>

SLIDE 11

With the preprocess given by getSimpos, it is possible to know for each word in the second sequence if there is match with the first and directly in which positions!

>>> s1="TTTAAATTT" >>> s2="SAAASSAAT" >>> ws,th = 3,2 >>> simpos=getSimpos(s1,al,sm,ws,th) >>> simpos {'AAA': [3], 'TTT': [0, 6], 'AAT': [4], 'ATT': [5], 'TAA': [2], 'TTA': [1]} >>> for j in range(len(s2)-ws+1): ... word=s2[j:j+ws] ... for i in simpos.get(word,[]): ... print i,j,s1[i:i+ws],s2[j:j+ws] ... 3 1 AAA AAA 4 6 AAT AAT >>>

SLIDE 12

getSimpos: simplified version

def getSimpos_simple(s1,sm,ws,th): ''' s1=sequence, ws=score matrix, ws=word size, th=threshold ''' simpos={} for i in range(len(query)-ws+1): word=query[i:i+ws] simpos[word]=simpos.get(word,[]) simpos[word].append(i) return simpos It works only for identity matrix! Why??

SLIDE 13

Blosum62 has more possible matches!

>>> al,sm=readScoreMatrix('id.mat') >>> s1 'TTTAAATTT' >>> getSimpos(s1,al,sm,ws=1,th=1) {'A': [3, 4, 5], 'T': [0, 1, 2, 6, 7, 8]} >>> al,sm=readScoreMatrix('blosum62.mat') >>> getSimpos(s1,al,sm,ws=1,th=1) {'A': [3, 4, 5], 'S': [0, 1, 2, 3, 4, 5, 6, 7, 8], 'T': [0, 1, 2, 6, 7, 8]} >>>

SLIDE 14

Blosum62

A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

SLIDE 15

Idea: add a dictionary that takes care of all the words that we can predict will have a score higher than a threshold for a given matrix and a given word in our sequence.

SLIDE 16

Given a word: find all high scoring similar words: itertools.product

>>> import itertools >>> alphabet='AB' >>> ws=3 >>> for e in itertools.product(alphabet,repeat=ws): ... print "".join(e), e ... AAA ('A', 'A', 'A') AAB ('A', 'A', 'B') ABA ('A', 'B', 'A') ABB ('A', 'B', 'B') BAA ('B', 'A', 'A') BAB ('B', 'A', 'B') BBA ('B', 'B', 'A') BBB ('B', 'B', 'B')

SLIDE 17

Given a word: find all high scoring similar words:

def findSumWords(word,alphabet,sm,th): ''' findSumWords(word,alphabet,sm,th) ''' Similar=[] # list of similar words for w in itertools.product(alphabet,repeat=len(word)): cscore=0.0 for i in range(len(word)): cscore+=sm[word[i],w[i]] if cscore >= th: similar.append("".join(w)) return similar

SLIDE 18

GetSimpos

def getSimpos(query,alphabet,sm,ws,th): simpos={} # stores positions of a given matching word simwords={} # for each used word, store similar words for i in range(len(query)-ws+1): word=query[i:i+ws] # create word # test if we searched for similar words if not simwords.has_key(word): simwords[word]=findSumWords(word,alphabet,sm,th) for k in simwords[word]: simpos[k]=simpos.get(k,[]) simpos[k].append(i) return simpos

SLIDE 19

Finally: find all the dotplot matches!

def lcomputeDP(s1,s2,alphabet,sm,ws,th): ''' s1,s2=sequences, sm=score matrix ws=word size, th=threshold ''' dp=set() # initialize dotplot # process first sequence simpos=getSimpos(s1,alphabet,sm,ws,th) for j in range(len(s2)-ws+1): word=s2[j:j+ws] for i in simpos.get(word,[]): dp.add((i+hw,j+hw)) return dp

SLIDE 20

Computational complexity ??? Blue + Red

def lcomputeDP(s1,s2,alphabet,sm,ws,th): ''' s1,s2=sequences, sm=score matrix ws=word size, th=threshold ''' dp=set() # initialize dotplot # process first sequence simpos=getSimpos(s1,alphabet,sm,ws,th) for j in range(len(s2)-ws+1): # loop 1 word=s2[j:j+ws] for i in simpos.get(word,[]): # loop 2 dp.add((i+hw,j+hw)) return dp

SLIDE 21

Computational complexity:Blue

def getSimpos(query,alphabet,sm,ws,th): simpos={} # stores positions of a given matching word simwords={} # for each used word, store similar words for i in range(len(query)-ws+1): word=query[i:i+ws] # create word # test if we searched for similar words if not simwords.has_key(word): simwords[word]=findSumWords(word,alphabet,sm,th) for k in simwords[word]: simpos[k]=simpos.get(k,[]) simpos[k].append(i) return simpos

O(|s1|dictionary) = O(cn) ~ O(n) with a big constant c!

SLIDE 22

Computational complexity:Red

for j in range(len(s2)-ws+1): # loop 1 word=s2[j:j+ws] for i in simpos.get(word,[]): # loop 2 dp.add((i+hw,j+hw))

O(|s2|*simpos) => worst case O(n2) !!! But In the average case with reasonable threshold end score The number of positions that matches can be considered constant on average, so for average case: O(n) !

SLIDE 23

Computational complexity ??? Blue + Red

def lcomputeDP(s1,s2,alphabet,sm,ws,th): ''' s1,s2=sequences, sm=score matrix ws=word size, th=threshold ''' dp=set() # initialize dotplot # process first sequence simpos=getSimpos(s1,alphabet,sm,ws,th) for j in range(len(s2)-ws+1): # loop 1 word=s2[j:j+ws] for i in simpos.get(word,[]): # loop 2 dp.add((i+hw,j+hw)) return dp

O(n)+O(n) = O(n) !

SLIDE 24

Experimental time performace of the two implementations

* dotplot ^ lindotplot

SLIDE 25

Exercise:

Use the lin_dotplot idea to find local ungapped alignments:

1. use the dotplot pairs and link them together in a single

alignment if the pairs are on diagonals ( eg. (i-1,j-1), (i,j), (i+1,j+1), …)

2. extend the idea to process a sequence and search similar

Dot Plot in Python We want to write a python program that create a - - PowerPoint PPT Presentation

Dot Plot in Python

We want to write a python program that create a

line:

Text example Id matrix, wordsize=3, threshold=1

\ S A A A S S A A T T T T A * * * A * * * * * A * * * T * T T

matplotlib example Id matrix, wordsize=3, threshold=1

Building an internal description of the dotplot

Printing in in text format

Printing using matplotlib

def printMDP(s1,s2,dp,shift=0): import matplotlib.pyplot as plt L1=len(s1) L2=len(s2) plt.axis([0, L1+1, 0, L2+1]) x,y=[],[] for i,j in dp: x.append(i+shift+1) y.append(j+shift+1) plt.plot(x, y, 'ro') plt.show()

Computational complexity is quadratic: O(|s1||s2|ws)

Computational complexity is quadratic: O(|s1||s2|ws)

Is it possible to improve the computational complexity?

Idea: pre-process one sequence in order to know in advance the possible matches, then scan the

Example: assume you have a function getSimpos

With the preprocess given by getSimpos, it is possible to know for each word in the second sequence if there is match with the first and directly in which positions!

getSimpos: simplified version

def getSimpos_simple(s1,sm,ws,th): ''' s1=sequence, ws=score matrix, ws=word size, th=threshold ''' simpos={} for i in range(len(query)-ws+1): word=query[i:i+ws] simpos[word]=simpos.get(word,[]) simpos[word].append(i) return simpos It works only for identity matrix! Why??

Blosum62 has more possible matches!

>>> al,sm=readScoreMatrix('id.mat') >>> s1 'TTTAAATTT' >>> getSimpos(s1,al,sm,ws=1,th=1) {'A': [3, 4, 5], 'T': [0, 1, 2, 6, 7, 8]} >>> al,sm=readScoreMatrix('blosum62.mat') >>> getSimpos(s1,al,sm,ws=1,th=1) {'A': [3, 4, 5], 'S': [0, 1, 2, 3, 4, 5, 6, 7, 8], 'T': [0, 1, 2, 6, 7, 8]} >>>

Blosum62

Idea: add a dictionary that takes care of all the words that we can predict will have a score higher than a threshold for a given matrix and a given word in our sequence.

Given a word: find all high scoring similar words: itertools.product

Given a word: find all high scoring similar words:

GetSimpos

Finally: find all the dotplot matches!

Computational complexity ??? Blue + Red

Computational complexity:Blue

O(|s1|dictionary) = O(cn) ~ O(n) with a big constant c!

Computational complexity:Red

O(|s2|*simpos) => worst case O(n2) !!! But In the average case with reasonable threshold end score The number of positions that matches can be considered constant on average, so for average case: O(n) !

Computational complexity ??? Blue + Red

O(n)+O(n) = O(n) !

Experimental time performace of the two implementations

* dotplot ^ lindotplot

Exercise:

Use the lin_dotplot idea to find local ungapped alignments:

alignment if the pairs are on diagonals ( eg. (i-1,j-1), (i,j), (i+1,j+1), …)

local alignments in a set of sequences.

Dot Plot in Python

We want to write a python program that create a

line:

Text example Id matrix, wordsize=3, threshold=1

\ S A A A S S A A T T T T A * * * A * * * * * A * * * T * T T

matplotlib example Id matrix, wordsize=3, threshold=1

Building an internal description of the dotplot

Printing in in text format

Printing using matplotlib

def printMDP(s1,s2,dp,shift=0): import matplotlib.pyplot as plt L1=len(s1) L2=len(s2) plt.axis([0, L1+1, 0, L2+1]) x,y=[],[] for i,j in dp: x.append(i+shift+1) y.append(j+shift+1) plt.plot(x, y, 'ro') plt.show()

Computational complexity is quadratic: O(|s1|*|s2|*ws)

Computational complexity is quadratic: O(|s1|*|s2|*ws)

Is it possible to improve the computational complexity?

Idea: pre-process one sequence in order to know in advance the possible matches, then scan the

Example: assume you have a function getSimpos

With the preprocess given by getSimpos, it is possible to know for each word in the second sequence if there is match with the first and directly in which positions!

getSimpos: simplified version

def getSimpos_simple(s1,sm,ws,th): ''' s1=sequence, ws=score matrix, ws=word size, th=threshold ''' simpos={} for i in range(len(query)-ws+1): word=query[i:i+ws] simpos[word]=simpos.get(word,[]) simpos[word].append(i) return simpos It works only for identity matrix! Why??

Blosum62 has more possible matches!

>>> al,sm=readScoreMatrix('id.mat') >>> s1 'TTTAAATTT' >>> getSimpos(s1,al,sm,ws=1,th=1) {'A': [3, 4, 5], 'T': [0, 1, 2, 6, 7, 8]} >>> al,sm=readScoreMatrix('blosum62.mat') >>> getSimpos(s1,al,sm,ws=1,th=1) {'A': [3, 4, 5], 'S': [0, 1, 2, 3, 4, 5, 6, 7, 8], 'T': [0, 1, 2, 6, 7, 8]} >>>

Blosum62

Idea: add a dictionary that takes care of all the words that we can predict will have a score higher than a threshold for a given matrix and a given word in our sequence.

Given a word: find all high scoring similar words: itertools.product

Given a word: find all high scoring similar words:

GetSimpos

Finally: find all the dotplot matches!

Computational complexity ??? Blue + Red

Computational complexity:Blue

O(|s1|*dictionary) = O(c*n) ~ O(n) with a big constant c!

Computational complexity:Red

O(|s2|*simpos) => worst case O(n2) !!! But In the average case with reasonable threshold end score The number of positions that matches can be considered constant on average, so for average case: O(n) !

Computational complexity ??? Blue + Red

O(n)+O(n) = O(n) !

Experimental time performace of the two implementations

* dotplot ^ lindotplot

Exercise:

Use the lin_dotplot idea to find local ungapped alignments:

alignment if the pairs are on diagonals ( eg. (i-1,j-1), (i,j), (i+1,j+1), …)

local alignments in a set of sequences.

Computational complexity is quadratic: O(|s1||s2|ws)

Computational complexity is quadratic: O(|s1||s2|ws)

O(|s1|dictionary) = O(cn) ~ O(n) with a big constant c!