Lecture 4 Sequence alignment: how to discover similarities between - - PowerPoint PPT Presentation

lecture 4 sequence alignment how to discover similarities
SMART_READER_LITE
LIVE PREVIEW

Lecture 4 Sequence alignment: how to discover similarities between - - PowerPoint PPT Presentation

Lecture 4 Sequence alignment: how to discover similarities between biological sequences Chapter 6 in Jones and Pevzner Fall 2019 September 10, 2019 Evolution as a tool for biological insight Nothing in biology makes sense except in


slide-1
SLIDE 1

Lecture 4 Sequence alignment: how to discover similarities between biological sequences

Chapter 6 in Jones and Pevzner Fall 2019 September 10, 2019

slide-2
SLIDE 2

Evolution as a tool for biological insight

  • “Nothing in biology makes

sense except in the light of evolution” - Theodosius Dobzhansky.

  • The functionality of many

genes is virtually the same among many organisms: Can understand biology in simpler

  • rganisms than ourselves

(“model organisms”).

slide-3
SLIDE 3

Homology

  • Genes in organisms A and B that have evolved

from the same ancestral gene are said to be homologs.

  • Homology between genes typically indicates

conserved function.

  • Sequence similarity is used to infer homology.
slide-4
SLIDE 4

Sequence Comparison: Early Success Story

  • In 1983 Russell Doolittle and colleagues found

similarities between a cancer-causing gene from the Simian Sarcoma virus and a normal growth factor gene (PDGF).

  • Finding sequence similarities with genes of known

function is a common approach to infer a newly sequenced gene’s function.

slide-5
SLIDE 5

The drosophila “eyeless” gene

  • W. Gehring discovered

that turning on the “eyeless” gene in drosophila leads to the growth of ectopic eyes.

  • “eyeless” is a master

control gene for eye formation (transcription factor).

slide-6
SLIDE 6

A similar gene in humans

  • The aniridia gene in humans has a sequence that

is similar to the drosophila eyeless gene.

  • Eye morphogenesis is under similar genetic

control in vertebrates and insects.

slide-7
SLIDE 7

5 HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVS 54
 ||||||||||||.|||||||||||||||||||||||||||||||||||||
 57 HSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVS 106
 
 55 KILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRD 104
 ||||||||||||||||||||||||||.||||||:||||||||||||||||
 107 KILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRD 156
 
 105 RLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGA--------------- 139
 |||.|.|||||||||||||||||||||::|:|... 
 157 RLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISA 206
 
 155 -----------SWGTR---PGWYPGTSVPGQPTQ---------------- 174
 ||..| ..||| ||:...|.. 
 307 NHQALQQHQQQSWPPRHYSGSWYP-TSLSEIPISSAPNIASVTAYASGPS 355
 
 175 ------------------------------------DGCQQQE---GGGE 185
 ||.|..| |.||
 356 LAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGE 405
 
 186 NTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYP 235
 |:|..:||..::::.|.||.|||||||||||||.:||::|||||||||||
 406 NSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYP 455


PAX6_HUMAN aligned against PAX6_DRO

slide-8
SLIDE 8

Sequence alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

  • AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings v = v1v2...vm, w = w1w2…wn, an alignment is an assignment of gaps to positions 0,…,m in v, and 0,…,n in w, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence.

slide-9
SLIDE 9

Mutations at the DNA level

…ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Substitution SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

slide-10
SLIDE 10

Scoring an alignment

  • A simple scoring scheme:
  • Penalize mismatches by –μ
  • Penalize indels by –σ,
  • Reward matches with +1
  • Resulting score:

#matches – (#mismatches) μ – (#indels) σ

  • Objective: find the best scoring alignment
slide-11
SLIDE 11

Number of pairwise alignments

  • Given sequences of length m and n, the number
  • f alignments is:
  • For two sequences of length n:

min(m,n)

k=0

m k ⇥n k ⇥ = n + m n ⇥ 2n n ⇥ = (2n)! (n!)2 ≈ 22n √πn

n! ≈ √ 2πn n e ⇥n

Derived using Stirling’ s approximation:

slide-12
SLIDE 12

Substrings and subsequences

Definition: A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix string v (x’ = xi…xj, for some 1 ≤ i ≤ j ≤ |x|) A string x’ is a subsequence of a string x if x’ can be obtained from x by deleting 0 or more letters (x’ = xi1…xik, for some 1 ≤ i1 ≤ … ≤ ik ≤ |x|) Note: a substring is always a subsequence Example: x = abracadabra y = cadabr; substring z = brcdbr; subseqence, not substring

slide-13
SLIDE 13

Encoding alignment as a path in a 2-d grid

A T

  • C

T G A T C

  • T

G C T

  • A
  • C

elements of v elements of w

  • A

1 2 1 2 2 3 3 4 3 5 4 5 5 6 6 6 7 7 8 j coords: i coords:

Every alignment is a path in 2-D grid

(0,0)à (1,0)à (2,1)à (2,2)à (3,3)à (3,4)à (4,5)à (5,5)à (6,6)à (7,6)à (8,7)

slide-14
SLIDE 14

Alignment as a path

T G C A T A C

1 2 3 4 5 6 7 i

A T C T G A T C

1 2 3 4 5 6 7 8 j

slide-15
SLIDE 15

Alignment as a Path in the Edit Graph

0 1 2 2 3 4 5 6 7 7 A T - G T T A T - A T C G T - A - C 0 1 2 3 4 5 5 6 6 7

(0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)

  • Corresponding path -
slide-16
SLIDE 16

Alignment as a Path in the Edit Graph

and represent indels in v and w with score -1. represent matches with score 1. The score of the alignment is 1.

slide-17
SLIDE 17

Alignment as a Path in the Edit Graph

Every path in the edit graph corresponds to an alignment:

slide-18
SLIDE 18

Alignment algorithms we will cover

  • Global alignment
  • Local alignment
  • Alignment with affine gap penalties
  • Scoring matrices
slide-19
SLIDE 19

Our simple scoring scheme

  • The score when mismatches are penalized by –μ,

indels are penalized by -σ, and matches are rewarded by +1: #matches – μ (#mismatches) – σ (#indels)

slide-20
SLIDE 20

Global Alignment: The Needleman- Wunsch algorithm1

Find the best alignment between two strings under our scoring scheme Input : Strings v and w and a scoring scheme Output : Maximum scoring alignment

si-1,j-1 + 1 if vi = wj si,j = max si-1,j-1 - µ if vi ≠ wj si-1,j - σ si,j-1 - σ

si,j – the score for the best alignment of a length i prefix of v and a length j prefix of w

µ : mismatch penalty

σ : indel penalty

1A general method applicable to the search for similarities in the amino acid

sequence of two proteins, J Mol Biol. 48(3):443-53, 1970.

slide-21
SLIDE 21

Needleman Wunsch (cont)

  • What about the base case?
slide-22
SLIDE 22

NW as a DP algorithm

NW( NW(v,w,sigma,mu v,w,sigma,mu) ) for for i in range(0, m): si,0 = -sigma * i for for j in range(0, n) : s0,j = -sigma * j for for i in range(1, m) : for for j in range(1, n) : fill in si,j return return (sm,n)

Runtime: O(nm) Memory: O(nm)

slide-23
SLIDE 23

Now What?

  • The DP algorithm

created the alignment grid.

  • To read the best

alignment: Follow the pointers from sink.

slide-24
SLIDE 24

Scoring Matrices

To generalize scoring, we use a scoring matrix δ. Size of the matrix: Alignment of DNA sequences: (4+1) x (4+1) Alignment of amino acids: (20+1) x (20+1) The additional row/column includes scores for the gap character “-” si-1,j-1 + δ (vi , wj) si,j = max si-1,j + δ (vi , -) si,j-1 + δ (-, wj)