Lecture 4 Sequence alignment: how to discover similarities between - - PowerPoint PPT Presentation

▶

Dec 01, 2022 33 likes •279 views

Lecture 4 Sequence alignment: how to discover similarities between biological sequences Chapter 6 in Jones and Pevzner Fall 2019 September 10, 2019 Evolution as a tool for biological insight Nothing in biology makes sense except in

SLIDE 1

Lecture 4 Sequence alignment: how to discover similarities between biological sequences

Chapter 6 in Jones and Pevzner Fall 2019 September 10, 2019

SLIDE 2

Evolution as a tool for biological insight

“Nothing in biology makes

sense except in the light of evolution” - Theodosius Dobzhansky.

The functionality of many

genes is virtually the same among many organisms: Can understand biology in simpler

rganisms than ourselves

(“model organisms”).

SLIDE 3

Homology

Genes in organisms A and B that have evolved

from the same ancestral gene are said to be homologs.

Homology between genes typically indicates

conserved function.

Sequence similarity is used to infer homology.

SLIDE 4

Sequence Comparison: Early Success Story

In 1983 Russell Doolittle and colleagues found

similarities between a cancer-causing gene from the Simian Sarcoma virus and a normal growth factor gene (PDGF).

Finding sequence similarities with genes of known

function is a common approach to infer a newly sequenced gene’s function.

SLIDE 5

The drosophila “eyeless” gene

W. Gehring discovered

that turning on the “eyeless” gene in drosophila leads to the growth of ectopic eyes.

“eyeless” is a master

control gene for eye formation (transcription factor).

SLIDE 6

A similar gene in humans

The aniridia gene in humans has a sequence that

is similar to the drosophila eyeless gene.

Eye morphogenesis is under similar genetic

control in vertebrates and insects.

SLIDE 7

5 HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVS 54  ||||||||||||.|||||||||||||||||||||||||||||||||||||  57 HSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVS 106    55 KILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRD 104  ||||||||||||||||||||||||||.||||||:||||||||||||||||  107 KILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRD 156    105 RLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGA--------------- 139  |||.|.|||||||||||||||||||||::|:|...   157 RLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISA 206    155 -----------SWGTR---PGWYPGTSVPGQPTQ---------------- 174  ||..| ..||| ||:...|..   307 NHQALQQHQQQSWPPRHYSGSWYP-TSLSEIPISSAPNIASVTAYASGPS 355    175 ------------------------------------DGCQQQE---GGGE 185  ||.|..| |.||  356 LAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGE 405    186 NTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYP 235  |:|..:||..::::.|.||.|||||||||||||.:||::|||||||||||  406 NSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYP 455 

PAX6_HUMAN aligned against PAX6_DRO

SLIDE 8

Sequence alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings v = v1v2...vm, w = w1w2…wn, an alignment is an assignment of gaps to positions 0,…,m in v, and 0,…,n in w, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence.

SLIDE 9

Mutations at the DNA level

…ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Substitution SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

SLIDE 10

Scoring an alignment

A simple scoring scheme:
Penalize mismatches by –μ
Penalize indels by –σ,
Reward matches with +1
Resulting score:

#matches – (#mismatches) μ – (#indels) σ

Objective: find the best scoring alignment

SLIDE 11

Number of pairwise alignments

Given sequences of length m and n, the number
f alignments is:
For two sequences of length n:

min(m,n)

⇤

k=0

m k ⇥n k ⇥ = n + m n ⇥ 2n n ⇥ = (2n)! (n!)2 ≈ 22n √πn

n! ≈ √ 2πn n e ⇥n

Derived using Stirling’ s approximation:

SLIDE 12

Substrings and subsequences

Definition: A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix string v (x’ = xi…xj, for some 1 ≤ i ≤ j ≤ |x|) A string x’ is a subsequence of a string x if x’ can be obtained from x by deleting 0 or more letters (x’ = xi1…xik, for some 1 ≤ i1 ≤ … ≤ ik ≤ |x|) Note: a substring is always a subsequence Example: x = abracadabra y = cadabr; substring z = brcdbr; subseqence, not substring

SLIDE 13

Encoding alignment as a path in a 2-d grid

A T

T G A T C

G C T

elements of v elements of w

1 2 1 2 2 3 3 4 3 5 4 5 5 6 6 6 7 7 8 j coords: i coords:

Every alignment is a path in 2-D grid

(0,0)à (1,0)à (2,1)à (2,2)à (3,3)à (3,4)à (4,5)à (5,5)à (6,6)à (7,6)à (8,7)

SLIDE 14

Alignment as a path

T G C A T A C

1 2 3 4 5 6 7 i

A T C T G A T C

1 2 3 4 5 6 7 8 j

SLIDE 15

Alignment as a Path in the Edit Graph

0 1 2 2 3 4 5 6 7 7 A T - G T T A T - A T C G T - A - C 0 1 2 3 4 5 5 6 6 7

(0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)

Corresponding path -

SLIDE 16

Alignment as a Path in the Edit Graph

and represent indels in v and w with score -1. represent matches with score 1. The score of the alignment is 1.

SLIDE 17

Alignment as a Path in the Edit Graph

Every path in the edit graph corresponds to an alignment:

SLIDE 18

Alignment algorithms we will cover

Global alignment
Local alignment
Alignment with affine gap penalties
Scoring matrices

SLIDE 19

Our simple scoring scheme

The score when mismatches are penalized by –μ,

indels are penalized by -σ, and matches are rewarded by +1: #matches – μ (#mismatches) – σ (#indels)

SLIDE 20

Global Alignment: The Needleman- Wunsch algorithm1

Find the best alignment between two strings under our scoring scheme Input : Strings v and w and a scoring scheme Output : Maximum scoring alignment

si-1,j-1 + 1 if vi = wj si,j = max si-1,j-1 - µ if vi ≠ wj si-1,j - σ si,j-1 - σ

si,j – the score for the best alignment of a length i prefix of v and a length j prefix of w

µ : mismatch penalty

σ : indel penalty

1A general method applicable to the search for similarities in the amino acid

sequence of two proteins, J Mol Biol. 48(3):443-53, 1970.

SLIDE 21

Needleman Wunsch (cont)

What about the base case?

SLIDE 22

NW as a DP algorithm

NW( NW(v,w,sigma,mu v,w,sigma,mu) ) for for i in range(0, m): si,0 = -sigma * i for for j in range(0, n) : s0,j = -sigma * j for for i in range(1, m) : for for j in range(1, n) : fill in si,j return return (sm,n)

Runtime: O(nm) Memory: O(nm)

SLIDE 23

Now What?

The DP algorithm

created the alignment grid.

To read the best

alignment: Follow the pointers from sink.

SLIDE 24

Scoring Matrices

To generalize scoring, we use a scoring matrix δ. Size of the matrix: Alignment of DNA sequences: (4+1) x (4+1) Alignment of amino acids: (20+1) x (20+1) The additional row/column includes scores for the gap character “-” si-1,j-1 + δ (vi , wj) si,j = max si-1,j + δ (vi , -) si,j-1 + δ (-, wj)