[PPT] - Sequence Alignment: Mo1va1on and Algorithms Mo1va1on and PowerPoint Presentation

SLIDE 1

Sequence ¡Alignment: ¡Mo1va1on ¡ and ¡Algorithms ¡

SLIDE 2

Mo1va1on ¡and ¡Introduc1on ¡

SLIDE 3

Importance ¡of ¡Sequence ¡Alignment ¡

For ¡DNA, ¡RNA ¡and ¡amino ¡acid ¡sequences, ¡high ¡

sequence ¡similarity ¡usually ¡implies ¡significant ¡ func1onal ¡or ¡structural ¡similarity. ¡

– imply ¡structure ¡from ¡the ¡sequence ¡similarity ¡ – suggest ¡gene ¡func1on ¡

Frequently, ¡we ¡aren’t ¡interested ¡in ¡exact ¡

matches ¡of ¡the ¡sequences ¡but ¡approximate ¡

matches. ¡

SLIDE 4

Inexact ¡Alignment ¡

Where ¡is ¡GATTACA ¡approximately ¡in ¡the ¡human ¡ genome? ¡And ¡how ¡do ¡we ¡efficiently ¡find ¡them? ¡ Answers ¡to ¡these ¡ques1ons ¡depend ¡on: ¡

1. The ¡defini1on ¡of ¡“approximately” ¡ ¡

– Hamming ¡Distance, ¡Edit ¡distance, ¡Sequence ¡ Similarity ¡

2. Efficiency ¡depends ¡on ¡the ¡data ¡characteris1cs ¡

and ¡goals ¡

SLIDE 5

In ¡this ¡lecture… ¡

Sequence ¡alignment ¡and ¡alignment ¡soTware ¡

are ¡used ¡all ¡over ¡bioinforma1cs ¡for ¡different ¡

purposes. ¡ ¡
The ¡goal ¡of ¡this ¡lecture ¡is ¡to ¡understand ¡

sequence ¡alignment, ¡mo1vate ¡it ¡in ¡the ¡ context ¡of ¡assembly, ¡and ¡seeing ¡what ¡ soTware ¡is ¡out ¡there. ¡ ¡

SLIDE 6

Two ¡examples ¡where ¡you ¡would ¡ use ¡sequence ¡similarity ¡

1. Use ¡a ¡known ¡genome ¡to ¡help ¡assemble ¡an ¡

unknown ¡genome. ¡

2. Annota1ng ¡a ¡genome. ¡

SLIDE 7

Two ¡examples ¡where ¡you ¡would ¡ use ¡sequence ¡similarity ¡

1. Use ¡a ¡known ¡genome ¡to ¡help ¡assemble ¡an ¡

unknown ¡genome. ¡

2. Annota1ng ¡a ¡genome. ¡

SLIDE 8

What ¡is ¡a ¡reference ¡genome? ¡

A ¡reference ¡genome ¡is ¡a ¡representa1ve ¡

example ¡of ¡a ¡species’ ¡set ¡of ¡genes. ¡ ¡

As ¡they ¡are ¡oTen ¡assembled ¡from ¡the ¡

sequencing ¡of ¡DNA ¡from ¡a ¡number ¡of ¡donors, ¡ reference ¡genomes ¡do ¡not ¡accurately ¡ represent ¡the ¡set ¡of ¡genes ¡of ¡any ¡individual. ¡

BUT ¡the ¡reference ¡is ¡very, ¡very, ¡VERY ¡similar ¡in

¡ many ¡loca1ons ¡to ¡the ¡individual. ¡ ¡ ¡

SLIDE 9

SLIDE 10

Sample ¡Prepara*on ¡

10

SLIDE 11

Sample ¡Prepara*on ¡

Fragments

11

SLIDE 12

Sample ¡Prepara*on ¡ Sequencing ¡

ACGTAGAATCGACCATG GGGACGTAGAATACGAC ACGTAGAATACGTAGAA

Reads Fragments Next Generation Sequencing (NGS)

12

SLIDE 13

Sample ¡Prepara*on ¡ Sequencing ¡ Assembly ¡

ACGTAGAATACGTAGAAACAGATTAGAGAG…

Contigs Fragments Reads

ACGTAGAATCGACCATG GGGACGTAGAATACGAC ACGTAGAATACGTAGAA

13

SLIDE 14

Sample ¡Prepara*on ¡ Sequencing ¡ Assembly ¡

ACGTAGAATACGTAGAAACAGATTAGAGAG…

Contigs Fragments Reads

ACGTAGAATCGACCATG GGGACGTAGAATACGAC ACGTAGAATACGTAGAA

14

BUT WE DON’T HAVE TO BEGIN FROM SCRATCH!!!

SLIDE 15

How ¡do ¡we ¡use ¡the ¡reference? ¡

The ¡reads ¡from ¡the ¡individual ¡will ¡be ¡very ¡

similar ¡but ¡not ¡iden1cal ¡to ¡the ¡reference ¡

genome. ¡
Ideally, ¡we ¡would ¡like ¡to ¡use ¡the ¡informa1on ¡

from ¡the ¡reference! ¡Otherwise, ¡we ¡are ¡ throwing ¡up ¡valuable ¡informa1on. ¡

Hence, ¡the ¡first ¡step ¡is ¡to ¡find ¡where ¡in ¡the ¡

genome ¡the ¡sequences ¡are ¡similar ¡to. ¡ ¡

SLIDE 16

Two ¡examples ¡where ¡you ¡would ¡ use ¡sequence ¡similarity ¡

1. Use ¡a ¡known ¡genome ¡to ¡help ¡assemble ¡an ¡

unknown ¡genome. ¡

2. Annota9ng ¡a ¡genome. ¡

SLIDE 17

SLIDE 18

Sequence ¡Alignment ¡

Before ¡we ¡can ¡make ¡compara1ve ¡statements ¡

about ¡two ¡sequences, ¡we ¡have ¡to ¡produce ¡a ¡ pairwise ¡sequence ¡alignment ¡

What ¡is ¡the ¡op1mal ¡alignment ¡between ¡two ¡

sequences? ¡

Match/mismatch? ¡Gaps/extensions? ¡Is ¡an ¡
p1mal ¡alignment ¡always ¡significant? ¡

SLIDE 19

Edit ¡Distance ¡

Levenshtein (1966) introduced edit distance between two strings as the minimum number

f elementary operations (insertions,

deletions, and substitutions) to transform one string into the other d(v, w): the minimum number of operations required to transform v into w.

SLIDE 20

Edit ¡Distance ¡vs ¡Hamming ¡Distance ¡

V = ATATATAT W = TATATATA d(v, w)=8 d(v, w)=2 computing Hamming computing edit distance distance is a trivial task is a non-trivial task W = TATATATA- V = -ATATATAT Hamming ¡distance ¡always ¡ compares ¡the ¡ith ¡leêr ¡of ¡v ¡ ¡ with ¡the ¡jth ¡leêr ¡of ¡w ¡ Edit ¡distance ¡may ¡compare ¡ ¡ the ¡ith ¡leêr ¡of ¡v ¡ ¡with ¡the ¡jth ¡ leêr ¡of ¡w ¡

SLIDE 21

Edit ¡Distance ¡vs ¡Hamming ¡Distance ¡

V = ATATATAT W = TATATATA d(v, w)=8 d(v, w)=2

(one insertion and one deletion)

W = TATATATA- V = -ATATATAT Hamming ¡distance ¡always ¡ compares ¡the ¡ith ¡leêr ¡of ¡v ¡ ¡ with ¡the ¡jth ¡leêr ¡of ¡w ¡ Edit ¡distance ¡may ¡compare ¡ ¡ the ¡ith ¡leêr ¡of ¡v ¡ ¡with ¡the ¡jth ¡ leêr ¡of ¡w ¡

How ¡to ¡find ¡what ¡j ¡goes ¡with ¡what ¡i ¡such ¡that ¡ the ¡“score” ¡of ¡the ¡alignment ¡is ¡op1mized? ¡

SLIDE 22

Edit ¡Distance: ¡Example ¡

¡ TGCATAT ¡ ¡ATCCGAT ¡in ¡5 ¡steps ¡ ¡

TGCATAT (delete ¡last ¡T) ¡ TGCATA (delete ¡last ¡A) ¡ TGCAT (insert ¡A ¡at ¡front) ¡ ATGCAT (subs1tute ¡C ¡for ¡3rd ¡G) ¡ ATCCAT (insert ¡G ¡before ¡last ¡A) ¡ ¡ ATCCGAT (Done) ¡ ¡ ¡ ¡

SLIDE 23

Compu1ng ¡Edit ¡Distance ¡

SLIDE 24

Dynamic ¡Programming ¡Algorithm ¡

Dynamic ¡Programming ¡is ¡a ¡method ¡for ¡solving ¡

complex ¡problems ¡by ¡breaking ¡them ¡down ¡ into ¡simpler ¡subproblems. ¡

Provides ¡the ¡best ¡(op1mal) ¡alignment ¡

between ¡two ¡sequences. ¡

Includes ¡matches, ¡mismatches, ¡and ¡gaps ¡to ¡

maximize ¡the ¡number ¡of ¡matched ¡characters. ¡

Score: ¡match, ¡mismatch, ¡gap ¡(affine ¡vs. ¡non-‑

affine) ¡

SLIDE 25

Dynamic ¡Programming ¡Visualized ¡

0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7

(0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)

Corresponding ¡path ¡

SLIDE 26

and represent indels in v and w with score 0. represent matches with score 1. The score of the alignment path is 5.

Dynamic ¡Programming ¡Visualized ¡

SLIDE 27

Every path in the edit graph corresponds to an alignment:

Dynamic ¡Programming ¡Visualized ¡

SLIDE 28

Define ¡Rules ¡

Score ¡for ¡match: ¡+1 ¡
Score ¡for ¡mismatch: ¡0 ¡
Score ¡for ¡gap: ¡0 ¡

¡ Let ¡si,j ¡be ¡the ¡total ¡score ¡at ¡posi1on ¡i, ¡j ¡then ¡we ¡ have ¡that ¡it ¡is ¡equal ¡to ¡the ¡maximum ¡of ¡{si-‑1, ¡j-‑1 ¡+ ¡1 ¡ (if ¡vi ¡= ¡wj), ¡si-‑1, ¡j ¡(otherwise: ¡gap ¡or ¡mismatch)} ¡ ¡

SLIDE 29

Dynamic ¡Programming ¡Example ¡

Ini1alize ¡1st ¡row ¡and ¡1st ¡ ¡ column ¡to ¡be ¡all ¡zeroes. ¡ ¡ Or, ¡to ¡be ¡more ¡precise, ¡ ini1alize ¡0th ¡row ¡and ¡0th ¡ column ¡to ¡be ¡all ¡zeroes. ¡

SLIDE 30

Dynamic ¡Programming ¡Example ¡

Si,j = maximum of : Si-1, j-1

Si-1, j

Si, j-1

value from NW +1, if vi = wj value from North (top) value from West (left)

SLIDE 31

Alignment: ¡Backtracking ¡

Arrows ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡show ¡where ¡the ¡score ¡originated. ¡ ¡ ¡ ¡

if ¡from ¡the ¡top ¡
if ¡from ¡the ¡leT ¡
if ¡vi ¡= ¡wj ¡ ¡

SLIDE 32

Backtracking ¡Example ¡

Find a match in row and column 2. i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since vi = wj, si,j = si-1,j-1 +1 s2,2 = [s1,1 = 1] + 1 s2,5 = [s1,4 = 1] + 1 s4,2 = [s3,1 = 1] + 1 s5,2 = [s4,1 = 1] + 1 s7,2 = [s6,1 = 1] + 1

SLIDE 33

Backtracking ¡Example ¡

Con1nuing ¡with ¡the ¡ dynamic ¡programming ¡ ¡ algorithm ¡gives ¡this ¡

result. ¡

SLIDE 34

Scoring ¡Matrices ¡

The ¡alignment ¡score ¡represent ¡odds ¡of ¡
btaining ¡that ¡score ¡between ¡sequences ¡

known ¡to ¡be ¡related ¡to ¡that ¡obtained ¡by ¡ chance ¡alignment ¡between ¡unrelated ¡ sequences ¡

When ¡the ¡correct ¡scoring ¡matrix ¡is ¡used, ¡

alignment ¡sta1s1cs ¡are ¡meaningful ¡

Different ¡scoring ¡schemes ¡for ¡DNA ¡and ¡protein ¡

sequences ¡

SLIDE 35

Examples ¡of ¡Scoring ¡Matrices ¡

Amino ¡acid ¡subs1tu1on ¡matrices: ¡

– PAM ¡ – BLOSUM ¡ ¡

DNA ¡subs1tu1on ¡matrices: ¡

– DNA ¡is ¡less ¡conserved ¡than ¡protein ¡sequences. ¡ – Less ¡effec1ve ¡to ¡compare ¡coding ¡regions ¡at ¡ nucleo1de ¡level. ¡

SLIDE 36

Scoring ¡Indels: ¡Naive ¡Approach ¡

A ¡fixed ¡penalty ¡σ ¡is ¡given ¡to ¡every ¡indel: ¡

– ¡-‑σ ¡for ¡1 ¡indel, ¡ ¡ – -‑2σ ¡for ¡2 ¡consecu1ve ¡ ¡indels ¡ – -‑3σ ¡for ¡3 ¡consecu1ve ¡ ¡indels, ¡etc. ¡

Can ¡be ¡too ¡severe ¡penalty ¡for ¡a ¡series ¡of ¡100 ¡ consecu1ve ¡indels ¡

SLIDE 37

Affine ¡Gap ¡Penal1es ¡

A ¡series ¡of ¡k ¡indels ¡oTen ¡come ¡as ¡a ¡single ¡

event ¡rather ¡than ¡a ¡series ¡of ¡k ¡single ¡ nucleo1de ¡events: ¡

Normal scoring would give the same score for both alignments

This is more likely. This is less likely.

SLIDE 38

Accoun1ng ¡for ¡Gaps ¡

Score ¡for ¡a ¡gap ¡of ¡length ¡x ¡is ¡-‑(ρ ¡+ ¡σx), ¡where ¡ρ ¡ >0 ¡is ¡the ¡penalty ¡for ¡introducing ¡a ¡gap: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡gap ¡opening ¡penalty ¡ ¡ ¡ ¡ ¡ ¡ρ ¡will ¡be ¡large ¡rela1ve ¡to ¡σ: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡gap ¡extension ¡penalty ¡ ¡ ¡ ¡ ¡ ¡because ¡you ¡do ¡not ¡want ¡to ¡add ¡too ¡much ¡of ¡ a ¡penalty ¡for ¡extending ¡the ¡gap. ¡

SLIDE 39

Affine ¡Gap ¡Penal1es ¡

Gap ¡penal1es: ¡

– -‑ρ-‑σ ¡ ¡when ¡there ¡is ¡1 ¡indel ¡ – -‑ρ-‑2σ ¡ ¡when ¡there ¡are ¡2 ¡indels ¡ – -‑ρ-‑3σ ¡ ¡when ¡there ¡are ¡3 ¡indels, ¡etc. ¡ ¡ – -‑ρ-‑ ¡x·√σ ¡(-‑gap ¡opening ¡-‑ ¡x ¡gap ¡extensions) ¡

Somehow ¡reduced ¡penal1es ¡(as ¡compared ¡to ¡

naïve ¡scoring) ¡are ¡given ¡to ¡runs ¡of ¡horizontal ¡ and ¡ver1cal ¡edges ¡

SLIDE 40

Heuris1c ¡Algorithms ¡

Make ¡reasonable ¡assump1ons ¡about ¡nature ¡of

¡ sequence ¡alignments ¡and ¡try ¡out ¡only ¡“most ¡

Make ¡reasonable ¡assump1ons ¡about ¡nature ¡of

¡ likely” ¡alignments ¡ sequence ¡alignments ¡and ¡try ¡out ¡only ¡“most ¡ likely” ¡alignments ¡

Full ¡perfect ¡match ¡(word, ¡k-‑tuple) ¡
Much ¡faster ¡than ¡DP ¡algorithms ¡but ¡less ¡

sensi1ve ¡

Extend ¡alignment ¡un1l: ¡ ¡

SLIDE 41

Exis1ng ¡Tools ¡

SLIDE 42

Different ¡Sequence ¡Alignment ¡

Database ¡Search: ¡ ¡

– BLAST, ¡FASTA, ¡HMMER ¡

Mul1ple ¡Sequence ¡Alignment: ¡ ¡

– ClustalW, ¡FSA ¡

Genomic ¡Analysis: ¡ ¡

– BLAT ¡

Short ¡Read ¡Sequence ¡Alignment: ¡ ¡

– BWA, ¡Bow9e, ¡drFAST, ¡GSNAP, ¡SHRiMP, ¡SOAP, ¡ MAQ ¡

SLIDE 43

BLAST ¡

Basic ¡Local ¡Alignment ¡Search ¡Tool ¡
BLAST ¡is ¡faster ¡than ¡Smith-‑Waterman ¡(DP ¡

algorithm), ¡it ¡cannot ¡“guarantee ¡the ¡op1mal ¡ alignments ¡of ¡the ¡query ¡and ¡database ¡ sequences”. ¡

A ¡BLAST ¡search ¡enables ¡a ¡researcher ¡to ¡compare ¡

a ¡query ¡sequence ¡with ¡a ¡library ¡or ¡database ¡of ¡ sequences ¡and ¡iden1fy ¡library ¡sequences ¡that ¡ resemble ¡the ¡query ¡sequence ¡above ¡a ¡certain ¡

threshold. ¡
Different ¡types ¡of ¡BLASTs ¡are ¡available ¡according ¡

to ¡the ¡query ¡sequences. ¡

SLIDE 44

Blast ¡Versions ¡

BLASTN: ¡Compares ¡a ¡nucleo1de ¡query ¡to ¡a ¡

nucleo1de ¡database ¡

BLASTP: ¡Compares ¡a ¡protein ¡query ¡to ¡a ¡protein ¡

database ¡

BLASTX: ¡Compares ¡a ¡translated ¡nucleo1de ¡query ¡

to ¡protein ¡database ¡

TBLASTN: ¡Compares ¡a ¡protein ¡query ¡to ¡a ¡

translated ¡nucleo1de ¡database ¡

TBLASTX: ¡Compares ¡a ¡translated ¡nucleo1de ¡

query ¡to ¡a ¡translated ¡nucleo1de ¡database ¡

SLIDE 45

hIp://blast.ncbi.nlm.nih.gov/ ¡

SLIDE 46

Typical ¡Uses ¡of ¡BLAST ¡

Which ¡bacterial ¡species ¡have ¡a ¡protein ¡that ¡is ¡

related ¡in ¡lineage ¡to ¡a ¡certain ¡protein ¡with ¡ known ¡amino-‑acid ¡sequence? ¡ ¡

Where ¡does ¡a ¡certain ¡sequence ¡of ¡DNA ¡
riginate? ¡
What ¡other ¡genes ¡encode ¡proteins ¡that ¡

exhibit ¡structures ¡or ¡mo1fs ¡such ¡as ¡ones ¡thtat ¡ have ¡been ¡determined? ¡ ¡

SLIDE 47

Mul1ple ¡Alignment ¡

A ¡mul1ple ¡sequence ¡alignment ¡is ¡a ¡sequence ¡

alignment ¡of ¡three ¡or ¡more ¡biological ¡ sequences ¡(protein, ¡DNA, ¡RNA) ¡ ¡ ¡

The ¡query ¡sequences ¡are ¡assumed ¡to ¡have ¡

some ¡sort ¡of ¡evolu1onary ¡rela1onship ¡by ¡ which ¡they ¡share ¡some ¡sort ¡of ¡lineage ¡

Many ¡tools ¡are ¡available ¡but ¡this ¡is ¡a ¡really ¡

complex ¡computa1onal ¡problem ¡

Used ¡in ¡phylogene1cs ¡ ¡

SLIDE 48

Short ¡Read ¡Alignment ¡SW ¡

Bow9e: ¡memory-‑efficient ¡short ¡read ¡aligner. ¡It ¡ aligns ¡short ¡DNA ¡sequences ¡(reads) ¡to ¡the ¡human ¡ genome ¡at ¡a ¡rate ¡of ¡over ¡25 ¡million ¡35-‑bp ¡reads ¡per ¡ hours ¡ ¡ Burrows-‑Wheeler ¡Aligner ¡(BWA): ¡an ¡aliger ¡that ¡ implements ¡two ¡algorithms: ¡bwa-‑short ¡and ¡BWA-‑

SW. ¡The ¡former ¡works ¡for ¡query ¡sequences ¡shorter ¡

than ¡200bp ¡and ¡the ¡la^er ¡for ¡longer ¡sequences ¡up ¡ to ¡around ¡100kbp. ¡

SLIDE 49

Sequence ¡Alignment/Map ¡Format ¡

Sequence ¡Reads ¡ Alignment ¡SoMware ¡ SAM ¡File ¡ Resequencing ¡ RNA ¡Seq ¡ SNPs ¡

SLIDE 50

SAM ¡format ¡

“A ¡tab-‑delimited ¡text ¡format ¡consis1ng ¡of ¡a ¡header ¡sec1on, ¡ which ¡is ¡op1onal, ¡and ¡an ¡alignment ¡sec1on” ¡

@HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e @SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5 @RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD377

Example Headers:

1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37 19:20389:F:275+18M2D19M 99 1 17644 37M = 17919 314 TATGACTGCTAATAATACCTACACATGTTAGAACCAT >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 RG:Z:UM0098:1 XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37 19:20389:F:275+18M2D19M 147 1 17919 18M2D19M = 17644

314

GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT ;44999;499<8<8<<<8<<><<<<><7<;<<<>><< XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19 9:21597+10M2I25M:R:-209 83 1 21678 8M2I27M = 21469

244

CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT <;9<<5><<<<><<<>><<><>><9>><>>>9>>><> XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35

Example Alignments:

SLIDE 51

The ¡Alignment ¡Column ¡

SLIDE 52

Bitwise ¡Flags ¡

FLAG: ¡bitwise ¡FLAG. ¡ ¡Each ¡bit ¡is ¡explained ¡in ¡the ¡following ¡table: ¡

SLIDE 53

Bitwise ¡Representa1on ¡

1 ¡= ¡00000001 ¡ ¡paired-‑end ¡read ¡ 2 ¡= ¡00000010 ¡ ¡mapped ¡as ¡proper ¡pair ¡ 4 ¡= ¡00000100 ¡ ¡unmapped ¡read ¡ 8 ¡= ¡00001000 ¡ ¡read ¡mate ¡unmapped ¡ 16 ¡= ¡00010000 ¡ ¡read ¡mapped ¡on ¡reverse ¡strand ¡ Example: ¡ ¡ ¡ ¡The ¡flag ¡11 ¡ ¡1 ¡+ ¡2 ¡+ ¡8 ¡= ¡00001011 ¡(condi1ons ¡1, ¡2, ¡8) ¡ ¡

Flags ¡0, ¡4, ¡and ¡16 ¡are ¡the ¡flags ¡most ¡commonly ¡used. ¡

SLIDE 54

Mapping ¡Quality ¡

Phred ¡score, ¡iden1cal ¡to ¡the ¡quality ¡measure ¡

in ¡the ¡fastq ¡file. ¡quality ¡Q, ¡probability ¡P: ¡ ¡ P ¡= ¡10 ¡^ ¡(-‑Q ¡/ ¡10.0) ¡ ¡

If ¡Q=30, ¡P=1/1000on ¡average, ¡one ¡of ¡out ¡

1000 ¡alignments ¡will ¡be ¡wrong ¡

As ¡good ¡as ¡this ¡sounds ¡it ¡is ¡not ¡easy ¡to ¡

compute ¡such ¡a ¡quality. ¡

SLIDE 55

Mapping ¡Quality ¡

The ¡repeat ¡structure ¡of ¡the ¡reference. ¡Reads ¡falling ¡

in ¡repe11ve ¡regions ¡usually ¡get ¡very ¡low ¡mapping ¡ quality ¡

The ¡base ¡quality ¡of ¡the ¡read. ¡Low ¡quality ¡means ¡the ¡
bserved ¡read ¡sequence ¡is ¡possibly ¡wrong, ¡and ¡

wrong ¡sequence ¡may ¡lead ¡to ¡a ¡wrong ¡alignment ¡

The ¡sensi1vity ¡of ¡the ¡alignment ¡algorithm. ¡The ¡true ¡

hit ¡is ¡more ¡likely ¡to ¡be ¡missed ¡by ¡an ¡algorithm ¡with ¡ low ¡sensi1vity, ¡which ¡also ¡causes ¡mapping ¡errors ¡

Paired ¡end ¡or ¡not. ¡Reads ¡mapped ¡in ¡pairs ¡are ¡more ¡

likely ¡to ¡be ¡correct ¡ ¡

SLIDE 56

BWA ¡Specific ¡High ¡Scores ¡

A ¡read ¡alignment ¡with ¡a ¡mapping ¡quality ¡30 ¡or ¡above ¡ usually ¡implies: ¡ ¡ – The ¡overall ¡base ¡quality ¡of ¡the ¡read ¡is ¡good. ¡ ¡ – The ¡best ¡alignment ¡has ¡few ¡mismatches. ¡ ¡ – The ¡read ¡has ¡few ¡or ¡just ¡one ¡“good” ¡hit ¡on ¡the ¡ reference, ¡which ¡means ¡the ¡current ¡alignment ¡is ¡ s1ll ¡the ¡best ¡even ¡if ¡one ¡or ¡two ¡bases ¡are ¡actually ¡ muta1ons ¡or ¡sequencing ¡errors. ¡ ¡

SLIDE 57

BWA ¡Specific ¡Low ¡Scores ¡

Surprisingly ¡difficult ¡to ¡track ¡down ¡the ¡exact ¡ behavior ¡

Q=0 ¡if ¡a ¡read ¡can ¡be ¡aligned ¡equally ¡well ¡to ¡

mul1ple ¡posi1ons, ¡BWA ¡will ¡randomly ¡pick ¡

ne ¡posi1on ¡and ¡give ¡it ¡a ¡mapping ¡quality ¡
zero. ¡
Q=25 ¡the ¡edit ¡distance ¡equals ¡mismatches ¡

and ¡is ¡greater ¡than ¡zero ¡ ¡

SLIDE 58

What ¡to ¡do ¡with ¡low ¡quality ¡scores? ¡

Find ¡repeat ¡structures ¡in ¡the ¡genome/con1g. ¡
Determine ¡if ¡there ¡is ¡a ¡problem ¡with ¡your ¡

alignment ¡or ¡data ¡(i.e. ¡all ¡the ¡reads ¡mapped ¡ with ¡low ¡quality ¡scores). ¡

Filter ¡them ¡out. ¡ ¡Very ¡common ¡to ¡write ¡a ¡perl/

python ¡script ¡to ¡filter ¡out ¡poorly ¡aligned ¡

reads. ¡
Many, ¡many, ¡many ¡other ¡possibili1es. ¡

SLIDE 59

The ¡Alignment ¡Column ¡

SLIDE 60

CIGAR ¡String ¡

CIGAR ¡string ¡is ¡a ¡compact ¡representa1on ¡of ¡

how ¡the ¡read ¡aligned ¡to ¡the ¡reference ¡ genome ¡at ¡that ¡exact ¡posi1on. ¡

More ¡specifically, ¡the ¡CIGAR ¡string ¡is ¡a ¡

sequence ¡of ¡of ¡base ¡lengths ¡and ¡the ¡ associated ¡opera1on. ¡ ¡

– match/mismatch ¡with ¡the ¡reference. ¡ – deleted/inserted ¡from ¡the ¡from ¡the ¡reference. ¡

SLIDE 61

Example ¡of ¡CIGAR ¡

RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: A C T A G A A T G G C T

In ¡the ¡SAM ¡file ¡you ¡will ¡have ¡the ¡following ¡fields: ¡ ¡

POS: ¡5 ¡
CIGAR: ¡3M1I3M1D5M

SLIDE 62

Final ¡Comments ¡

BAM ¡is ¡a ¡compressed ¡version ¡of ¡the ¡SAM ¡file ¡
format. ¡ ¡There ¡are ¡mul1ple ¡programs ¡that ¡

convert ¡BAM ¡files ¡to ¡SAM ¡files ¡and ¡vice ¡versa. ¡

¡Tablet ¡(h^p://bioinf.scri.ac.uk/tablet/) ¡is ¡an ¡

Sequence ¡Alignment: ¡Mo1va1on ¡ and ¡Algorithms ¡

Mo1va1on ¡and ¡Introduc1on ¡

Importance ¡of ¡Sequence ¡Alignment ¡

sequence ¡similarity ¡usually ¡implies ¡significant ¡ func1onal ¡or ¡structural ¡similarity. ¡

– imply ¡structure ¡from ¡the ¡sequence ¡similarity ¡ – suggest ¡gene ¡func1on ¡

matches ¡of ¡the ¡sequences ¡but ¡approximate ¡

Inexact ¡Alignment ¡

Where ¡is ¡GATTACA ¡approximately ¡in ¡the ¡human ¡ genome? ¡And ¡how ¡do ¡we ¡efficiently ¡find ¡them? ¡ Answers ¡to ¡these ¡ques1ons ¡depend ¡on: ¡

– Hamming ¡Distance, ¡Edit ¡distance, ¡Sequence ¡ Similarity ¡

and ¡goals ¡

In ¡this ¡lecture… ¡

are ¡used ¡all ¡over ¡bioinforma1cs ¡for ¡different ¡

sequence ¡alignment, ¡mo1vate ¡it ¡in ¡the ¡ context ¡of ¡assembly, ¡and ¡seeing ¡what ¡ soTware ¡is ¡out ¡there. ¡ ¡

Two ¡examples ¡where ¡you ¡would ¡ use ¡sequence ¡similarity ¡

unknown ¡genome. ¡

Two ¡examples ¡where ¡you ¡would ¡ use ¡sequence ¡similarity ¡

unknown ¡genome. ¡

What ¡is ¡a ¡reference ¡genome? ¡

example ¡of ¡a ¡species’ ¡set ¡of ¡genes. ¡ ¡

sequencing ¡of ¡DNA ¡from ¡a ¡number ¡of ¡donors, ¡ reference ¡genomes ¡do ¡not ¡accurately ¡ represent ¡the ¡set ¡of ¡genes ¡of ¡any ¡individual. ¡

¡ many ¡loca1ons ¡to ¡the ¡individual. ¡ ¡ ¡

Sample ¡Prepara*on ¡

Sample ¡Prepara*on ¡

Fragments

Sample ¡Prepara*on ¡ Sequencing ¡

Reads Fragments Next Generation Sequencing (NGS)

Sample ¡Prepara*on ¡ Sequencing ¡ Assembly ¡

Contigs Fragments Reads

Sample ¡Prepara*on ¡ Sequencing ¡ Assembly ¡

Contigs Fragments Reads

BUT WE DON’T HAVE TO BEGIN FROM SCRATCH!!!

How ¡do ¡we ¡use ¡the ¡reference? ¡

similar ¡but ¡not ¡iden1cal ¡to ¡the ¡reference ¡

from ¡the ¡reference! ¡Otherwise, ¡we ¡are ¡ throwing ¡up ¡valuable ¡informa1on. ¡

genome ¡the ¡sequences ¡are ¡similar ¡to. ¡ ¡

Two ¡examples ¡where ¡you ¡would ¡ use ¡sequence ¡similarity ¡

unknown ¡genome. ¡

Sequence ¡Alignment ¡

about ¡two ¡sequences, ¡we ¡have ¡to ¡produce ¡a ¡ pairwise ¡sequence ¡alignment ¡

sequences? ¡

Edit ¡Distance ¡

Levenshtein (1966) introduced edit distance between two strings as the minimum number

deletions, and substitutions) to transform one string into the other d(v, w): the minimum number of operations required to transform v into w.

Edit ¡Distance ¡vs ¡Hamming ¡Distance ¡

Edit ¡Distance ¡vs ¡Hamming ¡Distance ¡

V = ATATATAT W = TATATATA d(v, w)=8 d(v, w)=2

W = TATATATA- V = -ATATATAT Hamming ¡distance ¡always ¡ compares ¡the ¡ith ¡leêr ¡of ¡v ¡ ¡ with ¡the ¡jth ¡leêr ¡of ¡w ¡ Edit ¡distance ¡may ¡compare ¡ ¡ the ¡ith ¡leêr ¡of ¡v ¡ ¡with ¡the ¡jth ¡ leêr ¡of ¡w ¡

How ¡to ¡find ¡what ¡j ¡goes ¡with ¡what ¡i ¡such ¡that ¡ the ¡“score” ¡of ¡the ¡alignment ¡is ¡op1mized? ¡

Edit ¡Distance: ¡Example ¡

¡ TGCATAT ¡ ¡ATCCGAT ¡in ¡5 ¡steps ¡ ¡

TGCATAT (delete ¡last ¡T) ¡ TGCATA (delete ¡last ¡A) ¡ TGCAT (insert ¡A ¡at ¡front) ¡ ATGCAT (subs1tute ¡C ¡for ¡3rd ¡G) ¡ ATCCAT (insert ¡G ¡before ¡last ¡A) ¡ ¡ ATCCGAT (Done) ¡ ¡ ¡ ¡

Compu1ng ¡Edit ¡Distance ¡

Dynamic ¡Programming ¡Algorithm ¡

complex ¡problems ¡by ¡breaking ¡them ¡down ¡ into ¡simpler ¡subproblems. ¡

between ¡two ¡sequences. ¡

maximize ¡the ¡number ¡of ¡matched ¡characters. ¡

affine) ¡

Dynamic ¡Programming ¡Visualized ¡

0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7

Corresponding ¡path ¡

and represent indels in v and w with score 0. represent matches with score 1. The score of the alignment path is 5.

Dynamic ¡Programming ¡Visualized ¡

Every path in the edit graph corresponds to an alignment:

Dynamic ¡Programming ¡Visualized ¡

Define ¡Rules ¡

¡ Let ¡si,j ¡be ¡the ¡total ¡score ¡at ¡posi1on ¡i, ¡j ¡then ¡we ¡ have ¡that ¡it ¡is ¡equal ¡to ¡the ¡maximum ¡of ¡{si-­‑1, ¡j-­‑1 ¡+ ¡1 ¡ (if ¡vi ¡= ¡wj), ¡si-­‑1, ¡j ¡(otherwise: ¡gap ¡or ¡mismatch)} ¡ ¡

Dynamic ¡Programming ¡Example ¡

Ini1alize ¡1st ¡row ¡and ¡1st ¡ ¡ column ¡to ¡be ¡all ¡zeroes. ¡ ¡ Or, ¡to ¡be ¡more ¡precise, ¡ ini1alize ¡0th ¡row ¡and ¡0th ¡ column ¡to ¡be ¡all ¡zeroes. ¡

Dynamic ¡Programming ¡Example ¡

Alignment: ¡Backtracking ¡

Arrows ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡show ¡where ¡the ¡score ¡originated. ¡ ¡ ¡ ¡

Backtracking ¡Example ¡

Find a match in row and column 2. i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since vi = wj, si,j = si-1,j-1 +1 s2,2 = [s1,1 = 1] + 1 s2,5 = [s1,4 = 1] + 1 s4,2 = [s3,1 = 1] + 1 s5,2 = [s4,1 = 1] + 1 s7,2 = [s6,1 = 1] + 1

Backtracking ¡Example ¡

Con1nuing ¡with ¡the ¡ dynamic ¡programming ¡ ¡ algorithm ¡gives ¡this ¡

Scoring ¡Matrices ¡

known ¡to ¡be ¡related ¡to ¡that ¡obtained ¡by ¡ chance ¡alignment ¡between ¡unrelated ¡ sequences ¡

alignment ¡sta1s1cs ¡are ¡meaningful ¡

sequences ¡

Examples ¡of ¡Scoring ¡Matrices ¡

¡ Let ¡si,j ¡be ¡the ¡total ¡score ¡at ¡posi1on ¡i, ¡j ¡then ¡we ¡ have ¡that ¡it ¡is ¡equal ¡to ¡the ¡maximum ¡of ¡{si-‑1, ¡j-‑1 ¡+ ¡1 ¡ (if ¡vi ¡= ¡wj), ¡si-‑1, ¡j ¡(otherwise: ¡gap ¡or ¡mismatch)} ¡ ¡

– ¡-‑σ ¡for ¡1 ¡indel, ¡ ¡ – -‑2σ ¡for ¡2 ¡consecu1ve ¡ ¡indels ¡ – -‑3σ ¡for ¡3 ¡consecu1ve ¡ ¡indels, ¡etc. ¡

– -‑ρ-‑σ ¡ ¡when ¡there ¡is ¡1 ¡indel ¡ – -‑ρ-‑2σ ¡ ¡when ¡there ¡are ¡2 ¡indels ¡ – -‑ρ-‑3σ ¡ ¡when ¡there ¡are ¡3 ¡indels, ¡etc. ¡ ¡ – -‑ρ-‑ ¡x·√σ ¡(-‑gap ¡opening ¡-‑ ¡x ¡gap ¡extensions) ¡

related ¡in ¡lineage ¡to ¡a ¡certain ¡protein ¡with ¡ known ¡amino-‑acid ¡sequence? ¡ ¡

“A ¡tab-‑delimited ¡text ¡format ¡consis1ng ¡of ¡a ¡header ¡sec1on, ¡ which ¡is ¡op1onal, ¡and ¡an ¡alignment ¡sec1on” ¡

in ¡the ¡fastq ¡file. ¡quality ¡Q, ¡probability ¡P: ¡ ¡ P ¡= ¡10 ¡^ ¡(-‑Q ¡/ ¡10.0) ¡ ¡