Computational and Mathematical Biology Computational and - - PowerPoint PPT Presentation

computational and mathematical biology computational and
SMART_READER_LITE
LIVE PREVIEW

Computational and Mathematical Biology Computational and - - PowerPoint PPT Presentation

Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics Age: in the Genomics Age: Predicting protein structures Predicting protein structures Ron Elber, Cornell Ron Elber, Cornell 11/13/2006 11/13/2006


slide-1
SLIDE 1

11/13/2006 11/13/2006 1 1

Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics Age: in the Genomics Age: Predicting protein structures Predicting protein structures

Ron Elber, Cornell Ron Elber, Cornell

slide-2
SLIDE 2

11/13/2006 11/13/2006 2 2

Crash course on proteins Crash course on proteins

  • Proteins are one

Proteins are one-

  • dimensional polymers

dimensional polymers

  • Made of 20 types of monomers (amino acids) with

Made of 20 types of monomers (amino acids) with different side chains (ACDEFG different side chains (ACDEFG… …) but the same ) but the same backbone backbone

  • Fold into a well defined 3D shape that includes

Fold into a well defined 3D shape that includes secondary structure elements (helices, sheets) secondary structure elements (helices, sheets)

  • They are the machines of the smallest living

They are the machines of the smallest living entities (cells) entities (cells)

slide-3
SLIDE 3

11/13/2006 11/13/2006 3 3

Why protein structures? Sequence Why protein structures? Sequence determines 3D shape. Shape determines determines 3D shape. Shape determines function. function.

ACDEFGHIJKLMNPQ ACDEFGHIJKLMNPQ Active site! Active site!

Drug design….

slide-4
SLIDE 4

11/13/2006 11/13/2006 4 4

Approaches to determine protein structure Approaches to determine protein structure

  • Experiment (X

Experiment (X-

  • ray, NMR):

ray, NMR): months months

  • Modeling the chemical physics

Modeling the chemical physics weeks weeks

  • Homology based modeling:

Homology based modeling: hours hours

slide-5
SLIDE 5

11/13/2006 11/13/2006 5 5

Structures Are Evolutionary Templates

Leghemoglobin in Plants Myoglobin in Mammals High degree of structural similarity is often observed in proteins with diverse sequences and in different species (below noise level – 15 percent sequence identity).

Oxygen Transport Proteins

slide-6
SLIDE 6

11/13/2006 11/13/2006 6 6

Three steps in homology modeling Three steps in homology modeling

  • Identify a structural

Identify a structural template to unknown template to unknown sequence sequence

  • Align the unknown

Align the unknown sequence to the sequence to the structural template structural template

  • Build an atomic model

Build an atomic model based on the template based on the template

ACEFGH….

A

  • C

D W L K A R C

  • F

L R

slide-7
SLIDE 7

11/13/2006 11/13/2006 7 7

Measures of tertiary Measures of tertiary structure fitness structure fitness

Instead of direct sequence comparison Instead of direct sequence comparison

1BIN:A 2/3 AFTEKQDALVSSSFEAFKANIPQYSVVFYTSILEKAPAAKDLFSFLANG-----VDPTNP 1MBC:_ 1/2 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE- 1BIN:A 57/58 KLTGHAEKLFALVRDSAGQLKASGTVV—ADAALGSVHAQKAVTDPQFVVVKEALLKTIK 1MBC:_ 60/61 DLKKHGVTVLTALGAILKKK---GHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLH 1BIN:A 115/116 AAVGDKWSDELSRAWEVAYDELAAAIKKA 1MBC:_ 117/118 SRHPGDFGADAQGAMNKALELFRKDIAAK

AFTEKQDALVSSSFEAFKANIPQYSVVFYTSILE KAPAAKDLFSFLANGVDPTNPKLTGHAEKLFA LVRDSAGQLKASGTVVADAALGSVHAQKAVT DPQFVVVKEALLKTIKAAVGDKWSDELSRAW EVAYDELAAAIKKA

Match unknown sequence to a Match unknown sequence to a known

known structure of a sequence

structure of a sequence

slide-8
SLIDE 8

11/13/2006 11/13/2006 8 8

Sequence Sequence structure structure function function

  • Testing folds

Testing folds ISTHISMYSHAPE ISTHISMYSHAPE

  • Find

Find homologs homologs ANYRELATIVES ANYRELATIVES PERHAPSIAM PERHAPSIAM

slide-9
SLIDE 9

11/13/2006 11/13/2006 9 9

A Machine Learning Algorithm to A Machine Learning Algorithm to Match a Protein Sequence to a Match a Protein Sequence to a Homolog Structure Homolog Structure

  • Potential design: Formulation and

Potential design: Formulation and application application

  • Generating and learning alignments

Generating and learning alignments

  • Applications

Applications

slide-10
SLIDE 10

11/13/2006 11/13/2006 10 10

Potential design Potential design

( )

( )

Pair or Contact potetial ; Profile potential ;

ij ij i j i i i

E u r P E u x P

>

= =

∑ ∑

slide-11
SLIDE 11

11/13/2006 11/13/2006 11 11

slide-12
SLIDE 12

11/13/2006 11/13/2006 12 12

Learning the fold that matches a Learning the fold that matches a sequence from the set of all known sequence from the set of all known structures structures

( ) ( )

, ; , ;

n i n n

E S X P E S X P − >

1 2 3... n

a a a a

slide-13
SLIDE 13

11/13/2006 11/13/2006 13 13

Learning folds: Find a potential that Learning folds: Find a potential that recognizes the native fold recognizes the native fold

( ) ( )

, ; , ;

n i n n

E S X P E S X P − >

( ) ( )

i i

E X p f X =∑

contact

E n p

α α α

=∑

slide-14
SLIDE 14

11/13/2006 11/13/2006 14 14

Mathematical Programming approach to potential design (contact energies)

( )

, , 2

1 1 subject to m in

ij i j i nat i nat i nat i nat

E p n p E E E E n n p n p p

α α

α α α α α >

= = ∆ = − > ∆ = − =∆ ⋅ > =

∑ ∑ ∑

pis the unknown

Interior point, SVM i j

slide-15
SLIDE 15

11/13/2006 11/13/2006 15 15

( )

,

1

i nat i nat

E n n p n p

α α

α α

∆ = − = ∆ ⋅ >

slide-16
SLIDE 16

11/13/2006 11/13/2006 16 16

Learning the correct fold Learning the correct fold using 60 million comparisons using 60 million comparisons between native and wrong structures between native and wrong structures

( ) ( )

, , 0 i=1,...,60000000

n i n n

E S X E S X − >

1 2 3... n

a a a a

slide-17
SLIDE 17

11/13/2006 11/13/2006 17 17

General pairwise potentials are insufficient to recognize correct protein fold for a large set of protein-like structures (13 steps

  • ptimized independently lead to infeasibility):

Tobi & Elber, Proteins 41,40-46(2000) Pairwise potentials are better than profile models (to be shown) but still not good enough. Need statistical enhancements of the signal.

slide-18
SLIDE 18

11/13/2006 11/13/2006 18 18

Threading Onion Model Threading Onion Model (THOM2) (THOM2)

An improved profile model that mixes the An improved profile model that mixes the accuracy of pairwise accuracy of pairwise energies and the energies and the efficiency of profile efficiency of profile energies. energies. Defining effective pair energies in terms of Defining effective pair energies in terms of structural fingerprints of sites in contact structural fingerprints of sites in contact … …

slide-19
SLIDE 19

11/13/2006 11/13/2006 19 19

slide-20
SLIDE 20

11/13/2006 11/13/2006 20 20

THOM2 yields effective pair interactions, THOM2 yields effective pair interactions, maintaining the efficiency of profile maintaining the efficiency of profile models. models.

  • Comparable performance to contact potentials

Comparable performance to contact potentials (with 300 parameters) in terms of self (with 300 parameters) in terms of self-

  • recognition

recognition

  • LP derived optimal parameters (interior point

LP derived optimal parameters (interior point algorithms!) algorithms!)

  • Optimal alignments with gaps found using

Optimal alignments with gaps found using dynamic programming dynamic programming

  • Need for gap penalties for family recognition

Need for gap penalties for family recognition … …

slide-21
SLIDE 21

11/13/2006 11/13/2006 21 21

Alignment Alignment

Even if we identify a homolog, the problem of Even if we identify a homolog, the problem of structural modeling is not solved. An accurate structural modeling is not solved. An accurate alignment is crucial for successful modeling. alignment is crucial for successful modeling. Also the presence of gaps can make the Also the presence of gaps can make the identification more difficult identification more difficult

1 2 3 4 1 2 3 4 5

a a a a x x x x x − − −

If we need gaps we call the fitness function – score (instead of energy) and denote it by

T

slide-22
SLIDE 22

11/13/2006 11/13/2006 22 22

An alignment is a path in a dynamic An alignment is a path in a dynamic programming table programming table

1 2 3 4 5 1 2 3 4 5

2 3 4 5 2 3 4 5 a a a a a g g g g g x g x g x g x g x g − − → → → → → ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

Finding the optimal alignment is quadratic in the protein length using Dynamic Programming

slide-23
SLIDE 23

11/13/2006 11/13/2006 23 23

Dynamics programming Dynamics programming Find optimal alignment for a given Find optimal alignment for a given set of parameters set of parameters

( )

, T n m

The optimal score for aligning a sequence length n against a sequence length m If we had the optimal scores for the following earlier alignments: can we construct the score ?

( )

1, 1) T n m − −

( )

1, T n m −

( )

, 1 T n m −

( )

, T n m

Yes…

slide-24
SLIDE 24

11/13/2006 11/13/2006 24 24

Dynamic programming: Continue Dynamic programming: Continue

We consider three possibilities to obtain an alignment of n against m amino acids. Option A: align n-1 against m-1 amino acids score extend the alignment by a(n)/b(m) with a score S(an,bm)

( )

1, 1 T n m − −

Option B: align n amino acids against m-1 amino acids with a score extend the alignment by -/(b(m) with a score g for a gap

( , 1) T n m −

Option C: align n-1 amino acids against m amino acids with a score Extend the alignment by a(n)/- with a corresponding score of g

( )

1, T n m −

To decide which of the three options is optimal we need to compare the score

  • f the three options A, B, C

( ) ( )

1, 1 ,

n n

T n m S a b − − +

( )

, 1 T n m g − +

( )

1, T n m g − +

slide-25
SLIDE 25

11/13/2006 11/13/2006 25 25

Dynamic programming: Decision Dynamic programming: Decision

( ) ( )

( 1, 1) , , max ( , 1) ( 1, )

n m

T n m S a b T n m T n m g T n m g − − + ⎡ ⎤ ⎢ ⎥ = − + ⎢ ⎥ ⎢ ⎥ − + ⎣ ⎦

slide-26
SLIDE 26

11/13/2006 11/13/2006 26 26

How to start?? How to start??

( ) ( )

1, ,1 T T g − = − =

( ) ( ) ( ) ( ) ( ) ( )

1 1 1 1 1 1 1 1

, 2 , max , max 2 0,0 , , T a g g T a b T b g g T S a b S a b − + ⎡ ⎤ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ = − + = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ + ⎣ ⎦ ⎣ ⎦

And continue (for example…) by

slide-27
SLIDE 27

11/13/2006 11/13/2006 27 27

Here we start Here we start… …

1 2 3 4 5 1 2 3 4 5

2 3 4 5 2 3 4 5 a a a a a g g g g g b g b g b g b g b g − − → → → → → ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

slide-28
SLIDE 28

11/13/2006 11/13/2006 28 28

13 step potential one of the best around (tested on the Baker’s set, 65 sets – Tamara Galor)

  • aver. pos. # correct Z score

TE13 27 40 4.3 MJ 150 23 2.1 HL 163 15 2.0 SK 158 11 1.8 BT 148 15 2.0 THOM2 106 15 2.0

slide-29
SLIDE 29

11/13/2006 11/13/2006 29 29

Need for statistical verification of Need for statistical verification of predictions: predictions:

  • Scoring according to an energy may be

Scoring according to an energy may be insufficient (good matches by similar insufficient (good matches by similar length or composition) length or composition)

  • Z

Z-

  • score: a convenient measure of the

score: a convenient measure of the strength of a match in terms of strength of a match in terms of distribution of energies for random distribution of energies for random alignments alignments

slide-30
SLIDE 30

11/13/2006 11/13/2006 30 30

Joint Z Joint Z-

  • score (global and local threading)

score (global and local threading) distribution: distribution:

FALSE POSITIVES FALSE POSITIVES TRUE TRUE POSITIVES POSITIVES

ORFX

slide-31
SLIDE 31

11/13/2006 11/13/2006 31 31

Family recognition: POU Family recognition: POU-

  • like domains

like domains

slide-32
SLIDE 32

11/13/2006 11/13/2006 32 32

Family recognition: Family recognition: immunoglobins immunoglobins

slide-33
SLIDE 33

11/13/2006 11/13/2006 33 33

Sample LOOPP Predictions Sample LOOPP Predictions

slide-34
SLIDE 34

11/13/2006 11/13/2006 34 34

Predictions for difficult targets CAFASP & CASP: T102 (70 res) T116_2 (121 res)

Model 1: 1bo9, 34 res with 2.5 A, 44 res with 3.1 A, 12th best (1st ) model (M. Sippl), 1nkl among best matches as well Model 1: 1a0cA, 50 res with 2.9 A, 2nd best (1st ) model (M. Sippl)

slide-35
SLIDE 35

11/13/2006 11/13/2006 35 35

predictions for difficult targets: T097 (104 res).

Model 1: 2hfh, 39 res with 3.3 A Model 2: 3itr, 54 res with 3.2 A Matching into complementary sub-domains: model 1 - “good for that target” (A. Lesk), model 2 - 11th best (among 1st and 2nd models, M. Sippl)

slide-36
SLIDE 36

11/13/2006 11/13/2006 36 36

CASP prediction: Target T0280 CASP prediction: Target T0280

slide-37
SLIDE 37

11/13/2006 11/13/2006 37 37

Yes 1.18 Best hit not chosen 1.4 2.0 308 Yes 1.17 Best hit not chosen 1.0 1.3 305 2.16 Best hit chosen 2.2 2.8 303

  • No good hit

1.5 1.7 302

  • No good hit#

2.3 5.7 298 2.37 Best hit chosen 2.8 4.3 297 2.12 Best hit chosen 1.8 2.4 295 1.55 Best hit chosen 2.1 2.7 294 Yes 2.42 Best hit not chosen 4.6 5.4 293 Yes 2.68 Best hit not chosen 2.9 3.1 292 0.86 Best hit chosen 0.7 1.6 291 0.47 Best hit chosen 0.48 0.53 290 Yes 1.01 Best hit not chosen 1.3 2.0 288 If best hit is not chosen, is one of the chosen hits true hits? RMSD

  • f best

hit Is the best hit chosen? Best Other RMSD Best Loopp RMSD Targe t

slide-38
SLIDE 38

11/13/2006 11/13/2006 38 38

No true hit in database 7.0 12.0 309 Hit present in DB but wrong parent Chosen 6.6 13.0 307 No true hit in database 5.3 12.9 306 Hit present in DB but wrong parent Chosen 4.9 10.0 304 No true hit in database 8.0 22.4 301 Hit present in DB but wrong parent Chosen 1.2 11.0 300 Hit present in DB but wrong parent Chosen 5.0 15.7 299 No true hit in database 5.1 22.3 296 Hit present in DB but wrong parent Chosen 6.2 7.4 289 Hit present in DB but wrong parent Chosen 5.8 8.4 283 Reason for Loopp going wrong Best other RMSD Best Loopp RMSD CASP7 Target

Sometimes we do really bad…

slide-39
SLIDE 39

11/13/2006 11/13/2006 39 39

Structure prediction for a tomato fruit Structure prediction for a tomato fruit-

  • weight protein

weight protein

  • ORFX gene, controlling the size of a

ORFX gene, controlling the size of a tomato fruit, has been predicted to tomato fruit, has been predicted to share structural similarity with human share structural similarity with human Ras Ras p21 p21 (work in collaboration with

(work in collaboration with Tanksley Tanksley’ ’s s group, Cornell, Science 289,85 group, Cornell, Science 289,85-

  • 89(2000))

89(2000))

slide-40
SLIDE 40

11/13/2006 11/13/2006 40 40

Phylogeny of Phylogeny of Lycopersicon Lycopersicon

  • L. esculentum var cerasiforme
  • L. cheesmanii
  • L. pimpinellifolium
  • L. chnielewskii
  • L. parviflorum
  • L. chilense
  • L. pennellii
  • L. hirsutum
  • L. peruvianum
  • L. esculentum var esculentum

red fruit self-compatible

slide-41
SLIDE 41

11/13/2006 11/13/2006 41 41

stuffer

  • vate

se2.1 fw2.1, 2.2, 2.3

TG463 TG266 TG554 TG469 CT9 TG337 TG48 TG493 TG189 TG608 CT205 TG167 TG151 CT59 TG154 TG34 TG91

Chromosome 2 Chromosome 2

slide-42
SLIDE 42

11/13/2006 11/13/2006 42 42

Human Human Ras Ras p21 p21

  • Molecular switch based on

Molecular switch based on GTP hydrolysis GTP hydrolysis

  • Cellular growth control and

Cellular growth control and cancer cancer

  • Ras

Ras oncogene

  • ncogene: single point

: single point mutations at positions mutations at positions Gly12 or Gln61 Gly12 or Gln61

slide-43
SLIDE 43

11/13/2006 11/13/2006 43 43

slide-44
SLIDE 44

11/13/2006 11/13/2006 44 44

Yet bigger tomatoes Yet bigger tomatoes … …

slide-45
SLIDE 45

11/13/2006 11/13/2006 45 45

Some references to Some references to LOOPP LOOPP

  • Dror

Dror Tobi Tobi, Gil , Gil Shafran Shafran, Nathan , Nathan Linial Linial and Ron Elber, "On the design and Ron Elber, "On the design and and analysis of protein folding potentials", "Proteins, Structure analysis of protein folding potentials", "Proteins, Structure Function and Function and Genetics", 40, 71 Genetics", 40, 71-

  • 85 (2000).

85 (2000).

  • Dror

Dror Tobi Tobi and Ron Elber, "Distance dependent, pair potential for and Ron Elber, "Distance dependent, pair potential for protein protein folding: Results from linear optimization", Proteins, Structure folding: Results from linear optimization", Proteins, Structure Function Function and Genetics, 41, 40 and Genetics, 41, 40-

  • 16 (2000).

16 (2000).

  • Jaroslaw Meller and Ron Elber, "Linear Optimization and a double

Jaroslaw Meller and Ron Elber, "Linear Optimization and a double Statistical Filter for protein threading protocols", Proteins, S Statistical Filter for protein threading protocols", Proteins, Structure, tructure, Function and Genetics, 45,241 Function and Genetics, 45,241-

  • 261(2001)

261(2001)

  • Jian Qiu and Ron Elber, "Atomically detailed potentials to recog

Jian Qiu and Ron Elber, "Atomically detailed potentials to recognize nize native and approximate protein structures", Proteins, Structure, native and approximate protein structures", Proteins, Structure, Function, and Bioinformatics, 61:44 Function, and Bioinformatics, 61:44-

  • 55,2005

55,2005