11/13/2006 11/13/2006 1 1
Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics Age: in the Genomics Age: Predicting protein structures Predicting protein structures
Ron Elber, Cornell Ron Elber, Cornell
Computational and Mathematical Biology Computational and - - PowerPoint PPT Presentation
Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics Age: in the Genomics Age: Predicting protein structures Predicting protein structures Ron Elber, Cornell Ron Elber, Cornell 11/13/2006 11/13/2006
11/13/2006 11/13/2006 1 1
Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics Age: in the Genomics Age: Predicting protein structures Predicting protein structures
Ron Elber, Cornell Ron Elber, Cornell
11/13/2006 11/13/2006 2 2
Proteins are one-
dimensional polymers
Made of 20 types of monomers (amino acids) with different side chains (ACDEFG different side chains (ACDEFG… …) but the same ) but the same backbone backbone
Fold into a well defined 3D shape that includes secondary structure elements (helices, sheets) secondary structure elements (helices, sheets)
They are the machines of the smallest living entities (cells) entities (cells)
11/13/2006 11/13/2006 3 3
Why protein structures? Sequence Why protein structures? Sequence determines 3D shape. Shape determines determines 3D shape. Shape determines function. function.
ACDEFGHIJKLMNPQ ACDEFGHIJKLMNPQ Active site! Active site!
Drug design….
11/13/2006 11/13/2006 4 4
Approaches to determine protein structure Approaches to determine protein structure
Experiment (X-
ray, NMR): months months
Modeling the chemical physics weeks weeks
Homology based modeling: hours hours
11/13/2006 11/13/2006 5 5
Leghemoglobin in Plants Myoglobin in Mammals High degree of structural similarity is often observed in proteins with diverse sequences and in different species (below noise level – 15 percent sequence identity).
Oxygen Transport Proteins
11/13/2006 11/13/2006 6 6
Identify a structural template to unknown template to unknown sequence sequence
Align the unknown sequence to the sequence to the structural template structural template
Build an atomic model based on the template based on the template
ACEFGH….
A
D W L K A R C
L R
11/13/2006 11/13/2006 7 7
Instead of direct sequence comparison Instead of direct sequence comparison
1BIN:A 2/3 AFTEKQDALVSSSFEAFKANIPQYSVVFYTSILEKAPAAKDLFSFLANG-----VDPTNP 1MBC:_ 1/2 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASE- 1BIN:A 57/58 KLTGHAEKLFALVRDSAGQLKASGTVV—ADAALGSVHAQKAVTDPQFVVVKEALLKTIK 1MBC:_ 60/61 DLKKHGVTVLTALGAILKKK---GHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLH 1BIN:A 115/116 AAVGDKWSDELSRAWEVAYDELAAAIKKA 1MBC:_ 117/118 SRHPGDFGADAQGAMNKALELFRKDIAAK
AFTEKQDALVSSSFEAFKANIPQYSVVFYTSILE KAPAAKDLFSFLANGVDPTNPKLTGHAEKLFA LVRDSAGQLKASGTVVADAALGSVHAQKAVT DPQFVVVKEALLKTIKAAVGDKWSDELSRAW EVAYDELAAAIKKA
Match unknown sequence to a Match unknown sequence to a known
known structure of a sequence
structure of a sequence
11/13/2006 11/13/2006 8 8
Testing folds ISTHISMYSHAPE ISTHISMYSHAPE
Find homologs homologs ANYRELATIVES ANYRELATIVES PERHAPSIAM PERHAPSIAM
11/13/2006 11/13/2006 9 9
Potential design: Formulation and application application
Generating and learning alignments
Applications
11/13/2006 11/13/2006 10 10
( )
Pair or Contact potetial ; Profile potential ;
ij ij i j i i i
E u r P E u x P
>
= =
11/13/2006 11/13/2006 11 11
11/13/2006 11/13/2006 12 12
( ) ( )
, ; , ;
n i n n
E S X P E S X P − >
1 2 3... n
11/13/2006 11/13/2006 13 13
Learning folds: Find a potential that Learning folds: Find a potential that recognizes the native fold recognizes the native fold
( ) ( )
, ; , ;
n i n n
E S X P E S X P − >
i i
E X p f X =∑
contact
E n p
α α α
=∑
11/13/2006 11/13/2006 14 14
Mathematical Programming approach to potential design (contact energies)
( )
, , 2
1 1 subject to m in
ij i j i nat i nat i nat i nat
E p n p E E E E n n p n p p
α α
α α α α α >
= = ∆ = − > ∆ = − =∆ ⋅ > =
Interior point, SVM i j
11/13/2006 11/13/2006 15 15
( )
,
1
i nat i nat
E n n p n p
α α
α α
∆ = − = ∆ ⋅ >
11/13/2006 11/13/2006 16 16
( ) ( )
, , 0 i=1,...,60000000
n i n n
E S X E S X − >
1 2 3... n
11/13/2006 11/13/2006 17 17
General pairwise potentials are insufficient to recognize correct protein fold for a large set of protein-like structures (13 steps
Tobi & Elber, Proteins 41,40-46(2000) Pairwise potentials are better than profile models (to be shown) but still not good enough. Need statistical enhancements of the signal.
11/13/2006 11/13/2006 18 18
An improved profile model that mixes the An improved profile model that mixes the accuracy of pairwise accuracy of pairwise energies and the energies and the efficiency of profile efficiency of profile energies. energies. Defining effective pair energies in terms of Defining effective pair energies in terms of structural fingerprints of sites in contact structural fingerprints of sites in contact … …
11/13/2006 11/13/2006 19 19
11/13/2006 11/13/2006 20 20
THOM2 yields effective pair interactions, THOM2 yields effective pair interactions, maintaining the efficiency of profile maintaining the efficiency of profile models. models.
Comparable performance to contact potentials (with 300 parameters) in terms of self (with 300 parameters) in terms of self-
recognition
LP derived optimal parameters (interior point algorithms!) algorithms!)
Optimal alignments with gaps found using dynamic programming dynamic programming
Need for gap penalties for family recognition … …
11/13/2006 11/13/2006 21 21
Even if we identify a homolog, the problem of Even if we identify a homolog, the problem of structural modeling is not solved. An accurate structural modeling is not solved. An accurate alignment is crucial for successful modeling. alignment is crucial for successful modeling. Also the presence of gaps can make the Also the presence of gaps can make the identification more difficult identification more difficult
1 2 3 4 1 2 3 4 5
a a a a x x x x x − − −
If we need gaps we call the fitness function – score (instead of energy) and denote it by
11/13/2006 11/13/2006 22 22
1 2 3 4 5 1 2 3 4 5
2 3 4 5 2 3 4 5 a a a a a g g g g g x g x g x g x g x g − − → → → → → ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
Finding the optimal alignment is quadratic in the protein length using Dynamic Programming
11/13/2006 11/13/2006 23 23
( )
, T n m
The optimal score for aligning a sequence length n against a sequence length m If we had the optimal scores for the following earlier alignments: can we construct the score ?
( )
1, 1) T n m − −
1, T n m −
( )
, 1 T n m −
, T n m
Yes…
11/13/2006 11/13/2006 24 24
We consider three possibilities to obtain an alignment of n against m amino acids. Option A: align n-1 against m-1 amino acids score extend the alignment by a(n)/b(m) with a score S(an,bm)
( )
1, 1 T n m − −
Option B: align n amino acids against m-1 amino acids with a score extend the alignment by -/(b(m) with a score g for a gap
( , 1) T n m −
Option C: align n-1 amino acids against m amino acids with a score Extend the alignment by a(n)/- with a corresponding score of g
( )
1, T n m −
To decide which of the three options is optimal we need to compare the score
( ) ( )
1, 1 ,
n n
T n m S a b − − +
( )
, 1 T n m g − +
( )
1, T n m g − +
11/13/2006 11/13/2006 25 25
( ) ( )
( 1, 1) , , max ( , 1) ( 1, )
n m
T n m S a b T n m T n m g T n m g − − + ⎡ ⎤ ⎢ ⎥ = − + ⎢ ⎥ ⎢ ⎥ − + ⎣ ⎦
11/13/2006 11/13/2006 26 26
1, ,1 T T g − = − =
( ) ( ) ( ) ( ) ( ) ( )
1 1 1 1 1 1 1 1
, 2 , max , max 2 0,0 , , T a g g T a b T b g g T S a b S a b − + ⎡ ⎤ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ = − + = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ + ⎣ ⎦ ⎣ ⎦
And continue (for example…) by
11/13/2006 11/13/2006 27 27
1 2 3 4 5 1 2 3 4 5
2 3 4 5 2 3 4 5 a a a a a g g g g g b g b g b g b g b g − − → → → → → ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → ↓ ↓ ↓ ↓ ↓ ↓ → → → → → \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
11/13/2006 11/13/2006 28 28
13 step potential one of the best around (tested on the Baker’s set, 65 sets – Tamara Galor)
TE13 27 40 4.3 MJ 150 23 2.1 HL 163 15 2.0 SK 158 11 1.8 BT 148 15 2.0 THOM2 106 15 2.0
11/13/2006 11/13/2006 29 29
Scoring according to an energy may be insufficient (good matches by similar insufficient (good matches by similar length or composition) length or composition)
Z-
score: a convenient measure of the strength of a match in terms of strength of a match in terms of distribution of energies for random distribution of energies for random alignments alignments
11/13/2006 11/13/2006 30 30
Joint Z Joint Z-
score (global and local threading) distribution: distribution:
FALSE POSITIVES FALSE POSITIVES TRUE TRUE POSITIVES POSITIVES
ORFX
11/13/2006 11/13/2006 31 31
Family recognition: POU Family recognition: POU-
like domains
11/13/2006 11/13/2006 32 32
Family recognition: Family recognition: immunoglobins immunoglobins
11/13/2006 11/13/2006 33 33
11/13/2006 11/13/2006 34 34
Predictions for difficult targets CAFASP & CASP: T102 (70 res) T116_2 (121 res)
Model 1: 1bo9, 34 res with 2.5 A, 44 res with 3.1 A, 12th best (1st ) model (M. Sippl), 1nkl among best matches as well Model 1: 1a0cA, 50 res with 2.9 A, 2nd best (1st ) model (M. Sippl)
11/13/2006 11/13/2006 35 35
predictions for difficult targets: T097 (104 res).
Model 1: 2hfh, 39 res with 3.3 A Model 2: 3itr, 54 res with 3.2 A Matching into complementary sub-domains: model 1 - “good for that target” (A. Lesk), model 2 - 11th best (among 1st and 2nd models, M. Sippl)
11/13/2006 11/13/2006 36 36
11/13/2006 11/13/2006 37 37
Yes 1.18 Best hit not chosen 1.4 2.0 308 Yes 1.17 Best hit not chosen 1.0 1.3 305 2.16 Best hit chosen 2.2 2.8 303
1.5 1.7 302
2.3 5.7 298 2.37 Best hit chosen 2.8 4.3 297 2.12 Best hit chosen 1.8 2.4 295 1.55 Best hit chosen 2.1 2.7 294 Yes 2.42 Best hit not chosen 4.6 5.4 293 Yes 2.68 Best hit not chosen 2.9 3.1 292 0.86 Best hit chosen 0.7 1.6 291 0.47 Best hit chosen 0.48 0.53 290 Yes 1.01 Best hit not chosen 1.3 2.0 288 If best hit is not chosen, is one of the chosen hits true hits? RMSD
hit Is the best hit chosen? Best Other RMSD Best Loopp RMSD Targe t
11/13/2006 11/13/2006 38 38
No true hit in database 7.0 12.0 309 Hit present in DB but wrong parent Chosen 6.6 13.0 307 No true hit in database 5.3 12.9 306 Hit present in DB but wrong parent Chosen 4.9 10.0 304 No true hit in database 8.0 22.4 301 Hit present in DB but wrong parent Chosen 1.2 11.0 300 Hit present in DB but wrong parent Chosen 5.0 15.7 299 No true hit in database 5.1 22.3 296 Hit present in DB but wrong parent Chosen 6.2 7.4 289 Hit present in DB but wrong parent Chosen 5.8 8.4 283 Reason for Loopp going wrong Best other RMSD Best Loopp RMSD CASP7 Target
Sometimes we do really bad…
11/13/2006 11/13/2006 39 39
Structure prediction for a tomato fruit Structure prediction for a tomato fruit-
weight protein
ORFX gene, controlling the size of a tomato fruit, has been predicted to tomato fruit, has been predicted to share structural similarity with human share structural similarity with human Ras Ras p21 p21 (work in collaboration with
(work in collaboration with Tanksley Tanksley’ ’s s group, Cornell, Science 289,85 group, Cornell, Science 289,85-
89(2000))
11/13/2006 11/13/2006 40 40
Phylogeny of Phylogeny of Lycopersicon Lycopersicon
red fruit self-compatible
11/13/2006 11/13/2006 41 41
stuffer
se2.1 fw2.1, 2.2, 2.3
TG463 TG266 TG554 TG469 CT9 TG337 TG48 TG493 TG189 TG608 CT205 TG167 TG151 CT59 TG154 TG34 TG91
Chromosome 2 Chromosome 2
11/13/2006 11/13/2006 42 42
Human Human Ras Ras p21 p21
Molecular switch based on GTP hydrolysis GTP hydrolysis
Cellular growth control and cancer cancer
Ras oncogene
: single point mutations at positions mutations at positions Gly12 or Gln61 Gly12 or Gln61
11/13/2006 11/13/2006 43 43
11/13/2006 11/13/2006 44 44
11/13/2006 11/13/2006 45 45
Dror Tobi Tobi, Gil , Gil Shafran Shafran, Nathan , Nathan Linial Linial and Ron Elber, "On the design and Ron Elber, "On the design and and analysis of protein folding potentials", "Proteins, Structure analysis of protein folding potentials", "Proteins, Structure Function and Function and Genetics", 40, 71 Genetics", 40, 71-
85 (2000).
Dror Tobi Tobi and Ron Elber, "Distance dependent, pair potential for and Ron Elber, "Distance dependent, pair potential for protein protein folding: Results from linear optimization", Proteins, Structure folding: Results from linear optimization", Proteins, Structure Function Function and Genetics, 41, 40 and Genetics, 41, 40-
16 (2000).
Jaroslaw Meller and Ron Elber, "Linear Optimization and a double Statistical Filter for protein threading protocols", Proteins, S Statistical Filter for protein threading protocols", Proteins, Structure, tructure, Function and Genetics, 45,241 Function and Genetics, 45,241-
261(2001)
Jian Qiu and Ron Elber, "Atomically detailed potentials to recognize nize native and approximate protein structures", Proteins, Structure, native and approximate protein structures", Proteins, Structure, Function, and Bioinformatics, 61:44 Function, and Bioinformatics, 61:44-
55,2005