Predicting Secondary Structure of All-Helical Proteins Using Hidden - - PowerPoint PPT Presentation

predicting secondary structure of all helical proteins
SMART_READER_LITE
LIVE PREVIEW

Predicting Secondary Structure of All-Helical Proteins Using Hidden - - PowerPoint PPT Presentation

Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, and Srinivas Devadas Computer Science and Artificial


slide-1
SLIDE 1

Predicting Secondary Structure

  • f All-Helical Proteins Using

Hidden Markov Support Vector Machines

Blaise Gassend, Charles W. O'Donnell, William Thies, Andrew Lee, Marten van Dijk, and Srinivas Devadas

Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Workshop on Pattern Recognition in Bioinformatics – August 20, 2006

slide-2
SLIDE 2

Protein Structure Prediction

  • Classical problem: given sequence, predict structure
  • High-level approaches
  • 1. Energy-minimization (ab-initio) techniques
  • Elegant, but often lack correct parameters
  • 2. Homology-based techniques
  • Useful, but hard to predict new proteins

Sequence Sequence Structure Structure

Our approach: Use energy minimization, but learn parameters from existing proteins

slide-3
SLIDE 3

Our Framework (Training)

Amino-acid Sequence

Prediction Prediction Algorithm Algorithm

Correct structure

Protein Data Bank Protein Data Bank Energy Energy Parameters Parameters

Predicted structure correct incorrect Done! Done!

Constraints Constraints

energy(incorrect) > energy(correct)

Learning Learning Algorithm Algorithm

slide-4
SLIDE 4

Our Framework (Testing)

Energy Energy Parameters Parameters

Predicted structure

Prediction Prediction Algorithm Algorithm

Amino-acid Sequence

slide-5
SLIDE 5

Initial Focus: Secondary Structure

  • Classify each residue as alpha helix, beta strand, coil

– In this paper, restrict to all-alpha proteins

  • Applications:

– Informing tertiary structure predictors – Identification of homologous proteins – Identification of active sites (coils)

slide-6
SLIDE 6

50% 60% 70% 80% 90% 100% 1975 1980 1985 1990 1995 2000 2005 2010

Year Prediction Accuracy (Q3)

Secondary Structure Predictors

slide-7
SLIDE 7

DSC Zvelebil et al. GOR Chou/Fasman

50% 60% 70% 80% 90% 100% 1975 1980 1985 1990 1995 2000 2005 2010

Year Prediction Accuracy (Q3)

Secondary Structure Predictors

Statistical Methods HMMs Sequence Only Sequence + Alignment Statistical Methods Sequence Only Sequence + Alignment

slide-8
SLIDE 8

Chou/Fasman GOR Zvelebil et al. DSC SSPro PSIPred Porter SSPro4 Peterson PSIPred Riis/Krough PHD Qian/Sejnoweski

50% 60% 70% 80% 90% 100% 1975 1980 1985 1990 1995 2000 2005 2010

Year Prediction Accuracy (Q3)

Secondary Structure Predictors

Statistical Methods Neural Networks HMMs Sequence Only Sequence + Alignment Statistical Methods Neural Networks Sequence Only Sequence + Alignment

slide-9
SLIDE 9

Chou/Fasman GOR Zvelebil et al. DSC SSPro PSIPred Porter SSPro4 Peterson PSIPred Riis/Krough PHD Qian/Sejnoweski Hu Nguyen Kim Ward Ceroni Casbon Hua/Sun

50% 60% 70% 80% 90% 100% 1975 1980 1985 1990 1995 2000 2005 2010

Year Prediction Accuracy (Q3)

Secondary Structure Predictors

Statistical Methods Neural Networks HMMs Sequence Only Sequence + Alignment Statistical Methods Neural Networks Sequence Only Sequence + Alignment SVMs

slide-10
SLIDE 10

DSC Zvelebil et al. GOR Chou/Fasman Qian/Sejnoweski PHD Riis/Krough PSIPred Peterson PSIPred Porter SSPro4 SSPro

Schmidler et al. HMMSTR Nguyen Martin Won Martin

Hua/Sun Casbon Ceroni Ward Kim NguyenHu

50% 60% 70% 80% 90% 100% 1975 1980 1985 1990 1995 2000 2005 2010

Year Prediction Accuracy (Q3)

Secondary Structure Predictors

Statistical Methods Neural Networks HMMs Sequence Only Sequence + Alignment Statistical Methods Neural Networks HMMs Sequence Only Sequence + Alignment SVMs

slide-11
SLIDE 11

DSC Zvelebil et al. GOR Chou/Fasman Qian/Sejnoweski PHD Riis/Krough PSIPred Peterson PSIPred Porter SSPro4 SSPro

Schmidler et al. HMMSTR Nguyen Martin Won Martin

Hua/Sun Casbon Ceroni Ward Kim NguyenHu

50% 60% 70% 80% 90% 100% 1975 1980 1985 1990 1995 2000 2005 2010

Year Prediction Accuracy (Q3)

Secondary Structure Predictors

  • Exploits biochemical models
  • Offers biological insight

Statistical Methods Neural Networks HMMs Sequence Only Sequence + Alignment Statistical Methods Neural Networks HMMs Sequence Only Sequence + Alignment SVMs

1400-2900 parameters 680 MB of support vectors 471 parameters

slide-12
SLIDE 12

DSC Zvelebil et al. GOR Chou/Fasman Qian/Sejnoweski PHD Riis/Krough PSIPred Peterson PSIPred Porter SSPro4 SSPro

Schmidler et al. HMMSTR Nguyen Martin

THIS PAPER

Won Martin

Hua/Sun Casbon Ceroni Ward Kim NguyenHu

50% 60% 70% 80% 90% 100% 1975 1980 1985 1990 1995 2000 2005 2010

Year Prediction Accuracy (Q3)

Secondary Structure Predictors

302 params

Statistical Methods Neural Networks HMMs Sequence Only Sequence + Alignment Statistical Methods Neural Networks HMMs Sequence Only Sequence + Alignment SVMs

1400-2900 parameters 471 parameters

  • Exploits biochemical models
  • Offers biological insight

680 MB of support vectors

slide-13
SLIDE 13

Our Framework Applied to Helix Prediction

Amino-acid Sequence Correct structure

Protein Data Bank Protein Data Bank Energy Energy Parameters Parameters

Predicted structure correct incorrect Done! Done!

Constraints Constraints

energy(incorrect) > energy(correct)

Learning Learning Algorithm Algorithm Prediction Prediction Algorithm Algorithm Hidden Hidden Markov Model Markov Model Support Support Vector Vector Machines Machines Alpha Helices Alpha Helices

MNIFEMLRIDEGL HHHHHHHHH

slide-14
SLIDE 14

Energy Parameters

2 140 140 20

Number of Parameters

Penalty for coils of length 1 or 2 Energy of residue R at offset i (-3…3) from C-cap Energy of residue R at offset i (-3…3) from N-cap Energy of residue R in a helix

Description of Energy Parameters

CR,i NR,i HR

Name 302 Total

slide-15
SLIDE 15

Energy Parameters

2 140 140 20

Number of Parameters

Penalty for coils of length 1 or 2 Energy of residue R at offset i (-3…3) from C-cap Energy of residue R at offset i (-3…3) from N-cap Energy of residue R in a helix

Description of Energy Parameters

CR,i NR,i HR

Name

  • Example:

Sequence: MNIFELRIDEGL Structure: HHHHHH Energy = 302 Total

slide-16
SLIDE 16

Energy Parameters

2 140 140 20

Number of Parameters

Penalty for coils of length 1 or 2 Energy of residue R at offset i (-3…3) from C-cap Energy of residue R at offset i (-3…3) from N-cap Energy of residue R in a helix

Description of Energy Parameters

CR,i NR,i HR

Name

  • Example:

Sequence: MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) 302 Total

slide-17
SLIDE 17

Energy Parameters

2 140 140 20

Number of Parameters

Penalty for coils of length 1 or 2 Energy of residue R at offset i (-3…3) from C-cap Energy of residue R at offset i (-3…3) from N-cap Energy of residue R in a helix

Description of Energy Parameters

CR,i NR,i HR

Name

  • Example:

Sequence: MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) + NM,-3 + NN,-2 + NI,-1 + NF,0 + NE,1 + NL,2 + NR,3 (N-cap) 302 Total

slide-18
SLIDE 18

Energy Parameters

2 140 140 20

Number of Parameters

Penalty for coils of length 1 or 2 Energy of residue R at offset i (-3…3) from C-cap Energy of residue R at offset i (-3…3) from N-cap Energy of residue R in a helix

Description of Energy Parameters

CR,i NR,i HR

Name

  • Example:

Sequence: MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) + NM,-3 + NN,-2 + NI,-1 + NF,0 + NE,1 + NL,2 + NR,3 (N-cap) + CL,-3 + CR,-2 + CI,-1 + CD,0 + CE,1 + CG,2 + CL,3 (C-cap) 302 Total

slide-19
SLIDE 19

Energy Parameters

2 140 140 20

Number of Parameters

Penalty for coils of length 1 or 2 Energy of residue R at offset i (-3…3) from C-cap Energy of residue R at offset i (-3…3) from N-cap Energy of residue R in a helix

Description of Energy Parameters

CR,i NR,i HR

Name

  • Example:

Sequence: MNIFELRIDEGL Structure: HHHHHH Energy = HF + HE + HL + HR + HI + HD (Helix) + NM,-3 + NN,-2 + NI,-1 + NF,0 + NE,1 + NL,2 + NR,3 (N-cap) + CL,-3 + CR,-2 + CI,-1 + CD,0 + CE,1 + CG,2 + CL,3 (C-cap) 302 Total

slide-20
SLIDE 20

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure Correct structure

Energy ( ) = HA*A + HG*G = w · [A G] Highest energy in direction

  • f energy parameters w

Feature Space where w represents the energy parameters [HA HG]

slide-21
SLIDE 21

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices Feature Space w

Legal structure Correct structure

Energy ( ) = HA*A + HG*G = w · [A G] Highest energy in direction

  • f energy parameters w

where w represents the energy parameters [HA HG]

slide-22
SLIDE 22

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

  • 1. Predict stucture
slide-23
SLIDE 23

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

  • 1. Predict stucture
slide-24
SLIDE 24

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

Separating Hyperplane

  • 1. Predict stucture
  • 2. Refine parameters
slide-25
SLIDE 25

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w Separating Hyperplane

  • 1. Predict stucture
  • 2. Refine parameters
slide-26
SLIDE 26

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

  • 1. Predict stucture
  • 2. Refine parameters
slide-27
SLIDE 27

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
slide-28
SLIDE 28

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
slide-29
SLIDE 29

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
slide-30
SLIDE 30

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
slide-31
SLIDE 31

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
slide-32
SLIDE 32

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
  • 5. Predict structure
slide-33
SLIDE 33

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
  • 5. Predict structure
slide-34
SLIDE 34

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
  • 5. Predict structure
  • 6. Refine parameters
slide-35
SLIDE 35

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
  • 5. Predict structure
  • 6. Refine parameters
slide-36
SLIDE 36

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
  • 5. Predict structure
  • 6. Refine parameters
slide-37
SLIDE 37

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

Structure already predicted

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
  • 5. Predict structure
  • 6. Refine parameters
  • 7. Predict structure
slide-38
SLIDE 38

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

Structure already predicted

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
  • 5. Predict structure
  • 6. Refine parameters
  • 7. Predict structure
  • 8. Terminate
slide-39
SLIDE 39

Learning the Parameters

A: # of Alanines in Helices G: # of Glycines in Helices

Legal structure

Feature Space

Correct structure Predicted structure

w

Structure already predicted

  • 1. Predict stucture
  • 2. Refine parameters
  • 3. Predict structure
  • 4. Refine parameters
  • 5. Predict structure
  • 6. Refine parameters
  • 7. Predict structure
  • 8. Terminate

Details in paper:

  • How to converge faster
  • Early termination condition
  • [Tsochantaridis et al., ICML’02]
slide-40
SLIDE 40

Experimental Methodology

  • Data set: 300 non-homologous all-alpha proteins

– From EVA’s sequence-unique subset of the PDB, July 2005 – Only consider alpha helices (“H” symbol in DSSP)

  • Randomly split into 150 training, 150 test proteins
slide-41
SLIDE 41

Results

  • Comparison to others

– Best HMM method to date that does not utilize alignment info

  • Offers 3.5% (Qα), 0.2% (SOVα) over previous best

– Lags behind neural networks; e.g., Porter overall SOV = 76.6% – However, we could likely gain 6-8% from alignment profiles

  • Caveats

– Moving beyond all-alpha proteins, we could suffer 3% – By considering 3/10 helices, we could decrease 2% 73.4% 77.6% Value segment overlap measure [Zemla’99] SOVα percent of residues correctly predicted Qα Explanation Metric

[Nguyen02] [Rost93] [Jones99]

slide-42
SLIDE 42

Conclusions

  • Represents first step toward learning biophysical

parameters for energy minimization techniques

– Iterative, demand-driven learning process using SVMs

  • Promising results on alpha-helix prediction

– 77.6% among best Qα for methods without alignment info

  • Future work: super-secondary structure

– Will predict full “contact maps” rather than 3-state labels – For beta sheets, replace HMMs by multi-tape grammars

http://protein.csail.mit.edu/

slide-43
SLIDE 43

Extra Slides

slide-44
SLIDE 44

Prediction Algorithm

  • Parameters represent energetic benefit
  • f a given feature in a protein structure

– Features are fixed, chosen by designer – Example features:

  • Number of prolines in an alpha helix
  • Number of coils shorter than 2 residues
  • Energy (structure) = ∑features ∈ structure Energy (feature)
  • Minimal-energy structure found with dynamic prog.

– Idea: consider all structures, exploiting overlapping problems – Implemented as HMM using Viterbi algorithm

Amino-acid Sequence

Energy Parameters

Predicted structure

Pr Predicti tion

  • n

Algor gorithm hm

Structure with Minimal Energy

slide-45
SLIDE 45

Learning Algorithm

  • Constraints have form:

For all incorrectly predicted structures Si, in future selection of the parameters w: Energyw (Si) > Energyw (correct structure) Constraints are linear in the energy parameters.

  • If feasible, could solve with linear programming
  • In general, solve with Support Vector Machines (SVMs)

– Energy(Si) ≥ Energy (correct structure) + 1 - ξi (ξi ≥ 0) – Find parameters w minimizing ½ ||w||2 + C/n ∑i=1 ξi

Energy nergy Pa Parame mete ters Co Const nstraint nts

energy(incorrect) > energy(correct)

Lea Learn rning Algor gorithm hm

n

Provides general solution using soft-margin criterion