Machine Learning HMM applications in computational biology Central - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning HMM applications in computational biology Central - - PowerPoint PPT Presentation

10-701 Machine Learning HMM applications in computational biology Central dogma CCTGAGCCAACTATTGATGAA DNA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly accumulating Transcription


slide-1
SLIDE 1

HMM applications in computational biology

10-701 Machine Learning

slide-2
SLIDE 2

2

Central dogma

Protein mRNA DNA

transcription translation CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

slide-3
SLIDE 3

Biological data is rapidly accumulating

DNA RNA transcription translation Proteins Transcription factors

Next generation sequencing

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

Biological data is rapidly accumulating

DNA RNA transcription translation Proteins Transcription factors

Array / sequencing technology

slide-7
SLIDE 7

Biological data is rapidly accumulating

DNA RNA transcription translation Proteins Transcription factors

Protein interactions

  • 38,000 identified interactions
  • Hundreds of thousands of

predictions

slide-8
SLIDE 8

8

slide-9
SLIDE 9

FDA Approves Gene-Based Breast Cancer Test*

“ MammaPrint is a DNA microarray-based test that measures the activity of 70 genes in a sample of a woman's breast-cancer tumor and then uses a specific formula to determine whether the patient is deemed low risk

  • r high risk for the spread of

the cancer to another site.”

*Washington Post, 2/06/2007

slide-10
SLIDE 10

10

slide-11
SLIDE 11

Active Learning

11

slide-12
SLIDE 12

Sequencing DNA

Due to accumulated errors, we could only reliably read at most 300-500 nucleotides.

First human genome draft in 2001

slide-13
SLIDE 13

Shotgun Sequencing

Wikipedia

slide-14
SLIDE 14

Caveats

  • Errors in reading
  • Non-trivial assembly task: repeats in the genome

MacCallum et al., GB 2009

slide-15
SLIDE 15

Error Correction in DNA sequencing

  • The fragmentation happens at random locations of the molecules.

We expect all positions in the genome to have the same # number of reads K-mers = substrings of length K of the reads. Errors create error k-mers.

Kellly et al., GB 2010

slide-16
SLIDE 16

Transcriptome Shotgun Sequencing (RNA-Seq)

Sequencing RNA molecule transcripts. Reminder:

  • (mRNA) Transcripts are “expression products” of genes.
  • Different genes having different expression levels so some

transcripts are more or less abundant than others.

@Friedrich Miescher Laboratory

slide-17
SLIDE 17

Challenges

  • Large datasets: 10-100 millions reads of 75-150 bps.
  • Memory efficiency: Too time consuming to perform out-

memory processing of data. DNA Sequencing + others : alternative splicing, RNA editing, post-transcription modification.

slide-18
SLIDE 18
  • Some transcripts are more prone to errors
  • Errors are harder to correct in reads from lowly expressed transcripts

Errors are non uniformly distributed

slide-19
SLIDE 19

SEECER Error Correction + Consensus sequence estimation for RNA-Seq data

slide-20
SLIDE 20

Key idea: HMM model

The way sequencers work:

  • Read letter by letter sequentially
  • Possible errors: Insertion , Deletion or Misread of a nucleotide

Salmela et al., Bioinformatics 2011

slide-21
SLIDE 21
slide-22
SLIDE 22

Building (Learning) the HMMs and Making Corrections (Inference)

Learning = Expectation-Maximization Inference = Viterbi algorithm

Seeding: Guessing possible reads using k-mer overlaps. Constructing the HMM from these reads. Speed up: The k-mer overlaps yield approximate multiple alignments of reads. We can learn HMM parameters from this directly.

slide-23
SLIDE 23

Clustering to improve seeding

Real biological differences should be supported by a set of reads with similar mismatches to the consensus

slide-24
SLIDE 24
  • 1. Clustering positions with mismatches to

identify clusters of correlated positions.

  • 2. Build a similarity matrix between these

positions.

  • 3. Use Spectral clustering to find clusters of

correlated positions.

  • 4. Filter reads have mismatches in these clusters.
slide-25
SLIDE 25

Comparison to other methods

slide-26
SLIDE 26

Using the corrected reads, the assembler can recover more transcripts

slide-27
SLIDE 27

Analysis of sea cucumber data

B

slide-28
SLIDE 28

Data integration in biology

slide-29
SLIDE 29

Key problem: Most high-throughput data is static

Sequencing motif CHIP-chip PPI microarray Static data sources Time-series measurements Time

slide-30
SLIDE 30

DREM: Dynamic Regulatory Events Miner

slide-31
SLIDE 31

TF C time

Expression Level

Model Structure

time 1 0.1 0.9 1 0.95 0.05

Expression Level

Time Series Expression Data Static TF-DNA Binding Data IOHMM Model

TF A TF B TF D ? ? a b c d

slide-32
SLIDE 32

Things are a bit more complicated: Real data

slide-33
SLIDE 33

A Hidden Markov Model

             

  

    T t t t n i T t t t

i H i H p i H i O p O H L

2 1 1 1

)) ( | ) ( ( )) ( | ) ( ( ) ; , (

Hidden States Observed outputs (expression levels) t=0 t=1 t=2 t=3 H0 H1 H2 H3 O0 O1 O2 O3 Schliep et al Bioinformatics 2003 1

slide-34
SLIDE 34

Sum over all genes Sum over all paths Q Product over all Gaussian emission density values

  • n path

Product over all transition probabilities on path

Input – Output Hidden Markov Model

Input (Static TF-gene interactions) Hidden States (transitions

between states form a tree structure)

Emissions (Distribution of

expression values)

Ig t=0 t=1 t=2 t=3 H0 H1 H2 H3 O0 O1 O2 O3 Log Likelihood

slide-35
SLIDE 35

1 2 3 4 5 6 7 8 9

  • E. coli. response

PLoS Comp. Bio. 2008 Nature MSB 2011

IRF7

Fly development

Science 2010

Genome Research 2010, PLoS ONE 2011

Mouse Immune response Stem cells differentiation

slide-36
SLIDE 36
  • Approximate learning to speed up on large datasets.
  • In real world, one technique is not enough. A solution involves using

many techniques.

  • Precision and Recall are trade-offs.

Things that work