Machine Learning HMM applications in computational biology Central - - PowerPoint PPT Presentation

▶

Aug 02, 2023 324 likes •704 views

10-701 Machine Learning HMM applications in computational biology Central dogma CCTGAGCCAACTATTGATGAA DNA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE 2 Biological data is rapidly accumulating Transcription

SLIDE 1

HMM applications in computational biology

10-701 Machine Learning

SLIDE 2

Central dogma

Protein mRNA DNA

transcription translation CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

SLIDE 3

Biological data is rapidly accumulating

DNA RNA transcription translation Proteins Transcription factors

Next generation sequencing

SLIDE 4

SLIDE 5

SLIDE 6

Biological data is rapidly accumulating

DNA RNA transcription translation Proteins Transcription factors

Array / sequencing technology

SLIDE 7

Biological data is rapidly accumulating

DNA RNA transcription translation Proteins Transcription factors

Protein interactions

38,000 identified interactions
Hundreds of thousands of

predictions

SLIDE 8

SLIDE 9

FDA Approves Gene-Based Breast Cancer Test*

“ MammaPrint is a DNA microarray-based test that measures the activity of 70 genes in a sample of a woman's breast-cancer tumor and then uses a specific formula to determine whether the patient is deemed low risk

r high risk for the spread of

the cancer to another site.”

*Washington Post, 2/06/2007

SLIDE 10

SLIDE 11

Active Learning

SLIDE 12

Sequencing DNA

Due to accumulated errors, we could only reliably read at most 300-500 nucleotides.

First human genome draft in 2001

SLIDE 13

Shotgun Sequencing

Wikipedia

SLIDE 14

Caveats

Errors in reading
Non-trivial assembly task: repeats in the genome

MacCallum et al., GB 2009

SLIDE 15

Error Correction in DNA sequencing

The fragmentation happens at random locations of the molecules.

We expect all positions in the genome to have the same # number of reads K-mers = substrings of length K of the reads. Errors create error k-mers.

Kellly et al., GB 2010

SLIDE 16

Transcriptome Shotgun Sequencing (RNA-Seq)

Sequencing RNA molecule transcripts. Reminder:

(mRNA) Transcripts are “expression products” of genes.
Different genes having different expression levels so some

transcripts are more or less abundant than others.

@Friedrich Miescher Laboratory

SLIDE 17

Challenges

Large datasets: 10-100 millions reads of 75-150 bps.
Memory efficiency: Too time consuming to perform out-

memory processing of data. DNA Sequencing + others : alternative splicing, RNA editing, post-transcription modification.

SLIDE 18

Some transcripts are more prone to errors
Errors are harder to correct in reads from lowly expressed transcripts

Errors are non uniformly distributed

SLIDE 19

SEECER Error Correction + Consensus sequence estimation for RNA-Seq data

SLIDE 20

Key idea: HMM model

The way sequencers work:

Read letter by letter sequentially
Possible errors: Insertion , Deletion or Misread of a nucleotide

Salmela et al., Bioinformatics 2011

SLIDE 21

SLIDE 22

Building (Learning) the HMMs and Making Corrections (Inference)

Learning = Expectation-Maximization Inference = Viterbi algorithm

Seeding: Guessing possible reads using k-mer overlaps. Constructing the HMM from these reads. Speed up: The k-mer overlaps yield approximate multiple alignments of reads. We can learn HMM parameters from this directly.

SLIDE 23

Clustering to improve seeding

Real biological differences should be supported by a set of reads with similar mismatches to the consensus

SLIDE 24

1. Clustering positions with mismatches to

identify clusters of correlated positions.

2. Build a similarity matrix between these

positions.

3. Use Spectral clustering to find clusters of

correlated positions.

4. Filter reads have mismatches in these clusters.

SLIDE 25

Comparison to other methods

SLIDE 26

Using the corrected reads, the assembler can recover more transcripts

SLIDE 27

Analysis of sea cucumber data

SLIDE 28

Data integration in biology

SLIDE 29

Key problem: Most high-throughput data is static

Sequencing motif CHIP-chip PPI microarray Static data sources Time-series measurements Time

SLIDE 30

DREM: Dynamic Regulatory Events Miner

SLIDE 31

TF C time

Expression Level

Model Structure

time 1 0.1 0.9 1 0.95 0.05

Expression Level

Time Series Expression Data Static TF-DNA Binding Data IOHMM Model

TF A TF B TF D ? ? a b c d

SLIDE 32

Things are a bit more complicated: Real data

SLIDE 33

A Hidden Markov Model

             

  

    T t t t n i T t t t

i H i H p i H i O p O H L

2 1 1 1

)) ( | ) ( ( )) ( | ) ( ( ) ; , (

Hidden States Observed outputs (expression levels) t=0 t=1 t=2 t=3 H0 H1 H2 H3 O0 O1 O2 O3 Schliep et al Bioinformatics 2003 1

SLIDE 34

Sum over all genes Sum over all paths Q Product over all Gaussian emission density values

n path

Product over all transition probabilities on path

Input – Output Hidden Markov Model

Input (Static TF-gene interactions) Hidden States (transitions

between states form a tree structure)

Emissions (Distribution of

expression values)

Ig t=0 t=1 t=2 t=3 H0 H1 H2 H3 O0 O1 O2 O3 Log Likelihood

SLIDE 35

1 2 3 4 5 6 7 8 9

E. coli. response

PLoS Comp. Bio. 2008 Nature MSB 2011

IRF7

Fly development

Science 2010

Genome Research 2010, PLoS ONE 2011

Mouse Immune response Stem cells differentiation

SLIDE 36

Approximate learning to speed up on large datasets.
In real world, one technique is not enough. A solution involves using

many techniques.

Precision and Recall are trade-offs.

HMM applications in computational biology

10-701 Machine Learning

Central dogma

PEPTIDE

Biological data is rapidly accumulating

Next generation sequencing

Biological data is rapidly accumulating

Array / sequencing technology

Biological data is rapidly accumulating

Protein interactions

FDA Approves Gene-Based Breast Cancer Test*

“ MammaPrint is a DNA microarray-based test that measures the activity of 70 genes in a sample of a woman's breast-cancer tumor and then uses a specific formula to determine whether the patient is deemed low risk

the cancer to another site.”

Active Learning

Sequencing DNA

Due to accumulated errors, we could only reliably read at most 300-500 nucleotides.

Shotgun Sequencing

Caveats

Error Correction in DNA sequencing

We expect all positions in the genome to have the same # number of reads K-mers = substrings of length K of the reads. Errors create error k-mers.

Transcriptome Shotgun Sequencing (RNA-Seq)

Sequencing RNA molecule transcripts. Reminder:

transcripts are more or less abundant than others.

Challenges

memory processing of data. DNA Sequencing + others : alternative splicing, RNA editing, post-transcription modification.

Errors are non uniformly distributed

SEECER Error Correction + Consensus sequence estimation for RNA-Seq data

Key idea: HMM model

The way sequencers work:

Building (Learning) the HMMs and Making Corrections (Inference)

Seeding: Guessing possible reads using k-mer overlaps. Constructing the HMM from these reads. Speed up: The k-mer overlaps yield approximate multiple alignments of reads. We can learn HMM parameters from this directly.

Clustering to improve seeding

Real biological differences should be supported by a set of reads with similar mismatches to the consensus

identify clusters of correlated positions.

positions.

correlated positions.

Comparison to other methods

Using the corrected reads, the assembler can recover more transcripts

Analysis of sea cucumber data

Data integration in biology

Key problem: Most high-throughput data is static

DREM: Dynamic Regulatory Events Miner

Things are a bit more complicated: Real data

A Hidden Markov Model

             

  

i H i H p i H i O p O H L

)) ( | ) ( ( )) ( | ) ( ( ) ; , (

Input – Output Hidden Markov Model

many techniques.

Things that work