[PDF] - Random Variable : non empty set. {up, town} Event A: is a subset PDF Document

SLIDE 1

1 Introduction to Probability and Hidden Markov Models

Durbin et al. Ch3 EECS 458 CWRU Fall 2004

Random Variable

Ω : non empty set. {up, town}
Event A: is a subset of Ω. {up}
A random variable X is a function defined on

events with range in a subset of R. X({up})=1, X({down})=0.

The probability of an event A, denoted as

P(A), is the likelihood of the occurrence of A. It is a function that maps events to [0,1] that satisfies the following conditions: sum(P(A)) =

1. For example: P({up}) = P({down}) = 0.5.
The distribution function of a r.v. is defined

as the probability function.

Examples

A die with 4 sides labeled using A, C, G,
T. The probability of each letter x∈{A, C,

G, T} that occurs in a roll is px.

The probability of a sequence of

“AAGC” is generated by the simple model is just pA* pA* pG* pC.

SLIDE 2

2 Conditional probability

The joint probability of two events A

and B (P(A,B)) is the probability that event A and B occur at the same time.

The conditional probability of P(B|A)

is the probability that B occurs given A

ccurred.
Facts:

– P(B|A) = P( AB)/P(A) – P(A) = ∑iP(ABi)=∑iP(A|Bi)P(Bi)

An example

A family has two children, sample

space: Ω={GG,BB,BG,GB}

P(GG)=P(BB)=P(BG)=P(GB)=0.25
? P(BB|1 boy at least)
? P(BB|older is a boy)

Bayes Theorem

SLIDE 3

3 Expectation of a r.v.

E(X) := ∑xP(X=x)
E(g(X))= ∑g(x)P(X=x)
E(aX+bY) = aE(X)+bE(Y)
E(XY)=E(X)E(Y) if X and Y are

independent

Var(X) := E((X-EX)2) = EX2-(EX)2
Var(aX)=a2Var(X)

Roadmap

Model selection
Model inference
Markov models
Hidden Markov models
Applications

Probabilistic models

A model is a mathematical formulation

that explains/simulates the data under consideration.

Linear model:

– Y=aX+b – Y=aX+b+ε, where ε is a r.v.

Generating models for bio sequences

SLIDE 4

4 Model selection

Everything else being equal, one should

prefer simpler one.

Should take into consideration

background/prior knowledge of the problem.

Some explicit measure can be used,

such as minimum description length (MDL).

Minimum Description Length

Connections to information theory,

communication channels and data compression

Prefer models that minimize the length
f the message required to describe the

data.

Minimum Description Length

The most probable model is also the
ne with the shortest description.
The “better” the model, the shortest the

message describing the observation x, the highest the compression (and vice versa)

Keep in mind that if the model is too

faithful on the observation, it could lead to over fitting and has poor generalization.

SLIDE 5

5 Model inference

Once a model is fixed, parameters need

to be estimated from data.

Two views:

– True parameters are fixed values in a space, we need to find them. Classical /Frequencist – Parameters themselves are r.v.s and we need to estimate their distributions. Bayesian.

Maximum likelihood

Given a model M with some unknown

parameters θ, and some training data set x, the maximum likelihood estimate of θ is

θML = argmaxθP(x|M, θ) where the P(x|M, θ) is the probability that the dataset x has been generated by the model M with parameters θ.

MLE is consistent estimate, i.e., if we have

enough data, the MLE is close to the true value.

With small dataset, overfitting,

Example

Two dice with four faces: {A,C,G,T}
One has the distribution of

pA=pC=pG=pT=0.25

The other has the distribution: pA=0.20,

pC=0.28, pG=0.30, pT=0.22

SLIDE 6

6 Example: generating model

The source generates a string as

follows:

1. Randomly select one die with the

probability distribution of p1=0.9, p2=0.1.

2. Roll it, append the symbol to the string.
3. Repeat 1-2, until all symbols have been

generated.

Parameter θ : {0.9*{0.25, 0.25, 0.25, 0.25,}, 0.1*{0.20,0.28,0.30,0.22}}

Example

Suppose we don’t know the model,

neither the parameters

Given a string generated by the

unknown model, say x=“GATTCCAA…”

Questions:

– How to choose a “good” model? – How to infer parameters? – How much data do we need?

Bernoulli and Markov models

There are two typical hypothesis about

the source

Bernoulli: symbols are generated

independently and they are identically distributed (iid)

Markov: the probability distribution for

the “next” symbol depends on the previous h symbols (h>0 is the order of Markov chain).

SLIDE 7

7 Example

X1 = CCACCCTTGT
X2 = TTGTTCTTCC
X3 = TTCAACCGGT
X4 = AATAACCCCA
MLE for the parameter set {pA, pC, pG,

pT} for Bernoulli model is: pA=8/40=0.20, pC=15/40=0.375, pG=4/40=0.10, pT=13/40=0.325

Issues

How can we incorporate prior

knowledge in the analysis process?

Over fitting
How can we compare different models?

Bayesian inference

Useful to get better estimates of the

parameters using prior knowledge

Useful to compare different models
Parameters themselves are r.v.s. Prior

knowledge represented by prior distribution of parameters. We need to estimate the posterior distribution of parameters given data.

SLIDE 8

8 Bayesian inference

∫

= δ δ θ θ θ d x P P x P x P ) | ( ) ( ) | ( ) | (

Point estimation

We can define the “best” model as the one

that maximizes P(θ|x) (mazimum a posteriori estimation, or MAP).

Equivalent to minimize:
logP(θ|x) = -logP(x|θ) – logP(θ) +logP(x)
Can be used for model comparisons
Note that logP(x) can be regarded as a

constant

ML and MAP

ML estimation:

θML = argmaxθP(x|θ) θML = argminθ{-logP(x|θ)}

MAP estimation:

θMAP = argminθ{-logP(x|θ)-logθ)}

SLIDE 9

9 Occam’s Razor

Everything else being equal, one should

prefer a simple model over a complex

ne
One can incorporate this principle in the

Bayesian framework by using priors which penalize complex models.

Example

Two dice with four faces: {A,C,G,T}
One has the distribution of

pA=pC=pG=pT=0.25. M1

The other has the distribution: pA=0.20,

pC=0.28, pG=0.30, pT=0.22. M2

Example

The source generates a string as

follows:

1. Randomly select one die.
2. Roll it, append the symbol to the string.
3. Repeat 2, until all symbols have been

generated.

– Given a string say x=“GATTCCAA…”

SLIDE 10

10 Example (ML)

What is the probability of the data is

generated by M1? M2?

P(x|M1)
P(x|M2)
They can be easily calculated since this is a

Bernoulli model.

Log Likelihood Ratio: log(P(x|M1))-

log(P(x|M2))

Example

Which die or model is selected in step 1?

What is the probability distribution of selecting model 1 and model 2.

Prior: if we know nothing, better assume they

are equally likely to be selected. But in some case, we know sth. Say P(M1) =0.4, P(M2) = 0.6.

Given the data (x=“GATTCCAA…”), what can

we say about the question?

Example (Bayesian)

P(M1|x) = P(x|M1)P(D1)/P(x)
P(M2|x) = P(x|M2)P(D2)/P(x)
This is the posterior distribution. We can

choose the model using MAP criterion.

SLIDE 11

11 Markov models

Recall that in general we assume a

sequence is generated by a source/model.

For Bernoulli models, each letter is

independent generated.

For Markov models, the generation of

the current letter depends on the previous letters, i.e., what we are going to see depends on what we have seen.

Markov models (Markov chain)

Order h; how many previous steps
States, correspond to letters (for current

moment)

Transition probabilities: px,y the

probability of generating y given we have seen x.

Stationary probability: the limit

distribution of the state probabilities.

First order MM

m m

a a m m m m m

p a a P a a a a P

, 1 1 2 1

1

) | ( ) ... | (

−

= =

− − −

) ( ) ( ) | ( )... | ( ) | ( ) ... ( ) ... | ( ) ... (

1 , 1 1 2 2 1 1 1 2 1 1 2 1 1 2 1

1

a P p a P a a P a a P a a P a a a P a a a a P a a a a P

i i

a a m m m m m m m m m m m m

∏

−

= = =

− − − − − − − − −

SLIDE 12

12 First order MM

A C T G b e

Transition probability

e 0.05 0.25 0.4 0.15 0.15 T 0.05 0.15 0.2 0.3 0.3 G 0.1 0.3 0.2 0.2 0.2 C 0.2 0.3 0.3 0.1 0.1 A 0.25 0.25 0.25 0.25 b e T G C A b

Example: CpG islands

CpG islands are stretches of C and G

bases repeating over and over

Often occur adjacent to gene-rich areas,

forming a barrier between the genes and “junk” DNA

It is important to identify CpG islands for

gene recognition.

SLIDE 13

13 Learning from examples

We are given a set of regions/ sequences that

labeled as CpG islands. Denoted as S+. (positive training set)

The rest sequences are the “sea”. Denoted

as S-. (negative training set)

The problem: how can we learn our model

from the training data, so that we can predict a new sequence as a CpG island or not, based on the model.

This is also known as classification problem.

First order MM for CpG islands

Build two Markov models of order one, for S+

and S-, respectively.

Train the models: the transition probabilities

can be estimated by ML.

∑

+ + + = c b a

c a f b a f p ) , ( ) , (

,

∑

− − − = c b a

c a f b a f p ) , ( ) , (

,

Example

For any given sequence y, we can use

the log likelihood ratio test to predict the string is CpG island or not.

∑

= − − + −

− = − + =

| | 2 ] [ ], 1 [ ] [ ], 1 [

)} log( ) {log( ) | ( ) | ( log

y i i y i y i y i y

p p S y P S y P LLR

SLIDE 14

14 Example

How can we find a CpG island in a long

string?

We could use a sliding window, but how

to choose the size of the window?

Can we combine the two models in
ne?
The answer to all these questions is to

use an HMM.

Hidden Markov Models

HMMs were introduced in the ’70s for

speech recognition

HMMs have shown to be good models

for biosequences.

They are mainly used in

– Sequence alignments – Sequence analysis – Many, many other applications

HMM for CpG islands

A+ C+ G+ T+ A- C- G- T-

SLIDE 15

15 HMM for CpG islands

Now we have two states for each

symbol.

Within each group of states (+, -), each

group behaves as the original MM.

There is also a small probability to

switch to other state.

Since we expect CpG islands smaller

that the “sea”, P(S-|S+) > P(S+|S-).

HMM state paths

It is no longer possible to tell what state

the system is by looking at the symbol.

Let π denote the sequence of states

(integers)

The chain of the states follow the

transition probabilities

– ps,t = P(πi=t | πi-1=s)

Emission probabilities

We can formalize the decoupling of

symbols from states by introducing the emission probabilities.

A state can emits any symbol based on

a probability distribution

– es(b) = P(y[i]=b| πi=s)

The states are hidden since we can not

tell state path directly from the sequence information!

SLIDE 16

16 CpG islands with emission prob

A: pA

+

C: pC

+

G: pG

+

T: pT

+

A: pA

C: pC
G: pG
T: pT
Example
Two dice with four faces: {A,C,G,T}
One has the distribution of

pA=pC=pG=pT=0.25

The other has the distribution: pA=0.20,

pC=0.28, pG=0.30, pT=0.22

Example: generating model

The source generates a string as

follows:

1. Randomly select one die with the

probability distribution of p1=0.9, p2=0.1.

2. Roll it, append the symbol to the string.
3. Repeat 1-2, until all symbols have been

generated.

SLIDE 17

17 Example

A: pA

1=.25

C: pC

1=.25

G: pG

1 =.25

T: pT

1 =.25

A: pA

2 =.20

C: pC

2 =.28

G: pG

2 =.30

T: pT

2 =.22

0.9 0.9 0.1 0.1

HMM as a generating model

How do we generate a string?

– Choose a new state according to transition probabilities – Choose a letter according to emission probabilities

Usually we may add a start state and

end state. Sometimes, we may combine the two states together and denote as state 0.

Roadmap

Most probable path: Viterbi algorithm
Posterior distribution: forward and

backward algorithm

Parameter estimation: Baum-Welch

algorithm (Expectation-Maximization)

HMM structure
HMM applications

SLIDE 18

18 HMMs

Given a string y of size m generated by a

HMM using the path π, we have

∏

=

+

=

m i i

i i i

p y e p y P

1 , ] [ ,

1 1

) ( ) , (

π π π π

π

In general, we do not know the path, like in

the example of CpG islands.

Most probable path

Problem: given a string y generated by

a given HMM, find the most probable state path, that is

π* = argmaxπ P(y, π)

π* can be computed recursively, using

Viterbi algorithm

In order to know the

state in the most probable path for the last position of the string, we need to know all the paths that lead to the state via valid transitions. ? s … 1 m m-1 … 1

SLIDE 19

19 Viterbi algorithm

Suppose the probability vs(i) of the most

probable path ending in state s at position i for y is known for all the states s, then we can compute vt(i+1) as follows

vt(i+1) = et(y[i+1])*maxs(vs(i)ps,t)

Use dynamic programming to calculate

the values of the matrix v.

Viterbi algorithm

Initialization:

– v0(0) =1, v0(0) =0 for s>0

Recursive relationship:

– vt(i) = et(y[i])*maxs(vs(i-1)ps,t)

Termination:

– P(y, π*) = maxs(vs(m)ps,0)

Computing P(y)

Problem: given a string y and a hmm M, find

the probability that y has been generated by M.

Recall the expression for the first order MM:

) ( ) ... (

1 , 1 2 1

1

a P p a a a a P

i i

a a m m m

∏

−

=

− −

For HMM, we don’t know the path, and we

need to sum over all possible path.

SLIDE 20

20 Computing P(y)

∑ ∏ ∑

+

= =

π π π π π π

π } ) ( { ) , ( ) (

] [ , ,

1 1

i

y e p p y P y P

i i i

But there are may be exponential number of

paths in terms of the length of the string!

One approximation is the use the most

probable path π* as an approximation. What if there are multiple path with similar probability?

One can use forward algorithm, which is a

recursive algorithm similar to Viberti.

Forward algorithm

Suppose the probabilities

fs(i) = P(y[1,i], πi=s) are known for all states s, then we can compute ft(i+1) as follows: ft(i+1) = et(y[i+1])∑s fs(i) ps,t Again, using dynamic programming to calculate the matrix.

Posterior probabilities

Problem: Given a string y and a hidden

Markov model M, find the P(πi = s|y), which is the probability that y[i] was emitted from state s given the observed sequence y.

In other words, P(πi = s|y) is the

posterior probability of state s when the emitted sequence y is known

SLIDE 21

21 Computing the posterior probs

In order to compute P(πi = s|y), we first

compute the probability of generating the entire y with the ith symbol of y emitted by moving through state s, that is P(y ,πi = s).

) ( ) , ( ) | ( y P s y P y s P

i i

= = = π π

Computing the posterior probs

We have:

) ( ) ( ) | ( ) ( ) , | ( ) , ( ) , , ( ) , (

] , 1 [ ] , 1 [ ] , 1 [ ] , 1 [ ] , 1 [ ] , 1 [

i b i f s y P i f s y y P s y P s y y P s y P

s s i m i s i i m i i i i m i i i

= = = = = = = = =

+ + +

π π π π π

Where bs(i) := P(y[i+1,m]|πi=s)

Problem: how to compute bs(i)

Backward algorithm

Suppose the probabilities

bs(i) = P(y[i+1,m]|πi=s) are know for all states, then we can compute bt(i-1) as follows: bt(i-1) = ∑s pt,s es(y[i]) bs(i) Again, using DP! But this time starting from the end of the sequence.

SLIDE 22

22 Posterior probs

Finally:

) ( ) ( ) ( ) ( ) , ( ) | ( y P i b i f y P s y P y s P

s s i i

= = = = π π

We already know how to compute fs(i)

and P(y).

Uses of posterior probs

Alternative decoding: instead of

computing the most probable state sequence π* = argmaxπ P(y, π), we can compute the state sequence: π’ with the maximum posterior prob. P(πi|y).

The later may be more appropriate in

the cases when many different paths have almost the same probability as the most probable path.

Uses of posterior probs

Suppose we have a function g(s)

associated with states of the HMM, (e.g. a labeling) and we need to find the most probable labeling at each position i on the sequence can be expressed by G(i|y) = ∑sP(πi = s|y)g(s)

SLIDE 23

23 Uses of posterior probs

In the CpG example, one can set g(s)=1

for s∈{A+,C+,G+,T+} and g(s)=0 for s∈{A-,C-,G-,T-}

Then G(i|y) = ∑s∈{A+,C+,G+,T+} P(πi = s|y) is

precisely the posterior probability that base i is a CpG island

The most probable labeling does not

correspond to the labeling of the most probable path.

Parameter estimation

Two main issues:
1. The design of the structure: order,

states, connections

2. The assignment of the parameters:

transition probabilities and emission probabilities.

If the state paths are known

When all the paths

are known, estimation is simple: count the number of times a particular transition or emission is used in the training set

The max likelihood

estimations are

∑

=

' ,

) ' ( ) (

t t s

st f st f p

∑

=

c s s s

c f b f b e ) ( ) ( ) (

SLIDE 24

24 If the paths are unknown

A closed-form equation to estimate the

parameters is not available anymore.

Some iterative procedure must be used
Several algorithms for optimization of

continuous function can be used.

Baum-Welch algorithm is the “standard”

for HMMs.

Baum-Welch

Observations:
If we knew the paths, we could compute

transition and emission probabilities

If we knew the transition and emission

probabilities, we could compute the paths (e.g. the most probable paths)

Baum-Welch

Iterative process involving repeating

1&2:

1. Compute the most probable paths

from the current values of ps,t and es(b)

2. Estimate the new parameters using

∑

=

' ,

) ' ( ) (

t t s

st f st f p

∑

=

c s s s

c f b f b e ) ( ) ( ) (

SLIDE 25

25 Baum-Welch

It can be proved that at each iteration,

the overall likelihood of the model is increased.

Hopefully it will converge to the global

maximum

Unfortunately, there are many local

maxima and the one you end up with depends on the initial assignments of the parameters.

Baum-Welch

Since we don’t know the path, the

actual counts f(st) and fs(b) are not available.

One method is to find the most probable

path first and use the counts from the most probable path.

BW actually computes the expected

counts of transitions and emissions based on current values of parameters.

Baum-Welch

The probability that we see a transition from state s

to state t at position i in sequence y is:

SLIDE 26

26

) ( ) ( ) ( ) 1 ( ) ( ) ( ) | ( ) , | ( ) ( ) ( ) | , ( ) ( ) ( ) | ( ) , | , ( ) ( ) ( ) | , , ( ) ( ) , ( ) , | , ( ) ( ) , , , ( ) ( ) , , ( ) | , (

, ] 1 [ , 1 ] 1 [ 1 ] 1 [ ] , 2 [ , 1 ] 1 [ ] , 2 [ 1 1 ] 1 [ ] , 2 [ ] 1 [ ] , 2 [ 1 ] , 1 [ ] , 1 [ ] , 1 [ 1 ] , 1 [ ] , 1 [ 1 1 1

y P i f p y e i b y P i f p t y P t y y P y P i f p t y y P y P i f s t P t s y y P y P i f s y y t P y P y s P y s y t P y P y y t s P y P y t s P y t s P

s t s i t t s t s i i i i m i s t s i i m i s i i i i i m i s i i m i i i i i i m i i m i i i i i i i i + + + + + + + + + + + + + + + + + + + + + +

+ = = = = = = = = = = = = = = = = = = = = = = = = = = π π π π π π π π π π π π π π π π π π

Baum-Welch

Then the expected number of times that we

see a transition from state s to state t can be

btained by summing over all positions and
ver all k sequences {x1, x2,…, xk}

( )

∑ ∑

= = +

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ + =

k j j x i j t i j t t s j s t s

x P i b x e p i f A

j

1 | | 1 ] 1 [ , , ,

) ( / ) 1 ( ) ( ) (

Where fsj is the forward variable for the jth

sequence and btj is the corresponding backward variable.

Baum-Welch

Similarly, the expected number of times

that symbol b is emitted in states s in all the k sequences {x1, x2,…, xk}

Where the inner sum is only over the position
f xj for which the emitted symbol is b.

( )

∑ ∑

= = =

⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ =

k j j b x x i j s j s s

x P i b i f b E

i j j

1 } |: | .. 1 {

) ( / ) ( ) ( ) (

] [ ,

SLIDE 27

27 Baum-Welch

New parameters are calculated using

the usual formula, but this time based

n the expected number of times,

instead of the actual count.

∑

=

' ' , , , t t s t s t s

A A p

∑

=

c s s s

c E b E b e ) ( ) ( ) (

Then we can compute the As,t and Es(b)

again based on the new parameters and iterate

Baum-Welch

Initialize the model with some random

parameters

Repeat:

– Set all As,t and Es(b) to zero – For each sequence j=1..k

Compute ft(i) for sequence j using forward alg
Compute bt(i) for seq j using backward alg
Add the contribution of seq j to A and E.

– Compute the new parameters – Compute the new log likelihood of the model

Termination: stop when the change in the log

likelihood is less than some predefined threshold or max iterations reached.

HMM structure

So far we have assumed a fully

connected model

Although is tempting to “let the model

find out for itself” which transitions to use, it almost never works in practice

For real-size problems the fully

connected model lead to very bad models, even with plenty of training data

SLIDE 28

28 HMM structure

The problem is not overfitting, but local

maxima

The less constrained the model is, the

more sever the local maximum problem becomes

Although automatic methods to

add/remove transitions and states have been proposed, the most successful HMMs are constructed by carefully deciding which transitions are allowed.

HMM structure

It is easy to “disable” a transition in the

BW algorithm

Just set it to zero and never change it

during the iterative process

HMM structure

The model topology should have an

interpretation in terms of our knowledge

f the problem
There are some standard topology to

use in typical situations.

SLIDE 29

29 Duration modeling

A typical scenario is to model a stretch
f DNA for which the distribution does

not change for a certain length

The simplest model implies that P(m

symbols) = (1-p)pm-1

The geometric distribution is not always

appropriate to model the length

p 1-p

Duration modeling

More complex distributions can be

modeled by introducing several states with the same distribution

Example to generate sequences of at

least 5 symbols and exponentially decaying distribution over longer sequences:

Duration modeling

This example will generate strings of

lengths between 2 and 6 symbols.

SLIDE 30

30 Silent States

Silent states are states that do not emit

symbols (e.g., the state 0 in our previous examples)

Silent states can be introduced in

HMMs to reduce the number of transitions

Silent States

Suppose we want to model a sequence

in which arbitrary deletions are allowed

In that case we need some completely

forward connected HMM (O(m2) edges)

Silent States

If we use silent states, we use only

O(m) edges

SLIDE 31

31 Silent States

A price is paid for the reduction in the

number of parameters.

Suppose we want to assign high

probability to 1->5 and 2->4, there is no way to have also low probability on 1->4 and 2->5.

1 2 3 4 5

Silent States

As long as there are no cycles

composed solely by silent states, the Baum-Welch can be modified to train the transition probabilities also for silent states.

HMM Applications

Profile HMM
HMM for gene predictions
From A. Krogh “An introduction to

HMM…” 1998

SLIDE 32

32 From alignment to HMM

.8 .2 T .2 .2 .33 .2 G .8 1 1 .33 .2 .8 C 1 .33 .8 .8 A

From alignment to HMM From alignment to HMM

Log odds assume the null model is a Bernoulli model and Therefore are defined as logP(S) – log(0.25)^m

SLIDE 33

33

Log(.) – log(0.25)

Profile HMM

Deletion states Inseation

Example

SLIDE 34

34 Example Example HMM for gene finding

SLIDE 35

35 Coding region Unspliced genes Genes with splicing

SLIDE 36

36 GENSCAN More applications

Alignment and searching
Gene finding
Promoter finding
Translation initiation sites finding
Splicing sites finding
Haplotype structure finding
……

Bayesian networks

A generalization of HMM
Directed graph model, the probability of

a node only depends on the probability

f its parent nodes
The calculation of the full probability of a

model can be obtained by partitioning into many conditional probabilities