Random Variable : non empty set. {up, town} Event A: is a subset - - PDF document

random variable
SMART_READER_LITE
LIVE PREVIEW

Random Variable : non empty set. {up, town} Event A: is a subset - - PDF document

Introduction to Probability and Hidden Markov Models Durbin et al. Ch3 EECS 458 CWRU Fall 2004 Random Variable : non empty set. {up, town} Event A: is a subset of . {up} A random variable X is a function defined on events


slide-1
SLIDE 1

1

Introduction to Probability and Hidden Markov Models

Durbin et al. Ch3 EECS 458 CWRU Fall 2004

Random Variable

  • Ω : non empty set. {up, town}
  • Event A: is a subset of Ω. {up}
  • A random variable X is a function defined on

events with range in a subset of R. X({up})=1, X({down})=0.

  • The probability of an event A, denoted as

P(A), is the likelihood of the occurrence of A. It is a function that maps events to [0,1] that satisfies the following conditions: sum(P(A)) =

  • 1. For example: P({up}) = P({down}) = 0.5.
  • The distribution function of a r.v. is defined

as the probability function.

Examples

  • A die with 4 sides labeled using A, C, G,
  • T. The probability of each letter x∈{A, C,

G, T} that occurs in a roll is px.

  • The probability of a sequence of

“AAGC” is generated by the simple model is just pA* pA* pG* pC.

slide-2
SLIDE 2

2

Conditional probability

  • The joint probability of two events A

and B (P(A,B)) is the probability that event A and B occur at the same time.

  • The conditional probability of P(B|A)

is the probability that B occurs given A

  • ccurred.
  • Facts:

– P(B|A) = P( AB)/P(A) – P(A) = ∑iP(ABi)=∑iP(A|Bi)P(Bi)

An example

  • A family has two children, sample

space: Ω={GG,BB,BG,GB}

  • P(GG)=P(BB)=P(BG)=P(GB)=0.25
  • ? P(BB|1 boy at least)
  • ? P(BB|older is a boy)

Bayes Theorem

Bayes Theorem: P(B|A) = P(AB)/P(A) =(P(A|B)P(B))/∑iP(ACi)=∑iP(A|Ci)P(Ci) Two events are independent if P(AB) = P( A)P(B), i.e. P(B|A) = P(B) Example: playing card, the suit and the rank. P(King)=4/52=1/13, P(King|Spade)=1/13

slide-3
SLIDE 3

3

Expectation of a r.v.

  • E(X) := ∑xP(X=x)
  • E(g(X))= ∑g(x)P(X=x)
  • E(aX+bY) = aE(X)+bE(Y)
  • E(XY)=E(X)E(Y) if X and Y are

independent

  • Var(X) := E((X-EX)2) = EX2-(EX)2
  • Var(aX)=a2Var(X)

Roadmap

  • Model selection
  • Model inference
  • Markov models
  • Hidden Markov models
  • Applications

Probabilistic models

  • A model is a mathematical formulation

that explains/simulates the data under consideration.

  • Linear model:

– Y=aX+b – Y=aX+b+ε, where ε is a r.v.

  • Generating models for bio sequences
slide-4
SLIDE 4

4

Model selection

  • Everything else being equal, one should

prefer simpler one.

  • Should take into consideration

background/prior knowledge of the problem.

  • Some explicit measure can be used,

such as minimum description length (MDL).

Minimum Description Length

  • Connections to information theory,

communication channels and data compression

  • Prefer models that minimize the length
  • f the message required to describe the

data.

Minimum Description Length

  • The most probable model is also the
  • ne with the shortest description.
  • The “better” the model, the shortest the

message describing the observation x, the highest the compression (and vice versa)

  • Keep in mind that if the model is too

faithful on the observation, it could lead to over fitting and has poor generalization.

slide-5
SLIDE 5

5

Model inference

  • Once a model is fixed, parameters need

to be estimated from data.

  • Two views:

– True parameters are fixed values in a space, we need to find them. Classical /Frequencist – Parameters themselves are r.v.s and we need to estimate their distributions. Bayesian.

Maximum likelihood

  • Given a model M with some unknown

parameters θ, and some training data set x, the maximum likelihood estimate of θ is

θML = argmaxθP(x|M, θ) where the P(x|M, θ) is the probability that the dataset x has been generated by the model M with parameters θ.

  • MLE is consistent estimate, i.e., if we have

enough data, the MLE is close to the true value.

  • With small dataset, overfitting,

Example

  • Two dice with four faces: {A,C,G,T}
  • One has the distribution of

pA=pC=pG=pT=0.25

  • The other has the distribution: pA=0.20,

pC=0.28, pG=0.30, pT=0.22

slide-6
SLIDE 6

6

Example: generating model

  • The source generates a string as

follows:

  • 1. Randomly select one die with the

probability distribution of p1=0.9, p2=0.1.

  • 2. Roll it, append the symbol to the string.
  • 3. Repeat 1-2, until all symbols have been

generated.

Parameter θ : {0.9*{0.25, 0.25, 0.25, 0.25,}, 0.1*{0.20,0.28,0.30,0.22}}

Example

  • Suppose we don’t know the model,

neither the parameters

  • Given a string generated by the

unknown model, say x=“GATTCCAA…”

  • Questions:

– How to choose a “good” model? – How to infer parameters? – How much data do we need?

Bernoulli and Markov models

  • There are two typical hypothesis about

the source

  • Bernoulli: symbols are generated

independently and they are identically distributed (iid)

  • Markov: the probability distribution for

the “next” symbol depends on the previous h symbols (h>0 is the order of Markov chain).

slide-7
SLIDE 7

7

Example

  • X1 = CCACCCTTGT
  • X2 = TTGTTCTTCC
  • X3 = TTCAACCGGT
  • X4 = AATAACCCCA
  • MLE for the parameter set {pA, pC, pG,

pT} for Bernoulli model is: pA=8/40=0.20, pC=15/40=0.375, pG=4/40=0.10, pT=13/40=0.325

Issues

  • How can we incorporate prior

knowledge in the analysis process?

  • Over fitting
  • How can we compare different models?

Bayesian inference

  • Useful to get better estimates of the

parameters using prior knowledge

  • Useful to compare different models
  • Parameters themselves are r.v.s. Prior

knowledge represented by prior distribution of parameters. We need to estimate the posterior distribution of parameters given data.

slide-8
SLIDE 8

8

Bayesian inference

= δ δ θ θ θ d x P P x P x P ) | ( ) ( ) | ( ) | (

Point estimation

  • We can define the “best” model as the one

that maximizes P(θ|x) (mazimum a posteriori estimation, or MAP).

  • Equivalent to minimize:
  • logP(θ|x) = -logP(x|θ) – logP(θ) +logP(x)
  • Can be used for model comparisons
  • Note that logP(x) can be regarded as a

constant

ML and MAP

  • ML estimation:

θML = argmaxθP(x|θ) θML = argminθ{-logP(x|θ)}

  • MAP estimation:

θMAP = argminθ{-logP(x|θ)-logθ)}

slide-9
SLIDE 9

9

Occam’s Razor

  • Everything else being equal, one should

prefer a simple model over a complex

  • ne
  • One can incorporate this principle in the

Bayesian framework by using priors which penalize complex models.

Example

  • Two dice with four faces: {A,C,G,T}
  • One has the distribution of

pA=pC=pG=pT=0.25. M1

  • The other has the distribution: pA=0.20,

pC=0.28, pG=0.30, pT=0.22. M2

Example

  • The source generates a string as

follows:

  • 1. Randomly select one die.
  • 2. Roll it, append the symbol to the string.
  • 3. Repeat 2, until all symbols have been

generated.

– Given a string say x=“GATTCCAA…”

slide-10
SLIDE 10

10

Example (ML)

  • What is the probability of the data is

generated by M1? M2?

  • P(x|M1)
  • P(x|M2)
  • They can be easily calculated since this is a

Bernoulli model.

  • Log Likelihood Ratio: log(P(x|M1))-

log(P(x|M2))

Example

  • Which die or model is selected in step 1?

What is the probability distribution of selecting model 1 and model 2.

  • Prior: if we know nothing, better assume they

are equally likely to be selected. But in some case, we know sth. Say P(M1) =0.4, P(M2) = 0.6.

  • Given the data (x=“GATTCCAA…”), what can

we say about the question?

Example (Bayesian)

  • P(M1|x) = P(x|M1)P(D1)/P(x)
  • P(M2|x) = P(x|M2)P(D2)/P(x)
  • This is the posterior distribution. We can

choose the model using MAP criterion.

slide-11
SLIDE 11

11

Markov models

  • Recall that in general we assume a

sequence is generated by a source/model.

  • For Bernoulli models, each letter is

independent generated.

  • For Markov models, the generation of

the current letter depends on the previous letters, i.e., what we are going to see depends on what we have seen.

Markov models (Markov chain)

  • Order h; how many previous steps
  • States, correspond to letters (for current

moment)

  • Transition probabilities: px,y the

probability of generating y given we have seen x.

  • Stationary probability: the limit

distribution of the state probabilities.

First order MM

m m

a a m m m m m

p a a P a a a a P

, 1 1 2 1

1

) | ( ) ... | (

= =

− − −

) ( ) ( ) | ( )... | ( ) | ( ) ... ( ) ... | ( ) ... (

1 , 1 1 2 2 1 1 1 2 1 1 2 1 1 2 1

1

a P p a P a a P a a P a a P a a a P a a a a P a a a a P

i i

a a m m m m m m m m m m m m

= = =

− − − − − − − − −

slide-12
SLIDE 12

12

First order MM

A C T G b e

Transition probability

e 0.05 0.25 0.4 0.15 0.15 T 0.05 0.15 0.2 0.3 0.3 G 0.1 0.3 0.2 0.2 0.2 C 0.2 0.3 0.3 0.1 0.1 A 0.25 0.25 0.25 0.25 b e T G C A b

Example: CpG islands

  • CpG islands are stretches of C and G

bases repeating over and over

  • Often occur adjacent to gene-rich areas,

forming a barrier between the genes and “junk” DNA

  • It is important to identify CpG islands for

gene recognition.

slide-13
SLIDE 13

13

Learning from examples

  • We are given a set of regions/ sequences that

labeled as CpG islands. Denoted as S+. (positive training set)

  • The rest sequences are the “sea”. Denoted

as S-. (negative training set)

  • The problem: how can we learn our model

from the training data, so that we can predict a new sequence as a CpG island or not, based on the model.

  • This is also known as classification problem.

First order MM for CpG islands

  • Build two Markov models of order one, for S+

and S-, respectively.

  • Train the models: the transition probabilities

can be estimated by ML.

+ + + = c b a

c a f b a f p ) , ( ) , (

,

− − − = c b a

c a f b a f p ) , ( ) , (

,

Example

  • For any given sequence y, we can use

the log likelihood ratio test to predict the string is CpG island or not.

= − − + −

− = − + =

| | 2 ] [ ], 1 [ ] [ ], 1 [

)} log( ) {log( ) | ( ) | ( log

y i i y i y i y i y

p p S y P S y P LLR

slide-14
SLIDE 14

14

Example

  • How can we find a CpG island in a long

string?

  • We could use a sliding window, but how

to choose the size of the window?

  • Can we combine the two models in
  • ne?
  • The answer to all these questions is to

use an HMM.

Hidden Markov Models

  • HMMs were introduced in the ’70s for

speech recognition

  • HMMs have shown to be good models

for biosequences.

  • They are mainly used in

– Sequence alignments – Sequence analysis – Many, many other applications

HMM for CpG islands

A+ C+ G+ T+ A- C- G- T-

slide-15
SLIDE 15

15

HMM for CpG islands

  • Now we have two states for each

symbol.

  • Within each group of states (+, -), each

group behaves as the original MM.

  • There is also a small probability to

switch to other state.

  • Since we expect CpG islands smaller

that the “sea”, P(S-|S+) > P(S+|S-).

HMM state paths

  • It is no longer possible to tell what state

the system is by looking at the symbol.

  • Let π denote the sequence of states

(integers)

  • The chain of the states follow the

transition probabilities

– ps,t = P(πi=t | πi-1=s)

Emission probabilities

  • We can formalize the decoupling of

symbols from states by introducing the emission probabilities.

  • A state can emits any symbol based on

a probability distribution

– es(b) = P(y[i]=b| πi=s)

  • The states are hidden since we can not

tell state path directly from the sequence information!

slide-16
SLIDE 16

16

CpG islands with emission prob

A: pA

+

C: pC

+

G: pG

+

T: pT

+

A: pA

  • C: pC
  • G: pG
  • T: pT
  • Example
  • Two dice with four faces: {A,C,G,T}
  • One has the distribution of

pA=pC=pG=pT=0.25

  • The other has the distribution: pA=0.20,

pC=0.28, pG=0.30, pT=0.22

Example: generating model

  • The source generates a string as

follows:

  • 1. Randomly select one die with the

probability distribution of p1=0.9, p2=0.1.

  • 2. Roll it, append the symbol to the string.
  • 3. Repeat 1-2, until all symbols have been

generated.

slide-17
SLIDE 17

17

Example

A: pA

1=.25

C: pC

1=.25

G: pG

1 =.25

T: pT

1 =.25

A: pA

2 =.20

C: pC

2 =.28

G: pG

2 =.30

T: pT

2 =.22

0.9 0.9 0.1 0.1

HMM as a generating model

  • How do we generate a string?

– Choose a new state according to transition probabilities – Choose a letter according to emission probabilities

  • Usually we may add a start state and

end state. Sometimes, we may combine the two states together and denote as state 0.

Roadmap

  • Most probable path: Viterbi algorithm
  • Posterior distribution: forward and

backward algorithm

  • Parameter estimation: Baum-Welch

algorithm (Expectation-Maximization)

  • HMM structure
  • HMM applications
slide-18
SLIDE 18

18

HMMs

  • Given a string y of size m generated by a

HMM using the path π, we have

=

+

=

m i i

i i i

p y e p y P

1 , ] [ ,

1 1

) ( ) , (

π π π π

π

  • In general, we do not know the path, like in

the example of CpG islands.

Most probable path

  • Problem: given a string y generated by

a given HMM, find the most probable state path, that is

π* = argmaxπ P(y, π)

  • π* can be computed recursively, using

Viterbi algorithm

Viterbi algorithm

  • In order to know the

state in the most probable path for the last position of the string, we need to know all the paths that lead to the state via valid transitions. ? s … 1 m m-1 … 1

slide-19
SLIDE 19

19

Viterbi algorithm

  • Suppose the probability vs(i) of the most

probable path ending in state s at position i for y is known for all the states s, then we can compute vt(i+1) as follows

vt(i+1) = et(y[i+1])*maxs(vs(i)ps,t)

  • Use dynamic programming to calculate

the values of the matrix v.

Viterbi algorithm

  • Initialization:

– v0(0) =1, v0(0) =0 for s>0

  • Recursive relationship:

– vt(i) = et(y[i])*maxs(vs(i-1)ps,t)

  • Termination:

– P(y, π*) = maxs(vs(m)ps,0)

Computing P(y)

  • Problem: given a string y and a hmm M, find

the probability that y has been generated by M.

  • Recall the expression for the first order MM:

) ( ) ... (

1 , 1 2 1

1

a P p a a a a P

i i

a a m m m

=

− −

  • For HMM, we don’t know the path, and we

need to sum over all possible path.

slide-20
SLIDE 20

20

Computing P(y)

∑ ∏ ∑

+

= =

π π π π π π

π } ) ( { ) , ( ) (

] [ , ,

1 1

i

y e p p y P y P

i i i

  • But there are may be exponential number of

paths in terms of the length of the string!

  • One approximation is the use the most

probable path π* as an approximation. What if there are multiple path with similar probability?

  • One can use forward algorithm, which is a

recursive algorithm similar to Viberti.

Forward algorithm

  • Suppose the probabilities

fs(i) = P(y[1,i], πi=s) are known for all states s, then we can compute ft(i+1) as follows: ft(i+1) = et(y[i+1])∑s fs(i) ps,t Again, using dynamic programming to calculate the matrix.

Posterior probabilities

  • Problem: Given a string y and a hidden

Markov model M, find the P(πi = s|y), which is the probability that y[i] was emitted from state s given the observed sequence y.

  • In other words, P(πi = s|y) is the

posterior probability of state s when the emitted sequence y is known

slide-21
SLIDE 21

21

Computing the posterior probs

  • In order to compute P(πi = s|y), we first

compute the probability of generating the entire y with the ith symbol of y emitted by moving through state s, that is P(y ,πi = s).

) ( ) , ( ) | ( y P s y P y s P

i i

= = = π π

Computing the posterior probs

  • We have:

) ( ) ( ) | ( ) ( ) , | ( ) , ( ) , , ( ) , (

] , 1 [ ] , 1 [ ] , 1 [ ] , 1 [ ] , 1 [ ] , 1 [

i b i f s y P i f s y y P s y P s y y P s y P

s s i m i s i i m i i i i m i i i

= = = = = = = = =

+ + +

π π π π π

Where bs(i) := P(y[i+1,m]|πi=s)

  • Problem: how to compute bs(i)

Backward algorithm

  • Suppose the probabilities

bs(i) = P(y[i+1,m]|πi=s) are know for all states, then we can compute bt(i-1) as follows: bt(i-1) = ∑s pt,s es(y[i]) bs(i) Again, using DP! But this time starting from the end of the sequence.

slide-22
SLIDE 22

22

Posterior probs

  • Finally:

) ( ) ( ) ( ) ( ) , ( ) | ( y P i b i f y P s y P y s P

s s i i

= = = = π π

  • We already know how to compute fs(i)

and P(y).

Uses of posterior probs

  • Alternative decoding: instead of

computing the most probable state sequence π* = argmaxπ P(y, π), we can compute the state sequence: π’ with the maximum posterior prob. P(πi|y).

  • The later may be more appropriate in

the cases when many different paths have almost the same probability as the most probable path.

Uses of posterior probs

  • Suppose we have a function g(s)

associated with states of the HMM, (e.g. a labeling) and we need to find the most probable labeling at each position i on the sequence can be expressed by G(i|y) = ∑sP(πi = s|y)g(s)

slide-23
SLIDE 23

23

Uses of posterior probs

  • In the CpG example, one can set g(s)=1

for s∈{A+,C+,G+,T+} and g(s)=0 for s∈{A-,C-,G-,T-}

  • Then G(i|y) = ∑s∈{A+,C+,G+,T+} P(πi = s|y) is

precisely the posterior probability that base i is a CpG island

  • The most probable labeling does not

correspond to the labeling of the most probable path.

Parameter estimation

  • Two main issues:
  • 1. The design of the structure: order,

states, connections

  • 2. The assignment of the parameters:

transition probabilities and emission probabilities.

If the state paths are known

  • When all the paths

are known, estimation is simple: count the number of times a particular transition or emission is used in the training set

  • The max likelihood

estimations are

=

' ,

) ' ( ) (

t t s

st f st f p

=

c s s s

c f b f b e ) ( ) ( ) (

slide-24
SLIDE 24

24

If the paths are unknown

  • A closed-form equation to estimate the

parameters is not available anymore.

  • Some iterative procedure must be used
  • Several algorithms for optimization of

continuous function can be used.

  • Baum-Welch algorithm is the “standard”

for HMMs.

Baum-Welch

  • Observations:
  • If we knew the paths, we could compute

transition and emission probabilities

  • If we knew the transition and emission

probabilities, we could compute the paths (e.g. the most probable paths)

Baum-Welch

  • Iterative process involving repeating

1&2:

  • 1. Compute the most probable paths

from the current values of ps,t and es(b)

  • 2. Estimate the new parameters using

=

' ,

) ' ( ) (

t t s

st f st f p

=

c s s s

c f b f b e ) ( ) ( ) (

slide-25
SLIDE 25

25

Baum-Welch

  • It can be proved that at each iteration,

the overall likelihood of the model is increased.

  • Hopefully it will converge to the global

maximum

  • Unfortunately, there are many local

maxima and the one you end up with depends on the initial assignments of the parameters.

Baum-Welch

  • Since we don’t know the path, the

actual counts f(st) and fs(b) are not available.

  • One method is to find the most probable

path first and use the counts from the most probable path.

  • BW actually computes the expected

counts of transitions and emissions based on current values of parameters.

Baum-Welch

  • The probability that we see a transition from state s

to state t at position i in sequence y is:

slide-26
SLIDE 26

26

) ( ) ( ) ( ) 1 ( ) ( ) ( ) | ( ) , | ( ) ( ) ( ) | , ( ) ( ) ( ) | ( ) , | , ( ) ( ) ( ) | , , ( ) ( ) , ( ) , | , ( ) ( ) , , , ( ) ( ) , , ( ) | , (

, ] 1 [ , 1 ] 1 [ 1 ] 1 [ ] , 2 [ , 1 ] 1 [ ] , 2 [ 1 1 ] 1 [ ] , 2 [ ] 1 [ ] , 2 [ 1 ] , 1 [ ] , 1 [ ] , 1 [ 1 ] , 1 [ ] , 1 [ 1 1 1

y P i f p y e i b y P i f p t y P t y y P y P i f p t y y P y P i f s t P t s y y P y P i f s y y t P y P y s P y s y t P y P y y t s P y P y t s P y t s P

s t s i t t s t s i i i i m i s t s i i m i s i i i i i m i s i i m i i i i i i m i i m i i i i i i i i + + + + + + + + + + + + + + + + + + + + + +

+ = = = = = = = = = = = = = = = = = = = = = = = = = = π π π π π π π π π π π π π π π π π π

Baum-Welch

  • Then the expected number of times that we

see a transition from state s to state t can be

  • btained by summing over all positions and
  • ver all k sequences {x1, x2,…, xk}

( )

∑ ∑

= = +

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ + =

k j j x i j t i j t t s j s t s

x P i b x e p i f A

j

1 | | 1 ] 1 [ , , ,

) ( / ) 1 ( ) ( ) (

  • Where fsj is the forward variable for the jth

sequence and btj is the corresponding backward variable.

Baum-Welch

  • Similarly, the expected number of times

that symbol b is emitted in states s in all the k sequences {x1, x2,…, xk}

  • Where the inner sum is only over the position
  • f xj for which the emitted symbol is b.

( )

∑ ∑

= = =

⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ =

k j j b x x i j s j s s

x P i b i f b E

i j j

1 } |: | .. 1 {

) ( / ) ( ) ( ) (

] [ ,

slide-27
SLIDE 27

27

Baum-Welch

  • New parameters are calculated using

the usual formula, but this time based

  • n the expected number of times,

instead of the actual count.

=

' ' , , , t t s t s t s

A A p

=

c s s s

c E b E b e ) ( ) ( ) (

  • Then we can compute the As,t and Es(b)

again based on the new parameters and iterate

Baum-Welch

  • Initialize the model with some random

parameters

  • Repeat:

– Set all As,t and Es(b) to zero – For each sequence j=1..k

  • Compute ft(i) for sequence j using forward alg
  • Compute bt(i) for seq j using backward alg
  • Add the contribution of seq j to A and E.

– Compute the new parameters – Compute the new log likelihood of the model

  • Termination: stop when the change in the log

likelihood is less than some predefined threshold or max iterations reached.

HMM structure

  • So far we have assumed a fully

connected model

  • Although is tempting to “let the model

find out for itself” which transitions to use, it almost never works in practice

  • For real-size problems the fully

connected model lead to very bad models, even with plenty of training data

slide-28
SLIDE 28

28

HMM structure

  • The problem is not overfitting, but local

maxima

  • The less constrained the model is, the

more sever the local maximum problem becomes

  • Although automatic methods to

add/remove transitions and states have been proposed, the most successful HMMs are constructed by carefully deciding which transitions are allowed.

HMM structure

  • It is easy to “disable” a transition in the

BW algorithm

  • Just set it to zero and never change it

during the iterative process

HMM structure

  • The model topology should have an

interpretation in terms of our knowledge

  • f the problem
  • There are some standard topology to

use in typical situations.

slide-29
SLIDE 29

29

Duration modeling

  • A typical scenario is to model a stretch
  • f DNA for which the distribution does

not change for a certain length

  • The simplest model implies that P(m

symbols) = (1-p)pm-1

  • The geometric distribution is not always

appropriate to model the length

p 1-p

Duration modeling

  • More complex distributions can be

modeled by introducing several states with the same distribution

  • Example to generate sequences of at

least 5 symbols and exponentially decaying distribution over longer sequences:

Duration modeling

  • This example will generate strings of

lengths between 2 and 6 symbols.

slide-30
SLIDE 30

30

Silent States

  • Silent states are states that do not emit

symbols (e.g., the state 0 in our previous examples)

  • Silent states can be introduced in

HMMs to reduce the number of transitions

Silent States

  • Suppose we want to model a sequence

in which arbitrary deletions are allowed

  • In that case we need some completely

forward connected HMM (O(m2) edges)

Silent States

  • If we use silent states, we use only

O(m) edges

slide-31
SLIDE 31

31

Silent States

  • A price is paid for the reduction in the

number of parameters.

  • Suppose we want to assign high

probability to 1->5 and 2->4, there is no way to have also low probability on 1->4 and 2->5.

1 2 3 4 5

Silent States

  • As long as there are no cycles

composed solely by silent states, the Baum-Welch can be modified to train the transition probabilities also for silent states.

HMM Applications

  • Profile HMM
  • HMM for gene predictions
  • From A. Krogh “An introduction to

HMM…” 1998

slide-32
SLIDE 32

32

From alignment to HMM

.8 .2 T .2 .2 .33 .2 G .8 1 1 .33 .2 .8 C 1 .33 .8 .8 A

From alignment to HMM From alignment to HMM

Log odds assume the null model is a Bernoulli model and Therefore are defined as logP(S) – log(0.25)^m

slide-33
SLIDE 33

33

Log(.) – log(0.25)

Profile HMM

Deletion states Inseation

Example

slide-34
SLIDE 34

34

Example Example HMM for gene finding

slide-35
SLIDE 35

35

Coding region Unspliced genes Genes with splicing

slide-36
SLIDE 36

36

GENSCAN More applications

  • Alignment and searching
  • Gene finding
  • Promoter finding
  • Translation initiation sites finding
  • Splicing sites finding
  • Haplotype structure finding
  • ……

Bayesian networks

  • A generalization of HMM
  • Directed graph model, the probability of

a node only depends on the probability

  • f its parent nodes
  • The calculation of the full probability of a

model can be obtained by partitioning into many conditional probabilities