[PPT] - Hidden Markov Models Discrete Markov Processes 1 Hidden Markov PowerPoint Presentation

SLIDE 1

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Hidden Markov Models

Steven J Zeil

Old Dominion Univ.

Fall 2010

1 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Hidden Markov Models

1

Discrete Markov Processes

2

Hidden Markov Models

3

Inferences from HMMs Evaluation Decoding

4

Training an HMM Model Selection

2 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Introduction

Sequences of input, not i.i.d.

Sequences in time: phonemes in a word, words in a sentence, pen movements in handwriting Sequences in space: base pairs in DNA

3 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Discrete Markov Processes

N states: S1, S2, . . . , SN State at “time” t: qt = Si First-order Markov: prob of entering a state depnds on only the most recent prior state P(qt+1 = Sj|qt = Si, qt−1 = Sk, . . .) = P(qt+1 = Sj|qt = Si) Transition probabilities are independent of time aij ≡ P(qt+1 = Sj|qt = Si), ; aij ≥ 0 ∧

N

j=1

aij = 1 Initial probabilities πi ≡ P(q1 = Si), ;

N

j=1

πi = 1

4

SLIDE 2

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Stochastic Automaton

5 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Example: Balls & Urns

Three urns each full of balls of one color. A “genii” moves randomly from urn to urn selecting balls. S1: red, S2: blue, S3: green

π = [0.5, 0.2, 0.3]T

A =   0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8   Suppose we observe O = [red, red, green, green] P(O|A, π) = P(S1)P(S1|S1)P(S3|S1)P(S3|S3) = π1a11a13a33 = 0.5 ∗ 0.4 ∗ 0.3 ∗ 0.8 = 0.048

6 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Hiding the Model

Now suppose that The urns and the genii are hidden behind a screen. The urns start with different mixtures of all three colors and (if we’re really unlucky) we don’t even know how many urns there are Suppose we observe O = [red, red, green, green]. Can we say anything at all?

7 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Hidden Markov Models

States are not observable Discrete observations [v1, v2, . . . , vM] are recorded

Each is a probabilisitic function of the state Emission probabilities bj(m) ≡ P(Ot = vm|qt = Sj) For any given sequence of observations, there may be multiple possible state sequences.

8

SLIDE 3

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

HMM Unfolded in Time

9 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Elements of an HMM

An HMM λ = (A, B, π) A = [aij]: N × N state transition probability matrix

N is number of hidden states

B = bj(m): N × M emission probability matrix

M is number of observation symbols

π = [πi]’: N × 1 initial state probability vector

10 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Making Inferences from an HMM

Evaluation: Given λ and O, calculate P(O|λ) Example: Given several HMMs, each trained to recognize a different handwritten character, and given a sequence of pen strokes, which character is most likely denoted by that sequence? Decoding: Given λ and O, what is the most probable sequence of states leading to that observation? Example: Given an HMM trianed on sentences and a sequence of words, some of which can belong to multiple syntactical classes (e.g., “green” can be an adjective, a noun,

r a verb), determine the most likely syntactic class from

surrounding context. Related problems: most likely starting or ending state

11 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Decoding Example

What’s the weather been? States can be “labeled” even though “hidden”.

12

SLIDE 4

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Evaluation

Given λ and O, calculate P(O|λ) If we knew the state sequence q, we could do P(O|λ, q) =

T

t=1

P(Ot|qt, λ) =

T

t=1

bqt(Ot) The prob of a state sequence is P( q|λ) = πq1

T−1

t=1

aqtqt+1 P(O, q|λ) = πq1bq1(O1)

T−1

t=1

aqtqt+1bqt+1(Ot+1) P(O|λ) =

all possible

q P(O, q|λ)} which is totally impractical

13 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Forward Variable

αt(i) ≡ P(O1 . . . Ot, qt = Si|λ) P(O|λ) =

N

i=1

αT(i) Computed recursively Initial: α1(i) = πibi(O1) Recursion: αt+1(j) = N

i=1

αt(i)aij

bj(Ot+1)

14 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Decoding

Given λ and O, what is the most probable sequence of states leading to that observation? Start by introducing a backward variable: βt(i) ≡ P(Ot+1 . . . OT|qt = Si, λ) Initial: βT(i) = 1 Recursion: βt(i) =

N

j=1

aijbj(Ot+1)βt+1(j)

15 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Viterbi’s Algorithm

A constrained optimizer for state graph traversal Dynamic programming algorithm: Assign a cost to each edge Update path metrics by addition from shorter paths. Discard suboptimal cases. Starting from the final state, trace back the optimal path.

16

SLIDE 5

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

The HMM Trellis

17 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Viterbi’s Algorithm for HMMs

δt(i) ≡ max

q1q2...qt−1 p(q1q2 . . . qt−1, qt = Si, O1 . . . Ot|λ)

Initial: δ1(i) = πibi(O1), ψ1(i) = 0 Iterate: δt(i) = maxi δt−1(i)aijbj(Ot) ψt(j) = arg maxi δt−1(i)aij Optimum: p∗ = maxi δT(i), q∗

T = arg maxi δT(i)

Backtrack: q∗

t = ψt+1(q∗ t+1), t = T − 1, . . . , 1

Examples: Numeric sequence, fixed problem Coin Flipping, Customizable Spelling Correction as a decoding problem

18 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Training an HMM

Need to estimate aij, πi, bj(m) that maximize the likelihood of observing a set or training instances X = {Ok}K

k=1

19 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Baum-Welch Algorithm - Overview

An E-M style algorithm Repeatedly apply the steps

E: Use the current λ = (A, B, π) to compute, for each training instance,

the probability of being in Si at time t the probability of making the transition from Si to Sj at time t + 1

M: Update the values of λ = (A, B, π) to maximize the likelihood of matching those probabilities.

20

SLIDE 6

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

From the Training Data

zt

i =

1 if qt = Si

w

Note that

k zk

i

K

= ˆ P(qt = Si|λ) zt

ij =

1 if qt = Si ∧ qt+1 = Sj

w

Note that

k zk

ij

K

= ˆ P(qt = Si, qt+1 = Sj|λ)

21 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

From the HMM

γt(i) ≡ P(qt = Si|O, λ) = αt(i)βt(i) N

j=1 αt(j)βt(j)

During Baum-Welch, we estimate as γk

t (i) ≈ E[zt i ]

22 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

From the HMM

ξt(i, j) ≡ P(qt = Si, qt+1 = Sj|O, λ) = αt(i)aijbj(Ot+1)βt+1(j)

k
m αt(k)akmbm(Ot+1)βt+1(m)

During Baum-Welch, we estimate as ξk

t (i, j) ≈ E[zt ij]

23 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Baum-Welch Algorithm - E

Repeatedly apply the steps

E: For each Ok, γk

t (i) ← E[zt i ]

ξk

t (i, j) ← E[zt ij]

Then average over all observations: γt(i) ← K

k=1 γk t (i)

K ξt(i, j) ← K

k=1 ξk t (i, j)

K M: Update the values of λ = (A, B, π) to maximize the likelihood of matching those probabilities.

24

SLIDE 7

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Updating A

Expected number of transitions from Si to Sj is

t ξt(i, j)

Expected number of times to be in Si is

t γt(i)

Therefore the probability of the transition from Si to Sj is ˆ aij =

t ξt(i, j)
t γt(i)

25 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Updating B

Expected number of times we see vm when the system is in Sj is T

t=1 γt(j)1(Ot = vm)

Expected number of times will be in Sj is

t γt(j)

Therefore the probability emitting vm from Sj is ˆ bj(m) =

k
t γk

t (j)1(Ok t = vm)

k
t γk

t (i)

26 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Updating π

Probability of starting in Si ˆ πi =

k γk

1 (j)

K

27 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Baum-Welch Algorithm - EM

Repeatedly apply the steps

E: For each Ok, γk

t (i) ← E[zt i ]

ξk

t (i, j) ← E[zt ij]

γt(i) ← K

k=1 γk t (i)

K ξt(i, j) ← K

k=1 ξk t (i, j)

K M: ˆ aij ←

t ξt(i, j)
t γt(i)

ˆ bj(m) ←

k
t γk

t (j)1(Ok t = vm)

k
t γk

t (i)

ˆ πi ←

k γk

1 (j)

K

(Practical implementations often require careful scaling.)

28

SLIDE 8

Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Model Selection

Classification: train a separate HMM for each class and apply Bayes rule P(λi|O) = P(O|λi)P(λi)

j P(O|λj)P(λj)

For many problems, we encode prior knowledge as known values for transitions, emissions, and/or starting points. For others, we may know something about the “shape” of the HMM

29 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Left-to-Right HMMs

A =     a11 a12 a13 a22 a23 a24 a33 a34 a44     Useful for modeling signals whose properties change over time

Sometimes large jumps are prohibited

E.g., no jumps of more than k states band-diagonal matrix

No change required to training (initially zero transitions never become positive) Example: Face Recognition

30 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM

Layered HMMs

Lower layers recognize sequences (e.g., phonemes to words) Best sequence fed to higher layer (e.g., words to sentences)

31 Discrete Markov Processes Hidden Markov Models Inferences from HMMs Training an HMM