[PDF] - Markov Models Yanbing Xue Outline Introduction Markov chains PDF Document

SLIDE 1

2020年2月25日 1

Markov Models

Yanbing Xue

Outline

▪ Introduction ▪ Markov chains ▪ Dynamic belief networks ▪ Hidden Markov models (HMMs)

SLIDE 2

2020年2月25日 2

Outline

▪ Introduction

▪ Time series ▪ Probabilistic graphical models

▪ Markov chains ▪ Dynamic belief networks ▪ Hiddem Markov models (HMM)

What is time series?

▪ A time series is a sequence of data instance listed in time order.

▪ In other words, data instances are totally ordered. ▪ Example: weather forecasting ▪ Notice: we care about the orderings rather than the exact time.

SLIDE 3

2020年2月25日 3

Different kinds of time series

▪ Two properties:

▪ Time space: discrete

r

continuous ? ▪ Task: classification

r

regression ?

Weather Min/max temp Temperature Prob of rain Discrete & classification Discrete & regression Continuous & regression

Probabilistic graphical models (PGMs)

▪ A PGM uses a graph-based representation to represent the conditional distributions over variables. ▪ Directed acyclic graphs (DAGs) ▪ Undirected graph

Markov model is a sub- family of PGMs on DAGs

SLIDE 4

2020年2月25日 4

Outline

▪ Introduction ▪ Markov chains

▪ Intuition ▪ Inference ▪ Learning

▪ Dynamic belief networks ▪ Hidden Markov models (HMMs)

Modeling time series

Assume a sequence of four weather observations: 𝑧1, 𝑧2, 𝑧3, 𝑧4 ▪ Possible dependences: 𝑧4 depends on the previous weather(s)

𝑧1 𝑧2 𝑧3 𝑧4 𝑧1 𝑧2 𝑧3 𝑧4

SLIDE 5

2020年2月25日 5

Modeling time series

In general observations: 𝑧1, 𝑧2, 𝑧3, 𝑧4 can be

y1 y2 y3 y4 y1 y2 y3 y4

Fully dependent: E.g. y4 depends on all previous observations Independent: E.g. y4 does not depend on any previous observation A lot of middle ground in between the two extremes

Modeling time series

▪ Are there intuitive and convenient dependency models?

y1 y2 y3 y4 y1 y2 y3 y4

Think of the last observation 𝑄(𝑧4|𝑧1𝑧2𝑧3) What if we have T observations? Parameter #: exponential to # of observations Totally drops time information

?

SLIDE 6

2020年2月25日 6

Markov chains

▪ Markov assumption: Future predictions are independent of all but the most recent observations

y1 y2 y3 y4 y1 y2 y3 y4

Independent Fully dependent

y1 y2 y3 y4

First order Markov chain

Markov chains

▪ Markov assumption: Future predictions are independent of all but the most recent observations

y1 y2 y3 y4 y1 y2 y3 y4

Independent Fully dependent

y1 y2 y3 y4

Second order Markov chain

SLIDE 7

2020年2月25日 7

A formal representation

▪ Using conditional probabilities to model 𝑧1, 𝑧2, 𝑧3, 𝑧4 ▪ Fully dependent:

▪ 𝑄 𝑧1𝑧2𝑧3𝑧4 = 𝑄 𝑧1 𝑄 𝑧2 𝑧1 𝑄 𝑧3 𝑧1𝑧2 𝑄(𝑧4|𝑧1𝑧2𝑧3)

▪ Fully independent:

▪ 𝑄 𝑧1𝑧2𝑧3𝑧4 = 𝑄 𝑧1 𝑄(𝑧2)𝑄(𝑧3)𝑄(𝑧4)

▪ First-order Markov chain (recent 1 observation):

▪ 𝑄 𝑧1𝑧2𝑧3𝑧4 = 𝑄 𝑧1 𝑄 𝑧2 𝑧1 𝑄 𝑧3 𝑧2 𝑄(𝑧4|𝑧3)

▪ Second-order Markov chain (recent 2 observations):

▪ 𝑄 𝑧1𝑧2𝑧3𝑧4 = 𝑄 𝑧1 𝑄 𝑧2 𝑧1 𝑄 𝑧3 𝑧1𝑧2 𝑄(𝑧4|𝑧2𝑧3)

A more formal representation

▪ Generalizes to T observations ▪ First-order Markov chain (recent 1 observation):

▪ 𝑄 𝑧1𝑧2 … 𝑧𝑈 = 𝑄 𝑧1 ς𝑢=2

𝑈

𝑄(𝑧𝑢|𝑧𝑢−1)

▪ Second-order Markov chain (recent 2 observations):

▪ 𝑄 𝑧1𝑧2 … 𝑧𝑈 = 𝑄 𝑧1 𝑄 𝑧2 𝑧1 ς𝑢=3

𝑈

𝑄(𝑧𝑢|𝑧𝑢−1𝑧𝑢−2)

▪ k-th order Markov chain (recent k observations):

▪ 𝑄 𝑧1𝑧2 … 𝑧𝑈 = 𝑄 𝑧1 𝑄 𝑧2 𝑧1 … 𝑄(𝑧𝑙|𝑧1 … 𝑧𝑙−1)ς𝑢=𝑙+1

𝑈

𝑄(𝑧𝑢|𝑧𝑢−𝑙 … 𝑧𝑢−1)

SLIDE 8

2020年2月25日 8

Stationarity

▪ Do all states yield to the identical conditional distribution? ▪ 𝑄 𝑧𝑢 = 𝑘 𝑧𝑢−1 = 𝑗 = 𝑄 𝑧𝑢−1 = 𝑘 𝑧𝑢−2 = 𝑗 for all 𝑢, 𝑗, 𝑘 ▪ Typically holds ▪ A transition table A to represent conditional distribution

▪ 𝐵𝑗𝑘 = 𝑄 𝑧𝑢 = 𝑘 𝑧𝑢−1 = 𝑗 for all 𝑢 = 1,2, … , 𝑈 ▪ 𝑒: dimention of 𝑧𝑢

▪ A vector 𝛒 to represent the initial distribution

▪ 𝜌𝑗 = 𝑄(𝑧1 = 𝑗) for all 𝑗 = 1,2, … , 𝑒 𝐵11 ⋯ 𝐵1𝑒 ⋮ ⋱ ⋮ 𝐵𝑒1 ⋯ 𝐵𝑒𝑒

Inference on a Markov chain

▪ Probability of a given sequence

▪ 𝑄 𝑧1 = 𝑗1, … , 𝑧𝑈 = 𝑗𝑈 = 𝜌𝑗1 ς𝑢=2

𝑈

𝐵𝑗𝑢𝑗𝑢−1

▪ Probability of a given state

▪ Forward iteration: 𝑄 𝑧𝑢 = 𝑗𝑢 = σ𝑗𝑢−1 𝑄(𝑧𝑢−1 = 𝑗𝑢−1)𝐵𝑗𝑢𝑗𝑢−1 ▪ Can be calculated iteratively

▪ Both inferences are efficient ▪ 𝑄 𝑧𝑙 = 𝑗𝑙, … , 𝑧𝑈 = 𝑗𝑈 = 𝑄 𝑧𝑙 = 𝑗𝑙 ς𝑢=𝑙+1

𝑈

𝐵𝑗𝑢𝑗𝑢−1

SLIDE 9

2020年2月25日 9

Learning a Markov chain

▪ MLE of conditional probabilities can be estimated directly. ▪ 𝐵𝑗𝑘

𝑁𝑀𝐹 = 𝑄 𝑧𝑢 = 𝑘 𝑧𝑢−1 = 𝑗 = 𝑄(𝑧𝑢=𝑘,𝑧𝑢−1=𝑗) 𝑄(𝑧𝑢−1=𝑗)

=

𝑂𝑗𝑘 σ𝑘 𝑂𝑗𝑘

▪ 𝑂𝑗𝑘: # of observations that yields 𝑧𝑢 = 𝑘, 𝑧𝑢−1 = 𝑗

▪ Bayesian parameter estimation

▪ Prior: 𝐸𝑗𝑠(𝜄1, 𝜄2, … ) ▪ Posterior: 𝐸𝑗𝑠(𝜄1 + 𝑂𝑗1, 𝜄2 + 𝑂𝑗2, … ) ▪ 𝐵𝑗𝑘

𝑁𝐵𝑄 = 𝑂𝑗𝑘+𝜄𝑘−1 σ𝑘(𝑂𝑗𝑘+𝜄𝑘−1)

𝐵𝑗𝑘

𝐹𝑊 = 𝑂𝑗𝑘+𝜄𝑘 σ𝑘(𝑂𝑗𝑘+𝜄𝑘)

A toy example – weather forecast

▪ State 1: rainy state 2: cloudy state 3: sunny ▪ Given “sun-sun-sun-rain-rain-sun-cloud-sun”, find 𝐵33 ▪ 𝐵33

𝑁𝑀𝐹 = 𝑂33 σ𝑘 𝑂3𝑘 = 2 1+1+2

▪ Prior: 𝐸𝑗𝑠(2,2,2) ▪ Posterior: 𝐸𝑗𝑠(2 + 1,2 + 1,2 + 2)

▪ 𝐵33

𝑁𝐵𝑄 = 𝑂33+𝜄3−1 σ𝑘(𝑂3𝑘+𝜄𝑘−1) = 3 7

𝐵33

𝐹𝑊 = 𝑂33+𝜄3 σ𝑘(𝑂3𝑘+𝜄𝑘) = 4 10

SLIDE 10

2020年2月25日 10

A toy example – weather forecast

▪ Given 𝐵 = 0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8 , day 1 is sunny ▪ Find the probability that day 2~8 will be “sun-sun-rain-rain-sun-cloud-sun” ▪ 𝑄 𝑧1𝑧2 … 𝑧8 = 𝑄 𝑧1 = 𝑡 𝑄 𝑧2 = 𝑡 𝑧1 = 𝑡 𝑄 𝑧3 = 𝑡 𝑧2 = 𝑡 𝑄 𝑧4 = 𝑠 𝑧3 = 𝑡 𝑄 𝑧5 = 𝑠 𝑧4 = 𝑠 𝑄 𝑧6 = 𝑡 𝑧5 = 𝑠 𝑄 𝑧7 = 𝑑 𝑧6 = 𝑡 𝑄 𝑧8 = 𝑡 𝑧7 = 𝑑 = 1 ∙ 𝐵33 ∙ 𝐵33 ∙ 𝐵31 ∙ 𝐵11 ∙ 𝐵13 ∙ 𝐵32 ∙ 𝐵23 = 1 ∙ 0.8 ∙ 0.8 ∙ 0.1 ∙ 0.4 ∙ 0.3 ∙ 0.1 ∙ 0.2 = 1.536 × 10−4

A toy example – weather forecast

▪ Given 𝐵 = 0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8 , day 1 is sunny ▪ Find the probability that day 3 will be sunny ▪ 𝑄 𝑧2 = 𝑡 = σ𝑗 𝑄 𝑧1 = 𝑗 𝑄 𝑧2 = 𝑡 𝑧1 = 𝑗 = 0 ∙ 0.3 + 0 ∙ 0.2 + 1 ∙ 0.8 = 0.8

▪ Similarly, 𝑄 𝑧2 = 𝑠 = σ𝑗 𝑄 𝑧1 = 𝑗 𝑄 𝑧2 = 𝑠 𝑧1 = 𝑗 = 0 ∙ 0.4 + 0 ∙ 0.2 + 1 ∙ 0.1 = 0.1 ▪ 𝑄 𝑧2 = 𝑑 = σ𝑗 𝑄 𝑧1 = 𝑗 𝑄 𝑧2 = 𝑑 𝑧1 = 𝑗 = 0 ∙ 0.3 + 0 ∙ 0.6 + 1 ∙ 0.1 = 0.1 ▪ 𝑄 𝑧3 = 𝑡 = σ𝑗 𝑄 𝑧2 = 𝑗 𝑄 𝑧3 = 𝑡 𝑧2 = 𝑗 = 0.1 ∙ 0.3 + 0.1 ∙ 0.2 + 0.8 ∙ 0.8 = 0.69

SLIDE 11

2020年2月25日 11

Limitation of Markov chain

▪ Each state is represented by one variable ▪ What if each state consists of multiple variables?

Outline

▪ Introduction ▪ Markov chains ▪ Dynamic belief networks

▪ Intuition ▪ Inference ▪ Learning

▪ Hidden Markov models (HMMs)

SLIDE 12

2020年2月25日 12

Modeling multiple variables

▪ What if each state consists of multiple variables? ▪ e.g. monitoring a robot

▪ Location, GPS, Speed ▪ Modeling all variables in each state jointly

▪ Is this a good solution?

LtGtSt Lt-1Gt-1St-1

Modeling multiple variables

▪ Each variable only depends on some of the previous or current

bservations

▪ Factorization

LtGtSt Lt-1Gt-1St-1 St-1 Lt-1 Gt-1 St Lt Gt

SLIDE 13

2020年2月25日 13

Dynamic belief networks

▪ Also named as dynamic Bayesian networks

St-1 Lt-1 Gt-1 St Lt Gt

𝐘𝑢 = {𝑇𝑢, 𝑀𝑢}: transition states Only dependent on previous

bservations

𝑄 𝐘𝑢 𝐘𝑢−1 = {𝑄 𝑇𝑢 𝑇𝑢−1 , 𝑄 𝑀𝑢 𝑇𝑢−1𝑀𝑢−1 }: transition model 𝐙𝑢 = {𝐻𝑢}: emission states / evidences Only dependent on current

bservations

𝑄 𝐙𝑢 𝐘𝑢 = {𝑄 𝐻𝑢 𝑀𝑢 }: emission model / sensor model

Inference on a dynamic BN

▪ Filtering: given 𝐳1…𝑢, find 𝑄(𝐘𝑢|𝐳1…𝑢) ▪ Exact inference

▪ using Bayesian rule and the structure of dynamic BN

▪ 𝑄 𝐘𝑢 𝐳1…𝑢 ∝ 𝑄 𝐘𝑢𝐳𝑢 𝐳1…𝑢−1 = 𝑄 𝐳𝑢 𝐘𝑢𝐳1…𝑢−1 𝑄 𝐘𝑢 𝐳1…𝑢−1 = 𝑄 𝐳𝑢 𝐘𝑢𝐳1…𝑢−1 ෍

𝐲𝑢−1

𝑄 𝐘𝑢 𝐲𝑢−1𝐳1…𝑢−1 𝑄 𝐲𝑢−1 𝐳1…𝑢−1

Structure of dynamic BN

Emission model Transition model

Can be inferred iteratively

SLIDE 14

2020年2月25日 14

Approximate inference on a dynamic BN

▪ Is exact inference useful? ▪ 𝑄 𝐘𝑢 𝐳1…𝑢 = 𝑄 𝐳𝑢 𝐘𝑢 σ𝐲𝑢−1 𝑄 𝐘𝑢 𝐲𝑢−1 𝑄 𝐲𝑢−1 𝐳1…𝑢−1

▪ Needs to enumerate 𝐲𝑢−1, exponential to # of transition variables

▪ Use approximate inference instead ▪ Particle filtering

Particle filtering – a toy example

▪ 𝐘𝑢 = {𝑇𝑢, 𝑀𝑢}, 𝐙𝑢 = 𝐻𝑢 ▪ 𝑇𝑢, 𝑀𝑢 only contains 2 outcomes

▪ 𝑇𝑢 = {fast, slow} 𝑀𝑢 = {left, right}

▪ 𝑄 𝐘1 = 𝑄(𝑇1𝑀1) a 2*2 table ▪ 𝑂 = 10: # of samples in each iteration ▪ 𝑢th iteration = time state 𝑢

St-1 Lt-1 Gt-1 St Lt Gt

SLIDE 15

2020年2月25日 15

Particle filtering – a toy example

▪ Step 1: samples 𝐛1 … 𝐛𝑂 from prior 𝑄(𝐘𝑢−1|𝐳1…𝑢−1)

▪ When 𝑢 = 1, samples from 𝑄 𝐘1

▪ Step 2: update 𝐛𝑗←samples from 𝑄(𝐘𝑢|𝐘𝑢−1 = 𝐛𝑗) for all 𝑗

▪ 𝐛𝑗 randomly transits based on transition model

1 2 3 4 2 3 2 3

Speed Location Speed Location

Particle filtering – a toy example

▪ Step 3: given 𝐳𝑢 and 𝐛𝑗, define 𝑥𝑗 = 𝑄(𝐳𝑢|𝐘𝑢 = 𝐛𝑗) ▪ In step 1 of next iteration, we sample from 𝐛1 … 𝐛𝑂 where the weight of 𝐛𝑗 is 𝑥𝑗

▪ Should be the same as sampling from 𝑄(𝐘𝑢|𝐳1…𝑢) ▪ Is this true?

2 * 0.3 3 * 0.6 2 * 0.5 3 * 0.1

1 2 3 4 2 3 2 3

Speed Location Speed Location Speed Location

SLIDE 16

2020年2月25日 16

Correctness of particle filtering

▪ Can be proved using induction ▪ Let 𝑂(𝐲𝑢−1|𝐳1…𝑢−1) denotes population of 𝐲𝑢−1 given 𝐳1…𝑢−1 ▪ After step 1:

𝑂(𝐲𝑢−1|𝐳1…𝑢−1) 𝑂

= 𝑄(𝐲𝑢−1|𝐳1…𝑢−1) ▪ After step 2, we have population of 𝐲𝑢:

▪ 𝑂 𝐲𝑢 𝐳1…𝑢−1 = σ𝐲𝑢−1 𝑄(𝐲𝑢|𝐲𝑢−1) 𝑂(𝐲𝑢−1|𝐳1…𝑢−1)

Correctness of particle filtering

▪ After step 3, population of 𝐲𝑢 is weighted by 𝑄(𝐳𝑢|𝐲𝑢) ▪ 𝑄 𝐳𝑢 𝐲𝑢 𝑂 𝐲𝑢 𝐳1…𝑢−1 = 𝑄 𝐳𝑢 𝐲𝑢 ෍

𝐲𝑢−1

𝑄 𝐲𝑢 𝐲𝑢−1 𝑂 𝐲𝑢−1 𝐳1…𝑢−1 = 𝑂𝑄 𝐳𝑢 𝐲𝑢 ෍

𝐲𝑢−1

𝑄 𝐲𝑢 𝐲𝑢−1 𝑄 𝐲𝑢−1 𝐳1…𝑢−1 = 𝑂𝑄 𝐳𝑢 𝐲𝑢 𝑄 𝐲𝑢 𝐳1…𝑢−1 = 𝑂𝑄 𝐳𝑢𝐲𝑢 𝐳1…𝑢−1 ∝ 𝑄 𝐲𝑢 𝐳1…𝑢

SLIDE 17

2020年2月25日 17

Learning a dynamic BN

▪ Given the structure of the dynamic BN…

▪ Learning transition models and emission models is same as in Markov chain

▪ How to learn the structure?

▪ For 𝑄 𝐘𝑢 𝐘𝑢−1 , take each 𝐘𝑢

(𝑗) ∈ 𝐘𝑢 as label and 𝐘𝑢−1 as features

▪ For 𝑄 𝐙𝑢 𝐘𝑢 , take each 𝐙𝑢

(𝑗) ∈ 𝐙𝑢 as label and 𝐘𝑢 as features

▪ Converts to feature reduction

Limitation

▪ Current assumption: all states are observable, which is unrealistic ▪ The actual location L of the robot may never be observed ▪ What if some variables are hidden?

St-1 Lt-1 Gt-1 St Lt Gt

SLIDE 18

2020年2月25日 18

Outline

▪ Introduction ▪ Markov chains ▪ Dynamic belief networks ▪ Hidden Markov models (HMMs)

▪ Intuition ▪ Inference ▪ Learning ▪ Applications & APIs

Hidden variables

▪ Some variables in the dynamic BN can be hidden ▪ Transistion variables can be hidden ▪ HMM: think of only one transition & one emission variable

St-1 Lt-1 Gt-1 St Lt Gt

SLIDE 19

2020年2月25日 19

Hidden Markov models (HMMs)

▪ Overview

▪ A sequence of length T ▪ Evidence / emission variable: {𝑧𝑢} is categorical or continuous ▪ Hidden variable: {𝑦𝑢} is categorical

▪ 𝑄 𝑧1 … 𝑧𝑈, 𝑦1 … 𝑦𝑈 = 𝑄 𝑦1 ς𝑢=2

𝑈

𝑄 𝑦𝑢 𝑦𝑢−1 ς𝑢=1

𝑈

𝑄 𝑧𝑢 𝑦𝑢

xt-1 yt-1 xt yt

Transition table

▪ Let d as the dimention of 𝑦𝑢 ▪ Transition table A is a d*d matrix ▪ 𝐵𝑗𝑘 = 𝑄(𝑦𝑢 = 𝑘|𝑦𝑢−1 = 𝑗) ▪ Clearly, σ𝑘=1

𝑒

𝐵𝑗𝑘 = 1 for all i

𝐵 = 𝐵11 ⋯ 𝐵1𝑒 ⋮ ⋱ ⋮ 𝐵𝑒1 ⋯ 𝐵𝑒𝑒

SLIDE 20

2020年2月25日 20

Emission function

▪ When 𝑧𝑢 is categorical, let K as the dimension of 𝑧𝑢 ▪ Emission function B can be represented as a d*K matrix ▪ 𝐶𝑗𝑘 = 𝑄(𝑧𝑢 = 𝑘|𝑦𝑢 = 𝑗) ▪ Clearly, σ𝑘=1

𝐿

𝐶𝑗𝑘 = 1 for all i

𝐶 = 𝐶11 ⋯ 𝐶1𝐿 ⋮ ⋱ ⋮ 𝐶𝑒1 ⋯ 𝐶𝑒𝐿

Emission function

▪ When 𝑧𝑢 is continuous, 𝑞(𝑧𝑢|𝑦𝑢) is a PDF

▪ Emission function B is the set of parameters of d different PDFs

▪ When 𝑞(𝑧𝑢|𝑦𝑢) is Gaussian ▪ 𝐶 = {𝜈1 … 𝜈𝑒, Σ1 … Σ𝑒}

SLIDE 21

2020年2月25日 21

Inference on an HMM

▪ Given the HMM, what can we do? ▪ Given an observation sequence, find its probability

▪ Filtering: find the distribution of the last hidden variable ▪ Smoothing: find the distribution of the a hidden variable in the middle

▪ Given an observation sequence, find the most likely (ML) hidden variable sequence

Probability of an observed sequence

▪ 𝑄 𝑧1 … 𝑧𝑈 = σ𝑗=1

𝑒

𝑄 𝑧1 … 𝑧𝑈, 𝑦𝑈 = 𝑗 ▪ Let’s expand one step more: ▪ 𝑄 𝑧1 … 𝑧𝑈, 𝑦𝑈 = 𝑗 = σ𝑘=1

𝑒

𝑄 𝑧1 … 𝑧𝑈, 𝑦𝑈 = 𝑗, 𝑦𝑈−1 = 𝑘 = ෍

𝑘=1 𝑒

𝑄 𝑧1 … 𝑧𝑈−1, 𝑦𝑈−1 = 𝑘 𝑄 𝑦𝑈 = 𝑗 𝑦𝑈−1 = 𝑘 𝑄 𝑧𝑈 𝑦𝑈 = 𝑗 ▪ Can be calculated iteratively

SLIDE 22

2020年2月25日 22

Forward algorithm

▪ Let 𝛽𝑢 𝑗 = 𝑄 𝑧1 … 𝑧𝑢, 𝑦𝑢 = 𝑗 ▪ Iteration: 𝛽𝑢 𝑗 = ෍

𝑘=1 𝑒

𝛽𝑢−1 𝑘 𝐵𝑘𝑗𝑄 𝑧𝑢 𝑦𝑢 = 𝑗 ▪ Base: 𝛽1 𝑗 = 𝑄 𝑧1, 𝑦1 = 𝑗 = 𝜌𝑗𝑄 𝑧1 𝑦1 = 𝑗 ▪ Output: σ𝑗=1

𝑒

𝛽𝑈 𝑗

Forward algorithm

▪ 𝛽𝑢 𝑗 = σ𝑘=1

𝑒

𝛽𝑢−1 𝑘 𝐵𝑘𝑗𝑄 𝑧𝑢 𝑦𝑢 = 𝑗

▪ 𝛽𝑢−1 𝑘 = 𝑄 𝑧1 … 𝑧𝑢−1, 𝑦𝑢−1 = 𝑘

▪ ⇓ integrating 𝑦𝑢

▪ 𝛽𝑢−1 𝑘 𝐵𝑘𝑗 = 𝑄 𝑧1 … 𝑧𝑢−1, 𝑦𝑢−1 = 𝑘, 𝑦𝑢 = 𝑗

▪ ⇓ integrating 𝑧𝑢

▪ 𝛽𝑢−1 𝑘 𝐵𝑘𝑗𝑄 𝑧𝑢 𝑦𝑢 = 𝑗 = 𝑄 𝑧1 … 𝑧𝑢, 𝑦𝑢−1 = 𝑘, 𝑦𝑢 = 𝑗

▪ ⇓ sum 𝑦𝑢−1 out

▪ 𝛽𝑢 𝑗 = σ𝑘=1

𝑒

𝛽𝑢−1 𝑘 𝐵𝑘𝑗𝑄 𝑧𝑢 𝑦𝑢 = 𝑗 = 𝑄 𝑧1 … 𝑧𝑢, 𝑦𝑢 = 𝑗

↑

SLIDE 23

2020年2月25日 23

Backward algorithm

▪ Iterates reversely ▪ Let 𝛾𝑢 𝑗 = 𝑄 𝑧𝑢+1 … 𝑧𝑈|𝑦𝑢 = 𝑗 ▪ Iteration: 𝛾𝑢 𝑗 = ෍

𝑘=1 𝑒

𝛾𝑢+1 𝑘 𝐵𝑗𝑘𝑄 𝑧𝑢+1 𝑦𝑢+1 = 𝑗 ▪ Base: 𝛾𝑈 𝑗 = 1 ▪ Output: σ𝑗=1

𝑒

𝜌𝑗𝑄 𝑧1 𝑦1 = 𝑗 𝛾1 𝑗

Filtering and smoothing

▪ Filtering: find 𝑄 𝑦𝑈 = 𝑗|𝑧1 … 𝑧𝑈 ▪ 𝑄 𝑦𝑈 = 𝑗|𝑧1 … 𝑧𝑈 ∝ 𝑄 𝑧1 … 𝑧𝑈, 𝑦𝑈 = 𝑗 = 𝛽𝑢 𝑗

▪ Directly applies forward algorithm

▪ Smoothing: find 𝑄 𝑦𝑢 = 𝑗|𝑧1 … 𝑧𝑈 where 𝑢 < 𝑈 ▪ 𝑄 𝑦𝑢 = 𝑗|𝑧1 … 𝑧𝑈 ∝ 𝑄 𝑧1 … 𝑧𝑈, 𝑦𝑢 = 𝑗 = 𝑄 𝑧1 … 𝑧𝑢, 𝑦𝑢 = 𝑗 𝑄 𝑧𝑢+1 … 𝑧𝑈|𝑦𝑢 = 𝑗 = 𝛽𝑢 𝑗 𝛾𝑢 𝑗

▪ Using both forward and backward algorithm

SLIDE 24

2020年2月25日 24

Viterbi algorithm

▪ Find argmax

𝑦1…𝑦𝑈

𝑄(𝑦1 … 𝑦𝑈|𝑧1 … 𝑧𝑈) ▪ argmax

𝑦1…𝑦𝑈

𝑄 𝑦1 … 𝑦𝑈 𝑧1 … 𝑧𝑈 = argmax

𝑦1…𝑦𝑈

𝑄(𝑧1 … 𝑧𝑈, 𝑦1 … 𝑦𝑈) ▪ Let 𝜀𝑢 𝑗 = max

𝑦1…𝑦𝑢−1 𝑄(𝑧1 … 𝑧𝑢, 𝑦1 … 𝑦𝑢−1, 𝑦𝑢 = 𝑗)

▪ Represents the highest probability of a hidden variable sequence 𝑦1 … 𝑦𝑢 ending with 𝑦𝑢 = 𝑗

▪ Iteration: 𝜀𝑢 𝑗 = 𝑄 𝑧𝑢|𝑦𝑢 = 𝑗 max

𝑘

𝜀𝑢−1 𝑘 𝐵𝑘𝑗

▪ 𝐵𝑘𝑗 and 𝑄 𝑧𝑢|𝑦𝑢 = 𝑗 are independent of 𝑧1 … 𝑧𝑢−1, 𝑦1 … 𝑦𝑢−2

▪ Base: 𝜀1 𝑗 = 𝑄 𝑧1, 𝑦1 = 𝑗 = 𝜌𝑗𝑄 𝑧1|𝑦1 = 𝑗

Correctness of Viterbi

▪ Can be proved using induction ▪ 𝜀𝑢−1 𝑘 = max

𝑦1…𝑦𝑢−2 𝑄(𝑧1 … 𝑧𝑢−1, 𝑦1 … 𝑦𝑢−2, 𝑦𝑢−1 = 𝑘)

▪ 𝜀𝑢 𝑗 = 𝑄 𝑧𝑢|𝑦𝑢 = 𝑗 max

𝑘

𝜀𝑢−1 𝑘 𝐵𝑘𝑗 = 𝑄 𝑧𝑢|𝑦𝑢 = 𝑗 max

𝑘

max

𝑦1…𝑦𝑢−2 𝑄(𝑧1 … 𝑧𝑢−1, 𝑦1 … 𝑦𝑢−2, 𝑦𝑢−1 = 𝑘) 𝑄(𝑦𝑢 = 𝑗|𝑦𝑢−1 = 𝑘)

= 𝑄 𝑧𝑢|𝑦𝑢 = 𝑗 max

𝑦1…𝑦𝑢−1 𝑄(𝑧1 … 𝑧𝑢−1, 𝑦1 … 𝑦𝑢−2, 𝑦𝑢−1, 𝑦𝑢 = 𝑗)

= max

𝑦1…𝑦𝑢−1 𝑄(𝑧1 … 𝑧𝑢, 𝑦1 … 𝑦𝑢−2, 𝑦𝑢−1, 𝑦𝑢 = 𝑗)

SLIDE 25

2020年2月25日 25

Learning an HMM

▪ Given 𝑧1 … 𝑧𝑈, find the MLE of 𝛒, 𝐵, 𝐶 ▪ Some notations (for simplicity):

▪ 𝐲 = {𝑦1 … 𝑦𝑢} 𝐳 = {𝑧1 … 𝑧𝑈} ▪ 𝑦𝑢𝑗: binary variable, 1 if 𝑦𝑢 = 𝑗 and 0 otherwise ▪ 𝛿 𝑦𝑢𝑗 = 𝑄(𝑦𝑢 = 𝑗|𝐳) ▪ 𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑗 = 𝑄(𝑦𝑢−1 = 𝑘, 𝑦𝑢 = 𝑗|𝐳)

▪ Using Baum-Welch algorithm (EM)

Q function

▪ max

𝛒,𝐵,𝐶 𝔽𝐲|𝐳 log𝑄(𝐳, 𝐲)

▪ σ𝐲 𝑄(𝐲|𝐳)log𝑄 𝐳, 𝐲 = σ𝐲 𝑄(𝐲|𝐳) log𝑄 𝑦1 + σ𝑢=2

𝑈

𝑄 𝑦𝑢 𝑦𝑢−1 + σ𝑢=1

𝑈

𝑄 𝑧𝑢 𝑦𝑢 = ෍

𝑦1

𝑄 𝑦1 𝐳 log𝑄 𝑦1 + ෍

𝑢=2 𝑈

෍

𝑦𝑢−1𝑦𝑢

𝑄(𝑦𝑢−1𝑦𝑢|𝐳) log𝑄 𝑦𝑢 𝑦𝑢−1 + ෍

𝑢=1 𝑈

෍

𝑦𝑢

𝑄(𝑦𝑢|𝐳) log𝑄 𝑧𝑢 𝑦𝑢 = ෍

𝑙=1 𝑒

𝛿 𝑦1𝑙 log𝜌𝑙 + ෍

𝑢=2 𝑈

෍

𝑘=1 𝑒

෍

𝑙=1 𝑒

𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 log𝐵𝑘𝑙 + ෍

𝑢=1 𝑈

෍

𝑙=1 𝑒

𝛿 𝑦𝑢𝑙 log𝑄 𝑧𝑢 𝑦𝑢 = 𝑙

SLIDE 26

2020年2月25日 26

M-step

▪

෍

𝑙=1 𝑒

𝛿 𝑦1𝑙 log𝜌𝑙 + ෍

𝑢=2 𝑈

෍

𝑘=1 𝑒

෍

𝑙=1 𝑒

𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 log𝐵𝑘𝑙 + ෍

𝑢=1 𝑈

෍

𝑙=1 𝑒

𝛿 𝑦𝑢𝑙 log𝑄 𝑧𝑢 𝑦𝑢 = 𝑙

▪ We can maximize Q regarding 𝛒, 𝐵, 𝐶 separately ▪ Can be achieved using Lagrange multipliers

Maximize Q regarding 𝛒

▪ For 𝛒 = {𝜌1 … 𝜌𝑒}, we always have σ𝑙=1

𝑒

𝜌𝑙 = 1 ▪ We incorporate such constraint, and set the derivative as 0: 𝜖 𝜖𝜌𝑙 ෍

𝑙=1 𝑒

𝛿 𝑦1𝑙 log𝜌𝑙 + 𝜒 ෍

𝑙=1 𝑒

𝜌𝑙 − 1 = 𝛿 𝑦1𝑙 𝜌𝑙 + 𝜒 = 0 ▪ In other words, 𝛿 𝑦1𝑙 + 𝜒𝜌𝑙 = 0 holds for all k. Their sum is also 0 ෍

𝑙=1 𝑒

𝛿 𝑦1𝑙 + 𝜒 ෍

𝑙=1 𝑒

𝜌𝑙 = ෍

𝑙=1 𝑒

𝛿 𝑦1𝑙 + 𝜒 = 0 ▪ Take 𝜒 back to the derivative for each 𝜌𝑙, we obtain 𝜌𝑙 =

𝛿 𝑦1𝑙 σ𝑘=1

𝑒

𝛿 𝑦1𝑘

SLIDE 27

2020年2月25日 27

Maximize Q regarding 𝑩, 𝑪

▪ Using similar technique, A and B can also be optimized ▪ 𝐵𝑘𝑙 =

σ𝑢=2

𝑈

𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 σ𝑚=1

𝑒

σ𝑢=2

𝑈

𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑚

▪ When 𝑧𝑢 is categorical: ▪ 𝑄 𝑧𝑢 𝑦𝑢 = 𝑙 = ς𝑗=1

𝐿

𝜈𝑗𝑙𝑧𝑢𝑗𝑦𝑢𝑙 where 𝜈𝑗𝑙 =

σ𝑢=1

𝑈

𝛿 𝑦𝑢𝑙 𝑧𝑢𝑗 σ𝑢=1

𝑈

𝛿 𝑦𝑢𝑙

▪ When 𝑧𝑢 is continuous: 𝑄 𝑧𝑢 𝑦𝑢 = 𝑙 ~𝒪 𝜈𝑙, Σ𝑙 ▪ 𝜈𝑙 =

σ𝑢=1

𝑈

𝛿 𝑦𝑢𝑙 𝑧𝑢 σ𝑢=1

𝑈

𝛿 𝑦𝑢𝑙

Σ𝑙 =

σ𝑢=1

𝑈

𝛿 𝑦𝑢𝑙 (𝑧𝑢−𝜈𝑙)(𝑧𝑢−𝜈𝑙)𝑈 σ𝑢=1

𝑈

𝛿 𝑦𝑢𝑙

E-step

▪ Compute 𝛿 𝑦𝑢𝑙 and 𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 for all t,j,k ▪ Remember:

▪ 𝛿 𝑦𝑢𝑙 = 𝑄(𝑦𝑢 = 𝑙|𝐳) ▪ 𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 = 𝑄(𝑦𝑢−1 = 𝑘, 𝑦𝑢 = 𝑙|𝐳) Similar to smoothing!

▪ 𝛿 𝑦𝑢𝑙 ∝ 𝑄 𝑦𝑢 = 𝑙, 𝐳 = 𝛽𝑢 𝑙 𝛾𝑢 𝑙 ▪ 𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 ∝ 𝑄(𝑦𝑢−1 = 𝑘, 𝑦𝑢 = 𝑙, 𝐳) = 𝛽𝑢−1 𝑘 𝛾𝑢 𝑙 𝐵𝑘𝑙𝑄 𝑧𝑢 𝑦𝑢 = 𝑙

SLIDE 28

2020年2月25日 28

Applications

▪ Speech recognition ▪ Natural language processing ▪ Bio-sequence analysis

APIs

▪ Python: hmmlearn (compatible with scikit-learn)

▪ https://github.com/hmmlearn/hmmlearn (or pip install hmmlearn)

▪ Matlab (integrated)

▪ https://www.mathworks.com/help/stats/hidden-markov-models- hmm.html

▪ C++: HTK3

▪ http://htk.eng.cam.ac.uk/

SLIDE 29

2020年2月25日 29

Thank You!

Markov models