2020年2月25日 1
Markov Models
Yanbing Xue
Outline
▪ Introduction ▪ Markov chains ▪ Dynamic belief networks ▪ Hidden Markov models (HMMs)
Markov Models Yanbing Xue Outline Introduction Markov chains - - PDF document
2020 2 25 Markov Models Yanbing Xue Outline Introduction Markov chains Dynamic belief networks Hidden Markov models (HMMs) 1 2020 2 25 Outline Introduction Time series Probabilistic graphical
2020年2月25日 1
Yanbing Xue
Outline
▪ Introduction ▪ Markov chains ▪ Dynamic belief networks ▪ Hidden Markov models (HMMs)
2020年2月25日 2
Outline
▪ Introduction
▪ Time series ▪ Probabilistic graphical models
▪ Markov chains ▪ Dynamic belief networks ▪ Hiddem Markov models (HMM)
What is time series?
▪ A time series is a sequence of data instance listed in time order.
▪ In other words, data instances are totally ordered. ▪ Example: weather forecasting ▪ Notice: we care about the orderings rather than the exact time.
2020年2月25日 3
Different kinds of time series
▪ Two properties:
▪ Time space: discrete
continuous ? ▪ Task: classification
regression ?
Weather Min/max temp Temperature Prob of rain Discrete & classification Discrete & regression Continuous & regression
Probabilistic graphical models (PGMs)
▪ A PGM uses a graph-based representation to represent the conditional distributions over variables. ▪ Directed acyclic graphs (DAGs) ▪ Undirected graph
Markov model is a sub- family of PGMs on DAGs
2020年2月25日 4
Outline
▪ Introduction ▪ Markov chains
▪ Intuition ▪ Inference ▪ Learning
▪ Dynamic belief networks ▪ Hidden Markov models (HMMs)
Modeling time series
Assume a sequence of four weather observations: 𝑧1, 𝑧2, 𝑧3, 𝑧4 ▪ Possible dependences: 𝑧4 depends on the previous weather(s)
𝑧1 𝑧2 𝑧3 𝑧4 𝑧1 𝑧2 𝑧3 𝑧4
2020年2月25日 5
Modeling time series
In general observations: 𝑧1, 𝑧2, 𝑧3, 𝑧4 can be
y1 y2 y3 y4 y1 y2 y3 y4
Fully dependent: E.g. y4 depends on all previous observations Independent: E.g. y4 does not depend on any previous observation A lot of middle ground in between the two extremes
Modeling time series
▪ Are there intuitive and convenient dependency models?
y1 y2 y3 y4 y1 y2 y3 y4
Think of the last observation 𝑄(𝑧4|𝑧1𝑧2𝑧3) What if we have T observations? Parameter #: exponential to # of observations Totally drops time information
2020年2月25日 6
Markov chains
▪ Markov assumption: Future predictions are independent of all but the most recent observations
y1 y2 y3 y4 y1 y2 y3 y4
Independent Fully dependent
y1 y2 y3 y4
First order Markov chain
Markov chains
▪ Markov assumption: Future predictions are independent of all but the most recent observations
y1 y2 y3 y4 y1 y2 y3 y4
Independent Fully dependent
y1 y2 y3 y4
Second order Markov chain
2020年2月25日 7
A formal representation
▪ Using conditional probabilities to model 𝑧1, 𝑧2, 𝑧3, 𝑧4 ▪ Fully dependent:
▪ 𝑄 𝑧1𝑧2𝑧3𝑧4 = 𝑄 𝑧1 𝑄 𝑧2 𝑧1 𝑄 𝑧3 𝑧1𝑧2 𝑄(𝑧4|𝑧1𝑧2𝑧3)
▪ Fully independent:
▪ 𝑄 𝑧1𝑧2𝑧3𝑧4 = 𝑄 𝑧1 𝑄(𝑧2)𝑄(𝑧3)𝑄(𝑧4)
▪ First-order Markov chain (recent 1 observation):
▪ 𝑄 𝑧1𝑧2𝑧3𝑧4 = 𝑄 𝑧1 𝑄 𝑧2 𝑧1 𝑄 𝑧3 𝑧2 𝑄(𝑧4|𝑧3)
▪ Second-order Markov chain (recent 2 observations):
▪ 𝑄 𝑧1𝑧2𝑧3𝑧4 = 𝑄 𝑧1 𝑄 𝑧2 𝑧1 𝑄 𝑧3 𝑧1𝑧2 𝑄(𝑧4|𝑧2𝑧3)
A more formal representation
▪ Generalizes to T observations ▪ First-order Markov chain (recent 1 observation):
▪ 𝑄 𝑧1𝑧2 … 𝑧𝑈 = 𝑄 𝑧1 ς𝑢=2
𝑈
𝑄(𝑧𝑢|𝑧𝑢−1)
▪ Second-order Markov chain (recent 2 observations):
▪ 𝑄 𝑧1𝑧2 … 𝑧𝑈 = 𝑄 𝑧1 𝑄 𝑧2 𝑧1 ς𝑢=3
𝑈
𝑄(𝑧𝑢|𝑧𝑢−1𝑧𝑢−2)
▪ k-th order Markov chain (recent k observations):
▪ 𝑄 𝑧1𝑧2 … 𝑧𝑈 = 𝑄 𝑧1 𝑄 𝑧2 𝑧1 … 𝑄(𝑧𝑙|𝑧1 … 𝑧𝑙−1)ς𝑢=𝑙+1
𝑈
𝑄(𝑧𝑢|𝑧𝑢−𝑙 … 𝑧𝑢−1)
2020年2月25日 8
Stationarity
▪ Do all states yield to the identical conditional distribution? ▪ 𝑄 𝑧𝑢 = 𝑘 𝑧𝑢−1 = 𝑗 = 𝑄 𝑧𝑢−1 = 𝑘 𝑧𝑢−2 = 𝑗 for all 𝑢, 𝑗, 𝑘 ▪ Typically holds ▪ A transition table A to represent conditional distribution
▪ 𝐵𝑗𝑘 = 𝑄 𝑧𝑢 = 𝑘 𝑧𝑢−1 = 𝑗 for all 𝑢 = 1,2, … , 𝑈 ▪ 𝑒: dimention of 𝑧𝑢
▪ A vector 𝛒 to represent the initial distribution
▪ 𝜌𝑗 = 𝑄(𝑧1 = 𝑗) for all 𝑗 = 1,2, … , 𝑒 𝐵11 ⋯ 𝐵1𝑒 ⋮ ⋱ ⋮ 𝐵𝑒1 ⋯ 𝐵𝑒𝑒
Inference on a Markov chain
▪ Probability of a given sequence
▪ 𝑄 𝑧1 = 𝑗1, … , 𝑧𝑈 = 𝑗𝑈 = 𝜌𝑗1 ς𝑢=2
𝑈
𝐵𝑗𝑢𝑗𝑢−1
▪ Probability of a given state
▪ Forward iteration: 𝑄 𝑧𝑢 = 𝑗𝑢 = σ𝑗𝑢−1 𝑄(𝑧𝑢−1 = 𝑗𝑢−1)𝐵𝑗𝑢𝑗𝑢−1 ▪ Can be calculated iteratively
▪ Both inferences are efficient ▪ 𝑄 𝑧𝑙 = 𝑗𝑙, … , 𝑧𝑈 = 𝑗𝑈 = 𝑄 𝑧𝑙 = 𝑗𝑙 ς𝑢=𝑙+1
𝑈
𝐵𝑗𝑢𝑗𝑢−1
2020年2月25日 9
Learning a Markov chain
▪ MLE of conditional probabilities can be estimated directly. ▪ 𝐵𝑗𝑘
𝑁𝑀𝐹 = 𝑄 𝑧𝑢 = 𝑘 𝑧𝑢−1 = 𝑗 = 𝑄(𝑧𝑢=𝑘,𝑧𝑢−1=𝑗) 𝑄(𝑧𝑢−1=𝑗)
=
𝑂𝑗𝑘 σ𝑘 𝑂𝑗𝑘
▪ 𝑂𝑗𝑘: # of observations that yields 𝑧𝑢 = 𝑘, 𝑧𝑢−1 = 𝑗
▪ Bayesian parameter estimation
▪ Prior: 𝐸𝑗𝑠(𝜄1, 𝜄2, … ) ▪ Posterior: 𝐸𝑗𝑠(𝜄1 + 𝑂𝑗1, 𝜄2 + 𝑂𝑗2, … ) ▪ 𝐵𝑗𝑘
𝑁𝐵𝑄 = 𝑂𝑗𝑘+𝜄𝑘−1 σ𝑘(𝑂𝑗𝑘+𝜄𝑘−1)
𝐵𝑗𝑘
𝐹𝑊 = 𝑂𝑗𝑘+𝜄𝑘 σ𝑘(𝑂𝑗𝑘+𝜄𝑘)
A toy example – weather forecast
▪ State 1: rainy state 2: cloudy state 3: sunny ▪ Given “sun-sun-sun-rain-rain-sun-cloud-sun”, find 𝐵33 ▪ 𝐵33
𝑁𝑀𝐹 = 𝑂33 σ𝑘 𝑂3𝑘 = 2 1+1+2
▪ Prior: 𝐸𝑗𝑠(2,2,2) ▪ Posterior: 𝐸𝑗𝑠(2 + 1,2 + 1,2 + 2)
▪ 𝐵33
𝑁𝐵𝑄 = 𝑂33+𝜄3−1 σ𝑘(𝑂3𝑘+𝜄𝑘−1) = 3 7
𝐵33
𝐹𝑊 = 𝑂33+𝜄3 σ𝑘(𝑂3𝑘+𝜄𝑘) = 4 10
2020年2月25日 10
A toy example – weather forecast
▪ Given 𝐵 = 0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8 , day 1 is sunny ▪ Find the probability that day 2~8 will be “sun-sun-rain-rain-sun-cloud-sun” ▪ 𝑄 𝑧1𝑧2 … 𝑧8 = 𝑄 𝑧1 = 𝑡 𝑄 𝑧2 = 𝑡 𝑧1 = 𝑡 𝑄 𝑧3 = 𝑡 𝑧2 = 𝑡 𝑄 𝑧4 = 𝑠 𝑧3 = 𝑡 𝑄 𝑧5 = 𝑠 𝑧4 = 𝑠 𝑄 𝑧6 = 𝑡 𝑧5 = 𝑠 𝑄 𝑧7 = 𝑑 𝑧6 = 𝑡 𝑄 𝑧8 = 𝑡 𝑧7 = 𝑑 = 1 ∙ 𝐵33 ∙ 𝐵33 ∙ 𝐵31 ∙ 𝐵11 ∙ 𝐵13 ∙ 𝐵32 ∙ 𝐵23 = 1 ∙ 0.8 ∙ 0.8 ∙ 0.1 ∙ 0.4 ∙ 0.3 ∙ 0.1 ∙ 0.2 = 1.536 × 10−4
A toy example – weather forecast
▪ Given 𝐵 = 0.4 0.3 0.3 0.2 0.6 0.2 0.1 0.1 0.8 , day 1 is sunny ▪ Find the probability that day 3 will be sunny ▪ 𝑄 𝑧2 = 𝑡 = σ𝑗 𝑄 𝑧1 = 𝑗 𝑄 𝑧2 = 𝑡 𝑧1 = 𝑗 = 0 ∙ 0.3 + 0 ∙ 0.2 + 1 ∙ 0.8 = 0.8
▪ Similarly, 𝑄 𝑧2 = 𝑠 = σ𝑗 𝑄 𝑧1 = 𝑗 𝑄 𝑧2 = 𝑠 𝑧1 = 𝑗 = 0 ∙ 0.4 + 0 ∙ 0.2 + 1 ∙ 0.1 = 0.1 ▪ 𝑄 𝑧2 = 𝑑 = σ𝑗 𝑄 𝑧1 = 𝑗 𝑄 𝑧2 = 𝑑 𝑧1 = 𝑗 = 0 ∙ 0.3 + 0 ∙ 0.6 + 1 ∙ 0.1 = 0.1 ▪ 𝑄 𝑧3 = 𝑡 = σ𝑗 𝑄 𝑧2 = 𝑗 𝑄 𝑧3 = 𝑡 𝑧2 = 𝑗 = 0.1 ∙ 0.3 + 0.1 ∙ 0.2 + 0.8 ∙ 0.8 = 0.69
2020年2月25日 11
Limitation of Markov chain
▪ Each state is represented by one variable ▪ What if each state consists of multiple variables?
Outline
▪ Introduction ▪ Markov chains ▪ Dynamic belief networks
▪ Intuition ▪ Inference ▪ Learning
▪ Hidden Markov models (HMMs)
2020年2月25日 12
Modeling multiple variables
▪ What if each state consists of multiple variables? ▪ e.g. monitoring a robot
▪ Location, GPS, Speed ▪ Modeling all variables in each state jointly
▪ Is this a good solution?
LtGtSt Lt-1Gt-1St-1
Modeling multiple variables
▪ Each variable only depends on some of the previous or current
▪ Factorization
LtGtSt Lt-1Gt-1St-1 St-1 Lt-1 Gt-1 St Lt Gt
2020年2月25日 13
Dynamic belief networks
▪ Also named as dynamic Bayesian networks
St-1 Lt-1 Gt-1 St Lt Gt
𝐘𝑢 = {𝑇𝑢, 𝑀𝑢}: transition states Only dependent on previous
𝑄 𝐘𝑢 𝐘𝑢−1 = {𝑄 𝑇𝑢 𝑇𝑢−1 , 𝑄 𝑀𝑢 𝑇𝑢−1𝑀𝑢−1 }: transition model 𝐙𝑢 = {𝐻𝑢}: emission states / evidences Only dependent on current
𝑄 𝐙𝑢 𝐘𝑢 = {𝑄 𝐻𝑢 𝑀𝑢 }: emission model / sensor model
Inference on a dynamic BN
▪ Filtering: given 𝐳1…𝑢, find 𝑄(𝐘𝑢|𝐳1…𝑢) ▪ Exact inference
▪ using Bayesian rule and the structure of dynamic BN
▪ 𝑄 𝐘𝑢 𝐳1…𝑢 ∝ 𝑄 𝐘𝑢𝐳𝑢 𝐳1…𝑢−1 = 𝑄 𝐳𝑢 𝐘𝑢𝐳1…𝑢−1 𝑄 𝐘𝑢 𝐳1…𝑢−1 = 𝑄 𝐳𝑢 𝐘𝑢𝐳1…𝑢−1
𝐲𝑢−1
𝑄 𝐘𝑢 𝐲𝑢−1𝐳1…𝑢−1 𝑄 𝐲𝑢−1 𝐳1…𝑢−1
Structure of dynamic BN
Emission model Transition model
Can be inferred iteratively
2020年2月25日 14
Approximate inference on a dynamic BN
▪ Is exact inference useful? ▪ 𝑄 𝐘𝑢 𝐳1…𝑢 = 𝑄 𝐳𝑢 𝐘𝑢 σ𝐲𝑢−1 𝑄 𝐘𝑢 𝐲𝑢−1 𝑄 𝐲𝑢−1 𝐳1…𝑢−1
▪ Needs to enumerate 𝐲𝑢−1, exponential to # of transition variables
▪ Use approximate inference instead ▪ Particle filtering
Particle filtering – a toy example
▪ 𝐘𝑢 = {𝑇𝑢, 𝑀𝑢}, 𝐙𝑢 = 𝐻𝑢 ▪ 𝑇𝑢, 𝑀𝑢 only contains 2 outcomes
▪ 𝑇𝑢 = {fast, slow} 𝑀𝑢 = {left, right}
▪ 𝑄 𝐘1 = 𝑄(𝑇1𝑀1) a 2*2 table ▪ 𝑂 = 10: # of samples in each iteration ▪ 𝑢th iteration = time state 𝑢
St-1 Lt-1 Gt-1 St Lt Gt
2020年2月25日 15
Particle filtering – a toy example
▪ Step 1: samples 𝐛1 … 𝐛𝑂 from prior 𝑄(𝐘𝑢−1|𝐳1…𝑢−1)
▪ When 𝑢 = 1, samples from 𝑄 𝐘1
▪ Step 2: update 𝐛𝑗←samples from 𝑄(𝐘𝑢|𝐘𝑢−1 = 𝐛𝑗) for all 𝑗
▪ 𝐛𝑗 randomly transits based on transition model
1 2 3 4 2 3 2 3
Speed Location Speed Location
Particle filtering – a toy example
▪ Step 3: given 𝐳𝑢 and 𝐛𝑗, define 𝑥𝑗 = 𝑄(𝐳𝑢|𝐘𝑢 = 𝐛𝑗) ▪ In step 1 of next iteration, we sample from 𝐛1 … 𝐛𝑂 where the weight of 𝐛𝑗 is 𝑥𝑗
▪ Should be the same as sampling from 𝑄(𝐘𝑢|𝐳1…𝑢) ▪ Is this true?
2 * 0.3 3 * 0.6 2 * 0.5 3 * 0.1
1 2 3 4 2 3 2 3
Speed Location Speed Location Speed Location
2020年2月25日 16
Correctness of particle filtering
▪ Can be proved using induction ▪ Let 𝑂(𝐲𝑢−1|𝐳1…𝑢−1) denotes population of 𝐲𝑢−1 given 𝐳1…𝑢−1 ▪ After step 1:
𝑂(𝐲𝑢−1|𝐳1…𝑢−1) 𝑂
= 𝑄(𝐲𝑢−1|𝐳1…𝑢−1) ▪ After step 2, we have population of 𝐲𝑢:
▪ 𝑂 𝐲𝑢 𝐳1…𝑢−1 = σ𝐲𝑢−1 𝑄(𝐲𝑢|𝐲𝑢−1) 𝑂(𝐲𝑢−1|𝐳1…𝑢−1)
Correctness of particle filtering
▪ After step 3, population of 𝐲𝑢 is weighted by 𝑄(𝐳𝑢|𝐲𝑢) ▪ 𝑄 𝐳𝑢 𝐲𝑢 𝑂 𝐲𝑢 𝐳1…𝑢−1 = 𝑄 𝐳𝑢 𝐲𝑢
𝐲𝑢−1
𝑄 𝐲𝑢 𝐲𝑢−1 𝑂 𝐲𝑢−1 𝐳1…𝑢−1 = 𝑂𝑄 𝐳𝑢 𝐲𝑢
𝐲𝑢−1
𝑄 𝐲𝑢 𝐲𝑢−1 𝑄 𝐲𝑢−1 𝐳1…𝑢−1 = 𝑂𝑄 𝐳𝑢 𝐲𝑢 𝑄 𝐲𝑢 𝐳1…𝑢−1 = 𝑂𝑄 𝐳𝑢𝐲𝑢 𝐳1…𝑢−1 ∝ 𝑄 𝐲𝑢 𝐳1…𝑢
2020年2月25日 17
Learning a dynamic BN
▪ Given the structure of the dynamic BN…
▪ Learning transition models and emission models is same as in Markov chain
▪ How to learn the structure?
▪ For 𝑄 𝐘𝑢 𝐘𝑢−1 , take each 𝐘𝑢
(𝑗) ∈ 𝐘𝑢 as label and 𝐘𝑢−1 as features
▪ For 𝑄 𝐙𝑢 𝐘𝑢 , take each 𝐙𝑢
(𝑗) ∈ 𝐙𝑢 as label and 𝐘𝑢 as features
▪ Converts to feature reduction
Limitation
▪ Current assumption: all states are observable, which is unrealistic ▪ The actual location L of the robot may never be observed ▪ What if some variables are hidden?
St-1 Lt-1 Gt-1 St Lt Gt
2020年2月25日 18
Outline
▪ Introduction ▪ Markov chains ▪ Dynamic belief networks ▪ Hidden Markov models (HMMs)
▪ Intuition ▪ Inference ▪ Learning ▪ Applications & APIs
Hidden variables
▪ Some variables in the dynamic BN can be hidden ▪ Transistion variables can be hidden ▪ HMM: think of only one transition & one emission variable
St-1 Lt-1 Gt-1 St Lt Gt
2020年2月25日 19
Hidden Markov models (HMMs)
▪ Overview
▪ A sequence of length T ▪ Evidence / emission variable: {𝑧𝑢} is categorical or continuous ▪ Hidden variable: {𝑦𝑢} is categorical
▪ 𝑄 𝑧1 … 𝑧𝑈, 𝑦1 … 𝑦𝑈 = 𝑄 𝑦1 ς𝑢=2
𝑈
𝑄 𝑦𝑢 𝑦𝑢−1 ς𝑢=1
𝑈
𝑄 𝑧𝑢 𝑦𝑢
xt-1 yt-1 xt yt
Transition table
▪ Let d as the dimention of 𝑦𝑢 ▪ Transition table A is a d*d matrix ▪ 𝐵𝑗𝑘 = 𝑄(𝑦𝑢 = 𝑘|𝑦𝑢−1 = 𝑗) ▪ Clearly, σ𝑘=1
𝑒
𝐵𝑗𝑘 = 1 for all i
𝐵 = 𝐵11 ⋯ 𝐵1𝑒 ⋮ ⋱ ⋮ 𝐵𝑒1 ⋯ 𝐵𝑒𝑒
2020年2月25日 20
Emission function
▪ When 𝑧𝑢 is categorical, let K as the dimension of 𝑧𝑢 ▪ Emission function B can be represented as a d*K matrix ▪ 𝐶𝑗𝑘 = 𝑄(𝑧𝑢 = 𝑘|𝑦𝑢 = 𝑗) ▪ Clearly, σ𝑘=1
𝐿
𝐶𝑗𝑘 = 1 for all i
𝐶 = 𝐶11 ⋯ 𝐶1𝐿 ⋮ ⋱ ⋮ 𝐶𝑒1 ⋯ 𝐶𝑒𝐿
Emission function
▪ When 𝑧𝑢 is continuous, 𝑞(𝑧𝑢|𝑦𝑢) is a PDF
▪ Emission function B is the set of parameters of d different PDFs
▪ When 𝑞(𝑧𝑢|𝑦𝑢) is Gaussian ▪ 𝐶 = {𝜈1 … 𝜈𝑒, Σ1 … Σ𝑒}
2020年2月25日 21
Inference on an HMM
▪ Given the HMM, what can we do? ▪ Given an observation sequence, find its probability
▪ Filtering: find the distribution of the last hidden variable ▪ Smoothing: find the distribution of the a hidden variable in the middle
▪ Given an observation sequence, find the most likely (ML) hidden variable sequence
Probability of an observed sequence
▪ 𝑄 𝑧1 … 𝑧𝑈 = σ𝑗=1
𝑒
𝑄 𝑧1 … 𝑧𝑈, 𝑦𝑈 = 𝑗 ▪ Let’s expand one step more: ▪ 𝑄 𝑧1 … 𝑧𝑈, 𝑦𝑈 = 𝑗 = σ𝑘=1
𝑒
𝑄 𝑧1 … 𝑧𝑈, 𝑦𝑈 = 𝑗, 𝑦𝑈−1 = 𝑘 =
𝑘=1 𝑒
𝑄 𝑧1 … 𝑧𝑈−1, 𝑦𝑈−1 = 𝑘 𝑄 𝑦𝑈 = 𝑗 𝑦𝑈−1 = 𝑘 𝑄 𝑧𝑈 𝑦𝑈 = 𝑗 ▪ Can be calculated iteratively
2020年2月25日 22
Forward algorithm
▪ Let 𝛽𝑢 𝑗 = 𝑄 𝑧1 … 𝑧𝑢, 𝑦𝑢 = 𝑗 ▪ Iteration: 𝛽𝑢 𝑗 =
𝑘=1 𝑒
𝛽𝑢−1 𝑘 𝐵𝑘𝑗𝑄 𝑧𝑢 𝑦𝑢 = 𝑗 ▪ Base: 𝛽1 𝑗 = 𝑄 𝑧1, 𝑦1 = 𝑗 = 𝜌𝑗𝑄 𝑧1 𝑦1 = 𝑗 ▪ Output: σ𝑗=1
𝑒
𝛽𝑈 𝑗
Forward algorithm
▪ 𝛽𝑢 𝑗 = σ𝑘=1
𝑒
𝛽𝑢−1 𝑘 𝐵𝑘𝑗𝑄 𝑧𝑢 𝑦𝑢 = 𝑗
▪ 𝛽𝑢−1 𝑘 = 𝑄 𝑧1 … 𝑧𝑢−1, 𝑦𝑢−1 = 𝑘
▪ ⇓ integrating 𝑦𝑢
▪ 𝛽𝑢−1 𝑘 𝐵𝑘𝑗 = 𝑄 𝑧1 … 𝑧𝑢−1, 𝑦𝑢−1 = 𝑘, 𝑦𝑢 = 𝑗
▪ ⇓ integrating 𝑧𝑢
▪ 𝛽𝑢−1 𝑘 𝐵𝑘𝑗𝑄 𝑧𝑢 𝑦𝑢 = 𝑗 = 𝑄 𝑧1 … 𝑧𝑢, 𝑦𝑢−1 = 𝑘, 𝑦𝑢 = 𝑗
▪ ⇓ sum 𝑦𝑢−1 out
▪ 𝛽𝑢 𝑗 = σ𝑘=1
𝑒
𝛽𝑢−1 𝑘 𝐵𝑘𝑗𝑄 𝑧𝑢 𝑦𝑢 = 𝑗 = 𝑄 𝑧1 … 𝑧𝑢, 𝑦𝑢 = 𝑗
↑
2020年2月25日 23
Backward algorithm
▪ Iterates reversely ▪ Let 𝛾𝑢 𝑗 = 𝑄 𝑧𝑢+1 … 𝑧𝑈|𝑦𝑢 = 𝑗 ▪ Iteration: 𝛾𝑢 𝑗 =
𝑘=1 𝑒
𝛾𝑢+1 𝑘 𝐵𝑗𝑘𝑄 𝑧𝑢+1 𝑦𝑢+1 = 𝑗 ▪ Base: 𝛾𝑈 𝑗 = 1 ▪ Output: σ𝑗=1
𝑒
𝜌𝑗𝑄 𝑧1 𝑦1 = 𝑗 𝛾1 𝑗
Filtering and smoothing
▪ Filtering: find 𝑄 𝑦𝑈 = 𝑗|𝑧1 … 𝑧𝑈 ▪ 𝑄 𝑦𝑈 = 𝑗|𝑧1 … 𝑧𝑈 ∝ 𝑄 𝑧1 … 𝑧𝑈, 𝑦𝑈 = 𝑗 = 𝛽𝑢 𝑗
▪ Directly applies forward algorithm
▪ Smoothing: find 𝑄 𝑦𝑢 = 𝑗|𝑧1 … 𝑧𝑈 where 𝑢 < 𝑈 ▪ 𝑄 𝑦𝑢 = 𝑗|𝑧1 … 𝑧𝑈 ∝ 𝑄 𝑧1 … 𝑧𝑈, 𝑦𝑢 = 𝑗 = 𝑄 𝑧1 … 𝑧𝑢, 𝑦𝑢 = 𝑗 𝑄 𝑧𝑢+1 … 𝑧𝑈|𝑦𝑢 = 𝑗 = 𝛽𝑢 𝑗 𝛾𝑢 𝑗
▪ Using both forward and backward algorithm
2020年2月25日 24
Viterbi algorithm
▪ Find argmax
𝑦1…𝑦𝑈
𝑄(𝑦1 … 𝑦𝑈|𝑧1 … 𝑧𝑈) ▪ argmax
𝑦1…𝑦𝑈
𝑄 𝑦1 … 𝑦𝑈 𝑧1 … 𝑧𝑈 = argmax
𝑦1…𝑦𝑈
𝑄(𝑧1 … 𝑧𝑈, 𝑦1 … 𝑦𝑈) ▪ Let 𝜀𝑢 𝑗 = max
𝑦1…𝑦𝑢−1 𝑄(𝑧1 … 𝑧𝑢, 𝑦1 … 𝑦𝑢−1, 𝑦𝑢 = 𝑗)
▪ Represents the highest probability of a hidden variable sequence 𝑦1 … 𝑦𝑢 ending with 𝑦𝑢 = 𝑗
▪ Iteration: 𝜀𝑢 𝑗 = 𝑄 𝑧𝑢|𝑦𝑢 = 𝑗 max
𝑘
𝜀𝑢−1 𝑘 𝐵𝑘𝑗
▪ 𝐵𝑘𝑗 and 𝑄 𝑧𝑢|𝑦𝑢 = 𝑗 are independent of 𝑧1 … 𝑧𝑢−1, 𝑦1 … 𝑦𝑢−2
▪ Base: 𝜀1 𝑗 = 𝑄 𝑧1, 𝑦1 = 𝑗 = 𝜌𝑗𝑄 𝑧1|𝑦1 = 𝑗
Correctness of Viterbi
▪ Can be proved using induction ▪ 𝜀𝑢−1 𝑘 = max
𝑦1…𝑦𝑢−2 𝑄(𝑧1 … 𝑧𝑢−1, 𝑦1 … 𝑦𝑢−2, 𝑦𝑢−1 = 𝑘)
▪ 𝜀𝑢 𝑗 = 𝑄 𝑧𝑢|𝑦𝑢 = 𝑗 max
𝑘
𝜀𝑢−1 𝑘 𝐵𝑘𝑗 = 𝑄 𝑧𝑢|𝑦𝑢 = 𝑗 max
𝑘
max
𝑦1…𝑦𝑢−2 𝑄(𝑧1 … 𝑧𝑢−1, 𝑦1 … 𝑦𝑢−2, 𝑦𝑢−1 = 𝑘) 𝑄(𝑦𝑢 = 𝑗|𝑦𝑢−1 = 𝑘)
= 𝑄 𝑧𝑢|𝑦𝑢 = 𝑗 max
𝑦1…𝑦𝑢−1 𝑄(𝑧1 … 𝑧𝑢−1, 𝑦1 … 𝑦𝑢−2, 𝑦𝑢−1, 𝑦𝑢 = 𝑗)
= max
𝑦1…𝑦𝑢−1 𝑄(𝑧1 … 𝑧𝑢, 𝑦1 … 𝑦𝑢−2, 𝑦𝑢−1, 𝑦𝑢 = 𝑗)
2020年2月25日 25
Learning an HMM
▪ Given 𝑧1 … 𝑧𝑈, find the MLE of 𝛒, 𝐵, 𝐶 ▪ Some notations (for simplicity):
▪ 𝐲 = {𝑦1 … 𝑦𝑢} 𝐳 = {𝑧1 … 𝑧𝑈} ▪ 𝑦𝑢𝑗: binary variable, 1 if 𝑦𝑢 = 𝑗 and 0 otherwise ▪ 𝛿 𝑦𝑢𝑗 = 𝑄(𝑦𝑢 = 𝑗|𝐳) ▪ 𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑗 = 𝑄(𝑦𝑢−1 = 𝑘, 𝑦𝑢 = 𝑗|𝐳)
▪ Using Baum-Welch algorithm (EM)
Q function
▪ max
𝛒,𝐵,𝐶 𝔽𝐲|𝐳 log𝑄(𝐳, 𝐲)
▪ σ𝐲 𝑄(𝐲|𝐳)log𝑄 𝐳, 𝐲 = σ𝐲 𝑄(𝐲|𝐳) log𝑄 𝑦1 + σ𝑢=2
𝑈
𝑄 𝑦𝑢 𝑦𝑢−1 + σ𝑢=1
𝑈
𝑄 𝑧𝑢 𝑦𝑢 =
𝑦1
𝑄 𝑦1 𝐳 log𝑄 𝑦1 +
𝑢=2 𝑈
𝑦𝑢−1𝑦𝑢
𝑄(𝑦𝑢−1𝑦𝑢|𝐳) log𝑄 𝑦𝑢 𝑦𝑢−1 +
𝑢=1 𝑈
𝑦𝑢
𝑄(𝑦𝑢|𝐳) log𝑄 𝑧𝑢 𝑦𝑢 =
𝑙=1 𝑒
𝛿 𝑦1𝑙 log𝜌𝑙 +
𝑢=2 𝑈
𝑘=1 𝑒
𝑙=1 𝑒
𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 log𝐵𝑘𝑙 +
𝑢=1 𝑈
𝑙=1 𝑒
𝛿 𝑦𝑢𝑙 log𝑄 𝑧𝑢 𝑦𝑢 = 𝑙
2020年2月25日 26
M-step
▪
𝑙=1 𝑒
𝛿 𝑦1𝑙 log𝜌𝑙 +
𝑢=2 𝑈
𝑘=1 𝑒
𝑙=1 𝑒
𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 log𝐵𝑘𝑙 +
𝑢=1 𝑈
𝑙=1 𝑒
𝛿 𝑦𝑢𝑙 log𝑄 𝑧𝑢 𝑦𝑢 = 𝑙
▪ We can maximize Q regarding 𝛒, 𝐵, 𝐶 separately ▪ Can be achieved using Lagrange multipliers
Maximize Q regarding 𝛒
▪ For 𝛒 = {𝜌1 … 𝜌𝑒}, we always have σ𝑙=1
𝑒
𝜌𝑙 = 1 ▪ We incorporate such constraint, and set the derivative as 0: 𝜖 𝜖𝜌𝑙
𝑙=1 𝑒
𝛿 𝑦1𝑙 log𝜌𝑙 + 𝜒
𝑙=1 𝑒
𝜌𝑙 − 1 = 𝛿 𝑦1𝑙 𝜌𝑙 + 𝜒 = 0 ▪ In other words, 𝛿 𝑦1𝑙 + 𝜒𝜌𝑙 = 0 holds for all k. Their sum is also 0
𝑙=1 𝑒
𝛿 𝑦1𝑙 + 𝜒
𝑙=1 𝑒
𝜌𝑙 =
𝑙=1 𝑒
𝛿 𝑦1𝑙 + 𝜒 = 0 ▪ Take 𝜒 back to the derivative for each 𝜌𝑙, we obtain 𝜌𝑙 =
𝛿 𝑦1𝑙 σ𝑘=1
𝑒
𝛿 𝑦1𝑘
2020年2月25日 27
Maximize Q regarding 𝑩, 𝑪
▪ Using similar technique, A and B can also be optimized ▪ 𝐵𝑘𝑙 =
σ𝑢=2
𝑈
𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 σ𝑚=1
𝑒
σ𝑢=2
𝑈
𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑚
▪ When 𝑧𝑢 is categorical: ▪ 𝑄 𝑧𝑢 𝑦𝑢 = 𝑙 = ς𝑗=1
𝐿
𝜈𝑗𝑙𝑧𝑢𝑗𝑦𝑢𝑙 where 𝜈𝑗𝑙 =
σ𝑢=1
𝑈
𝛿 𝑦𝑢𝑙 𝑧𝑢𝑗 σ𝑢=1
𝑈
𝛿 𝑦𝑢𝑙
▪ When 𝑧𝑢 is continuous: 𝑄 𝑧𝑢 𝑦𝑢 = 𝑙 ~𝒪 𝜈𝑙, Σ𝑙 ▪ 𝜈𝑙 =
σ𝑢=1
𝑈
𝛿 𝑦𝑢𝑙 𝑧𝑢 σ𝑢=1
𝑈
𝛿 𝑦𝑢𝑙
Σ𝑙 =
σ𝑢=1
𝑈
𝛿 𝑦𝑢𝑙 (𝑧𝑢−𝜈𝑙)(𝑧𝑢−𝜈𝑙)𝑈 σ𝑢=1
𝑈
𝛿 𝑦𝑢𝑙
E-step
▪ Compute 𝛿 𝑦𝑢𝑙 and 𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 for all t,j,k ▪ Remember:
▪ 𝛿 𝑦𝑢𝑙 = 𝑄(𝑦𝑢 = 𝑙|𝐳) ▪ 𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 = 𝑄(𝑦𝑢−1 = 𝑘, 𝑦𝑢 = 𝑙|𝐳) Similar to smoothing!
▪ 𝛿 𝑦𝑢𝑙 ∝ 𝑄 𝑦𝑢 = 𝑙, 𝐳 = 𝛽𝑢 𝑙 𝛾𝑢 𝑙 ▪ 𝜃 𝑦𝑢−1,𝑘𝑦𝑢𝑙 ∝ 𝑄(𝑦𝑢−1 = 𝑘, 𝑦𝑢 = 𝑙, 𝐳) = 𝛽𝑢−1 𝑘 𝛾𝑢 𝑙 𝐵𝑘𝑙𝑄 𝑧𝑢 𝑦𝑢 = 𝑙
2020年2月25日 28
Applications
▪ Speech recognition ▪ Natural language processing ▪ Bio-sequence analysis
APIs
▪ Python: hmmlearn (compatible with scikit-learn)
▪ https://github.com/hmmlearn/hmmlearn (or pip install hmmlearn)
▪ Matlab (integrated)
▪ https://www.mathworks.com/help/stats/hidden-markov-models- hmm.html
▪ C++: HTK3
▪ http://htk.eng.cam.ac.uk/
2020年2月25日 29
Markov models