Expectation Maximization CMSC 691 UMBC Outline EM (Expectation - - PowerPoint PPT Presentation
Expectation Maximization CMSC 691 UMBC Outline EM (Expectation - - PowerPoint PPT Presentation
Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count
Outline
EM (Expectation Maximization) Basic idea Three coins example Why EM works
Expectation Maximization (EM)
- 0. Assume some value for your parameters
Two step, iterative algorithm
- 1. E-step: count under uncertainty (compute expectations)
- 2. M-step: maximize log-likelihood, assuming these
uncertain counts
Expectation Maximization (EM): E-step
- 0. Assume some value for your parameters
Two step, iterative algorithm
- 1. E-step: count under uncertainty, assuming these
parameters
- 2. M-step: maximize log-likelihood, assuming these
uncertain counts
count(π¨π, π₯π) π(π¨π)
Expectation Maximization (EM): E-step
- 0. Assume some value for your parameters
Two step, iterative algorithm
- 1. E-step: count under uncertainty, assuming these
parameters
- 2. M-step: maximize log-likelihood, assuming these
uncertain counts
count(π¨π, π₯π) π(π¨π)
Weβve already seen this type of counting, when computing the gradient in maxent models.
Expectation Maximization (EM): M-step
- 0. Assume some value for your parameters
Two step, iterative algorithm
- 1. E-step: count under uncertainty, assuming these
parameters
- 2. M-step: maximize log-likelihood, assuming these
uncertain counts
π π’+1 (π¨) π(π’)(π¨)
estimated counts
EM Math
max π½π¨ ~ ππ(π’)(β |π₯) log ππ(π¨, π₯)
the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is
EM Math
max
π
π½π¨ ~ ππ(π’)(β |π₯) log ππ(π¨, π₯)
maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is
EM Math
max
π
π½π¨ ~ ππ(π’)(β |π₯) log ππ(π¨, π₯)
maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is
EM Math
max
π
π½π¨ ~ ππ(π’)(β |π₯) log ππ(π¨, π₯)
posterior distribution
maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is
current parameters
EM Math
max
π
π½π¨ ~ ππ(π’)(β |π₯) log ππ(π¨, π₯)
current parameters new parameters new parameters posterior distribution
maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is
EM Math
max
π
π½π¨ ~ ππ(π’)(β |π₯) log ππ(π¨, π₯)
E-step: count under uncertainty M-step: maximize log-likelihood
current parameters new parameters new parameters posterior distribution
maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is
Why EM? Un-Supervised Learning
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
NO labeled data:
- human annotated
- relatively small/few
examples unlabeled data:
- raw; not annotated
- plentiful
EM/generative models in this case can be seen as a type
- f clustering
EM
β
Why EM? Semi-Supervised Learning ο ο ο ο ο ο ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
labeled data:
- human annotated
- relatively small/few
examples unlabeled data:
- raw; not annotated
- plentiful
Why EM? Semi-Supervised Learning ο ο ο ο ο ο ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
labeled data:
- human annotated
- relatively small/few
examples unlabeled data:
- raw; not annotated
- plentiful
EM
Why EM? Semi-Supervised Learning ο ο ο ο ο ο ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
labeled data:
- human annotated
- relatively small/few
examples unlabeled data:
- raw; not annotated
- plentiful
Why EM? Semi-Supervised Learning ο ο ο ο ο ο ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
EM
Outline
EM (Expectation Maximization) Basic idea Three coins example Why EM works
Three Coins Example
Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)
Three Coins Example
Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)
- nly observe these
(record heads vs. tails
- utcome)
donβt observe this
Three Coins Example
Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)
- bserved:
a, b, e, etc. We run the code, vs. The run failed unobserved: part of speech? genre?
Three Coins Example
Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)
π heads = π π tails = 1 β π π heads = πΏ π heads = π π tails = 1 β πΏ π tails = 1 β π
Three Coins Example
Imagine three coins
π heads = π π tails = 1 β π π heads = πΏ π heads = π π tails = 1 β πΏ π tails = 1 β π Three parameters to estimate: Ξ», Ξ³, and Ο
Generative Story for Three Coins
π π₯1, π₯2, β¦ , π₯π = π π₯1 π π₯2 β― π π₯π = ΰ·
π
π π₯π π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1 π π₯1|π¨1 β― π π¨π π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π
add complexity to better explain what we see
π heads = π π tails = 1 β π π heads = πΏ π heads = π π tails = 1 β πΏ π tails = 1 β π for item π = 1 to π: π¨π ~ Bernoulli π
Generative Story
π = distribution over penny πΏ = distribution for dollar coin π = distribution over dime if π¨π = πΌ: π₯π ~ Bernoulli πΏ else: π₯π ~ Bernoulli π
Three Coins Example
If all flips were observed
H H T H T H H T H T T T
π heads = π π tails = 1 β π π heads = πΏ π heads = π π tails = 1 β πΏ π tails = 1 β π
Three Coins Example
If all flips were observed
H H T H T H H T H T T T
π heads = 4 6 π tails = 2 6 π heads = 1 4 π heads = 1 2 π tails = 3 4 π tails = 1 2 π heads = π π tails = 1 β π π heads = πΏ π heads = π π tails = 1 β πΏ π tails = 1 β π
Three Coins Example
But not all flips are observed β set parameter values
H H T H T H H T H T T T
π heads = π = .6 π tails = .4 π heads = .8 π heads = .6 π tails = .2 π tails = .4
Three Coins Example
But not all flips are observed β set parameter values
H H T H T H H T H T T T
π heads = π = .6 π tails = .4 π heads = .8 π heads = .6 π tails = .2 π tails = .4 π heads | observed item H = π(heads & H) π(H)
Use these values to compute posteriors
π heads | observed item T = π(heads & T) π(T)
Three Coins Example
But not all flips are observed β set parameter values
H H T H T H H T H T T T
π heads = π = .6 π tails = .4 π heads = .8 π heads = .6 π tails = .2 π tails = .4
π heads | observed item H = π H heads)π(heads) π(H)
Use these values to compute posteriors
marginal likelihood rewrite joint using Bayes rule
Three Coins Example
But not all flips are observed β set parameter values H H T H T H H T H T T T
π heads = π = .6 π tails = .4 π heads = .8 π heads = .6 π tails = .2 π tails = .4
π heads | observed item H = π H heads)π(heads) π(H)
Use these values to compute posteriors
π H | heads = .8 π T | heads = .2
Three Coins Example
But not all flips are observed β set parameter values H H T H T H H T H T T T
π heads = π = .6 π tails = .4 π heads = .8 π heads = .6 π tails = .2 π tails = .4
Use these values to compute posteriors
π H = π H | heads β π heads + π H | tails * π(tails) = .8 β .6 + .6 β .4
π heads | observed item H = π H heads)π(heads) π(H) π H | heads = .8 π T | heads = .2
Three Coins Example
H H T H T H H T H T T T
π heads | obs. H = π H heads)π(heads) π(H) = .8 β .6 .8 β .6 + .6 β .4 β 0.667
Use posteriors to update parameters
π heads | obs. T = π T heads)π(heads) π(T) = .2 β .6 .2 β .6 + .6 β .4 β 0.334
Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?
Three Coins Example
H H T H T H H T H T T T
π heads | obs. H = π H heads)π(heads) π(H) = .8 β .6 .8 β .6 + .6 β .4 β 0.667
Use posteriors to update parameters
π heads | obs. T = π T heads)π(heads) π(T) = .2 β .6 .2 β .6 + .6 β .4 β 0.334
Q: Is p(heads | obs. H) + p(heads| obs. T) = 1? A: No.
Three Coins Example
H H T H T H H T H T T T Use posteriors to update parameters
π heads = # heads from penny # total flips of penny fully observed setting
- ur setting: partially-observed
π heads = # ππ¦ππππ’ππ heads from penny # total flips of penny π heads | obs. H = π H heads)π(heads) π(H) = .8 β .6 .8 β .6 + .6 β .4 β 0.667 π heads | obs. T = π T heads)π(heads) π(T) = .2 β .6 .2 β .6 + .6 β .4 β 0.334 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)
Three Coins Example
H H T H T H H T H T T T Use posteriors to update parameters
- ur setting: partially-observed
π(π’+1) heads = # ππ¦ππππ’ππ heads from penny # total flips of penny = π½π(π’)[# ππ¦ππππ’ππ heads from penny] # total flips of penny π heads | obs. H = π H heads)π(heads) π(H) = .8 β .6 .8 β .6 + .6 β .4 β 0.667 π heads | obs. T = π T heads)π(heads) π(T) = .2 β .6 .2 β .6 + .6 β .4 β 0.334
Three Coins Example
H H T H T H H T H T T T Use posteriors to update parameters
- ur setting:
partially-
- bserved
π(π’+1) heads = # ππ¦ππππ’ππ heads from penny # total flips of penny = π½π(π’)[# ππ¦ππππ’ππ heads from penny] # total flips of penny = 2 β π heads | obs. H + 4 β π heads | obs. π 6 β 0.444 π heads | obs. H = π H heads)π(heads) π(H) = .8 β .6 .8 β .6 + .6 β .4 β 0.667 π heads | obs. T = π T heads)π(heads) π(T) = .2 β .6 .2 β .6 + .6 β .4 β 0.334
Expectation Maximization (EM)
- 0. Assume some value for your parameters
Two step, iterative algorithm:
- 1. E-step: count under uncertainty (compute expectations)
- 2. M-step: maximize log-likelihood, assuming these
uncertain counts
Outline
EM (Expectation Maximization) Basic idea Three coins example Why EM works
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
what do π, β³, π¬ look like?
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
π π = ΰ·
π
log π(π¦π, π§π)
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
π π = ΰ·
π
log π(π¦π, π§π) β³ π = ΰ·
π
log π(π¦π) = ΰ·
π
log ΰ·
π
π(π¦π, π§ = π)
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
π π = ΰ·
π
log π(π¦π, π§π) β³ π = ΰ·
π
log π(π¦π) = ΰ·
π
log ΰ·
π
π(π¦π, π§ = π) π¬ π = ΰ·
π
log π π§π π¦π)
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y)
ππ π π) = ππ(π, π) ππ(π) ππ(π) = ππ(π, π) ππ π π)
π¬ π = posterior log-likelihood of incomplete data Y definition of conditional probability algebra β³ π = marginal log-likelihood of
- bserved data X
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y)
ππ π π) = ππ(π, π) ππ(π) ππ(π) = ππ(π, π) ππ π π)
π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
β³ π = π π β π¬ π
π π = ΰ·
π
log π(π¦π, π§π) β³ π = ΰ·
π
log π(π¦π) = ΰ·
π
log ΰ·
π
π(π¦π, π§ = π) π¬ π = ΰ·
π
log π π§π π¦π)
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y)
ππ π π) = ππ(π, π) ππ(π) ππ(π) = ππ(π, π) ππ π π)
π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
β³ π = π π β π¬ π π½πβΌπ(π’)[β³ π |π] = π½πβΌπ(π’)[π π |π] β π½πβΌπ(π’)[π¬ π |π]
take a conditional expectation (why? weβll cover this more in variational inference)
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y)
ππ π π) = ππ(π, π) ππ(π) ππ(π) = ππ(π, π) ππ π π)
π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
β³ π = π π β π¬ π π½πβΌπ(π’)[β³ π |π] = π½πβΌπ(π’)[π π |π] β π½πβΌπ(π’)[π¬ π |π] β³ π = π½πβΌπ(π’)[π π |π] β π½πβΌπ(π’)[π¬ π |π]
β³ already sums over Y
β³ π = ΰ·
π
logπ(π¦π) = ΰ·
π
logΰ·
π
π(π¦π, π§ = π)
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
β³ π = π½πβΌπ(π’)[π π |π] β π½πβΌπ(π’)[π¬ π |π] π½πβΌπ(π’) π π π = ΰ·
π
ΰ·
π
ππ(π’) π§ = π π¦π) log π(π¦π, π§ = π)
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
β³ π = π½πβΌπ(π’)[π π |π] β π½πβΌπ(π’)[π¬ π |π]
π (π, π(π’)) π(π, π(π’))
Let πβ be the value that maximizes π (π, π(π’))
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
β³ π = π½πβΌπ(π’)[π π |π] β π½πβΌπ(π’)[π¬ π |π]
π (π, π(π’)) π(π, π(π’))
β³ πβ β β³ π π’ = π πβ, π(π’) β π (π(π’), π(π’)) β π πβ, π(π’) β π(π(π’), π(π’))
Let πβ be the value that maximizes π (π, π(π’))
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
β³ π = π½πβΌπ(π’)[π π |π] β π½πβΌπ(π’)[π¬ π |π]
π (π, π(π’)) π(π, π(π’))
β³ πβ β β³ π π’ = π πβ, π(π’) β π (π(π’), π(π’)) β π πβ, π(π’) β π(π(π’), π(π’))
Let πβ be the value that maximizes π (π, π(π’))
β₯ 0 β€ 0 (weβll see why with Jensenβs
inequality, in variational inference)
Why does EM work?
π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) π¬ π = posterior log-likelihood of incomplete data Y β³ π = marginal log-likelihood of
- bserved data X
β³ π = π½πβΌπ(π’)[π π |π] β π½πβΌπ(π’)[π¬ π |π]
π (π, π(π’)) π(π, π(π’))
β³ πβ β β³ π π’ = π πβ, π(π’) β π (π(π’), π(π’)) β π πβ, π(π’) β π(π(π’), π(π’))
Let πβ be the value that maximizes π (π, π(π’))
β³ πβ β β³ π π’ β₯ 0 EM does not decrease the marginal log-likelihood
Generalized EM
Partial M step: find a ΞΈ that simply increases, rather than maximizes, Q Partial E step: only consider some of the variables (an online learning algorithm)
EM has its pitfalls
Objective is not convex β converge to a bad local optimum Computing expectations can be hard: the E-step could require clever algorithms How well does log-likelihood correlate with an end task?
A Maximization-Maximization Procedure
πΊ π, π = π½ π(π) βπ½ log π(π)
- bserved data
log-likelihood any distribution
- ver Z
weβll see this again with variational inference