[PPT] - Expectation Maximization CMSC 691 UMBC Outline EM (Expectation PowerPoint Presentation

SLIDE 1

Expectation Maximization

CMSC 691 UMBC

SLIDE 2

Outline

EM (Expectation Maximization) Basic idea Three coins example Why EM works

SLIDE 3

Expectation Maximization (EM)

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty (compute expectations)
2. M-step: maximize log-likelihood, assuming these

uncertain counts

SLIDE 4

Expectation Maximization (EM): E-step

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these

parameters

2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, 𝑥𝑗) 𝑞(𝑨𝑗)

SLIDE 5

Expectation Maximization (EM): E-step

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these

parameters

2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, 𝑥𝑗) 𝑞(𝑨𝑗)

We’ve already seen this type of counting, when computing the gradient in maxent models.

SLIDE 6

Expectation Maximization (EM): M-step

0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these

parameters

2. M-step: maximize log-likelihood, assuming these

uncertain counts

𝑞 𝑢+1 (𝑨) 𝑞(𝑢)(𝑨)

estimated counts

SLIDE 7

EM Math

max 𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

SLIDE 8

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

SLIDE 9

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

SLIDE 10

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

current parameters

SLIDE 11

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

current parameters new parameters new parameters posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

SLIDE 12

EM Math

max

𝜄

𝔽𝑨 ~ 𝑞𝜄(𝑢)(⋅|𝑥) log 𝑞𝜄(𝑨, 𝑥)

E-step: count under uncertainty M-step: maximize log-likelihood

current parameters new parameters new parameters posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

SLIDE 13

Why EM? Un-Supervised Learning

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

NO labeled data:

human annotated
relatively small/few

examples unlabeled data:

raw; not annotated
plentiful

EM/generative models in this case can be seen as a type

f clustering

EM

➔

SLIDE 14

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

human annotated
relatively small/few

examples unlabeled data:

raw; not annotated
plentiful

SLIDE 15

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

human annotated
relatively small/few

examples unlabeled data:

raw; not annotated
plentiful

EM

SLIDE 16

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

human annotated
relatively small/few

examples unlabeled data:

raw; not annotated
plentiful

SLIDE 17

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

EM

SLIDE 18

Outline

EM (Expectation Maximization) Basic idea Three coins example Why EM works

SLIDE 19

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

SLIDE 20

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

nly observe these

(record heads vs. tails

utcome)

don’t observe this

SLIDE 21

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

bserved:

a, b, e, etc. We run the code, vs. The run failed unobserved: part of speech? genre?

SLIDE 22

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔

SLIDE 23

Three Coins Example

Imagine three coins

𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔 Three parameters to estimate: λ, γ, and ψ

SLIDE 24

Generative Story for Three Coins

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

add complexity to better explain what we see

𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔 for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

Generative Story

𝜇 = distribution over penny 𝛿 = distribution for dollar coin 𝜔 = distribution over dime if 𝑨𝑗 = 𝐼: 𝑥𝑗 ~ Bernoulli 𝛿 else: 𝑥𝑗 ~ Bernoulli 𝜔

SLIDE 25

Three Coins Example

If all flips were observed

H H T H T H H T H T T T

𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔

SLIDE 26

Three Coins Example

If all flips were observed

H H T H T H H T H T T T

𝑞 heads = 4 6 𝑞 tails = 2 6 𝑞 heads = 1 4 𝑞 heads = 1 2 𝑞 tails = 3 4 𝑞 tails = 1 2 𝑞 heads = 𝜇 𝑞 tails = 1 − 𝜇 𝑞 heads = 𝛿 𝑞 heads = 𝜔 𝑞 tails = 1 − 𝛿 𝑞 tails = 1 − 𝜔

SLIDE 27

Three Coins Example

But not all flips are observed → set parameter values

H H T H T H H T H T T T

𝑞 heads = 𝜇 = .6 𝑞 tails = .4 𝑞 heads = .8 𝑞 heads = .6 𝑞 tails = .2 𝑞 tails = .4

SLIDE 28

Three Coins Example

But not all flips are observed → set parameter values

H H T H T H H T H T T T

𝑞 heads = 𝜇 = .6 𝑞 tails = .4 𝑞 heads = .8 𝑞 heads = .6 𝑞 tails = .2 𝑞 tails = .4 𝑞 heads | observed item H = 𝑞(heads & H) 𝑞(H)

Use these values to compute posteriors

𝑞 heads | observed item T = 𝑞(heads & T) 𝑞(T)

SLIDE 29

Three Coins Example

But not all flips are observed → set parameter values

H H T H T H H T H T T T

𝑞 heads = 𝜇 = .6 𝑞 tails = .4 𝑞 heads = .8 𝑞 heads = .6 𝑞 tails = .2 𝑞 tails = .4

𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H)

Use these values to compute posteriors

marginal likelihood rewrite joint using Bayes rule

SLIDE 30

Three Coins Example

But not all flips are observed → set parameter values H H T H T H H T H T T T

𝑞 heads = 𝜇 = .6 𝑞 tails = .4 𝑞 heads = .8 𝑞 heads = .6 𝑞 tails = .2 𝑞 tails = .4

𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H)

Use these values to compute posteriors

𝑞 H | heads = .8 𝑞 T | heads = .2

SLIDE 31

Three Coins Example

But not all flips are observed → set parameter values H H T H T H H T H T T T

𝑞 heads = 𝜇 = .6 𝑞 tails = .4 𝑞 heads = .8 𝑞 heads = .6 𝑞 tails = .2 𝑞 tails = .4

Use these values to compute posteriors

𝑞 H = 𝑞 H | heads ∗ 𝑞 heads + 𝑞 H | tails * 𝑞(tails) = .8 ∗ .6 + .6 ∗ .4

𝑞 heads | observed item H = 𝑞 H heads)𝑞(heads) 𝑞(H) 𝑞 H | heads = .8 𝑞 T | heads = .2

SLIDE 32

Three Coins Example

H H T H T H H T H T T T

𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(H) = .8 ∗ .6 .8 ∗ .6 + .6 ∗ .4 ≈ 0.667

Use posteriors to update parameters

𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞(T) = .2 ∗ .6 .2 ∗ .6 + .6 ∗ .4 ≈ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

SLIDE 33

Three Coins Example

H H T H T H H T H T T T

𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(H) = .8 ∗ .6 .8 ∗ .6 + .6 ∗ .4 ≈ 0.667

Use posteriors to update parameters

𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞(T) = .2 ∗ .6 .2 ∗ .6 + .6 ∗ .4 ≈ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1? A: No.

SLIDE 34

Three Coins Example

H H T H T H H T H T T T Use posteriors to update parameters

𝑞 heads = # heads from penny # total flips of penny fully observed setting

ur setting: partially-observed

𝑞 heads = # 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny # total flips of penny 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(H) = .8 ∗ .6 .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞(T) = .2 ∗ .6 .2 ∗ .6 + .6 ∗ .4 ≈ 0.334 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)

SLIDE 35

Three Coins Example

H H T H T H H T H T T T Use posteriors to update parameters

ur setting: partially-observed

𝑞(𝑢+1) heads = # 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny # total flips of penny = 𝔽𝑞(𝑢)[# 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny] # total flips of penny 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(H) = .8 ∗ .6 .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞(T) = .2 ∗ .6 .2 ∗ .6 + .6 ∗ .4 ≈ 0.334

SLIDE 36

Three Coins Example

H H T H T H H T H T T T Use posteriors to update parameters

ur setting:

partially-

bserved

𝑞(𝑢+1) heads = # 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny # total flips of penny = 𝔽𝑞(𝑢)[# 𝑓𝑦𝑞𝑓𝑑𝑢𝑓𝑒 heads from penny] # total flips of penny = 2 ∗ 𝑞 heads | obs. H + 4 ∗ 𝑞 heads | obs. 𝑈 6 ≈ 0.444 𝑞 heads | obs. H = 𝑞 H heads)𝑞(heads) 𝑞(H) = .8 ∗ .6 .8 ∗ .6 + .6 ∗ .4 ≈ 0.667 𝑞 heads | obs. T = 𝑞 T heads)𝑞(heads) 𝑞(T) = .2 ∗ .6 .2 ∗ .6 + .6 ∗ .4 ≈ 0.334

SLIDE 37

Expectation Maximization (EM)

0. Assume some value for your parameters

Two step, iterative algorithm:

1. E-step: count under uncertainty (compute expectations)
2. M-step: maximize log-likelihood, assuming these

uncertain counts

SLIDE 38

Outline

EM (Expectation Maximization) Basic idea Three coins example Why EM works

SLIDE 39

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

what do 𝒟, ℳ, 𝒬 look like?

SLIDE 40

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

𝒟 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗, 𝑧𝑗)

SLIDE 41

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

𝒟 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗, 𝑧𝑗) ℳ 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗) = ෍

𝑗

log ෍

𝑙

𝑞(𝑦𝑗, 𝑧 = 𝑙)

SLIDE 42

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

𝒟 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗, 𝑧𝑗) ℳ 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗) = ෍

𝑗

log ෍

𝑙

𝑞(𝑦𝑗, 𝑧 = 𝑙) 𝒬 𝜄 = ෍

𝑗

log 𝑞 𝑧𝑗 𝑦𝑗)

SLIDE 43

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y)

𝑞𝜄 𝑍 𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄(𝑌) 𝑞𝜄(𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄 𝑍 𝑌)

𝒬 𝜄 = posterior log-likelihood of incomplete data Y definition of conditional probability algebra ℳ 𝜄 = marginal log-likelihood of

bserved data X

SLIDE 44

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y)

𝑞𝜄 𝑍 𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄(𝑌) 𝑞𝜄(𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄 𝑍 𝑌)

𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

ℳ 𝜄 = 𝒟 𝜄 − 𝒬 𝜄

𝒟 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗, 𝑧𝑗) ℳ 𝜄 = ෍

𝑗

log 𝑞(𝑦𝑗) = ෍

𝑗

log ෍

𝑙

𝑞(𝑦𝑗, 𝑧 = 𝑙) 𝒬 𝜄 = ෍

𝑗

log 𝑞 𝑧𝑗 𝑦𝑗)

SLIDE 45

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y)

𝑞𝜄 𝑍 𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄(𝑌) 𝑞𝜄(𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄 𝑍 𝑌)

𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

ℳ 𝜄 = 𝒟 𝜄 − 𝒬 𝜄 𝔽𝑍∼𝜄(𝑢)[ℳ 𝜄 |𝑌] = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

take a conditional expectation (why? we’ll cover this more in variational inference)

SLIDE 46

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y)

𝑞𝜄 𝑍 𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄(𝑌) 𝑞𝜄(𝑌) = 𝑞𝜄(𝑌, 𝑍) 𝑞𝜄 𝑍 𝑌)

𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

ℳ already sums over Y

ℳ 𝜄 = ෍

𝑗

log𝑞(𝑦𝑗) = ෍

𝑗

log෍

𝑙

𝑞(𝑦𝑗, 𝑧 = 𝑙)

SLIDE 47

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌] 𝔽𝑍∼𝜄(𝑢) 𝒟 𝜄 𝑌 = ෍

𝑗

෍

𝑙

𝑞𝜄(𝑢) 𝑧 = 𝑙 𝑦𝑗) log 𝑞(𝑦𝑗, 𝑧 = 𝑙)

SLIDE 48

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

𝑅(𝜄, 𝜄(𝑢)) 𝑆(𝜄, 𝜄(𝑢))

Let 𝜄∗ be the value that maximizes 𝑅(𝜄, 𝜄(𝑢))

SLIDE 49

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

𝑅(𝜄, 𝜄(𝑢)) 𝑆(𝜄, 𝜄(𝑢))

ℳ 𝜄∗ − ℳ 𝜄 𝑢 = 𝑅 𝜄∗, 𝜄(𝑢) − 𝑅(𝜄(𝑢), 𝜄(𝑢)) − 𝑆 𝜄∗, 𝜄(𝑢) − 𝑆(𝜄(𝑢), 𝜄(𝑢))

Let 𝜄∗ be the value that maximizes 𝑅(𝜄, 𝜄(𝑢))

SLIDE 50

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

𝑅(𝜄, 𝜄(𝑢)) 𝑆(𝜄, 𝜄(𝑢))

ℳ 𝜄∗ − ℳ 𝜄 𝑢 = 𝑅 𝜄∗, 𝜄(𝑢) − 𝑅(𝜄(𝑢), 𝜄(𝑢)) − 𝑆 𝜄∗, 𝜄(𝑢) − 𝑆(𝜄(𝑢), 𝜄(𝑢))

Let 𝜄∗ be the value that maximizes 𝑅(𝜄, 𝜄(𝑢))

≥ 0 ≤ 0 (we’ll see why with Jensen’s

inequality, in variational inference)

SLIDE 51

Why does EM work?

𝑌: observed data 𝑍: unobserved data 𝒟 𝜄 = log-likelihood of complete data (X,Y) 𝒬 𝜄 = posterior log-likelihood of incomplete data Y ℳ 𝜄 = marginal log-likelihood of

bserved data X

ℳ 𝜄 = 𝔽𝑍∼𝜄(𝑢)[𝒟 𝜄 |𝑌] − 𝔽𝑍∼𝜄(𝑢)[𝒬 𝜄 |𝑌]

𝑅(𝜄, 𝜄(𝑢)) 𝑆(𝜄, 𝜄(𝑢))

ℳ 𝜄∗ − ℳ 𝜄 𝑢 = 𝑅 𝜄∗, 𝜄(𝑢) − 𝑅(𝜄(𝑢), 𝜄(𝑢)) − 𝑆 𝜄∗, 𝜄(𝑢) − 𝑆(𝜄(𝑢), 𝜄(𝑢))

Let 𝜄∗ be the value that maximizes 𝑅(𝜄, 𝜄(𝑢))

ℳ 𝜄∗ − ℳ 𝜄 𝑢 ≥ 0 EM does not decrease the marginal log-likelihood

SLIDE 52

Generalized EM

Partial M step: find a θ that simply increases, rather than maximizes, Q Partial E step: only consider some of the variables (an online learning algorithm)

SLIDE 53

EM has its pitfalls

Objective is not convex → converge to a bad local optimum Computing expectations can be hard: the E-step could require clever algorithms How well does log-likelihood correlate with an end task?

SLIDE 54

A Maximization-Maximization Procedure

𝐺 𝜄, 𝑟 = 𝔽 𝒟(𝜄) −𝔽 log 𝑟(𝑎)

bserved data

log-likelihood any distribution

ver Z

we’ll see this again with variational inference

SLIDE 55