Expectation Maximization CMSC 691 UMBC Outline EM (Expectation - - PowerPoint PPT Presentation

β–Ά
expectation maximization
SMART_READER_LITE
LIVE PREVIEW

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation - - PowerPoint PPT Presentation

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count


slide-1
SLIDE 1

Expectation Maximization

CMSC 691 UMBC

slide-2
SLIDE 2

Outline

EM (Expectation Maximization) Basic idea Three coins example Why EM works

slide-3
SLIDE 3

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty (compute expectations)
  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

slide-4
SLIDE 4

Expectation Maximization (EM): E-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, π‘₯𝑗) π‘ž(𝑨𝑗)

slide-5
SLIDE 5

Expectation Maximization (EM): E-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, π‘₯𝑗) π‘ž(𝑨𝑗)

We’ve already seen this type of counting, when computing the gradient in maxent models.

slide-6
SLIDE 6

Expectation Maximization (EM): M-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

π‘ž 𝑒+1 (𝑨) π‘ž(𝑒)(𝑨)

estimated counts

slide-7
SLIDE 7

EM Math

max 𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-8
SLIDE 8

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-9
SLIDE 9

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-10
SLIDE 10

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

current parameters

slide-11
SLIDE 11

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

current parameters new parameters new parameters posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-12
SLIDE 12

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

E-step: count under uncertainty M-step: maximize log-likelihood

current parameters new parameters new parameters posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-13
SLIDE 13

Why EM? Un-Supervised Learning

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

NO labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful

EM/generative models in this case can be seen as a type

  • f clustering

EM

βž”

slide-14
SLIDE 14

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful
slide-15
SLIDE 15

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful

EM

slide-16
SLIDE 16

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful
slide-17
SLIDE 17

Why EM? Semi-Supervised Learning       ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

EM

slide-18
SLIDE 18

Outline

EM (Expectation Maximization) Basic idea Three coins example Why EM works

slide-19
SLIDE 19

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

slide-20
SLIDE 20

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

  • nly observe these

(record heads vs. tails

  • utcome)

don’t observe this

slide-21
SLIDE 21

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

  • bserved:

a, b, e, etc. We run the code, vs. The run failed unobserved: part of speech? genre?

slide-22
SLIDE 22

Three Coins Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

slide-23
SLIDE 23

Three Coins Example

Imagine three coins

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ” Three parameters to estimate: Ξ», Ξ³, and ψ

slide-24
SLIDE 24

Generative Story for Three Coins

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

add complexity to better explain what we see

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ” for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli πœ‡

Generative Story

πœ‡ = distribution over penny 𝛿 = distribution for dollar coin πœ” = distribution over dime if 𝑨𝑗 = 𝐼: π‘₯𝑗 ~ Bernoulli 𝛿 else: π‘₯𝑗 ~ Bernoulli πœ”

slide-25
SLIDE 25

Three Coins Example

If all flips were observed

H H T H T H H T H T T T

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

slide-26
SLIDE 26

Three Coins Example

If all flips were observed

H H T H T H H T H T T T

π‘ž heads = 4 6 π‘ž tails = 2 6 π‘ž heads = 1 4 π‘ž heads = 1 2 π‘ž tails = 3 4 π‘ž tails = 1 2 π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

slide-27
SLIDE 27

Three Coins Example

But not all flips are observed β†’ set parameter values

H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

slide-28
SLIDE 28

Three Coins Example

But not all flips are observed β†’ set parameter values

H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4 π‘ž heads | observed item H = π‘ž(heads & H) π‘ž(H)

Use these values to compute posteriors

π‘ž heads | observed item T = π‘ž(heads & T) π‘ž(T)

slide-29
SLIDE 29

Three Coins Example

But not all flips are observed β†’ set parameter values

H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H)

Use these values to compute posteriors

marginal likelihood rewrite joint using Bayes rule

slide-30
SLIDE 30

Three Coins Example

But not all flips are observed β†’ set parameter values H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H)

Use these values to compute posteriors

π‘ž H | heads = .8 π‘ž T | heads = .2

slide-31
SLIDE 31

Three Coins Example

But not all flips are observed β†’ set parameter values H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

Use these values to compute posteriors

π‘ž H = π‘ž H | heads βˆ— π‘ž heads + π‘ž H | tails * π‘ž(tails) = .8 βˆ— .6 + .6 βˆ— .4

π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H) π‘ž H | heads = .8 π‘ž T | heads = .2

slide-32
SLIDE 32

Three Coins Example

H H T H T H H T H T T T

π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667

Use posteriors to update parameters

π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

slide-33
SLIDE 33

Three Coins Example

H H T H T H H T H T T T

π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667

Use posteriors to update parameters

π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334

Q: Is p(heads | obs. H) + p(heads| obs. T) = 1? A: No.

slide-34
SLIDE 34

Three Coins Example

H H T H T H H T H T T T Use posteriors to update parameters

π‘ž heads = # heads from penny # total flips of penny fully observed setting

  • ur setting: partially-observed

π‘ž heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)

slide-35
SLIDE 35

Three Coins Example

H H T H T H H T H T T T Use posteriors to update parameters

  • ur setting: partially-observed

π‘ž(𝑒+1) heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny = π”½π‘ž(𝑒)[# π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny] # total flips of penny π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334

slide-36
SLIDE 36

Three Coins Example

H H T H T H H T H T T T Use posteriors to update parameters

  • ur setting:

partially-

  • bserved

π‘ž(𝑒+1) heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny = π”½π‘ž(𝑒)[# π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny] # total flips of penny = 2 βˆ— π‘ž heads | obs. H + 4 βˆ— π‘ž heads | obs. π‘ˆ 6 β‰ˆ 0.444 π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334

slide-37
SLIDE 37

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm:

  • 1. E-step: count under uncertainty (compute expectations)
  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

slide-38
SLIDE 38

Outline

EM (Expectation Maximization) Basic idea Three coins example Why EM works

slide-39
SLIDE 39

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) 𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

what do π’Ÿ, β„³, 𝒬 look like?

slide-40
SLIDE 40

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) 𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

π’Ÿ πœ„ = ෍

𝑗

log π‘ž(𝑦𝑗, 𝑧𝑗)

slide-41
SLIDE 41

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) 𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

π’Ÿ πœ„ = ෍

𝑗

log π‘ž(𝑦𝑗, 𝑧𝑗) β„³ πœ„ = ෍

𝑗

log π‘ž(𝑦𝑗) = ෍

𝑗

log ෍

𝑙

π‘ž(𝑦𝑗, 𝑧 = 𝑙)

slide-42
SLIDE 42

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) 𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

π’Ÿ πœ„ = ෍

𝑗

log π‘ž(𝑦𝑗, 𝑧𝑗) β„³ πœ„ = ෍

𝑗

log π‘ž(𝑦𝑗) = ෍

𝑗

log ෍

𝑙

π‘ž(𝑦𝑗, 𝑧 = 𝑙) 𝒬 πœ„ = ෍

𝑗

log π‘ž 𝑧𝑗 𝑦𝑗)

slide-43
SLIDE 43

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y)

π‘žπœ„ 𝑍 π‘Œ) = π‘žπœ„(π‘Œ, 𝑍) π‘žπœ„(π‘Œ) π‘žπœ„(π‘Œ) = π‘žπœ„(π‘Œ, 𝑍) π‘žπœ„ 𝑍 π‘Œ)

𝒬 πœ„ = posterior log-likelihood of incomplete data Y definition of conditional probability algebra β„³ πœ„ = marginal log-likelihood of

  • bserved data X
slide-44
SLIDE 44

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y)

π‘žπœ„ 𝑍 π‘Œ) = π‘žπœ„(π‘Œ, 𝑍) π‘žπœ„(π‘Œ) π‘žπœ„(π‘Œ) = π‘žπœ„(π‘Œ, 𝑍) π‘žπœ„ 𝑍 π‘Œ)

𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

β„³ πœ„ = π’Ÿ πœ„ βˆ’ 𝒬 πœ„

π’Ÿ πœ„ = ෍

𝑗

log π‘ž(𝑦𝑗, 𝑧𝑗) β„³ πœ„ = ෍

𝑗

log π‘ž(𝑦𝑗) = ෍

𝑗

log ෍

𝑙

π‘ž(𝑦𝑗, 𝑧 = 𝑙) 𝒬 πœ„ = ෍

𝑗

log π‘ž 𝑧𝑗 𝑦𝑗)

slide-45
SLIDE 45

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y)

π‘žπœ„ 𝑍 π‘Œ) = π‘žπœ„(π‘Œ, 𝑍) π‘žπœ„(π‘Œ) π‘žπœ„(π‘Œ) = π‘žπœ„(π‘Œ, 𝑍) π‘žπœ„ 𝑍 π‘Œ)

𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

β„³ πœ„ = π’Ÿ πœ„ βˆ’ 𝒬 πœ„ π”½π‘βˆΌπœ„(𝑒)[β„³ πœ„ |π‘Œ] = π”½π‘βˆΌπœ„(𝑒)[π’Ÿ πœ„ |π‘Œ] βˆ’ π”½π‘βˆΌπœ„(𝑒)[𝒬 πœ„ |π‘Œ]

take a conditional expectation (why? we’ll cover this more in variational inference)

slide-46
SLIDE 46

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y)

π‘žπœ„ 𝑍 π‘Œ) = π‘žπœ„(π‘Œ, 𝑍) π‘žπœ„(π‘Œ) π‘žπœ„(π‘Œ) = π‘žπœ„(π‘Œ, 𝑍) π‘žπœ„ 𝑍 π‘Œ)

𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

β„³ πœ„ = π’Ÿ πœ„ βˆ’ 𝒬 πœ„ π”½π‘βˆΌπœ„(𝑒)[β„³ πœ„ |π‘Œ] = π”½π‘βˆΌπœ„(𝑒)[π’Ÿ πœ„ |π‘Œ] βˆ’ π”½π‘βˆΌπœ„(𝑒)[𝒬 πœ„ |π‘Œ] β„³ πœ„ = π”½π‘βˆΌπœ„(𝑒)[π’Ÿ πœ„ |π‘Œ] βˆ’ π”½π‘βˆΌπœ„(𝑒)[𝒬 πœ„ |π‘Œ]

β„³ already sums over Y

β„³ πœ„ = ෍

𝑗

logπ‘ž(𝑦𝑗) = ෍

𝑗

log෍

𝑙

π‘ž(𝑦𝑗, 𝑧 = 𝑙)

slide-47
SLIDE 47

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) 𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

β„³ πœ„ = π”½π‘βˆΌπœ„(𝑒)[π’Ÿ πœ„ |π‘Œ] βˆ’ π”½π‘βˆΌπœ„(𝑒)[𝒬 πœ„ |π‘Œ] π”½π‘βˆΌπœ„(𝑒) π’Ÿ πœ„ π‘Œ = ෍

𝑗

෍

𝑙

π‘žπœ„(𝑒) 𝑧 = 𝑙 𝑦𝑗) log π‘ž(𝑦𝑗, 𝑧 = 𝑙)

slide-48
SLIDE 48

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) 𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

β„³ πœ„ = π”½π‘βˆΌπœ„(𝑒)[π’Ÿ πœ„ |π‘Œ] βˆ’ π”½π‘βˆΌπœ„(𝑒)[𝒬 πœ„ |π‘Œ]

𝑅(πœ„, πœ„(𝑒)) 𝑆(πœ„, πœ„(𝑒))

Let πœ„βˆ— be the value that maximizes 𝑅(πœ„, πœ„(𝑒))

slide-49
SLIDE 49

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) 𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

β„³ πœ„ = π”½π‘βˆΌπœ„(𝑒)[π’Ÿ πœ„ |π‘Œ] βˆ’ π”½π‘βˆΌπœ„(𝑒)[𝒬 πœ„ |π‘Œ]

𝑅(πœ„, πœ„(𝑒)) 𝑆(πœ„, πœ„(𝑒))

β„³ πœ„βˆ— βˆ’ β„³ πœ„ 𝑒 = 𝑅 πœ„βˆ—, πœ„(𝑒) βˆ’ 𝑅(πœ„(𝑒), πœ„(𝑒)) βˆ’ 𝑆 πœ„βˆ—, πœ„(𝑒) βˆ’ 𝑆(πœ„(𝑒), πœ„(𝑒))

Let πœ„βˆ— be the value that maximizes 𝑅(πœ„, πœ„(𝑒))

slide-50
SLIDE 50

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) 𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

β„³ πœ„ = π”½π‘βˆΌπœ„(𝑒)[π’Ÿ πœ„ |π‘Œ] βˆ’ π”½π‘βˆΌπœ„(𝑒)[𝒬 πœ„ |π‘Œ]

𝑅(πœ„, πœ„(𝑒)) 𝑆(πœ„, πœ„(𝑒))

β„³ πœ„βˆ— βˆ’ β„³ πœ„ 𝑒 = 𝑅 πœ„βˆ—, πœ„(𝑒) βˆ’ 𝑅(πœ„(𝑒), πœ„(𝑒)) βˆ’ 𝑆 πœ„βˆ—, πœ„(𝑒) βˆ’ 𝑆(πœ„(𝑒), πœ„(𝑒))

Let πœ„βˆ— be the value that maximizes 𝑅(πœ„, πœ„(𝑒))

β‰₯ 0 ≀ 0 (we’ll see why with Jensen’s

inequality, in variational inference)

slide-51
SLIDE 51

Why does EM work?

π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) 𝒬 πœ„ = posterior log-likelihood of incomplete data Y β„³ πœ„ = marginal log-likelihood of

  • bserved data X

β„³ πœ„ = π”½π‘βˆΌπœ„(𝑒)[π’Ÿ πœ„ |π‘Œ] βˆ’ π”½π‘βˆΌπœ„(𝑒)[𝒬 πœ„ |π‘Œ]

𝑅(πœ„, πœ„(𝑒)) 𝑆(πœ„, πœ„(𝑒))

β„³ πœ„βˆ— βˆ’ β„³ πœ„ 𝑒 = 𝑅 πœ„βˆ—, πœ„(𝑒) βˆ’ 𝑅(πœ„(𝑒), πœ„(𝑒)) βˆ’ 𝑆 πœ„βˆ—, πœ„(𝑒) βˆ’ 𝑆(πœ„(𝑒), πœ„(𝑒))

Let πœ„βˆ— be the value that maximizes 𝑅(πœ„, πœ„(𝑒))

β„³ πœ„βˆ— βˆ’ β„³ πœ„ 𝑒 β‰₯ 0 EM does not decrease the marginal log-likelihood

slide-52
SLIDE 52

Generalized EM

Partial M step: find a ΞΈ that simply increases, rather than maximizes, Q Partial E step: only consider some of the variables (an online learning algorithm)

slide-53
SLIDE 53

EM has its pitfalls

Objective is not convex β†’ converge to a bad local optimum Computing expectations can be hard: the E-step could require clever algorithms How well does log-likelihood correlate with an end task?

slide-54
SLIDE 54

A Maximization-Maximization Procedure

𝐺 πœ„, π‘Ÿ = 𝔽 π’Ÿ(πœ„) βˆ’π”½ log π‘Ÿ(π‘Ž)

  • bserved data

log-likelihood any distribution

  • ver Z

we’ll see this again with variational inference

slide-55
SLIDE 55

Outline

EM (Expectation Maximization) Basic idea Three coins example Why EM works