EM and GMM Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / - - PowerPoint PPT Presentation

β–Ά
em and gmm
SMART_READER_LITE
LIVE PREVIEW

EM and GMM Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / - - PowerPoint PPT Presentation

EM and GMM Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 3 due March 27. Final project discussion: Link Final exam date/time Exam Section: 14M


slide-1
SLIDE 1

EM and GMM

Jia-Bin Huang Virginia Tech

Spring 2019

ECE-5424G / CS-5824

slide-2
SLIDE 2

Administrative

  • HW 3 due March 27.
  • Final project discussion: Link
  • Final exam date/time
  • Exam Section: 14M
  • https://banweb.banner.vt.edu/ssb/prod/hzskexam.P_DispExamInfo
  • 2:05PM to 4:05PM May 13
slide-3
SLIDE 3
  • J. Mark Sowers Distinguished Lecture
  • Michael Jordan
  • Pehong Chen Distinguished Professor

Department of Statistics and Electrical Engineering and Computer Sciences

  • University of California, Berkeley
  • 3/28/19
  • 7:30 PM, McBryde 100
slide-4
SLIDE 4

K-means algorithm

  • Input:
  • 𝐿 (number of clusters)
  • Training set 𝑦(1), 𝑦(2), 𝑦(3), β‹― , 𝑦(𝑛)
  • 𝑦(𝑗) ∈ β„π‘œ (note: drop 𝑦0 = 1 convention)

Slide credit: Andrew Ng

slide-5
SLIDE 5

K-means algorithm

  • Randomly initialize 𝐿 cluster centroids 𝜈1, 𝜈2, β‹― , 𝜈𝐿 ∈ β„π‘œ

Repeat{ for 𝑗 = 1 to 𝑛 𝑑(𝑗) ≔ index (from 1 to 𝐿) of cluster centroid closest to 𝑦(𝑗) for 𝑙 = 1 to 𝐿 πœˆπ‘™ ≔ average (mean) of points assigned to cluster 𝑙 } Cluster assignment step Centroid update step

Slide credit: Andrew Ng

slide-6
SLIDE 6

K-means optimization objective

  • 𝑑(𝑗) = Index of cluster (1, 2, … K) to which

example 𝑦 𝑗 is currently assigned

  • πœˆπ‘™

= cluster centroid 𝑙 (πœˆπ‘™ ∈ β„π‘œ)

  • πœˆπ‘‘(𝑗) = cluster centroid of cluster to which

example 𝑦 𝑗 has been assigned

  • Optimization objective:

𝐾 𝑑 1 , β‹― , 𝑑 𝑛 , 𝜈1, β‹― , 𝜈𝐿 = 1 𝑛 ෍

𝑗=1 𝑛

𝑦 𝑗 βˆ’ πœˆπ‘‘ 𝑗

2

min

𝑑 1 ,β‹―,𝑑 𝑛 𝜈1,β‹―,𝜈𝐿

𝐾 𝑑 1 , β‹― , 𝑑 𝑛 , 𝜈1, β‹― , 𝜈𝐿 Example: 𝑦(𝑗) = 5 𝑑(𝑗) = 5 πœˆπ‘‘(𝑗) = 𝜈5

Slide credit: Andrew Ng

slide-7
SLIDE 7

K-means algorithm

Randomly initialize 𝐿 cluster centroids 𝜈1, 𝜈2, β‹― , 𝜈𝐿 ∈ β„π‘œ Repeat{ for 𝑗 = 1 to 𝑛 𝑑(𝑗) ≔ index (from 1 to 𝐿) of cluster centroid closest to 𝑦(𝑗) for 𝑙 = 1 to 𝐿 πœˆπ‘™ ≔ average (mean) of points assigned to cluster 𝑙 }

Cluster assignment step

𝐾 𝑑 1 , β‹― , 𝑑 𝑛 , 𝜈1, β‹― , 𝜈𝐿 = 1 𝑛 ෍

𝑗=1 𝑛

𝑦 𝑗 βˆ’ πœˆπ‘‘ 𝑗

2

Centroid update step

𝐾 𝑑 1 , β‹― , 𝑑 𝑛 , 𝜈1, β‹― , 𝜈𝐿 = 1 𝑛 ෍

𝑗=1 𝑛

𝑦 𝑗 βˆ’ πœˆπ‘‘ 𝑗

2

Slide credit: Andrew Ng

slide-8
SLIDE 8

Hierarchical Clustering

  • A hierarchy might be more nature
  • Different users might care about different levels of granularity or even

prunings.

Slide credit: Maria-Florina Balcan

slide-9
SLIDE 9

Hierarchical Clustering

  • Top-down (divisive)
  • Partition data into 2-groups (e.g., 2-means)
  • Recursively cluster each group
  • Bottom-up (agglomerative)
  • Start with every point in its own cluster.
  • Repeatedly merge the β€œclosest” two clusters
  • Different definitions of β€œclosest” give different algorithms.

Slide credit: Maria-Florina Balcan

slide-10
SLIDE 10

Bottom-up (agglomerative)

  • Have a distance measure on pairs of objects.
  • 𝑒 𝑦, 𝑧 : Distance between 𝑦 and 𝑧
  • Single linkage: dist A, B =

min

x∈𝐡,π‘¦β€²βˆˆπΆ d(x, xβ€²)

  • Complete linkage: dist A, B =

max

x∈𝐡,π‘¦β€²βˆˆπΆ d(x, xβ€²)

  • Average linkage: dist A, B = average

x∈𝐡,π‘¦β€²βˆˆπΆ

d(x, xβ€²)

  • Ward’s method dist A, B =

𝐡 |𝐢| 𝐡 +|𝐢| mean 𝐡 βˆ’ mean 𝐢 2

Slide credit: Maria-Florina Balcan

slide-11
SLIDE 11

Bottom-up (agglomerative)

  • Single linkage: dist A, B =

min

x∈𝐡,π‘¦β€²βˆˆπΆ d(x, xβ€²)

  • At any time, distance between any two points in a connected components < r.
  • Complete linkage: dist A, B =

max

x∈𝐡,π‘¦β€²βˆˆπΆ d(x, xβ€²)

  • Keep max diameter as small as possible at any level
  • Ward’s method dist A, B =

𝐡 |𝐢| 𝐡 +|𝐢| mean 𝐡 βˆ’ mean 𝐢 2

  • Merge the two clusters such that the increase in k-means cost is as small as

possible.

  • Works well in practice

Slide credit: Maria-Florina Balcan

slide-12
SLIDE 12

Things to remember

  • Intro to unsupervised learning
  • K-means algorithm
  • Optimization objective
  • Initialization and the number of clusters
  • Hierarchical clustering
slide-13
SLIDE 13

Today’s Class

  • Examples of Missing Data Problems
  • Detecting outliers
  • Latent topic models
  • Segmentation
  • Background
  • Maximum Likelihood Estimation
  • Probabilistic Inference
  • Dealing with β€œHidden” Variables
  • EM algorithm, Mixture of Gaussians
  • Hard EM
slide-14
SLIDE 14

Today’s Class

  • Examples of Missing Data Problems
  • Detecting outliers
  • Latent topic models
  • Segmentation
  • Background
  • Maximum Likelihood Estimation
  • Probabilistic Inference
  • Dealing with β€œHidden” Variables
  • EM algorithm, Mixture of Gaussians
  • Hard EM
slide-15
SLIDE 15

Missing Data Problems: Outliers

You want to train an algorithm to predict whether a photograph is

  • attractive. You collect annotations from Mechanical Turk. Some

annotators try to give accurate ratings, but others answer randomly. Challenge: Determine which people to trust and the average rating by accurate annotators.

Photo: Jam343 (Flickr)

Annotator Ratings 10 8 9 2 8

slide-16
SLIDE 16

Missing Data Problems: Object Discovery

You have a collection of images and have extracted regions from

  • them. Each is represented by a histogram of β€œvisual words”.

Challenge: Discover frequently occurring object categories, without pre-trained appearance models.

http://www.robots.ox.ac.uk/~vgg/publications/papers/russell06.pdf

slide-17
SLIDE 17

Missing Data Problems: Segmentation

You are given an image and want to assign foreground/background pixels. Challenge: Segment the image into figure and ground without knowing what the foreground looks like in advance.

Foreground Background

slide-18
SLIDE 18

Missing Data Problems: Segmentation

Challenge: Segment the image into figure and ground without knowing what the foreground looks like in advance. Three steps: 1. If we had labels, how could we model the appearance of foreground and background?

  • Maximum Likelihood Estimation

2. Once we have modeled the fg/bg appearance, how do we compute the likelihood that a pixel is foreground?

  • Probabilistic Inference

3. How can we get both labels and appearance models at once?

  • Expectation-Maximization (EM) Algorithm
slide-19
SLIDE 19

Maximum Likelihood Estimation

  • 1. If we had labels, how could we model the appearance of

foreground and background?

Foreground Background

slide-20
SLIDE 20

Maximum Likelihood Estimation

 



ο€½ ο€½ ο€½

n n N

x p p x x ) | ( argmax Λ† ) | ( argmax Λ† ..

1

   

 

x x

data parameters

slide-21
SLIDE 21

Maximum Likelihood Estimation

 



ο€½ ο€½ ο€½

n n N

x p p x x ) | ( argmax Λ† ) | ( argmax Λ† ..

1

   

 

x x

Gaussian Distribution

 

οƒ· οƒ· οƒΈ οƒΆ     ο€­ ο€­ ο€½

2 2 2 2

2 exp 2 1 ) , | (     

n n

x x p

slide-22
SLIDE 22

Maximum Likelihood Estimation

ΰ·  πœ„ = argmaxπœ„ π‘ž 𝐲 πœ„) = argmaxπœ„ log π‘ž 𝐲 πœ„) ΰ·  πœ„ = argmaxπœ„ ෍

π‘œ

log (π‘ž π‘¦π‘œ πœ„ ) = argmaxπœ„ 𝑀(πœ„) 𝑀 πœ„ = βˆ’π‘‚ 2 log 2𝜌 βˆ’ βˆ’π‘‚ 2 log 𝜏2 βˆ’ 1 2𝜏2 ෍

π‘œ

π‘¦π‘œ βˆ’ 𝜈 2 πœ–π‘€(πœ„) πœ–πœˆ = 1 𝜏2 ෍

π‘œ

π‘¦π‘œ βˆ’ 𝑣 = 0 β†’ ΖΈ 𝜈 = 1 𝑂 ෍

π‘œ

π‘¦π‘œ πœ–π‘€(πœ„) πœ–πœ = 𝑂 𝜏 βˆ’ 1 𝜏3 ෍

π‘œ

π‘¦π‘œ βˆ’ 𝜈 2 = 0 β†’ 𝜏2 = 1 𝑂 ෍

π‘œ

π‘¦π‘œ βˆ’ ΖΈ 𝜈 2

Log-Likelihood

 

οƒ· οƒ· οƒΈ οƒΆ     ο€­ ο€­ ο€½

2 2 2 2

2 exp 2 1 ) , | (     

n n

x x p

Gaussian Distribution

slide-23
SLIDE 23

Maximum Likelihood Estimation

 



ο€½ ο€½ ο€½

n n N

x p p x x ) | ( argmax Λ† ) | ( argmax Λ† ..

1

   

 

x x

 

οƒ· οƒ· οƒΈ οƒΆ     ο€­ ο€­ ο€½

2 2 2 2

2 exp 2 1 ) , | (     

n n

x x p

Gaussian Distribution

οƒ₯

ο€½

n n

x N 1 Λ† 

 

οƒ₯

ο€­ ο€½

n n

x N

2 2

Λ† 1 Λ†  

slide-24
SLIDE 24

Example: MLE

>> mu_fg = mean(im(labels)) mu_fg = 0.6012 >> sigma_fg = sqrt(mean((im(labels)-mu_fg).^2)) sigma_fg = 0.1007 >> mu_bg = mean(im(~labels)) mu_bg = 0.4007 >> sigma_bg = sqrt(mean((im(~labels)-mu_bg).^2)) sigma_bg = 0.1007 >> pfg = mean(labels(:));

labels im fg: mu=0.6, sigma=0.1 bg: mu=0.4, sigma=0.1 Parameters used to Generate

slide-25
SLIDE 25

Probabilistic Inference

  • 2. Once we have modeled the fg/bg appearance, how do

we compute the likelihood that a pixel is foreground?

Foreground Background

slide-26
SLIDE 26

Probabilistic Inference

Compute the likelihood that a particular model generated a sample

component or label

) , | ( 

n n

x m z p ο€½

slide-27
SLIDE 27

Probabilistic Inference

component or label

   

   | | , ) , | (

n m n n n n

x p x m z p x m z p ο€½ ο€½ ο€½

Conditional probability

𝑄 𝐡 𝐢 = 𝑄(𝐡, 𝐢) 𝑄(𝐢)

Compute the likelihood that a particular model generated a sample

slide-28
SLIDE 28

Probabilistic Inference

component or label

   

   | | , ) , | (

n m n n n n

x p x m z p x m z p ο€½ ο€½ ο€½

   

οƒ₯

ο€½ ο€½ ο€½

k k n n m n n

x k z p x m z p   | , | , Compute the likelihood that a particular model generated a sample

Marginalization

𝑄 𝐡 = ෍

𝑙

𝑄(𝐡, 𝐢 = 𝑙)

slide-29
SLIDE 29

Probabilistic Inference

component or label

   

   | | , ) , | (

n m n n n n

x p x m z p x m z p ο€½ ο€½ ο€½

       

οƒ₯

ο€½ ο€½ ο€½ ο€½ ο€½

k k n k n n m n m n n

k z p k z x p m z p m z x p     | , | | , |

   

οƒ₯

ο€½ ο€½ ο€½

k k n n m n n

x k z p x m z p   | , | , Compute the likelihood that a particular model generated a sample

Joint distribution

𝑄 𝐡, 𝐢 = P B P(A|B)

slide-30
SLIDE 30

Example: Inference

>> pfg = 0.5; >> px_fg = normpdf(im, mu_fg, sigma_fg); >> px_bg = normpdf(im, mu_bg, sigma_bg); >> pfg_x = px_fg*pfg ./ (px_fg*pfg + px_bg*(1-pfg));

im fg: mu=0.6, sigma=0.1 bg: mu=0.4, sigma=0.1 Learned Parameters p(fg | im)

slide-31
SLIDE 31

Dealing with Hidden Variables

  • 3. How can we get both labels and appearance parameters

at once?

Foreground Background

slide-32
SLIDE 32

Mixture of Gaussians

 

m m m n m

x     οƒ— οƒ· οƒ· οƒΈ οƒΆ     ο€­ ο€­ ο€½

2 2 2

2 exp 2 1

 

 

m m m n n n n

m z x p m z x p    , , | , , , | ,

2 2

ο€½ ο€½ ο€½ Ο€ Οƒ ΞΌ

  



m n m m n

m z p x p    | , |

2

ο€½ ο€½

mixture component

 

 

οƒ₯

ο€½ ο€½

m m m m n n n

m z x p x p    , , | , , , |

2 2 Ο€

Οƒ ΞΌ

component prior component model parameters

slide-33
SLIDE 33

Mixture of Gaussians

With enough components, can represent any probability density function

  • Widely used as general purpose pdf estimator
slide-34
SLIDE 34

Segmentation with Mixture of Gaussians

Pixels come from one of several Gaussian components

  • We don’t know which pixels come from which components
  • We don’t know the parameters for the components

Problem:

  • Estimate the parameters of the

Gaussian Mixture Model. What would you do?

slide-35
SLIDE 35

Simple solution

  • 1. Initialize parameters
  • 2. Compute the probability of each hidden variable given the current

parameters

  • 3. Compute new parameters for each model, weighted by likelihood of

hidden variables

  • 4. Repeat 2-3 until convergence
slide-36
SLIDE 36

Mixture of Gaussians: Simple Solution

  • 1. Initialize parameters
  • 2. Compute likelihood of hidden variables for current parameters
  • 3. Estimate new parameters for each model, weighted by likelihood

) , , , | (

) ( ) ( 2 ) ( t t t n n nm

x m z p Ο€ Οƒ ΞΌ ο€½ ο€½ 

οƒ₯ οƒ₯

ο€½

 n n nm n nm t m

x    1 Λ†

) 1 (

 

οƒ₯ οƒ₯

ο€­ ο€½

 n m n nm n nm t m

x

2 ) 1 ( 2

Λ† 1 Λ†    

N

n nm t m

οƒ₯

ο€½



 

) 1 (

Λ†

slide-37
SLIDE 37

Expectation Maximization (EM) Algorithm

 

οƒΈ οƒΆ    ο€½

οƒ₯

z

z x  



| , log argmax Λ† p

Goal:

 

   

 

X f X f E E ο‚³

Jensen’s Inequality Log of sums is intractable

See here for proof: www.stanford.edu/class/cs229/notes/cs229-notes8.ps

for concave functions f(x) (so we maximize the lower bound!)

slide-38
SLIDE 38

Expectation Maximization (EM) Algorithm

  • 1. E-step: compute
  • 2. M-step: solve

   

 

    



) ( , |

, | | , log | , log E

) (

t x z

p p p

t

  



x z z x z x

z

οƒ₯

ο€½

    



) ( ) 1 (

, | | , log argmax

t t

p p   



x z z x

z

οƒ₯

ο€½



 

οƒΈ οƒΆ    ο€½

οƒ₯

z

z x  



| , log argmax Λ† p

Goal:

slide-39
SLIDE 39
  • 1. E-step: compute
  • 2. M-step: solve

   

 

    



) ( , |

, | | , log | , log E

) (

t x z

p p p

t

  



x z z x z x

z

οƒ₯

ο€½

    



) ( ) 1 (

, | | , log argmax

t t

p p   



x z z x

z

οƒ₯

ο€½



 

οƒΈ οƒΆ    ο€½

οƒ₯

z

z x  



| , log argmax Λ† p

Goal:

 

   

 

X f X f E E ο‚³

log of expectation of P(x|z) expectation of log of P(x|z)

slide-40
SLIDE 40

EM for Mixture of Gaussians - derivation

 

οƒ₯

οƒ— οƒ· οƒ· οƒΈ οƒΆ     ο€­ ο€­ ο€½

m m m m n m

x    

2 2 2 exp

2 1

 

 

οƒ₯

ο€½ ο€½

m m m m n n n

m z x p x p    , , | , , , |

2 2 Ο€

Οƒ ΞΌ

1. E-step: 2. M-step:

          



) ( , |

, | | , log | , log E

) (

t x z

p p p

t

  



x z z x z x

z

οƒ₯

ο€½

    



) ( ) 1 (

, | | , log argmax

t t

p p   



x z z x

z

οƒ₯

ο€½



slide-41
SLIDE 41

EM for Mixture of Gaussians

 

οƒ₯

οƒ— οƒ· οƒ· οƒΈ οƒΆ     ο€­ ο€­ ο€½

m m m m n m

x    

2 2 2 exp

2 1

 

 

οƒ₯

ο€½ ο€½

m m m m n n n

m z x p x p    , , | , , , |

2 2 Ο€

Οƒ ΞΌ

1. E-step: 2. M-step:

          



) ( , |

, | | , log | , log E

) (

t x z

p p p

t

  



x z z x z x

z

οƒ₯

ο€½

    



) ( ) 1 (

, | | , log argmax

t t

p p   



x z z x

z

οƒ₯

ο€½



) , , , | (

) ( ) ( 2 ) ( t t t n n nm

x m z p Ο€ Οƒ ΞΌ ο€½ ο€½  οƒ₯ οƒ₯

ο€½

 n n nm n nm t m

x    1 Λ†

) 1 (

 

οƒ₯ οƒ₯

ο€­ ο€½

 n m n nm n nm t m

x

2 ) 1 ( 2

Λ† 1 Λ†    

N

n nm t m

οƒ₯

ο€½



 

) 1 (

Λ†

slide-42
SLIDE 42

EM algorithm - derivation

http://lasa.epfl.ch/teaching/lectures/ML_Phd/Notes/GP-GMM.pdf

slide-43
SLIDE 43

EM algorithm – E-Step

slide-44
SLIDE 44

EM algorithm – E-Step

slide-45
SLIDE 45

EM algorithm – M-Step

slide-46
SLIDE 46

EM algorithm – M-Step

Take derivative with respect to πœˆπ‘š

slide-47
SLIDE 47

EM algorithm – M-Step

Take derivative with respect to Οƒπ‘š

βˆ’1

slide-48
SLIDE 48

EM Algorithm for GMM

slide-49
SLIDE 49

EM Algorithm

  • Maximizes a lower bound on the data likelihood at each iteration
  • Each step increases the data likelihood
  • Converges to local maximum
  • Common tricks to derivation
  • Find terms that sum or integrate to 1
  • Lagrange multiplier to deal with constraints
slide-50
SLIDE 50

Convergence of EM Algorithm

slide-51
SLIDE 51

β€œHard EM”

  • Same as EM except compute z* as most likely values for hidden

variables

  • K-means is an example
  • Advantages
  • Simpler: can be applied when cannot derive EM
  • Sometimes works better if you want to make hard predictions at the end
  • But
  • Generally, pdf parameters are not as accurate as EM