[PPT] - EM and GMM Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / PowerPoint Presentation

SLIDE 1

EM and GMM

Jia-Bin Huang Virginia Tech

Spring 2019

ECE-5424G / CS-5824

SLIDE 2

Administrative

HW 3 due March 27.
Final project discussion: Link
Final exam date/time
Exam Section: 14M
https://banweb.banner.vt.edu/ssb/prod/hzskexam.P_DispExamInfo
2:05PM to 4:05PM May 13

SLIDE 3

J. Mark Sowers Distinguished Lecture
Michael Jordan
Pehong Chen Distinguished Professor

Department of Statistics and Electrical Engineering and Computer Sciences

University of California, Berkeley
3/28/19
7:30 PM, McBryde 100

SLIDE 4

K-means algorithm

Input:
𝐿 (number of clusters)
Training set 𝑦(1), 𝑦(2), 𝑦(3), ⋯ , 𝑦(𝑛)
𝑦(𝑗) ∈ ℝ𝑜 (note: drop 𝑦0 = 1 convention)

Slide credit: Andrew Ng

SLIDE 5

K-means algorithm

Randomly initialize 𝐿 cluster centroids 𝜈1, 𝜈2, ⋯ , 𝜈𝐿 ∈ ℝ𝑜

Repeat{ for 𝑗 = 1 to 𝑛 𝑑(𝑗) ≔ index (from 1 to 𝐿) of cluster centroid closest to 𝑦(𝑗) for 𝑙 = 1 to 𝐿 𝜈𝑙 ≔ average (mean) of points assigned to cluster 𝑙 } Cluster assignment step Centroid update step

Slide credit: Andrew Ng

SLIDE 6

K-means optimization objective

𝑑(𝑗) = Index of cluster (1, 2, … K) to which

example 𝑦 𝑗 is currently assigned

𝜈𝑙

= cluster centroid 𝑙 (𝜈𝑙 ∈ ℝ𝑜)

𝜈𝑑(𝑗) = cluster centroid of cluster to which

example 𝑦 𝑗 has been assigned

Optimization objective:

𝐾 𝑑 1 , ⋯ , 𝑑 𝑛 , 𝜈1, ⋯ , 𝜈𝐿 = 1 𝑛 ෍

𝑗=1 𝑛

𝑦 𝑗 − 𝜈𝑑 𝑗

2

min

𝑑 1 ,⋯,𝑑 𝑛 𝜈1,⋯,𝜈𝐿

𝐾 𝑑 1 , ⋯ , 𝑑 𝑛 , 𝜈1, ⋯ , 𝜈𝐿 Example: 𝑦(𝑗) = 5 𝑑(𝑗) = 5 𝜈𝑑(𝑗) = 𝜈5

Slide credit: Andrew Ng

SLIDE 7

K-means algorithm

Randomly initialize 𝐿 cluster centroids 𝜈1, 𝜈2, ⋯ , 𝜈𝐿 ∈ ℝ𝑜 Repeat{ for 𝑗 = 1 to 𝑛 𝑑(𝑗) ≔ index (from 1 to 𝐿) of cluster centroid closest to 𝑦(𝑗) for 𝑙 = 1 to 𝐿 𝜈𝑙 ≔ average (mean) of points assigned to cluster 𝑙 }

Cluster assignment step

𝐾 𝑑 1 , ⋯ , 𝑑 𝑛 , 𝜈1, ⋯ , 𝜈𝐿 = 1 𝑛 ෍

𝑗=1 𝑛

𝑦 𝑗 − 𝜈𝑑 𝑗

2

Centroid update step

𝐾 𝑑 1 , ⋯ , 𝑑 𝑛 , 𝜈1, ⋯ , 𝜈𝐿 = 1 𝑛 ෍

𝑗=1 𝑛

𝑦 𝑗 − 𝜈𝑑 𝑗

2

Slide credit: Andrew Ng

SLIDE 8

Hierarchical Clustering

A hierarchy might be more nature
Different users might care about different levels of granularity or even

prunings.

Slide credit: Maria-Florina Balcan

SLIDE 9

Hierarchical Clustering

Top-down (divisive)
Partition data into 2-groups (e.g., 2-means)
Recursively cluster each group
Bottom-up (agglomerative)
Start with every point in its own cluster.
Repeatedly merge the “closest” two clusters
Different definitions of “closest” give different algorithms.

Slide credit: Maria-Florina Balcan

SLIDE 10

Bottom-up (agglomerative)

Have a distance measure on pairs of objects.
𝑒 𝑦, 𝑧 : Distance between 𝑦 and 𝑧
Single linkage: dist A, B =

min

x∈𝐵,𝑦′∈𝐶 d(x, x′)

Complete linkage: dist A, B =

max

x∈𝐵,𝑦′∈𝐶 d(x, x′)

Average linkage: dist A, B = average

x∈𝐵,𝑦′∈𝐶

d(x, x′)

Ward’s method dist A, B =

𝐵 |𝐶| 𝐵 +|𝐶| mean 𝐵 − mean 𝐶 2

Slide credit: Maria-Florina Balcan

SLIDE 11

Bottom-up (agglomerative)

Single linkage: dist A, B =

min

x∈𝐵,𝑦′∈𝐶 d(x, x′)

At any time, distance between any two points in a connected components < r.
Complete linkage: dist A, B =

max

x∈𝐵,𝑦′∈𝐶 d(x, x′)

Keep max diameter as small as possible at any level
Ward’s method dist A, B =

𝐵 |𝐶| 𝐵 +|𝐶| mean 𝐵 − mean 𝐶 2

Merge the two clusters such that the increase in k-means cost is as small as

possible.

Works well in practice

Slide credit: Maria-Florina Balcan

SLIDE 12

Things to remember

Intro to unsupervised learning
K-means algorithm
Optimization objective
Initialization and the number of clusters
Hierarchical clustering

SLIDE 13

Today’s Class

Examples of Missing Data Problems
Detecting outliers
Latent topic models
Segmentation
Background
Maximum Likelihood Estimation
Probabilistic Inference
Dealing with “Hidden” Variables
EM algorithm, Mixture of Gaussians
Hard EM

SLIDE 14

Today’s Class

Examples of Missing Data Problems
Detecting outliers
Latent topic models
Segmentation
Background
Maximum Likelihood Estimation
Probabilistic Inference
Dealing with “Hidden” Variables
EM algorithm, Mixture of Gaussians
Hard EM

SLIDE 15

Missing Data Problems: Outliers

You want to train an algorithm to predict whether a photograph is

attractive. You collect annotations from Mechanical Turk. Some

annotators try to give accurate ratings, but others answer randomly. Challenge: Determine which people to trust and the average rating by accurate annotators.

Photo: Jam343 (Flickr)

Annotator Ratings 10 8 9 2 8

SLIDE 16

Missing Data Problems: Object Discovery

You have a collection of images and have extracted regions from

them. Each is represented by a histogram of “visual words”.

Challenge: Discover frequently occurring object categories, without pre-trained appearance models.

http://www.robots.ox.ac.uk/~vgg/publications/papers/russell06.pdf

SLIDE 17

Missing Data Problems: Segmentation

You are given an image and want to assign foreground/background pixels. Challenge: Segment the image into figure and ground without knowing what the foreground looks like in advance.

Foreground Background

SLIDE 18

Missing Data Problems: Segmentation

Challenge: Segment the image into figure and ground without knowing what the foreground looks like in advance. Three steps: 1. If we had labels, how could we model the appearance of foreground and background?

Maximum Likelihood Estimation

2. Once we have modeled the fg/bg appearance, how do we compute the likelihood that a pixel is foreground?

Probabilistic Inference

3. How can we get both labels and appearance models at once?

Expectation-Maximization (EM) Algorithm

SLIDE 19

Maximum Likelihood Estimation

1. If we had labels, how could we model the appearance of

foreground and background?

Foreground Background

SLIDE 20

Maximum Likelihood Estimation

 



  

n n N

x p p x x ) | ( argmax ˆ ) | ( argmax ˆ ..

1

   

 

x x

data parameters

SLIDE 21

Maximum Likelihood Estimation

 



  

n n N

x p p x x ) | ( argmax ˆ ) | ( argmax ˆ ..

1

   

 

x x

Gaussian Distribution

 

          

2 2 2 2

2 exp 2 1 ) , | (     

n n

x x p

SLIDE 22

Maximum Likelihood Estimation

෠ 𝜄 = argmax𝜄 𝑞 𝐲 𝜄) = argmax𝜄 log 𝑞 𝐲 𝜄) ෠ 𝜄 = argmax𝜄 ෍

𝑜

log (𝑞 𝑦𝑜 𝜄 ) = argmax𝜄 𝑀(𝜄) 𝑀 𝜄 = −𝑂 2 log 2𝜌 − −𝑂 2 log 𝜏2 − 1 2𝜏2 ෍

𝑜

𝑦𝑜 − 𝜈 2 𝜖𝑀(𝜄) 𝜖𝜈 = 1 𝜏2 ෍

𝑜

𝑦𝑜 − 𝑣 = 0 → Ƹ 𝜈 = 1 𝑂 ෍

𝑜

𝑦𝑜 𝜖𝑀(𝜄) 𝜖𝜏 = 𝑂 𝜏 − 1 𝜏3 ෍

𝑜

𝑦𝑜 − 𝜈 2 = 0 → 𝜏2 = 1 𝑂 ෍

𝑜

𝑦𝑜 − Ƹ 𝜈 2

Log-Likelihood

 

          

2 2 2 2

2 exp 2 1 ) , | (     

n n

x x p

Gaussian Distribution

SLIDE 23

Maximum Likelihood Estimation

 



  

n n N

x p p x x ) | ( argmax ˆ ) | ( argmax ˆ ..

1

   

 

x x

 

          

2 2 2 2

2 exp 2 1 ) , | (     

n n

x x p

Gaussian Distribution





n n

x N 1 ˆ 

 



 

n n

x N

2 2

ˆ 1 ˆ  

SLIDE 24

Example: MLE

>> mu_fg = mean(im(labels)) mu_fg = 0.6012 >> sigma_fg = sqrt(mean((im(labels)-mu_fg).^2)) sigma_fg = 0.1007 >> mu_bg = mean(im(~labels)) mu_bg = 0.4007 >> sigma_bg = sqrt(mean((im(~labels)-mu_bg).^2)) sigma_bg = 0.1007 >> pfg = mean(labels(:));

labels im fg: mu=0.6, sigma=0.1 bg: mu=0.4, sigma=0.1 Parameters used to Generate

SLIDE 25

Probabilistic Inference

2. Once we have modeled the fg/bg appearance, how do

we compute the likelihood that a pixel is foreground?

Foreground Background

SLIDE 26

Probabilistic Inference

Compute the likelihood that a particular model generated a sample

component or label

) , | ( 

n n

x m z p 

SLIDE 27

Probabilistic Inference

component or label

   

   | | , ) , | (

n m n n n n

x p x m z p x m z p   

Conditional probability

𝑄 𝐵 𝐶 = 𝑄(𝐵, 𝐶) 𝑄(𝐶)

Compute the likelihood that a particular model generated a sample

SLIDE 28

Probabilistic Inference

component or label

   

   | | , ) , | (

n m n n n n

x p x m z p x m z p   

   



  

k k n n m n n

x k z p x m z p   | , | , Compute the likelihood that a particular model generated a sample

Marginalization

𝑄 𝐵 = ෍

𝑙

𝑄(𝐵, 𝐶 = 𝑙)

SLIDE 29

Probabilistic Inference

component or label

   

   | | , ) , | (

n m n n n n

x p x m z p x m z p   

       



    

k k n k n n m n m n n

k z p k z x p m z p m z x p     | , | | , |

   



  

k k n n m n n

x k z p x m z p   | , | , Compute the likelihood that a particular model generated a sample

Joint distribution

𝑄 𝐵, 𝐶 = P B P(A|B)

SLIDE 30

Example: Inference

>> pfg = 0.5; >> px_fg = normpdf(im, mu_fg, sigma_fg); >> px_bg = normpdf(im, mu_bg, sigma_bg); >> pfg_x = px_fg*pfg ./ (px_fg*pfg + px_bg*(1-pfg));

im fg: mu=0.6, sigma=0.1 bg: mu=0.4, sigma=0.1 Learned Parameters p(fg | im)

SLIDE 31

Dealing with Hidden Variables

3. How can we get both labels and appearance parameters

at once?

Foreground Background

SLIDE 32

Mixture of Gaussians

 

m m m n m

x                

2 2 2

2 exp 2 1

 

m m m n n n n

m z x p m z x p    , , | , , , | ,

2 2

   π σ μ

  



m n m m n

m z p x p    | , |

2

 

mixture component

 



 

m m m m n n n

m z x p x p    , , | , , , |

2 2 π

σ μ

component prior component model parameters

SLIDE 33

Mixture of Gaussians

With enough components, can represent any probability density function

Widely used as general purpose pdf estimator

SLIDE 34

Segmentation with Mixture of Gaussians

Pixels come from one of several Gaussian components

We don’t know which pixels come from which components
We don’t know the parameters for the components

Problem:

Estimate the parameters of the

Gaussian Mixture Model. What would you do?

SLIDE 35

Simple solution

1. Initialize parameters
2. Compute the probability of each hidden variable given the current

parameters

3. Compute new parameters for each model, weighted by likelihood of

hidden variables

4. Repeat 2-3 until convergence

SLIDE 36

Mixture of Gaussians: Simple Solution

1. Initialize parameters
2. Compute likelihood of hidden variables for current parameters
3. Estimate new parameters for each model, weighted by likelihood

) , , , | (

) ( ) ( 2 ) ( t t t n n nm

x m z p π σ μ   

 



 n n nm n nm t m

x    1 ˆ

) 1 (

 

 

 

 n m n nm n nm t m

x

2 ) 1 ( 2

ˆ 1 ˆ    

N

n nm t m







 

) 1 (

ˆ

SLIDE 37

Expectation Maximization (EM) Algorithm

 

     



z

z x  



| , log argmax ˆ p

Goal:

 

   

 

X f X f E E 

Jensen’s Inequality Log of sums is intractable

See here for proof: www.stanford.edu/class/cs229/notes/cs229-notes8.ps

for concave functions f(x) (so we maximize the lower bound!)

SLIDE 38

Expectation Maximization (EM) Algorithm

1. E-step: compute
2. M-step: solve

   

 

    



) ( , |

, | | , log | , log E

) (

t x z

p p p

t

  



x z z x z x

z





    



) ( ) 1 (

, | | , log argmax

t t

p p   



x z z x

z







 

     



z

z x  



| , log argmax ˆ p

Goal:

SLIDE 39

1. E-step: compute
2. M-step: solve

   

 

    



) ( , |

, | | , log | , log E

) (

t x z

p p p

t

  



x z z x z x

z





    



) ( ) 1 (

, | | , log argmax

t t

p p   



x z z x

z







 

     



z

z x  



| , log argmax ˆ p

Goal:

 

   

 

X f X f E E 

log of expectation of P(x|z) expectation of log of P(x|z)

SLIDE 40

EM for Mixture of Gaussians - derivation

 



           

m m m m n m

x    

2 2 2 exp

2 1

 



 

m m m m n n n

m z x p x p    , , | , , , |

2 2 π

σ μ

1. E-step: 2. M-step:

          



) ( , |

, | | , log | , log E

) (

t x z

p p p

t

  



x z z x z x

z





    



) ( ) 1 (

, | | , log argmax

t t

p p   



x z z x

z







SLIDE 41

EM for Mixture of Gaussians

 



           

m m m m n m

x    

2 2 2 exp

2 1

 



 

m m m m n n n

m z x p x p    , , | , , , |

2 2 π

σ μ

1. E-step: 2. M-step:

          



) ( , |

, | | , log | , log E

) (

t x z

p p p

t

  



x z z x z x

z





    



) ( ) 1 (

, | | , log argmax

t t

p p   



x z z x

z







) , , , | (

) ( ) ( 2 ) ( t t t n n nm

x m z p π σ μ     



 n n nm n nm t m

x    1 ˆ

) 1 (

 

 

 

 n m n nm n nm t m

x

2 ) 1 ( 2

ˆ 1 ˆ    

N

n nm t m







 

) 1 (

ˆ

SLIDE 42

EM algorithm - derivation

http://lasa.epfl.ch/teaching/lectures/ML_Phd/Notes/GP-GMM.pdf

SLIDE 43

EM algorithm – E-Step

SLIDE 44

EM algorithm – E-Step

SLIDE 45

EM algorithm – M-Step

SLIDE 46

EM algorithm – M-Step

Take derivative with respect to 𝜈𝑚

SLIDE 47

EM algorithm – M-Step

Take derivative with respect to σ𝑚

−1

SLIDE 48

EM Algorithm for GMM

SLIDE 49

EM Algorithm

Maximizes a lower bound on the data likelihood at each iteration
Each step increases the data likelihood
Converges to local maximum
Common tricks to derivation
Find terms that sum or integrate to 1
Lagrange multiplier to deal with constraints

SLIDE 50

Convergence of EM Algorithm

SLIDE 51

“Hard EM”

Same as EM except compute z* as most likely values for hidden

variables

K-means is an example
Advantages
Simpler: can be applied when cannot derive EM
Sometimes works better if you want to make hard predictions at the end
But
Generally, pdf parameters are not as accurate as EM