Accelerating the EM Algorithm for Mixture Density Estimation Homer - - PowerPoint PPT Presentation

accelerating the em algorithm for mixture density
SMART_READER_LITE
LIVE PREVIEW

Accelerating the EM Algorithm for Mixture Density Estimation Homer - - PowerPoint PPT Presentation

Accelerating the EM Algorithm for Mixture Density Estimation Homer Walker Mathematical Sciences Department Worcester Polytechnic Instititute t r t s Pss Pr


slide-1
SLIDE 1

Accelerating the EM Algorithm for Mixture Density Estimation

Homer Walker Mathematical Sciences Department Worcester Polytechnic Instititute

❏♦✐♥t ✇♦r❦ ✇✐t❤ ❏♦s❤ P❧❛ss❡ ✭❲P■✴■♠♣❡r✐❛❧ ❈♦❧❧❡❣❡✮✳ ❘❡s❡❛r❝❤ s✉♣♣♦rt❡❞ ✐♥ ♣❛rt ❜② ❉❖❊ ●r❛♥t ❉❊✲❙❈✵✵✵✹✽✽✵ ❛♥❞ ◆❙❋ ●r❛♥t ❉▼❙✲✶✸✸✼✾✹✸✳

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 1/18

slide-2
SLIDE 2

Mixture Densities

Consider a (finite) mixture density p(x|Φ) =

m

  • i=1

αipi(x|φi). Problem: Estimate Φ = (α1, . . . , αm, φ1, . . . , φm) using an “unlabeled” sample {xk}N

k=1 on the mixture.

Maximum-Likelihood Estimate (MLE): Determine Φ = arg max L(Φ), where L(Φ) ≡

N

  • k=1

log p(xk|Φ).

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 2/18

slide-3
SLIDE 3

The EM (Expectation-Maximization) Algorithm

The general formulation and name were given in . . .

  • A. P. Dempster, N. M. Laird, and D. B. Rubin (1977),

Maximum-likelihood from incomplete data via the EM algorithm, J. Royal Statist. Soc. Ser. B (methodological), 39, pp. 1-38. General idea: Determine the next approximate MLE to maximize the expectation of the complete-data log-likelihood function, given the

  • bserved incomplete data and the current approximate MLE.

Marvelous property: The log-likelihood function increases at each iteration.

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 3/18

slide-4
SLIDE 4

The EM Algorithm for Mixture Densities

For a mixture density, an EM iteration is . . . α+

i

= 1 N

N

  • k=1

αc

i pi(xk|φc i )

p(xk|Φc) , φ+

i

= arg max

N

  • k=1

log pi(xk|φi)αc

i pi(xk|φc i )

p(xk|Φc) For a derivation, convergence analysis, history, etc., see . . .

  • R. A. Redner and HW (1984), Mixture densities, maximum-likelihood,

and the EM algorithm, SIAM Review, 26, 195–239.

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 4/18

slide-5
SLIDE 5

Particular Example: Normal (Gaussian) Mixtures

Assume (multivariate) normal densities. For each i, φi = (µi, Σi) and pi(x|φi) = 1 (2π)n/2(det Σi)1/2 e−(x−µi)T Σ−1

i

(x−µi)/2

EM iteration: For i = 1, . . . , m,

α+

i

= 1 N

N

  • k=1

αc

i pi(xk|φc i )

p(xk|Φc) , µ+

i

= N

  • k=1

xk αc

i pi(xk|φc i )

p(xk|Φc) N

  • k=1

αc

i pi(xk|φc i )

p(xk|Φc)

  • ,

Σ+

i

= N

  • k=1

(xk − µ+

i )(xk − µ+ i )T αc i pi(xk|φc i )

p(xk|Φc) N

  • k=1

αc

i pi(xk|φc i )

p(xk|Φc)

  • .

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 5/18

slide-6
SLIDE 6

EM Iterations Demo

A Univariate Normal Mixture.

pi(x|φi) =

1

  • 2πσ2

i

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 5.

Sample of 100,000 observations. — [α1, . . . , α5] = [.2, .3, .3, .1, .1] — [µ1, . . . , µ5] = [0, 1, 2, 3, 4], — [σ2

1, . . . , σ2 5] = [.2, 2, .5, .1, .1]. ◮

EM iterations on the means: µ+

i =

N

k=1 xk αi pi (xk |φi ) p(xk |Φ)

N

k=1 αi pi (xk |φi ) p(xk |Φ)

  • .

3 2 1 1 2 3 4 5 0.1 0.05 0.05 0.1 0.15 0.2 0.25 0.3

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 6/18

slide-7
SLIDE 7

EM Iterations Demo

A Univariate Normal Mixture.

pi(x|φi) =

1

  • 2πσ2

i

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 5.

Sample of 100,000 observations. — [α1, . . . , α5] = [.2, .3, .3, .1, .1] — [µ1, . . . , µ5] = [0, 1, 2, 3, 4], — [σ2

1, . . . , σ2 5] = [.2, 2, .5, .1, .1]. ◮

EM iterations on the means: µ+

i =

N

k=1 xk αi pi (xk |φi ) p(xk |Φ)

N

k=1 αi pi (xk |φi ) p(xk |Φ)

  • .

3 2 1 1 2 3 4 5 0.1 0.05 0.05 0.1 0.15 0.2 0.25 0.3 20 40 60 80 100 −14 −12 −10 −8 −6 −4 −2

Log Residual Norm Iteration Number

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 6/18

slide-8
SLIDE 8

Anderson Acceleration

Derived from a method of D. G. Anderson, Iterative procedures for nonlinear integral equations, J. Assoc. Comput. Machinery, 12 (1965), 547–560. Consider a fixed-point iteration x+ = g(x), g : Rn → Rn. Anderson Acceleration: Given x0 and mMax ≥ 1. Set x1 = g(x0). Iterate: For k = 1, 2, . . . Set mk = min{mMax, k}. Set Fk = (fk−mk , . . . , fk), where fi = g(xi) − xi. Solve minα∈Rmk +1 Fkα2 s. t. mk

i=0 αi = 1.

Set xk+1 = mk

i=0 αig(xk−mk +i). Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 7/18

slide-9
SLIDE 9

EM Iterations Demo (cont.)

pi(x|φi) =

1

  • 2πσ2

i

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 5.

Sample of 100,000 observations. — [α1, . . . , α5] = [.2, .3, .3, .1, .1] — [µ1, . . . , µ5] = [0, 1, 2, 3, 4], — [σ2

1, . . . , σ2 5] = [.2, 2, .5, .1, .1]. ◮

EM iterations on the means: µ+

i =

N

k=1 xk αi pi (xk |φi ) p(xk |Φ)

N

k=1 αi pi (xk |φi ) p(xk |Φ)

  • .

3 2 1 1 2 3 4 5 0.1 0.05 0.05 0.1 0.15 0.2 0.25 0.3

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 8/18

slide-10
SLIDE 10

EM Iterations Demo (cont.)

pi(x|φi) =

1

  • 2πσ2

i

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 5.

Sample of 100,000 observations. — [α1, . . . , α5] = [.2, .3, .3, .1, .1] — [µ1, . . . , µ5] = [0, 1, 2, 3, 4], — [σ2

1, . . . , σ2 5] = [.2, 2, .5, .1, .1]. ◮

EM iterations on the means: µ+

i =

N

k=1 xk αi pi (xk |φi ) p(xk |Φ)

N

k=1 αi pi (xk |φi ) p(xk |Φ)

  • .

3 2 1 1 2 3 4 5 0.1 0.05 0.05 0.1 0.15 0.2 0.25 0.3 10 20 30 40 50 60 70 80 90 100 −14 −12 −10 −8 −6 −4 −2

Log Residual Norm Iteration Number

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 8/18

slide-11
SLIDE 11

EM Convergence and “Separation”

Redner–W (1984): For mixture densities, the convergence is linear and depends on the “separation” of the component populations: “well-separated” (fast convergence) if, whenever i = j, pi(x|φ∗

i )

p(x|Φ∗) · pj(x|φ∗

j )

p(x|Φ∗) ≈ 0 for all x ∈ IRn; “poorly separated” (slow convergence) if, for some i = j, pi(x|φ∗

i )

p(x|Φ∗) ≈ pj(x|φ∗

j )

p(x|Φ∗) for all x ∈ Rn.

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 9/18

slide-12
SLIDE 12

Example: EM Convergence and “Separation”

A Univariate Normal Mixture.

pi(x|φi) =

1

  • 2πσ2

i

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 3.

EM iterations on the means: µ+

i =

N

k=1 xk αi pi (xk |φi ) p(xk |Φ)

N

k=1 αi pi (xk |φi ) p(xk |Φ)

  • .

Sample of 100,000 observations. — [α1, α2, α3] = [.3, .3, .4], [σ2

1, σ2 2, σ2 3] = [1, 1, 1].

— [µ1, µ2, µ3] = [0, 2, 4], [0, 1, 2], [0, .5, 1].

10 20 30 40 50 60 70 80 90 100 −14 −12 −10 −8 −6 −4 −2

Log Residual Norm Iteration Number

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 10/18

slide-13
SLIDE 13

Example: EM Convergence and “Separation”

A Univariate Normal Mixture.

pi(x|φi) =

1

  • 2πσ2

i

e−(x−µi )2/(2σ2

i ) for i = 1, . . . , 3.

EM iterations on the means: µ+

i =

N

k=1 xk αi pi (xk |φi ) p(xk |Φ)

N

k=1 αi pi (xk |φi ) p(xk |Φ)

  • .

Sample of 100,000 observations. — [α1, α2, α3] = [.3, .3, .4], [σ2

1, σ2 2, σ2 3] = [1, 1, 1].

— [µ1, µ2, µ3] = [0, 2, 4], [0, 1, 2], [0, .5, 1].

10 20 30 40 50 60 70 80 90 100 −14 −12 −10 −8 −6 −4 −2

Log Residual Norm Iteration Number

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 10/18

slide-14
SLIDE 14

Experiments with Multivariate Normal Mixtures

Experiment with Anderson acceleration applied to . . . EM iteration: For i = 1, . . . , m, α+

i

= 1 N

N

  • k=1

αc

i pi(xk|φc i )

p(xk|Φc) , µ+

i

= N

  • k=1

xk αc

i pi(xk|φc i )

p(xk|Φc) N

  • k=1

αc

i pi(xk|φc i )

p(xk|Φc)

  • ,

Σ+

i

= N

  • k=1

(xk − µ+

i )(xk − µ+ i )T αc i pi(xk|φc i )

p(xk|Φc) N

  • k=1

αc

i pi(xk|φc i )

p(xk|Φc)

  • .

Assume m is known. Ultimate interest: very large N.

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 11/18

slide-15
SLIDE 15

Experiments with Multivariate Normal Mixtures (cont.)

Two issues:

Good initial guess? Use K-means. — Fast clustering algorithm. Usually gives good results. — Apply several times to random subsets of the sample. Choose the clustering with minimal sum of within-class distances. — Use proportions, means, covariance matrices for the clusters as the initial guess.

Preserving constraints? Iterate on . . . — √αi, i = 1, . . . , m; — Cholesky factors of each Σi.

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 12/18

slide-16
SLIDE 16

Experiments with Generated Data

All computing in MATLAB.

Mixtures with m = 5 subpopulations.

Generated data in Rd for d = 2, 5, 10, 15, 20: — For each d, randomly generated 100 “true” {αi, µi, Σi}5

i=1.

— For each {αi, µi, Σi}5

i=1, randomly generated a sample of size

N = 1, 000, 000.

Compared (unaccelerated) EM with EM+AA with mMax = 5, 10, 15, 20, 25, 30.

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 13/18

slide-17
SLIDE 17

Experiments with Generated Data (cont.)

A look at failures. mMax 5 10 15 20 25 30 ⋆ 75 66 52 52 51 51 51 ⋆⋆ 4 19 23 28 29 29 Totals 75 70 71 75 79 80 80

⋆ ⇒ failure to converge within 300 iterations. ⋆⋆ ⇒ N

k=1 αi pi (xk )/p(xk ) = 0 for some i.

There were . . . — 49 trials in which all methods failed, — 26 trials in which EM failed and EM+AA succeeded for at least one mMax, — 15 trials in which EM failed and EM+AA succeeded for all mMax, — 20 trials in which EM succeeded and EM+AA failed for all mMax, — 21 trials in which EM succeeded and EM+AA failed for at least one mMax.

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 14/18

slide-18
SLIDE 18

Experiments with Generated Data (cont.)

Performance profiles (Dolan-Mor´ e, 2002) for (unaccelerated) EM and EM+AA with mMax = 5 over all trials:

5 10 15 20 0.2 0.4 0.6 0.8 1 mMax = 0 mMax = 5 2 4 6 8 10 12 14 16 18 20 0.2 0.4 0.6 0.8 1 mMax = 0 mMax = 5

Iteration Numbers Run Times

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 15/18

slide-19
SLIDE 19

An Experiment with Real Data

Remotely sensed data from near Tollhouse, CA. (Thanks to Brett Bader, Digital Globe.)

N = 3285 × 959 = 3150315 observations of 16-dimensional multispectral data.

Modeled with a mixture of m = 3 multivariate normals.

Applied (unaccelerated) EM and EM+AA with mMax = 5, 10, 15, 20, 25,30.

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 16/18

slide-20
SLIDE 20

An Experiment with Real Data (cont.)

Log residual norms vs. iteration numbers. Right: Bayes classification of data based on MLE. Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 17/18

slide-21
SLIDE 21

In Conclusion . . .

Anderson acceleration is a promising tool for accelerating the EM algorithm that may improve both robustness and efficiency.

Future work: — Expand generated-data experiments to include more trials, larger data sets, well-controlled “separation” experiments, “partially-labeled” samples, and

  • ther parametric PDF forms.

— Look for more data from real applications.

Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 18/18