[PPT] - On Casting Importance Weighted Autoencoder to an EM Algorithm to PowerPoint Presentation

SLIDE 1

On Casting Importance Weighted Autoencoder to an EM Algorithm to Learn Deep Generative Models

D.Kim 1 and J.Hwang 2 and Y.Kim 1 Speaker : Dongha Kim

1Department of Statistics, Seoul National University, South Korea 2SK Telecom, South Korea

November 07, 2019

XAIENCE 2019 November 07, 2019 1 / 31

SLIDE 2

Introduction

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

XAIENCE 2019 November 07, 2019 2 / 31

SLIDE 3

Introduction

Deep generative model with latent variable

X : observable variable
Z : latent variable

Z ∼ p(z) (ex: N(0, I)) X|Z = z ∼ p(x|z; θ)

XAIENCE 2019 November 07, 2019 3 / 31

SLIDE 4

Introduction

Deep generative model with latent variable

The log-likelihood of the observable vector x:

log p(x; θ) = log

p(x|z; θ)p(z)dz.
Marginalizing operation is problematic.

→ Hard to estimate MLE directly.

An alternative approach:
Calculate lower bound which is easy to compute.
VAE (Kingma and Welling, 2013; Rezende et al., 2014)

IWAE (Burda et al., 2015)

XAIENCE 2019 November 07, 2019 4 / 31

SLIDE 5

Introduction

Variational autoencoders (VAE)

Employ a variational posterior distribution q(z|x; φ):

LVAE(x; θ, φ) := Ez∼q

log

p(x, z; θ) q(z|x; φ)

In practice, we use the Monte Carlo method:

ˆ LVAE(x; θ, φ) := 1 L

L

l=1

log p(x, zl; θ) q(zl|x; φ)

,

where z1, ..., zL ∼ q(z|x; φ).

Maximize ˆ

LVAE w.r.t. (θ, φ).

XAIENCE 2019 November 07, 2019 5 / 31

SLIDE 6

Introduction

Importance weighted autoencoders (IWAE)

Use multiple samples from q(z|x; φ).

LIWAE(x; θ, φ) := Ez1,...,zK∼q

log
1

K

k=1

p(x, zk; θ) q(zk|x; φ)

More tight lower bound than VAE.
Use the Monte Carlo method:

ˆ LIWAE(x; θ, φ) := log

1

K

k=1

p(x, zk; θ) q(zk|x; φ)

,

where z1, ..., zK ∼ q(z|x; φ).

Maximize ˆ

LIWAE w.r.t. (θ, φ).

XAIENCE 2019 November 07, 2019 6 / 31

SLIDE 7

Introduction

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

XAIENCE 2019 November 07, 2019 8 / 31

SLIDE 9

Proposed methods IWAE as EM algorithm

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

XAIENCE 2019 November 07, 2019 9 / 31

SLIDE 10

Proposed methods IWAE as EM algorithm

EM algorithm

1 E-step

θc : the current estimate of θ.
Calculate the expected value of the complete log likelihood function:

Q(θ|θc; x) := Ez∼p(z|x;θc) [log p(x, z; θ)] .

2 M-step

Update the current estimate by maximizing Q(θ|θc; x).

XAIENCE 2019 November 07, 2019 10 / 31

SLIDE 11

Proposed methods IWAE as EM algorithm

EM algorithm with IS

1 E-step

Approximate Q by employing a proposal distribution q(z|x; φ):

ˆ Q(θ|θc, φ; x) :=

K

k=1

wk K

k′=1 wk′ · log p(x, zk; θ)

where zk ∼ q(z|x; φ) and wk = p(x,zk;θc)

q(zk|x;φ) for k = 1, ..., K.

2 M-step

Update θ by maximizing ˆ

Q(θ|θc, φ; x).

3 P-step (if necessary)

Update φ by encouraging q(z|x; φ) to be a good proposal distribution.

XAIENCE 2019 November 07, 2019 11 / 31

SLIDE 12

Proposed methods IWAE as EM algorithm

IWAE = EM algorithm

Proposition 1. The following equality holds for any θc: ∇θˆ LIWAE(x; θ, φ)

θ=θc = ∇θ ˆ

Q(θ|θc, φ; x)

θ=θc

IWAE = EM algorithm

if we use GD based optimization method.
Updating φ in IWAE can be understood as P-step:

max

φ

ˆ LIWAE(x; θc, φ)

XAIENCE 2019 November 07, 2019 12 / 31

SLIDE 13

Proposed methods IWAE as EM algorithm

IWAE = EM algorithm (cont.)

IWAE as EM algorithm 1 E-step

Calculate ˆ

Q(θ|θc, φ; x)

2 M-step

Update θ by maximizing ˆ

Q(θ|θc, φ; x).

3 P-step

Update φ by maximizing ˆ

LIWAE(x; θc, φ).

XAIENCE 2019 November 07, 2019 13 / 31

SLIDE 14

Proposed methods IWEM

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

XAIENCE 2019 November 07, 2019 14 / 31

SLIDE 15

Proposed methods IWEM

Optimal P-step

Using ˆ Q inevitably causes variance due to IS.

Small variance → stable learning procedure
The optimal proposal distribution (Owen, 2013):

qopt(z) ∝ |log p(x, z; θc)| · p(x, z; θc).

IWAE uses p(x, z; θc).

New P-step: replace p(x, z; θc) in IWAE to qopt(z):

ˆ Lopt(θc, φ; x) := log

1

K

k=1

qopt(zk) q(zk|x; φ)

.

XAIENCE 2019 November 07, 2019 15 / 31

SLIDE 16

Proposed methods IWEM

Annealing strategy

In general,

Var ˆ LVAE(x; θ, φ)

≪Var
ˆ

Q(θ|θ, φ; x)

.
Using VAE at early steps → small variance
New E-step : take a convex combination with VAE:

ˆ Qα(θ|θc, φ; x) := α · ˆ Q(θ|θc, φ; x) + (1 − α) · ˆ LVAE(θ, φ; x).

α ∈ [0, 1] : annealing controller
start from zero and
increase it incrementally up to one as the iteration proceeds.

XAIENCE 2019 November 07, 2019 16 / 31

SLIDE 17

Proposed methods IWEM

IWAE vs. IWEM

IWAE 1 E-step

Calculate ˆ

Q(θ|θc, φ; x)

2 M-step

Update θ by maximizing ˆ

Q.

3 P-step

Update φ by maximizing

ˆ LIWAE(x; θc, φ).

IWEM 1 E-step

Calculate ˆ

Qα(θ|θc, φ; x)

2 M-step

Update θ by maximizing ˆ

Qα.

3 P-step

Update φ by maximizing

ˆ Lopt(θc, φ; x).

XAIENCE 2019 November 07, 2019 17 / 31

SLIDE 18

Proposed methods miss-IWAE

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

XAIENCE 2019 November 07, 2019 18 / 31

SLIDE 19

Proposed methods miss-IWAE

Missing data problem

x = (x(o), x(m))
We only observe x(o).
The log-likelihood is

log p(x(o); θ) = log

p(x(o), x(m), z; θ)dzdx(m)

Need to formulate a proposal distribution of (x(m), z).

XAIENCE 2019 November 07, 2019 19 / 31

SLIDE 20

Proposed methods miss-IWAE

Formulation of proposal distribution

We use the following proposal distribution:

q(x(m), z|x(o); θ, φ) := p(x(m)|z; θ) · q(z|˘ x; φ)

q(z|x; φ) : same distribution as q in IWEM.
˘

x = (x(o), ˘ x(m))

˘

x(m) : imputed value of x(m).

Draw ˘

z from the distribution q(z|(x(o), 0); φ),

and draw ˘

x(m) from the distribution p(x(m)|˘ z; θ).

XAIENCE 2019 November 07, 2019 20 / 31

SLIDE 21

Proposed methods miss-IWAE

miss-IWEM

Simply replaces q(z|x; φ) in IWEM to q(x(m), z|x(o); θ, φ). 1 E-step

Calculate

ˆ Qα

m(θ|θc, φ; x(o)) := α · ˆ

Qm(θ|θc, φ; x(o)) + (1 − α) · ˆ LVAE

m

(θ, φ; x(o)).

2 M-step

Update θ by maximizing ˆ

Qα

m(θ|θc, φ; x(o)).

3 P-step

Update φ by maximizing ˆ

Lopt

m (θc, φ; x(o)).

XAIENCE 2019 November 07, 2019 21 / 31

SLIDE 22

Empirical analysis

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

XAIENCE 2019 November 07, 2019 22 / 31

SLIDE 23

Empirical analysis

Experimental setup

Model
p(z): N(040, I40)
(p(x|z; θ), q(z|x; φ)): (MLP, MLP) or (DeConv, Conv)
Optimization algorithm
Adam (Kingma and Ba, 2014)
Performance measure
Approximated test log-likelihood.
Datasets
Static biMNIST, Dynamic biMNIST, Omniglot, Caltech 101 Silhouette

XAIENCE 2019 November 07, 2019 23 / 31

SLIDE 24

Empirical analysis

Complete data analysis

Performance results

MLP VAE IWAE IWEM-woa1 IWEM

sta. MNIST
88.21
87.68
87.00
87.11
dyn. MNIST
85.31
84.30
84.10
84.16

Omniglot

108.46
106.80
106.50
106.38

Caltech 101

119.67
118.06
116.92
116.54

CNN VAE IWAE IWEM-woa1 IWEM

sta. MNIST
84.63
83.54
83.32
83.77
dyn. MNIST
84.08
81.56
81.07
81.28

Omniglot

101.63
100.27
100.15
100.39

Caltech 101

109.24
106.94
106.19
106.05

1IWEM w/o annealing strategy XAIENCE 2019 November 07, 2019 24 / 31

SLIDE 25

Empirical analysis

Incomplete data analysis

Generation of missing samples 1 Divide an image into 9 equal patches. 2 Generate an incomplete image by removing the predefined number of patches randomly.

XAIENCE 2019 November 07, 2019 25 / 31

SLIDE 26

Empirical analysis

Incomplete data analysis (cont.)

Performance results

Static biMNIST + (MLP, MLP)
Missing rate ↑ ⇒ margin ↑

# of cropped patches missIWAE2 miss-IWEM-woa3 miss-IWEM 3

90.29
89.79
89.71

4

92.07
90.97
90.76

5

95.54
93.33
92.23

6

102.26
97.66
95.18

2Mattei and Frellsen (2018) 3miss-IWEM w/o annealing strategy XAIENCE 2019 November 07, 2019 26 / 31

SLIDE 27

Summary

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

XAIENCE 2019 November 07, 2019 27 / 31

SLIDE 28

Summary

1 Proposed a new learning algorithm called IWEM.

Showed that IWAE can be understood as an EM algorithm.
Devised two new techniques to reduce the variance due to IS in E-step.

2 Modified IWEM for missing data, called miss-IWEM.

XAIENCE 2019 November 07, 2019 28 / 31

SLIDE 29

Summary XAIENCE 2019 November 07, 2019 29 / 31

SLIDE 30

References

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

XAIENCE 2019 November 07, 2019 30 / 31

SLIDE 31

References

References I

Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Mattei, P.-A. and Frellsen, J. (2018). missiwae: Deep generative modelling and imputation of incomplete data. arXiv preprint arXiv:1812.02633. Owen, A. B. (2013). Monte Carlo theory, methods and examples. Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.

XAIENCE 2019 November 07, 2019 31 / 31

On Casting Importance Weighted Autoencoder to an EM Algorithm to Learn Deep Generative Models

D.Kim 1 and J.Hwang 2 and Y.Kim 1 Speaker : Dongha Kim

November 07, 2019

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

Deep generative model with latent variable

Z ∼ p(z) (ex: N(0, I)) X|Z = z ∼ p(x|z; θ)

Deep generative model with latent variable

log p(x; θ) = log

→ Hard to estimate MLE directly.

IWAE (Burda et al., 2015)

Variational autoencoders (VAE)

LVAE(x; θ, φ) := Ez∼q

p(x, z; θ) q(z|x; φ)

ˆ LVAE(x; θ, φ) := 1 L

L

log p(x, zl; θ) q(zl|x; φ)

where z1, ..., zL ∼ q(z|x; φ).

LVAE w.r.t. (θ, φ).

Importance weighted autoencoders (IWAE)

LIWAE(x; θ, φ) := Ez1,...,zK∼q

K

K

p(x, zk; θ) q(zk|x; φ)

ˆ LIWAE(x; θ, φ) := log

K

K

p(x, zk; θ) q(zk|x; φ)

where z1, ..., zK ∼ q(z|x; φ).

LIWAE w.r.t. (θ, φ).

Contents

1 learning the proposal distribution carefully 2 and devising an annealing strategy.

→ IWEM (importance weighted EM algorithm )

→ miss-IWEM

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

EM algorithm

1 E-step

Q(θ|θc; x) := Ez∼p(z|x;θc) [log p(x, z; θ)] .

2 M-step

EM algorithm with IS

1 E-step

ˆ Q(θ|θc, φ; x) :=

wk K

where zk ∼ q(z|x; φ) and wk = p(x,zk;θc)

2 M-step

Q(θ|θc, φ; x).

3 P-step (if necessary)

IWAE = EM algorithm

Proposition 1. The following equality holds for any θc: ∇θˆ LIWAE(x; θ, φ)

Q(θ|θc, φ; x)

IWAE = EM algorithm

max

φ

ˆ LIWAE(x; θc, φ)

IWAE = EM algorithm (cont.)

IWAE as EM algorithm 1 E-step

Q(θ|θc, φ; x)

2 M-step

Q(θ|θc, φ; x).

3 P-step

LIWAE(x; θc, φ).

Outline

1 Introduction 2 Proposed methods

IWAE as EM algorithm IWEM miss-IWAE

3 Empirical analysis 4 Summary 5 References

Optimal P-step

Using ˆ Q inevitably causes variance due to IS.

qopt(z) ∝ |log p(x, z; θc)| · p(x, z; θc).

IWAE uses p(x, z; θc).

ˆ Lopt(θc, φ; x) := log

K