Variational Hamiltonian Monte Carlo via Score Matching Cheng Zhang - - PowerPoint PPT Presentation

▶

Dec 18, 2022 11 likes •454 views

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Rfrences Variational Hamiltonian Monte Carlo via Score Matching Cheng Zhang (Joint work with Prof. Shahbaba and Prof. Zhao)

SLIDE 1

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références

Variational Hamiltonian Monte Carlo via Score Matching

Cheng Zhang (Joint work with Prof. Shahbaba and Prof. Zhao)

Department of Mathematics University of California, Irvine

Jan 6, 2017

Cheng Zhang UCI Variational HMC Jan 6, 2017 1 / 27

SLIDE 2

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références

Outline

1

Background Bayesian Inference

2

Markov chain Monte Carlo Metropolis-Hastings Algorithm Hamiltonian Monte Carlo Scalable MCMC

3

Fixed-Form Variational Bayes Lower Bounds and Free Energy Variational Bayes as Linear Regression

4

Variational Hamiltonian Monte Carlo Approximation with Random Bases Variational HMC Experiments

5

Conclusion

Cheng Zhang UCI Variational HMC Jan 6, 2017 2 / 27

SLIDE 3

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Bayesian Inference

Bayesian Inference

Bayesian inference model

D = {y1, . . . , yN} : observed data θ ∈ Rd : model parameter p(D|θ) : model density p(θ) : prior

Cheng Zhang UCI Variational HMC Jan 6, 2017 3 / 27

SLIDE 4

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Bayesian Inference

Bayesian Inference

Bayesian inference model

D = {y1, . . . , yN} : observed data θ ∈ Rd : model parameter p(D|θ) : model density p(θ) : prior

Goal : learning parameter θ from data p(θ|D) = p(D|θ) · p(θ) p(D) ∝ p(D|θ) · p(θ)

Cheng Zhang UCI Variational HMC Jan 6, 2017 3 / 27

SLIDE 5

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Bayesian Inference

Bayesian Inference

Bayesian inference model

D = {y1, . . . , yN} : observed data θ ∈ Rd : model parameter p(D|θ) : model density p(θ) : prior

Goal : learning parameter θ from data p(θ|D) = p(D|θ) · p(θ) p(D) ∝ p(D|θ) · p(θ) Difficulty : p(D) unknown ⇒ intractable posterior distribution p(θ|D) e.g., probabilistic graphical models, Bayesian hierarchical models

Cheng Zhang UCI Variational HMC Jan 6, 2017 3 / 27

SLIDE 6

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Bayesian Inference

Bayesian Inference

Bayesian inference model

D = {y1, . . . , yN} : observed data θ ∈ Rd : model parameter p(D|θ) : model density p(θ) : prior

Goal : learning parameter θ from data p(θ|D) = p(D|θ) · p(θ) p(D) ∝ p(D|θ) · p(θ) Difficulty : p(D) unknown ⇒ intractable posterior distribution p(θ|D) e.g., probabilistic graphical models, Bayesian hierarchical models Two popular approximations

Markov chain Monte Carlo. Sample by running a Markov chain : asymptotically unbiased but computationally slow Variational Bayes. Approximate via tractable distributions : computationally fast but may result in poor approximation.

Cheng Zhang UCI Variational HMC Jan 6, 2017 3 / 27

SLIDE 7

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Metropolis-Hastings Algorithm

Markov chain Monte Carlo

Intuitive idea : evolve a Markov chain to sample from a target distribution π(θ) (METROPOLIS et al. 1953).

Cheng Zhang UCI Variational HMC Jan 6, 2017 4 / 27

SLIDE 8

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Metropolis-Hastings Algorithm

Markov chain Monte Carlo

Intuitive idea : evolve a Markov chain to sample from a target distribution π(θ) (METROPOLIS et al. 1953). Conditions for transition kernel T(·|·)

Irreducibility : any state has positive probability of visiting any other state. Aperiodicity : The chain should not get trapped in cycles. Detailed Balance condition (sufficient) : π(θ)T(θ′|θ) = π(θ′)T(θ|θ′)

Cheng Zhang UCI Variational HMC Jan 6, 2017 4 / 27

SLIDE 9

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Metropolis-Hastings Algorithm

Markov chain Monte Carlo

Intuitive idea : evolve a Markov chain to sample from a target distribution π(θ) (METROPOLIS et al. 1953). Conditions for transition kernel T(·|·)

Irreducibility : any state has positive probability of visiting any other state. Aperiodicity : The chain should not get trapped in cycles. Detailed Balance condition (sufficient) : π(θ)T(θ′|θ) = π(θ′)T(θ|θ′)

Metropolis-Hastings algorithms (one iteration)

sample θ′ ∼ q(θ′|θ)

update the current state to θ′ with probability α(θ, θ′) = min[1, π(θ′)q(θ|θ′)

π(θ)q(θ′|θ) ] Cheng Zhang UCI Variational HMC Jan 6, 2017 4 / 27

SLIDE 10

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Metropolis-Hastings Algorithm

Markov chain Monte Carlo

Intuitive idea : evolve a Markov chain to sample from a target distribution π(θ) (METROPOLIS et al. 1953). Conditions for transition kernel T(·|·)

Irreducibility : any state has positive probability of visiting any other state. Aperiodicity : The chain should not get trapped in cycles. Detailed Balance condition (sufficient) : π(θ)T(θ′|θ) = π(θ′)T(θ|θ′)

Metropolis-Hastings algorithms (one iteration)

sample θ′ ∼ q(θ′|θ)

update the current state to θ′ with probability α(θ, θ′) = min[1, π(θ′)q(θ|θ′)

π(θ)q(θ′|θ) ]

Pros & Cons for simple MCMCs (e.g., RWM and Gibbs sampling)

Pro : easy to implement and computationally cheap Con : slow mixing due to random walk behaviors, especially in complicate, high-dimensional models.

Cheng Zhang UCI Variational HMC Jan 6, 2017 4 / 27

SLIDE 11

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

Intuition : Leveraging a Hamiltonian dynamical system to generate trial moves in MCMC samplers. (DUANE et al. 1987, NEAL 2011)

Cheng Zhang UCI Variational HMC Jan 6, 2017 5 / 27

SLIDE 12

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

Intuition : Leveraging a Hamiltonian dynamical system to generate trial moves in MCMC samplers. (DUANE et al. 1987, NEAL 2011) Model based energy function : the Hamiltonian H(θ, r) = U(θ) + K(r)

Potential : U(θ) = − log p(θ, D) = −[log p(θ) + log p(D|θ)] Kinetic : K(r) = 1

2 r⊺M−1r ⇒ π(r) ∼ N (0, M)

The joint density of z = (θ, r) is π(z) ∝ exp(−U(θ) − K(r)) ∝ p(θ|D) · π(r)

Cheng Zhang UCI Variational HMC Jan 6, 2017 5 / 27

SLIDE 13

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

Intuition : Leveraging a Hamiltonian dynamical system to generate trial moves in MCMC samplers. (DUANE et al. 1987, NEAL 2011) Model based energy function : the Hamiltonian H(θ, r) = U(θ) + K(r)

Potential : U(θ) = − log p(θ, D) = −[log p(θ) + log p(D|θ)] Kinetic : K(r) = 1

2 r⊺M−1r ⇒ π(r) ∼ N (0, M)

The joint density of z = (θ, r) is π(z) ∝ exp(−U(θ) − K(r)) ∝ p(θ|D) · π(r) Hamilton’s equations : dθ dt = ∇rH = ∇rK(r), dr dt = −∇θH = −∇θU(θ) Hamiltonian flow φH

s : R2d → R2d,

z(0) = z → z∗ = z(s)

Cheng Zhang UCI Variational HMC Jan 6, 2017 5 / 27

SLIDE 14

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

Intuition : Leveraging a Hamiltonian dynamical system to generate trial moves in MCMC samplers. (DUANE et al. 1987, NEAL 2011) Model based energy function : the Hamiltonian H(θ, r) = U(θ) + K(r)

Potential : U(θ) = − log p(θ, D) = −[log p(θ) + log p(D|θ)] Kinetic : K(r) = 1

2 r⊺M−1r ⇒ π(r) ∼ N (0, M)

The joint density of z = (θ, r) is π(z) ∝ exp(−U(θ) − K(r)) ∝ p(θ|D) · π(r) Hamilton’s equations : dθ dt = ∇rH = ∇rK(r), dr dt = −∇θH = −∇θU(θ) Hamiltonian flow φH

s : R2d → R2d,

z(0) = z → z∗ = z(s) Properties : reversibility, volume preservation and constant Hamiltonian over time t

Cheng Zhang UCI Variational HMC Jan 6, 2017 5 / 27

SLIDE 15

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

Numerical integrators : the leap-frog scheme r(t+1/2) = r(t) − ε 2 ∇θU(θ(t)) θ(t+1) = θ(t) + εM−1r(t+1/2) r(t+1) = r(t+1/2) − ε 2 ∇θU(θ(t+1)) (1) The leap-frog scheme (1) remains time reversible and volume preserving. Constant Hamiltonian does not hold.

Cheng Zhang UCI Variational HMC Jan 6, 2017 6 / 27

SLIDE 16

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

Numerical integrators : the leap-frog scheme r(t+1/2) = r(t) − ε 2 ∇θU(θ(t)) θ(t+1) = θ(t) + εM−1r(t+1/2) r(t+1) = r(t+1/2) − ε 2 ∇θU(θ(t+1)) (1) The leap-frog scheme (1) remains time reversible and volume preserving. Constant Hamiltonian does not hold. Metropolis correction αhmc(z, z∗) = min{1, exp(−H(z∗) + H(z))} (2)

Cheng Zhang UCI Variational HMC Jan 6, 2017 6 / 27

SLIDE 17

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

Numerical integrators : the leap-frog scheme r(t+1/2) = r(t) − ε 2 ∇θU(θ(t)) θ(t+1) = θ(t) + εM−1r(t+1/2) r(t+1) = r(t+1/2) − ε 2 ∇θU(θ(t+1)) (1) The leap-frog scheme (1) remains time reversible and volume preserving. Constant Hamiltonian does not hold. Metropolis correction αhmc(z, z∗) = min{1, exp(−H(z∗) + H(z))} (2) Some variants.

Automatically tuning of the hyper-parameters (e.g., step size ε, number of leap-frog steps L) (HOFFMAN et GELMAN 2011; WANG, MOHAMED et NANDO 2013 Riemannian Manifold HMC (GIROLAMI et CALDERHEAD 2011)

Cheng Zhang UCI Variational HMC Jan 6, 2017 6 / 27

SLIDE 18

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Scalable MCMC

Stochastic Gradient MCMC

Challenge from massive data. ∇θU(θ) = −

y∈D

∇θ log p(y|θ) − ∇θ log p(θ) ∼ O(N)

Cheng Zhang UCI Variational HMC Jan 6, 2017 7 / 27

SLIDE 19

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Scalable MCMC

Stochastic Gradient MCMC

Challenge from massive data. ∇θU(θ) = −

y∈D

∇θ log p(y|θ) − ∇θ log p(θ) ∼ O(N) Stochastic gradient MCMC : use stochastic gradients ∇θ ˜ U(θ) = − |D| | ˜ D|

y∈ ˜

D

∇θ log p(y|θ) − ∇θ log p(θ), ˜ D ⊂ D e.g., SGLD (WELLING et TEH 2011), SGHMC (CHEN, E. B. FOX et GUESTRIN 2014), SGNHT (DING et al. 2014)

Cheng Zhang UCI Variational HMC Jan 6, 2017 7 / 27

SLIDE 20

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Scalable MCMC

Stochastic Gradient MCMC

Challenge from massive data. ∇θU(θ) = −

y∈D

∇θ log p(y|θ) − ∇θ log p(θ) ∼ O(N) Stochastic gradient MCMC : use stochastic gradients ∇θ ˜ U(θ) = − |D| | ˜ D|

y∈ ˜

D

∇θ log p(y|θ) − ∇θ log p(θ), ˜ D ⊂ D e.g., SGLD (WELLING et TEH 2011), SGHMC (CHEN, E. B. FOX et GUESTRIN 2014), SGNHT (DING et al. 2014) Properties of stochastic gradient MCMC

Convergence relies on stochastic differential equation of the form dz = f(z)dt +

2D(z)dW(t)

with appropriate f(z) and D(z). See a complete recipe (MA, CHEN et E. FOX 2015) Need to anneal (or use small) step size, sacrificing the exploration efficiency to compromise scalability (BETANCOURT 2015).

Cheng Zhang UCI Variational HMC Jan 6, 2017 7 / 27

SLIDE 21

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Scalable MCMC

Surrogate Method

Find cheap surrogate function U∗(θ) and transition kernel Ts(·|·) that leaves q∗(θ) ∝ exp(−U∗(θ)) invariant. q∗(θ)Ts(θ′|θ) = q∗(θ′)Ts(θ|θ′) Use Ts to generate proposals Proposition (Liu 2001) The target distribution p(θ|D) ∝ exp(−U(θ)) is the stationary distribution for a Markov chain simulated according to the following procedure : given the current state θ, let ϑ0 = θ and recursively sample ϑi ∼ Ts(·|ϑi−1) for i = 1, . . . , k. Then, accept the proposal θ∗ = ϑk with the following probability αs(θ, θ∗) = min

1, p(θ∗|D)q∗(θ)

p(θ|D)q∗(θ∗)

Some of the existing surrogate methods : Gaussian process surrogate (RASMUSSEN

2003; LAN et al. 2015), reproducing kernel Hilbert space surrogate (STRATHMANN et al. 2015) and random network surrogate (ZHANG, SHAHBABA et ZHAO 2015).

Cheng Zhang UCI Variational HMC Jan 6, 2017 8 / 27

SLIDE 22

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Lower Bounds and Free Energy

Lower Bounds and Free Energy

Fixed-Form Variational Bayes (HONKELA et al. 2010; SAUL et JORDAN 1996) uses a parametrized distribution qη(θ) = exp[T(θ)η − A(η)] (3) to approximate the target posterior p(θ|D)

Cheng Zhang UCI Variational HMC Jan 6, 2017 9 / 27

SLIDE 23

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Lower Bounds and Free Energy

Lower Bounds and Free Energy

Fixed-Form Variational Bayes (HONKELA et al. 2010; SAUL et JORDAN 1996) uses a parametrized distribution qη(θ) = exp[T(θ)η − A(η)] (3) to approximate the target posterior p(θ|D) Distance between distributions DKL(qη(θ)p(θ|D)) =

qη(θ) log

qη(θ) p(θ|D)

= log p(D) −

qη(θ) log

p(θ, D) qη(θ)

Free Energy (Lower bound)

Cheng Zhang UCI Variational HMC Jan 6, 2017 9 / 27

SLIDE 24

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Lower Bounds and Free Energy

Lower Bounds and Free Energy

Fixed-Form Variational Bayes (HONKELA et al. 2010; SAUL et JORDAN 1996) uses a parametrized distribution qη(θ) = exp[T(θ)η − A(η)] (3) to approximate the target posterior p(θ|D) Distance between distributions DKL(qη(θ)p(θ|D)) =

qη(θ) log

qη(θ) p(θ|D)

= log p(D) −

qη(θ) log

p(θ, D) qη(θ)

Free Energy (Lower bound)

An optimization problem ˆ η = arg max

η

Eqη[log p(θ, D) − log qη(θ)] (4) more accurate than using mean-field assumptions; requires analytically evaluation

f Eqη log qη(θ), Eqη log p(θ, D) and the derivatives.

Cheng Zhang UCI Variational HMC Jan 6, 2017 9 / 27

SLIDE 25

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Variational Bayes as Linear Regression

Stochastic Linear Regression

Optimization in (4) can be solved using stochastic linear regression (SALIMANS et KNOWLES 2013). Rewrite (3) in the unnormalized form ˜ q ˜

η(θ) = exp[˜

T(θ)˜ η], ˜ T(θ) = (1, T(θ)), ˜ η = (η0, η⊺)⊺ Unnormalized KL divergence : ˜ DKL(˜ q ˜

ηp(θ, D)) =

q ˜

η(θ) log

˜ q ˜

η(θ)

p(θ, D) dθ −

q ˜

η(θ)dθ

=

exp[˜

T(θ)˜ η][˜ T(θ)˜ η − log p(θ, D)]dθ −

exp[˜

T(θ)˜ η]dθ (5) Find the minimum by differentiation ∇ ˜

η ˜

DKL(˜ q ˜

ηp(θ, D)) =

q ˜

η(θ)[˜

T(θ)⊺ ˜ T(θ)˜ η − ˜ T(θ)⊺ log p(θ, D)]dθ = 0 The minimum : ˜ η =

Eq[˜

T(θ)⊺ ˜ T(θ)] −1 Eq[˜ T(θ)⊺ log p(θ, D)] (6)

Cheng Zhang UCI Variational HMC Jan 6, 2017 10 / 27

SLIDE 26

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Variational Bayes as Linear Regression

Monte Carlo Estimation

(6) is not a solution yet, but can be used to derive a fix point iteration. Let C = Eq[˜ T(θ)⊺ ˜ T(θ)], g = Eq[˜ T(θ)⊺ log p(θ, D)] then ˜ η = C−1g. Stochastic Optimization for Fixed-Form VB

1: Initialize ˜

η1, C1, g1 = C1 ˜ η1, set step-size w ∈ [0, 1]

2: for t = 1 to T do 3:

Drawing a single sample θ∗

t from the current approximation qηt (θ) 4:

Update Ct+1, gt+1 as follows :

5:

gt+1 = (1 − w)gt + wˆ gt, ˆ gt = ˜ T(θ∗

t )⊺ log p(θ∗ t , D) 6:

Ct+1 = (1 − w)Ct + w ˆ Ct, ˆ Ct = ˜ T(θ∗

t )⊺ ˜

T(θ∗

t ) 7:

Update the parameters :

8:

˜ ηt+1 = C−1

t+1gt+1 9: end for

See SALIMANS et KNOWLES 2013 for more variants.

Cheng Zhang UCI Variational HMC Jan 6, 2017 11 / 27

SLIDE 27

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références

Combine VB and HMC via Surrogate

Surrogate method revisit Target : p(θ|D) ∝ exp(−U(θ)), Surrogate : q∗(θ) ∝ exp(−U∗(θ)) Acceptance probability αs(θ, θ∗) = min

1, p(θ∗|D)q∗(θ)

p(θ|D)q∗(θ∗)

High approximation quality ⇒ high acceptance rate.

Cheng Zhang UCI Variational HMC Jan 6, 2017 12 / 27

SLIDE 28

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références

Combine VB and HMC via Surrogate

Surrogate method revisit Target : p(θ|D) ∝ exp(−U(θ)), Surrogate : q∗(θ) ∝ exp(−U∗(θ)) Acceptance probability αs(θ, θ∗) = min

1, p(θ∗|D)q∗(θ)

p(θ|D)q∗(θ∗)

High approximation quality ⇒ high acceptance rate.

Flexible and efficient surrogate based on random network U∗(θ) = z(θ) =

s

via(θ; γi) ∼ O(s) (7) where {γi}n

i=1 are random samples from some distribution.

Cheng Zhang UCI Variational HMC Jan 6, 2017 12 / 27

SLIDE 29

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références

Combine VB and HMC via Surrogate

Surrogate method revisit Target : p(θ|D) ∝ exp(−U(θ)), Surrogate : q∗(θ) ∝ exp(−U∗(θ)) Acceptance probability αs(θ, θ∗) = min

1, p(θ∗|D)q∗(θ)

p(θ|D)q∗(θ∗)

High approximation quality ⇒ high acceptance rate.

Flexible and efficient surrogate based on random network U∗(θ) = z(θ) =

s

via(θ; γi) ∼ O(s) (7) where {γi}n

i=1 are random samples from some distribution.

Use variational Bayes to improve approximation. ˆ q(θ) = arg min

q∗

D(q∗(θ), p(θ, D))

Cheng Zhang UCI Variational HMC Jan 6, 2017 12 / 27

SLIDE 30

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Approximation with Random Bases

A Rich Class of Functions : Fp

Bases : {a(θ; γ) : γ ∈ Γ}, Θ ⊂ Rd, a distribution on Γ : p(γ), consider functions f(θ) =

α(γ)a(θ; γ) dγ (8) where |α(γ)| ≤ C|p(γ)|. Define a norm fp = supγ

α(γ)

p(γ)

and the set

Fp ≡

f(θ) =
Γ

α(γ)a(θ; γ) dγ

fp < ∞
Theorem (Rahimi 2008)

Let µ be any probability measure on Θ, f2

µ =

Θ f 2(θ)µ(dθ). Suppose

supθ,γ |a(θ; γ)| ≤ 1. Fix f ∈ Fp. ∀δ > 0, with probability at least 1 − δ over γi

iid

∼ p(γ), there exist v1, . . . , vs such that z(θ) = s

i=1 via(θ; γi) satisfies

z − fµ < fp √s

1 +
2 log 1

δ

Cheng Zhang

UCI Variational HMC Jan 6, 2017 13 / 27

SLIDE 31

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Approximation with Random Bases

Fp is dense in H

A reproducing kernel Hilbert space (RKHS) H with kernel k on Θ × Θ k(θ, θ′) =

p(γ)a(θ; γ)a(θ′; γ) dγ H can be constructed based on functions of the form (8). Proposition (Rahimi 2008) Let ˆ H be the completion of the set of functions of the form (8) such that

α(γ)2 p(γ) dγ < ∞ (9) with the inner product f, g =

α(γ)β(γ) p(γ) dγ where g(θ) =

γ β(γ)a(θ; γ) dγ. Then ˆ

H = H Note that ∀f ∈ Fp, (9) is satisfied. Therefore, Fp is a subset of H. In fact, Fp is dense in H. See RAHIMI et RECHT 2008 for detailed proofs.

Cheng Zhang UCI Variational HMC Jan 6, 2017 14 / 27

SLIDE 32

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Variational HMC

Free-Form Variational Bayes

Random bases surrogate induced distribution qv(θ) ∝ exp(−z(θ)) = exp[−

s

via(θ; γi) − Φ(v)] where γi are drawn iid from some distribution. The natural parameters v of interest belong to the set Ω := {v ∈ Rs | Φ(v) < +∞} A distance measure D : (q, p) → [0, +∞), q, p are unnormalized densities. Free-Form variational inference ˆ v = arg min

v∈Ω

D(qv(θ)p(θ, D)) (10)

qv(θ) does not have to be tractable Free style construction of the random network surrogate.

Cheng Zhang UCI Variational HMC Jan 6, 2017 15 / 27

SLIDE 33

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Variational HMC

Distance Between Unnormalized Densities

Potential matching distance DPM(qv(θ)p(θ, D)) = 1 2 min

b

qv(θ)z(θ) − U(θ) − b2 dθ

= 1 2Varqv (z(θ) − U(θ)) (11) “Score matching” distance ˜ DSM(qv(θ)p(θ, D)) = 1 2

qv(θ)∇θz(θ) − ∇θU(θ)2 dθ

(12) The above distances are well-defined DPM(qv(θ)p(θ, D)) = 0 or ˜ DSM(qv(θ)p(θ, D)) = 0 ⇒ z(θ) = U(θ) + Constant ⇒ qv(θ) = p(θ|D) In practice, (12) is usually intractable, use empirical version instead.

Cheng Zhang UCI Variational HMC Jan 6, 2017 16 / 27

SLIDE 34

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Variational HMC

Surrogate Induced Hamiltonian Flow

Define ˜ H(θ, r) = z(θ) + K(r), simulating the surrogate induced Hamiltonian Dynamics dθ dt = M−1r, dr dt = −∇θz(θ) (13) defines a mapping φ˜

H s : R2d → R2d,

(θ, r) → (θ∗, r∗)

Cheng Zhang UCI Variational HMC Jan 6, 2017 17 / 27

SLIDE 35

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Variational HMC

Variational Hamiltonian Monte Carlo

1 Simulate surrogate induced Hamiltonian flow to generate (θ∗, r∗) and accept with

probability αvhmc = min{1, exp(˜ H(θ, r) − ˜ H(θ∗, r∗))}

2 Add to the training data set.

T (t)

s

:= T (t−1)

s

∪ {(θt, ∇θU(θt))}

3 Update surrogate by minimizing the empirical squared distance plus

regularization. ˆ vt = arg min

v

1 2

t

∇θz(θn) − ∇θU(θn)2 + λ 2 v2 (14) Regularized surrogate approximation to simulate Hamiltonian flow Vt(θ) = µtzt(θ) + 1 2 (1 − µt)(θ − θL)⊺∇2

θU(θ)L(θ − θL),

µt : 0 ↑ 1

Cheng Zhang UCI Variational HMC Jan 6, 2017 18 / 27

SLIDE 36

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Variational HMC

Online Updating

Let A(θ) = (A1(θ), A2(θ, . . . , As(θ))), where Ai(θ) = ∇θa(θ; γi), i = 1, . . . , s Variational Hamiltonian Monte Carlo

1: Set λ, µt, s and HMC parameters ε, L. Initialize θ(0) to a first

guess,v(0) = 0, C(0) = 1

λ Is. Find θL and compute ∇2 θU(θ)L. 2: for t = 1 to T do 3:

Perform one HMC iteration for the regularized surrogate induced distribution q(t)

v (θ) ∝ exp(−Vt(θ)) to draw (θ(t+1), r(t+1)) 4:

Acquire ∇θU(θ(t+1)) and At+1 = A(θ(t+1))

5:

Compute W (t+1) = C(t)A⊺

t+1[Id + At+1C(t)A⊺ t+1]−1 6:

Update v(t+1), C(t+1) as follows

7:

v(t+1) = v(t) + W (t+1)(∇θU(θ(t+1)) − At+1v(t))

8:

C(t+1) = C(t) − W (t+1)At+1C(t)

9: end for

Cheng Zhang UCI Variational HMC Jan 6, 2017 19 / 27

SLIDE 37

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Variational HMC

Stopping time and Connections to Existing Work

Inefficient updating when surrogate is well trained ⇒ set up a stopping time t0

t < t0. Perform Free-Form variational Bayes to improve approximate quality of the surrogate t > t0. Perform standard HMC that samples from the surrogate induced distribution.

Connections to some existing work

Stochastic linear regression for Fix-Form variational Bayes (SALIMANS et KNOWLES 2013) : allows Free-Form intractable approximate distributions, use HMC to draw samples. Random bases surrogate HMC (ZHANG, SHAHBABA et ZHAO 2015) : further reduce the computation by allowing to use the fast surrogate in the Metropolis correction step, trade-off between approximation accuracy and computational efficiency.

Cheng Zhang UCI Variational HMC Jan 6, 2017 20 / 27

SLIDE 38

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Experiments

A Beta-binomial Model for Overdispersion

logit m log K exact −7.5 −7 −6.5 −6 5 10 logit m log K 6 −7.5 −7 −6.5 −6 5 10 logit m log K 9 −7.5 −7 −6.5 −6 5 10 logit m log K 12 −7.5 −7 −6.5 −6 5 10 logit m log K 15 −7.5 −7 −6.5 −6 5 10 logit m log K 18 −7.5 −7 −6.5 −6 5 10 logit m log K 21 −7.5 −7 −6.5 −6 5 10 logit m log K 24 −7.5 −7 −6.5 −6 5 10 logit m log K 27 −7.5 −7 −6.5 −6 5 10

number of hidden neurons

6 9 12 15 18 21 24 27 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

KL divergence and SM squared distance

KL SM FIGURE – Left : Approximate posteriors for a varying number of hidden neurons. Exact posterior at bottom right.

Right : KL-divergence and score matching squared distance between the surrogate approximation and the exact posterior density using an increasing number of hidden neurons.

Cheng Zhang UCI Variational HMC Jan 6, 2017 21 / 27

SLIDE 39

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Experiments

Bayesian Probit Regression

number of likelihood evaluations

100 101 102 103

RMSE approximate posterior mean

0.0145 0.015 0.0155 0.016

VBEM FF-Minibatch VHMC

FIGURE – RMSE of the approximate posterior mean as a function of the number of likelihood evaluations for different

variational Bayesian approaches and VHMC algorithm.

Cheng Zhang UCI Variational HMC Jan 6, 2017 22 / 27

SLIDE 40

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références Experiments

Independent Component Analysis

Wall Clock Time (secs)

200 400 600 800 1000

Amari Distance

10-4 10-3 10-2 10-1 100 101 HMC SGLD VHMC FIGURE – Convergence of Amari distance on the MEG data for HMC, SGLD and our variational HMC algorithm.

Cheng Zhang UCI Variational HMC Jan 6, 2017 23 / 27

SLIDE 41

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références

Summary and Discussion

Combine variational Bayes and MCMC via random bases surrogate : Construct efficient surrogate to accelerate HMC dθ dt = M−1r, dr dt = −∇θz(θ) Find good surrogate via Free-Form variational Bayes ˆ v = arg min

v∈Ω

˜ DSM(qv(θ)p(θ, D)) Open future directions : The random bases surrogate is more effective in problems with costly likelihood and a moderate number of parameters. How about really high dimensional problems? Would more sophisticated structures (e.g., deep networks) work? Evaluating full-data gradient ∇θU(θ) to collect training data is still expensive, how about stochastic gradient? Be careful about overfitting!

Cheng Zhang UCI Variational HMC Jan 6, 2017 24 / 27

SLIDE 42

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références

Reference I

BETANCOURT, M. (2015). “The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling”. In : Proceedings of the 32nd International Conference on Machine Learning (ICML 2015). CHEN, T., E. B. FOX et C. GUESTRIN (2014). “Stochastic Gradient Hamiltonian Monte Carlo”. In : Proceedings of 31st International Conference on Machine Learning (ICML 2014). DING, N. et al. (2014). “Bayesian Sampling Using Stochastic Gradient Thermostats”. In : Advances in Neural Information Processing Systems 27 (NIPS 2014). DUANE, S. et al. (1987). “Hybrid Monte Carlo”. In : Physics Letters B 195.2,

p. 216–222.

GIROLAMI, M. et B. CALDERHEAD (2011). “Riemann manifold Langevin and Hamiltonian Monte Carlo methods”. In : Journal of the Royal Statistical Society (with discussion) 73.2, p. 123–214. HOFFMAN, D. et A. GELMAN (2011). The No-U-Turn Sampler : Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. arxiv.org/abs/1111.4246. HONKELA, A. et al. (2010). “Approximate Riemannian conjugate gradient learning for fixed-form variational Bayes”. In : Journal of Machine Learning Research 11,

p. 3235–3268.

Cheng Zhang UCI Variational HMC Jan 6, 2017 25 / 27

SLIDE 43

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références

Reference II

LAN, S. et al. (2015). Emulation of higher-order tensors in manifold Monte Carlo methods for Bayesian inverse problems. arxiv.org/abs/1507.06244. MA, Y. A., T. CHEN et E. FOX (2015). “A complete recipe for stochastic gradient MCMC”. In : Advances in Neural Information Processing Systems 28 (NIPS 2015). METROPOLIS, N. et al. (1953). “Equation of State Calculations by Fast Computing Machines”. In : The Journal of Chemical Physics 21.6, p. 1087–1092. NEAL, R. M. (2011). “MCMC using Hamiltonian dynamics”. In : Handbook of Markov Chain Monte Carlo. Sous la dir. de S. BROOKS et al. Chapman et Hall/CRC,

p. 113–162.

RAHIMI, A. et B. RECHT (2008). “Uniform approximation of functions with random bases”. In : Proc. 46th Ann. Allerton Conf. Commun., Contr. Comput. RASMUSSEN, C. E. (2003). “Gaussian Processes to Speed up Hybrid Monte Carlo for Expensive Bayesian Integrals”. In : Bayesian Statistics 7, p. 651–659. SALIMANS, T. et D. A. KNOWLES (2013). “Fixed-form variational posterior approximation through stochastic linear regression”. In : Bayesian Analysis 8.4,

p. 837–882.

Cheng Zhang UCI Variational HMC Jan 6, 2017 26 / 27

SLIDE 44

Background Markov chain Monte Carlo Fixed-Form Variational Bayes Variational Hamiltonian Monte Carlo Conclusion Références

Reference III

SAUL, L. et M. I. JORDAN (1996). “Exploiting tractable substructures in intractable networks”. In : Advance in neural information processing systems 7 (NIPS 1996). Sous la dir. de G. TESAURO, D. S. TOURETZKY et T. K. LEEN. Cambridge, MA : MIT Press, p. 486–492. STRATHMANN, H. et al. (2015). “Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families”. In : Advances in Neual Information Processing

Systems. Cambridge, MA : MIT Press.

WANG, Z., S. MOHAMED et D. NANDO (2013). “Adaptive Hamiltonian and Riemann manifold Monte Carlo”. In : Proceedings of the 30th International Conference on Machine Learning (ICML 2013), p. 1462–1470. WELLING, M. et Y. W. TEH (2011). “Bayesian Learning via Stochastic Gradient Langevin Dynamics”. In : Proceedings of the International Conference on Machine Learning. ZHANG, C., B. SHAHBABA et H. K. ZHAO (2015). Hamiltonian Monte Carlo Acceleration Using Surrogate Functions with Random Bases. arxiv.org/abs/1506.05555.

Cheng Zhang UCI Variational HMC Jan 6, 2017 27 / 27