Variational inference Probabilistic Graphical Models Sharif - - PowerPoint PPT Presentation

variational inference
SMART_READER_LITE
LIVE PREVIEW

Variational inference Probabilistic Graphical Models Sharif - - PowerPoint PPT Presentation

Variational inference Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2018 Some slides are adapted from Xing s slides Exact methods for inference Variable elimination Message Passing: shared terms


slide-1
SLIDE 1

Variational inference

Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2018

Some slides are adapted from Xing’s slides

slide-2
SLIDE 2

Exact methods for inference

2

 Variable elimination  Message Passing: shared terms

 Sum-product (belief propagation)  Max-product  Junction Tree

slide-3
SLIDE 3

Junction tree

3

 General algorithm on graphs with cycles  Message passing on junction trees

𝐷

𝑘

𝑇𝑗𝑘 𝐷𝑗 𝑛𝑗𝑘 𝑇𝑗𝑘 𝑛𝑘𝑗 𝑇𝑗𝑘

slide-4
SLIDE 4

Why approximate inference

4

 The computational complexity of Junction tree algorithm with be

𝐿 𝐷 where 𝐷 shows the largest elimination clique (the largest clique in the triangulated graph)

 For a distribution 𝑄 associated with a complex graph, computing the

marginal (or conditional) probability of arbitrary random variable(s) is intractable

Tree-width of an 𝑂 × 𝑂 grid is 𝑂

slide-5
SLIDE 5

Learning and inference

5

 Learning usually needs inference

 For Bayesian inference that is one of the principal foundations

for machine learning, learning is just an inference problem

 For Maximum Likelihood approach, also, we need inference

when we have incomplete data or when we encounter an undirected model

slide-6
SLIDE 6

Approximate inference

6

 Approximate inference techniques

 Variational algorithms

 Loopy belief propagation  Mean field approximation  Expectation propagation

 Stochastic simulation / sampling methods

slide-7
SLIDE 7

Variational methods

7

 “variational”:

general term for

  • ptimization-based

formulations

 Many problems can be expressed in terms of an optimization

problem in which the quantity being optimized is a functional

 Variational inference is deterministic framework that is

widely used for approximate inference

slide-8
SLIDE 8

Variatonal inference methods

8

 Constructing an approximation to the target distribution 𝑄

where this approximation takes a simpler form for inference:

 We define a target class of distributions 𝒭  Search for an instance 𝑅∗ in 𝒭 that is the best approximation to 𝑄  Queries will be answered using 𝑅∗ rather than on 𝑄

 𝒭: given family of distributions

 Simpler families for which solving the optimization problem will

be computationally tractable

 However, the family may not be sufficiently expressive to

encode 𝑄

Constrained

  • ptimization
slide-9
SLIDE 9

Setup

9

 Assume that we are interested in the posterior distribution

𝑄 𝑎 𝑌, 𝛽 = 𝑄(𝑎, 𝑌|𝛽) 𝑄 𝑎, 𝑌 𝛽 𝑒𝑎

 The problem of computing the posterior is an instance of a more

general problem that variational inference solves

 Main idea:

 We pick a family of distributions over the latent variables with its own

variational parameters

 Then, find the setting of the parameters that makes 𝑅 close to the posterior

  • f interest

 Use 𝑅 with the fitted parameters as an approximation for the posterior

𝑌 = {𝑦1, … , 𝑦𝑜} 𝑎 = {𝑨1, … , 𝑨𝑛} Observed variables Hidden variables

slide-10
SLIDE 10

Approximation

10

 Goal: Approximate difficult distribution 𝑄(𝑎|𝑌) with a

new distribution 𝑅(𝑎) such that:

 𝑄(𝑎|𝑌) and 𝑅(𝑎) are close  Computation on 𝑅(𝑎) is easy

 Typically, the true posterior is not in the variational family.  How should we measure distance between distributions?

 The Kullback-Leibler divergence (KL-divergence) between two

distributions 𝑄 and 𝑅

slide-11
SLIDE 11

KL divergence

11

 Kullback-Leibler divergence between 𝑄 and 𝑅:

𝐿𝑀(𝑄| 𝑅 = 𝑄 𝑦 log 𝑄(𝑦) 𝑅(𝑦) 𝑒𝑦

 A result from information theory: For any 𝑄 and 𝑅

𝐿𝑀(𝑄| 𝑅 ≥ 0

 𝐿𝑀(𝑄| 𝑅 = 0 if and only if 𝑄 ≡ 𝑅  𝐸 is asymmetric

slide-12
SLIDE 12

How measure the distance of 𝑄 and 𝑅?

12

 We wish to find a distribution 𝑅 such that 𝑅 is a “good”

approximation to 𝑄

 We can therefore use KL divergence as a scoring function

to decide a good 𝑅

 But, 𝐿𝑀(𝑄(𝑎|𝑌)||𝑅(𝑎)) ≠ 𝐿𝑀(𝑅(𝑎)||𝑄(𝑎|𝑌))

slide-13
SLIDE 13

M-projection vs. I-projection

13

 M-projection of 𝑅 onto 𝑄

𝑅∗ = argmin

𝑅∈𝒭

𝐿𝑀(𝑄||𝑅)

 I-projection of 𝑅 onto 𝑄

𝑅∗ = argmin

𝑅∈𝒭

𝐿𝑀(𝑅||𝑄)

 These two will differ only when 𝑅 is minimized over a

restricted set of probability distributions (when 𝑄 ∉ 𝑅 set

  • f possible 𝑅 distributions)
slide-14
SLIDE 14

KL divergence: M-projection vs. I-projection

14

 Let 𝑄 be a 2D Gaussian and 𝑅 be a Gaussian distribution

with diagonal covariance matrix:

𝑄: Green 𝑅∗: Red 𝑅∗ = argmin

𝑅

𝑄 𝒜 log 𝑄 𝒜 𝑅 𝒜 𝑒𝒜 𝑅∗ = argmin

𝑅

𝑅 𝒜 log 𝑅 𝒜 𝑄 𝒜 𝑒𝒜 𝐹𝑄 𝒜 = 𝐹𝑅[𝒜] 𝐹𝑄 𝒜 = 𝐹𝑅[𝒜] [Bishop]

slide-15
SLIDE 15

KL divergence: M-projection vs. I-projection

15

 Let 𝑄 is mixture of two 2D Gaussians and 𝑅 be a 2D

Gaussian distribution with arbitrary covariance matrix:

𝑄: Blue 𝑅∗: Red [Bishop] 𝐹𝑄 𝒜 = 𝐹𝑅 𝒜 𝐷𝑝𝑤𝑄 𝒜 = 𝐷𝑝𝑤𝑅 𝒜 𝑅∗ = argmin

𝑅

𝑄 𝒜 log 𝑄 𝒜 𝑅 𝒜 𝑒𝒜 𝑅∗ = argmin

𝑅

𝑅 𝒜 log 𝑅 𝒜 𝑄 𝒜 𝑒𝒜 two good solutions!

slide-16
SLIDE 16

M-projection

16

 Computing 𝐿𝑀(𝑄| 𝑅 requires inference on 𝑄

𝐿𝑀(𝑄| 𝑅 =

𝑨

𝑄 𝑨 log 𝑄 𝑨 𝑅 𝑨 = −𝐼 𝑄 − 𝐹𝑄[log 𝑅(𝑨)]

 When 𝑅 is in the exponential family:

𝐿𝑀(𝑄| 𝑅 = 0 ⇔ 𝐹𝑄 𝑈 𝑨 = 𝐹𝑅[𝑈 𝑨 ]

 Expectation

Propagation methods are based

  • n

minimizing 𝐿𝑀(𝑄| 𝑅

Moment projection Inference on 𝑄 (that is difficult) is required!

slide-17
SLIDE 17

I-projection

17

 𝐿𝑀(𝑅| 𝑄

can be computed without performing inference on 𝑄 𝐿𝑀(𝑅| 𝑄 = 𝑅 𝑨 log 𝑅 𝑨 𝑄 𝑨 𝑒𝑨 = −𝐼 𝑅 − 𝐹𝑅[𝑄(𝑨)]

 Most variational inference algorithms make use of 𝐿𝑀(𝑅| 𝑄  Computing expectations w.r.t. 𝑅 is tractable (by choosing a

suitable class of distributions for 𝑅)

 We choose a restricted family of distributions such that the expectations

can be evaluated and optimized efficiently.

 and yet which is still sufficiently flexible as to give a good approximation

slide-18
SLIDE 18

Example of variatinal approximation

18

[Bishop] Variational Laplace Approx.

slide-19
SLIDE 19

Evidence Lower Bound (ELBO)

19

ln 𝑄 𝑌 = ℒ 𝑅 + 𝐿𝑀(𝑅||𝑄) ℒ 𝑅 = 𝑅 𝑎 ln 𝑄(𝑌, 𝑎) 𝑅(𝑎) 𝑒𝑎 𝐿𝑀(𝑅||𝑄) = − 𝑅 𝑎 ln 𝑄(𝑎|𝑌) 𝑅(𝑎) 𝑒𝑎

 We can maximize the lower bound ℒ 𝑅

 equivalent to minimizing KL divergence.  if we allow any possible choice for 𝑅(𝑎), then the maximum of the

lower bound occurs when the KL divergence vanishes

 occurs when 𝑅(𝑎) equals the posterior distribution 𝑄(𝑎|𝑌).

 The

difference between the ELBO and the KL divergence is ln 𝑄(𝑌) which is what the ELBO bounds

We also called ℒ 𝑅 as 𝐺[𝑄, 𝑅] latter. 𝑌 = {𝑦1, … , 𝑦𝑜} 𝑎 = {𝑨1, … , 𝑨𝑛}

slide-20
SLIDE 20

Evidence Lower Bound (ELBO)

20

 Lower bound on the marginal likelihood  This quantity should increase monotonically with each

iteration

 we maximize the ELBO to find the parameters that gives as

tight a bound as possible on the marginal likelihood

 ELBO converges to a local minimum.

 Variational inference is closely related to EM

slide-21
SLIDE 21

Factorized distributions 𝑅 (mean-filed variational inference)

21

𝑅 𝑎 =

𝑗

𝑅𝑗(𝑎𝑗) ℒ 𝑅 =

𝑗

𝑅𝑗 ln 𝑄(𝑌, 𝑎) −

𝑗

ln 𝑅𝑗 𝑒𝑎 Coordinate ascent to optimize ℒ 𝑅 (we first find ℒ𝑘 𝑅 that is a functional of 𝑅𝑘): ℒ𝑘 𝑅 = 𝑅𝑘 ln 𝑄(𝑌, 𝑎)

𝑗≠𝑘

𝑅𝑗𝑒𝑎𝑗 𝑒𝑎

𝑘 − 𝑅𝑘 ln 𝑅𝑘 𝑒𝑎 𝑘 + 𝑑𝑝𝑜𝑡𝑢

⇒ ℒ𝑘 𝑅 = 𝐹−𝑘 ln 𝑄 𝑌, 𝑎 − 𝑅𝑘 ln 𝑅𝑘 𝑒𝑎

𝑘 + 𝑑𝑝𝑜𝑡𝑢

The restriction on the distributions in the form of factorization assumptions:

𝐹−𝑘 ln 𝑄 𝑌, 𝑎 = ln 𝑄 𝑌, 𝑎

𝑗≠𝑘

𝑅𝑗 𝑒𝑎𝑗

slide-22
SLIDE 22

Factorized distributions 𝑅: optimization

22

𝑀(𝑅𝑘, 𝜇) = ℒ𝑘 𝑅 + 𝜇(

𝑎𝑘

𝑅 𝑎

𝑘 − 1)

𝑒𝑀 𝑒𝑅(𝑎

𝑘) = 𝐹−𝑘 log 𝑄 𝑎, 𝑌

− log 𝑅 𝑎

𝑘 − 1 + 𝜇 = 0

⇒ 𝑅∗(𝑎

𝑘) ∝ exp 𝐹−𝑘 ln 𝑄 𝑌, 𝑎

𝑅∗(𝑎

𝑘) ∝ exp 𝐹−𝑘 ln 𝑄 𝑎 𝑘|𝑎−𝑘, 𝑌

 The above formula determines the form of the optimal 𝑅 𝑎

𝑘 . We didn't

specify the form in advance and only the factorization has been assumed.

 Depending on that form, the optimal 𝑅 𝑎 𝑘

might not be easy to work with. Nonetheless, for many models it is easy.

 Since we are replacing the neighboring values by their mean value, the

method is known as mean field

slide-23
SLIDE 23

Example: Gaussian factorized distribution

23

slide-24
SLIDE 24

Example: Gaussian factorized distribution

24

Solution:

slide-25
SLIDE 25

Example: Bayesian mixtures of Gaussians

25

 For simplicity, assume that the data generating variance is one.

 𝑄 𝝂 = 𝑙=1

𝐿

𝒪 𝝂𝑙|𝒏0, 𝚳0

−1

 𝑄 𝑨𝑙

𝑜 = 1|𝝆 = 𝜌𝑙  𝑄 𝒚(𝑜)|𝑨𝑙

𝑜 = 1, 𝝂 = 𝒪 𝒚(𝑜)|𝝂𝑙, 𝑱

For 𝑙 = 1, … , 𝐿 Draw 𝝂𝑙~𝒪 𝒏0, 𝚳0

−1

For 𝑜 = 1, … , 𝑂 Draw 𝒜(𝑜)~𝑁𝑣𝑚𝑢 𝝆 Draw 𝒚(𝑜)~ 𝑙=1

𝐿

𝒪 𝝂𝑙, 𝑱 𝑨𝑙

𝑜

𝒜(𝑜) 𝒚(𝑜) 𝑂 𝝆𝝆 𝝂𝝆

slide-26
SLIDE 26

Example: Bayesian mixtures of Gaussians

26

𝑎 = 𝒜(1), … , 𝒜 𝑂 , 𝝂1, … , 𝝂𝐿 𝑌 = 𝒚(1), … , 𝒚 𝑂

𝑄 𝒜(1), … , 𝒜 𝑂 , 𝝂1, … , 𝝂𝐿|𝒚(1), … , 𝒚 𝑂 = 𝑙=1

𝐿

𝑄 𝝂𝑙 𝑜=1

𝑂

𝑄 𝒜(𝑜) 𝑄 𝒚(𝑜)|𝒜 𝑜 , 𝝂1, … , 𝝂𝐿

𝝂1,…,𝝂𝐿 𝒜(1),…,𝒜 𝑂 𝑙=1 𝐿

𝑄 𝝂𝑙 𝑜=1

𝑂

𝑄 𝒜(𝑜) 𝑄 𝒚(𝑜)|𝒜 𝑜 , 𝝂1, … , 𝝂𝐿

The denominator is difficult to compute

slide-27
SLIDE 27

Example: Bayesian mixtures of Gaussians

27

 Consider

a variational distribution which factorizes between the latent variables and the parameters: 𝑅 𝒜 1 , … , 𝒜 𝑂 , 𝝂1, … , 𝝂𝐿 = 𝑅 𝒜 1 , … , 𝒜 𝑂 𝑅 𝝂1, … , 𝝂𝐿

 This is the only assumption required to make in order to

  • btain a tractable practical solution
slide-28
SLIDE 28

Example: Bayesian mixtures of Gaussians

28

ln 𝑅(𝒜 1 , … , 𝒜 𝑂 ) = 𝐹𝝂1,…,𝝂𝐿 ln 𝑄 𝑎, 𝑌, 𝝂 + const = 𝐹𝝂1,…,𝝂𝐿 ln 𝑄 𝒜 1 , … , 𝒜 𝑂 , 𝝂1, … , 𝝂𝐿, 𝒚 1 , … , 𝒚 𝑂 + const = 𝐹𝝂1,…,𝝂𝐿 ln

𝑙=1 𝐿

𝑄 𝝂𝑙

𝑜=1 𝑂

𝑄 𝒜(𝑜) 𝑄 𝒚(𝑜)|𝒜 𝑜 , 𝝂1, … , 𝝂𝐿

+ const

= 𝐹𝝂1,…,𝝂𝐿

𝑙=1 𝐿

ln 𝑄 𝝂𝑙 +

𝑜=1 𝑂

ln 𝑄 𝒜(𝑜) +

𝑜=1 𝑂

ln 𝑄 𝒚(𝑜)|𝒜 𝑜 , 𝝂1, … , 𝝂𝐿

+ const

ln 𝑄 𝒚(𝑜)|𝒜 𝑜 , 𝝂1, … , 𝝂𝐿 =

𝑙=1 𝐿

𝑨𝑙

(𝑜) ln 𝑂 𝒚(𝑜)|𝝂𝑙, 𝑱

= − 𝑒 2 ln 2𝜌 − 1 2

𝑙=1 𝐿

𝑨𝑙

𝑜

𝒚 𝑜 − 𝝂𝑙

𝑈 𝒚 𝑜 − 𝝂𝑙

ln 𝑄 𝒜(𝑜) =

𝑙=1 𝐿

𝑨𝑙

𝑜 ln 𝜌𝑙

slide-29
SLIDE 29

Example: Bayesian mixtures of Gaussians

29

ln 𝑅(𝒜 1 , … , 𝒜 𝑂 ) =

𝑜=1 𝑂

ln 𝑅 𝒜(𝑜) ⇒ 𝑅 𝒜 1 , … , 𝒜 𝑂 =

𝑜=1 𝑂

𝑅 𝒜(𝑜) ln 𝑅 𝒜(𝑜) = 𝐹𝝂1,…,𝝂𝐿

𝑙=1 𝐿

𝑨𝑙

𝑜 ln 𝜌𝑙 − 1

2

𝑙=1 𝐿

𝑨𝑙

𝑜

𝒚 𝑜 − 𝝂𝑙

𝑈 𝒚 𝑜 − 𝝂𝑙

+ const

slide-30
SLIDE 30

Example: Bayesian mixtures of Gaussians

30

ln 𝑅 𝒜(𝑜) = 𝐹𝝂1,…,𝝂𝐿

𝑙=1 𝐿

𝑨𝑙

𝑜 ln 𝜌𝑙 − 1

2

𝑙=1 𝐿

𝑨𝑙

𝑜

𝒚 𝑜 − 𝝂𝑙

𝑈 𝒚 𝑜 − 𝝂𝑙

+ const

ln 𝑅 𝒜(𝑜) =

𝑙=1 𝐿

𝑨𝑙

𝑜

ln 𝜌𝑙 + 𝒚 𝑜 𝑈𝐹𝝂𝑙 𝝂𝑙 − 1 2 𝐹𝝂𝑙 𝝂𝑙

𝑈𝝂𝑙 − 1

2 𝒚 𝑜 𝑈𝒚 𝑜 + const ⇒ 𝑅 𝒜(𝑜) = 𝑁𝑣𝑚𝑢 𝑠𝑜1, … , 𝑠𝑜𝑙 𝐹 𝑨𝑙

(𝑜) = 𝑠𝑜𝑙

𝑠𝑜𝑙 = 𝑓𝑦𝑞 ln 𝜌𝑙 + 𝒚 𝑜 𝑈𝐹 𝝂𝑙 − 1 2 𝐹 𝝂𝑙

𝑈𝝂𝑙 − 1

2 𝒚 𝑜 𝑈𝒚 𝑜 𝑙=1

𝐿

𝑓𝑦𝑞 ln 𝜌𝑙 + 𝒚 𝑜 𝑈𝐹 𝝂𝑙 − 1 2 𝐹 𝝂𝑙

𝑈𝝂𝑙 − 1

2 𝒚 𝑜 𝑈𝒚 𝑜

slide-31
SLIDE 31

Example: Bayesian mixtures of Gaussians

31

ln 𝑅(𝝂1, … , 𝝂𝐿) = 𝐹𝒜 1 ,…,𝒜 𝑂 ln 𝑄 𝒜 1 , … , 𝒜 𝑂 , 𝝂1, … , 𝝂𝐿, 𝒚 1 , … , 𝒚 𝑂 + const = 𝐹𝒜 1 ,…,𝒜 𝑂 ln

𝑙=1 𝐿

𝑄 𝝂𝑙

𝑜=1 𝑂

𝑄 𝒜(𝑜) 𝑄 𝒚(𝑜)|𝒜 𝑜 , 𝝂1, … , 𝝂𝐿 + const = ln

𝑙=1 𝐿

𝑄 𝝂𝑙 + 𝐹𝒜 1 ,…,𝒜 𝑂

𝑜=1 𝑂

ln 𝑄 𝒚(𝑜)|𝒜 𝑜 , 𝝂1, … , 𝝂𝐿 + const

ln 𝑄 𝒚(𝑜)|𝒜 𝑜 , 𝝂1, … , 𝝂𝐿 =

𝑙=1 𝐿

𝑨𝑙

(𝑜) ln 𝑂 𝒚(𝑜)|𝝂𝑙, 𝑱

slide-32
SLIDE 32

Example: Bayesian mixtures of Gaussians

32

=

𝑙=1 𝐿

ln 𝑄 𝝂𝑙 +

𝑜=1 𝑂 𝑙=1 𝐿

𝐹 𝑨𝑙

(𝑜) ln 𝑂 𝒚(𝑜)|𝝂𝑙, 𝑱

+ const ⇒ 𝑅 𝝂1, … , 𝝂𝐿 =

𝑙=1 𝐿

𝑅 𝝂𝑙 𝑅(𝝂𝑙) ∝ exp ln 𝑄 𝝂𝑙 +

𝑜=1 𝑂

𝐹 𝑨𝑙

𝑜

ln 𝑂 𝒚 𝑜 |𝝂𝑙, 𝑱 ⇒ 𝑅 𝝂𝑙 = 𝑂 𝝂𝑙|𝒏𝑙, 𝜧𝑙

−1

𝜧𝑙 = 𝜧0 +

𝑜=1 𝑂

𝐹 𝑨𝑙

(𝑜)

𝑱 𝒏𝑙 = 𝜧𝑙

−1

𝜧0𝝂0 +

𝑜=1 𝑂

𝐹 𝑨𝑙

(𝑜) 𝒚(𝑜)

slide-33
SLIDE 33

Variational posterior distribution

33

 In this example, variational posterior distribution have the same

functional form as the corresponding factor in the joint distribution

 This is a general result and is a consequence of the choice of conjugate

distributions.

 The form of posteriors will be determined by the form of the likelihood and

prior

 There are general results for general class of conjugate-exponential

models

 The additional factorizations of variational posterior distributions

are a consequence of the interaction between the assumed factorization and the conditional independencies in 𝑄

slide-34
SLIDE 34

Mean field for exponential family

34

𝑄 𝑨

𝑘|𝑨−𝑘, 𝑦 = ℎ 𝑨 𝑘 exp 𝜃 𝑨−𝑘, 𝑦 𝑈𝑈 𝑨 𝑘 − 𝐵 𝜃 𝑨−𝑘, 𝑦

ln 𝑄 𝑨

𝑘|𝑨−𝑘, 𝑦 = ln ℎ 𝑨 𝑘 + 𝜃 𝑨−𝑘, 𝑦 𝑈𝑈 𝑨 𝑘 − 𝐵 𝜃 𝑨−𝑘, 𝑦

 Mean field variational inference is straightforward:

ln 𝑅 𝑨

𝑘 = 𝐹𝑅−𝑘 log 𝑄 𝑨 𝑘|𝑨−𝑘, 𝑦

+ const = ln ℎ 𝑨

𝑘 + 𝐹𝑅−𝑘 𝜃 𝑨−𝑘, 𝑦 𝑈𝑈 𝑨 𝑘 − 𝐹𝑅−𝑘 𝐵 𝜃 𝑨−𝑘, 𝑦

𝑅 𝑨

𝑘 ∝ ℎ 𝑨 𝑘 exp 𝐹𝑅−𝑘 𝜃 𝑨−𝑘, 𝑦 𝑈𝑈 𝑨 𝑘

 𝑅 𝑨

𝑘

is in the same exponential distribution as the conditional.

slide-35
SLIDE 35

Mean field for exponential family

35

 Give each hidden variable 𝑨

𝑘 a variational parameter 𝑤𝑘, and

put it in the same exponential family as its model conditional: 𝑅 𝑎 =

𝑘

𝑅 𝑨

𝑘|𝑤𝑘

 In each iteration of coordinate descent:

 sets each natural variational parameter 𝑤𝑘 to the expectation of the

natural conditional parameter for variable 𝑨

𝑘 :

𝑤𝑘

∗ = 𝐹𝑅−𝑘 𝜃 𝑨−𝑘, 𝑦

slide-36
SLIDE 36

Conjugate exponential model in learning problems

36

 When

complete-data likelihood is drawn from the exponential family with natural parameters 𝜽:

𝑄 𝒀, 𝒂|𝜽 =

𝑜=1 𝑂

ℎ 𝒚 𝑜 , 𝒜 𝑜 exp 𝜽𝑈𝑈 𝒚 𝑜 , 𝒜 𝑜 − 𝐵 𝜽

 We shall also use a conjugate prior for η:

 𝑄 𝜽|𝜉0, 𝝍𝟏 = 𝑔 𝜉𝟏, 𝝍𝟏 exp 𝜉0𝜽𝑈𝝍𝟏 − 𝜉0𝐵 𝜽

 

𝒂 = 𝒜 1 , … , 𝒜 𝑜 𝒀 = 𝒚 1 , … , 𝒚 𝑜

slide-37
SLIDE 37

Mean field for conjugate exponential model in learning problems

37

 Suppose 𝑅 𝒂, 𝜽 = 𝑅 𝒂 𝑅 𝜽 :

⇒ 𝑅 𝒂 =

𝑜=1 𝑂

𝑅 𝒜(𝑜) 𝑅∗ 𝒜(𝑜) = ℎ 𝒚 𝑜 , 𝒜 𝑜 exp 𝐹𝜽[𝜽]𝑈𝑈 𝒚 𝑜 , 𝒜 𝑜 − 𝐵 𝐹𝜽[𝜽] 𝑅∗ 𝜽 = 𝑔 𝜉𝑂, 𝝍𝑂 exp 𝜽𝑈𝝍𝑂 − 𝜉𝑂𝐵 𝜽 𝜉𝑂 = 𝜉0 + 𝑂 𝝍𝑂 = 𝝍0 +

𝑜=1 𝑂

𝐹𝒜 𝑜 𝑈 𝒚 𝑜 , 𝒜 𝑜

slide-38
SLIDE 38

Variational Bayes

38

 Bayesian inference with incomplete data

 For complete data, we could derive closed-form solutions to this

inference problem when we take some assumptions.

 In the case of incomplete data, these solutions do not exist, and so we

need to resort to the approximate inference.

 Variational Bayes EM (VBEM) provides a way to model

uncertainty in the parameters as well in the latent variables

 Bayesian estimation at a computational cost that is essentially the same

as EM.

 Thus, it often gives us the speed benefits of ML or MAP estimation but

the statistical benefits of the Bayesian approach

slide-39
SLIDE 39

Variarional Bayes learning

39

ln 𝑄(𝒠) = ln

𝑄 𝒠, ℋ|𝜾 𝑄 𝜾 𝑒𝜾 ln 𝑄(𝒠) ≥

𝑅 ℋ, 𝜾 ln 𝑄 𝒠, ℋ, 𝜾 𝑅 ℋ, 𝜾 𝑒𝜾 ln 𝑄(𝒠) ≥

𝑅ℋ ℋ 𝑅𝜾 𝜾 ln 𝑄 𝒠, ℋ, 𝜾 𝑅ℋ ℋ 𝑅𝜾 𝜾 𝑒𝜾 ln 𝑄(𝒠) ≥

𝑅ℋ ℋ 𝑅𝜾 𝜾 ln 𝑄 𝒠, ℋ, 𝜾 𝑒𝜾 + 𝐼 𝑅ℋ + 𝐼 𝑅𝜾

Mean field: 𝑅 ℋ, 𝜾 = 𝑅ℋ ℋ 𝑅𝜾 𝜾 𝐺𝒠 𝑄, 𝑅 𝑎 = ℋ ∪ 𝜾 𝑌 = 𝒠 𝐺𝒠 𝑄, 𝑅

slide-40
SLIDE 40

Variarional Bayes learning

40

ln 𝑄 𝒠 = 𝐺𝒠 𝑄, 𝑅 + 𝐸(𝑅(ℋ, 𝜾)| 𝑄 ℋ, 𝜾 𝒠

 We want to find 𝑅∗ = argmax

𝑅

𝐺

𝒠 𝑄, 𝑅

 We assume factorization 𝑅 ℋ, 𝜾 = 𝑅ℋ ℋ 𝑅𝜾 𝜾

and use block coordinate ascent to optimize the above problem

slide-41
SLIDE 41

Mean Field VB

41

 Initialization: Randomly select starting distribution 𝜾1  Repeat

 E-Step: Given parameters, find posterior of hidden data

𝑅ℋ

𝑢+1 = argmax 𝑅ℋ

𝐺

𝒠 𝑄, 𝑅ℋ, 𝑅𝜾 𝑢

 M-Step: Given posterior distributions, find likely parameters

𝑅𝜾

𝑢+1 = argmax 𝑅𝜾

𝐺

𝒠 𝑄, 𝑅ℋ 𝑢+1, 𝑅𝜾

 Until convergence

𝐺 𝑄, 𝑅ℋ, 𝑅𝜾 =

𝑅ℋ ℋ 𝑅𝜾 𝜾 ln 𝑄 𝒠, ℋ, 𝜾 𝑒𝜾 + 𝐼 𝑅ℋ + 𝐼 𝑅𝜾

slide-42
SLIDE 42

Local computation of ELBO for factorized 𝑄

42

𝑄 𝒚 = 1 𝑎

𝑔

𝑏∈ℱ

𝑔

𝑏(𝒚𝑏)

𝐸(𝑅| 𝑄 = −𝐼 𝑅 − 𝐹𝑅 log 1 𝑎

𝑔

𝑏∈𝐺

𝑔

𝑏 𝒚𝑏

= −𝐼 𝑅 − log 1 𝑎 −

𝑔

𝑏∈𝐺

𝐹𝑅 log 𝑔

𝑏 𝒚𝑏

= log 𝑎 − 𝐼 𝑅 +

𝑔

𝑏∈𝐺

𝐹𝑅 log 𝑔

𝑏 𝒚𝑏 ℒ 𝑅 = 𝐼 𝑅 +

𝑔

𝑏∈𝐺

𝐹𝑅 log 𝑔

𝑏 𝒚𝑏

slide-43
SLIDE 43

Naïve mean field for factorized 𝑄

43

Naïve mean field (i.e., fully factored distribution 𝑅):

𝑅 𝒚 =

𝑗=1 𝑂

𝑅𝑗(𝑦𝑗) ℒ 𝑅 =

𝑏∈ℱ

𝐹𝑅 log 𝑔

𝑏 𝒚𝑏

+ 𝐼 𝑅

𝐹𝑅 log 𝑔

𝑏(𝒚𝑏) = 𝒚𝑏∈𝑊𝑏𝑚(𝑌𝑏) 𝑗∈𝒪 𝑏

𝑅𝑗(𝑦𝑗) log 𝑔

𝑏(𝒚𝑏)

𝐼 𝑅 =

𝑗=1 𝑂

𝐼[𝑅𝑗]

 Thus, ℒ[𝑅] can be rewritten simply as a sum of expectations,

each one over a small set of variables

𝒪 𝑏 = 𝑗|𝑦𝑗 ∈ 𝑡𝑑𝑝𝑞𝑓 𝑔

𝑏

slide-44
SLIDE 44

Stationary point (fixed-point equations)

44

𝑅𝑗 𝑦𝑗 = 1 𝑎𝑗 exp

𝑏:𝑗∈𝒪 𝑏 𝒚𝑏∈𝑊𝑏𝑚(𝑌𝑏)

𝑅 𝒚𝑏|𝑦𝑗 log 𝑔

𝑏(𝒚𝑏)  Proof:

ℒ 𝑅 =

𝑗=1 𝑂

ℒ𝑗[𝑅] ℒ𝑗 𝑅 =

𝑏:𝑗∈𝒪 𝑏 𝒚𝑏 𝑘∈𝒪 𝑏

𝑅𝑘 𝑦𝑘 log 𝑔

𝑏 𝒚𝑏 + 𝐼 𝑅𝑗

𝑀𝑗 𝑅, 𝜇 = ℒ𝑗 𝑅 + 𝜇𝑗(

𝑦𝑗∈𝑊𝑏𝑚 𝑌𝑗

𝑅𝑗 𝑦𝑗 − 1) 𝜖𝑀𝑗 𝜖𝑅𝑗(𝑦𝑗) = 0 ⇒ 𝑅𝑗 𝑦𝑗 = 1 𝑓1−𝜇𝑗 exp

𝑏:𝑌𝑗∈𝒪 𝑏 𝒚𝑏

𝑅 𝒚𝑏|𝑦𝑗 log 𝑔

𝑏(𝒚𝑏)

Update rule: We can optimize each 𝑅𝑗 given values for other potentials

slide-45
SLIDE 45

Optimization by coordinate ascent for factorized 𝑄

45

𝑅𝑗 𝑦𝑗 = 1 𝑎𝑗 exp

𝑏:𝑌𝑗∈𝒪 𝑏 𝒚𝑏

𝑅 𝒚𝑏|𝑦𝑗 log 𝑔

𝑏(𝒚𝑏, 𝑦𝑗)

 Coordinate ascent algorithm repeatedly optimizes a single

marginal at a time, given fixed choices to all of the others.

While not converged Iterate over each of the variables 𝑗 ∈ 𝒲 Maximize the objective function with respect to 𝑅𝑗 𝑦𝑗 , ∀𝑦𝑗 ∈ 𝑊𝑏𝑚 𝑌𝑗 by the above formula.

All these terms involve expectations of variables other than 𝑌𝑗 and do not depend on the choice of 𝑅𝑗 𝑌𝑗 . block coordinate descent

slide-46
SLIDE 46

Convergence properties

46

 ℒ𝑗 is concave in 𝑅𝑗(𝑌𝑗)

 Update of 𝑅𝑗 is guaranteed to increase (or not decrease) ℒ

 Mean Field iterations are guaranteed to converge.

 Each step of coordinate ascent procedure is monotonically non-

decreasing in ℒ.

 Because ℒ is bounded, the sequence of distributions represented by

successive iterations of Mean-Field must converge.

 At the convergence point, the fixed-point equations hold for

all variables.

 As a consequence, the convergence point is a stationary point of the

energy functional subject to the constraints

 The result of the mean field approximation is a local maximum,

and not necessarily a global one

slide-47
SLIDE 47

Local computation in naïve mean field

47

 When updating 𝑅𝑘, we only need to reason about the

variables which share a factor with 𝑎

𝑘

 the expectations required to evaluate 𝑅𝑘 involve only those

variables lying in the Markov blanket of the node 𝑘

 the other terms get absorbed into the constant term.

 The optimization of 𝑅𝑘 can therefore be expressed as a local

computation at the node

slide-48
SLIDE 48

Variational methods: two perspective

48

 Each algorithm can be explained via two perspective:

 Message-passing algorithm

 As a on-way of solving the optimization problem

 Constrained optimization

slide-49
SLIDE 49

Example: Mean field for pairwise MRFs

49

𝑄 𝒚 = 1 𝑎 exp

(𝑗,𝑘)∈ℰ

𝜄𝑗𝑘 𝑦𝑗, 𝑦𝑘 +

𝑗∈𝒲

𝜄𝑗 𝑦𝑗 𝑅∗ = argmax

𝑅∈𝒭

ℒ 𝑅 𝑅 𝒚 =

𝑗=1 𝑂

𝑅𝑗(𝑦𝑗)

𝑦𝑗∈𝑊𝑏𝑚 𝑌𝑗

𝑅𝑗(𝑦𝑗) = 1

Subject to:

slide-50
SLIDE 50

Example: Mean field for pairwise MRFs

50

 𝑄 : Pairwise MRF

𝑄 𝒚 = 1 𝑎

(𝑗,𝑘)∈ℰ

𝜚𝑗𝑘 𝑦𝑗, 𝑦𝑘

𝑗∈𝒲

𝜚𝑗 𝑦𝑗 𝑄 𝒚 = 1 𝑎 exp

(𝑗,𝑘)∈ℰ

𝜄𝑗𝑘 𝑦𝑗, 𝑦𝑘 +

𝑗∈𝒲

𝜄𝑗 𝑦𝑗 𝑅𝑗 𝑦𝑗 = 1 𝑎𝑗 exp 𝜄𝑗 𝑦𝑗 +

𝑘∈𝒪 𝑗 𝑦𝑘

𝑅𝑘 𝑦𝑘 𝜄𝑗𝑘 𝑦𝑗, 𝑦𝑘 ⇒ 𝑅𝑗 𝑦𝑗 ∝ 𝜚𝑗 𝑦𝑗

𝑘∈𝒪 𝑗

𝑛𝑘𝑗 𝑦𝑗 𝑛𝑘𝑗 𝑦𝑗 ∝ exp

𝑦𝑘

𝑅𝑘 𝑦𝑘 𝜄𝑗𝑘 𝑦𝑗, 𝑦𝑘 𝜄𝑗 = ln 𝜚𝑗 𝜄𝑗𝑘 = ln 𝜚𝑗𝑘

slide-51
SLIDE 51

Message passing: Mean field vs. BP for pairwise MRF

51

𝑄 𝒚 = 1 𝑎

(𝑗,𝑘)∈ℰ

𝜚𝑗𝑘 𝑦𝑗, 𝑦𝑘

𝑗∈𝒲

𝜚𝑗 𝑦𝑗

 Mean Field:

𝑅𝑗 𝑦𝑗 ∝ 𝜚𝑗 𝑦𝑗

𝑘∈𝒪 𝑗

𝑛𝑘𝑗 𝑦𝑗 𝑛𝑘𝑗 𝑦𝑗 ∝ exp

𝑦𝑘

𝑅𝑘 𝑦𝑘 𝜄𝑗𝑘 𝑦𝑗, 𝑦𝑘

 Belief propagation (sum product)

𝑐𝑗(𝑦𝑗) ∝ 𝜚𝑗 𝑦𝑗

𝑘∈𝒪 𝑗

𝑛𝑘𝑗 𝑦𝑗 𝑛𝑗𝑘(𝑦𝑘) ∝

𝑦𝑗

𝜚𝑗 𝑦𝑗 𝜚𝑗𝑘 𝑦𝑗, 𝑦𝑘

𝑙∈𝒪 𝑗 \𝑘

𝑛𝑙𝑗 𝑦𝑗

𝜄𝑗 𝑦𝑗, 𝑦𝑘 = ln 𝜚𝑗𝑘 𝑦𝑗, 𝑦𝑘

slide-52
SLIDE 52

Variational message passing

52

 Mean field methods are all very similar

 just compute each node’s full conditional, and average out the neighbors

𝑄 𝒚 =

𝑗

𝑄(𝑦𝑗|𝑄𝑏𝑦𝑗) ln 𝑅 𝑦𝑘 = 𝐹𝑅−𝑘

𝑘,𝐷ℎ𝑘

ln 𝑄 𝑦𝑗|𝑄𝑏𝑗 + const

 It is possible to derive a general purpose set of update equations that work for

any DGM for which all CPDs are in the exponential family, and for which all parent nodes have conjugate distributions  Updating nodes one at a time

 updating posterior beliefs using local operations at each node.  each update increases a lower bound on the log evidence (unless already

at a local maximum)

slide-53
SLIDE 53

VMP

53

 can be carried out, given that the model is conjugate-

exponential

ln 𝑅 𝑦𝑘 = 𝐹𝑅−𝑘

𝑗=1 𝑁

ln 𝑄 𝑦𝑗|𝑄𝑏𝑗 + const ln 𝑅 𝑦𝑘 = 𝐹𝑅−𝑘 ln 𝑄 𝑦𝑘|𝑄𝑏𝑘 +

𝑙∈𝐷ℎ𝑗𝑚𝑒𝑘

𝐹𝑅−𝑘 ln 𝑄 𝑦𝑙|𝑄𝑏𝑙 + const ln 𝑅 𝑦𝑘 = 𝐹𝑅−𝑘 𝜃 𝑄𝑏𝑘

𝑈𝑈(𝑦𝑘) + ln ℎ(𝑦𝑘) + 𝑙∈𝐷ℎ𝑗𝑚𝑒𝑘

𝐹𝑅−𝑘 𝜃 𝑦𝑙, 𝑑𝑞𝑙 𝑈𝑈(𝑦𝑘) + const

Winn and Bishop, Variational Message Passing, JMLR 2005.

slide-54
SLIDE 54

Structured variational

54

 Mean Field

 Naïve mean field  Structured mean field

slide-55
SLIDE 55

Structured Mean Field

55

 Naïve mean-field can lead to very poor approximations

 we must use a richer class of distributions 𝒭, which has greater expressive

power (by capturing some of the dependencies in 𝑄)

 use network structures of different complexity

 subgraph of 𝐻𝑄 over which exact computation of 𝐼[𝑅] is feasible

 Example: for grid 𝐻𝑄, a collection of independent chain structures.

 Exact inference with such structures is linear

slide-56
SLIDE 56

Structured stationary point

56

𝑄 𝒚 = 1 𝑎

𝑙=1 𝐿

𝜚𝑙 𝒚𝑙 𝑅 𝒚 = 1 𝑎𝑅

𝑘=1 𝐾

𝜔𝑘 𝒚𝑘 𝐺 𝑄, 𝑅 =

𝑙=1 𝐿

𝐹𝑅 ln 𝜚𝑙 𝒚𝑙 − 𝐹𝑅 ln 𝑅 𝐺 𝑄, 𝑅 =

𝑙=1 𝐿

𝐹𝑅 ln 𝜚𝑙 𝒚𝑙 −

𝑘=1 𝐾

𝐹𝑅 ln 𝜔𝑘 𝒚𝑘 + ln 𝑎𝑅

slide-57
SLIDE 57

Structured stationary point

57

 𝜔𝑘 is a stationary point of the energy functional iff: 𝜔𝑘 𝒚𝑘 ∝ exp 𝐹𝑅 log 𝑄(𝒚) |𝒚𝑘 −

𝑙≠𝑘

𝐹𝑅 log 𝜔𝑙 𝒚𝑙 |𝒚𝑘 𝜔𝑘 𝒚𝑘 ∝ exp

𝑗

𝐹𝑅 log 𝜚𝑗(𝒚𝑗) |𝒚𝑘 −

𝑙≠𝑘

𝐹𝑅 log 𝜔𝑙 𝒚𝑙 |𝒚𝑘

 We need to perform inference after each update step 𝜔𝑘 𝒚𝑘 does not affect the right-hand side of the fixed-point equations defining its value

slide-58
SLIDE 58

Structured mean-field quality

58

 Both the quality and the computational complexity of the

variational approximation depend on the structure of 𝑄 and 𝑅.

 We want to be able to perform efficient inference in the

approximating network.

 we often select our network so that the resulting factorization

leads to a tractable network (that is, one of low tree-width)

slide-59
SLIDE 59

Example: Factorial HMM

59

Ghahramani and Jordan, Factorial Hidden Markov Models, Machine Learning 1997.

slide-60
SLIDE 60

Factorial HMM: Exact Inference

60

 𝑃(𝑈𝑁𝐿𝑁+1)  We can use variational inference

Ghahramani and Jordan, Factorial Hidden Markov Models, Machine Learning 1997.

slide-61
SLIDE 61

Factorial HMM: Structured Variational Inefernce

61

Ghahramani and Jordan, Factorial Hidden Markov Models, Machine Learning 1997.

slide-62
SLIDE 62

Loopy Belief Propagation (LBP)

62

 A fixed point iteration procedure that tries to minimize an

approximation of 𝐺 𝑄, 𝑅

 Start with initialization of all messages to one  While not converged do

 Compute (i.e. update) message on all the edges

 LBP optimizes approximate versions of the energy functional

 approximate 𝐺[𝑄, 𝑅]  works directly with pseudomarginals which may not be consistent with

any joint distribution

 LBP does not always converge, and even when it does, it may

converge to the wrong answers

slide-63
SLIDE 63

LBP

63

 If BP is used on graphs with loops, messages may circulate

indefinitely

 But we can run it anyway and hope for the best  Stop message passing when

 fixed number of iterations is reached  or when no significant change in beliefs is occurred

 Empirically, a good approximation can be achievable

 If solution is not oscillatory but converges, it usually is a good

approximation

slide-64
SLIDE 64

Reference

64

 C.M. Bishop, “Pattern Recognition and Machine Learning”,

Chapter 10.1-10.7.

 D. Koller and N. Friedman, “Probabilistic Graphical Models:

Principles and Techniques”, Chapter 11.1-11.3, 11.5, 11.6.

slide-65
SLIDE 65

65

LDA: Latent Dirichlet Allocation

[D.M. Blei, A.Y. Ng, M.I. Jordan, 2003]

slide-66
SLIDE 66

Summarizing the data using topics

66

slide-67
SLIDE 67

Summarizing the data using topics

67

slide-68
SLIDE 68

68

slide-69
SLIDE 69

Generative process for doc 𝑒

69

slide-70
SLIDE 70

LDA model

70

slide-71
SLIDE 71

Probability distribution of docs

71

𝑞(𝒙|𝛽, 𝛾) under LDA for three words and four topics

slide-72
SLIDE 72

Key inference problems

72

slide-73
SLIDE 73

Mean field variational inference for LDA

73

 The problematic coupling between 𝜾 and 𝛾 arises due to

the edges between 𝜾, 𝒜, and 𝒙.

slide-74
SLIDE 74

Mean field variational inference

74

=

slide-75
SLIDE 75

A variational inference algorithm for LDA

75

slide-76
SLIDE 76

Parameter estimation

76

 Variational EM procedure

 E

step: uses variational inference to approximate 𝑞(𝜄, 𝑎|𝑥, 𝛽, 𝛾)

 and

then, for fixed values

  • f

the

  • btained

variational parameters, maximizes the lower bound wrt the model parameters.

slide-77
SLIDE 77

Summarizing the data using topics

77

slide-78
SLIDE 78

78

slide-79
SLIDE 79

Graphical model representation of fully Bayesian LDA

79

 We assume that each row of 𝛾 is independently drawn

from an exchangeable Dirichlet distribution

𝜇

slide-80
SLIDE 80

Recall: Mean Field VB

80

 Initialization: Randomly select starting distribution 𝜾1  Repeat

 E-Step: Given parameters, find posterior of hidden data

𝑅ℋ

𝑢+1 = argmax 𝑅ℋ

𝐺

𝒠 𝑄, 𝑅ℋ, 𝑅𝜾 𝑢

 M-Step: Given posterior distributions, find likely parameters

𝑅𝜾

𝑢+1 = argmax 𝑅𝜾

𝐺

𝒠 𝑄, 𝑅ℋ 𝑢+1, 𝑅𝜾

 Until convergence

slide-81
SLIDE 81

Variational Bayes

81

𝑟 𝛾1:𝑙, 𝒜1:𝑁, 𝜄1:𝑁 = 𝑟 𝛾1:𝑙 𝑟 𝒜1:𝑁, 𝜄1:𝑁 ⇒ 𝑟 𝛾1:𝑙, 𝒜1:𝑁, 𝜄1:𝑁 =

𝑙=1 𝐿

𝑟 𝛾𝑙

𝑛=1 𝑁

𝑟 𝜄𝑒, 𝒜𝑒

slide-82
SLIDE 82

Variational Bayes

82

slide-83
SLIDE 83

83

slide-84
SLIDE 84

Reference

84

 D.M. Blei, A.Y. Ng, M.I. Jordan, “Latent Dirichlet Allocation”,

JMLR 3 (2003) 993-1022.