[PPT] - CS480/680 Machine Learning Lecture 11: February 11 th , 2020 PowerPoint Presentation

SLIDE 1

CS480/680 Winter 2020 Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 11: February 11th, 2020

Variational Inference Zahra Sheikhbahaee

VARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE (Beal 2003, chapter 2) Variational Inference: A Review for Statisticians (Blei et al. 2016)

University of Waterloo

1

SLIDE 2

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational lower bound derivation
Variational mean field approximation

University of Waterloo

2

SLIDE 3

CS480/680 Winter 2020 Zahra Sheikhbahaee

Full Bayesian Inference

Training stage

𝑄 𝜄 𝑌$%, 𝑍

$% =

𝑄 𝑍

$% 𝑌$%, 𝜄 𝑄(𝜄)

∫ 𝑄 𝑍

$% 𝑌$%, 𝜄 𝑄 𝜄 𝑒𝜄

Testing stage

𝑄 𝑧 𝑦, 𝑌$%, 𝑍

$% = / 𝑄(𝑧|𝑦, 𝜄)𝑄 𝜄 𝑌$%, 𝑍 $% 𝑒𝜄

University of Waterloo

3

SLIDE 4

CS480/680 Winter 2020 Zahra Sheikhbahaee

Full Bayesian Inference

Training stage

𝑄 𝜄 𝑌$%, 𝑍

$% =

𝑄 𝑍

$% 𝑌$%, 𝜄 𝑄(𝜄)

∫ 𝑄 𝑍

$% 𝑌$%, 𝜄 𝑄 𝜄 𝑒𝜄

Testing stage

𝑄 𝑧 𝑦, 𝑌$%, 𝑍

$% = / 𝑄(𝑧|𝑦, 𝜄)𝑄 𝜄 𝑌$%, 𝑍 $% 𝑒𝜄

University of Waterloo

4

Maybe intractable Posterior distributions can be calculated analytically only for simple conjugate models!

SLIDE 5

CS480/680 Winter 2020 Zahra Sheikhbahaee

Choice Of Priors

In any Bayesian inference model what is essential is which type of prior

knowledge (if any) is conveyed in prior.

Subjective priors: The prior encapsulates information as fully as possible by using

previous experimental data or expert knowledge. Conjugate priors in the exponential family are subjective priors. 𝑔 𝜄 2 𝜈 = 𝑞 𝜄 𝑧 ∝ 𝑔 𝜄 𝜈 𝑞 𝑧 𝜄 The definition of a likelihood function in an exponential family model is given as follow where we assume that n data points arrive independent and identically distributed 𝑞 𝑧6 𝜄 = 𝑕 𝜄 𝑔(𝑧6)𝑓9 : ;<(=>) 𝑕(𝜄): a normalization constant 𝜚(𝜄): a vector of natural parameters 𝑣(𝑧6): The sufficient statistics 𝑄 𝜄 𝜃, 𝜉 = ℎ(𝜃, 𝜉)𝑕(𝜄)D𝑓9(:);E

University of Waterloo

5

SLIDE 6

CS480/680 Winter 2020 Zahra Sheikhbahaee

Choice Of Priors

In any Bayesian inference model what is essential is which type of prior

knowledge (if any) is conveyed in prior.

Subjective priors: The prior encapsulates information as fully as possible by using

previous experimental data or expert knowledge. Conjugate priors in the exponential family are subjective priors. 𝑔 𝜄 2 𝜈 = 𝑞 𝜄 𝑧 ∝ 𝑔 𝜄 𝜈 𝑞 𝑧 𝜄 The posterior distribution 𝑄 𝜄 𝑧 = 𝑄 𝜄 𝜃, 𝜉 F

6GH I

𝑞 𝑧6 𝜄 ∝ 𝑕 𝜄 DJI𝑓9(:);E F

6GH I

𝑔(𝑧6) 𝑓9 : ;<(=>) ∝ 𝑄(𝜄|2 𝜃, 2 𝜉) 2 𝜃 = 𝜃 + 𝑜 2 𝜉 = 𝜉 + M

6GH I

𝑣(𝑧6)

University of Waterloo

6

SLIDE 7

CS480/680 Winter 2020 Zahra Sheikhbahaee

Choice Of Priors

Objective Priors: Instead of attempting to encapsulate rich knowledge

into the prior, the objective Bayesian tries to impart as little information as possible in an attempt to allow the data to carry as much weight as possible in the posterior distribution. One class of noninformative priors are reference priors.

Hierarchical priors: Utilize hierarchical modeling to transfer the

reference prior problem to a ‘higher level’ of the model. Hierarchical models allow a more “objective” approach to inference by estimating the parameters of prior distributions from data rather than requiring them to be specified using subjective information.

University of Waterloo

7

SLIDE 8

CS480/680 Winter 2020 Zahra Sheikhbahaee

Approximate Inference

Variational Inference Approximate 𝑞(𝜄|𝑦) ≈ 𝑟(𝜄) ∈ 𝒭

Biased
Faster and more scalable

Markov Chain Monte Carlo Samples from unnormalized 𝑞 𝜄 𝑦

Unbiased
Need a lot of sample

University of Waterloo

8

Probabilistic model:𝑄 𝑦, 𝜄 = 𝑄 𝑦 𝜄 𝑄(𝜄)

SLIDE 9

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic

University of Waterloo 9

Consider a model with hidden variables 𝒚 = 𝑦H, … , 𝑦I and

bserved variables 𝒛 = 𝑧H, 𝑧U, … , 𝑧I and the stochastic

dependency between variables are given by 𝜄: ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑6GH

I

ln 𝑞 𝑧6 𝜄 = ∑6GH

I

ln ∫ 𝑒𝑦6 𝑞 𝑦6, 𝑧6 𝜄

SLIDE 10

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic

University of Waterloo 10

Consider a model with hidden variables 𝒚 = 𝑦H, … , 𝑦I and

bserved variables 𝒛 = 𝑧H, 𝑧U, … , 𝑧I and the stochastic

dependency between variables are given by 𝜄: ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑6GH

I

ln 𝑞 𝑧6 𝜄 = ∑6GH

I

ln ∫ 𝑒𝑦6 𝑞 𝑦6, 𝑧6 𝜄 = M

6GH I

ln / 𝑒𝑦6𝑟[>(𝑦6) 𝑞 𝑦6, 𝑧6 𝜄 𝑟[>(𝑦6) = = M

6GH I

ln 𝔽]^> 𝑞 𝑦6, 𝑧6 𝜄 𝑟[>(𝑦6) The Jensen’s inequality for a concave function is given as 𝑔 𝔽] 𝑦 ≥ 𝔽] [𝑔(𝑦)]

SLIDE 11

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic

University of Waterloo 11

Consider a model with hidden variables 𝒚 = 𝑦H, … , 𝑦I and

bserved variables 𝒛 = 𝑧H, 𝑧U, … , 𝑧I and the stochastic

dependency between variables are given by 𝜄: ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑6GH

I

ln 𝑞 𝑧6 𝜄 = ∑6GH

I

ln ∫ 𝑒𝑦6 𝑞 𝑦6, 𝑧6 𝜄 = M

6GH I

ln / 𝑒𝑦6𝑟[>(𝑦6) 𝑞 𝑦6, 𝑧6 𝜄 𝑟[>(𝑦6) = = M

6GH I

ln 𝔽]^> 𝑞 𝑦6, 𝑧6 𝜄 𝑟[>(𝑦6) ≥ M

6GH I

𝔽]^> ln 𝑞 𝑦6, 𝑧6 𝜄 𝑟[>(𝑦6)

SLIDE 12

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic

University of Waterloo 12

Consider a model with hidden variables 𝒚 = 𝑦H, … , 𝑦I and

bserved variables 𝒛 = 𝑧H, 𝑧U, … , 𝑧I and the stochastic

dependency between variables are given by 𝜄: ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑6GH

I

ln 𝑞 𝑧6 𝜄 = ∑6GH

I

ln ∫ 𝑒𝑦6 𝑞 𝑦6, 𝑧6 𝜄 = M

6GH I

ln / 𝑒𝑦6𝑟[>(𝑦6) 𝑞 𝑦6, 𝑧6 𝜄 𝑟[>(𝑦6) = ≥ M

6GH I

𝔽]^> ln 𝑞 𝑦6, 𝑧6 𝜄 𝑟[>(𝑦6) = M

6GH I

/ 𝑒𝑦6𝑟[>(𝑦6) ln 𝑞 𝑦6, 𝑧6 𝜄 𝑟[>(𝑦6)

SLIDE 13

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic

University of Waterloo 13

Consider a model with hidden variables 𝒚 = 𝑦H, … , 𝑦I and

bserved variables 𝒛 = 𝑧H, 𝑧U, … , 𝑧I and the stochastic

dependency between variables are given by 𝜄: ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑6GH

I

ln 𝑞 𝑧6 𝜄 = ∑6GH

I

ln ∫ 𝑒𝑦6 𝑞 𝑦6, 𝑧6 𝜄 = M

6GH I

ln / 𝑒𝑦6𝑟[>(𝑦6) 𝑞 𝑦6, 𝑧6 𝜄 𝑟[>(𝑦6) = ≥ ∑6GH

I

∫ 𝑒𝑦6𝑟[>(𝑦6) ln

b 𝑦6, 𝑧6 𝜄 ]^>([>)

= M

6GH I

/ 𝑒𝑦6𝑟[>(𝑦6) ln 𝑞 𝑦6, 𝑧6 𝜄 − / 𝑒𝑦6𝑟[>(𝑦6) ln 𝑟[>(𝑦6) ≡ ℱ(𝑟[e 𝑦H , … , 𝑟[f 𝑦I , 𝜄)

SLIDE 14

CS480/680 Winter 2020 Zahra Sheikhbahaee

The Variational Lower Bound

The (negative) variational free energy (ℱ(𝑟[ 𝑦 , 𝜄)) or the evidence lower bound

(ELBO): the expected energy under 𝑟[(𝑦) minus the entropy of 𝑟[(𝑦) ℱ 𝑟[ 𝑦 , 𝜄 = M

6GH I

/ 𝑒𝑦6𝑟[>(𝑦6) ln 𝑞 𝑦6, 𝑧6 𝜄 𝑟[>(𝑦6) = M

6

/ 𝑒𝑦6𝑟[>(𝑦6) ln 𝑞 𝑧6 𝜄 + M

6

/ 𝑒𝑦6𝑟[>(𝑦6) ln 𝑞(𝑦6|𝑧6, 𝜄) 𝑟[>(𝑦6) M

6

ln 𝑞 𝑧6 𝜄 − M

6

/ 𝑒𝑦6𝑟[>(𝑦6) ln 𝑟[>(𝑦6) 𝑞(𝑦6|𝑧6, 𝜄) M

6

ln 𝑞 𝑧6 𝜄 − 𝐸hi[𝑟[>(𝑦6) ∥ 𝑞(𝑦6|𝑧6, 𝜄)]

University of Waterloo

14

KL divergence that we need for VI

SLIDE 15

CS480/680 Winter 2020 Zahra Sheikhbahaee

ELBO = Evidence Lower BOund

ln 𝑞 𝑧|𝜄 = ℒ 𝜄 + 𝐸hi(𝑟(𝑦) ∥ 𝑞(𝑦|𝑧, 𝜄)) Evidence 𝑞 𝑦 𝑧, 𝜄 = 𝑞 𝑧 𝑦, 𝜄 𝑞(𝑦|𝜄) 𝑞(𝑧|𝜄) = 𝑞 𝑧 𝑦, 𝜄 𝑞(𝑦|𝜄) ∫ 𝑞 𝑧 𝑦, 𝜄 𝑞(𝑦|𝜄) 𝑒𝑦 = Likelihood×Prior Evidence Evidence of the probabilistic model shows the total probability of observing the data. Lower Bound:

𝐸hi ≥ 0 → ln 𝑞(𝑧|𝜄) ≥ ℒ(𝜄)

University of Waterloo

15

SLIDE 16

CS480/680 Winter 2020 Zahra Sheikhbahaee

Kullback-Leibler Divergence

Properties
𝐸hi(𝑞||𝑟) = 0 if and only if (iff) 𝑞 = 𝑟

(they may be different on sets of probability zero)

𝐸hi(𝑞||𝑟) ≠ 𝐸hi(𝑟||𝑞)
𝐸hi(𝑞||𝑟) ≥ 0

−𝐸hi 𝑟 ∥ 𝑞 = 𝔽] − log 𝑟 𝑞 = 𝔽] log 𝑞 𝑟 ≤ log 𝔽](𝑞 𝑟) = log / 𝑟 𝑦 𝑞 𝑦 𝑟 𝑦 𝑒𝑦 = log / 𝑞 𝑦 𝑒𝑦 = 0

University of Waterloo

16

Blue: mixture of Gaussians 𝑞(𝑦) (fixed) Green: (unimodal) Gaussian 𝑟 that minimises 𝐿𝑀(𝑟||𝑞) Red: (unimodal) Gaussian 𝑟 that minimises 𝐿𝑀(𝑞||𝑟)

SLIDE 17

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational Inference

Optimization problem with intractable posterior distribution:

𝑟∗ = argmin

]([)∈𝒭

𝐸hi(𝑟(𝑦) ∥ 𝑞(𝑦|𝑧, 𝜄))

University of Waterloo

17

𝑄(𝑦|𝑧, 𝜄) 𝑟(𝑦) NICE

SLIDE 18

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational Inference

University of Waterloo

18

𝑄(𝑦|𝑧, 𝜄) 𝑟∗(𝑦) NICE CLOSE

VB practical success: point estimates and

prediction, fast, streaming, distributed (3.6M Wikipedia, 350K Nature)

[Broderick, Boyd, Wibisono,Wilson, Jordan 2013]

Optimization problem with intractable posterior distribution:

𝑟∗ = argmin

]([)∈𝒭

𝐸hi(𝑟(𝑦) ∥ 𝑞(𝑦|𝑧, 𝜄))

SLIDE 19

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational Inference

ℒ 𝜄 = / 𝑒𝑦 𝑟(𝑦) log 𝑞 𝑦, 𝑧 𝜄 𝑟[(𝑦) → max

]([)∈𝒭[−𝐸hi 𝑟 𝑦 ∥ 𝑞 𝑦 𝑧, 𝜄

+ log 𝑞(𝑧|𝜄)] ≤ log 𝑞(𝒠)

University of Waterloo

19

SLIDE 20

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational Inference

ℒ 𝜄 = / 𝑒𝑦 𝑟(𝑦) log 𝑞 𝑦, 𝑧 𝜄 𝑟[(𝑦) = / 𝑒𝑦 𝑟(𝑦) log 𝑞 𝑧 𝑦, 𝜄 𝑞(𝑦|𝜄) 𝑟(𝑦) = / 𝑟(𝑦) log 𝑞 𝑧 𝑦, 𝜄 𝑒𝑦 + / 𝑟(𝑦) log 𝑞(𝑦|𝜄) 𝑟(𝑦) 𝑒𝑦 𝔽]([) log 𝑞 𝑧 𝑦, 𝜄 − 𝐸hi(𝑟(𝑦) ∥ 𝑞(𝑦|𝜄)) The first term encourages maximum likelihood estimation. The second term encourages converging to the prior. Optimizing ℒ 𝜄 will lead to generalization of Bayes theorem.

University of Waterloo

20

regularizer data term

SLIDE 21

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational Inference

ℒ 𝜄 = / 𝑒𝑦𝑟[(𝑦) log 𝑞 𝑦, 𝑧 𝜄 𝑟[(𝑦) → max

]^([)∈𝒭

University of Waterloo

21

How to perform an

ptimization w.r.t. a

distribution?

The Mean Field Method We assume the posterior is a fully factorized approximation and 𝑦6„ are independent 𝑟[> = F

„GH |[>|

𝑟[>…(𝑦6„) Parametric Approximation Parametric family 𝑟[> 𝑦6 = 𝑟[> 𝑦6|𝜇 e.g. 𝑟(𝑦6)~𝒪(𝑦6; 𝜈6, 𝜏6

U)

SLIDE 22

CS480/680 Winter 2020 Zahra Sheikhbahaee

The Mean Field Approximation

ℱ 𝑟[ 𝑦 , 𝜄 = M

6

/ 𝑒𝑦6𝑟[>(𝑦6) ln 𝑞(𝑦6, 𝑧6|𝜄) 𝑟[>(𝑦6) = M

6

/ 𝑒𝑦6 F

„GH |[>|

𝑟[>…(𝑦6„) ln 𝑞 𝑦6, 𝑧6 𝜄 − F

„GH |[>|

𝑟[>…(𝑦6„) ln F

„GH |[>|

𝑟[>…(𝑦6„) = M

6

/ 𝑒𝑦6 F

„GH |[>|

𝑟[>…(𝑦6„) ln 𝑞 𝑦6, 𝑧6 𝜄 − M

„GH |[>|

𝑟[>…(𝑦6„)ln𝑟[>…(𝑦6„) = M

6

/ 𝑒𝑦6„ 𝑟[>… / ln 𝑞 𝑦6, 𝑧6 𝜄 F

‹Œ„ |[>|

𝑟[>•(𝑦6‹) 𝑒𝑦6‹ − ln 𝑟[>… + const

University of Waterloo

22

SLIDE 23

CS480/680 Winter 2020 Zahra Sheikhbahaee

The Mean Field Approximation

ℱ 𝑟[ 𝑦 , 𝜄 = M

6

/ 𝑒𝑦6𝑟[>(𝑦6) ln 𝑞(𝑦6, 𝑧6|𝜄) 𝑟[>(𝑦6) = M

6

/ 𝑒𝑦6„ 𝑟[>… / ln 𝑞 𝑦6, 𝑧6 𝜄 F

‹Œ„ |[>|

𝑟[>•(𝑦6‹) 𝑒𝑦6‹ − ln 𝑟[>… + const = M

6

/ 𝑒𝑦6„ 𝑟[>… 𝔽‹Œ„ ln 𝑞(𝑦6, 𝑧6|𝜄) − M

6

/ 𝑒𝑦6„ 𝑟[>… ln 𝑟[>… + const 𝔽‹Œ„[… ]: an expectation with respect to the 𝑟 distributions over all variables 𝑦6‹ for 𝑙 ≠ 𝑘

University of Waterloo

23

SLIDE 24

CS480/680 Winter 2020 Zahra Sheikhbahaee

The Mean Field Approximation

The optimal solution after maximizing ℒ 𝜄

ln 𝑟[>…

∗

= 𝔽‹Œ„ ln 𝑞(𝑦6, 𝑧6|𝜄) + const The log of the optimal solution for factor 𝑟[>… is obtained by the log of the joint distribution over all hidden and observed variables and taking the expectation with respect to all of the other factors 𝑟[>• for 𝑙 ≠ 𝑘. 𝑟[>…

∗

= 1 𝒶6„ exp(𝔽‹Œ„ ln 𝑞(𝑦6, 𝑧6|𝜄) )

University of Waterloo

24

SLIDE 25

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mean Field Variational Inference

Algorithm (Coordinate ascent variational inference):

Initialize 𝑟[> = ∏„GH

|[>| 𝑟[>…(𝑦6„)

Iterations:

Update each factor 𝑟[>e, … , 𝑟[>–:

𝑟[>…

∗

= 1 𝒶6„ exp(𝔽‹Œ„ ln 𝑞(𝑦6, 𝑧6|𝜄) )

Compute ELBO ℒ 𝜄

Repeat until convergence of ELBO

University of Waterloo

25

Assumption: We can compute the update analytically

SLIDE 26

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mean Field Variational Inference

Probabilistic model: 𝑞 𝑦6, 𝑧6 𝜄 = 𝑞 𝑧6 𝑦6, 𝜄 𝑞 𝑦6 𝜄 , 𝑦6 = 𝑦6H, … , 𝑦6—

When applicable? Conditional conjugacy of likelihood and prior on each 𝑦6„conditioned on all

ther 𝑦6‹ ‹Œ„:

𝑞 𝑦6„ 𝑦6‹Œ„ ∈ 𝒝 𝛽 , 𝑞 𝑧6 𝑦6„, 𝑦6‹Œ„, 𝜄 ∈ ℬ(𝑦6„) → 𝑞(𝑦6„|𝑧6, 𝑦6‹Œ„, 𝜄) ∈ 𝒝(𝛽′)

How to check in practice? For each 𝑦6„:

Fix all other 𝑦6‹ ‹Œ„ (consider them as fixed or constant values)
Check whether 𝑞 𝑧6 𝑦6, 𝜄 and 𝑞 𝑦6 𝜄 are conjugate w.r.t. 𝑦6„

University of Waterloo

26

SLIDE 27

CS480/680 Winter 2020 Zahra Sheikhbahaee

Parametric Approximation

Parametric family of variational distributions: 𝑟[> 𝑦6 = 𝑟[> 𝑦6|𝜇 where 𝜇 is some parameters Why is it a restriction? We choose a family of some fixed form:

It may be too simple and insufficient to model the data
If it is complex enough then there is no guarantee we can train it well to fit the

data ℒ 𝜄 = / 𝑒𝑦𝑟[(𝑦|𝜇) log 𝑞 𝑦, 𝑧 𝜄 𝑟[(𝑦|𝜇) → max

œ

If we're able to calculate derivatives of ELBO w.r.t. 𝑦 then we can solve this problem using some numerical optimization solver.

University of Waterloo

27

SLIDE 28

CS480/680 Winter 2020 Zahra Sheikhbahaee

Overlapping Communities

(Gopalan & Blei 2013)

A community detection algorithm that

discovers overlapping communities in massive real-world networks.

Nodes are allowed to participate in

multiple communities.

Many community detection algorithms run

in time squared in the number of nodes, which makes analyzing massive networks computationally intractable.

Here the community memberships will be

encoded as hidden random variables.

University of Waterloo

28

575,000 scientific articles on the arXiv preprint server. Each link denotes that an article cites or is cited by another

article.