CS480/680 Machine Learning Lecture 11: February 11 th , 2020 - - PowerPoint PPT Presentation

β–Ά
cs480 680 machine learning lecture 11 february 11 th 2020
SMART_READER_LITE
LIVE PREVIEW

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 - - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra Sheikhbahaee VARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE (Beal 2003, chapter 2) Variational Inference: A Review for Statisticians (Blei et al.


slide-1
SLIDE 1

CS480/680 Winter 2020 Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 11: February 11th, 2020

Variational Inference Zahra Sheikhbahaee

VARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE (Beal 2003, chapter 2) Variational Inference: A Review for Statisticians (Blei et al. 2016)

University of Waterloo

1

slide-2
SLIDE 2

CS480/680 Winter 2020 Zahra Sheikhbahaee

  • Variational lower bound derivation
  • Variational mean field approximation

University of Waterloo

2

slide-3
SLIDE 3

CS480/680 Winter 2020 Zahra Sheikhbahaee

Full Bayesian Inference

  • Training stage

𝑄 πœ„ π‘Œ$%, 𝑍

$% =

𝑄 𝑍

$% π‘Œ$%, πœ„ 𝑄(πœ„)

∫ 𝑄 𝑍

$% π‘Œ$%, πœ„ 𝑄 πœ„ π‘’πœ„

  • Testing stage

𝑄 𝑧 𝑦, π‘Œ$%, 𝑍

$% = / 𝑄(𝑧|𝑦, πœ„)𝑄 πœ„ π‘Œ$%, 𝑍 $% π‘’πœ„

University of Waterloo

3

slide-4
SLIDE 4

CS480/680 Winter 2020 Zahra Sheikhbahaee

Full Bayesian Inference

  • Training stage

𝑄 πœ„ π‘Œ$%, 𝑍

$% =

𝑄 𝑍

$% π‘Œ$%, πœ„ 𝑄(πœ„)

∫ 𝑄 𝑍

$% π‘Œ$%, πœ„ 𝑄 πœ„ π‘’πœ„

  • Testing stage

𝑄 𝑧 𝑦, π‘Œ$%, 𝑍

$% = / 𝑄(𝑧|𝑦, πœ„)𝑄 πœ„ π‘Œ$%, 𝑍 $% π‘’πœ„

University of Waterloo

4

Maybe intractable Posterior distributions can be calculated analytically only for simple conjugate models!

slide-5
SLIDE 5

CS480/680 Winter 2020 Zahra Sheikhbahaee

Choice Of Priors

  • In any Bayesian inference model what is essential is which type of prior

knowledge (if any) is conveyed in prior.

  • Subjective priors: The prior encapsulates information as fully as possible by using

previous experimental data or expert knowledge. Conjugate priors in the exponential family are subjective priors. 𝑔 πœ„ 2 𝜈 = π‘ž πœ„ 𝑧 ∝ 𝑔 πœ„ 𝜈 π‘ž 𝑧 πœ„ The definition of a likelihood function in an exponential family model is given as follow where we assume that n data points arrive independent and identically distributed π‘ž 𝑧6 πœ„ = 𝑕 πœ„ 𝑔(𝑧6)𝑓9 : ;<(=>) 𝑕(πœ„): a normalization constant 𝜚(πœ„): a vector of natural parameters 𝑣(𝑧6): The sufficient statistics 𝑄 πœ„ πœƒ, πœ‰ = β„Ž(πœƒ, πœ‰)𝑕(πœ„)D𝑓9(:);E

University of Waterloo

5

slide-6
SLIDE 6

CS480/680 Winter 2020 Zahra Sheikhbahaee

Choice Of Priors

  • In any Bayesian inference model what is essential is which type of prior

knowledge (if any) is conveyed in prior.

  • Subjective priors: The prior encapsulates information as fully as possible by using

previous experimental data or expert knowledge. Conjugate priors in the exponential family are subjective priors. 𝑔 πœ„ 2 𝜈 = π‘ž πœ„ 𝑧 ∝ 𝑔 πœ„ 𝜈 π‘ž 𝑧 πœ„ The posterior distribution 𝑄 πœ„ 𝑧 = 𝑄 πœ„ πœƒ, πœ‰ F

6GH I

π‘ž 𝑧6 πœ„ ∝ 𝑕 πœ„ DJI𝑓9(:);E F

6GH I

𝑔(𝑧6) 𝑓9 : ;<(=>) ∝ 𝑄(πœ„|2 πœƒ, 2 πœ‰) 2 πœƒ = πœƒ + π‘œ 2 πœ‰ = πœ‰ + M

6GH I

𝑣(𝑧6)

University of Waterloo

6

slide-7
SLIDE 7

CS480/680 Winter 2020 Zahra Sheikhbahaee

Choice Of Priors

  • Objective Priors: Instead of attempting to encapsulate rich knowledge

into the prior, the objective Bayesian tries to impart as little information as possible in an attempt to allow the data to carry as much weight as possible in the posterior distribution. One class of noninformative priors are reference priors.

  • Hierarchical priors: Utilize hierarchical modeling to transfer the

reference prior problem to a β€˜higher level’ of the model. Hierarchical models allow a more β€œobjective” approach to inference by estimating the parameters of prior distributions from data rather than requiring them to be specified using subjective information.

University of Waterloo

7

slide-8
SLIDE 8

CS480/680 Winter 2020 Zahra Sheikhbahaee

Approximate Inference

Variational Inference Approximate π‘ž(πœ„|𝑦) β‰ˆ π‘Ÿ(πœ„) ∈ 𝒭

  • Biased
  • Faster and more scalable

Markov Chain Monte Carlo Samples from unnormalized π‘ž πœ„ 𝑦

  • Unbiased
  • Need a lot of sample

University of Waterloo

8

Probabilistic model:𝑄 𝑦, πœ„ = 𝑄 𝑦 πœ„ 𝑄(πœ„)

slide-9
SLIDE 9

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic

University of Waterloo 9

Consider a model with hidden variables π’š = 𝑦H, … , 𝑦I and

  • bserved variables 𝒛 = 𝑧H, 𝑧U, … , 𝑧I and the stochastic

dependency between variables are given by πœ„: β„’(πœ„) ≑ ln π‘ž 𝒛 πœ„ = βˆ‘6GH

I

ln π‘ž 𝑧6 πœ„ = βˆ‘6GH

I

ln ∫ 𝑒𝑦6 π‘ž 𝑦6, 𝑧6 πœ„

slide-10
SLIDE 10

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic

University of Waterloo 10

Consider a model with hidden variables π’š = 𝑦H, … , 𝑦I and

  • bserved variables 𝒛 = 𝑧H, 𝑧U, … , 𝑧I and the stochastic

dependency between variables are given by πœ„: β„’(πœ„) ≑ ln π‘ž 𝒛 πœ„ = βˆ‘6GH

I

ln π‘ž 𝑧6 πœ„ = βˆ‘6GH

I

ln ∫ 𝑒𝑦6 π‘ž 𝑦6, 𝑧6 πœ„ = M

6GH I

ln / 𝑒𝑦6π‘Ÿ[>(𝑦6) π‘ž 𝑦6, 𝑧6 πœ„ π‘Ÿ[>(𝑦6) = = M

6GH I

ln 𝔽]^> π‘ž 𝑦6, 𝑧6 πœ„ π‘Ÿ[>(𝑦6) The Jensen’s inequality for a concave function is given as 𝑔 𝔽] 𝑦 β‰₯ 𝔽] [𝑔(𝑦)]

slide-11
SLIDE 11

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic

University of Waterloo 11

Consider a model with hidden variables π’š = 𝑦H, … , 𝑦I and

  • bserved variables 𝒛 = 𝑧H, 𝑧U, … , 𝑧I and the stochastic

dependency between variables are given by πœ„: β„’(πœ„) ≑ ln π‘ž 𝒛 πœ„ = βˆ‘6GH

I

ln π‘ž 𝑧6 πœ„ = βˆ‘6GH

I

ln ∫ 𝑒𝑦6 π‘ž 𝑦6, 𝑧6 πœ„ = M

6GH I

ln / 𝑒𝑦6π‘Ÿ[>(𝑦6) π‘ž 𝑦6, 𝑧6 πœ„ π‘Ÿ[>(𝑦6) = = M

6GH I

ln 𝔽]^> π‘ž 𝑦6, 𝑧6 πœ„ π‘Ÿ[>(𝑦6) β‰₯ M

6GH I

𝔽]^> ln π‘ž 𝑦6, 𝑧6 πœ„ π‘Ÿ[>(𝑦6)

slide-12
SLIDE 12

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic

University of Waterloo 12

Consider a model with hidden variables π’š = 𝑦H, … , 𝑦I and

  • bserved variables 𝒛 = 𝑧H, 𝑧U, … , 𝑧I and the stochastic

dependency between variables are given by πœ„: β„’(πœ„) ≑ ln π‘ž 𝒛 πœ„ = βˆ‘6GH

I

ln π‘ž 𝑧6 πœ„ = βˆ‘6GH

I

ln ∫ 𝑒𝑦6 π‘ž 𝑦6, 𝑧6 πœ„ = M

6GH I

ln / 𝑒𝑦6π‘Ÿ[>(𝑦6) π‘ž 𝑦6, 𝑧6 πœ„ π‘Ÿ[>(𝑦6) = β‰₯ M

6GH I

𝔽]^> ln π‘ž 𝑦6, 𝑧6 πœ„ π‘Ÿ[>(𝑦6) = M

6GH I

/ 𝑒𝑦6π‘Ÿ[>(𝑦6) ln π‘ž 𝑦6, 𝑧6 πœ„ π‘Ÿ[>(𝑦6)

slide-13
SLIDE 13

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mathematical magic

University of Waterloo 13

Consider a model with hidden variables π’š = 𝑦H, … , 𝑦I and

  • bserved variables 𝒛 = 𝑧H, 𝑧U, … , 𝑧I and the stochastic

dependency between variables are given by πœ„: β„’(πœ„) ≑ ln π‘ž 𝒛 πœ„ = βˆ‘6GH

I

ln π‘ž 𝑧6 πœ„ = βˆ‘6GH

I

ln ∫ 𝑒𝑦6 π‘ž 𝑦6, 𝑧6 πœ„ = M

6GH I

ln / 𝑒𝑦6π‘Ÿ[>(𝑦6) π‘ž 𝑦6, 𝑧6 πœ„ π‘Ÿ[>(𝑦6) = β‰₯ βˆ‘6GH

I

∫ 𝑒𝑦6π‘Ÿ[>(𝑦6) ln

b 𝑦6, 𝑧6 πœ„ ]^>([>)

= M

6GH I

/ 𝑒𝑦6π‘Ÿ[>(𝑦6) ln π‘ž 𝑦6, 𝑧6 πœ„ βˆ’ / 𝑒𝑦6π‘Ÿ[>(𝑦6) ln π‘Ÿ[>(𝑦6) ≑ β„±(π‘Ÿ[e 𝑦H , … , π‘Ÿ[f 𝑦I , πœ„)

slide-14
SLIDE 14

CS480/680 Winter 2020 Zahra Sheikhbahaee

The Variational Lower Bound

  • The (negative) variational free energy (β„±(π‘Ÿ[ 𝑦 , πœ„)) or the evidence lower bound

(ELBO): the expected energy under π‘Ÿ[(𝑦) minus the entropy of π‘Ÿ[(𝑦) β„± π‘Ÿ[ 𝑦 , πœ„ = M

6GH I

/ 𝑒𝑦6π‘Ÿ[>(𝑦6) ln π‘ž 𝑦6, 𝑧6 πœ„ π‘Ÿ[>(𝑦6) = M

6

/ 𝑒𝑦6π‘Ÿ[>(𝑦6) ln π‘ž 𝑧6 πœ„ + M

6

/ 𝑒𝑦6π‘Ÿ[>(𝑦6) ln π‘ž(𝑦6|𝑧6, πœ„) π‘Ÿ[>(𝑦6) M

6

ln π‘ž 𝑧6 πœ„ βˆ’ M

6

/ 𝑒𝑦6π‘Ÿ[>(𝑦6) ln π‘Ÿ[>(𝑦6) π‘ž(𝑦6|𝑧6, πœ„) M

6

ln π‘ž 𝑧6 πœ„ βˆ’ 𝐸hi[π‘Ÿ[>(𝑦6) βˆ₯ π‘ž(𝑦6|𝑧6, πœ„)]

University of Waterloo

14

KL divergence that we need for VI

slide-15
SLIDE 15

CS480/680 Winter 2020 Zahra Sheikhbahaee

ELBO = Evidence Lower BOund

ln π‘ž 𝑧|πœ„ = β„’ πœ„ + 𝐸hi(π‘Ÿ(𝑦) βˆ₯ π‘ž(𝑦|𝑧, πœ„)) Evidence π‘ž 𝑦 𝑧, πœ„ = π‘ž 𝑧 𝑦, πœ„ π‘ž(𝑦|πœ„) π‘ž(𝑧|πœ„) = π‘ž 𝑧 𝑦, πœ„ π‘ž(𝑦|πœ„) ∫ π‘ž 𝑧 𝑦, πœ„ π‘ž(𝑦|πœ„) 𝑒𝑦 = LikelihoodΓ—Prior Evidence Evidence of the probabilistic model shows the total probability of observing the data. Lower Bound:

𝐸hi β‰₯ 0 β†’ ln π‘ž(𝑧|πœ„) β‰₯ β„’(πœ„)

University of Waterloo

15

slide-16
SLIDE 16

CS480/680 Winter 2020 Zahra Sheikhbahaee

Kullback-Leibler Divergence

  • Properties
  • 𝐸hi(π‘ž||π‘Ÿ) = 0 if and only if (iff) π‘ž = π‘Ÿ

(they may be different on sets of probability zero)

  • 𝐸hi(π‘ž||π‘Ÿ) β‰  𝐸hi(π‘Ÿ||π‘ž)
  • 𝐸hi(π‘ž||π‘Ÿ) β‰₯ 0

βˆ’πΈhi π‘Ÿ βˆ₯ π‘ž = 𝔽] βˆ’ log π‘Ÿ π‘ž = 𝔽] log π‘ž π‘Ÿ ≀ log 𝔽](π‘ž π‘Ÿ) = log / π‘Ÿ 𝑦 π‘ž 𝑦 π‘Ÿ 𝑦 𝑒𝑦 = log / π‘ž 𝑦 𝑒𝑦 = 0

University of Waterloo

16

Blue: mixture of Gaussians π‘ž(𝑦) (fixed) Green: (unimodal) Gaussian π‘Ÿ that minimises 𝐿𝑀(π‘Ÿ||π‘ž) Red: (unimodal) Gaussian π‘Ÿ that minimises 𝐿𝑀(π‘ž||π‘Ÿ)

slide-17
SLIDE 17

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational Inference

  • Optimization problem with intractable posterior distribution:

π‘Ÿβˆ— = argmin

]([)βˆˆπ’­

𝐸hi(π‘Ÿ(𝑦) βˆ₯ π‘ž(𝑦|𝑧, πœ„))

University of Waterloo

17

𝑄(𝑦|𝑧, πœ„) π‘Ÿ(𝑦) NICE

slide-18
SLIDE 18

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational Inference

University of Waterloo

18

𝑄(𝑦|𝑧, πœ„) π‘Ÿβˆ—(𝑦) NICE CLOSE

  • VB practical success: point estimates and

prediction, fast, streaming, distributed (3.6M Wikipedia, 350K Nature)

[Broderick, Boyd, Wibisono,Wilson, Jordan 2013]

  • Optimization problem with intractable posterior distribution:

π‘Ÿβˆ— = argmin

]([)βˆˆπ’­

𝐸hi(π‘Ÿ(𝑦) βˆ₯ π‘ž(𝑦|𝑧, πœ„))

slide-19
SLIDE 19

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational Inference

β„’ πœ„ = / 𝑒𝑦 π‘Ÿ(𝑦) log π‘ž 𝑦, 𝑧 πœ„ π‘Ÿ[(𝑦) β†’ max

]([)βˆˆπ’­[βˆ’πΈhi π‘Ÿ 𝑦 βˆ₯ π‘ž 𝑦 𝑧, πœ„

+ log π‘ž(𝑧|πœ„)] ≀ log π‘ž(𝒠)

University of Waterloo

19

slide-20
SLIDE 20

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational Inference

β„’ πœ„ = / 𝑒𝑦 π‘Ÿ(𝑦) log π‘ž 𝑦, 𝑧 πœ„ π‘Ÿ[(𝑦) = / 𝑒𝑦 π‘Ÿ(𝑦) log π‘ž 𝑧 𝑦, πœ„ π‘ž(𝑦|πœ„) π‘Ÿ(𝑦) = / π‘Ÿ(𝑦) log π‘ž 𝑧 𝑦, πœ„ 𝑒𝑦 + / π‘Ÿ(𝑦) log π‘ž(𝑦|πœ„) π‘Ÿ(𝑦) 𝑒𝑦 𝔽]([) log π‘ž 𝑧 𝑦, πœ„ βˆ’ 𝐸hi(π‘Ÿ(𝑦) βˆ₯ π‘ž(𝑦|πœ„)) The first term encourages maximum likelihood estimation. The second term encourages converging to the prior. Optimizing β„’ πœ„ will lead to generalization of Bayes theorem.

University of Waterloo

20

regularizer data term

slide-21
SLIDE 21

CS480/680 Winter 2020 Zahra Sheikhbahaee

Variational Inference

β„’ πœ„ = / π‘’π‘¦π‘Ÿ[(𝑦) log π‘ž 𝑦, 𝑧 πœ„ π‘Ÿ[(𝑦) β†’ max

]^([)βˆˆπ’­

University of Waterloo

21

How to perform an

  • ptimization w.r.t. a

distribution?

The Mean Field Method We assume the posterior is a fully factorized approximation and 𝑦6β€ž are independent π‘Ÿ[> = F

β€žGH |[>|

π‘Ÿ[>…(𝑦6β€ž) Parametric Approximation Parametric family π‘Ÿ[> 𝑦6 = π‘Ÿ[> 𝑦6|πœ‡ e.g. π‘Ÿ(𝑦6)~π’ͺ(𝑦6; 𝜈6, 𝜏6

U)

slide-22
SLIDE 22

CS480/680 Winter 2020 Zahra Sheikhbahaee

The Mean Field Approximation

β„± π‘Ÿ[ 𝑦 , πœ„ = M

6

/ 𝑒𝑦6π‘Ÿ[>(𝑦6) ln π‘ž(𝑦6, 𝑧6|πœ„) π‘Ÿ[>(𝑦6) = M

6

/ 𝑒𝑦6 F

β€žGH |[>|

π‘Ÿ[>…(𝑦6β€ž) ln π‘ž 𝑦6, 𝑧6 πœ„ βˆ’ F

β€žGH |[>|

π‘Ÿ[>…(𝑦6β€ž) ln F

β€žGH |[>|

π‘Ÿ[>…(𝑦6β€ž) = M

6

/ 𝑒𝑦6 F

β€žGH |[>|

π‘Ÿ[>…(𝑦6β€ž) ln π‘ž 𝑦6, 𝑧6 πœ„ βˆ’ M

β€žGH |[>|

π‘Ÿ[>…(𝑦6β€ž)lnπ‘Ÿ[>…(𝑦6β€ž) = M

6

/ 𝑒𝑦6β€ž π‘Ÿ[>… / ln π‘ž 𝑦6, 𝑧6 πœ„ F

β€ΉΕ’β€ž |[>|

π‘Ÿ[>β€’(𝑦6β€Ή) 𝑒𝑦6β€Ή βˆ’ ln π‘Ÿ[>… + const

University of Waterloo

22

slide-23
SLIDE 23

CS480/680 Winter 2020 Zahra Sheikhbahaee

The Mean Field Approximation

β„± π‘Ÿ[ 𝑦 , πœ„ = M

6

/ 𝑒𝑦6π‘Ÿ[>(𝑦6) ln π‘ž(𝑦6, 𝑧6|πœ„) π‘Ÿ[>(𝑦6) = M

6

/ 𝑒𝑦6β€ž π‘Ÿ[>… / ln π‘ž 𝑦6, 𝑧6 πœ„ F

β€ΉΕ’β€ž |[>|

π‘Ÿ[>β€’(𝑦6β€Ή) 𝑒𝑦6β€Ή βˆ’ ln π‘Ÿ[>… + const = M

6

/ 𝑒𝑦6β€ž π‘Ÿ[>… π”½β€ΉΕ’β€ž ln π‘ž(𝑦6, 𝑧6|πœ„) βˆ’ M

6

/ 𝑒𝑦6β€ž π‘Ÿ[>… ln π‘Ÿ[>… + const π”½β€ΉΕ’β€ž[… ]: an expectation with respect to the π‘Ÿ distributions over all variables 𝑦6β€Ή for 𝑙 β‰  π‘˜

University of Waterloo

23

slide-24
SLIDE 24

CS480/680 Winter 2020 Zahra Sheikhbahaee

The Mean Field Approximation

  • The optimal solution after maximizing β„’ πœ„

ln π‘Ÿ[>…

βˆ—

= π”½β€ΉΕ’β€ž ln π‘ž(𝑦6, 𝑧6|πœ„) + const The log of the optimal solution for factor π‘Ÿ[>… is obtained by the log of the joint distribution over all hidden and observed variables and taking the expectation with respect to all of the other factors π‘Ÿ[>β€’ for 𝑙 β‰  π‘˜. π‘Ÿ[>…

βˆ—

= 1 𝒢6β€ž exp(π”½β€ΉΕ’β€ž ln π‘ž(𝑦6, 𝑧6|πœ„) )

University of Waterloo

24

slide-25
SLIDE 25

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mean Field Variational Inference

Algorithm (Coordinate ascent variational inference):

Initialize π‘Ÿ[> = βˆβ€žGH

|[>| π‘Ÿ[>…(𝑦6β€ž)

Iterations:

  • Update each factor π‘Ÿ[>e, … , π‘Ÿ[>–:

π‘Ÿ[>…

βˆ—

= 1 𝒢6β€ž exp(π”½β€ΉΕ’β€ž ln π‘ž(𝑦6, 𝑧6|πœ„) )

  • Compute ELBO β„’ πœ„

Repeat until convergence of ELBO

University of Waterloo

25

Assumption: We can compute the update analytically

slide-26
SLIDE 26

CS480/680 Winter 2020 Zahra Sheikhbahaee

Mean Field Variational Inference

  • Probabilistic model: π‘ž 𝑦6, 𝑧6 πœ„ = π‘ž 𝑧6 𝑦6, πœ„ π‘ž 𝑦6 πœ„ , 𝑦6 = 𝑦6H, … , 𝑦6β€”

When applicable? Conditional conjugacy of likelihood and prior on each 𝑦6β€žconditioned on all

  • ther 𝑦6β€Ή β€ΉΕ’β€ž:

π‘ž 𝑦6β€ž 𝑦6β€ΉΕ’β€ž ∈ 𝒝 𝛽 , π‘ž 𝑧6 𝑦6β€ž, 𝑦6β€ΉΕ’β€ž, πœ„ ∈ ℬ(𝑦6β€ž) β†’ π‘ž(𝑦6β€ž|𝑧6, 𝑦6β€ΉΕ’β€ž, πœ„) ∈ 𝒝(𝛽′)

How to check in practice? For each 𝑦6β€ž:

  • Fix all other 𝑦6β€Ή β€ΉΕ’β€ž (consider them as fixed or constant values)
  • Check whether π‘ž 𝑧6 𝑦6, πœ„ and π‘ž 𝑦6 πœ„ are conjugate w.r.t. 𝑦6β€ž

University of Waterloo

26

slide-27
SLIDE 27

CS480/680 Winter 2020 Zahra Sheikhbahaee

Parametric Approximation

Parametric family of variational distributions: π‘Ÿ[> 𝑦6 = π‘Ÿ[> 𝑦6|πœ‡ where πœ‡ is some parameters Why is it a restriction? We choose a family of some fixed form:

  • It may be too simple and insufficient to model the data
  • If it is complex enough then there is no guarantee we can train it well to fit the

data β„’ πœ„ = / π‘’π‘¦π‘Ÿ[(𝑦|πœ‡) log π‘ž 𝑦, 𝑧 πœ„ π‘Ÿ[(𝑦|πœ‡) β†’ max

Ε“

If we're able to calculate derivatives of ELBO w.r.t. 𝑦 then we can solve this problem using some numerical optimization solver.

University of Waterloo

27

slide-28
SLIDE 28

CS480/680 Winter 2020 Zahra Sheikhbahaee

Overlapping Communities

(Gopalan & Blei 2013)

  • A community detection algorithm that

discovers overlapping communities in massive real-world networks.

  • Nodes are allowed to participate in

multiple communities.

  • Many community detection algorithms run

in time squared in the number of nodes, which makes analyzing massive networks computationally intractable.

  • Here the community memberships will be

encoded as hidden random variables.

University of Waterloo

28

575,000 scientific articles on the arXiv preprint server. Each link denotes that an article cites or is cited by another

article.