[PPT] - Sampling Methods CMSC 691 UMBC (Some) Learning Techniques PowerPoint Presentation

SLIDE 1

Approximate Inference: Sampling Methods

CMSC 691 UMBC

SLIDE 2

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM Variational Inference: Functional Optimization Sampling/Monte Carlo

today

SLIDE 3

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

SLIDE 4

Two Problems for Sampling Methods to Solve

Generate samples from p

𝑞 𝑦 = 𝑣 𝑦 𝑎 , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why might sampling from p(x) be hard?

SLIDE 5

Two Problems for Sampling Methods to Solve

Generate samples from p

𝑞 𝑦 = 𝑣 𝑦 𝑎 , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why might sampling from p(x) be hard? A1: Can we evaluate Z? A2: Can we sample without enumerating? (Correct samples should be where p is big)

SLIDE 6

Two Problems for Sampling Methods to Solve

Generate samples from p

𝑞 𝑦 = 𝑣 𝑦 𝑎 , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why might sampling from p(x) be hard? A1: Can we evaluate Z? A2: Can we sample without enumerating? (Correct samples should be where p is big) 𝑣 𝑦 = exp(.4 𝑦 − .4 2 − 0.08𝑦4)

ITILA, Fig 29.1

SLIDE 7

Two Problems for Sampling Methods to Solve

Generate samples from p Estimate expectation of a function 𝜚

𝑞 𝑦 = 𝑣 𝑦 𝑎 , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why might sampling from p(x) be hard? A1: Can we evaluate Z? A2: Can we sample without enumerating? (Correct samples should be where p is big) Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

= ∫ 𝑞 𝑦 𝜚 𝑦 𝑒𝑦

SLIDE 8

Two Problems for Sampling Methods to Solve

Generate samples from p Estimate expectation of a function 𝜚

𝑞 𝑦 = 𝑣 𝑦 𝑎 , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why might sampling from p(x) be hard? A1: Can we evaluate Z? A2: Can we sample without enumerating? (Correct samples should be where p is big) Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

= ∫ 𝑞 𝑦 𝜚 𝑦 𝑒𝑦

෡ Φ =

1 𝑆 σ𝑠 𝜚 𝑦𝑠

SLIDE 9

Two Problems for Sampling Methods to Solve

Generate samples from p Estimate expectation of a function 𝜚

If we could sample from p… 𝑞 𝑦 = 𝑣 𝑦 𝑎 , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why is sampling from p(x) hard? A1: Can we evaluate Z? A2: Can we sample without enumerating? (Correct samples should be where p is big) Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

= ∫ 𝑞 𝑦 𝜚 𝑦 𝑒𝑦

෡ Φ =

1 𝑆 σ𝑠 𝜚 𝑦𝑠 𝔽 ෡ Φ = Φ

consistent estimator

SLIDE 10

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

SLIDE 11

Uniform Sampling

෡ Φ = ෍

𝑠

𝜚 𝑦𝑠 𝑞∗(𝑦𝑠)

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝑦1, 𝑦2, … , 𝑦𝑆

SLIDE 12

Uniform Sampling

𝑞∗ 𝑦 = 𝑣 𝑦 𝑎∗

෡ Φ = ෍

𝑠

𝜚 𝑦𝑠 𝑞∗(𝑦𝑠)

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal: 𝑎∗ = ෍

𝑠

𝑣(𝑦𝑠)

sample uniformly: 𝑦1, 𝑦2, … , 𝑦𝑆

SLIDE 13

Uniform Sampling

𝑞∗ 𝑦 = 𝑣 𝑦 𝑎∗

෡ Φ = ෍

𝑠

𝜚 𝑦𝑠 𝑞∗(𝑦𝑠)

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal: 𝑎∗ = ෍

𝑠

𝑣(𝑦𝑠)

sample uniformly: 𝑦1, 𝑦2, … , 𝑦𝑆

this might work if R (the number of samples) sufficiently hits high probability regions

SLIDE 14

Uniform Sampling

𝑞∗ 𝑦 = 𝑣 𝑦 𝑎∗

෡ Φ = ෍

𝑠

𝜚 𝑦𝑠 𝑞∗(𝑦𝑠)

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal: 𝑎∗ = ෍

𝑠

𝑣(𝑦𝑠)

sample uniformly: 𝑦1, 𝑦2, … , 𝑦𝑆

this might work if R (the number of samples) sufficiently hits high probability regions Ising model example:

2H states of high

probability

2N states total

SLIDE 15

Uniform Sampling

𝑞∗ 𝑦 = 𝑣 𝑦 𝑎∗

෡ Φ = ෍

𝑠

𝜚 𝑦𝑠 𝑞∗(𝑦𝑠)

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal: 𝑎∗ = ෍

𝑠

𝑣(𝑦𝑠)

sample uniformly: 𝑦1, 𝑦2, … , 𝑦𝑆

this might work if R (the number of samples) sufficiently hits high probability regions Ising model example:

2H states of high

probability

2N states total

chance of sample being in high prob. region:

2𝐼 2𝑂

min. samples needed: ∼ 2𝑂−𝐼

SLIDE 16

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

SLIDE 17

Importance Sampling

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦

ITILA, Fig 29.5

SLIDE 18

Importance Sampling

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

ver-represented

x where Q(x) < p(x): under-represented

SLIDE 19

Importance Sampling

𝑥 𝑦𝑠 = 𝑣𝑞 𝑦 𝑣𝑟 𝑦

෡ Φ = σ𝑠 𝜚 𝑦𝑠 𝑥(𝑦𝑠) σ𝑠 𝑥 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

ver-represented

x where Q(x) < p(x): under-represented

SLIDE 20

Importance Sampling

𝑥 𝑦𝑠 = 𝑣𝑞 𝑦 𝑣𝑟 𝑦

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

ver-represented

x where Q(x) < p(x): under-represented

෡ Φ = σ𝑠 𝜚 𝑦𝑠 𝑥(𝑦𝑠) σ𝑠 𝑥 𝑦𝑠

Q: How reliable will this estimator be?

SLIDE 21

Importance Sampling

𝑥 𝑦𝑠 = 𝑣𝑞 𝑦 𝑣𝑟 𝑦

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

ver-represented

x where Q(x) < p(x): under-represented

෡ Φ = σ𝑠 𝜚 𝑦𝑠 𝑥(𝑦𝑠) σ𝑠 𝑥 𝑦𝑠

Q: How reliable will this estimator be? A: In practice, difficult to say. 𝑥 𝑦𝑠 may not be a good indicator

SLIDE 22

Importance Sampling

𝑥 𝑦𝑠 = 𝑣𝑞 𝑦 𝑣𝑟 𝑦

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

ver-represented

x where Q(x) < p(x): under-represented

෡ Φ = σ𝑠 𝜚 𝑦𝑠 𝑥(𝑦𝑠) σ𝑠 𝑥 𝑦𝑠

Q: How reliable will this estimator be? A: In practice, difficult to say. 𝑥 𝑦𝑠 may not be a good indicator Q: How do you choose a good approximating distribution?

SLIDE 23

Importance Sampling

𝑥 𝑦𝑠 = 𝑣𝑞 𝑦 𝑣𝑟 𝑦

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

ver-represented

x where Q(x) < p(x): under-represented

෡ Φ = σ𝑠 𝜚 𝑦𝑠 𝑥(𝑦𝑠) σ𝑠 𝑥 𝑦𝑠

Q: How reliable will this estimator be? A: In practice, difficult to say. 𝑥 𝑦𝑠 may not be a good indicator Q: How do you choose a good approximating distribution? A: Task/domain specific

SLIDE 24

Importance Sampling: Variance Estimator may vary

ITILA, Fig 29.6

true value

q(x): Gaussian q(x): Cauchy distribution iterations

SLIDE 25

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

SLIDE 26

Rejection Sampling

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 , 𝑑 ∗ 𝑣𝑟 > 𝑣𝑞

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦

ITILA, Fig 29.8

SLIDE 27

Rejection Sampling

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 ∗ 𝑣𝑟 𝑦𝑙 )

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆∗

select tuples

approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 , 𝑑 ∗ 𝑣𝑟 > 𝑣𝑞

SLIDE 28

Rejection Sampling

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 ∗ 𝑣𝑟 𝑦𝑙 )

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆∗

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦 𝑨

select tuples

if 𝑨𝑙 ≤ 𝑣𝑞 𝑦𝑙 : add 𝑦𝑙 to sampled R points

therwise: reject it

approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 , 𝑑 ∗ 𝑣𝑟 > 𝑣𝑞

SLIDE 29

Rejection Sampling

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 ∗ 𝑣𝑟 𝑦𝑙 )

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆∗

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦 𝑨

select tuples

if 𝑨𝑙 ≤ 𝑣𝑞 𝑦𝑙 : add 𝑦𝑙 to sampled R points

therwise: reject it

approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 , 𝑑 ∗ 𝑣𝑟 > 𝑣𝑞

this produces samples from the p-distribution

SLIDE 30

Rejection Sampling

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 ∗ 𝑣𝑟 𝑦𝑙 )

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆∗

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦 𝑨

select tuples

if 𝑨𝑙 ≤ 𝑣𝑞 𝑦𝑙 : add 𝑦𝑙 to sampled R points

therwise: reject it

approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 , 𝑑 ∗ 𝑣𝑟 > 𝑣𝑞

SLIDE 31

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦 𝑨

Rejection Sampling

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 ∗ 𝑣𝑟 𝑦𝑙 )

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆∗

select tuples

if 𝑨𝑙 ≤ 𝑣𝑞 𝑦𝑙 : add 𝑦𝑙 to sampled R points

therwise: reject it

Q: How reliable will this estimator be?

approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 , 𝑑 ∗ 𝑣𝑟 > 𝑣𝑞

SLIDE 32

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦 𝑨

Rejection Sampling

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 ∗ 𝑣𝑟 𝑦𝑙 )

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆∗

select tuples

if 𝑨𝑙 ≤ 𝑣𝑞 𝑦𝑙 : add 𝑦𝑙 to sampled R points

therwise: reject it

Q: How reliable will this estimator be? A: How well does Q approximate P? Q: How do you choose a good approximating distribution?

approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 , 𝑑 ∗ 𝑣𝑟 > 𝑣𝑞

SLIDE 33

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦 𝑨

Rejection Sampling

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 ∗ 𝑣𝑟 𝑦𝑙 )

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆∗

select tuples

if 𝑨𝑙 ≤ 𝑣𝑞 𝑦𝑙 : add 𝑦𝑙 to sampled R points

therwise: reject it

Q: How reliable will this estimator be? A: How well does Q approximate P? Q: How do you choose a good approximating distribution? A: Task/domain specific

approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 , 𝑑 ∗ 𝑣𝑟 > 𝑣𝑞

SLIDE 34

𝑑 ∗ 𝑣𝑟 𝑦 𝑣𝑞 𝑦 𝑨

Rejection Sampling

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 ∗ 𝑣𝑟 𝑦𝑙 )

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆∗

select tuples

if 𝑨𝑙 ≤ 𝑣𝑞 𝑦𝑙 : add 𝑦𝑙 to sampled R points

therwise: reject it

Q: How reliable will this estimator be? A: How well does Q approximate P? Q: How do you choose a good approximating distribution? A: Task/domain specific

approximating distribution: 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 , 𝑑 ∗ 𝑣𝑟 > 𝑣𝑞 rejection sampling can be difficult to use in high- dimensional spaces 

SLIDE 35

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

SLIDE 36

Markov Chain Monte Carlo

transition kernel

SLIDE 37

Metropolis-Hastings

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

importance and rejection sampling: a single proposal distribution 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 Metropolis-Hastings (and Gibbs): create a proposal distribution based

n current state

𝑅 𝑦|𝑦(𝑢) ∝ 𝑣𝑟 𝑦| 𝑦(𝑢)

SLIDE 38

Metropolis-Hastings

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

importance and rejection sampling: a single proposal distribution 𝑅 𝑦 ∝ 𝑣𝑟 𝑦 Metropolis-Hastings (and Gibbs): create a proposal distribution based

n current state

𝑅 𝑦|𝑦(𝑢) ∝ 𝑣𝑟 𝑦|𝑦(𝑢)

Q does not need to look similar to P

ITILA, Fig 29.10

SLIDE 39

Metropolis-Hastings

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝛽𝑙 = 𝑣𝑞(𝑦𝑙) 𝑣𝑞 𝑦(𝑢) 𝑣𝑟 𝑦𝑙 𝑣𝑟 𝑦(𝑢) sample from 𝑹 𝑦|𝑦(𝑢) : 𝑦1, 𝑦2, … , 𝑦𝑆∗ if𝛽𝑙 ≥ 1: add 𝑦𝑙 to sampled R points

therwise: accept with

probability 𝛽𝑙 transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) ∝ 𝑣𝑟 𝑦|𝑦(𝑢)

SLIDE 40

Metropolis-Hastings

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝛽𝑙 = 𝑣𝑞(𝑦𝑙) 𝑣𝑞 𝑦(𝑢) 𝑣𝑟 𝑦𝑙 𝑣𝑟 𝑦(𝑢) sample from 𝑹 𝑦|𝑦(𝑢) : 𝑦1, 𝑦2, … , 𝑦𝑆∗ if 𝛽𝑙 ≥ 1: add 𝑦𝑙 to sampled R points

therwise: accept with

probability 𝛽𝑙 transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) ∝ 𝑣𝑟 𝑦|𝑦(𝑢)

if accepted: 𝑦(𝑢+1) = 𝑦𝑙

therwise: 𝑦(𝑢+1) = 𝑦(𝑢)

samples are not independent

SLIDE 41

Metropolis-Hastings

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample uniformly: 𝛽𝑙 = 𝑣𝑞(𝑦𝑙) 𝑣𝑞 𝑦(𝑢) 𝑣𝑟 𝑦𝑙 𝑣𝑟 𝑦(𝑢) sample from 𝑹 𝑦|𝑦(𝑢) : 𝑦1, 𝑦2, … , 𝑦𝑆∗ if 𝛽𝑙 ≥ 1: add 𝑦𝑙 to sampled R points

therwise: accept with

probability 𝛽𝑙 transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) ∝ 𝑣𝑟 𝑦|𝑦(𝑢)

if accepted: 𝑦(𝑢+1) = 𝑦𝑙

therwise: 𝑦(𝑢+1) = 𝑦(𝑢)

samples are not independent Metropolis-Hastings can be used effectively in high- dimensional spaces ☺

SLIDE 42

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

SLIDE 43

Gibbs Sampling

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) = 𝑞 𝑦 all other variables)

Next sampled value

f current variable

Values of all other variables, both new and old

𝑦 𝑗

(𝑢+1) ∼ 𝑞 ⋅ 𝑦 1 𝑢+1 , … , 𝑦 𝑗−1 𝑢+1 , 𝑦 𝑗+1 𝑢

, … , 𝑦 𝑂

𝑢 )

x[i]

SLIDE 44

Remember: Markov Blanket

x Markov blanket of a node x is its parents, children, and children's parents

𝑞 𝑦𝑗 𝑦𝑘≠𝑗 = 𝑞(𝑦1, … , 𝑦𝑂) ∫ 𝑞 𝑦1, … , 𝑦𝑂 𝑒𝑦𝑗 = ς𝑙 𝑞(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙 𝑞 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

factor out terms not dependent on xi

factorization

f graph

= ς𝑙:𝑙=𝑗 or 𝑗∈𝜌 𝑦𝑙 𝑞(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙:𝑙=𝑗 or 𝑗∈𝜌 𝑦𝑙 𝑞 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

the set of nodes needed to form the complete conditional for a variable xi

SLIDE 45

Gibbs Sampling

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) = 𝑞 𝑦 MB variables)

Next sampled value of current variable Values of just the Markov blanket variables, both old and old

𝑦 5

(𝑢+1) ∼ 𝑞 ⋅ 𝑦 2 (𝑢+1), 𝑦 3 𝑢 , 𝑦 4 (𝑢+1) , 𝑦 6 (𝑢+1), 𝑦 7 𝑢 , 𝑦 8 (𝑢+1))

x5 x2 x3 x4 x6 x7 x8

SLIDE 46

Gibbs Sampling

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample (always accept) from 𝑹 𝑦|𝑦(𝑢) : 𝑦1, 𝑦2, … , 𝑦𝑆∗ transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) = 𝑞 𝑦 MB(𝑦(𝑢)))

𝑦(𝑢+1) = 𝑦𝑙

Markov blanket

samples are not independent Gibbs Sampling can be used effectively in high-dimensional spaces ☺

SLIDE 47

Collapsed Gibbs Sampling (CGS)

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample (always accept) from 𝑹 𝑦|𝑦(𝑢) : 𝑦1, 𝑦2, … , 𝑦𝑆∗ transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) = ∫ 𝑞 𝑦 MB(𝑦 𝑢 )) 𝑒𝑧 = 𝑞 𝑦 MB−y 𝑦𝑢 )

𝑦(𝑢+1) = 𝑦𝑙

integrate out some of Markov blanket

samples are not independent Collapsed Gibbs can be used effectively in high-dimensional spaces ☺

SLIDE 48

Collapsed Gibbs Sampling

෡ Φ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

sample (always accept) from 𝑹 𝑦|𝑦(𝑢) : 𝑦1, 𝑦2, … , 𝑦𝑆∗ transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) = ∫ 𝑞 𝑦 MB(𝑦 𝑢 )) 𝑒𝑧 = 𝑞 𝑦 MB−y 𝑦𝑢 )

𝑦(𝑢+1) = 𝑦𝑙

integrate out some of Markov blanket

samples are not independent Collapsed Gibbs can be used effectively in high-dimensional spaces ☺ Warning: collapsing changes the Markov blanket

SLIDE 49

Collapsed Gibbs Sampling Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) = 𝑞 𝑦 select vars in MB)

x5 x2 x3 x4 x6 x7 x8 x9

Let’s integrate out x4

SLIDE 50

Collapsed Gibbs Sampling Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) = 𝑞 𝑦 select vars in MB)

x5 x2 x3 x4 x6 x7 x8 x9

Let’s integrate out x4

SLIDE 51

Collapsed Gibbs Sampling Φ = 𝜚 𝑦

𝑞 = 𝔽𝑦∼𝑞 𝜚 𝑦

Goal:

transition kernel/distribution: 𝑅 𝑦|𝑦(𝑢) = 𝑞 𝑦 select vars in MB)

Next sampled value of current variable Values of some of the Markov blanket variables, both old and old

𝑦 5

(𝑢+1) ∼ 𝑞 ⋅ 𝑦 2 (𝑢+1), 𝑦 3 𝑢 , 𝑦 9 (𝑢+1) , 𝑦 6 (𝑢+1), 𝑦 7 𝑢 , 𝑦 8 (𝑢+1))

x5 x2 x3 x4 x6 x7 x8 x9

Let’s integrate out x4

SLIDE 52

What Are the Trade-offs of CGS?

Benefits

Collapsing variables

removes variables from the model via integration/marginalization

– Depending on which variables are marginalized out, this could be a drastic reduction – The priors/hyperparams of the collapsed variables still impact the result

The “steps” are less

incremental Drawbacks

SLIDE 53

What Are the Trade-offs of CGS?

Benefits

Collapsing variables

removes variables from the model via integration/marginalization

– Depending on which variables are marginalized out, this could be a drastic reduction – The priors/hyperparams of the collapsed variables still impact the result

The “steps” are less

incremental Drawbacks

Collapsing removes

conditional independences in the model

Math may not be easy
You may be restricted via

conjugacy/other statistical properties

You still have the drawbacks
f sampling

SLIDE 54

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

SLIDE 55

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage

d

SLIDE 56

Gibbs Sampler for LDirA

for each document d: resample θd | zd,1,…, zd,Nd for each token i in d: resample zd,i | wd,i , {ψk }, θd for each topic k: resample ψk

SLIDE 57

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage

d

integrate these out

SLIDE 58

Collapsed Gibbs Sampler for LDirA

for each document d: resample θd | zd,1,…, zd,Nd for each token i in d: resample zd,i | wd,i , {ψk }, {z*,-i} for each topic k: resample ψk

SLIDE 59