Sampling Methods CMSC 691 UMBC (Some) Learning Techniques - - PowerPoint PPT Presentation

β–Ά
sampling methods
SMART_READER_LITE
LIVE PREVIEW

Sampling Methods CMSC 691 UMBC (Some) Learning Techniques - - PowerPoint PPT Presentation

Approximate Inference: Sampling Methods CMSC 691 UMBC (Some) Learning Techniques MAP/MLE: Point estimation, basic EM Variational Inference: Functional Optimization Sampling/Monte Carlo today Outline Monte Carlo methods Sampling


slide-1
SLIDE 1

Approximate Inference: Sampling Methods

CMSC 691 UMBC

slide-2
SLIDE 2

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM Variational Inference: Functional Optimization Sampling/Monte Carlo

today

slide-3
SLIDE 3

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

slide-4
SLIDE 4

Two Problems for Sampling Methods to Solve

Generate samples from p

π‘ž 𝑦 = 𝑣 𝑦 π‘Ž , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why might sampling from p(x) be hard?

slide-5
SLIDE 5

Two Problems for Sampling Methods to Solve

Generate samples from p

π‘ž 𝑦 = 𝑣 𝑦 π‘Ž , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why might sampling from p(x) be hard? A1: Can we evaluate Z? A2: Can we sample without enumerating? (Correct samples should be where p is big)

slide-6
SLIDE 6

Two Problems for Sampling Methods to Solve

Generate samples from p

π‘ž 𝑦 = 𝑣 𝑦 π‘Ž , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why might sampling from p(x) be hard? A1: Can we evaluate Z? A2: Can we sample without enumerating? (Correct samples should be where p is big) 𝑣 𝑦 = exp(.4 𝑦 βˆ’ .4 2 βˆ’ 0.08𝑦4)

ITILA, Fig 29.1

slide-7
SLIDE 7

Two Problems for Sampling Methods to Solve

Generate samples from p Estimate expectation of a function 𝜚

π‘ž 𝑦 = 𝑣 𝑦 π‘Ž , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why might sampling from p(x) be hard? A1: Can we evaluate Z? A2: Can we sample without enumerating? (Correct samples should be where p is big) Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

= ∫ π‘ž 𝑦 𝜚 𝑦 𝑒𝑦

slide-8
SLIDE 8

Two Problems for Sampling Methods to Solve

Generate samples from p Estimate expectation of a function 𝜚

π‘ž 𝑦 = 𝑣 𝑦 π‘Ž , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why might sampling from p(x) be hard? A1: Can we evaluate Z? A2: Can we sample without enumerating? (Correct samples should be where p is big) Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

= ∫ π‘ž 𝑦 𝜚 𝑦 𝑒𝑦

ΰ·‘ Ξ¦ =

1 𝑆 σ𝑠 𝜚 𝑦𝑠

slide-9
SLIDE 9

Two Problems for Sampling Methods to Solve

Generate samples from p Estimate expectation of a function 𝜚

If we could sample from p… π‘ž 𝑦 = 𝑣 𝑦 π‘Ž , 𝑦 ∈ ℝ𝐸 𝑦1, 𝑦2, … , 𝑦𝑆 samples Q: Why is sampling from p(x) hard? A1: Can we evaluate Z? A2: Can we sample without enumerating? (Correct samples should be where p is big) Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

= ∫ π‘ž 𝑦 𝜚 𝑦 𝑒𝑦

ΰ·‘ Ξ¦ =

1 𝑆 σ𝑠 𝜚 𝑦𝑠 𝔽 ΰ·‘ Ξ¦ = Ξ¦

consistent estimator

slide-10
SLIDE 10

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

slide-11
SLIDE 11

Uniform Sampling

ෑ Φ = ෍

𝑠

𝜚 𝑦𝑠 π‘žβˆ—(𝑦𝑠)

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝑦1, 𝑦2, … , 𝑦𝑆

slide-12
SLIDE 12

Uniform Sampling

π‘žβˆ— 𝑦 = 𝑣 𝑦 π‘Žβˆ—

ෑ Φ = ෍

𝑠

𝜚 𝑦𝑠 π‘žβˆ—(𝑦𝑠)

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal: π‘Žβˆ— = ෍

𝑠

𝑣(𝑦𝑠)

sample uniformly: 𝑦1, 𝑦2, … , 𝑦𝑆

slide-13
SLIDE 13

Uniform Sampling

π‘žβˆ— 𝑦 = 𝑣 𝑦 π‘Žβˆ—

ෑ Φ = ෍

𝑠

𝜚 𝑦𝑠 π‘žβˆ—(𝑦𝑠)

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal: π‘Žβˆ— = ෍

𝑠

𝑣(𝑦𝑠)

sample uniformly: 𝑦1, 𝑦2, … , 𝑦𝑆

this might work if R (the number of samples) sufficiently hits high probability regions

slide-14
SLIDE 14

Uniform Sampling

π‘žβˆ— 𝑦 = 𝑣 𝑦 π‘Žβˆ—

ෑ Φ = ෍

𝑠

𝜚 𝑦𝑠 π‘žβˆ—(𝑦𝑠)

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal: π‘Žβˆ— = ෍

𝑠

𝑣(𝑦𝑠)

sample uniformly: 𝑦1, 𝑦2, … , 𝑦𝑆

this might work if R (the number of samples) sufficiently hits high probability regions Ising model example:

  • 2H states of high

probability

  • 2N states total
slide-15
SLIDE 15

Uniform Sampling

π‘žβˆ— 𝑦 = 𝑣 𝑦 π‘Žβˆ—

ෑ Φ = ෍

𝑠

𝜚 𝑦𝑠 π‘žβˆ—(𝑦𝑠)

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal: π‘Žβˆ— = ෍

𝑠

𝑣(𝑦𝑠)

sample uniformly: 𝑦1, 𝑦2, … , 𝑦𝑆

this might work if R (the number of samples) sufficiently hits high probability regions Ising model example:

  • 2H states of high

probability

  • 2N states total

chance of sample being in high prob. region:

2𝐼 2𝑂

  • min. samples needed: ∼ 2π‘‚βˆ’πΌ
slide-16
SLIDE 16

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

slide-17
SLIDE 17

Importance Sampling

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦

ITILA, Fig 29.5

slide-18
SLIDE 18

Importance Sampling

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

  • ver-represented

x where Q(x) < p(x): under-represented

slide-19
SLIDE 19

Importance Sampling

π‘₯ 𝑦𝑠 = π‘£π‘ž 𝑦 π‘£π‘Ÿ 𝑦

ΰ·‘ Ξ¦ = σ𝑠 𝜚 𝑦𝑠 π‘₯(𝑦𝑠) σ𝑠 π‘₯ 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

  • ver-represented

x where Q(x) < p(x): under-represented

slide-20
SLIDE 20

Importance Sampling

π‘₯ 𝑦𝑠 = π‘£π‘ž 𝑦 π‘£π‘Ÿ 𝑦

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

  • ver-represented

x where Q(x) < p(x): under-represented

ΰ·‘ Ξ¦ = σ𝑠 𝜚 𝑦𝑠 π‘₯(𝑦𝑠) σ𝑠 π‘₯ 𝑦𝑠

Q: How reliable will this estimator be?

slide-21
SLIDE 21

Importance Sampling

π‘₯ 𝑦𝑠 = π‘£π‘ž 𝑦 π‘£π‘Ÿ 𝑦

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

  • ver-represented

x where Q(x) < p(x): under-represented

ΰ·‘ Ξ¦ = σ𝑠 𝜚 𝑦𝑠 π‘₯(𝑦𝑠) σ𝑠 π‘₯ 𝑦𝑠

Q: How reliable will this estimator be? A: In practice, difficult to say. π‘₯ 𝑦𝑠 may not be a good indicator

slide-22
SLIDE 22

Importance Sampling

π‘₯ 𝑦𝑠 = π‘£π‘ž 𝑦 π‘£π‘Ÿ 𝑦

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

  • ver-represented

x where Q(x) < p(x): under-represented

ΰ·‘ Ξ¦ = σ𝑠 𝜚 𝑦𝑠 π‘₯(𝑦𝑠) σ𝑠 π‘₯ 𝑦𝑠

Q: How reliable will this estimator be? A: In practice, difficult to say. π‘₯ 𝑦𝑠 may not be a good indicator Q: How do you choose a good approximating distribution?

slide-23
SLIDE 23

Importance Sampling

π‘₯ 𝑦𝑠 = π‘£π‘ž 𝑦 π‘£π‘Ÿ 𝑦

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample from Q: 𝑦1, 𝑦2, … , 𝑦𝑆 approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦

p(x)

ITILA, Fig 29.5

x where Q(x) > p(x):

  • ver-represented

x where Q(x) < p(x): under-represented

ΰ·‘ Ξ¦ = σ𝑠 𝜚 𝑦𝑠 π‘₯(𝑦𝑠) σ𝑠 π‘₯ 𝑦𝑠

Q: How reliable will this estimator be? A: In practice, difficult to say. π‘₯ 𝑦𝑠 may not be a good indicator Q: How do you choose a good approximating distribution? A: Task/domain specific

slide-24
SLIDE 24

Importance Sampling: Variance Estimator may vary

ITILA, Fig 29.6

true value

q(x): Gaussian q(x): Cauchy distribution iterations

slide-25
SLIDE 25

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

slide-26
SLIDE 26

Rejection Sampling

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 , 𝑑 βˆ— π‘£π‘Ÿ > π‘£π‘ž

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦

ITILA, Fig 29.8

slide-27
SLIDE 27

Rejection Sampling

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 βˆ— π‘£π‘Ÿ 𝑦𝑙 )

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , π‘¦π‘†βˆ—

select tuples

approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 , 𝑑 βˆ— π‘£π‘Ÿ > π‘£π‘ž

slide-28
SLIDE 28

Rejection Sampling

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 βˆ— π‘£π‘Ÿ 𝑦𝑙 )

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , π‘¦π‘†βˆ—

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦 𝑨

select tuples

if 𝑨𝑙 ≀ π‘£π‘ž 𝑦𝑙 : add 𝑦𝑙 to sampled R points

  • therwise: reject it

approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 , 𝑑 βˆ— π‘£π‘Ÿ > π‘£π‘ž

slide-29
SLIDE 29

Rejection Sampling

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 βˆ— π‘£π‘Ÿ 𝑦𝑙 )

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , π‘¦π‘†βˆ—

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦 𝑨

select tuples

if 𝑨𝑙 ≀ π‘£π‘ž 𝑦𝑙 : add 𝑦𝑙 to sampled R points

  • therwise: reject it

approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 , 𝑑 βˆ— π‘£π‘Ÿ > π‘£π‘ž

this produces samples from the p-distribution

slide-30
SLIDE 30

Rejection Sampling

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 βˆ— π‘£π‘Ÿ 𝑦𝑙 )

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , π‘¦π‘†βˆ—

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦 𝑨

select tuples

if 𝑨𝑙 ≀ π‘£π‘ž 𝑦𝑙 : add 𝑦𝑙 to sampled R points

  • therwise: reject it

approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 , 𝑑 βˆ— π‘£π‘Ÿ > π‘£π‘ž

slide-31
SLIDE 31

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦 𝑨

Rejection Sampling

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 βˆ— π‘£π‘Ÿ 𝑦𝑙 )

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , π‘¦π‘†βˆ—

select tuples

if 𝑨𝑙 ≀ π‘£π‘ž 𝑦𝑙 : add 𝑦𝑙 to sampled R points

  • therwise: reject it

Q: How reliable will this estimator be?

approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 , 𝑑 βˆ— π‘£π‘Ÿ > π‘£π‘ž

slide-32
SLIDE 32

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦 𝑨

Rejection Sampling

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 βˆ— π‘£π‘Ÿ 𝑦𝑙 )

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , π‘¦π‘†βˆ—

select tuples

if 𝑨𝑙 ≀ π‘£π‘ž 𝑦𝑙 : add 𝑦𝑙 to sampled R points

  • therwise: reject it

Q: How reliable will this estimator be? A: How well does Q approximate P? Q: How do you choose a good approximating distribution?

approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 , 𝑑 βˆ— π‘£π‘Ÿ > π‘£π‘ž

slide-33
SLIDE 33

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦 𝑨

Rejection Sampling

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 βˆ— π‘£π‘Ÿ 𝑦𝑙 )

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , π‘¦π‘†βˆ—

select tuples

if 𝑨𝑙 ≀ π‘£π‘ž 𝑦𝑙 : add 𝑦𝑙 to sampled R points

  • therwise: reject it

Q: How reliable will this estimator be? A: How well does Q approximate P? Q: How do you choose a good approximating distribution? A: Task/domain specific

approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 , 𝑑 βˆ— π‘£π‘Ÿ > π‘£π‘ž

slide-34
SLIDE 34

𝑑 βˆ— π‘£π‘Ÿ 𝑦 π‘£π‘ž 𝑦 𝑨

Rejection Sampling

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝑨𝑙 ∼ Unif(0, 𝑑 βˆ— π‘£π‘Ÿ 𝑦𝑙 )

ITILA, Fig 29.8

sample from Q: 𝑦1, 𝑦2, … , π‘¦π‘†βˆ—

select tuples

if 𝑨𝑙 ≀ π‘£π‘ž 𝑦𝑙 : add 𝑦𝑙 to sampled R points

  • therwise: reject it

Q: How reliable will this estimator be? A: How well does Q approximate P? Q: How do you choose a good approximating distribution? A: Task/domain specific

approximating distribution: 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 , 𝑑 βˆ— π‘£π‘Ÿ > π‘£π‘ž rejection sampling can be difficult to use in high- dimensional spaces 

slide-35
SLIDE 35

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

slide-36
SLIDE 36

Markov Chain Monte Carlo

transition kernel

slide-37
SLIDE 37

Metropolis-Hastings

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

importance and rejection sampling: a single proposal distribution 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 Metropolis-Hastings (and Gibbs): create a proposal distribution based

  • n current state

𝑅 𝑦|𝑦(𝑒) ∝ π‘£π‘Ÿ 𝑦| 𝑦(𝑒)

slide-38
SLIDE 38

Metropolis-Hastings

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

importance and rejection sampling: a single proposal distribution 𝑅 𝑦 ∝ π‘£π‘Ÿ 𝑦 Metropolis-Hastings (and Gibbs): create a proposal distribution based

  • n current state

𝑅 𝑦|𝑦(𝑒) ∝ π‘£π‘Ÿ 𝑦|𝑦(𝑒)

Q does not need to look similar to P

ITILA, Fig 29.10

slide-39
SLIDE 39

Metropolis-Hastings

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝛽𝑙 = π‘£π‘ž(𝑦𝑙) π‘£π‘ž 𝑦(𝑒) π‘£π‘Ÿ 𝑦𝑙 π‘£π‘Ÿ 𝑦(𝑒) sample from 𝑹 𝑦|𝑦(𝑒) : 𝑦1, 𝑦2, … , π‘¦π‘†βˆ— if𝛽𝑙 β‰₯ 1: add 𝑦𝑙 to sampled R points

  • therwise: accept with

probability 𝛽𝑙 transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) ∝ π‘£π‘Ÿ 𝑦|𝑦(𝑒)

slide-40
SLIDE 40

Metropolis-Hastings

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝛽𝑙 = π‘£π‘ž(𝑦𝑙) π‘£π‘ž 𝑦(𝑒) π‘£π‘Ÿ 𝑦𝑙 π‘£π‘Ÿ 𝑦(𝑒) sample from 𝑹 𝑦|𝑦(𝑒) : 𝑦1, 𝑦2, … , π‘¦π‘†βˆ— if 𝛽𝑙 β‰₯ 1: add 𝑦𝑙 to sampled R points

  • therwise: accept with

probability 𝛽𝑙 transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) ∝ π‘£π‘Ÿ 𝑦|𝑦(𝑒)

if accepted: 𝑦(𝑒+1) = 𝑦𝑙

  • therwise: 𝑦(𝑒+1) = 𝑦(𝑒)

samples are not independent

slide-41
SLIDE 41

Metropolis-Hastings

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample uniformly: 𝛽𝑙 = π‘£π‘ž(𝑦𝑙) π‘£π‘ž 𝑦(𝑒) π‘£π‘Ÿ 𝑦𝑙 π‘£π‘Ÿ 𝑦(𝑒) sample from 𝑹 𝑦|𝑦(𝑒) : 𝑦1, 𝑦2, … , π‘¦π‘†βˆ— if 𝛽𝑙 β‰₯ 1: add 𝑦𝑙 to sampled R points

  • therwise: accept with

probability 𝛽𝑙 transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) ∝ π‘£π‘Ÿ 𝑦|𝑦(𝑒)

if accepted: 𝑦(𝑒+1) = 𝑦𝑙

  • therwise: 𝑦(𝑒+1) = 𝑦(𝑒)

samples are not independent Metropolis-Hastings can be used effectively in high- dimensional spaces ☺

slide-42
SLIDE 42

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

slide-43
SLIDE 43

Gibbs Sampling

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) = π‘ž 𝑦 all other variables)

Next sampled value

  • f current variable

Values of all other variables, both new and old

𝑦 𝑗

(𝑒+1) ∼ π‘ž β‹… 𝑦 1 𝑒+1 , … , 𝑦 π‘—βˆ’1 𝑒+1 , 𝑦 𝑗+1 𝑒

, … , 𝑦 𝑂

𝑒 )

x[i]

slide-44
SLIDE 44

Remember: Markov Blanket

x Markov blanket of a node x is its parents, children, and children's parents

π‘ž 𝑦𝑗 π‘¦π‘˜β‰ π‘— = π‘ž(𝑦1, … , 𝑦𝑂) ∫ π‘ž 𝑦1, … , 𝑦𝑂 𝑒𝑦𝑗 = ς𝑙 π‘ž(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙 π‘ž 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

factor out terms not dependent on xi

factorization

  • f graph

= ς𝑙:𝑙=𝑗 or π‘—βˆˆπœŒ 𝑦𝑙 π‘ž(𝑦𝑙|𝜌 𝑦𝑙 ) ∫ ς𝑙:𝑙=𝑗 or π‘—βˆˆπœŒ 𝑦𝑙 π‘ž 𝑦𝑙 𝜌 𝑦𝑙 ) 𝑒𝑦𝑗

the set of nodes needed to form the complete conditional for a variable xi

slide-45
SLIDE 45

Gibbs Sampling

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) = π‘ž 𝑦 MB variables)

Next sampled value of current variable Values of just the Markov blanket variables, both old and old

𝑦 5

(𝑒+1) ∼ π‘ž β‹… 𝑦 2 (𝑒+1), 𝑦 3 𝑒 , 𝑦 4 (𝑒+1) , 𝑦 6 (𝑒+1), 𝑦 7 𝑒 , 𝑦 8 (𝑒+1))

x5 x2 x3 x4 x6 x7 x8

slide-46
SLIDE 46

Gibbs Sampling

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample (always accept) from 𝑹 𝑦|𝑦(𝑒) : 𝑦1, 𝑦2, … , π‘¦π‘†βˆ— transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) = π‘ž 𝑦 MB(𝑦(𝑒)))

𝑦(𝑒+1) = 𝑦𝑙

Markov blanket

samples are not independent Gibbs Sampling can be used effectively in high-dimensional spaces ☺

slide-47
SLIDE 47

Collapsed Gibbs Sampling (CGS)

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample (always accept) from 𝑹 𝑦|𝑦(𝑒) : 𝑦1, 𝑦2, … , π‘¦π‘†βˆ— transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) = ∫ π‘ž 𝑦 MB(𝑦 𝑒 )) 𝑒𝑧 = π‘ž 𝑦 MBβˆ’y 𝑦𝑒 )

𝑦(𝑒+1) = 𝑦𝑙

integrate out some of Markov blanket

samples are not independent Collapsed Gibbs can be used effectively in high-dimensional spaces ☺

slide-48
SLIDE 48

Collapsed Gibbs Sampling

ΰ·‘ Ξ¦ = 1 𝑆 ෍

𝑠

𝜚 𝑦𝑠

Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

sample (always accept) from 𝑹 𝑦|𝑦(𝑒) : 𝑦1, 𝑦2, … , π‘¦π‘†βˆ— transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) = ∫ π‘ž 𝑦 MB(𝑦 𝑒 )) 𝑒𝑧 = π‘ž 𝑦 MBβˆ’y 𝑦𝑒 )

𝑦(𝑒+1) = 𝑦𝑙

integrate out some of Markov blanket

samples are not independent Collapsed Gibbs can be used effectively in high-dimensional spaces ☺ Warning: collapsing changes the Markov blanket

slide-49
SLIDE 49

Collapsed Gibbs Sampling Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) = π‘ž 𝑦 select vars in MB)

x5 x2 x3 x4 x6 x7 x8 x9

Let’s integrate out x4

slide-50
SLIDE 50

Collapsed Gibbs Sampling Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) = π‘ž 𝑦 select vars in MB)

x5 x2 x3 x4 x6 x7 x8 x9

Let’s integrate out x4

slide-51
SLIDE 51

Collapsed Gibbs Sampling Ξ¦ = 𝜚 𝑦

π‘ž = π”½π‘¦βˆΌπ‘ž 𝜚 𝑦

Goal:

transition kernel/distribution: 𝑅 𝑦|𝑦(𝑒) = π‘ž 𝑦 select vars in MB)

Next sampled value of current variable Values of some of the Markov blanket variables, both old and old

𝑦 5

(𝑒+1) ∼ π‘ž β‹… 𝑦 2 (𝑒+1), 𝑦 3 𝑒 , 𝑦 9 (𝑒+1) , 𝑦 6 (𝑒+1), 𝑦 7 𝑒 , 𝑦 8 (𝑒+1))

x5 x2 x3 x4 x6 x7 x8 x9

Let’s integrate out x4

slide-52
SLIDE 52

What Are the Trade-offs of CGS?

Benefits

  • Collapsing variables

removes variables from the model via integration/marginalization

– Depending on which variables are marginalized out, this could be a drastic reduction – The priors/hyperparams of the collapsed variables still impact the result

  • The β€œsteps” are less

incremental Drawbacks

slide-53
SLIDE 53

What Are the Trade-offs of CGS?

Benefits

  • Collapsing variables

removes variables from the model via integration/marginalization

– Depending on which variables are marginalized out, this could be a drastic reduction – The priors/hyperparams of the collapsed variables still impact the result

  • The β€œsteps” are less

incremental Drawbacks

  • Collapsing removes

conditional independences in the model

  • Math may not be easy
  • You may be restricted via

conjugacy/other statistical properties

  • You still have the drawbacks
  • f sampling
slide-54
SLIDE 54

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models

slide-55
SLIDE 55

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage

d

slide-56
SLIDE 56

Gibbs Sampler for LDirA

for each document d: resample ΞΈd | zd,1,…, zd,Nd for each token i in d: resample zd,i | wd,i , {ψk }, ΞΈd for each topic k: resample ψk

slide-57
SLIDE 57

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage

d

integrate these out

slide-58
SLIDE 58

Collapsed Gibbs Sampler for LDirA

for each document d: resample ΞΈd | zd,1,…, zd,Nd for each token i in d: resample zd,i | wd,i , {ψk }, {z*,-i} for each topic k: resample ψk

slide-59
SLIDE 59

Collapsed Gibbs Sampler for LDirA

for each document d: resample ΞΈd | zd,1,…, zd,Nd for each token i in d: resample zd,i | wd,i , {ψk }, {z*,-i} for each topic k: resample ψk π‘ž 𝑨𝑒𝑗 π‘¨βˆ—,βˆ’π‘—) = π‘ž(π‘¨βˆ—,βˆ—) π‘ž(π‘¨βˆ—,βˆ’π‘—)

slide-60
SLIDE 60

Sampling: Discrete Observations

Griffiths and Stevers (PNAS, 2004)

slide-61
SLIDE 61

Sampling: Discrete Observations

Griffiths and Stevers (PNAS, 2004)

slide-62
SLIDE 62

Sampling: Discrete Observations

Griffiths and Stevers (PNAS, 2004)

slide-63
SLIDE 63

Sampling: Discrete Observations

Griffiths and Stevers (PNAS, 2004)

Gamma function fact: Ξ“ 𝑦 + 1 = 𝑦Γ(𝑦) maintain count tables

slide-64
SLIDE 64

Sampling: Discrete Observations

Griffiths and Stevers (PNAS, 2004)

π‘ž 𝑨𝑒𝑗 π‘¨βˆ—,βˆ’π‘—) = π‘ž(π‘¨βˆ—,βˆ—) π‘ž(π‘¨βˆ—,βˆ’π‘—)

Collapsed Gibbs Sampling goal:

slide-65
SLIDE 65

Sampling: Discrete Observations

Griffiths and Stevers (PNAS, 2004)

Gamma function fact: Ξ“ 𝑦 + 1 = 𝑦Γ(𝑦)

π‘ž 𝑨𝑒𝑗 π‘¨βˆ—,βˆ’π‘—) = Ξ“(σ𝑙 𝛽𝑙) Ξ“(σ𝑙 𝑑 𝑒, 𝑙 + 𝛽𝑙) ς𝑙 Ξ“(𝑑 𝑒, 𝑙 + 𝛽𝑙) Ξ“(𝛽𝑙) Ξ“(σ𝑙 𝛽𝑙) Ξ“(σ𝑙 𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙) ς𝑙 Ξ“(𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙) Ξ“(𝛽𝑙)

Collapsed Gibbs Sampling goal:

slide-66
SLIDE 66

Sampling: Discrete Observations

Griffiths and Stevers (PNAS, 2004)

Gamma function fact: Ξ“ 𝑦 + 1 = 𝑦Γ(𝑦) π‘ž 𝑨𝑒𝑗 π‘¨βˆ—,βˆ’π‘—) = ς𝑙(𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙)Ξ“(𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙) (σ𝑙 𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙) Ξ“(σ𝑙 𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙) ς𝑙 Ξ“(𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙) Ξ“(σ𝑙 𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙)

Collapsed Gibbs Sampling goal:

slide-67
SLIDE 67

Sampling: Discrete Observations

Griffiths and Stevers (PNAS, 2004)

Gamma function fact: Ξ“ 𝑦 + 1 = 𝑦Γ(𝑦) π‘ž 𝑨𝑒𝑗 π‘¨βˆ—,βˆ’π‘—) = ς𝑙(𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙)Ξ“(𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙) (σ𝑙 𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙) Ξ“(σ𝑙 𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙) ς𝑙 Ξ“(𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙) Ξ“(σ𝑙 𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙)

Collapsed Gibbs Sampling goal:

slide-68
SLIDE 68

Sampling: Discrete Observations

Griffiths and Stevers (PNAS, 2004)

Gamma function fact: Ξ“ 𝑦 + 1 = 𝑦Γ(𝑦)

π‘ž 𝑨𝑒𝑗 = 𝑙 π‘¨βˆ—,βˆ’π‘—) ∝ 𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙

Collapsed Gibbs Sampling goal:

maintain count tables

slide-69
SLIDE 69

Collapsed Gibbs Sampler for LDirA

for each document d: for each token i in d: resample zd,i | wd,i , {ψk }, {z*,-i} π‘ž 𝑨𝑒𝑗 π‘¨βˆ—,βˆ’π‘—) =∝ 𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙 βˆ—topic-word counts

slide-70
SLIDE 70

Collapsed Gibbs Sampler for LDirA

randomly assign z*,* maintain count tables: c(d,k): document-topic counts c(k,v): topic-word counts for each document d: for each token i in d: unassign topic: zd,i resample zd,i | wd,i , {ψk }, {z*,-i} reassign topic: zd,i

π‘ž 𝑨𝑒𝑗 π‘¨βˆ—,βˆ’π‘—) =∝ 𝑑 𝑒, 𝑙 βˆ’ 1 + 𝛽𝑙 βˆ—topic-word counts

decrease counts increase counts

slide-71
SLIDE 71

Outline

Monte Carlo methods Sampling Techniques

Uniform sampling Importance Sampling Rejection Sampling Metropolis-Hastings Gibbs sampling

Example: Collapsed Gibbs Sampler for Topic Models