CSci 8980: Advanced Topics in Graphical Models MCMC, Gibbs Sampling - - PowerPoint PPT Presentation

csci 8980 advanced topics in graphical models mcmc gibbs
SMART_READER_LITE
LIVE PREVIEW

CSci 8980: Advanced Topics in Graphical Models MCMC, Gibbs Sampling - - PowerPoint PPT Presentation

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers CSci 8980: Advanced Topics in Graphical Models MCMC, Gibbs Sampling Instructor: Arindam Banerjee September 27, 2007 Basics MCMC Gibbs Sampling Auxiliary Variable Samplers Problems


slide-1
SLIDE 1

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

CSci 8980: Advanced Topics in Graphical Models MCMC, Gibbs Sampling

Instructor: Arindam Banerjee September 27, 2007

slide-2
SLIDE 2

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization

slide-3
SLIDE 3

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

slide-4
SLIDE 4

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

Computing normalization in Bayesian methods p(y|x) = p(y)p(x|y)

  • y ′ p(y ′)p(x|y ′)dy ′
slide-5
SLIDE 5

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

Computing normalization in Bayesian methods p(y|x) = p(y)p(x|y)

  • y ′ p(y ′)p(x|y ′)dy ′

Marginalization: p(y|x) =

  • z p(y, z|x)dz
slide-6
SLIDE 6

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

Computing normalization in Bayesian methods p(y|x) = p(y)p(x|y)

  • y ′ p(y ′)p(x|y ′)dy ′

Marginalization: p(y|x) =

  • z p(y, z|x)dz

Expectation: Ey|x[f (y)] =

  • y

f (y)p(y|x)dy

slide-7
SLIDE 7

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

Computing normalization in Bayesian methods p(y|x) = p(y)p(x|y)

  • y ′ p(y ′)p(x|y ′)dy ′

Marginalization: p(y|x) =

  • z p(y, z|x)dz

Expectation: Ey|x[f (y)] =

  • y

f (y)p(y|x)dy

Statistical mechanics: Computing the partition function Z =

  • s

exp

  • −E(s)

kT

slide-8
SLIDE 8

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

Computing normalization in Bayesian methods p(y|x) = p(y)p(x|y)

  • y ′ p(y ′)p(x|y ′)dy ′

Marginalization: p(y|x) =

  • z p(y, z|x)dz

Expectation: Ey|x[f (y)] =

  • y

f (y)p(y|x)dy

Statistical mechanics: Computing the partition function Z =

  • s

exp

  • −E(s)

kT

  • Optimization, Model Selection, etc.
slide-9
SLIDE 9

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space

slide-10
SLIDE 10

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space Draw i.i.d. samples {xi}n

i=1 from p(x)

slide-11
SLIDE 11

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space Draw i.i.d. samples {xi}n

i=1 from p(x)

Construct empirical point mass function pn(x) = 1 n

n

  • i=1

δxi(x)

slide-12
SLIDE 12

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space Draw i.i.d. samples {xi}n

i=1 from p(x)

Construct empirical point mass function pn(x) = 1 n

n

  • i=1

δxi(x) One can approximate integrals/sums by In(f ) = 1 n

n

  • i=1

f (xi)

a.s.

− − − →

n→∞ I(f ) =

  • x

f (x)p(x)dx

slide-13
SLIDE 13

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space Draw i.i.d. samples {xi}n

i=1 from p(x)

Construct empirical point mass function pn(x) = 1 n

n

  • i=1

δxi(x) One can approximate integrals/sums by In(f ) = 1 n

n

  • i=1

f (xi)

a.s.

− − − →

n→∞ I(f ) =

  • x

f (x)p(x)dx Unbiased estimate In(f ) converges by strong law

slide-14
SLIDE 14

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space Draw i.i.d. samples {xi}n

i=1 from p(x)

Construct empirical point mass function pn(x) = 1 n

n

  • i=1

δxi(x) One can approximate integrals/sums by In(f ) = 1 n

n

  • i=1

f (xi)

a.s.

− − − →

n→∞ I(f ) =

  • x

f (x)p(x)dx Unbiased estimate In(f ) converges by strong law For finite σ2

f , central limit theorem implies

√n(In(f ) − I(f )) = ⇒

n→∞ N(0, σ2 f )

slide-15
SLIDE 15

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample

slide-16
SLIDE 16

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x)

slide-17
SLIDE 17

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞

slide-18
SLIDE 18

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

slide-19
SLIDE 19

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

1

Sample xi ∼ q(x) and u ∼ U(0, 1)

slide-20
SLIDE 20

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

1

Sample xi ∼ q(x) and u ∼ U(0, 1)

2

If u <

p(xi) Mq(xi), accept xi, else reject

slide-21
SLIDE 21

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

1

Sample xi ∼ q(x) and u ∼ U(0, 1)

2

If u <

p(xi) Mq(xi), accept xi, else reject

Issues:

slide-22
SLIDE 22

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

1

Sample xi ∼ q(x) and u ∼ U(0, 1)

2

If u <

p(xi) Mq(xi), accept xi, else reject

Issues:

Tricky to bound p(x)/q(x) with a reasonable constant M

slide-23
SLIDE 23

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

1

Sample xi ∼ q(x) and u ∼ U(0, 1)

2

If u <

p(xi) Mq(xi), accept xi, else reject

Issues:

Tricky to bound p(x)/q(x) with a reasonable constant M If M is too large, acceptance probability is small

slide-24
SLIDE 24

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling (Contd.)

slide-25
SLIDE 25

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling

For a proposal distribution q(x), with w(x) = p(x)/q(x) I(f ) =

  • x

f (x)w(x)q(x)dx

slide-26
SLIDE 26

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling

For a proposal distribution q(x), with w(x) = p(x)/q(x) I(f ) =

  • x

f (x)w(x)q(x)dx w(x) is the importance weight

slide-27
SLIDE 27

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling

For a proposal distribution q(x), with w(x) = p(x)/q(x) I(f ) =

  • x

f (x)w(x)q(x)dx w(x) is the importance weight Monte Carlo estimate of I(f ) based on samples xi ∼ q(x) ˆ In(f ) =

n

  • i=1

f (xi)w(xi)

slide-28
SLIDE 28

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling

For a proposal distribution q(x), with w(x) = p(x)/q(x) I(f ) =

  • x

f (x)w(x)q(x)dx w(x) is the importance weight Monte Carlo estimate of I(f ) based on samples xi ∼ q(x) ˆ In(f ) =

n

  • i=1

f (xi)w(xi) The estimator is unbiased, and converges to I(f ) a.s.

slide-29
SLIDE 29

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f )

slide-30
SLIDE 30

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f ) Applying Jensen’s and optimizing, we get q∗(x) = |f (x)|p(x)

  • |f (x)|p(x)dx
slide-31
SLIDE 31

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f ) Applying Jensen’s and optimizing, we get q∗(x) = |f (x)|p(x)

  • |f (x)|p(x)dx

Efficient sampling focuses on regions of high |f (x)|p(x)

slide-32
SLIDE 32

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f ) Applying Jensen’s and optimizing, we get q∗(x) = |f (x)|p(x)

  • |f (x)|p(x)dx

Efficient sampling focuses on regions of high |f (x)|p(x) Super efficient sampling, variance lower than even q(x) = p(x)

slide-33
SLIDE 33

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f ) Applying Jensen’s and optimizing, we get q∗(x) = |f (x)|p(x)

  • |f (x)|p(x)dx

Efficient sampling focuses on regions of high |f (x)|p(x) Super efficient sampling, variance lower than even q(x) = p(x) Exploited to evaluate probability of rare events, q(x) ∝ IE(x)p(x)

slide-34
SLIDE 34

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f ) Applying Jensen’s and optimizing, we get q∗(x) = |f (x)|p(x)

  • |f (x)|p(x)dx

Efficient sampling focuses on regions of high |f (x)|p(x) Super efficient sampling, variance lower than even q(x) = p(x) Exploited to evaluate probability of rare events, q(x) ∝ IE(x)p(x)

slide-35
SLIDE 35

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

slide-36
SLIDE 36

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space

slide-37
SLIDE 37

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1)

slide-38
SLIDE 38

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i

slide-39
SLIDE 39

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

slide-40
SLIDE 40

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

1

Irreducible, transition graph is connected

slide-41
SLIDE 41

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

1

Irreducible, transition graph is connected

2

Aperiodic, does not get trapped in cycles

slide-42
SLIDE 42

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

1

Irreducible, transition graph is connected

2

Aperiodic, does not get trapped in cycles

Sufficient condition to ensure p(x) is the invariant distribution p(xi)T(xi−1|xi) = p(xi−1)T(xi|xi−1)

slide-43
SLIDE 43

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

1

Irreducible, transition graph is connected

2

Aperiodic, does not get trapped in cycles

Sufficient condition to ensure p(x) is the invariant distribution p(xi)T(xi−1|xi) = p(xi−1)T(xi|xi−1) MCMC samplers, invariant distribution = target distribution

slide-44
SLIDE 44

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

1

Irreducible, transition graph is connected

2

Aperiodic, does not get trapped in cycles

Sufficient condition to ensure p(x) is the invariant distribution p(xi)T(xi−1|xi) = p(xi−1)T(xi|xi−1) MCMC samplers, invariant distribution = target distribution Design of samplers for fast convergence

slide-45
SLIDE 45

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

slide-46
SLIDE 46

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages

slide-47
SLIDE 47

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

slide-48
SLIDE 48

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

slide-49
SLIDE 49

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

L = link matrix for the web graph

slide-50
SLIDE 50

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

L = link matrix for the web graph E = uniform random matrix, to ensure irreducibility, aperiodicity

slide-51
SLIDE 51

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

L = link matrix for the web graph E = uniform random matrix, to ensure irreducibility, aperiodicity

Invariant distribution p(x) represents rank of webpage x

slide-52
SLIDE 52

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

L = link matrix for the web graph E = uniform random matrix, to ensure irreducibility, aperiodicity

Invariant distribution p(x) represents rank of webpage x Continuous spaces, T becomes an integral kernel K

  • xi

p(xi)K(xi+1|xi)dxi = p(xi+1)

slide-53
SLIDE 53

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

L = link matrix for the web graph E = uniform random matrix, to ensure irreducibility, aperiodicity

Invariant distribution p(x) represents rank of webpage x Continuous spaces, T becomes an integral kernel K

  • xi

p(xi)K(xi+1|xi)dxi = p(xi+1) p(x) is the corresponding eigenfunction

slide-54
SLIDE 54

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method

slide-55
SLIDE 55

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x)

slide-56
SLIDE 56

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x) Algorithm: For i = 0, . . . , (n − 1)

slide-57
SLIDE 57

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x) Algorithm: For i = 0, . . . , (n − 1)

Sample u ∼ U(0, 1)

slide-58
SLIDE 58

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x) Algorithm: For i = 0, . . . , (n − 1)

Sample u ∼ U(0, 1) Sample x∗ ∼ q(x∗|xi)

slide-59
SLIDE 59

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x) Algorithm: For i = 0, . . . , (n − 1)

Sample u ∼ U(0, 1) Sample x∗ ∼ q(x∗|xi) Then xi+1 =

  • x∗

if u < A(xi, x∗) = min

  • 1, p(x∗)q(xi|x∗)

p(xi)q(x∗|xi)

  • xi
  • therwise
slide-60
SLIDE 60

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x) Algorithm: For i = 0, . . . , (n − 1)

Sample u ∼ U(0, 1) Sample x∗ ∼ q(x∗|xi) Then xi+1 =

  • x∗

if u < A(xi, x∗) = min

  • 1, p(x∗)q(xi|x∗)

p(xi)q(x∗|xi)

  • xi
  • therwise

The transition kernel is KMH(xi+1|xi) = q(xi+1|xi)A(xi, xi+1) + δxi(xi+1)r(xi) where r(xi) is the term associated with rejection r(xi) =

  • x

q(x|xi)(1 − A(xi, x))dx

slide-61
SLIDE 61

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

slide-62
SLIDE 62

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1)

slide-63
SLIDE 63

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution

slide-64
SLIDE 64

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution Basic properties

slide-65
SLIDE 65

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution Basic properties

Irreducibility, ensure support of q contains support of p

slide-66
SLIDE 66

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution Basic properties

Irreducibility, ensure support of q contains support of p Aperiodicity, ensured since rejection is always a possibility

slide-67
SLIDE 67

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution Basic properties

Irreducibility, ensure support of q contains support of p Aperiodicity, ensured since rejection is always a possibility

Independent sampler: q(x∗|xi) = q(x∗) so that A(xi, x∗) = min

  • 1, p(x∗)q(xi)

q(x∗)p(xi)

slide-68
SLIDE 68

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution Basic properties

Irreducibility, ensure support of q contains support of p Aperiodicity, ensured since rejection is always a possibility

Independent sampler: q(x∗|xi) = q(x∗) so that A(xi, x∗) = min

  • 1, p(x∗)q(xi)

q(x∗)p(xi)

  • Metropolis sampler: symmetric q(x∗|xi) = q(xi|x∗)

A(xi, x∗) = min

  • 1, p(x∗)

p(xi)

slide-69
SLIDE 69

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

slide-70
SLIDE 70

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x)

slide-71
SLIDE 71

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max

slide-72
SLIDE 72

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s)

slide-73
SLIDE 73

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s) Simulate a non-homogenous Markov chain

slide-74
SLIDE 74

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s) Simulate a non-homogenous Markov chain Invariant distribution at iteration i is pi(x) ∝ p1/Ti(x)

slide-75
SLIDE 75

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s) Simulate a non-homogenous Markov chain Invariant distribution at iteration i is pi(x) ∝ p1/Ti(x) Sample update follows xi+1 =      x∗ if u < A(xi, x∗) = min

  • 1, p

1 Ti (x∗)q(xi|x∗)

p

1 Ti (xi)q(x∗|xi)

  • xi
  • therwise
slide-76
SLIDE 76

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s) Simulate a non-homogenous Markov chain Invariant distribution at iteration i is pi(x) ∝ p1/Ti(x) Sample update follows xi+1 =      x∗ if u < A(xi, x∗) = min

  • 1, p

1 Ti (x∗)q(xi|x∗)

p

1 Ti (xi)q(x∗|xi)

  • xi
  • therwise

Ti decreases following a cooling schedule, limi→∞ Ti = 0

slide-77
SLIDE 77

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s) Simulate a non-homogenous Markov chain Invariant distribution at iteration i is pi(x) ∝ p1/Ti(x) Sample update follows xi+1 =      x∗ if u < A(xi, x∗) = min

  • 1, p

1 Ti (x∗)q(xi|x∗)

p

1 Ti (xi)q(x∗|xi)

  • xi
  • therwise

Ti decreases following a cooling schedule, limi→∞ Ti = 0 Cooling schedule needs proper choice, e.g., Ti =

1 C log(i+T0)

slide-78
SLIDE 78

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing (Contd.)

slide-79
SLIDE 79

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

  • x

log p(x, z|θ)p(z|x, θn)dx

slide-80
SLIDE 80

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

  • x

log p(x, z|θ)p(z|x, θn)dx Estimate the expectation using MCMC

slide-81
SLIDE 81

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

  • x

log p(x, z|θ)p(z|x, θn)dx Estimate the expectation using MCMC Draw samples using MH with acceptance probability A(z, z∗) = min

  • 1, p(x|z∗, θn)p(z∗|θn)q(z|z∗)

p(x|z, θn)p(z|θn)q(z∗|z)

slide-82
SLIDE 82

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

  • x

log p(x, z|θ)p(z|x, θn)dx Estimate the expectation using MCMC Draw samples using MH with acceptance probability A(z, z∗) = min

  • 1, p(x|z∗, θn)p(z∗|θn)q(z|z∗)

p(x|z, θn)p(z|θn)q(z∗|z)

  • Several variants:
slide-83
SLIDE 83

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

  • x

log p(x, z|θ)p(z|x, θn)dx Estimate the expectation using MCMC Draw samples using MH with acceptance probability A(z, z∗) = min

  • 1, p(x|z∗, θn)p(z∗|θn)q(z|z∗)

p(x|z, θn)p(z|θn)q(z∗|z)

  • Several variants:

Stochastic EM: Draw one sample

slide-84
SLIDE 84

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

  • x

log p(x, z|θ)p(z|x, θn)dx Estimate the expectation using MCMC Draw samples using MH with acceptance probability A(z, z∗) = min

  • 1, p(x|z∗, θn)p(z∗|θn)q(z|z∗)

p(x|z, θn)p(z|θn)q(z∗|z)

  • Several variants:

Stochastic EM: Draw one sample Monte Carlo EM: Draw multiple samples

slide-85
SLIDE 85

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers

slide-86
SLIDE 86

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

slide-87
SLIDE 87

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p

slide-88
SLIDE 88

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

slide-89
SLIDE 89

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

slide-90
SLIDE 90

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

Global proposals explore the entire space (with probability α)

slide-91
SLIDE 91

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

Global proposals explore the entire space (with probability α) Local proposals discover finer details (with probability (1 − α))

slide-92
SLIDE 92

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

Global proposals explore the entire space (with probability α) Local proposals discover finer details (with probability (1 − α))

Example: Target has many narrow peaks

slide-93
SLIDE 93

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

Global proposals explore the entire space (with probability α) Local proposals discover finer details (with probability (1 − α))

Example: Target has many narrow peaks

Global proposal gets the peaks

slide-94
SLIDE 94

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

Global proposals explore the entire space (with probability α) Local proposals discover finer details (with probability (1 − α))

Example: Target has many narrow peaks

Global proposal gets the peaks Local proposals get the neighborhood of peaks (random walk)

slide-95
SLIDE 95

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks

slide-96
SLIDE 96

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately

slide-97
SLIDE 97

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked

slide-98
SLIDE 98

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked Transition kernel is given by KMHCycle(x(i+1)|x(i)) =

nb

  • j=1

KMH(j)(x(i+1)

bj

|x(i)

bj , x(i+1) −[bj] )

slide-99
SLIDE 99

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked Transition kernel is given by KMHCycle(x(i+1)|x(i)) =

nb

  • j=1

KMH(j)(x(i+1)

bj

|x(i)

bj , x(i+1) −[bj] )

Trade-off on block size

slide-100
SLIDE 100

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked Transition kernel is given by KMHCycle(x(i+1)|x(i)) =

nb

  • j=1

KMH(j)(x(i+1)

bj

|x(i)

bj , x(i+1) −[bj] )

Trade-off on block size

If block size is small, chain takes long time to explore the space

slide-101
SLIDE 101

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked Transition kernel is given by KMHCycle(x(i+1)|x(i)) =

nb

  • j=1

KMH(j)(x(i+1)

bj

|x(i)

bj , x(i+1) −[bj] )

Trade-off on block size

If block size is small, chain takes long time to explore the space If block size is large, acceptance probability is low

slide-102
SLIDE 102

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked Transition kernel is given by KMHCycle(x(i+1)|x(i)) =

nb

  • j=1

KMH(j)(x(i+1)

bj

|x(i)

bj , x(i+1) −[bj] )

Trade-off on block size

If block size is small, chain takes long time to explore the space If block size is large, acceptance probability is low

Gibbs sampling effectively uses block size of 1

slide-103
SLIDE 103

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler

For a d-dimensional vector x, assume we know p(xj|x−j) = p(xj|x1, . . . , xj−1, xj+1, · · · , xd)

slide-104
SLIDE 104

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler

For a d-dimensional vector x, assume we know p(xj|x−j) = p(xj|x1, . . . , xj−1, xj+1, · · · , xd) Gibbs sampler uses the following proposal distribution q(x∗|x(i)) =

  • p(x∗

j |x(i) −j )

if x∗

−j = x(i) −j

  • therwise
slide-105
SLIDE 105

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler

For a d-dimensional vector x, assume we know p(xj|x−j) = p(xj|x1, . . . , xj−1, xj+1, · · · , xd) Gibbs sampler uses the following proposal distribution q(x∗|x(i)) =

  • p(x∗

j |x(i) −j )

if x∗

−j = x(i) −j

  • therwise

The acceptance probability A(x(i), x∗) = min

  • 1, p(x∗)q(x(i)|x∗)

p(x(i))q(x∗|x(i))

  • = 1
slide-106
SLIDE 106

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler

For a d-dimensional vector x, assume we know p(xj|x−j) = p(xj|x1, . . . , xj−1, xj+1, · · · , xd) Gibbs sampler uses the following proposal distribution q(x∗|x(i)) =

  • p(x∗

j |x(i) −j )

if x∗

−j = x(i) −j

  • therwise

The acceptance probability A(x(i), x∗) = min

  • 1, p(x∗)q(x(i)|x∗)

p(x(i))q(x∗|x(i))

  • = 1

Deterministic scan: All samples are accepted

slide-107
SLIDE 107

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

slide-108
SLIDE 108

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

slide-109
SLIDE 109

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

slide-110
SLIDE 110

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

· · ·

slide-111
SLIDE 111

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

· · · Sample x(i+1)

d

∼ p(xd|x(i+1)

1

, . . . , x(i+1)

d−1 )

slide-112
SLIDE 112

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

· · · Sample x(i+1)

d

∼ p(xd|x(i+1)

1

, . . . , x(i+1)

d−1 )

Possible to have MH steps inside a Gibbs sampler

slide-113
SLIDE 113

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

· · · Sample x(i+1)

d

∼ p(xd|x(i+1)

1

, . . . , x(i+1)

d−1 )

Possible to have MH steps inside a Gibbs sampler For d = 2, Gibbs sampler is the data augmentation algorithm

slide-114
SLIDE 114

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

· · · Sample x(i+1)

d

∼ p(xd|x(i+1)

1

, . . . , x(i+1)

d−1 )

Possible to have MH steps inside a Gibbs sampler For d = 2, Gibbs sampler is the data augmentation algorithm For Bayes nets, the conditioning is on the Markov blanket p(xj|x−j) = p(xj|xpa(j))

  • k∈ch(j)

p(xk|pa(k))

slide-115
SLIDE 115

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Bayesian LDA

slide-116
SLIDE 116

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ)

slide-117
SLIDE 117

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ) Notation:

slide-118
SLIDE 118

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ) Notation:

C DT

(d−ℓ,h) = words from d assigned to h, excluding current word

slide-119
SLIDE 119

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ) Notation:

C DT

(d−ℓ,h) = words from d assigned to h, excluding current word

C WT

(w−ℓ,h) = wℓ assigned to h, excluding current word

slide-120
SLIDE 120

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ) Notation:

C DT

(d−ℓ,h) = words from d assigned to h, excluding current word

C WT

(w−ℓ,h) = wℓ assigned to h, excluding current word

Then, the first term p(zℓ = h|z−ℓ) = C DT

(d−ℓ,h) + α

T

t=1 C DT (d−ℓ,t) + Tα

slide-121
SLIDE 121

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ) Notation:

C DT

(d−ℓ,h) = words from d assigned to h, excluding current word

C WT

(w−ℓ,h) = wℓ assigned to h, excluding current word

Then, the first term p(zℓ = h|z−ℓ) = C DT

(d−ℓ,h) + α

T

t=1 C DT (d−ℓ,t) + Tα

The second term p(wℓ|zℓ = h, z−ℓ, w−ell) = C WT

(w−ℓ,h) + β

W

w=1 C WT (w−ℓ,h) + W β

slide-122
SLIDE 122

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Basic Idea

Sometimes easier to sample from p(x, u) rather than p(x)

slide-123
SLIDE 123

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Basic Idea

Sometimes easier to sample from p(x, u) rather than p(x) Sample (xi, ui), and then ignore ui

slide-124
SLIDE 124

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Basic Idea

Sometimes easier to sample from p(x, u) rather than p(x) Sample (xi, ui), and then ignore ui Consider two well-known examples:

slide-125
SLIDE 125

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Basic Idea

Sometimes easier to sample from p(x, u) rather than p(x) Sample (xi, ui), and then ignore ui Consider two well-known examples:

Hybrid Monte Carlo

slide-126
SLIDE 126

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Basic Idea

Sometimes easier to sample from p(x, u) rather than p(x) Sample (xi, ui), and then ignore ui Consider two well-known examples:

Hybrid Monte Carlo Slice sampling

slide-127
SLIDE 127

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution

slide-128
SLIDE 128

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions

slide-129
SLIDE 129

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions Effectively, take steps based on gradient of p(x)

slide-130
SLIDE 130

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions Effectively, take steps based on gradient of p(x) Introduce auxiliary momentum variables u ∈ Rd with p(x, u) = p(x)N(u; 0, Id)

slide-131
SLIDE 131

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions Effectively, take steps based on gradient of p(x) Introduce auxiliary momentum variables u ∈ Rd with p(x, u) = p(x)N(u; 0, Id) Gradient vector ∆(x) = ∂ log p(x)/∂x, step-size ρ

slide-132
SLIDE 132

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions Effectively, take steps based on gradient of p(x) Introduce auxiliary momentum variables u ∈ Rd with p(x, u) = p(x)N(u; 0, Id) Gradient vector ∆(x) = ∂ log p(x)/∂x, step-size ρ Gradient descent for L steps to get proposal candidate

slide-133
SLIDE 133

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions Effectively, take steps based on gradient of p(x) Introduce auxiliary momentum variables u ∈ Rd with p(x, u) = p(x)N(u; 0, Id) Gradient vector ∆(x) = ∂ log p(x)/∂x, step-size ρ Gradient descent for L steps to get proposal candidate When L = 1, one obtains the Langevin algorithm x∗ = x0 + ρu0 = x(i−1) + ρ(u∗ + ρ∆(x(i−1))/2)

slide-134
SLIDE 134

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

slide-135
SLIDE 135

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id)

slide-136
SLIDE 136

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2

slide-137
SLIDE 137

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2 For ℓ = 1, . . . , L, with ρℓ = ρ, ℓ < L, ρL = ρ/2 xℓ = xℓ−1 + ρuℓ−1 uℓ = uℓ−1 + ρℓ∆(xℓ)

slide-138
SLIDE 138

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2 For ℓ = 1, . . . , L, with ρℓ = ρ, ℓ < L, ρL = ρ/2 xℓ = xℓ−1 + ρuℓ−1 uℓ = uℓ−1 + ρℓ∆(xℓ) Set (x(i+1), u(i+1)) =

  • (xL, uL)

if A = min

  • 1, p(xL)

p(xi) exp

  • − 1

2(uL2 − u∗2)

  • (x(i), u(i))
  • therwise
slide-139
SLIDE 139

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2 For ℓ = 1, . . . , L, with ρℓ = ρ, ℓ < L, ρL = ρ/2 xℓ = xℓ−1 + ρuℓ−1 uℓ = uℓ−1 + ρℓ∆(xℓ) Set (x(i+1), u(i+1)) =

  • (xL, uL)

if A = min

  • 1, p(xL)

p(xi) exp

  • − 1

2(uL2 − u∗2)

  • (x(i), u(i))
  • therwise

Tradeoffs for ρ, L

slide-140
SLIDE 140

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2 For ℓ = 1, . . . , L, with ρℓ = ρ, ℓ < L, ρL = ρ/2 xℓ = xℓ−1 + ρuℓ−1 uℓ = uℓ−1 + ρℓ∆(xℓ) Set (x(i+1), u(i+1)) =

  • (xL, uL)

if A = min

  • 1, p(xL)

p(xi) exp

  • − 1

2(uL2 − u∗2)

  • (x(i), u(i))
  • therwise

Tradeoffs for ρ, L

Large ρ gives low acceptance, small ρ needs many steps

slide-141
SLIDE 141

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2 For ℓ = 1, . . . , L, with ρℓ = ρ, ℓ < L, ρL = ρ/2 xℓ = xℓ−1 + ρuℓ−1 uℓ = uℓ−1 + ρℓ∆(xℓ) Set (x(i+1), u(i+1)) =

  • (xL, uL)

if A = min

  • 1, p(xL)

p(xi) exp

  • − 1

2(uL2 − u∗2)

  • (x(i), u(i))
  • therwise

Tradeoffs for ρ, L

Large ρ gives low acceptance, small ρ needs many steps Large L gives candidates far from x0, but expensive

slide-142
SLIDE 142

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler

Construct extended target distribution p∗(x, u) =

  • 1

if 0 ≤ u ≤ p(x)

  • therwise
slide-143
SLIDE 143

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler

Construct extended target distribution p∗(x, u) =

  • 1

if 0 ≤ u ≤ p(x)

  • therwise

It follows that:

  • p∗(x, u) =

p(x) du = p(x)

slide-144
SLIDE 144

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler

Construct extended target distribution p∗(x, u) =

  • 1

if 0 ≤ u ≤ p(x)

  • therwise

It follows that:

  • p∗(x, u) =

p(x) du = p(x) From the Gibbs sampling perspective p(u|x) = U[0, p(x)] p(x|u) = UA, A = {x : p(x) ≥ u}

slide-145
SLIDE 145

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler

Construct extended target distribution p∗(x, u) =

  • 1

if 0 ≤ u ≤ p(x)

  • therwise

It follows that:

  • p∗(x, u) =

p(x) du = p(x) From the Gibbs sampling perspective p(u|x) = U[0, p(x)] p(x|u) = UA, A = {x : p(x) ≥ u} Algorithm is easy is A is easy to figure out

slide-146
SLIDE 146

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler

Construct extended target distribution p∗(x, u) =

  • 1

if 0 ≤ u ≤ p(x)

  • therwise

It follows that:

  • p∗(x, u) =

p(x) du = p(x) From the Gibbs sampling perspective p(u|x) = U[0, p(x)] p(x|u) = UA, A = {x : p(x) ≥ u} Algorithm is easy is A is easy to figure out Otherwise, several auxiliary variables need to be introduced

slide-147
SLIDE 147

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler (Contd.)