[PPT] - CSci 8980: Advanced Topics in Graphical Models MCMC, Gibbs Sampling PowerPoint Presentation

SLIDE 1

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

CSci 8980: Advanced Topics in Graphical Models MCMC, Gibbs Sampling

Instructor: Arindam Banerjee September 27, 2007

SLIDE 2

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization

SLIDE 3

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

SLIDE 4

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

Computing normalization in Bayesian methods p(y|x) = p(y)p(x|y)

y ′ p(y ′)p(x|y ′)dy ′

SLIDE 5

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

Computing normalization in Bayesian methods p(y|x) = p(y)p(x|y)

y ′ p(y ′)p(x|y ′)dy ′

Marginalization: p(y|x) =

z p(y, z|x)dz

SLIDE 6

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

Computing normalization in Bayesian methods p(y|x) = p(y)p(x|y)

y ′ p(y ′)p(x|y ′)dy ′

Marginalization: p(y|x) =

z p(y, z|x)dz

Expectation: Ey|x[f (y)] =

y

f (y)p(y|x)dy

SLIDE 7

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

Computing normalization in Bayesian methods p(y|x) = p(y)p(x|y)

y ′ p(y ′)p(x|y ′)dy ′

Marginalization: p(y|x) =

z p(y, z|x)dz

Expectation: Ey|x[f (y)] =

y

f (y)p(y|x)dy

Statistical mechanics: Computing the partition function Z =

s

exp

−E(s)

kT

SLIDE 8

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Problems

Primarily of two types: Integration and Optimization Bayesian inference and learning

Computing normalization in Bayesian methods p(y|x) = p(y)p(x|y)

y ′ p(y ′)p(x|y ′)dy ′

Marginalization: p(y|x) =

z p(y, z|x)dz

Expectation: Ey|x[f (y)] =

y

f (y)p(y|x)dy

Statistical mechanics: Computing the partition function Z =

s

exp

−E(s)

kT

Optimization, Model Selection, etc.

SLIDE 9

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space

SLIDE 10

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space Draw i.i.d. samples {xi}n

i=1 from p(x)

SLIDE 11

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space Draw i.i.d. samples {xi}n

i=1 from p(x)

Construct empirical point mass function pn(x) = 1 n

n

i=1

δxi(x)

SLIDE 12

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space Draw i.i.d. samples {xi}n

i=1 from p(x)

Construct empirical point mass function pn(x) = 1 n

n

i=1

δxi(x) One can approximate integrals/sums by In(f ) = 1 n

n

i=1

f (xi)

a.s.

− − − →

n→∞ I(f ) =

x

f (x)p(x)dx

SLIDE 13

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space Draw i.i.d. samples {xi}n

i=1 from p(x)

Construct empirical point mass function pn(x) = 1 n

n

i=1

δxi(x) One can approximate integrals/sums by In(f ) = 1 n

n

i=1

f (xi)

a.s.

− − − →

n→∞ I(f ) =

x

f (x)p(x)dx Unbiased estimate In(f ) converges by strong law

SLIDE 14

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo Principle

Target density p(x) on a high-dimensional space Draw i.i.d. samples {xi}n

i=1 from p(x)

Construct empirical point mass function pn(x) = 1 n

n

i=1

δxi(x) One can approximate integrals/sums by In(f ) = 1 n

n

i=1

f (xi)

a.s.

− − − →

n→∞ I(f ) =

x

f (x)p(x)dx Unbiased estimate In(f ) converges by strong law For finite σ2

f , central limit theorem implies

√n(In(f ) − I(f )) = ⇒

n→∞ N(0, σ2 f )

SLIDE 15

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample

SLIDE 16

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x)

SLIDE 17

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞

SLIDE 18

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

SLIDE 19

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

1

Sample xi ∼ q(x) and u ∼ U(0, 1)

SLIDE 20

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

1

Sample xi ∼ q(x) and u ∼ U(0, 1)

2

If u <

p(xi) Mq(xi), accept xi, else reject

SLIDE 21

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

1

Sample xi ∼ q(x) and u ∼ U(0, 1)

2

If u <

p(xi) Mq(xi), accept xi, else reject

Issues:

SLIDE 22

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

1

Sample xi ∼ q(x) and u ∼ U(0, 1)

2

If u <

p(xi) Mq(xi), accept xi, else reject

Issues:

Tricky to bound p(x)/q(x) with a reasonable constant M

SLIDE 23

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling

Target density p(x) is known, but hard to sample Use an easy to sample proposal distribution q(x) q(x) satisfies p(x) ≤ Mq(x), M < ∞ Algorithm: For i = 1, · · · , n

1

Sample xi ∼ q(x) and u ∼ U(0, 1)

2

If u <

p(xi) Mq(xi), accept xi, else reject

Issues:

Tricky to bound p(x)/q(x) with a reasonable constant M If M is too large, acceptance probability is small

SLIDE 24

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Rejection Sampling (Contd.)

SLIDE 25

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling

For a proposal distribution q(x), with w(x) = p(x)/q(x) I(f ) =

x

f (x)w(x)q(x)dx

SLIDE 26

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling

For a proposal distribution q(x), with w(x) = p(x)/q(x) I(f ) =

x

f (x)w(x)q(x)dx w(x) is the importance weight

SLIDE 27

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling

For a proposal distribution q(x), with w(x) = p(x)/q(x) I(f ) =

x

f (x)w(x)q(x)dx w(x) is the importance weight Monte Carlo estimate of I(f ) based on samples xi ∼ q(x) ˆ In(f ) =

n

i=1

f (xi)w(xi)

SLIDE 28

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling

For a proposal distribution q(x), with w(x) = p(x)/q(x) I(f ) =

x

f (x)w(x)q(x)dx w(x) is the importance weight Monte Carlo estimate of I(f ) based on samples xi ∼ q(x) ˆ In(f ) =

n

i=1

f (xi)w(xi) The estimator is unbiased, and converges to I(f ) a.s.

SLIDE 29

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f )

SLIDE 30

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f ) Applying Jensen’s and optimizing, we get q∗(x) = |f (x)|p(x)

|f (x)|p(x)dx

SLIDE 31

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f ) Applying Jensen’s and optimizing, we get q∗(x) = |f (x)|p(x)

|f (x)|p(x)dx

Efficient sampling focuses on regions of high |f (x)|p(x)

SLIDE 32

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f ) Applying Jensen’s and optimizing, we get q∗(x) = |f (x)|p(x)

|f (x)|p(x)dx

Efficient sampling focuses on regions of high |f (x)|p(x) Super efficient sampling, variance lower than even q(x) = p(x)

SLIDE 33

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f ) Applying Jensen’s and optimizing, we get q∗(x) = |f (x)|p(x)

|f (x)|p(x)dx

Efficient sampling focuses on regions of high |f (x)|p(x) Super efficient sampling, variance lower than even q(x) = p(x) Exploited to evaluate probability of rare events, q(x) ∝ IE(x)p(x)

SLIDE 34

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

Choose q(x) that minimizes variance of ˆ In(f ) varq(f (x)w(x)) = Eq[f 2(x)w2(x)] − I 2(f ) Applying Jensen’s and optimizing, we get q∗(x) = |f (x)|p(x)

|f (x)|p(x)dx

Efficient sampling focuses on regions of high |f (x)|p(x) Super efficient sampling, variance lower than even q(x) = p(x) Exploited to evaluate probability of rare events, q(x) ∝ IE(x)p(x)

SLIDE 35

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Importance Sampling (Contd.)

SLIDE 36

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space

SLIDE 37

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1)

SLIDE 38

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i

SLIDE 39

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

SLIDE 40

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

1

Irreducible, transition graph is connected

SLIDE 41

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

1

Irreducible, transition graph is connected

2

Aperiodic, does not get trapped in cycles

SLIDE 42

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

1

Irreducible, transition graph is connected

2

Aperiodic, does not get trapped in cycles

Sufficient condition to ensure p(x) is the invariant distribution p(xi)T(xi−1|xi) = p(xi−1)T(xi|xi−1)

SLIDE 43

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

1

Irreducible, transition graph is connected

2

Aperiodic, does not get trapped in cycles

Sufficient condition to ensure p(x) is the invariant distribution p(xi)T(xi−1|xi) = p(xi−1)T(xi|xi−1) MCMC samplers, invariant distribution = target distribution

SLIDE 44

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains

Use a Markov chain to explore the state space Markov chain in a discrete space is a process with p(xi|xi−1, . . . , x1) = T(xi|xi−1) A chain is homogenous if T is invariant for all i MC will stabilize into an invariant distribution if

1

Irreducible, transition graph is connected

2

Aperiodic, does not get trapped in cycles

Sufficient condition to ensure p(x) is the invariant distribution p(xi)T(xi−1|xi) = p(xi−1)T(xi|xi−1) MCMC samplers, invariant distribution = target distribution Design of samplers for fast convergence

SLIDE 45

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

SLIDE 46

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages

SLIDE 47

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

SLIDE 48

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

SLIDE 49

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

L = link matrix for the web graph

SLIDE 50

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

L = link matrix for the web graph E = uniform random matrix, to ensure irreducibility, aperiodicity

SLIDE 51

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

L = link matrix for the web graph E = uniform random matrix, to ensure irreducibility, aperiodicity

Invariant distribution p(x) represents rank of webpage x

SLIDE 52

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

L = link matrix for the web graph E = uniform random matrix, to ensure irreducibility, aperiodicity

Invariant distribution p(x) represents rank of webpage x Continuous spaces, T becomes an integral kernel K

xi

p(xi)K(xi+1|xi)dxi = p(xi+1)

SLIDE 53

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Markov Chains (Contd.)

Random walker on the web

Irreducibility, should be able to reach all pages Aperiodicity, do not get stuck in a loop

PageRank uses T = L + E

L = link matrix for the web graph E = uniform random matrix, to ensure irreducibility, aperiodicity

Invariant distribution p(x) represents rank of webpage x Continuous spaces, T becomes an integral kernel K

xi

p(xi)K(xi+1|xi)dxi = p(xi+1) p(x) is the corresponding eigenfunction

SLIDE 54

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method

SLIDE 55

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x)

SLIDE 56

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x) Algorithm: For i = 0, . . . , (n − 1)

SLIDE 57

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x) Algorithm: For i = 0, . . . , (n − 1)

Sample u ∼ U(0, 1)

SLIDE 58

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x) Algorithm: For i = 0, . . . , (n − 1)

Sample u ∼ U(0, 1) Sample x∗ ∼ q(x∗|xi)

SLIDE 59

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x) Algorithm: For i = 0, . . . , (n − 1)

Sample u ∼ U(0, 1) Sample x∗ ∼ q(x∗|xi) Then xi+1 =

x∗

if u < A(xi, x∗) = min

1, p(x∗)q(xi|x∗)

p(xi)q(x∗|xi)

xi
therwise

SLIDE 60

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm

Most popular MCMC method Based on a proposal distribution q(x∗|x) Algorithm: For i = 0, . . . , (n − 1)

Sample u ∼ U(0, 1) Sample x∗ ∼ q(x∗|xi) Then xi+1 =

x∗

if u < A(xi, x∗) = min

1, p(x∗)q(xi|x∗)

p(xi)q(x∗|xi)

xi
therwise

The transition kernel is KMH(xi+1|xi) = q(xi+1|xi)A(xi, xi+1) + δxi(xi+1)r(xi) where r(xi) is the term associated with rejection r(xi) =

x

q(x|xi)(1 − A(xi, x))dx

SLIDE 61

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

SLIDE 62

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1)

SLIDE 63

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution

SLIDE 64

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution Basic properties

SLIDE 65

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution Basic properties

Irreducibility, ensure support of q contains support of p

SLIDE 66

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution Basic properties

Irreducibility, ensure support of q contains support of p Aperiodicity, ensured since rejection is always a possibility

SLIDE 67

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution Basic properties

Irreducibility, ensure support of q contains support of p Aperiodicity, ensured since rejection is always a possibility

Independent sampler: q(x∗|xi) = q(x∗) so that A(xi, x∗) = min

1, p(x∗)q(xi)

q(x∗)p(xi)

SLIDE 68

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

By construction p(xi)KMH(xi+1|xi) = p(xi+1)KMH(xi|xi+1) Implies p(x) is the invariant distribution Basic properties

Irreducibility, ensure support of q contains support of p Aperiodicity, ensured since rejection is always a possibility

Independent sampler: q(x∗|xi) = q(x∗) so that A(xi, x∗) = min

1, p(x∗)q(xi)

q(x∗)p(xi)

Metropolis sampler: symmetric q(x∗|xi) = q(xi|x∗)

A(xi, x∗) = min

1, p(x∗)

p(xi)

SLIDE 69

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Metropolis-Hastings Algorithm (Contd.)

SLIDE 70

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x)

SLIDE 71

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max

SLIDE 72

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s)

SLIDE 73

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s) Simulate a non-homogenous Markov chain

SLIDE 74

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s) Simulate a non-homogenous Markov chain Invariant distribution at iteration i is pi(x) ∝ p1/Ti(x)

SLIDE 75

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s) Simulate a non-homogenous Markov chain Invariant distribution at iteration i is pi(x) ∝ p1/Ti(x) Sample update follows xi+1 =      x∗ if u < A(xi, x∗) = min

1, p

1 Ti (x∗)q(xi|x∗)

p

1 Ti (xi)q(x∗|xi)

xi
therwise

SLIDE 76

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s) Simulate a non-homogenous Markov chain Invariant distribution at iteration i is pi(x) ∝ p1/Ti(x) Sample update follows xi+1 =      x∗ if u < A(xi, x∗) = min

1, p

1 Ti (x∗)q(xi|x∗)

p

1 Ti (xi)q(x∗|xi)

xi
therwise

Ti decreases following a cooling schedule, limi→∞ Ti = 0

SLIDE 77

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing

Problem: To find global maximum of p(x) Initial idea: Run MCMC, estimate ˆ p(x), compute max Issue: MC may not come close to the mode(s) Simulate a non-homogenous Markov chain Invariant distribution at iteration i is pi(x) ∝ p1/Ti(x) Sample update follows xi+1 =      x∗ if u < A(xi, x∗) = min

1, p

1 Ti (x∗)q(xi|x∗)

p

1 Ti (xi)q(x∗|xi)

xi
therwise

Ti decreases following a cooling schedule, limi→∞ Ti = 0 Cooling schedule needs proper choice, e.g., Ti =

1 C log(i+T0)

SLIDE 78

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Simulated Annealing (Contd.)

SLIDE 79

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

x

log p(x, z|θ)p(z|x, θn)dx

SLIDE 80

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

x

log p(x, z|θ)p(z|x, θn)dx Estimate the expectation using MCMC

SLIDE 81

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

x

log p(x, z|θ)p(z|x, θn)dx Estimate the expectation using MCMC Draw samples using MH with acceptance probability A(z, z∗) = min

1, p(x|z∗, θn)p(z∗|θn)q(z|z∗)

p(x|z, θn)p(z|θn)q(z∗|z)

SLIDE 82

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

x

log p(x, z|θ)p(z|x, θn)dx Estimate the expectation using MCMC Draw samples using MH with acceptance probability A(z, z∗) = min

1, p(x|z∗, θn)p(z∗|θn)q(z|z∗)

p(x|z, θn)p(z|θn)q(z∗|z)

Several variants:

SLIDE 83

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

x

log p(x, z|θ)p(z|x, θn)dx Estimate the expectation using MCMC Draw samples using MH with acceptance probability A(z, z∗) = min

1, p(x|z∗, θn)p(z∗|θn)q(z|z∗)

p(x|z, θn)p(z|θn)q(z∗|z)

Several variants:

Stochastic EM: Draw one sample

SLIDE 84

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Monte Carlo EM

E-step involves computing an expectation Q(θ, θn) =

x

log p(x, z|θ)p(z|x, θn)dx Estimate the expectation using MCMC Draw samples using MH with acceptance probability A(z, z∗) = min

1, p(x|z∗, θn)p(z∗|θn)q(z|z∗)

p(x|z, θn)p(z|θn)q(z∗|z)

Several variants:

Stochastic EM: Draw one sample Monte Carlo EM: Draw multiple samples

SLIDE 85

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers

SLIDE 86

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

SLIDE 87

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p

SLIDE 88

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

SLIDE 89

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

SLIDE 90

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

Global proposals explore the entire space (with probability α)

SLIDE 91

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

Global proposals explore the entire space (with probability α) Local proposals discover finer details (with probability (1 − α))

SLIDE 92

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

Global proposals explore the entire space (with probability α) Local proposals discover finer details (with probability (1 − α))

Example: Target has many narrow peaks

SLIDE 93

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

Global proposals explore the entire space (with probability α) Local proposals discover finer details (with probability (1 − α))

Example: Target has many narrow peaks

Global proposal gets the peaks

SLIDE 94

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Mixtures of MCMC Kernels

Powerful property of MCMC: Combination of Samplers Let K1, K2 be kernels with invariant distribution p

Mixture kernel αK1 + (1 − α)K2, α ∈ [0, 1] converges to p Cycle kernel K1K2 converges to p

Mixtures can use global and local proposals

Global proposals explore the entire space (with probability α) Local proposals discover finer details (with probability (1 − α))

Example: Target has many narrow peaks

Global proposal gets the peaks Local proposals get the neighborhood of peaks (random walk)

SLIDE 95

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks

SLIDE 96

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately

SLIDE 97

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked

SLIDE 98

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked Transition kernel is given by KMHCycle(x(i+1)|x(i)) =

nb

j=1

KMH(j)(x(i+1)

bj

|x(i)

bj , x(i+1) −[bj] )

SLIDE 99

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked Transition kernel is given by KMHCycle(x(i+1)|x(i)) =

nb

j=1

KMH(j)(x(i+1)

bj

|x(i)

bj , x(i+1) −[bj] )

Trade-off on block size

SLIDE 100

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked Transition kernel is given by KMHCycle(x(i+1)|x(i)) =

nb

j=1

KMH(j)(x(i+1)

bj

|x(i)

bj , x(i+1) −[bj] )

Trade-off on block size

If block size is small, chain takes long time to explore the space

SLIDE 101

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked Transition kernel is given by KMHCycle(x(i+1)|x(i)) =

nb

j=1

KMH(j)(x(i+1)

bj

|x(i)

bj , x(i+1) −[bj] )

Trade-off on block size

If block size is small, chain takes long time to explore the space If block size is large, acceptance probability is low

SLIDE 102

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Cycles of MCMC Kernels

Split a multi-variate state into blocks Each block can be updated separately Convergence is faster if correlated variables are blocked Transition kernel is given by KMHCycle(x(i+1)|x(i)) =

nb

j=1

KMH(j)(x(i+1)

bj

|x(i)

bj , x(i+1) −[bj] )

Trade-off on block size

If block size is small, chain takes long time to explore the space If block size is large, acceptance probability is low

Gibbs sampling effectively uses block size of 1

SLIDE 103

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler

For a d-dimensional vector x, assume we know p(xj|x−j) = p(xj|x1, . . . , xj−1, xj+1, · · · , xd)

SLIDE 104

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler

For a d-dimensional vector x, assume we know p(xj|x−j) = p(xj|x1, . . . , xj−1, xj+1, · · · , xd) Gibbs sampler uses the following proposal distribution q(x∗|x(i)) =

p(x∗

j |x(i) −j )

if x∗

−j = x(i) −j

therwise

SLIDE 105

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler

For a d-dimensional vector x, assume we know p(xj|x−j) = p(xj|x1, . . . , xj−1, xj+1, · · · , xd) Gibbs sampler uses the following proposal distribution q(x∗|x(i)) =

p(x∗

j |x(i) −j )

if x∗

−j = x(i) −j

therwise

The acceptance probability A(x(i), x∗) = min

1, p(x∗)q(x(i)|x∗)

p(x(i))q(x∗|x(i))

= 1

SLIDE 106

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler

For a d-dimensional vector x, assume we know p(xj|x−j) = p(xj|x1, . . . , xj−1, xj+1, · · · , xd) Gibbs sampler uses the following proposal distribution q(x∗|x(i)) =

p(x∗

j |x(i) −j )

if x∗

−j = x(i) −j

therwise

The acceptance probability A(x(i), x∗) = min

1, p(x∗)q(x(i)|x∗)

p(x(i))q(x∗|x(i))

= 1

Deterministic scan: All samples are accepted

SLIDE 107

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

SLIDE 108

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

SLIDE 109

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

SLIDE 110

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

· · ·

SLIDE 111

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

· · · Sample x(i+1)

d

∼ p(xd|x(i+1)

1

, . . . , x(i+1)

d−1 )

SLIDE 112

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

· · · Sample x(i+1)

d

∼ p(xd|x(i+1)

1

, . . . , x(i+1)

d−1 )

Possible to have MH steps inside a Gibbs sampler

SLIDE 113

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

· · · Sample x(i+1)

d

∼ p(xd|x(i+1)

1

, . . . , x(i+1)

d−1 )

Possible to have MH steps inside a Gibbs sampler For d = 2, Gibbs sampler is the data augmentation algorithm

SLIDE 114

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Gibbs Sampler (Contd.)

Initialize x(0). For i = 0, . . . , (N − 1)

Sample x(i+1)

1

∼ p(x1|x(i)

2 , x(i) 3 . . . , x(i) d )

Sample x(i+1)

2

∼ p(x1|x(i+1)

1

, x(i)

3 . . . , x(i) d )

· · · Sample x(i+1)

d

∼ p(xd|x(i+1)

1

, . . . , x(i+1)

d−1 )

Possible to have MH steps inside a Gibbs sampler For d = 2, Gibbs sampler is the data augmentation algorithm For Bayes nets, the conditioning is on the Markov blanket p(xj|x−j) = p(xj|xpa(j))

k∈ch(j)

p(xk|pa(k))

SLIDE 115

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Bayesian LDA

SLIDE 116

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ)

SLIDE 117

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ) Notation:

SLIDE 118

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ) Notation:

C DT

(d−ℓ,h) = words from d assigned to h, excluding current word

SLIDE 119

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ) Notation:

C DT

(d−ℓ,h) = words from d assigned to h, excluding current word

C WT

(w−ℓ,h) = wℓ assigned to h, excluding current word

SLIDE 120

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ) Notation:

C DT

(d−ℓ,h) = words from d assigned to h, excluding current word

C WT

(w−ℓ,h) = wℓ assigned to h, excluding current word

Then, the first term p(zℓ = h|z−ℓ) = C DT

(d−ℓ,h) + α

T

t=1 C DT (d−ℓ,t) + Tα

SLIDE 121

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Gibbs Sampler for Bayesian LDA

The conditional distribution p(zℓ = h|z−ℓ, w) ∝ p(zℓ = h|z−ℓ)p(wℓ|zℓ = h, z−ℓ, w−ℓ) Notation:

C DT

(d−ℓ,h) = words from d assigned to h, excluding current word

C WT

(w−ℓ,h) = wℓ assigned to h, excluding current word

Then, the first term p(zℓ = h|z−ℓ) = C DT

(d−ℓ,h) + α

T

t=1 C DT (d−ℓ,t) + Tα

The second term p(wℓ|zℓ = h, z−ℓ, w−ell) = C WT

(w−ℓ,h) + β

W

w=1 C WT (w−ℓ,h) + W β

SLIDE 122

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Basic Idea

Sometimes easier to sample from p(x, u) rather than p(x)

SLIDE 123

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Basic Idea

Sometimes easier to sample from p(x, u) rather than p(x) Sample (xi, ui), and then ignore ui

SLIDE 124

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Basic Idea

Sometimes easier to sample from p(x, u) rather than p(x) Sample (xi, ui), and then ignore ui Consider two well-known examples:

SLIDE 125

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Basic Idea

Sometimes easier to sample from p(x, u) rather than p(x) Sample (xi, ui), and then ignore ui Consider two well-known examples:

Hybrid Monte Carlo

SLIDE 126

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Basic Idea

Sometimes easier to sample from p(x, u) rather than p(x) Sample (xi, ui), and then ignore ui Consider two well-known examples:

Hybrid Monte Carlo Slice sampling

SLIDE 127

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution

SLIDE 128

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions

SLIDE 129

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions Effectively, take steps based on gradient of p(x)

SLIDE 130

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions Effectively, take steps based on gradient of p(x) Introduce auxiliary momentum variables u ∈ Rd with p(x, u) = p(x)N(u; 0, Id)

SLIDE 131

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions Effectively, take steps based on gradient of p(x) Introduce auxiliary momentum variables u ∈ Rd with p(x, u) = p(x)N(u; 0, Id) Gradient vector ∆(x) = ∂ log p(x)/∂x, step-size ρ

SLIDE 132

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions Effectively, take steps based on gradient of p(x) Introduce auxiliary momentum variables u ∈ Rd with p(x, u) = p(x)N(u; 0, Id) Gradient vector ∆(x) = ∂ log p(x)/∂x, step-size ρ Gradient descent for L steps to get proposal candidate

SLIDE 133

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo

Uses gradient of the target distribution Improves “mixing” in high dimensions Effectively, take steps based on gradient of p(x) Introduce auxiliary momentum variables u ∈ Rd with p(x, u) = p(x)N(u; 0, Id) Gradient vector ∆(x) = ∂ log p(x)/∂x, step-size ρ Gradient descent for L steps to get proposal candidate When L = 1, one obtains the Langevin algorithm x∗ = x0 + ρu0 = x(i−1) + ρ(u∗ + ρ∆(x(i−1))/2)

SLIDE 134

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

SLIDE 135

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id)

SLIDE 136

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2

SLIDE 137

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2 For ℓ = 1, . . . , L, with ρℓ = ρ, ℓ < L, ρL = ρ/2 xℓ = xℓ−1 + ρuℓ−1 uℓ = uℓ−1 + ρℓ∆(xℓ)

SLIDE 138

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2 For ℓ = 1, . . . , L, with ρℓ = ρ, ℓ < L, ρL = ρ/2 xℓ = xℓ−1 + ρuℓ−1 uℓ = uℓ−1 + ρℓ∆(xℓ) Set (x(i+1), u(i+1)) =

(xL, uL)

if A = min

1, p(xL)

p(xi) exp

− 1

2(uL2 − u∗2)

(x(i), u(i))
therwise

SLIDE 139

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2 For ℓ = 1, . . . , L, with ρℓ = ρ, ℓ < L, ρL = ρ/2 xℓ = xℓ−1 + ρuℓ−1 uℓ = uℓ−1 + ρℓ∆(xℓ) Set (x(i+1), u(i+1)) =

(xL, uL)

if A = min

1, p(xL)

p(xi) exp

− 1

2(uL2 − u∗2)

(x(i), u(i))
therwise

Tradeoffs for ρ, L

SLIDE 140

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2 For ℓ = 1, . . . , L, with ρℓ = ρ, ℓ < L, ρL = ρ/2 xℓ = xℓ−1 + ρuℓ−1 uℓ = uℓ−1 + ρℓ∆(xℓ) Set (x(i+1), u(i+1)) =

(xL, uL)

if A = min

1, p(xL)

p(xi) exp

− 1

2(uL2 − u∗2)

(x(i), u(i))
therwise

Tradeoffs for ρ, L

Large ρ gives low acceptance, small ρ needs many steps

SLIDE 141

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

Hybrid Monte Carlo (Contd.)

Initialize x(0). For i = 0, . . . , (n − 1)

Sample v ∼ U[0, 1], u∗ ∼ N(0, Id) Let x0 = x(i−1), u0 = u∗ + ρ∆(x0)/2 For ℓ = 1, . . . , L, with ρℓ = ρ, ℓ < L, ρL = ρ/2 xℓ = xℓ−1 + ρuℓ−1 uℓ = uℓ−1 + ρℓ∆(xℓ) Set (x(i+1), u(i+1)) =

(xL, uL)

if A = min

1, p(xL)

p(xi) exp

− 1

2(uL2 − u∗2)

(x(i), u(i))
therwise

Tradeoffs for ρ, L

Large ρ gives low acceptance, small ρ needs many steps Large L gives candidates far from x0, but expensive

SLIDE 142

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler

Construct extended target distribution p∗(x, u) =

1

if 0 ≤ u ≤ p(x)

therwise

SLIDE 143

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler

Construct extended target distribution p∗(x, u) =

1

if 0 ≤ u ≤ p(x)

therwise

It follows that:

p∗(x, u) =

p(x) du = p(x)

SLIDE 144

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler

Construct extended target distribution p∗(x, u) =

1

if 0 ≤ u ≤ p(x)

therwise

It follows that:

p∗(x, u) =

p(x) du = p(x) From the Gibbs sampling perspective p(u|x) = U[0, p(x)] p(x|u) = UA, A = {x : p(x) ≥ u}

SLIDE 145

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler

Construct extended target distribution p∗(x, u) =

1

if 0 ≤ u ≤ p(x)

therwise

It follows that:

p∗(x, u) =

p(x) du = p(x) From the Gibbs sampling perspective p(u|x) = U[0, p(x)] p(x|u) = UA, A = {x : p(x) ≥ u} Algorithm is easy is A is easy to figure out

SLIDE 146

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers

The Slice Sampler

Construct extended target distribution p∗(x, u) =

1

if 0 ≤ u ≤ p(x)

therwise

It follows that:

p∗(x, u) =

p(x) du = p(x) From the Gibbs sampling perspective p(u|x) = U[0, p(x)] p(x|u) = UA, A = {x : p(x) ≥ u} Algorithm is easy is A is easy to figure out Otherwise, several auxiliary variables need to be introduced

SLIDE 147

Basics MCMC Gibbs Sampling Auxiliary Variable Samplers