Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

probabilistic graphical models probabilistic graphical
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Probabilistic Graphical Models - - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference Fall 2019 Siamak Ravanbakhsh Learning objectives Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two


slide-1
SLIDE 1

Probabilistic Graphical Models Probabilistic Graphical Models

Markov Chain Monte Carlo Inference

Siamak Ravanbakhsh

Fall 2019

slide-2
SLIDE 2

Learning objectives Learning objectives

Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling Metropolis-Hastings algorithm

slide-3
SLIDE 3

Problem with Problem with likelihood weighting likelihood weighting

use a topological ordering sample conditioned on the parents if observed: keep the observed value update the weight Recap

slide-4
SLIDE 4

Problem with Problem with likelihood weighting likelihood weighting

use a topological ordering sample conditioned on the parents if observed: keep the observed value update the weight Recap

  • bserving the child does not affect the parent's assignment
  • nly applies to Bayes-nets

Issues

slide-5
SLIDE 5

Gibbs sampling Gibbs sampling

iteratively sample each var. condition on its Markov blanket if is observed: keep the observed value Idea

X

i

p(x

i

X

)

MB(i)

X

i

after many Gibbs sampling iterations

X ∼ P

slide-6
SLIDE 6

Gibbs sampling Gibbs sampling

iteratively sample each var. condition on its Markov blanket if is observed: keep the observed value Idea

equivalent to

X

i

p(x

i

X

)

MB(i)

first simplifying the model by removing observed vars sampling from the simplified Gibbs dist.

X

i

after many Gibbs sampling iterations

X ∼ P

slide-7
SLIDE 7

Example: Example: Ising model Ising model

p(x) ∝ exp(

x h +

∑i

i i

x x J )

∑i,j∈E

i j i,j

recall the Ising model:

x

i

{−1, +1}

slide-8
SLIDE 8

Example: Example: Ising model Ising model

sample each node i:

p(x) ∝ exp(

x h +

∑i

i i

x x J )

∑i,j∈E

i j i,j

recall the Ising model:

x

i

{−1, +1}

p(x

=

i

+1 ∣ X

) =

MB(i)

=

exp(h

+ J X )+exp(−h − J X )

i

∑j∈Mb(i)

i,j j i

∑j∈Mb(i)

i,j j

exp(h

+ J X )

i

∑j∈Mb(i)

i,j j

slide-9
SLIDE 9

Example: Example: Ising model Ising model

sample each node i:

p(x) ∝ exp(

x h +

∑i

i i

x x J )

∑i,j∈E

i j i,j

recall the Ising model:

x

i

{−1, +1}

p(x

=

i

+1 ∣ X

) =

MB(i)

=

exp(h

+ J X )+exp(−h − J X )

i

∑j∈Mb(i)

i,j j i

∑j∈Mb(i)

i,j j

exp(h

+ J X )

i

∑j∈Mb(i)

i,j j

σ(2h

+

i

2

J X )

∑j∈Mb(i)

i,j j

slide-10
SLIDE 10

Example: Example: Ising model Ising model

sample each node i:

p(x) ∝ exp(

x h +

∑i

i i

x x J )

∑i,j∈E

i j i,j

recall the Ising model:

x

i

{−1, +1}

p(x

=

i

+1 ∣ X

) =

MB(i)

=

exp(h

+ J X )+exp(−h − J X )

i

∑j∈Mb(i)

i,j j i

∑j∈Mb(i)

i,j j

exp(h

+ J X )

i

∑j∈Mb(i)

i,j j

σ(2h

+

i

2

J X )

∑j∈Mb(i)

i,j j

compare with mean-field

σ(2h

+

i

2

J μ )

∑j∈Mb(i)

i,j j

slide-11
SLIDE 11

Markov Chain Markov Chain

a sequence of random variables with Markov property

P(X ∣X , … , X ) =

(t) (1) (t−1)

P(X ∣X )

(t) (t−1)

its graphical model

...

X(1)

X(T)

many applications: language modeling: X is a word or a character physics: with correct choice of X, the world is Markov

X(2)

X(T−1)

slide-12
SLIDE 12

Transition model Transition model

P(X =

(t)

x∣X =

(t−1)

x ) =

T(x , x)

is called the transition model think of this as a matrix T

notation: conditional probability we assume a homogeneous chain: P(X

∣X ) =

(t) (t−1)

P(X ∣X ) ∀t

(t+1) (t)

  • cond. probabilities remain the same across time-steps
slide-13
SLIDE 13

Transition model Transition model

P(X =

(t)

x∣X =

(t−1)

x ) =

T(x , x)

is called the transition model

state-transition diagram

think of this as a matrix T

T = ⎣ ⎢ ⎡.25 .5 .7 .5 .75 .3 0 ⎦ ⎥ ⎤

notation: conditional probability

its transition matrix

we assume a homogeneous chain: P(X

∣X ) =

(t) (t−1)

P(X ∣X ) ∀t

(t+1) (t)

  • cond. probabilities remain the same across time-steps
slide-14
SLIDE 14

Transition model Transition model

P(X =

(t)

x∣X =

(t−1)

x ) =

T(x , x)

is called the transition model

state-transition diagram

think of this as a matrix T

T = ⎣ ⎢ ⎡.25 .5 .7 .5 .75 .3 0 ⎦ ⎥ ⎤

evolving the distribution P(X

=

(t+1)

x) =

P(X

= ∑x ∈V al(X)

(t)

x )T(x , x)

′ ′

notation: conditional probability

its transition matrix

we assume a homogeneous chain: P(X

∣X ) =

(t) (t−1)

P(X ∣X ) ∀t

(t+1) (t)

  • cond. probabilities remain the same across time-steps
slide-15
SLIDE 15

Markov Chain Monte Carlo ( Markov Chain Monte Carlo (MCMC MCMC)

Example

state-transition diagram for grasshopper random walk

P (X =

(0)

0) = 1

initial distribution

slide-16
SLIDE 16

Markov Chain Monte Carlo ( Markov Chain Monte Carlo (MCMC MCMC)

Example

state-transition diagram for grasshopper random walk

P (X =

(0)

0) = 1

initial distribution after t=50 steps, the distribution is almost uniform P (x) ≈

t

∀x

9 1

slide-17
SLIDE 17

Markov Chain Monte Carlo ( Markov Chain Monte Carlo (MCMC MCMC)

Example

state-transition diagram for grasshopper random walk

P (X =

(0)

0) = 1

initial distribution after t=50 steps, the distribution is almost uniform P (x) ≈

t

∀x

9 1

use the chain to sample from the uniform distribution P (X) ≈

t 9 1

slide-18
SLIDE 18

Markov Chain Monte Carlo ( Markov Chain Monte Carlo (MCMC MCMC)

Example

state-transition diagram for grasshopper random walk

P (X =

(0)

0) = 1

initial distribution after t=50 steps, the distribution is almost uniform P (x) ≈

t

∀x

9 1

use the chain to sample from the uniform distribution P (X) ≈

t 9 1 why is it uniform?

(mixing image: Murphy's book)

slide-19
SLIDE 19

Markov Chain Monte Carlo ( Markov Chain Monte Carlo (MCMC MCMC)

Example

state-transition diagram for grasshopper random walk

P (X =

(0)

0) = 1

initial distribution after t=50 steps, the distribution is almost uniform P (x) ≈

t

∀x

9 1

use the chain to sample from the uniform distribution P (X) ≈

t 9 1

MCMC generalize this idea beyond uniform dist. we want to sample from pick the transition model such that P

(X) =

P (X)

P ∗

why is it uniform?

(mixing image: Murphy's book)

slide-20
SLIDE 20

Stationary distribution Stationary distribution

given a transition model if the chain converges:

T(x, x )

P (x) ≈

(t)

P (x)

(t+1)

=

P

(x )T(x , x) ∑x′

(t) ′ ′ global balance equation

slide-21
SLIDE 21

Stationary distribution Stationary distribution

given a transition model if the chain converges:

T(x, x )

P (x) ≈

(t)

P (x)

(t+1)

=

P

(x )T(x , x) ∑x′

(t) ′ ′

this condition defines the stationary distribution:

π(X = x) =

π(X =

∑x ∈V al(X)

x )T(x , x)

′ ′

π

global balance equation

slide-22
SLIDE 22

Stationary distribution Stationary distribution

given a transition model if the chain converges:

T(x, x )

P (x) ≈

(t)

P (x)

(t+1)

=

P

(x )T(x , x) ∑x′

(t) ′ ′

this condition defines the stationary distribution:

π(X = x) =

π(X =

∑x ∈V al(X)

x )T(x , x)

′ ′

π

Example

finding the stationary dist.

π(x ) =

1

.25π(x ) +

1

.5π(x )

3

π(x ) =

2

.7π(x ) +

2

.5π(x )

3

π(x ) =

3

.75π(x ) +

1

.3π(x )

2

π(x ) +

1

π(x ) +

2

π(x ) =

3

1 π(x ) =

1

.2 π(x ) =

2

.5 π(x ) =

3

.3

global balance equation

slide-23
SLIDE 23

Stationary distribution Stationary distribution as an eigenvector

as an eigenvector

Example

finding the stationary dist.

π(x ) =

1

.25π(x ) +

1

.5π(x )

3

π(x ) =

2

.7π(x ) +

2

.5π(x )

3

π(x ) =

3

.75π(x ) +

1

.3π(x )

2

π(x ) +

1

π(x ) +

2

π(x ) =

3

1 π(x ) =

1

.2 π(x ) =

2

.5 π(x ) =

3

.3

slide-24
SLIDE 24

Stationary distribution Stationary distribution as an eigenvector

as an eigenvector

viewing as a matrix and as a vector evolution of dist : multiple steps: T(., .) P (x)

t

Example

finding the stationary dist.

π(x ) =

1

.25π(x ) +

1

.5π(x )

3

π(x ) =

2

.7π(x ) +

2

.5π(x )

3

π(x ) =

3

.75π(x ) +

1

.3π(x )

2

π(x ) +

1

π(x ) +

2

π(x ) =

3

1 π(x ) =

1

.2 π(x ) =

2

.5 π(x ) =

3

.3

P (x)

t

P =

(t+1)

T P

T (t)

P =

(t+m)

(T ) P

T m (t)

⎣ ⎢ ⎡.2 .5 .3⎦ ⎥ ⎤ ⎣ ⎢ ⎡.25 .75 .7 .3 .5 .5 0 ⎦ ⎥ ⎤

T T

π

= ⎣

⎢ ⎡.2 .5 .3⎦ ⎥ ⎤

π

slide-25
SLIDE 25

Stationary distribution Stationary distribution as an eigenvector

as an eigenvector

viewing as a matrix and as a vector evolution of dist : multiple steps: T(., .) P (x)

t

Example

finding the stationary dist.

π(x ) =

1

.25π(x ) +

1

.5π(x )

3

π(x ) =

2

.7π(x ) +

2

.5π(x )

3

π(x ) =

3

.75π(x ) +

1

.3π(x )

2

π(x ) +

1

π(x ) +

2

π(x ) =

3

1 π(x ) =

1

.2 π(x ) =

2

.5 π(x ) =

3

.3

P (x)

t

P =

(t+1)

T P

T (t)

P =

(t+m)

(T ) P

T m (t)

for stationary dist: π = T π

T

⎣ ⎢ ⎡.2 .5 .3⎦ ⎥ ⎤ ⎣ ⎢ ⎡.25 .75 .7 .3 .5 .5 0 ⎦ ⎥ ⎤

T T

π

= ⎣

⎢ ⎡.2 .5 .3⎦ ⎥ ⎤

π

slide-26
SLIDE 26

is an eigenvector of with eigenvalue 1

Stationary distribution Stationary distribution as an eigenvector

as an eigenvector

viewing as a matrix and as a vector evolution of dist : multiple steps: T(., .) P (x)

t

Example

finding the stationary dist.

π(x ) =

1

.25π(x ) +

1

.5π(x )

3

π(x ) =

2

.7π(x ) +

2

.5π(x )

3

π(x ) =

3

.75π(x ) +

1

.3π(x )

2

π(x ) +

1

π(x ) +

2

π(x ) =

3

1 π(x ) =

1

.2 π(x ) =

2

.5 π(x ) =

3

.3

P (x)

t

P =

(t+1)

T P

T (t)

P =

(t+m)

(T ) P

T m (t)

for stationary dist: π = T π

T

⎣ ⎢ ⎡.2 .5 .3⎦ ⎥ ⎤ ⎣ ⎢ ⎡.25 .75 .7 .3 .5 .5 0 ⎦ ⎥ ⎤

T T

π

= ⎣

⎢ ⎡.2 .5 .3⎦ ⎥ ⎤

π

T T

π

(produce it by running the chain = power iteration)

slide-27
SLIDE 27

Stationary distribution: Stationary distribution: existance & uniquness

existance & uniquness

we should be able to reach any x' from any x

  • therwise, is not unique

1 1

π

irreducible

slide-28
SLIDE 28

Stationary distribution: Stationary distribution: existance & uniquness

existance & uniquness

we should be able to reach any x' from any x

  • therwise, is not unique

1 1

π

irreducible aperiodic

the chain should not have a fixed cyclic behavior

  • therwise, the chain does not converge (it oscillates)

1

1

1

slide-29
SLIDE 29

Stationary distribution: Stationary distribution: existance & uniquness

existance & uniquness

we should be able to reach any x' from any x

  • therwise, is not unique

1 1

π

irreducible aperiodic

the chain should not have a fixed cyclic behavior

  • therwise, the chain does not converge (it oscillates)

1

1

1

every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that

π

π(X = x) =

π(X =

∑x ∈V al(X)

x )T(x , x)

′ ′

slide-30
SLIDE 30

Stationary distribution: Stationary distribution: existance & uniquness

existance & uniquness

we should be able to reach any x' from any x

  • therwise, is not unique

1 1

π

irreducible aperiodic

the chain should not have a fixed cyclic behavior

  • therwise, the chain does not converge (it oscillates)

1

1

1

every aperiodic and irreducible chain (with a finite domain) has a unique limiting distribution such that

π

π(X = x) =

π(X =

∑x ∈V al(X)

x )T(x , x)

′ ′

a sufficient condition: there exists a K, such that the probability of reaching any destination from any source in K steps is positive (applies to discrete & continuous domains)

regular chain

slide-31
SLIDE 31

MCMC in graphical models MCMC in graphical models

distinguishing the "graphical models" involved

1: the Markov chain

P (X) π(X)

...

X(1)

X(T)

X(2)

X(T−1)

slide-32
SLIDE 32

MCMC in graphical models MCMC in graphical models

distinguishing the "graphical models" involved

1: the Markov chain 2: state-transition diagram (not shown) that has exponentially many nodes

#nodes = ∣V al(X)∣

P (X) π(X)

...

X(1)

X(T)

X(2)

X(T−1)

slide-33
SLIDE 33

MCMC in graphical models MCMC in graphical models

distinguishing the "graphical models" involved

1: the Markov chain 3: the graphical model, from which we want to sample

X = [C, D, I, G, S, L, J, H]

2: state-transition diagram (not shown) that has exponentially many nodes

#nodes = ∣V al(X)∣ P (X)

P (X)

P (X) π(X)

...

X(1)

X(T)

X(2)

X(T−1)

slide-34
SLIDE 34

MCMC in graphical models MCMC in graphical models

distinguishing the "graphical models" involved

1: the Markov chain 3: the graphical model, from which we want to sample

X = [C, D, I, G, S, L, J, H]

2: state-transition diagram (not shown) that has exponentially many nodes

#nodes = ∣V al(X)∣

  • bjective: design the Markov chain transition so that π(X) = P (X)

P (X)

P (X)

P (X) π(X)

...

X(1)

X(T)

X(2)

X(T−1)

slide-35
SLIDE 35

Multiple transition models Multiple transition models

idea

have multiple transition models each making local changes to

T

(x, x ), T (x, x ), … , T (x, x )

1 ′ 2 ′ n ′

x

T

1

T

2

x = (x

, x )

1 2

  • nly updates x

1 aka, kernels

using a single kernel we may not be able to visit all the states while their combination is "ergodic"

slide-36
SLIDE 36

Multiple transition models Multiple transition models

if

T(x , x) =

T (x , x )T (x , x ), … T (x

, x)dx dx … dx ∫x

,x ,…,x

[1] [2] [n]

1 ′ [1] 2 [1] [2] n [n−1] [1] [2] [n]

idea

have multiple transition models each making local changes to

T

(x, x ), T (x, x ), … , T (x, x )

1 ′ 2 ′ n ′

x

T

1

T

2

x = (x

, x )

1 2

  • nly updates x

1 aka, kernels

using a single kernel we may not be able to visit all the states while their combination is "ergodic"

π(X = x) =

π(X =

∑x ∈V al(X)

x )T

(x , x)

∀k

′ k ′

then we can combine the kernels: mixing them cycling them

T(x , x) =

p(k)T (x , x)

∑k

k ′

...

X(1)

X(T)

X(2)

X(T−1)

slide-37
SLIDE 37

Revisiting Gibbs sampling Revisiting Gibbs sampling

  • ne kernel for each variable

...

X(1)

X(T)

X(2)

X(T−1)

T

(x

, x ) =

i (t) (t+1)

P (x

∣x )I(x =

∗ i (t+1) −i (t) −i (t+1)

x )

−i (t)

perform local, conditional updates

...

cycle the local kernels

slide-38
SLIDE 38

Revisiting Gibbs sampling Revisiting Gibbs sampling

  • ne kernel for each variable

...

X(1)

X(T)

X(2)

X(T−1)

T

(x

, x ) =

i (t) (t+1)

P (x

∣x )I(x =

∗ i (t+1) −i (t) −i (t+1)

x )

−i (t)

perform local, conditional updates

...

P (x

∣x ) =

∗ i (t+1) −i (t)

P (x

∣x )

∗ i (t+1) MB(i) (t) cycle the local kernels

slide-39
SLIDE 39

Revisiting Gibbs sampling Revisiting Gibbs sampling

  • ne kernel for each variable

...

X(1)

X(T)

X(2)

X(T−1)

T

(x

, x ) =

i (t) (t+1)

P (x

∣x )I(x =

∗ i (t+1) −i (t) −i (t+1)

x )

−i (t)

perform local, conditional updates

...

P (x

∣x ) =

∗ i (t+1) −i (t)

P (x

∣x )

∗ i (t+1) MB(i) (t) cycle the local kernels

slide-40
SLIDE 40

Revisiting Gibbs sampling Revisiting Gibbs sampling

  • ne kernel for each variable

...

X(1)

X(T)

X(2)

X(T−1)

T

(x

, x ) =

i (t) (t+1)

P (x

∣x )I(x =

∗ i (t+1) −i (t) −i (t+1)

x )

−i (t)

perform local, conditional updates

...

P (x

∣x ) =

∗ i (t+1) −i (t)

P (x

∣x )

∗ i (t+1) MB(i) (t)

π(X) = P (X)

is the stationary dist. for this Markov chain

cycle the local kernels

slide-41
SLIDE 41

Revisiting Gibbs sampling Revisiting Gibbs sampling

  • ne kernel for each variable

...

X(1)

X(T)

X(2)

X(T−1)

T

(x

, x ) =

i (t) (t+1)

P (x

∣x )I(x =

∗ i (t+1) −i (t) −i (t+1)

x )

−i (t)

perform local, conditional updates

...

P (x

∣x ) =

∗ i (t+1) −i (t)

P (x

∣x )

∗ i (t+1) MB(i) (t)

π(X) = P (X)

is the stationary dist. for this Markov chain

cycle the local kernels

if then this chain is regular

P (x) >

∀x

i.e., converges to its unique stationary dist.

slide-42
SLIDE 42

Some variations Some variations

local moves can get stuck in modes of P (X)

x

1

x

2 updates using will have problem exploring these modes

P(x

1

x

), P(x ∣x )

2 2 1

block Gibbs sampling

slide-43
SLIDE 43

Some variations Some variations

local moves can get stuck in modes of P (X)

x

1

x

2 updates using will have problem exploring these modes

P(x

1

x

), P(x ∣x )

2 2 1

idea: each kernel updates a block of variables

block Gibbs sampling

slide-44
SLIDE 44

Some variations Some variations

local moves can get stuck in modes of P (X)

x

1

x

2 updates using will have problem exploring these modes

P(x

1

x

), P(x ∣x )

2 2 1

idea: each kernel updates a block of variables

block Gibbs sampling collapsed Gibbs sampling

marginalize out some variables

p(X ∣ Y , Z), P(Y ∣ X, Z), P(Z ∣ X, Y )

  • rdinary case:
slide-45
SLIDE 45

Some variations Some variations

local moves can get stuck in modes of P (X)

x

1

x

2 updates using will have problem exploring these modes

P(x

1

x

), P(x ∣x )

2 2 1

idea: each kernel updates a block of variables

block Gibbs sampling collapsed Gibbs sampling

marginalize out some variables

p(X ∣ Y , Z), P(Y ∣ X, Z), P(Z ∣ X, Y )

  • rdinary case:

marginalize over Y:

P(X ∣ Z), P(Z ∣ X, Y )

  • r

P(X ∣ Z), P(Z ∣ X)

involves analytical derivation of collapsed updates

slide-46
SLIDE 46

Detailed balance Detailed balance

A Markov chain is reversible if for a unique π

π(x)T(x, x ) =

π(x )T(x , x) ∀x, x

′ ′ ′

same frequency in both directions

detailed balance

slide-47
SLIDE 47

Detailed balance Detailed balance

A Markov chain is reversible if for a unique π

π(x)T(x, x ) =

π(x )T(x , x) ∀x, x

′ ′ ′

same frequency in both directions

π(x )T(x , x)dx

∫x′

′ ′ ′

left-hand side

π(x)T(x, x )dx =

∫x′

′ ′

π(x)

T(x, x )dx =

∫x′

′ ′

π(x)

right-hand side

=

detailed balance global balance

slide-48
SLIDE 48

Detailed balance Detailed balance

A Markov chain is reversible if for a unique π

π(x)T(x, x ) =

π(x )T(x , x) ∀x, x

′ ′ ′

same frequency in both directions

π(x )T(x , x)dx

∫x′

′ ′ ′

left-hand side

π(x)T(x, x )dx =

∫x′

′ ′

π(x)

T(x, x )dx =

∫x′

′ ′

π(x)

right-hand side

=

detailed balance global balance

detailed balance is a stronger condition

π = [.4, .4, .2]

global balance detailed balance

(example: Murphy's book)

slide-49
SLIDE 49

Detailed balance Detailed balance

A Markov chain is reversible if for a unique π

π(x)T(x, x ) =

π(x )T(x , x) ∀x, x

′ ′ ′

same frequency in both directions

π(x )T(x , x)dx

∫x′

′ ′ ′

left-hand side

π(x)T(x, x )dx =

∫x′

′ ′

π(x)

T(x, x )dx =

∫x′

′ ′

π(x)

right-hand side

=

detailed balance global balance

detailed balance is a stronger condition

π = [.4, .4, .2]

global balance detailed balance

if Markov chain is regular and satisfies detailed balance, then is the unique stationary distribution

π π

(example: Murphy's book)

slide-50
SLIDE 50

Detailed balance Detailed balance

A Markov chain is reversible if for a unique π

π(x)T(x, x ) =

π(x )T(x , x) ∀x, x

′ ′ ′

same frequency in both directions

π(x )T(x , x)dx

∫x′

′ ′ ′

left-hand side

π(x)T(x, x )dx =

∫x′

′ ′

π(x)

T(x, x )dx =

∫x′

′ ′

π(x)

right-hand side

=

detailed balance global balance

detailed balance is a stronger condition

π = [.4, .4, .2]

global balance detailed balance

if Markov chain is regular and satisfies detailed balance, then is the unique stationary distribution

π π

analogous to the theorem for global balance checking for detailed balance is sometimes easier

(example: Murphy's book)

slide-51
SLIDE 51

Detailed balance Detailed balance

A Markov chain is reversible if for a unique π

π(x)T(x, x ) =

π(x )T(x , x) ∀x, x

′ ′ ′

same frequency in both directions

π(x )T(x , x)dx

∫x′

′ ′ ′

left-hand side

π(x)T(x, x )dx =

∫x′

′ ′

π(x)

T(x, x )dx =

∫x′

′ ′

π(x)

right-hand side

=

detailed balance global balance

detailed balance is a stronger condition

π = [.4, .4, .2]

global balance detailed balance

if Markov chain is regular and satisfies detailed balance, then is the unique stationary distribution

π π

analogous to the theorem for global balance checking for detailed balance is sometimes easier

(example: Murphy's book)

what happens if T is symmetric?

slide-52
SLIDE 52

Using a proposal for the chain Using a proposal for the chain

Given design a chain to sample from

P ∗ P ∗

idea

slide-53
SLIDE 53

Using a proposal for the chain Using a proposal for the chain

Given design a chain to sample from

P ∗ P ∗

idea use a proposal transition we can sample from is a regular chain

(reaching every state in K steps has a non-zero probability)

T (x, x )

q ′

T (x, ⋅)

q

T (x, x )

q ′

slide-54
SLIDE 54

Using a proposal for the chain Using a proposal for the chain

Given design a chain to sample from

P ∗ P ∗

idea use a proposal transition we can sample from is a regular chain

(reaching every state in K steps has a non-zero probability)

accept the proposed move with probability to achieve detailed balance for a desirable

T (x, x )

q ′

T (x, ⋅)

q

T (x, x )

q ′

A(x, x )

P ∗

slide-55
SLIDE 55

Metropolis algorithm Metropolis algorithm

use a proposal transition we can sample from is a regular chain

(reaching every state in K steps has a non-zero probability)

accept the proposed move with probability to achieve detailed balance

T (x, x )

q ′

T (x, ⋅)

q

T (x, x )

q ′

A(x, x )

proposal is symmetric T(x, x ) =

T(x , x)

A(x, x ) ≜

min(1,

)

p(x) p(x )

accepts the move if it increases P ∗ may accept it otherwise

(image: Wikipedia)

slide-56
SLIDE 56

Metropolis Metropolis-Hastings

  • Hastings algorithm

algorithm

if the proposal is NOT symmetric, then A(x, x ) ≜

min(1,

)

p(x)T (x,x )

q ′

p(x )T (x ,x)

′ q ′

slide-57
SLIDE 57

why does it sample from ?

Metropolis Metropolis-Hastings

  • Hastings algorithm

algorithm

if the proposal is NOT symmetric, then A(x, x ) ≜

min(1,

)

p(x)T (x,x )

q ′

p(x )T (x ,x)

′ q ′

P ∗

slide-58
SLIDE 58

why does it sample from ?

Metropolis Metropolis-Hastings

  • Hastings algorithm

algorithm

T(x, x ) =

T (x, x )A(x, x ) ∀x

=

q ′ ′

 x′

if the proposal is NOT symmetric, then A(x, x ) ≜

min(1,

)

p(x)T (x,x )

q ′

p(x )T (x ,x)

′ q ′

P ∗

move to a different state is accepted

derive the transition kernel:

slide-59
SLIDE 59

why does it sample from ?

Metropolis Metropolis-Hastings

  • Hastings algorithm

algorithm

T(x, x ) =

T (x, x )A(x, x ) ∀x

=

q ′ ′

 x′

if the proposal is NOT symmetric, then A(x, x ) ≜

min(1,

)

p(x)T (x,x )

q ′

p(x )T (x ,x)

′ q ′

P ∗ T(x, x) = T (x, x) +

q

(1 −

∑x =x

 ′

A(x, x ))T(x, x )

′ ′

move to a different state is accepted proposal to stay is always accepted move to a new state is rejected

derive the transition kernel:

slide-60
SLIDE 60

why does it sample from ?

Metropolis Metropolis-Hastings

  • Hastings algorithm

algorithm

T(x, x ) =

T (x, x )A(x, x ) ∀x

=

q ′ ′

 x′

if the proposal is NOT symmetric, then A(x, x ) ≜

min(1,

)

p(x)T (x,x )

q ′

p(x )T (x ,x)

′ q ′

substitute this into detailed balance (does it hold?) P ∗ T(x, x) = T (x, x) +

q

(1 −

∑x =x

 ′

A(x, x ))T(x, x )

′ ′

move to a different state is accepted proposal to stay is always accepted move to a new state is rejected

π(x)T (x, x )A(x, x ) =

q ′ ′

π(x )T (x , x)A(x , x)

′ q ′ ′

min(1,

)

π(x)T (x,x )

q ′

π(x )T (x ,x)

′ q ′

min(1,

)

π(x )T (x ,x)

′ q ′

π(x)T (x,x )

q ′

?

derive the transition kernel:

this is for only

slide-61
SLIDE 61

why does it sample from ?

Metropolis Metropolis-Hastings

  • Hastings algorithm

algorithm

T(x, x ) =

T (x, x )A(x, x ) ∀x

=

q ′ ′

 x′

if the proposal is NOT symmetric, then A(x, x ) ≜

min(1,

)

p(x)T (x,x )

q ′

p(x )T (x ,x)

′ q ′

substitute this into detailed balance (does it hold?) P ∗ T(x, x) = T (x, x) +

q

(1 −

∑x =x

 ′

A(x, x ))T(x, x )

′ ′

move to a different state is accepted proposal to stay is always accepted move to a new state is rejected

π(x)T (x, x )A(x, x ) =

q ′ ′

π(x )T (x , x)A(x , x)

′ q ′ ′

min(1,

)

π(x)T (x,x )

q ′

π(x )T (x ,x)

′ q ′

min(1,

)

π(x )T (x ,x)

′ q ′

π(x)T (x,x )

q ′

?

Gibbs sampling is a special case, with all the time!

A(x, x ) =

1

derive the transition kernel:

this is for only

slide-62
SLIDE 62

Sampling from the chain Sampling from the chain

at the limit , how long should we wait for

T → ∞

P =

π = P ∗

D(P , π) <

T

ϵ?

mixing time

O(

log( ))

1−λ

2

1 ϵ N

#states (exponential) 2nd largest eigenvalue of T

slide-63
SLIDE 63

Sampling from the chain Sampling from the chain

at the limit , how long should we wait for

T → ∞

P =

π = P ∗

D(P , π) <

T

ϵ?

run the chain for a burn-in period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage

mixing time

O(

log( ))

1−λ

2

1 ϵ N

#states (exponential) 2nd largest eigenvalue of T

slide-64
SLIDE 64

model different colors 128x128 grid Gibbs sampling

Sampling from the chain Sampling from the chain

at the limit , how long should we wait for

T → ∞

P =

π = P ∗

D(P , π) <

T

ϵ?

run the chain for a burn-in period (T steps) collect samples (few more steps) multiple restarts can ensure a better coverage

mixing time

Potts model

p(x) ∝ exp(

h(x ) +

∑i

i

.66I(x =

∑i,j∈E

i

x

))

j

Example

∣V al(X)∣ = 5

200 iterations 10,000 iterations image : Murphy's book

O(

log( ))

1−λ

2

1 ϵ N

#states (exponential) 2nd largest eigenvalue of T

slide-65
SLIDE 65

Diagnosing convergence Diagnosing convergence

heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain

slide-66
SLIDE 66

Diagnosing convergence Diagnosing convergence

heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain

example

sampling from a mixture of two 1D Gaussians (3 chains: colors)

metropolis-hastings (MH) with increasing step sizes for the proposal

trace plot

high auto-correlation

step-size is too large (high rejection rate)

image: ANDRIEU et al.'03

slide-67
SLIDE 67

Diagnosing convergence Diagnosing convergence

heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain

example

sampling from a mixture of two 1D Gaussians (3 chains: colors)

metropolis-hastings (MH) with increasing step sizes for the proposal

trace plot

high auto-correlation

step-size is too large (high rejection rate)

slide-68
SLIDE 68

Diagnosing convergence Diagnosing convergence

heuristics for diagnosing non-convergence difficult problem run multiple chains (compare sample statistics) auto-correlation within each chain

example

sampling from a mixture of two 1D Gaussians (3 chains: colors)

metropolis-hastings (MH) with increasing step sizes for the proposal

trace plot

high auto-correlation

step-size is too large (high rejection rate)

image: Murphy's book

slide-69
SLIDE 69

Summary Summary

Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution

slide-70
SLIDE 70

Summary Summary

Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples

slide-71
SLIDE 71

Summary Summary

Markov Chain: can model the "evolution" of an initial distribution converges to a stationary distribution Markov Chain Monte Carlo: design a Markov chain: stationary dist. is what we want to sample run the chain to produce samples Two MCMC methods: Gibbs sampling Metropolis-Hastings