Approximate inference: Sampling methods Probabilistic Graphical - - PowerPoint PPT Presentation

approximate inference
SMART_READER_LITE
LIVE PREVIEW

Approximate inference: Sampling methods Probabilistic Graphical - - PowerPoint PPT Presentation

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani Approximate inference Approximate inference techniques Deterministic approximation Variational algorithms


slide-1
SLIDE 1

Approximate inference: Sampling methods

Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani

slide-2
SLIDE 2

Approximate inference

2

 Approximate inference techniques

 Deterministic approximation

 Variational algorithms

 Stochastic simulation / sampling methods

slide-3
SLIDE 3

Sampling-based estimation

3

 Assume that 𝒠 = 𝑦(1), … , 𝑦(𝑂) shows the set of i.i.d. samples drawn from

the desired distribution 𝑄

 For any distribution 𝑄, function 𝑔, we can estimate 𝐹𝑄 𝑔 :

𝐹𝑄 𝑔 ≈ 1 𝑂

𝑜=1 𝑂

𝑔 𝑦 𝑜

 Expectations reveal interesting properties about distribution 𝑄

 Means and variance of 𝑄  Probability of events

 E.g., we can find

𝑄(𝑦 = 𝑙) by estimating 𝐹𝑄 𝑔 where 𝑔 𝑦 = 𝐽 𝑦 = 𝑙

 We can use a stochastic representation of a complex distribution

Empirical expectation

slide-4
SLIDE 4

Bounds on error

4

 Hoeffding bound

 additive bound on error  𝑄

𝒠

𝜄 ∉ 𝜄 − 𝜗, 𝜄 + 𝜗 ≤ 2𝑓−2𝑂𝜗2

 Chernouf bound

 𝑄

𝒠

𝜄 ∉ 𝜄 1 − 𝜗 , 𝜄 1 + 𝜗 ≤ 2𝑓−𝑂𝜄𝜗2

3

 multiplicative bound on error 𝜄 =

1 𝑂 𝑜=1 𝑂

𝑔 𝒚(𝑜) where 𝒚(𝑜)~𝑄(𝒚) 𝜄 = 𝐹𝑄 𝒚 𝑔 𝒚

slide-5
SLIDE 5

The mean and variance of the estimator

5

 For samples drawn independently from the distribution 𝑄:

𝑔 = 1 𝑂

𝑜=1 𝑂

𝑔 𝑦 𝑜 𝐹 𝑔 = 𝐹 𝑔 𝑤𝑏𝑠 𝑔 = 1 𝑀 𝐹 𝑔 − 𝐹 𝑔

2

slide-6
SLIDE 6

Monte Carlo methods

6

 Using a set of samples to find the answer of an inference query

 expectations can be approximated using sample-based averages

 Asymptotically

exact and easy to apply to arbitrary problems

 Challenges:

 Drawing samples from many distributions is not trivial  Are the gathered samples enough?  Are all samples useful, or equally useful?

slide-7
SLIDE 7

Generating samples form a distribution

7

 Assume that we have an algorithm that generates (pseudo-)

random numbers distributed uniformly over (0,1)

 How do we generate a sample from these distributions.

First, we see simple cases:

 Bernoulli  Multinomial  Other standard distributions

slide-8
SLIDE 8

Transformation technique

8

 We intend to generate samples form standard distributions

 map the values generated by uniform random number generator

such the resulting mapped samples have the desired distribution.

 Choose function 𝑔 .

such that the resulting values of 𝑧 = 𝑔 𝑦 have some specific desired distribution 𝑄 𝑧 : 𝑄 𝑧 = 𝑄 𝑦 𝑒𝑦 𝑒𝑧

 Since 𝑄 𝑦 = 1, we have:

𝑦 =

−∞ 𝑧

𝑄 𝑧′ 𝑒𝑧′

 If we define ℎ 𝑧 ≡

−∞ 𝑧 𝑄 𝑧′ 𝑒𝑧′ ⇒ 𝑧 = ℎ−1 𝑦

slide-9
SLIDE 9

Transformation technique

9

 Cumulative CDF Sampling:

 If 𝑦~𝑉(0,1), and ℎ(. ) is the CDF of 𝑄, then ℎ−1(𝑦) ~𝑄.  Since we need to calculate and then invert the indefinite integral of

𝑄, it will only be feasible for a limited number of simple distributions

 Thus, we will see first rejection sampling and importance

sampling (in the next slides) that can be used as important components in the more general sampling techniques.

slide-10
SLIDE 10

Rejection sampling

10

 Suppose we wish to sample from 𝑄(𝒚) =

𝑄(𝒚)/𝑎.

 𝑄(𝒚) is difficult to sample, but

𝑄(𝒚) is easy to evaluate

 We choose a simpler (proposal) distribution 𝑅 𝒚 that we can

sample from it more easily

 Where ∃𝑙, 𝑙𝑅(𝒚) ≥

𝑄 𝒚

 Sample from 𝑅 𝒚 : 𝒚∗~𝑅(𝒚)  accept 𝒚∗ with probability

𝑄 𝒚∗ 𝑙𝑅(𝒚∗)

𝑙𝑅(𝑦) 𝑄(𝑦) 𝑦 𝑦∗

slide-11
SLIDE 11

Rejection sampling

11

 Correctness:

𝑄 𝒚 𝑙𝑅(𝒚) 𝑅(𝒚) 𝑄 𝒚 𝑙𝑅 𝒚 𝑅 𝒚 𝑒𝒚 = 𝑄 𝒚 𝑄 𝒚 𝑒𝒚 = 𝑄 𝒚 Probability of acceptance: 𝑄 𝑏𝑑𝑑𝑓𝑞𝑢 = 𝑄 𝒚 𝑙𝑅 𝒚 𝑅 𝒚 𝑒𝒚 = 𝑄 𝒚 𝑒𝒚 𝑙

slide-12
SLIDE 12

Adaptive rejection sampling

12

 It is difficult to determine a suitable analytic form for 𝑅  We can use envelope functions to define 𝑅 when 𝑄(𝑦) is log

concave

 Intersections of tangent lines are used to construct 𝑅

 Initially, gradient are evaluated at some initial set of grid points and tangent

lines are found accordingly.

 In each iteration, a sample can be drawn from the envelope distribution.

 the envelope distribution comprises a piecewise exponential distribution and

drawing a sample from it is straight forward

 If the sample is rejected, then it is incorporated into the set of grid points, a

new tangent line is computed, and 𝑅 is thereby refined.

ln 𝑄(𝑦) 𝑦1 𝑦2 𝑦3 𝑦

slide-13
SLIDE 13

High dimensional rejection sampling

13

 Problem: low acceptance rate rejection sampling in high

dimensional spaces

 exponential decrease of acceptance rate with dimensionality

 Example:

 Using 𝑅 = 𝑂(𝝂, 𝜏𝑟

2𝑱) to sample 𝑄 = 𝑂(𝝂, 𝜏𝑞 2𝑱)

 If 𝜏𝑟 exceeds 𝜏𝑞 by 1%, and 𝑒 = 1000 

𝜏𝑟 𝜏𝑄 𝑒

≈ 20,000 and so the optimal acceptance

 rate is 1/20,000 that is too small

𝑄(𝑦) 𝑦

slide-14
SLIDE 14

Importance sampling

14

 Suppose sampling from 𝑄 is hard.  Simpler proposal distribution 𝑅 is used instead.  If 𝑅 dominates 𝑄 (i.e., 𝑅(𝒚) > 0 whenever 𝑄(𝒚) > 0), we can

sample from 𝑅 and reweight the obtained samples: 𝐹𝑄 𝑔 𝒚 = 𝑔 𝒚 𝑄 𝒚 𝑒𝒚 = 𝑔 𝒚 𝑄 𝒚 𝑅 𝒚 𝑅 𝒚 𝑒𝒚 𝐹𝑄 𝑔 𝒚 ≈

1 𝑂 𝑜=1 𝑂

𝑔 𝒚 𝑜

𝑄 𝒚(𝑜) 𝑅 𝒚 𝑜

𝒚 𝑜 ~𝑅 𝒚 𝐹𝑄 𝑔 𝒚 ≈ 1 𝑂

𝑜=1 𝑂

𝑔 𝒚 𝑜 𝑥(𝑜)

𝑥(𝑜) = 𝑄 𝒚(𝑜) 𝑅 𝒚 𝑜

slide-15
SLIDE 15

Normalized importance sampling

15

 Suppose that we can only evaluate

𝑄(𝒚) where 𝑄 𝒚 = 𝑄(𝒚)/𝑎:

𝐹𝑄 𝑔 𝒚 = 𝑔 𝒚 𝑄 𝒚 𝑒𝒚 = 𝑎𝑅 𝑎𝑄 𝑔 𝒚 𝑄 𝒚 𝑅 𝒚 𝑅 𝒚 𝑒𝒚 𝑎𝑄 𝑎𝑅 = 1 𝑎𝑅 𝑄 𝒚 𝑒𝒚 = 𝑄 𝒚 𝑅 𝒚 𝑅 𝒚 𝑒𝒚 = 𝑠 𝒚 𝑅 𝒚 𝑒𝒚 𝐹𝑄 𝑔 𝒚 =

𝑔 𝒚 𝑠 𝒚 𝑅 𝒚 𝑒𝒚 𝑠 𝒚 𝑅 𝒚 𝑒𝒚

𝐹𝑄 𝑔 𝒚 ≈

1 𝑂 𝑜=1 𝑂

𝑔 𝒚 𝑜 𝑠(𝑜)

1 𝑂 𝑛=1 𝑂

𝑠(𝑛)

𝐹𝑄 𝑔 𝒚 ≈

𝑜=1 𝑂

𝑔 𝒚 𝑜 𝑥(𝑜)

𝑠 𝒚 = 𝑄 𝒚 𝑅 𝒚 𝑥(𝑜) = 𝑠(𝑜) 𝑛=1

𝑂

𝑠(𝑛)

𝒚 𝑜 ~𝑅 𝒚

slide-16
SLIDE 16

Importance sampling: problem

16

 Importance sampling depends on how well 𝑅 matches 𝑄

 For mismatch distributions, weights may be dominated by few samples having

large weights, with the remaining weights being relatively insignificant

 It is common that 𝑄(𝒚)𝑔(𝒚) is strongly varying and has a significant

proportion of its mass concentrated in a small region

 The problem is severe if none of the samples falls in the regions where

𝑄(𝒚)𝑔(𝒚) is large.

 The estimate of the expectation may be severely wrong while the variance of 𝑠(𝑜) can

be small

𝑄(𝑦) 𝑅(𝑦) 𝑔(𝑦) A key requirement for 𝑅(𝒚) is that it should not be small or zero in regions where 𝑄(𝒚) may be significant. [Bishop book]

slide-17
SLIDE 17

Sampling methods for graphical models

17

 DGMs:

 Forward (or ancestral) sampling  Likelihood weighted sampling

 For UGMs, there is no one-pass sampling strategy that will

sample even from the prior distribution with no observed variables.

 Instead, computationally more expensive techniques such as Gibbs sampling exist

that will be introduced in the next slides

slide-18
SLIDE 18

Sampling the joint distribution represented by a BN

18

 Sample the joint distribution by ancestral sampling  Example:

 Sample from 𝑄(𝐸) ⇒ 𝐸 = 𝑒1  Sample from 𝑄(𝐽) ⇒ 𝐽 = 𝑗0  Sample from 𝑄 𝐻 𝑗0, 𝑒1

⇒ 𝐻 = 𝑕3

 Sample from 𝑄 𝑇 𝑗0 ⇒ 𝑇 = 𝑡0  Sample from 𝑄 𝑀 𝑕3 ⇒ 𝑀 = 𝑚0  One sample 𝑒1, 𝑗0, 𝑕3, 𝑡0, 𝑚0 was

generated

slide-19
SLIDE 19

Forward sampling in a BN

19

 Given a BN, and number of samples 𝑂  Choose a topological ordering on variables, e.g., 𝑌1, … , 𝑌𝑁  For j = 1 to N  For i = 1 to M  Sample 𝑦𝑗

(𝑘) from the distribution 𝑄(𝑌𝑗|𝒚𝑄𝑏𝑌𝑗 (𝑘) )

 Add {𝑦1

(𝑘), … , 𝑦𝑁 (𝑘)} to the sample set

slide-20
SLIDE 20

Sampling for conditional probability query

20

 𝑄 𝑗1 𝑚0 , 𝑡0) =?  Looking at the samples we can count:

 𝑂: # of samples  𝑂𝑓: # of samples in which the evidence holds (𝑀 = 𝑚0, 𝑇 = 𝑡0)  𝑂𝐽: # of samples where the joint is true (𝑀 = 𝑚0, 𝑇 = 𝑡0, 𝐽 = 𝑗1)

 For a large enough 𝑂

 𝑂𝑓/𝑂 ≈ 𝑄(𝑚0, 𝑡0)  𝑂𝐽/𝑂 ≈ 𝑄(𝑗1, 𝑚0, 𝑡0)

 And so, we can set

 𝑄 𝑗1 𝑚0 , 𝑡0) =

𝑄(𝑗1,𝑚0,𝑡0) 𝑄(𝑚0,𝑡0) ≈ 𝑂𝐽/𝑂𝑓

slide-21
SLIDE 21

Using rejection sampling to compute 𝑄(𝒀|𝒇)

21

 Given a BN, a query 𝑄(𝒀|𝒇), and number of samples 𝑂  Choose a topological ordering on variables, e.g., 𝑌1, … , 𝑌𝑁  j=1  While j<N  For i = 1 to M  Sample 𝑦𝑗

(𝑘) from the distribution 𝑄(𝑌𝑗|𝒚𝑄𝑏𝑌𝑗 (𝑘) )

 If {𝑦1

(𝑘), … , 𝑦𝑁 (𝑘)} consistent with evidence 𝒇 add it to sample set and

j=j+1

 Use samples to compute 𝑄(𝒀|𝒇)

slide-22
SLIDE 22

Forward sampling: problem

22

 Problem: When the evidence rarely happens, we would need

lots of samples, and most would be wasted

 overall probability of accepting a sample rapidly decreases as the number

  • f observed variables increases and as the number of states that those

variables can take increases

 This approach is very slow and rarely used in practice.

slide-23
SLIDE 23

Likelihood weighting

23

 Let 𝑌1, … , 𝑌𝑁 be a topological ordering of 𝐻  𝑥 = 1  𝒚 = 𝟏, … , 𝟏  for i=1…M  if 𝑌𝑗 ∉ 𝑓  draw 𝑦𝑗 from 𝑄 𝑌𝑗 𝑦𝑄𝑏𝑌𝑗  else  𝑦𝑗 = 𝑓 𝑌𝑗  𝑥 = 𝑥 ∗ 𝑄 𝑦𝑗|𝑦𝑄𝑏𝑌𝑗

𝑥 𝒚 =

𝑌𝑗∉𝐹

𝑄 𝑦𝑗|𝑦𝑄𝑏𝑌𝑗 𝑄 𝑦𝑗|𝑦𝑄𝑏𝑌𝑗

𝑌𝑗∈𝐹

𝑄 𝑦𝑗|𝑦𝑄𝑏𝑌𝑗 1 =

𝑌𝑗∈𝐹

𝑄 𝑦𝑗|𝑦𝑄𝑏𝑌𝑗

The above procedure generates a sample 𝒚 = 𝑦1, … , 𝑦𝑁 and assigns it a weight 𝑥:

slide-24
SLIDE 24

Likelihood weighting: example

24

𝑄(𝐸|𝑕2, 𝑗1)

slide-25
SLIDE 25

Efficiency of likelihood weighting

25

 The efficiency of importance sampling depends on how close

the proposal 𝑅 is to the target 𝑄.

 k  Two extreme cases:

 when all the evidence is at the roots 𝑅 𝒚 = 𝑄 𝒚 𝒇 and all samples will

have the same weight 1.

 when all the evidence is at the leaves, 𝑅 𝒚 = 𝑄(𝒚).

 will work reasonably only if 𝑄(𝒚) is similar to 𝑄 𝒚 𝒇 .  Otherwise, most of our samples will be irrelevant, i.e., their weights are very

low

slide-26
SLIDE 26

Limitations of Monte Carlo

26

 Direct sampling: only when we can sample from 𝑄(𝒚)

 can be wasteful for rare events

 Rejection and importance sampling use a proposal distribution

𝑅 𝒚 and can also be used when we can not sample 𝑄(𝒚) directly

 In rejection sampling, when the proposal 𝑅(𝒚) is very different from

𝑄(𝒚), most samples are rejected

 In importance sampling, when the proposal 𝑅(𝒚) is very different from

𝑄(𝒚), most samples have very low weights Problem: Finding a good proposal 𝑅(𝒚) that is similar to 𝑄(𝒚) usually requires knowledge of the analytic form of 𝑄(𝒚) that is not available

slide-27
SLIDE 27

Markov chain Monte Carlo (MCMC)

27

 Instead of using a fixed proposal 𝑅(𝒚), we can use an adaptive

proposal 𝑅(𝒚|𝒚(𝑢)) that depends on the last previous sample 𝒚(𝑢)

 The proposal distribution is adapted as a function of the last accepted

sample

 MCMC methods

 Metropolis-Hasting  Gibbs

slide-28
SLIDE 28

MCMC

28

 MCMC algorithms feature adaptive proposals

 Instead of 𝑅(𝒚′), they use 𝑅(𝒚′|𝒚) where 𝒚′ is the new state being

sampled, and 𝒚 is the previous sample

 As 𝒚 changes, 𝑅(𝒚′|𝒚) can also change (as a function of 𝒚′)

𝑅 𝑦′|𝑦(1) 𝑅 𝑦′|𝑦(2)

[This slide has been adapted from Xing’s slides]

slide-29
SLIDE 29

Markov chains

29

 A Markov chain is a sequence of random variables 𝒚(1), … , 𝒚(𝑢)

with the Markov property 𝑄 𝒚(𝑢)|𝒚(1), … , 𝒚(𝑢−1) = 𝑄 𝒚(𝑢)|𝒚(𝑢−1)

 We

focus

  • n

homogeneous Markov chains, in which 𝑄 𝒚(𝑢)|𝒚(𝑢−1) is fixed with time

 Let 𝒚

be the previous state and 𝒚′ be the next state, we call 𝑄 𝒚(𝑢)|𝒚(𝑢−1) as 𝑈 𝒚′|𝒚

 Thus, at each time point, 𝒚(𝑢) is a state showing the

configuration of all the variables in the model

slide-30
SLIDE 30

Markov chains: invariant or stationary dist.

30

 𝜌𝑢 𝒚 : Probability distribution over state 𝒚, at time 𝑢

 Transition probability 𝑈 𝒚′|𝒚 redistributes the mass in state 𝒚 to other

states 𝒚′.

𝜌𝑢 𝒚′ =

𝒚

𝜌𝑢 𝒚 𝑈 𝒚′|𝒚

 𝜌 𝒚 is invariant or stationary if it does not change under

the transitions: 𝜌𝑢+1 𝒚′ =

𝒚

𝜌𝑢 𝒚 𝑈 𝒚′|𝒚 ∀𝒚′

𝑈 𝒚′|𝒚 = 𝑄 𝒚(𝑢) = 𝒚′|𝒚 𝑢−1 = 𝒚 Invariant distributions are of great importance in MCMC methods.

slide-31
SLIDE 31

More specific than stationary distribution

31

 There is also no guarantees that the stationary distribution is unique  In some chains, the stationary distribution reached depends on our starting

distribution 𝜌0(𝒚)

 We want to restrict our attention to MCs that have a unique stationary

distribution, which is reached from any starting distribution 𝜌0(𝒚).

 There are various conditions that suffice to guarantee this property.  The most commonly used condition is ergodicity.

slide-32
SLIDE 32

Equilibrium distribution

32

 Stationary distribution 𝜌 of transition probabilities 𝑈 is called

the equilibrium distribution when lim

𝑢→∞ 𝜌𝑢 = 𝜌 independent

  • f the initial distribution 𝜌0

 Ergodicity implies you can reach the stationary distribution 𝜌(𝒚), no

matter the initial distribution 𝜌0(𝒚)

 Clearly, an ergodic Markov chain can have only one equilibrium

distribution.

slide-33
SLIDE 33

Ergodic Markov chain

33

 A homogeneous Markov chain will be ergodic, subject only to

the following conditions:

 Irreducible: an MC is irreducible if we can get from any state 𝒚 to any

  • ther state 𝒚′ with probability > 0 in a finite number of steps

 Aperiodic: an MC is aperiodic if you can return to any state 𝒚 at any

time

 Periodic MCs have states that need ≥2 time steps to return to (cycles)

 Ergodic: an MC is ergodic if it is irreducible and aperiodic

 An ergodic MC converges to a unique stationary distribution regardless

  • f the start state
slide-34
SLIDE 34

Finite state space MC

34

 A Markov chain is ergodic or regular if there exists 𝑙

such that, for every 𝒚, 𝒚′, the probability of getting from 𝒚 to 𝒚′ in exactly 𝑙 steps is > 0

 Sufficient set of conditions for regularity:

 Every two states are connected  For every state, there is a self-transition

slide-35
SLIDE 35

Detailed balance

35

 A sufficient (but not necessary) condition for ensuring that

𝜌(𝒚) is stationary distribution of an MC is the detailed balance condition: 𝜌 𝒚 𝑈 𝒚′|𝒚 = 𝜌 𝒚′ 𝑈 𝒚|𝒚′

 Reversible chain: an MC is reversible if there exists a dist.

𝜌(𝒚) such that the detailed balance condition is satisfied

 Enforcing detailed balance is easy

 Detailed balance is local (only involves isolated pairs)

Detailed balance means the sequences 𝒚′, 𝒚 and 𝒚, 𝒚′ are equally probable (although the probability of 𝒚′ → 𝒚 and 𝒚 → 𝒚′ can be different)

slide-36
SLIDE 36

Reversible Chains

36

 Theorem:

Detailed balance implies the stationary distribution

 Proof:

𝜌 𝒚 𝑈 𝒚′|𝒚 = 𝜌 𝒚′ 𝑈 𝒚|𝒚′ ⇒

𝒚

𝜌 𝒚 𝑈 𝒚′|𝒚 =

𝒚

𝜌 𝒚′ 𝑈 𝒚|𝒚′ ⇒

𝒚

𝜌 𝒚 𝑈 𝒚′|𝒚 = 𝜌 𝒚′

𝒚

𝑈 𝒚|𝒚′ = 𝜌 𝒚′

 Theorem: If detailed balance holds and 𝑈 is ergodic, then

𝑈 has a unique stationary distribution

slide-37
SLIDE 37

Stationary distribution: summary

37

 𝜌𝑢 𝒚 : Probability distribution over state 𝒚, at time 𝑢

 Transition probability 𝑈 𝒚′|𝒚 redistributes the mass in state 𝒚 to other

states 𝒚′.

𝜌𝑢+1 𝒚′ =

𝒚

𝜌𝑢 𝒚 𝑈 𝒚′|𝒚

 𝜌∗ 𝒚 is invariant or stationary if it does not change under

the transitions: 𝜌∗ 𝒚′ =

𝒚

𝜌∗ 𝒚 𝑈 𝒚′|𝒚 ∀𝒚′

 A sufficient condition for ensuring that 𝜌∗(𝒚) is stationary

distribution of an MC is the detailed balance condition: 𝜌∗ 𝒚 𝑈 𝒚′|𝒚 = 𝜌∗ 𝒚′ 𝑈 𝒚|𝒚′

slide-38
SLIDE 38

How to use Markov chains for sampling from 𝑄 𝒚 ?

38

 Our goal is to use Markov chains to sample from a given

distribution.

 We can achieve this if we set up a Markov chain whose unique

stationary distribution is 𝑄.

 We design the transition distribution 𝑈 𝒚′|𝒚 so that the chain

has a unique stationary distribution 𝑄 𝒚 (independent of 𝑄)

 The ergodic condition is a sufficient condition

Sample 𝒚(0) randomly For t = 0, 1, 2, … Sample 𝒚(𝑢+1) from 𝑈(𝒚′|𝒚(𝑢))

slide-39
SLIDE 39

Metropolis-Hastings

39

 Draws a sample 𝒚′ from 𝑅(𝒚′|𝒚), where 𝒚 is the previous

sample

 The new sample 𝒚′ is accepted with probability 𝐵(𝒚′|𝒚):

𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)

 𝐵 𝒚′ 𝒚 is like a ratio of importance sampling weights

𝑄 𝒚′ 𝑅(𝒚′|𝒚) is the importance weight for 𝒚′

𝑄 𝒚 𝑅(𝒚|𝒚′) is the importance weight for 𝒚

 We use 𝐵 𝒚′ 𝒚

to ensure that after sufficiently many draws,

  • ur samples will come from the true distribution 𝑄 𝒚

 We will show in the next slides

we only need to compute

𝑄 𝒚′ 𝑄(𝒚) rather

than 𝑄 𝒚′ or 𝑄 𝒚 separately

slide-40
SLIDE 40

Metropolis-Hastings algorithm: Example

40

 Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚  We’re trying to sample from a bimodal distribution 𝑄(𝒚)

𝑅 𝑦′|𝑦(0) [This slide has been adapted from Xing’s slides, CMU] 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)

slide-41
SLIDE 41

Metropolis-Hastings algorithm: Example

41

 Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚  We’re trying to sample from a bimodal distribution 𝑄(𝒚)

𝑅 𝑦′|𝑦(1) [This slide has been adapted from Xing’s slides, CMU] 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)

slide-42
SLIDE 42

Metropolis-Hastings algorithm: Example

42

 Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚  We’re trying to sample from a bimodal distribution 𝑄(𝒚)

Rejected since

𝑄 𝑦′ 𝑅(𝑦′|𝑦(2)) < 1 and 𝑄 𝑦(2) 𝑅(𝑦(2)|𝑦′) > 1, hence 𝐵(𝑦′|𝑦(2)) is small

𝑅 𝑦′|𝑦(2) [This slide has been adapted from Xing’s slides, CMU] 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)

slide-43
SLIDE 43

Metropolis-Hastings algorithm: Example

43

 Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚  We’re trying to sample from a bimodal distribution 𝑄(𝒚)

𝑅 𝑦′|𝑦(3) [This slide has been adapted from Xing’s slides, CMU] 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)

slide-44
SLIDE 44

Metropolis-Hastings algorithm: Example

44

 Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚  We’re trying to sample from a bimodal distribution 𝑄(𝒚)

The adaptive proposal 𝑅(𝑦’|𝑦) allows us to sample both modes of 𝑄(𝑦)! 𝑅 𝑦′|𝑦(4) [This slide has been adapted from Xing’s slides, CMU] 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)

slide-45
SLIDE 45

Metropolis-Hastings algorithm

45

 Initialize starting state 𝒚(0), set 𝑢 = 0  Burn-in: while samples have “not converged”

  • 𝒚 = 𝒚(𝑢)
  • sample 𝒚∗~𝑅(𝒚∗|𝒚)
  • 𝐵 𝒚∗ 𝒚 = min 1,

𝑄 𝒚∗ 𝑅(𝒚|𝒚∗) 𝑄 𝒚 𝑅(𝒚∗|𝒚)

  • sample 𝑣~𝑉𝑜𝑗𝑔𝑝𝑠𝑛(0,1)
  • if 𝑣 < 𝐵 𝒚∗ 𝒚
  • 𝒚(𝑢+1) = 𝒚∗
  • else
  • 𝒚(𝑢+1) = 𝒚(𝑢)
  • 𝑢 = 𝑢 + 1

 For n =1…N  Draw sample from 𝑅(𝒚|𝒚 𝑜−1 ) with acceptance probability 𝐵 𝒚 𝒚(𝑜−1)

𝒚∗ is accepted with probability 𝐵(𝒚∗|𝒚)

slide-46
SLIDE 46

Metropolis algorithm: example

46

Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚: 𝑅 𝒚′ 𝒚 = 𝑂 𝒚′ 𝒚, 𝜏𝟑𝑱 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑄 𝒚

A biased random walk that explores the target distribution 𝑄(𝒚) [Bishop book]

slide-47
SLIDE 47

Metropolis algorithm: example

47

 𝑅 𝒚′ 𝒚 = 𝑂 𝒚′ 𝒚, 𝜏𝟑𝑱

 large 𝜏2 ⇒ many rejections  small 𝜏2 ⇒ slow exploration

𝑀 𝜏 2 iterations are required to reach states

In general, finding a good proposal distribution is not always easy

slide-48
SLIDE 48

Proposal distribution

48

 low-variance proposals:

 high probability of acceptance  many iterations are required to explore 𝑄 𝒚  results in more correlated samples

 high-variance proposals

 low probability of acceptance  have the potential to explore much of 𝑄(𝒚)  Results in less correlated samples

slide-49
SLIDE 49

Why does Metropolis-Hastings work?

50

 MH is a general construction algorithm that allows us to build a

reversible Markov chain with a particular stationary distribution 𝑄 𝒚

 If we draw a sample 𝒚′ according to 𝑅(𝒚′|𝒚), and then accept/reject

according to 𝐵(𝒚′|𝒚), we have a transition kernel: 𝑈 𝒚′ 𝒚 = 𝐵(𝒚′|𝒚)𝑅(𝒚′|𝒚)

𝑈 𝒚′ 𝒚 = 𝐵(𝒚′|𝒚)𝑅(𝒚′|𝒚) if 𝒚′ ≠ 𝒚 𝑈 𝒚 𝒚 = 𝑅 𝒚 𝒚 +

𝒚′≠𝒚

𝑅 𝒚′ 𝒚 1 − 𝐵(𝒚′|𝒚)

slide-50
SLIDE 50

MH satisfies detailed balance

51

 Theorem: MH satisfies detailed balance  Proof:

𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)

If 𝐵 𝒚′ 𝒚 ≤ 1 then

𝑄 𝒚 𝑅(𝒚′|𝒚) 𝑄 𝒚′ 𝑅(𝒚|𝒚′) ≥ 1 then 𝐵 𝒚 𝒚′ = 1

Suppose that 𝐵 𝒚′ 𝒚 < 1

𝐵 𝒚′ 𝒚 = 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚) ⇒ 𝑄 𝒚 𝑅 𝒚′ 𝒚 𝐵 𝒚′ 𝒚 = 𝑄 𝒚′ 𝑅 𝒚 𝒚′

𝐵 𝒚 𝒚′ =1

𝑄 𝒚 𝑅 𝒚′ 𝒚 𝐵 𝒚′ 𝒚 = 𝑄 𝒚′ 𝑅 𝒚 𝒚′ 𝐵 𝒚 𝒚′ ⇒ 𝑄 𝒚 𝑈 𝒚′ 𝒚 = 𝑄 𝒚′ 𝑈(𝒚|𝒚′)

𝑈 𝒚′ 𝒚 = 𝐵(𝒚′|𝒚)𝑅(𝒚′|𝒚)

slide-51
SLIDE 51

MH properties

52

 MH algorithm eventually converges to a stationary distribution

𝑄(𝒚) that is the true distribution

 However, we have no guarantees as to when this will occur

 the burn-in period is a way to ignore the un-converged part of the

Markov chain

 but deciding when to halt burn-in is an art that needs experimentation.

 𝑅 must be chosen to fulfill the technical requirements

slide-52
SLIDE 52

Gibbs sampling algorithm

53

 Initialize starting values for 𝑦1, … , 𝑦𝑁  Do until convergence:  Pick an ordering of the 𝑁 variables (can be fixed or random)  For each variable 𝑦𝑗 in order:  Sample 𝑦 from 𝑄(𝑌𝑗|𝑦1, … , 𝑦𝑗−1, 𝑦𝑗+1, … , 𝑦𝑁)  Update 𝑦𝑗 ←𝑦

When we update 𝑦𝑗, we immediately use its new value for sampling other variables 𝑦𝑘

Suppose the graphical model contains variables 𝑦1, … , 𝑦𝑁

the current values of all other variables

slide-53
SLIDE 53

Gibbs sampling algorithm

54

 Sample 𝑦1

′ from 𝑄(𝑌1|𝑦2, … , 𝑦𝑁)

 Sample 𝑦2

′ from 𝑄(𝑌2|𝑦1 ′, 𝑦3, … , 𝑦𝑁)

 …  Sample 𝑦𝑗

′ from 𝑄 𝑌𝑗 𝑦1 ′, … , 𝑦𝑗−1 ′

, 𝑦𝑗+1, … , 𝑦𝑁

 …  Sample 𝑦𝑁

′ from 𝑄(𝑌𝑁|𝑦1 ′, … , 𝑦𝑁−1 ′

)

Suppose the graphical model contains variables 𝑦1, … , 𝑦𝑁

If the current sample is 𝒚 = 𝑦1, … , 𝑦𝑁 , the next sample 𝒚′ = 𝑦1

′, … , 𝑦𝑁 ′

is drawn as:

slide-54
SLIDE 54

Gibbs Sampling

55

 Gibbs Sampling is an MCMC algorithm that samples one

random variable of a graphical model at a time

 We will see that GS is a special case of MH

 GS needs reasonable time and memory

 Complete conditionals are fairly easy to derive for many graphical

models (e.g. mixture models, LDA)

 For conjugate-exponential models, we can use standard sampling techniques

 if conditionals are log concave, then we can use adaptive rejection

sampling

 Blocked Gibbs sampling, samples a subset of variables at a time

slide-55
SLIDE 55

Gibbs sampling: example

56

Target: 𝑄 𝐸, 𝐽, 𝐻|𝑡1, 𝑚0 𝑒0, 𝑗0, 𝑕1 𝑒0, 𝑗0, 𝑕1 𝑒0, 𝑗1, 𝑕1 𝑒0, 𝑗1, 𝑕1 Initialization 𝑒′~𝑄 𝐸|𝑗0, 𝑕1, 𝑡1, 𝑚0 𝑗′~𝑄 𝐽|𝑒0, 𝑕1, 𝑡1, 𝑚0 𝑕′~𝑄 𝐻|𝑒1, 𝑗0, 𝑡1, 𝑚0 𝒚(𝑢) = 𝑒0, 𝑗0, 𝑕1 → 𝒚(𝑢+1) = 𝑒0, 𝑗1, 𝑕1

slide-56
SLIDE 56

Gibbs sampling is a special case of MH

57

 Gibbs sampling is an MH algorithm with the proposal

distribution: 𝑅𝑗 𝒚′|𝒚 = 𝑄 𝑦𝑗

′|𝒚−𝑗

 Using this proposal distribution in MH algorithm, we find that

samples are always accepted:

𝐵 𝒚′|𝒚 = min 1, 𝑄 𝒚′ 𝑅𝑗 𝒚|𝒚′ 𝑄 𝒚 𝑅𝑗 𝒚′|𝒚 = min 1, 𝑄 𝑦𝑗

′, 𝒚−𝑗 ′

𝑄 𝑦𝑗|𝒚′−𝑗 𝑄 𝑦𝑗, 𝒚−𝑗 𝑄 𝑦𝑗

′|𝒚−𝑗

= min 1, 𝑄 𝑦𝑗

′|𝒚−𝑗 𝑄 𝒚−𝑗 𝑄 𝑦𝑗|𝒚−𝑗

𝑄 𝑦𝑗|𝒚−𝑗 𝑄 𝒚−𝑗 𝑄 𝑦𝑗

′|𝒚−𝑗

= 1

 GS is simply MH with a proposal that is always accepted! 𝒚−𝑗: all variables except 𝑦𝑗 𝒚−𝑗

= 𝒚−𝑗 ⇒ 𝒚−𝑗

= 𝒚−𝑗

slide-57
SLIDE 57

Gibbs sampling: complete conditional distributions

58

 𝑄 𝑦𝑗|𝒚−𝑗 can often be evaluated quickly, because they only

depends on the neighboring nodes

 Let 𝑁𝐶 𝑌𝑗 be the Markov Blanket of 𝑌𝑗

 For a BN, the Markov Blanket of a variable is the set of variables

containing its parents, children, and co-parents

 For an MRF, the Markov Blanket of variable is its immediate neighbors

𝑄 𝑌𝑗 𝑌1, … , 𝑌𝑗−1, 𝑌𝑗+1, … , 𝑌𝑁 = 𝑄(𝑌𝑗|𝑁𝐶 𝑌𝑗 )

 A sufficient condition for ergodicity of Gibbs sampling is that

none of the conditional distributions be anywhere zero

 All conditions are satisfied when all probabilities are positive

slide-58
SLIDE 58

Gibbs sampling implementation

59

 Conditionals with a few discrete settings can be explicitly

normalized: 𝑄 𝑦𝑗|𝒚−𝑗 = 𝑄 𝑦𝑗, 𝒚−𝑗 𝑦𝑗

′ 𝑄 𝑦𝑗

′, 𝒚−𝑗

= 𝑙: 𝑌𝑗∈𝑇𝑑𝑝𝑞𝑓(𝜚𝑙) 𝜚𝑙 𝑦𝑗, 𝒚𝑙\𝑦𝑗 𝑦𝑗

′ 𝑙: 𝑌𝑗∈𝑇𝑑𝑝𝑞𝑓(𝜚𝑙) 𝜚𝑙 𝑦𝑗

′, 𝒚𝑙\𝑦𝑗

 Continuous univariate conditionals can be sampled using

standard sampling methods for this purpose

 e.g., inverse CDF sampling, rejection sampling, adaptive rejection

sampling

depends only on the instantiation

  • f variables in Xi’s Markov blanket
slide-59
SLIDE 59

Sampling and the EM algorithm

60

 Monte Carlo EM algorithm: 𝑢 ← 1, Choose an initial parameters 𝜾1 Iterate until convergence: E Step: Calculate 𝑄(ℋ|𝒠, 𝜾𝑢) M Step: 𝜾𝑢+1 = argmax

𝜾

𝐹𝑄(ℋ|𝒠,𝜾𝑢) log 𝑄(𝒠, ℋ|𝜾) 𝑢 ← 𝑢 + 1

𝑄 𝑨 𝑜 𝑦 𝑜 , 𝜾𝑢 𝑜 = 1, … , 𝑂

𝑢 ← 1, Choose an initial parameters 𝜾1 Iterate until convergence: E Step: Draw samples 𝑨 𝑜 ,𝑚~𝑄 𝑨 𝑜 𝑦 𝑜 , 𝜾𝑢

𝑜 = 1, … , 𝑂, 𝑚 = 1, … , 𝑀

M Step1: 𝜾𝑢+1 = argmax

𝜾

𝑜=1

𝑂 1 𝑀 𝑚=1 𝑀

log 𝑄 𝑦 𝑜 , 𝑨 𝑜 ,𝑚 𝜾 𝑢 ← 𝑢 + 1

𝑜=1 𝑂

𝐹𝑄(𝑨(𝑜)|𝑦(𝑜),𝜾𝑢) log 𝑄(𝑦(𝑜), 𝑨|𝜾)

slide-60
SLIDE 60

Sampling and Bayesian Learning

61

slide-61
SLIDE 61

Example: Bayesian Mixture Model

62

 To find

generate samples from

 Using Gibbs sampling:

slide-62
SLIDE 62

Example: Bayesian Mixture Model

63

 Gibbs sampling:  The above complete conditionals can be computed easily:

slide-63
SLIDE 63

Gibbs sampling: summary

64

 Converts the hard problem of inference to a sequence of

sampling steps from (univariate) conditional distributions

 Pros:

 Perhaps the simplest Markov chain for graphical models  Computationally efficient: conditionals can often be evaluated quickly

 Cons:

 Only applies if we can sample from product of factors

 For overly complex cases, Metropolis Hastings can provide an effective

alternative

 Often slow to mix especially when probabilities are peaked

 Acceptance rate is always 1 and high acceptance usually entails slow

exploration

slide-64
SLIDE 64

MH & MC: summary

65

 To sample from the true distribution 𝑄(𝒚) , MH methods use

adaptive proposals 𝑅(𝒚|𝒚(𝑢)) rather than 𝑅(𝒚)

 MH allows specifying any proposal 𝑅(𝒚′|𝒚)

 But choosing a good 𝑅(𝒚′|𝒚) requires technical care

 Gibbs sampling sets the proposal 𝑅(𝒚′|𝒚) to the conditional

distribution 𝑄 𝑦𝑗

′|𝒚−𝑗

and acceptance rate will be one in this method.

 Gibbs sampling is useful at least for a “first try.”

 better choices for the MH transition probabilities than the univariate

conditionals used in Gibbs may lend to faster convergence

slide-65
SLIDE 65

Gibbs Sampling: Extentions

66

 Methods to improve convergence:

 Blocking  Collapsing Rao-Blackwellised

slide-66
SLIDE 66

Blocked Gibbs Sampling

67

 We can partition variables into several disjoint blocks of

variables

 such that we can efficiently sample from each block given all other

variables

 Blocked Gibbs sampling algorithm: iteratively sample blocks of

variables, rather than individual variables

 thereby taking much “longer-range” transitions in the state space in a

single sampling step.

 Here, like in Gibbs sampling, we define the algorithm to be producing

block Gibbs a new sample only once all blocks have been resampled.

slide-67
SLIDE 67

Blocked Gibbs Sampling: Example

68

 𝑄 𝐽1, 𝐽2, 𝐽3, 𝐽4, 𝐸1, 𝐸2|𝐻1,1 … 𝐻4,2

slide-68
SLIDE 68

Rao-Blackwellised Sampling

69

 Sampling in high dimensional spaces causes high variance in the

estimate.

 RB idea: sample some variables Z, and conditional on that, compute

expected value of rest X analytically.

slide-69
SLIDE 69

Rao-Blackwellised Sampling

70

 Instead of basic Monte Carlo estimation for joint distribution on x, z:

𝑦 𝑚 , 𝑨 𝑚 ~𝑞 𝑦, 𝑨 𝑚 = 1,2, … , 𝑀 𝔽𝑞 𝑔 𝑦, 𝑨 =

𝑨 𝑦

𝑔 𝑦, 𝑨 𝑞 𝑦, 𝑨 𝑒𝑦𝑒𝑨 ≈ 1 𝑀

𝑚=1 𝑀

𝑔 𝑦(𝑚), 𝑨(𝑚)

 When the expectation using p(x|z) is tractable, we can estimate it as:

𝔽𝑞 𝑔 𝑦, 𝑨 =

𝑨 𝑦

𝑔 𝑦, 𝑨 𝑞 𝑦|𝑨 𝑞(𝑨)𝑒𝑦𝑒𝑨 =

𝑨 𝑦

𝑔 𝑦, 𝑨 𝑞 𝑦|𝑨 𝑒𝑦 𝑞(𝑨)𝑒𝑦 ≈ 1 𝑀

𝑚=1 𝑀 𝑦

𝑔 𝑦, 𝑨(𝑚) 𝑞 𝑦|𝑨(𝑚) 𝑒𝑦

 and expect this estimator to be more accurate

slide-70
SLIDE 70

Rao-Blackwellized Estimation

71

 It is classically used to reduce the variance of estimators:  Analytically marginalize, or collapse, some variables from

the model and derive Gibbs sampler for the collapsed representation

slide-71
SLIDE 71

Collapsed Gibbs Sampling: Bayesian Mixture Models

72

slide-72
SLIDE 72

Topic Models: Collapsed Gibbs

73

 Popular inference algorithm for topic models

 Integrate out topic vectors π and topics B  Only need to sample word-topic assignments z

[This slide has been adapted from Xing’s slides, CMU]

slide-73
SLIDE 73

Topic Models: Collapsed Gibbs

74

 What is 𝑄(𝑨𝑗|𝒜−𝑗, 𝒙)?

 It is a product of two Dirichlet-Multinomial conditional

distributions:

[This slide has been adapted from Xing’s slides, CMU]

slide-74
SLIDE 74

Topic Models: Collapsed Gibbs

75

 What is 𝑄(𝑨𝑗|𝒜−𝑗, 𝒙)?

 It is a product of two Dirichlet-Multinomial conditional

distributions:

[This slide has been adapted from Xing’s slides, CMU]

slide-75
SLIDE 75

Collapsed Gibbs illustration

76

[This slide has been adapted from Xing’s slides, CMU]

slide-76
SLIDE 76

Collapsed Gibbs illustration

77

[This slide has been adapted from Xing’s slides, CMU]

slide-77
SLIDE 77

Collapsed Gibbs illustration

78

[This slide has been adapted from Xing’s slides, CMU]

slide-78
SLIDE 78

Collapsed Gibbs illustration

79

[This slide has been adapted from Xing’s slides, CMU]

slide-79
SLIDE 79

Practical aspects of MCMC

80

 Using the samples

 Only once the chain mixes, all samples 𝒚(𝑢) are from the

stationary distribution 𝑄 𝒚

 So we can use all 𝒚(𝑢) for 𝑢 > 𝑈

𝑛𝑗𝑦

 However, nearby samples are correlated!

 So we should not overestimate the quality of our estimate by simply

counting samples  Selecting the proposal 𝑅(𝒚′|𝒚): a good 𝑅

proposes distant samples with a sufficiently high acceptance rate

slide-80
SLIDE 80

Stop burn-in

81

 Stop burn-in when the state distribution is reasonably

close to 𝑄 𝒚 .

 Start collecting samples after the chain has run long enough to

mix

 How do you know if a chain has mixed or not?

 In general, you can never prove a chain has mixed

 No general-purpose theoretical analysis exists for the mixing time of

graphical models

 But in many cases you can show that it has not mixed

slide-81
SLIDE 81

Poor mixing

82

 Poor mixing (or slow convergence)

 MC stays in small regions of the parameter space for long periods of time  When the state space consists of several regions that are connected only

by low-probability transitions.

 The most straightforward approach is

to use multiple highly dispersed initial values to start several different chains

 straightforward MCMC methods are unlikely to work. Multimodal target distribution in which our choice of starting values traps us near one of the modes

slide-82
SLIDE 82

Stop burn-in

83

 There are approaches telling us if a chain has not

converged

 Compare chain statistics in different windows within a single

run of the chain or across different chains corresponding to runs initialized differently

 Simple approaches

 Plot the sample values vs time  Plot the log-likelihood vs time

slide-83
SLIDE 83

Mixing?

84

 Each dot is a statistic (e.g., 𝑄(𝑦 ∈ 𝑇))

 x-position is its estimated value from chain 1  y-position is its estimated value from chain 2 [Koller & Friedman book]

slide-84
SLIDE 84

Sample Values vs Time

85

 Monitor convergence by plotting samples (of r.v.s) from

multiple MH runs (chains)

 If the chains are well-mixed (left), they are probably converged  If the chains are poorly-mixed (right), we should continue burn-in

[This slide has been adapted from Xing’s slides, CMU]

slide-85
SLIDE 85

Log-likelihood vs Time

86

 Many graphical models are high-dimensional

 Hard to visualize all r.v. chains at once

 Instead, plot the complete log-likelihood vs. time

 The complete log-likelihood is an r.v. that depends on all model r.v.s 

Generally, the log-likelihood will climb, then eventually plateau

[This slide has been adapted from Xing’s slides, CMU]

slide-86
SLIDE 86

Autocorrelation coefficient

87

Autocorrelation function of random variable 𝑦:

𝑆 𝑙 = 𝑑𝑝𝑤 𝑦(𝑢), 𝑦(𝑢+𝑙) 𝑤𝑏𝑠 𝑦(𝑢) 𝑆𝑦 𝑙 = 𝑢=1

𝑂−𝑙 𝑦(𝑢) −

𝑦 𝑦(𝑢+𝑙) − 𝑦 𝑢=1

𝑂

𝑦(𝑢) − 𝑦 2 𝑦 = 𝑢=1

𝑂

𝑦(𝑢) 𝑂

 One way to diagnose a poorly mixing chain is to observe high

autocorrelation at distant lags

 High autocorrelation leads to smaller effective sample size!

slide-87
SLIDE 87

Proposal distribution

88

 How do we know if our proposal Q(x’|x) is any good?

 Monitor the acceptance rate  Plot the autocorrelation function

slide-88
SLIDE 88

Acceptance Rate

89

 Choosing the proposal Q(x’|x) is a tradeoff:

 “Narrow”, low-variance proposals have high acceptance, but take

many iterations to explore P(x) fully

 “Wide”, high-variance proposals have the potential to explore much

  • f P(x), but many proposals are rejected (slows down the sampler)

 A good Q(x’|x) proposes distant samples x’ with a sufficiently

high acceptance rate

[This slide has been adapted from Xing’s slides, CMU]

slide-89
SLIDE 89

Auto-correlation Function

90

 MCMC chains always show autocorrelation (AC)

 AC means that adjacent samples in time are highly correlated

[This slide has been adapted from Xing’s slides, CMU]

slide-90
SLIDE 90

Auto-correlation

91

 We may want proposals 𝑅(𝒚′|𝒚) with low autocorrelation

 If the variance of 𝑅 is too small, moves are accepted with high

probability, but they are also small, generating high autocorrelations and poor mixing

 High autocorrelation leads to smaller effective sample

size!

 We want proposals Q(x’|x) with low autocorrelation

slide-91
SLIDE 91

Acceptance Rate

92

 The first-order AC Rx(1) can be used to estimate the

Sample Size Inflation Factor (SSIF):

 If we took n samples with SSIF 𝑡𝑦, then the effective sample

size is 𝑜/𝑡𝑦 𝑡𝑦 = 1 + 𝑆𝑦(1) 1 − 𝑆𝑦(1)

[This slide has been adapted from Xing’s slides, CMU]

slide-92
SLIDE 92

Thinning

93

 One strategy for reducing autocorrelation is thinning the

  • utput

 storing only every 𝑛th point after the burn-in period.

 reduce the autocorrelation by increasing the thinning ratio

 Approximately independent samples can be obtained

thinning all samples after burn-in

slide-93
SLIDE 93

Thinning or not?

94

 All of the samples produces a provably better estimator than

using just remained after thinning

 Thinning is useful primarily in settings where there is a

significant cost associated with using each sample (for example, the evaluation of f is costly)

 so that we might want to reduce the overall number of particles used.

slide-94
SLIDE 94

One long chain or many smaller chains

95

 If long burn-in periods are required, or if the chains have very

high autocorrelations:

 using a large number of smaller chains may result in each not being

long enough to be of any value.

 A good strategy: run a small number of chains in parallel for a

reasonably long time

 using their behavior to evaluate mixing.

slide-95
SLIDE 95

MCMC implementation details

96

 Tunable parameters or design choices in MCMC:

 the proposal distribution  the metrics for evaluating mixing  the number of chains to run  techniques for determining the thinning ratio

slide-96
SLIDE 96

MCMC sampling: summary

97

 Theoretical guarantees for convergence to exact values  However, it can be quite slow to converge

 We may need ridiculously many samples  only useful if the chain we are using mixes reasonably quickly

 And it is difficult to tell whether it has been converged  Many options that need to be specified

 The application of Markov chains is more of an art than a science

 Unlike forward sampling methods, MCMC methods:

 do not degrade when the probability of the evidence is low, or when the

posterior is very different from the prior.

 apply to undirected models as well as to directed models.

slide-97
SLIDE 97

Resources

98

 D. Koller and N. Friedman, “Probabilistic Graphical Models:

Principles and Techniques”, Chapter 12.

 C.M. Bishop, “Pattern Recognition and Machine Learning”,

Springer, Chapter 11.