Approximate inference: Sampling methods Probabilistic Graphical - - PowerPoint PPT Presentation
Approximate inference: Sampling methods Probabilistic Graphical - - PowerPoint PPT Presentation
Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani Approximate inference Approximate inference techniques Deterministic approximation Variational algorithms
Approximate inference
2
Approximate inference techniques
Deterministic approximation
Variational algorithms
Stochastic simulation / sampling methods
Sampling-based estimation
3
Assume that = 𝑦(1), … , 𝑦(𝑂) shows the set of i.i.d. samples drawn from
the desired distribution 𝑄
For any distribution 𝑄, function 𝑔, we can estimate 𝐹𝑄 𝑔 :
𝐹𝑄 𝑔 ≈ 1 𝑂
𝑜=1 𝑂
𝑔 𝑦 𝑜
Expectations reveal interesting properties about distribution 𝑄
Means and variance of 𝑄 Probability of events
E.g., we can find
𝑄(𝑦 = 𝑙) by estimating 𝐹𝑄 𝑔 where 𝑔 𝑦 = 𝐽 𝑦 = 𝑙
We can use a stochastic representation of a complex distribution
Empirical expectation
Bounds on error
4
Hoeffding bound
additive bound on error 𝑄
𝜄 ∉ 𝜄 − 𝜗, 𝜄 + 𝜗 ≤ 2𝑓−2𝑂𝜗2
Chernouf bound
𝑄
𝜄 ∉ 𝜄 1 − 𝜗 , 𝜄 1 + 𝜗 ≤ 2𝑓−𝑂𝜄𝜗2
3
multiplicative bound on error 𝜄 =
1 𝑂 𝑜=1 𝑂
𝑔 𝒚(𝑜) where 𝒚(𝑜)~𝑄(𝒚) 𝜄 = 𝐹𝑄 𝒚 𝑔 𝒚
The mean and variance of the estimator
5
For samples drawn independently from the distribution 𝑄:
𝑔 = 1 𝑂
𝑜=1 𝑂
𝑔 𝑦 𝑜 𝐹 𝑔 = 𝐹 𝑔 𝑤𝑏𝑠 𝑔 = 1 𝑀 𝐹 𝑔 − 𝐹 𝑔
2
Monte Carlo methods
6
Using a set of samples to find the answer of an inference query
expectations can be approximated using sample-based averages
Asymptotically
exact and easy to apply to arbitrary problems
Challenges:
Drawing samples from many distributions is not trivial Are the gathered samples enough? Are all samples useful, or equally useful?
Generating samples form a distribution
7
Assume that we have an algorithm that generates (pseudo-)
random numbers distributed uniformly over (0,1)
How do we generate a sample from these distributions.
First, we see simple cases:
Bernoulli Multinomial Other standard distributions
Transformation technique
8
We intend to generate samples form standard distributions
map the values generated by uniform random number generator
such the resulting mapped samples have the desired distribution.
Choose function 𝑔 .
such that the resulting values of 𝑧 = 𝑔 𝑦 have some specific desired distribution 𝑄 𝑧 : 𝑄 𝑧 = 𝑄 𝑦 𝑒𝑦 𝑒𝑧
Since 𝑄 𝑦 = 1, we have:
𝑦 =
−∞ 𝑧
𝑄 𝑧′ 𝑒𝑧′
If we define ℎ 𝑧 ≡
−∞ 𝑧 𝑄 𝑧′ 𝑒𝑧′ ⇒ 𝑧 = ℎ−1 𝑦
Transformation technique
9
Cumulative CDF Sampling:
If 𝑦~𝑉(0,1), and ℎ(. ) is the CDF of 𝑄, then ℎ−1(𝑦) ~𝑄. Since we need to calculate and then invert the indefinite integral of
𝑄, it will only be feasible for a limited number of simple distributions
Thus, we will see first rejection sampling and importance
sampling (in the next slides) that can be used as important components in the more general sampling techniques.
Rejection sampling
10
Suppose we wish to sample from 𝑄(𝒚) =
𝑄(𝒚)/𝑎.
𝑄(𝒚) is difficult to sample, but
𝑄(𝒚) is easy to evaluate
We choose a simpler (proposal) distribution 𝑅 𝒚 that we can
sample from it more easily
Where ∃𝑙, 𝑙𝑅(𝒚) ≥
𝑄 𝒚
Sample from 𝑅 𝒚 : 𝒚∗~𝑅(𝒚) accept 𝒚∗ with probability
𝑄 𝒚∗ 𝑙𝑅(𝒚∗)
𝑙𝑅(𝑦) 𝑄(𝑦) 𝑦 𝑦∗
Rejection sampling
11
Correctness:
𝑄 𝒚 𝑙𝑅(𝒚) 𝑅(𝒚) 𝑄 𝒚 𝑙𝑅 𝒚 𝑅 𝒚 𝑒𝒚 = 𝑄 𝒚 𝑄 𝒚 𝑒𝒚 = 𝑄 𝒚 Probability of acceptance: 𝑄 𝑏𝑑𝑑𝑓𝑞𝑢 = 𝑄 𝒚 𝑙𝑅 𝒚 𝑅 𝒚 𝑒𝒚 = 𝑄 𝒚 𝑒𝒚 𝑙
Adaptive rejection sampling
12
It is difficult to determine a suitable analytic form for 𝑅 We can use envelope functions to define 𝑅 when 𝑄(𝑦) is log
concave
Intersections of tangent lines are used to construct 𝑅
Initially, gradient are evaluated at some initial set of grid points and tangent
lines are found accordingly.
In each iteration, a sample can be drawn from the envelope distribution.
the envelope distribution comprises a piecewise exponential distribution and
drawing a sample from it is straight forward
If the sample is rejected, then it is incorporated into the set of grid points, a
new tangent line is computed, and 𝑅 is thereby refined.
ln 𝑄(𝑦) 𝑦1 𝑦2 𝑦3 𝑦
High dimensional rejection sampling
13
Problem: low acceptance rate rejection sampling in high
dimensional spaces
exponential decrease of acceptance rate with dimensionality
Example:
Using 𝑅 = 𝑂(𝝂, 𝜏𝑟
2𝑱) to sample 𝑄 = 𝑂(𝝂, 𝜏𝑞 2𝑱)
If 𝜏𝑟 exceeds 𝜏𝑞 by 1%, and 𝑒 = 1000
𝜏𝑟 𝜏𝑄 𝑒
≈ 20,000 and so the optimal acceptance
rate is 1/20,000 that is too small
𝑄(𝑦) 𝑦
Importance sampling
14
Suppose sampling from 𝑄 is hard. Simpler proposal distribution 𝑅 is used instead. If 𝑅 dominates 𝑄 (i.e., 𝑅(𝒚) > 0 whenever 𝑄(𝒚) > 0), we can
sample from 𝑅 and reweight the obtained samples: 𝐹𝑄 𝑔 𝒚 = 𝑔 𝒚 𝑄 𝒚 𝑒𝒚 = 𝑔 𝒚 𝑄 𝒚 𝑅 𝒚 𝑅 𝒚 𝑒𝒚 𝐹𝑄 𝑔 𝒚 ≈
1 𝑂 𝑜=1 𝑂
𝑔 𝒚 𝑜
𝑄 𝒚(𝑜) 𝑅 𝒚 𝑜
𝒚 𝑜 ~𝑅 𝒚 𝐹𝑄 𝑔 𝒚 ≈ 1 𝑂
𝑜=1 𝑂
𝑔 𝒚 𝑜 𝑥(𝑜)
𝑥(𝑜) = 𝑄 𝒚(𝑜) 𝑅 𝒚 𝑜
Normalized importance sampling
15
Suppose that we can only evaluate
𝑄(𝒚) where 𝑄 𝒚 = 𝑄(𝒚)/𝑎:
𝐹𝑄 𝑔 𝒚 = 𝑔 𝒚 𝑄 𝒚 𝑒𝒚 = 𝑎𝑅 𝑎𝑄 𝑔 𝒚 𝑄 𝒚 𝑅 𝒚 𝑅 𝒚 𝑒𝒚 𝑎𝑄 𝑎𝑅 = 1 𝑎𝑅 𝑄 𝒚 𝑒𝒚 = 𝑄 𝒚 𝑅 𝒚 𝑅 𝒚 𝑒𝒚 = 𝑠 𝒚 𝑅 𝒚 𝑒𝒚 𝐹𝑄 𝑔 𝒚 =
𝑔 𝒚 𝑠 𝒚 𝑅 𝒚 𝑒𝒚 𝑠 𝒚 𝑅 𝒚 𝑒𝒚
𝐹𝑄 𝑔 𝒚 ≈
1 𝑂 𝑜=1 𝑂
𝑔 𝒚 𝑜 𝑠(𝑜)
1 𝑂 𝑛=1 𝑂
𝑠(𝑛)
𝐹𝑄 𝑔 𝒚 ≈
𝑜=1 𝑂
𝑔 𝒚 𝑜 𝑥(𝑜)
𝑠 𝒚 = 𝑄 𝒚 𝑅 𝒚 𝑥(𝑜) = 𝑠(𝑜) 𝑛=1
𝑂
𝑠(𝑛)
𝒚 𝑜 ~𝑅 𝒚
Importance sampling: problem
16
Importance sampling depends on how well 𝑅 matches 𝑄
For mismatch distributions, weights may be dominated by few samples having
large weights, with the remaining weights being relatively insignificant
It is common that 𝑄(𝒚)𝑔(𝒚) is strongly varying and has a significant
proportion of its mass concentrated in a small region
The problem is severe if none of the samples falls in the regions where
𝑄(𝒚)𝑔(𝒚) is large.
The estimate of the expectation may be severely wrong while the variance of 𝑠(𝑜) can
be small
𝑄(𝑦) 𝑅(𝑦) 𝑔(𝑦) A key requirement for 𝑅(𝒚) is that it should not be small or zero in regions where 𝑄(𝒚) may be significant. [Bishop book]
Sampling methods for graphical models
17
DGMs:
Forward (or ancestral) sampling Likelihood weighted sampling
For UGMs, there is no one-pass sampling strategy that will
sample even from the prior distribution with no observed variables.
Instead, computationally more expensive techniques such as Gibbs sampling exist
that will be introduced in the next slides
Sampling the joint distribution represented by a BN
18
Sample the joint distribution by ancestral sampling Example:
Sample from 𝑄(𝐸) ⇒ 𝐸 = 𝑒1 Sample from 𝑄(𝐽) ⇒ 𝐽 = 𝑗0 Sample from 𝑄 𝐻 𝑗0, 𝑒1
⇒ 𝐻 = 3
Sample from 𝑄 𝑇 𝑗0 ⇒ 𝑇 = 𝑡0 Sample from 𝑄 𝑀 3 ⇒ 𝑀 = 𝑚0 One sample 𝑒1, 𝑗0, 3, 𝑡0, 𝑚0 was
generated
Forward sampling in a BN
19
Given a BN, and number of samples 𝑂 Choose a topological ordering on variables, e.g., 𝑌1, … , 𝑌𝑁 For j = 1 to N For i = 1 to M Sample 𝑦𝑗
(𝑘) from the distribution 𝑄(𝑌𝑗|𝒚𝑄𝑏𝑌𝑗 (𝑘) )
Add {𝑦1
(𝑘), … , 𝑦𝑁 (𝑘)} to the sample set
Sampling for conditional probability query
20
𝑄 𝑗1 𝑚0 , 𝑡0) =? Looking at the samples we can count:
𝑂: # of samples 𝑂𝑓: # of samples in which the evidence holds (𝑀 = 𝑚0, 𝑇 = 𝑡0) 𝑂𝐽: # of samples where the joint is true (𝑀 = 𝑚0, 𝑇 = 𝑡0, 𝐽 = 𝑗1)
For a large enough 𝑂
𝑂𝑓/𝑂 ≈ 𝑄(𝑚0, 𝑡0) 𝑂𝐽/𝑂 ≈ 𝑄(𝑗1, 𝑚0, 𝑡0)
And so, we can set
𝑄 𝑗1 𝑚0 , 𝑡0) =
𝑄(𝑗1,𝑚0,𝑡0) 𝑄(𝑚0,𝑡0) ≈ 𝑂𝐽/𝑂𝑓
Using rejection sampling to compute 𝑄(𝒀|𝒇)
21
Given a BN, a query 𝑄(𝒀|𝒇), and number of samples 𝑂 Choose a topological ordering on variables, e.g., 𝑌1, … , 𝑌𝑁 j=1 While j<N For i = 1 to M Sample 𝑦𝑗
(𝑘) from the distribution 𝑄(𝑌𝑗|𝒚𝑄𝑏𝑌𝑗 (𝑘) )
If {𝑦1
(𝑘), … , 𝑦𝑁 (𝑘)} consistent with evidence 𝒇 add it to sample set and
j=j+1
Use samples to compute 𝑄(𝒀|𝒇)
Forward sampling: problem
22
Problem: When the evidence rarely happens, we would need
lots of samples, and most would be wasted
overall probability of accepting a sample rapidly decreases as the number
- f observed variables increases and as the number of states that those
variables can take increases
This approach is very slow and rarely used in practice.
Likelihood weighting
23
Let 𝑌1, … , 𝑌𝑁 be a topological ordering of 𝐻 𝑥 = 1 𝒚 = 𝟏, … , 𝟏 for i=1…M if 𝑌𝑗 ∉ 𝑓 draw 𝑦𝑗 from 𝑄 𝑌𝑗 𝑦𝑄𝑏𝑌𝑗 else 𝑦𝑗 = 𝑓 𝑌𝑗 𝑥 = 𝑥 ∗ 𝑄 𝑦𝑗|𝑦𝑄𝑏𝑌𝑗
𝑥 𝒚 =
𝑌𝑗∉𝐹
𝑄 𝑦𝑗|𝑦𝑄𝑏𝑌𝑗 𝑄 𝑦𝑗|𝑦𝑄𝑏𝑌𝑗
𝑌𝑗∈𝐹
𝑄 𝑦𝑗|𝑦𝑄𝑏𝑌𝑗 1 =
𝑌𝑗∈𝐹
𝑄 𝑦𝑗|𝑦𝑄𝑏𝑌𝑗
The above procedure generates a sample 𝒚 = 𝑦1, … , 𝑦𝑁 and assigns it a weight 𝑥:
Likelihood weighting: example
24
𝑄(𝐸|2, 𝑗1)
Efficiency of likelihood weighting
25
The efficiency of importance sampling depends on how close
the proposal 𝑅 is to the target 𝑄.
k Two extreme cases:
when all the evidence is at the roots 𝑅 𝒚 = 𝑄 𝒚 𝒇 and all samples will
have the same weight 1.
when all the evidence is at the leaves, 𝑅 𝒚 = 𝑄(𝒚).
will work reasonably only if 𝑄(𝒚) is similar to 𝑄 𝒚 𝒇 . Otherwise, most of our samples will be irrelevant, i.e., their weights are very
low
Limitations of Monte Carlo
26
Direct sampling: only when we can sample from 𝑄(𝒚)
can be wasteful for rare events
Rejection and importance sampling use a proposal distribution
𝑅 𝒚 and can also be used when we can not sample 𝑄(𝒚) directly
In rejection sampling, when the proposal 𝑅(𝒚) is very different from
𝑄(𝒚), most samples are rejected
In importance sampling, when the proposal 𝑅(𝒚) is very different from
𝑄(𝒚), most samples have very low weights Problem: Finding a good proposal 𝑅(𝒚) that is similar to 𝑄(𝒚) usually requires knowledge of the analytic form of 𝑄(𝒚) that is not available
Markov chain Monte Carlo (MCMC)
27
Instead of using a fixed proposal 𝑅(𝒚), we can use an adaptive
proposal 𝑅(𝒚|𝒚(𝑢)) that depends on the last previous sample 𝒚(𝑢)
The proposal distribution is adapted as a function of the last accepted
sample
MCMC methods
Metropolis-Hasting Gibbs
MCMC
28
MCMC algorithms feature adaptive proposals
Instead of 𝑅(𝒚′), they use 𝑅(𝒚′|𝒚) where 𝒚′ is the new state being
sampled, and 𝒚 is the previous sample
As 𝒚 changes, 𝑅(𝒚′|𝒚) can also change (as a function of 𝒚′)
𝑅 𝑦′|𝑦(1) 𝑅 𝑦′|𝑦(2)
[This slide has been adapted from Xing’s slides]
Markov chains
29
A Markov chain is a sequence of random variables 𝒚(1), … , 𝒚(𝑢)
with the Markov property 𝑄 𝒚(𝑢)|𝒚(1), … , 𝒚(𝑢−1) = 𝑄 𝒚(𝑢)|𝒚(𝑢−1)
We
focus
- n
homogeneous Markov chains, in which 𝑄 𝒚(𝑢)|𝒚(𝑢−1) is fixed with time
Let 𝒚
be the previous state and 𝒚′ be the next state, we call 𝑄 𝒚(𝑢)|𝒚(𝑢−1) as 𝑈 𝒚′|𝒚
Thus, at each time point, 𝒚(𝑢) is a state showing the
configuration of all the variables in the model
Markov chains: invariant or stationary dist.
30
𝜌𝑢 𝒚 : Probability distribution over state 𝒚, at time 𝑢
Transition probability 𝑈 𝒚′|𝒚 redistributes the mass in state 𝒚 to other
states 𝒚′.
𝜌𝑢 𝒚′ =
𝒚
𝜌𝑢 𝒚 𝑈 𝒚′|𝒚
𝜌 𝒚 is invariant or stationary if it does not change under
the transitions: 𝜌𝑢+1 𝒚′ =
𝒚
𝜌𝑢 𝒚 𝑈 𝒚′|𝒚 ∀𝒚′
𝑈 𝒚′|𝒚 = 𝑄 𝒚(𝑢) = 𝒚′|𝒚 𝑢−1 = 𝒚 Invariant distributions are of great importance in MCMC methods.
More specific than stationary distribution
31
There is also no guarantees that the stationary distribution is unique In some chains, the stationary distribution reached depends on our starting
distribution 𝜌0(𝒚)
We want to restrict our attention to MCs that have a unique stationary
distribution, which is reached from any starting distribution 𝜌0(𝒚).
There are various conditions that suffice to guarantee this property. The most commonly used condition is ergodicity.
Equilibrium distribution
32
Stationary distribution 𝜌 of transition probabilities 𝑈 is called
the equilibrium distribution when lim
𝑢→∞ 𝜌𝑢 = 𝜌 independent
- f the initial distribution 𝜌0
Ergodicity implies you can reach the stationary distribution 𝜌(𝒚), no
matter the initial distribution 𝜌0(𝒚)
Clearly, an ergodic Markov chain can have only one equilibrium
distribution.
Ergodic Markov chain
33
A homogeneous Markov chain will be ergodic, subject only to
the following conditions:
Irreducible: an MC is irreducible if we can get from any state 𝒚 to any
- ther state 𝒚′ with probability > 0 in a finite number of steps
Aperiodic: an MC is aperiodic if you can return to any state 𝒚 at any
time
Periodic MCs have states that need ≥2 time steps to return to (cycles)
Ergodic: an MC is ergodic if it is irreducible and aperiodic
An ergodic MC converges to a unique stationary distribution regardless
- f the start state
Finite state space MC
34
A Markov chain is ergodic or regular if there exists 𝑙
such that, for every 𝒚, 𝒚′, the probability of getting from 𝒚 to 𝒚′ in exactly 𝑙 steps is > 0
Sufficient set of conditions for regularity:
Every two states are connected For every state, there is a self-transition
Detailed balance
35
A sufficient (but not necessary) condition for ensuring that
𝜌(𝒚) is stationary distribution of an MC is the detailed balance condition: 𝜌 𝒚 𝑈 𝒚′|𝒚 = 𝜌 𝒚′ 𝑈 𝒚|𝒚′
Reversible chain: an MC is reversible if there exists a dist.
𝜌(𝒚) such that the detailed balance condition is satisfied
Enforcing detailed balance is easy
Detailed balance is local (only involves isolated pairs)
Detailed balance means the sequences 𝒚′, 𝒚 and 𝒚, 𝒚′ are equally probable (although the probability of 𝒚′ → 𝒚 and 𝒚 → 𝒚′ can be different)
Reversible Chains
36
Theorem:
Detailed balance implies the stationary distribution
Proof:
𝜌 𝒚 𝑈 𝒚′|𝒚 = 𝜌 𝒚′ 𝑈 𝒚|𝒚′ ⇒
𝒚
𝜌 𝒚 𝑈 𝒚′|𝒚 =
𝒚
𝜌 𝒚′ 𝑈 𝒚|𝒚′ ⇒
𝒚
𝜌 𝒚 𝑈 𝒚′|𝒚 = 𝜌 𝒚′
𝒚
𝑈 𝒚|𝒚′ = 𝜌 𝒚′
Theorem: If detailed balance holds and 𝑈 is ergodic, then
𝑈 has a unique stationary distribution
Stationary distribution: summary
37
𝜌𝑢 𝒚 : Probability distribution over state 𝒚, at time 𝑢
Transition probability 𝑈 𝒚′|𝒚 redistributes the mass in state 𝒚 to other
states 𝒚′.
𝜌𝑢+1 𝒚′ =
𝒚
𝜌𝑢 𝒚 𝑈 𝒚′|𝒚
𝜌∗ 𝒚 is invariant or stationary if it does not change under
the transitions: 𝜌∗ 𝒚′ =
𝒚
𝜌∗ 𝒚 𝑈 𝒚′|𝒚 ∀𝒚′
A sufficient condition for ensuring that 𝜌∗(𝒚) is stationary
distribution of an MC is the detailed balance condition: 𝜌∗ 𝒚 𝑈 𝒚′|𝒚 = 𝜌∗ 𝒚′ 𝑈 𝒚|𝒚′
How to use Markov chains for sampling from 𝑄 𝒚 ?
38
Our goal is to use Markov chains to sample from a given
distribution.
We can achieve this if we set up a Markov chain whose unique
stationary distribution is 𝑄.
We design the transition distribution 𝑈 𝒚′|𝒚 so that the chain
has a unique stationary distribution 𝑄 𝒚 (independent of 𝑄)
The ergodic condition is a sufficient condition
Sample 𝒚(0) randomly For t = 0, 1, 2, … Sample 𝒚(𝑢+1) from 𝑈(𝒚′|𝒚(𝑢))
Metropolis-Hastings
39
Draws a sample 𝒚′ from 𝑅(𝒚′|𝒚), where 𝒚 is the previous
sample
The new sample 𝒚′ is accepted with probability 𝐵(𝒚′|𝒚):
𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)
𝐵 𝒚′ 𝒚 is like a ratio of importance sampling weights
𝑄 𝒚′ 𝑅(𝒚′|𝒚) is the importance weight for 𝒚′
𝑄 𝒚 𝑅(𝒚|𝒚′) is the importance weight for 𝒚
We use 𝐵 𝒚′ 𝒚
to ensure that after sufficiently many draws,
- ur samples will come from the true distribution 𝑄 𝒚
We will show in the next slides
we only need to compute
𝑄 𝒚′ 𝑄(𝒚) rather
than 𝑄 𝒚′ or 𝑄 𝒚 separately
Metropolis-Hastings algorithm: Example
40
Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚 We’re trying to sample from a bimodal distribution 𝑄(𝒚)
𝑅 𝑦′|𝑦(0) [This slide has been adapted from Xing’s slides, CMU] 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)
Metropolis-Hastings algorithm: Example
41
Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚 We’re trying to sample from a bimodal distribution 𝑄(𝒚)
𝑅 𝑦′|𝑦(1) [This slide has been adapted from Xing’s slides, CMU] 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)
Metropolis-Hastings algorithm: Example
42
Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚 We’re trying to sample from a bimodal distribution 𝑄(𝒚)
Rejected since
𝑄 𝑦′ 𝑅(𝑦′|𝑦(2)) < 1 and 𝑄 𝑦(2) 𝑅(𝑦(2)|𝑦′) > 1, hence 𝐵(𝑦′|𝑦(2)) is small
𝑅 𝑦′|𝑦(2) [This slide has been adapted from Xing’s slides, CMU] 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)
Metropolis-Hastings algorithm: Example
43
Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚 We’re trying to sample from a bimodal distribution 𝑄(𝒚)
𝑅 𝑦′|𝑦(3) [This slide has been adapted from Xing’s slides, CMU] 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)
Metropolis-Hastings algorithm: Example
44
Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚 We’re trying to sample from a bimodal distribution 𝑄(𝒚)
The adaptive proposal 𝑅(𝑦’|𝑦) allows us to sample both modes of 𝑄(𝑦)! 𝑅 𝑦′|𝑦(4) [This slide has been adapted from Xing’s slides, CMU] 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)
Metropolis-Hastings algorithm
45
Initialize starting state 𝒚(0), set 𝑢 = 0 Burn-in: while samples have “not converged”
- 𝒚 = 𝒚(𝑢)
- sample 𝒚∗~𝑅(𝒚∗|𝒚)
- 𝐵 𝒚∗ 𝒚 = min 1,
𝑄 𝒚∗ 𝑅(𝒚|𝒚∗) 𝑄 𝒚 𝑅(𝒚∗|𝒚)
- sample 𝑣~𝑉𝑜𝑗𝑔𝑝𝑠𝑛(0,1)
- if 𝑣 < 𝐵 𝒚∗ 𝒚
- 𝒚(𝑢+1) = 𝒚∗
- else
- 𝒚(𝑢+1) = 𝒚(𝑢)
- 𝑢 = 𝑢 + 1
For n =1…N Draw sample from 𝑅(𝒚|𝒚 𝑜−1 ) with acceptance probability 𝐵 𝒚 𝒚(𝑜−1)
𝒚∗ is accepted with probability 𝐵(𝒚∗|𝒚)
Metropolis algorithm: example
46
Let 𝑅(𝒚′|𝒚) be a Gaussian centered on 𝒚: 𝑅 𝒚′ 𝒚 = 𝑂 𝒚′ 𝒚, 𝜏𝟑𝑱 𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑄 𝒚
A biased random walk that explores the target distribution 𝑄(𝒚) [Bishop book]
Metropolis algorithm: example
47
𝑅 𝒚′ 𝒚 = 𝑂 𝒚′ 𝒚, 𝜏𝟑𝑱
large 𝜏2 ⇒ many rejections small 𝜏2 ⇒ slow exploration
𝑀 𝜏 2 iterations are required to reach states
In general, finding a good proposal distribution is not always easy
Proposal distribution
48
low-variance proposals:
high probability of acceptance many iterations are required to explore 𝑄 𝒚 results in more correlated samples
high-variance proposals
low probability of acceptance have the potential to explore much of 𝑄(𝒚) Results in less correlated samples
Why does Metropolis-Hastings work?
50
MH is a general construction algorithm that allows us to build a
reversible Markov chain with a particular stationary distribution 𝑄 𝒚
If we draw a sample 𝒚′ according to 𝑅(𝒚′|𝒚), and then accept/reject
according to 𝐵(𝒚′|𝒚), we have a transition kernel: 𝑈 𝒚′ 𝒚 = 𝐵(𝒚′|𝒚)𝑅(𝒚′|𝒚)
𝑈 𝒚′ 𝒚 = 𝐵(𝒚′|𝒚)𝑅(𝒚′|𝒚) if 𝒚′ ≠ 𝒚 𝑈 𝒚 𝒚 = 𝑅 𝒚 𝒚 +
𝒚′≠𝒚
𝑅 𝒚′ 𝒚 1 − 𝐵(𝒚′|𝒚)
MH satisfies detailed balance
51
Theorem: MH satisfies detailed balance Proof:
𝐵 𝒚′ 𝒚 = min 1, 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚)
If 𝐵 𝒚′ 𝒚 ≤ 1 then
𝑄 𝒚 𝑅(𝒚′|𝒚) 𝑄 𝒚′ 𝑅(𝒚|𝒚′) ≥ 1 then 𝐵 𝒚 𝒚′ = 1
Suppose that 𝐵 𝒚′ 𝒚 < 1
𝐵 𝒚′ 𝒚 = 𝑄 𝒚′ 𝑅(𝒚|𝒚′) 𝑄 𝒚 𝑅(𝒚′|𝒚) ⇒ 𝑄 𝒚 𝑅 𝒚′ 𝒚 𝐵 𝒚′ 𝒚 = 𝑄 𝒚′ 𝑅 𝒚 𝒚′
𝐵 𝒚 𝒚′ =1
𝑄 𝒚 𝑅 𝒚′ 𝒚 𝐵 𝒚′ 𝒚 = 𝑄 𝒚′ 𝑅 𝒚 𝒚′ 𝐵 𝒚 𝒚′ ⇒ 𝑄 𝒚 𝑈 𝒚′ 𝒚 = 𝑄 𝒚′ 𝑈(𝒚|𝒚′)
𝑈 𝒚′ 𝒚 = 𝐵(𝒚′|𝒚)𝑅(𝒚′|𝒚)
MH properties
52
MH algorithm eventually converges to a stationary distribution
𝑄(𝒚) that is the true distribution
However, we have no guarantees as to when this will occur
the burn-in period is a way to ignore the un-converged part of the
Markov chain
but deciding when to halt burn-in is an art that needs experimentation.
𝑅 must be chosen to fulfill the technical requirements
Gibbs sampling algorithm
53
Initialize starting values for 𝑦1, … , 𝑦𝑁 Do until convergence: Pick an ordering of the 𝑁 variables (can be fixed or random) For each variable 𝑦𝑗 in order: Sample 𝑦 from 𝑄(𝑌𝑗|𝑦1, … , 𝑦𝑗−1, 𝑦𝑗+1, … , 𝑦𝑁) Update 𝑦𝑗 ←𝑦
When we update 𝑦𝑗, we immediately use its new value for sampling other variables 𝑦𝑘
Suppose the graphical model contains variables 𝑦1, … , 𝑦𝑁
the current values of all other variables
Gibbs sampling algorithm
54
Sample 𝑦1
′ from 𝑄(𝑌1|𝑦2, … , 𝑦𝑁)
Sample 𝑦2
′ from 𝑄(𝑌2|𝑦1 ′, 𝑦3, … , 𝑦𝑁)
… Sample 𝑦𝑗
′ from 𝑄 𝑌𝑗 𝑦1 ′, … , 𝑦𝑗−1 ′
, 𝑦𝑗+1, … , 𝑦𝑁
… Sample 𝑦𝑁
′ from 𝑄(𝑌𝑁|𝑦1 ′, … , 𝑦𝑁−1 ′
)
Suppose the graphical model contains variables 𝑦1, … , 𝑦𝑁
If the current sample is 𝒚 = 𝑦1, … , 𝑦𝑁 , the next sample 𝒚′ = 𝑦1
′, … , 𝑦𝑁 ′
is drawn as:
Gibbs Sampling
55
Gibbs Sampling is an MCMC algorithm that samples one
random variable of a graphical model at a time
We will see that GS is a special case of MH
GS needs reasonable time and memory
Complete conditionals are fairly easy to derive for many graphical
models (e.g. mixture models, LDA)
For conjugate-exponential models, we can use standard sampling techniques
if conditionals are log concave, then we can use adaptive rejection
sampling
Blocked Gibbs sampling, samples a subset of variables at a time
Gibbs sampling: example
56
Target: 𝑄 𝐸, 𝐽, 𝐻|𝑡1, 𝑚0 𝑒0, 𝑗0, 1 𝑒0, 𝑗0, 1 𝑒0, 𝑗1, 1 𝑒0, 𝑗1, 1 Initialization 𝑒′~𝑄 𝐸|𝑗0, 1, 𝑡1, 𝑚0 𝑗′~𝑄 𝐽|𝑒0, 1, 𝑡1, 𝑚0 ′~𝑄 𝐻|𝑒1, 𝑗0, 𝑡1, 𝑚0 𝒚(𝑢) = 𝑒0, 𝑗0, 1 → 𝒚(𝑢+1) = 𝑒0, 𝑗1, 1
Gibbs sampling is a special case of MH
57
Gibbs sampling is an MH algorithm with the proposal
distribution: 𝑅𝑗 𝒚′|𝒚 = 𝑄 𝑦𝑗
′|𝒚−𝑗
Using this proposal distribution in MH algorithm, we find that
samples are always accepted:
𝐵 𝒚′|𝒚 = min 1, 𝑄 𝒚′ 𝑅𝑗 𝒚|𝒚′ 𝑄 𝒚 𝑅𝑗 𝒚′|𝒚 = min 1, 𝑄 𝑦𝑗
′, 𝒚−𝑗 ′
𝑄 𝑦𝑗|𝒚′−𝑗 𝑄 𝑦𝑗, 𝒚−𝑗 𝑄 𝑦𝑗
′|𝒚−𝑗
= min 1, 𝑄 𝑦𝑗
′|𝒚−𝑗 𝑄 𝒚−𝑗 𝑄 𝑦𝑗|𝒚−𝑗
𝑄 𝑦𝑗|𝒚−𝑗 𝑄 𝒚−𝑗 𝑄 𝑦𝑗
′|𝒚−𝑗
= 1
GS is simply MH with a proposal that is always accepted! 𝒚−𝑗: all variables except 𝑦𝑗 𝒚−𝑗
′
= 𝒚−𝑗 ⇒ 𝒚−𝑗
′
= 𝒚−𝑗
Gibbs sampling: complete conditional distributions
58
𝑄 𝑦𝑗|𝒚−𝑗 can often be evaluated quickly, because they only
depends on the neighboring nodes
Let 𝑁𝐶 𝑌𝑗 be the Markov Blanket of 𝑌𝑗
For a BN, the Markov Blanket of a variable is the set of variables
containing its parents, children, and co-parents
For an MRF, the Markov Blanket of variable is its immediate neighbors
𝑄 𝑌𝑗 𝑌1, … , 𝑌𝑗−1, 𝑌𝑗+1, … , 𝑌𝑁 = 𝑄(𝑌𝑗|𝑁𝐶 𝑌𝑗 )
A sufficient condition for ergodicity of Gibbs sampling is that
none of the conditional distributions be anywhere zero
All conditions are satisfied when all probabilities are positive
Gibbs sampling implementation
59
Conditionals with a few discrete settings can be explicitly
normalized: 𝑄 𝑦𝑗|𝒚−𝑗 = 𝑄 𝑦𝑗, 𝒚−𝑗 𝑦𝑗
′ 𝑄 𝑦𝑗
′, 𝒚−𝑗
= 𝑙: 𝑌𝑗∈𝑇𝑑𝑝𝑞𝑓(𝜚𝑙) 𝜚𝑙 𝑦𝑗, 𝒚𝑙\𝑦𝑗 𝑦𝑗
′ 𝑙: 𝑌𝑗∈𝑇𝑑𝑝𝑞𝑓(𝜚𝑙) 𝜚𝑙 𝑦𝑗
′, 𝒚𝑙\𝑦𝑗
Continuous univariate conditionals can be sampled using
standard sampling methods for this purpose
e.g., inverse CDF sampling, rejection sampling, adaptive rejection
sampling
depends only on the instantiation
- f variables in Xi’s Markov blanket
Sampling and the EM algorithm
60
Monte Carlo EM algorithm: 𝑢 ← 1, Choose an initial parameters 𝜾1 Iterate until convergence: E Step: Calculate 𝑄(ℋ|, 𝜾𝑢) M Step: 𝜾𝑢+1 = argmax
𝜾
𝐹𝑄(ℋ|,𝜾𝑢) log 𝑄(, ℋ|𝜾) 𝑢 ← 𝑢 + 1
𝑄 𝑨 𝑜 𝑦 𝑜 , 𝜾𝑢 𝑜 = 1, … , 𝑂
𝑢 ← 1, Choose an initial parameters 𝜾1 Iterate until convergence: E Step: Draw samples 𝑨 𝑜 ,𝑚~𝑄 𝑨 𝑜 𝑦 𝑜 , 𝜾𝑢
𝑜 = 1, … , 𝑂, 𝑚 = 1, … , 𝑀
M Step1: 𝜾𝑢+1 = argmax
𝜾
𝑜=1
𝑂 1 𝑀 𝑚=1 𝑀
log 𝑄 𝑦 𝑜 , 𝑨 𝑜 ,𝑚 𝜾 𝑢 ← 𝑢 + 1
𝑜=1 𝑂
𝐹𝑄(𝑨(𝑜)|𝑦(𝑜),𝜾𝑢) log 𝑄(𝑦(𝑜), 𝑨|𝜾)
Sampling and Bayesian Learning
61
Example: Bayesian Mixture Model
62
To find
generate samples from
Using Gibbs sampling:
Example: Bayesian Mixture Model
63
Gibbs sampling: The above complete conditionals can be computed easily:
Gibbs sampling: summary
64
Converts the hard problem of inference to a sequence of
sampling steps from (univariate) conditional distributions
Pros:
Perhaps the simplest Markov chain for graphical models Computationally efficient: conditionals can often be evaluated quickly
Cons:
Only applies if we can sample from product of factors
For overly complex cases, Metropolis Hastings can provide an effective
alternative
Often slow to mix especially when probabilities are peaked
Acceptance rate is always 1 and high acceptance usually entails slow
exploration
MH & MC: summary
65
To sample from the true distribution 𝑄(𝒚) , MH methods use
adaptive proposals 𝑅(𝒚|𝒚(𝑢)) rather than 𝑅(𝒚)
MH allows specifying any proposal 𝑅(𝒚′|𝒚)
But choosing a good 𝑅(𝒚′|𝒚) requires technical care
Gibbs sampling sets the proposal 𝑅(𝒚′|𝒚) to the conditional
distribution 𝑄 𝑦𝑗
′|𝒚−𝑗
and acceptance rate will be one in this method.
Gibbs sampling is useful at least for a “first try.”
better choices for the MH transition probabilities than the univariate
conditionals used in Gibbs may lend to faster convergence
Gibbs Sampling: Extentions
66
Methods to improve convergence:
Blocking Collapsing Rao-Blackwellised
Blocked Gibbs Sampling
67
We can partition variables into several disjoint blocks of
variables
such that we can efficiently sample from each block given all other
variables
Blocked Gibbs sampling algorithm: iteratively sample blocks of
variables, rather than individual variables
thereby taking much “longer-range” transitions in the state space in a
single sampling step.
Here, like in Gibbs sampling, we define the algorithm to be producing
block Gibbs a new sample only once all blocks have been resampled.
Blocked Gibbs Sampling: Example
68
𝑄 𝐽1, 𝐽2, 𝐽3, 𝐽4, 𝐸1, 𝐸2|𝐻1,1 … 𝐻4,2
Rao-Blackwellised Sampling
69
Sampling in high dimensional spaces causes high variance in the
estimate.
RB idea: sample some variables Z, and conditional on that, compute
expected value of rest X analytically.
Rao-Blackwellised Sampling
70
Instead of basic Monte Carlo estimation for joint distribution on x, z:
𝑦 𝑚 , 𝑨 𝑚 ~𝑞 𝑦, 𝑨 𝑚 = 1,2, … , 𝑀 𝔽𝑞 𝑔 𝑦, 𝑨 =
𝑨 𝑦
𝑔 𝑦, 𝑨 𝑞 𝑦, 𝑨 𝑒𝑦𝑒𝑨 ≈ 1 𝑀
𝑚=1 𝑀
𝑔 𝑦(𝑚), 𝑨(𝑚)
When the expectation using p(x|z) is tractable, we can estimate it as:
𝔽𝑞 𝑔 𝑦, 𝑨 =
𝑨 𝑦
𝑔 𝑦, 𝑨 𝑞 𝑦|𝑨 𝑞(𝑨)𝑒𝑦𝑒𝑨 =
𝑨 𝑦
𝑔 𝑦, 𝑨 𝑞 𝑦|𝑨 𝑒𝑦 𝑞(𝑨)𝑒𝑦 ≈ 1 𝑀
𝑚=1 𝑀 𝑦
𝑔 𝑦, 𝑨(𝑚) 𝑞 𝑦|𝑨(𝑚) 𝑒𝑦
and expect this estimator to be more accurate
Rao-Blackwellized Estimation
71
It is classically used to reduce the variance of estimators: Analytically marginalize, or collapse, some variables from
the model and derive Gibbs sampler for the collapsed representation
Collapsed Gibbs Sampling: Bayesian Mixture Models
72
Topic Models: Collapsed Gibbs
73
Popular inference algorithm for topic models
Integrate out topic vectors π and topics B Only need to sample word-topic assignments z
[This slide has been adapted from Xing’s slides, CMU]
Topic Models: Collapsed Gibbs
74
What is 𝑄(𝑨𝑗|𝒜−𝑗, 𝒙)?
It is a product of two Dirichlet-Multinomial conditional
distributions:
[This slide has been adapted from Xing’s slides, CMU]
Topic Models: Collapsed Gibbs
75
What is 𝑄(𝑨𝑗|𝒜−𝑗, 𝒙)?
It is a product of two Dirichlet-Multinomial conditional
distributions:
[This slide has been adapted from Xing’s slides, CMU]
Collapsed Gibbs illustration
76
[This slide has been adapted from Xing’s slides, CMU]
Collapsed Gibbs illustration
77
[This slide has been adapted from Xing’s slides, CMU]
Collapsed Gibbs illustration
78
[This slide has been adapted from Xing’s slides, CMU]
Collapsed Gibbs illustration
79
[This slide has been adapted from Xing’s slides, CMU]
Practical aspects of MCMC
80
Using the samples
Only once the chain mixes, all samples 𝒚(𝑢) are from the
stationary distribution 𝑄 𝒚
So we can use all 𝒚(𝑢) for 𝑢 > 𝑈
𝑛𝑗𝑦
However, nearby samples are correlated!
So we should not overestimate the quality of our estimate by simply
counting samples Selecting the proposal 𝑅(𝒚′|𝒚): a good 𝑅
proposes distant samples with a sufficiently high acceptance rate
Stop burn-in
81
Stop burn-in when the state distribution is reasonably
close to 𝑄 𝒚 .
Start collecting samples after the chain has run long enough to
mix
How do you know if a chain has mixed or not?
In general, you can never prove a chain has mixed
No general-purpose theoretical analysis exists for the mixing time of
graphical models
But in many cases you can show that it has not mixed
Poor mixing
82
Poor mixing (or slow convergence)
MC stays in small regions of the parameter space for long periods of time When the state space consists of several regions that are connected only
by low-probability transitions.
The most straightforward approach is
to use multiple highly dispersed initial values to start several different chains
straightforward MCMC methods are unlikely to work. Multimodal target distribution in which our choice of starting values traps us near one of the modes
Stop burn-in
83
There are approaches telling us if a chain has not
converged
Compare chain statistics in different windows within a single
run of the chain or across different chains corresponding to runs initialized differently
Simple approaches
Plot the sample values vs time Plot the log-likelihood vs time
Mixing?
84
Each dot is a statistic (e.g., 𝑄(𝑦 ∈ 𝑇))
x-position is its estimated value from chain 1 y-position is its estimated value from chain 2 [Koller & Friedman book]
Sample Values vs Time
85
Monitor convergence by plotting samples (of r.v.s) from
multiple MH runs (chains)
If the chains are well-mixed (left), they are probably converged If the chains are poorly-mixed (right), we should continue burn-in
[This slide has been adapted from Xing’s slides, CMU]
Log-likelihood vs Time
86
Many graphical models are high-dimensional
Hard to visualize all r.v. chains at once
Instead, plot the complete log-likelihood vs. time
The complete log-likelihood is an r.v. that depends on all model r.v.s
Generally, the log-likelihood will climb, then eventually plateau
[This slide has been adapted from Xing’s slides, CMU]
Autocorrelation coefficient
87
Autocorrelation function of random variable 𝑦:
𝑆 𝑙 = 𝑑𝑝𝑤 𝑦(𝑢), 𝑦(𝑢+𝑙) 𝑤𝑏𝑠 𝑦(𝑢) 𝑆𝑦 𝑙 = 𝑢=1
𝑂−𝑙 𝑦(𝑢) −
𝑦 𝑦(𝑢+𝑙) − 𝑦 𝑢=1
𝑂
𝑦(𝑢) − 𝑦 2 𝑦 = 𝑢=1
𝑂
𝑦(𝑢) 𝑂
One way to diagnose a poorly mixing chain is to observe high
autocorrelation at distant lags
High autocorrelation leads to smaller effective sample size!
Proposal distribution
88
How do we know if our proposal Q(x’|x) is any good?
Monitor the acceptance rate Plot the autocorrelation function
Acceptance Rate
89
Choosing the proposal Q(x’|x) is a tradeoff:
“Narrow”, low-variance proposals have high acceptance, but take
many iterations to explore P(x) fully
“Wide”, high-variance proposals have the potential to explore much
- f P(x), but many proposals are rejected (slows down the sampler)
A good Q(x’|x) proposes distant samples x’ with a sufficiently
high acceptance rate
[This slide has been adapted from Xing’s slides, CMU]
Auto-correlation Function
90
MCMC chains always show autocorrelation (AC)
AC means that adjacent samples in time are highly correlated
[This slide has been adapted from Xing’s slides, CMU]
Auto-correlation
91
We may want proposals 𝑅(𝒚′|𝒚) with low autocorrelation
If the variance of 𝑅 is too small, moves are accepted with high
probability, but they are also small, generating high autocorrelations and poor mixing
High autocorrelation leads to smaller effective sample
size!
We want proposals Q(x’|x) with low autocorrelation
Acceptance Rate
92
The first-order AC Rx(1) can be used to estimate the
Sample Size Inflation Factor (SSIF):
If we took n samples with SSIF 𝑡𝑦, then the effective sample
size is 𝑜/𝑡𝑦 𝑡𝑦 = 1 + 𝑆𝑦(1) 1 − 𝑆𝑦(1)
[This slide has been adapted from Xing’s slides, CMU]
Thinning
93
One strategy for reducing autocorrelation is thinning the
- utput
storing only every 𝑛th point after the burn-in period.
reduce the autocorrelation by increasing the thinning ratio
Approximately independent samples can be obtained
thinning all samples after burn-in
Thinning or not?
94
All of the samples produces a provably better estimator than
using just remained after thinning
Thinning is useful primarily in settings where there is a
significant cost associated with using each sample (for example, the evaluation of f is costly)
so that we might want to reduce the overall number of particles used.
One long chain or many smaller chains
95
If long burn-in periods are required, or if the chains have very
high autocorrelations:
using a large number of smaller chains may result in each not being
long enough to be of any value.
A good strategy: run a small number of chains in parallel for a
reasonably long time
using their behavior to evaluate mixing.
MCMC implementation details
96
Tunable parameters or design choices in MCMC:
the proposal distribution the metrics for evaluating mixing the number of chains to run techniques for determining the thinning ratio
MCMC sampling: summary
97
Theoretical guarantees for convergence to exact values However, it can be quite slow to converge
We may need ridiculously many samples only useful if the chain we are using mixes reasonably quickly
And it is difficult to tell whether it has been converged Many options that need to be specified
The application of Markov chains is more of an art than a science
Unlike forward sampling methods, MCMC methods:
do not degrade when the probability of the evidence is low, or when the
posterior is very different from the prior.
apply to undirected models as well as to directed models.
Resources
98