SLIDE 1 Centre for Research in Statistical Methodology http://go.warwick.ac.uk/crism
- Conferences and workshops (including general calls for workshops to
be organised primarily outside Warwick, calls every 6 months, next in Summer 2014)
- Research Fellow positions: next advertising 2 positions around Febru-
ary 2014
- PhD studentships
- Academic visitor programme.
SLIDE 2
From Peskun Ordering to Optimal Simulated Tempering Gareth Roberts University of Warwick MCMSki, Chamonix, January 2014 mainly joint work with Jeffrey Rosenthal, but with aspects of joint work with Yves Atchade
SLIDE 3 Plan for talk
- 1. Why 0.234 is natural in many problems
- 2. Comparisions of algorithms based on their diffusion limits, links to Peskun or-
dering
- 3. A heterogenous scalling problem: spacing of temperatures in simulated temper-
ing
- 4. Local 0.234 story for simulated tempering
- 5. Conclusions
SLIDE 4
Metropolis-Hastings algorithm Given a target density π(·) that we wish to sample from, and a Markov chain transition kernel density q(·, ·), we construct a Markov chain as follows. Given Xn, generate Yn+1 from q(Xn, ·). Now set Xn+1 = Yn+1 with probability α(Xn, Yn+1) = 1 ∧ π(Yn+1)q(Yn+1, Xn) π(Xn)q(Xn, Yn+1) . Otherwise set Xn+1 = Xn.
SLIDE 5 Two first scaling problems
q(x, y) = q(|y − x|) The acceptance probability simplifies to α(x, y) = 1 ∧ π(y) π(x) For example q ∼ MV Nd(x, σ2Id), but also more generally.
Y ∼ MV N(x(k) + hV ∇ log π(x(k)) 2 , hV ) .
SLIDE 6 200 600 1000 −0.25 −0.05
0.01 0.9869
200 600 1000 −1.0 −0.4
0.02 0.996
200 600 1000 0.0 1.0
0.03 0.9912
200 600 1000 −0.5 0.5
0.04 0.9895
200 600 1000 −0.5 0.5
0.05 0.9939
200 600 1000 −1.5 0.0 1.5
0.1 0.9905
200 600 1000 −2 2
0.2 0.9715
200 600 1000 −2 2
0.3 0.954
200 600 1000 −2 2
0.4 0.9443
200 600 1000 −3 2 4
0.5 0.9178
200 600 1000 −3 −1 1 3
0.7 0.8368
200 600 1000 −2 2
0.9 0.7732
200 600 1000 −3 −1 1
1 0.7579
200 600 1000 −3 −1 1 3
1.2 0.7474
200 600 1000 −2 2
1.5 0.6192
200 600 1000 −2 2
1.8 0.6529
200 600 1000 −2 2
2.1 0.6067
200 600 1000 −2 2
2.38 0.5542
200 600 1000 −3 −1 1 3
2.6 0.6277
200 600 1000 −3 −1 1
2.9 0.6159
200 600 1000 −2 2
3.2 0.672
200 600 1000 −2 2
3.5 0.6982
200 600 1000 −3 2
4 0.6446
200 600 1000 −3 −1 1
5 0.6962
200 600 1000 −3 2
6 0.7411
200 600 1000 −3 −1 1 3
8 0.829
200 600 1000 −3 2
10 0.7974
200 600 1000 −2 2
12 0.8651
200 600 1000 −2 2
15 0.9036
200 600 1000 −3 −1 1 3
18 0.9089
200 600 1000 −3 2
22 0.9426
200 600 1000 −3 −1 1 3
27 0.9325
200 600 1000 −2 1
33 0.933
200 600 1000 −3 −1 1
40 0.9564
200 600 1000 −2 1 2
50 0.962
200 600 1000 −2 1
100 0.989
The Goldilocks dilemma
SLIDE 7 Scaling problems and diffusion limits Choosing σ in the above algorithms to optimise efficiency. For ‘appropriate choices’ the d-dimensional algorithm has a limit which is a diffusion. The faster the diffusion the better!
- How should σd depend on d for large d?
- What does this tell us about the efficiency of the algorithm?
- Can we optimise σd in some sensible way?
- Can we characterise optimal (or close to optimal) values of σd in terms of ob-
servable properties of the Markov chain? For RWM and MALA (and some other local algorithms) and for some simple classes
- f target distributions, a solution to the above can be obtained by considering a
diffusion limit (for high dimensional problems).
SLIDE 8
Simulated tempering Consider a d-dimensional target density fd, and suppose it is possible to construct MCMC on fd,β = f β
d , 0 ≤ χ ≤ β ≤ 1.
This typically would mix better for small β. However we are interested in fd,1. Problem: Choose a finite collection of inverse temperatures, B = {βi} such that we can construct a Markov chain on Rd × B which “optimally” permits the exploration of fd,1. This is also a scaling problem: chosing how large to make βi − βi−1 for each i.
SLIDE 9 What is “efficiency”? Let X be a Markov chain. Then for a π-integrable function f, efficiency can be described by σ2(g, P) = lim
n→∞ nVar
n
i=1 g(Xi)
n
Under weak(ish) regularity conditions σ2(g, P) = Varπ(g) + 2
∞
Covπ(g(X0), g(Xi)) In general relative efficiency between two possible Markov chains varies depending
- n what function of interest g is being considered. As d → ∞ the dependence on
g disappears, at least in cases where we have a diffusion limit as we will see....
SLIDE 10 How do we measure “efficiency” efficiently? It is well-established that estimating limiting variance is hard. “It’s easy, just measure ESJD instead!” Andrew Gelman, 1993 ESJD = E((Xt+1 − Xt)2) Why is this a good idea? Optimising this is just like considering only linear functions g and ignoring all but the first term in
∞
Covπ(g(X0), g(Xi))
SLIDE 11 Diffusion limits: a framework for studying algorithm optimality Many MCMC algorithms are well-approximated by diffusions (usually in the sense
- f diffusion limit results for high-dimensional algorithms, or other limiting regimes).
Examples include Random walk Metropolis, various versions of Langevin algorithm, Gibbs samplers simulated tempering. Provides a natural framework for studying algorithm complexity and optimisation. But why is the comparison of diffusion limits any easier than comparing the finite- dimensional algorithms?
SLIDE 12 2000 4000 6000 8000 10000 0.8 1.0 1.2 1.4 1.6 1.8 Index
MCMC sample paths and diffusions. Here ESJM is the quadratic variation lim
ǫ→0 [tǫ−1]
(Xiǫ − X(i−1)ǫ)2
SLIDE 13 Diffusions A d-dimensional diffusion is a continuous-time strong Markov process with con- tinuous sample paths. We can define a diffusion as the solution of the Stochastic Differential Equation (SDE): dXt = µ(Xt)dt + σ(Xt)dBt. where B denotes d-dimensional Brownian motion, σ is a d × d matrix and µ is a d-vector. Often understood intuitively and constructively via its dynamics over small time
- intervals. Approximately for small h:
Xt+h|Xt = xt ∼ xt + hµ(xt) + h1/2σ(xt)Z where Z is a d-dimensional standard normal random variable.
SLIDE 14 “Efficiency” for diffusions Consider two Langevin diffusions, both with πinvariant.: for h1 < h2, dXi
t = h1/2 i
dBt + hi∇ log π(Xi
t)/2,
i = 1, 2.
500 1000 1500 2000 0.0 0.1 0.2 0.3 0.4 0.5 Index thin(y, 5)
X2 is a “speeded-up” version of X1. But how can we compare diffusions which have non-constant diffusion coefficient?
SLIDE 15 Peskun ordering Peskun (1973). Many uses in MCMC theory (eg see work by Mira, Tierney and co-workers). P1 and P2 are two Markov chain kernels with invariant distribution π. We say P2 dominates P1 in the Peskun sense, and write P2 P1 if for all x, and sets A not containing x P1(x, A) ≤ P2(x, A) . Peskun ordering implies an ordering on asymptotic variances of ergodic estimates: lim
n→∞
n
i=1 g(X(1) i )
n
n→∞
n
i=1 g(X(2) i )
n
- where X(i) moves according to Pi.
Surprisingly many applications in MCMC! But are proposal scaling problems completely incompatible with Peskun??
SLIDE 16 Peskun ordering in continuous time Peskun ordering of P1 and P2 does not imply, nor is implied by ordering of the 2 step transition kernels. How about continuous time? (See work of Leisen +Mira 2008.) But diffusions satisfy P(x, {x}) = 0 so there can never be any interesting Peskun
- rderings in the original sense.
HOWEVER MANY discrete time processes possess the SAME diffusion limit. Maybe we can consider the limit along sequences of chains which do satisfy Peskun
SLIDE 17 A more powerful diffusion comparison result Consider two Langevin diffusions, both with stationary distribution π. dXi
t = hi(Xi t)1/2dBt + Vi(Xi t)dt,
i = 1, 2, with h1(x) ≤ h2(x) for all x. (Here Vi(x) = (hi(x)∇ log π(x) + h′
i(x))/2.)
Under regularity conditions on the tails of π (which have to decay exponentially, or π needs to have bounded support) Then X2 dominates X1 in covariance ordering sense: lim
t→∞ tVar
t
0 g(X1 s)ds
t
t→∞ tVar
t
0 g(X2 s)ds
t
SLIDE 18
Diffusion comparison result (ctd) Proof by finding suitable Peskun ordered birth and death processes with the given diffusion limit. A dominated convergence argument is needed to extend the covariance ordering to the limiting diffusions. Can extend argument to give approximate covariance ordering for ANY processes with these respective limits.
SLIDE 19
The first diffusion comparison result (R Gelman Gilks, 1997) Consider the Metropolis case. Suppose π ∼ d
i=1 f(xi), q(x, ·) ∼ N(x, σ2 dId), X0 ∼ π.
Set σ2
d = ℓ2/d. Consider
Zd
t = X(1) [td] .
Speed up time by factor d Zd is not a Markov chain, however in the limit as d goes to ∞, it is Markov: Zd ⇒ Z where Z satisfies the SDE, dZt = h(ℓ)1/2dBt + h(ℓ)∇ log f(Zt) 2 dt , for some function h(ℓ).
SLIDE 20 0.2 0.4 0.6 0.8 1.0 1.2
nh(ℓ)/d
How much diffusion path do we get for our n iterations?
SLIDE 21 h(ℓ) = ℓ2 × 2Φ
√ Iℓ 2
and I = Ef[((log f(X))′)2]. So h(ℓ) = ℓ2 × A(ℓ) , where A(ℓ) is the limiting overall acceptance rate of the algorithm, ie the proportion
- f proposed Metropolis moves ultimately accepted. So
h(ℓ) = 4 I
2 A(ℓ) , and so the maximisation problem can be written entirely in terms of the algorithm’s acceptance rate.
SLIDE 22
Sigma Efficiency 1 2 3 4 5 0.00.20.40.60.81.01.2
Efficiency as a function of scaling and acceptance rate
Acceptance rate Efficiency 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.01.2
SLIDE 23 When can we ‘solve’ the scaling problem for Metropolis? We need a sequence of target densities πd which are sufficiently regular as d → ∞ in order that meaningful (and optimisable) limiting distributions exist. Eg.
i=1 f(xi).(NB for discts f, mixing is O(d2), rate 0.13, (Peter Neal).)
i=1 f(cixi), q(x, ·) ∼ N(x, σ2 dId). for some inverse scales ci. (Bedard,
Rosenthal, Voss).
- 3. Elliptically symmetric target densities (Sherlock, Bedard).
- 4. The components form a homogeneous Markov chain.
- 5. π is a Gibbs random field with finite range interactions (Breyer).
- 6. Discretisations of an infinite-dimensional system absolutely cts wrt a Gaussian
measure (eg Pillai, Stuart, Thiery).
- 7. Purely discrete product form distributions.
SLIDE 24 A basic analysis of Metropolis Write Y(d) = X(d) + h1/2Z(d). α(X(d), Y(d)) = 1 ∧ β(X(d), Y(d)) = 1 ∧ π(d)(Y(d)) π(d)(X(d)) log β((X(d), (X(d)+h1/2(Z(d))) ≈ h1/2∇ log π(X(d))·Z(d)+1 2hZ(d)′∇∇′ log π(X(d))Z(d) β ≈ exp{h1/2G(d) − hV (d)/2} G is Gaussian and if V (d) converges in probability to constants, then the variance
In fact that is all we need for the 0.234 framework to hold!
SLIDE 25 Why? Suppose now G is a standard Gaussian and ℓ incorporates scaling choice: ESJD ≈ ℓ2E
- 1 ∧ exp
- ℓG − ℓ2/2
- = ℓ2 × 2Φ
- −
√ Iℓ 2
ESJD = 4 I
2 A(ℓ) , which is maximised by taking A(ℓ) = 0.234.
SLIDE 26 Simulated tempering Consider a d-dimensional target density fd(x) = edK
d
f(xi) , for some unnormalised one-dimensional density function f : R → [0, ∞), where K = − log(
Consider simulated tempering in d dimensions, with inverse-temperatures chosen as follows: β(d) = 1, and β(d)
i+1 = β(d) i
−
ℓ(β(d)
i
) d1/2 for some fixed C1 function ℓ : [0, 1] → R.
To stop adding new temperature values, we fix some χ ∈ (0, 1) and keep going until the inverse temperatures drop below χ, i.e. we stop at temperature β(d)
k(d) where
k(d) = sup{i : β(d)
i
≥ χ}.
SLIDE 27 The optimal temperature spacing problem asks what is the optimal choice of the function ℓ. We shall consider a joint process (y(d)
n , Xn), with Xn ∈ Rd, and with y(d) n
taking values in {β(d)
i ; 0 ≤ i ≤ k(d)} defined as follows.
Choose Xn−1 ∼ f β, then proposing Zn to be βi+1 or βi−1 with probability 1/2 each, and then accepting Zn with the usual Metropolis acceptance probability. We assume (unrealistically!) that the chain then immediately jumps to stationary at the new temperature, i.e. that mixing within a temperature is infinitely more efficient than mixing between temperatures. The process (y(d)
n , Xn) is thus a Markov chain with stationary density
fd(β, x) = edK(β)
d
f β(xi) , where K(β) = − log
- f β(x)dx is the normalising constant.
SLIDE 28 A diffusion limit for inverse temperature Theorem {y(d)
n } speeded up by a factor of d, converges weakly as d → ∞ to a
diffusion limit {Yt}t≥0 satisfying dYt =
−ℓ(Y (t))I1/2(Yt) 2 1/2 dBt +
−I1/2(Yt)ℓ 2
ℓI1/2(Yt) 2 ′ φ −I1/2(Yt)ℓ 2
for Yt in (χ, 1) with reflecting boundaries at both χ and 1. Corollary The speed of this diffusion is maximised, and the asymptotic variance
- f all L2 functionals is minimised, when the ℓ(y) is chosen so that the asymptotic
temperature acceptance probability at each and every inverse temperature y is equal to 0.234.
SLIDE 29 Final comments The simulated tempering temperature spacing problem can be considered as an example of a state-dependent proposal variance problem. Peskun ordering can be used in a quite general way for high-dimensional processes which exhibit limiting diffusion behavior (very many do!). Interesting extensions of this work will consider whether the temperature scaling problem really interacts with the within-temperature mixing of the algorithm. The 0.234 story is very natural in many situations. It is closely linked to the question
- f a concentration of measure phenomenon in high-dimensional problems.
0.234 can be a local phenomenon, not just an average over the whole state space. There are lots of things happening in scaling problems for MCMC (eg my session tomorrow) but still many rich and interesting scaling problems out there.
SLIDE 30
Conditions for diffusion comparison result Suppose that the following hold: (A1) π is log-Lipschitz, i.e. for all x, y, there is L < ∞ with | log π(y) − log π(x)| ≤ L |y − x| (A2) Either (a) the support of π is a bounded interval [a, b], and the diffusions Xσ have reflecting boundaries at a and b; or (b) the support of π is R, and π has exponentially-bounded tails, i.e. there is 0 < K < ∞ and r > 0 such that π(x + y) ≤ π(x)e−ry , x > K, y > 0 and π(x − y) ≤ π(x)e−ry , x < −K, y > 0 .