[PPT] - Centre for Research in Statistical Methodology PowerPoint Presentation

SLIDE 1

Centre for Research in Statistical Methodology http://go.warwick.ac.uk/crism

Conferences and workshops (including general calls for workshops to

be organised primarily outside Warwick, calls every 6 months, next in Summer 2014)

Research Fellow positions: next advertising 2 positions around Febru-

ary 2014

PhD studentships
Academic visitor programme.

SLIDE 2

From Peskun Ordering to Optimal Simulated Tempering Gareth Roberts University of Warwick MCMSki, Chamonix, January 2014 mainly joint work with Jeffrey Rosenthal, but with aspects of joint work with Yves Atchade

SLIDE 3

Plan for talk

1. Why 0.234 is natural in many problems
2. Comparisions of algorithms based on their diffusion limits, links to Peskun or-

dering

3. A heterogenous scalling problem: spacing of temperatures in simulated temper-

ing

4. Local 0.234 story for simulated tempering
5. Conclusions

SLIDE 4

Metropolis-Hastings algorithm Given a target density π(·) that we wish to sample from, and a Markov chain transition kernel density q(·, ·), we construct a Markov chain as follows. Given Xn, generate Yn+1 from q(Xn, ·). Now set Xn+1 = Yn+1 with probability α(Xn, Yn+1) = 1 ∧ π(Yn+1)q(Yn+1, Xn) π(Xn)q(Xn, Yn+1) . Otherwise set Xn+1 = Xn.

SLIDE 5

Two first scaling problems

RWM

q(x, y) = q(|y − x|) The acceptance probability simplifies to α(x, y) = 1 ∧ π(y) π(x) For example q ∼ MV Nd(x, σ2Id), but also more generally.

MALA

Y ∼ MV N(x(k) + hV ∇ log π(x(k)) 2 , hV ) .

SLIDE 6

200 600 1000 −0.25 −0.05

0.01 0.9869

200 600 1000 −1.0 −0.4

0.02 0.996

200 600 1000 0.0 1.0

0.03 0.9912

200 600 1000 −0.5 0.5

0.04 0.9895

200 600 1000 −0.5 0.5

0.05 0.9939

200 600 1000 −1.5 0.0 1.5

0.1 0.9905

200 600 1000 −2 2

0.2 0.9715

200 600 1000 −2 2

0.3 0.954

200 600 1000 −2 2

0.4 0.9443

200 600 1000 −3 2 4

0.5 0.9178

200 600 1000 −3 −1 1 3

0.7 0.8368

200 600 1000 −2 2

0.9 0.7732

200 600 1000 −3 −1 1

1 0.7579

200 600 1000 −3 −1 1 3

1.2 0.7474

200 600 1000 −2 2

1.5 0.6192

200 600 1000 −2 2

1.8 0.6529

200 600 1000 −2 2

2.1 0.6067

200 600 1000 −2 2

2.38 0.5542

200 600 1000 −3 −1 1 3

2.6 0.6277

200 600 1000 −3 −1 1

2.9 0.6159

200 600 1000 −2 2

3.2 0.672

200 600 1000 −2 2

3.5 0.6982

200 600 1000 −3 2

4 0.6446

200 600 1000 −3 −1 1

5 0.6962

200 600 1000 −3 2

6 0.7411

200 600 1000 −3 −1 1 3

8 0.829

200 600 1000 −3 2

10 0.7974

200 600 1000 −2 2

12 0.8651

200 600 1000 −2 2

15 0.9036

200 600 1000 −3 −1 1 3

18 0.9089

200 600 1000 −3 2

22 0.9426

200 600 1000 −3 −1 1 3

27 0.9325

200 600 1000 −2 1

33 0.933

200 600 1000 −3 −1 1

40 0.9564

200 600 1000 −2 1 2

50 0.962

200 600 1000 −2 1

100 0.989

The Goldilocks dilemma

SLIDE 7

Scaling problems and diffusion limits Choosing σ in the above algorithms to optimise efficiency. For ‘appropriate choices’ the d-dimensional algorithm has a limit which is a diffusion. The faster the diffusion the better!

How should σd depend on d for large d?
What does this tell us about the efficiency of the algorithm?
Can we optimise σd in some sensible way?
Can we characterise optimal (or close to optimal) values of σd in terms of ob-

servable properties of the Markov chain? For RWM and MALA (and some other local algorithms) and for some simple classes

f target distributions, a solution to the above can be obtained by considering a

diffusion limit (for high dimensional problems).

SLIDE 8

Simulated tempering Consider a d-dimensional target density fd, and suppose it is possible to construct MCMC on fd,β = f β

d , 0 ≤ χ ≤ β ≤ 1.

This typically would mix better for small β. However we are interested in fd,1. Problem: Choose a finite collection of inverse temperatures, B = {βi} such that we can construct a Markov chain on Rd × B which “optimally” permits the exploration of fd,1. This is also a scaling problem: chosing how large to make βi − βi−1 for each i.

SLIDE 9

What is “efficiency”? Let X be a Markov chain. Then for a π-integrable function f, efficiency can be described by σ2(g, P) = lim

n→∞ nVar

n

i=1 g(Xi)

n

.

Under weak(ish) regularity conditions σ2(g, P) = Varπ(g) + 2

∞

i=1

Covπ(g(X0), g(Xi)) In general relative efficiency between two possible Markov chains varies depending

n what function of interest g is being considered. As d → ∞ the dependence on

g disappears, at least in cases where we have a diffusion limit as we will see....

SLIDE 10

How do we measure “efficiency” efficiently? It is well-established that estimating limiting variance is hard. “It’s easy, just measure ESJD instead!” Andrew Gelman, 1993 ESJD = E((Xt+1 − Xt)2) Why is this a good idea? Optimising this is just like considering only linear functions g and ignoring all but the first term in

∞

i=1

Covπ(g(X0), g(Xi))

SLIDE 11

Diffusion limits: a framework for studying algorithm optimality Many MCMC algorithms are well-approximated by diffusions (usually in the sense

f diffusion limit results for high-dimensional algorithms, or other limiting regimes).

Examples include Random walk Metropolis, various versions of Langevin algorithm, Gibbs samplers simulated tempering. Provides a natural framework for studying algorithm complexity and optimisation. But why is the comparison of diffusion limits any easier than comparing the finite- dimensional algorithms?

SLIDE 12

2000 4000 6000 8000 10000 0.8 1.0 1.2 1.4 1.6 1.8 Index

MCMC sample paths and diffusions. Here ESJM is the quadratic variation lim

ǫ→0 [tǫ−1]

i=1

(Xiǫ − X(i−1)ǫ)2

SLIDE 13

Diffusions A d-dimensional diffusion is a continuous-time strong Markov process with con- tinuous sample paths. We can define a diffusion as the solution of the Stochastic Differential Equation (SDE): dXt = µ(Xt)dt + σ(Xt)dBt. where B denotes d-dimensional Brownian motion, σ is a d × d matrix and µ is a d-vector. Often understood intuitively and constructively via its dynamics over small time

intervals. Approximately for small h:

Xt+h|Xt = xt ∼ xt + hµ(xt) + h1/2σ(xt)Z where Z is a d-dimensional standard normal random variable.

SLIDE 14

“Efficiency” for diffusions Consider two Langevin diffusions, both with πinvariant.: for h1 < h2, dXi

t = h1/2 i

dBt + hi∇ log π(Xi

t)/2,

i = 1, 2.

500 1000 1500 2000 0.0 0.1 0.2 0.3 0.4 0.5 Index thin(y, 5)

X2 is a “speeded-up” version of X1. But how can we compare diffusions which have non-constant diffusion coefficient?

SLIDE 15

Peskun ordering Peskun (1973). Many uses in MCMC theory (eg see work by Mira, Tierney and co-workers). P1 and P2 are two Markov chain kernels with invariant distribution π. We say P2 dominates P1 in the Peskun sense, and write P2 P1 if for all x, and sets A not containing x P1(x, A) ≤ P2(x, A) . Peskun ordering implies an ordering on asymptotic variances of ergodic estimates: lim

n→∞

nVar

n

i=1 g(X(1) i )

n

≥ lim

n→∞

nVar

n

i=1 g(X(2) i )

n

where X(i) moves according to Pi.

Surprisingly many applications in MCMC! But are proposal scaling problems completely incompatible with Peskun??

SLIDE 16

Peskun ordering in continuous time Peskun ordering of P1 and P2 does not imply, nor is implied by ordering of the 2 step transition kernels. How about continuous time? (See work of Leisen +Mira 2008.) But diffusions satisfy P(x, {x}) = 0 so there can never be any interesting Peskun

rderings in the original sense.

HOWEVER MANY discrete time processes possess the SAME diffusion limit. Maybe we can consider the limit along sequences of chains which do satisfy Peskun

rdering.

SLIDE 17

A more powerful diffusion comparison result Consider two Langevin diffusions, both with stationary distribution π. dXi

t = hi(Xi t)1/2dBt + Vi(Xi t)dt,

i = 1, 2, with h1(x) ≤ h2(x) for all x. (Here Vi(x) = (hi(x)∇ log π(x) + h′

i(x))/2.)

Under regularity conditions on the tails of π (which have to decay exponentially, or π needs to have bounded support) Then X2 dominates X1 in covariance ordering sense: lim

t→∞ tVar

t

0 g(X1 s)ds

t

≥ lim

t→∞ tVar

t

0 g(X2 s)ds

t

SLIDE 18

Diffusion comparison result (ctd) Proof by finding suitable Peskun ordered birth and death processes with the given diffusion limit. A dominated convergence argument is needed to extend the covariance ordering to the limiting diffusions. Can extend argument to give approximate covariance ordering for ANY processes with these respective limits.

SLIDE 19

The first diffusion comparison result (R Gelman Gilks, 1997) Consider the Metropolis case. Suppose π ∼ d

i=1 f(xi), q(x, ·) ∼ N(x, σ2 dId), X0 ∼ π.

Set σ2

d = ℓ2/d. Consider

Zd

t = X(1) [td] .

Speed up time by factor d Zd is not a Markov chain, however in the limit as d goes to ∞, it is Markov: Zd ⇒ Z where Z satisfies the SDE, dZt = h(ℓ)1/2dBt + h(ℓ)∇ log f(Zt) 2 dt , for some function h(ℓ).

SLIDE 20

0.2 0.4 0.6 0.8 1.0 1.2

nh(ℓ)/d

How much diffusion path do we get for our n iterations?

SLIDE 21

h(ℓ) = ℓ2 × 2Φ

−

√ Iℓ 2

,

and I = Ef[((log f(X))′)2]. So h(ℓ) = ℓ2 × A(ℓ) , where A(ℓ) is the limiting overall acceptance rate of the algorithm, ie the proportion

f proposed Metropolis moves ultimately accepted. So

h(ℓ) = 4 I

Φ−1(A(ℓ))

2 A(ℓ) , and so the maximisation problem can be written entirely in terms of the algorithm’s acceptance rate.

SLIDE 22

Sigma Efficiency 1 2 3 4 5 0.00.20.40.60.81.01.2

Efficiency as a function of scaling and acceptance rate

Acceptance rate Efficiency 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.01.2

SLIDE 23

When can we ‘solve’ the scaling problem for Metropolis? We need a sequence of target densities πd which are sufficiently regular as d → ∞ in order that meaningful (and optimisable) limiting distributions exist. Eg.

1. π ∼ d

i=1 f(xi).(NB for discts f, mixing is O(d2), rate 0.13, (Peter Neal).)

2. π ∼ d

i=1 f(cixi), q(x, ·) ∼ N(x, σ2 dId). for some inverse scales ci. (Bedard,

Rosenthal, Voss).

3. Elliptically symmetric target densities (Sherlock, Bedard).
4. The components form a homogeneous Markov chain.
5. π is a Gibbs random field with finite range interactions (Breyer).
6. Discretisations of an infinite-dimensional system absolutely cts wrt a Gaussian

measure (eg Pillai, Stuart, Thiery).

7. Purely discrete product form distributions.

SLIDE 24

A basic analysis of Metropolis Write Y(d) = X(d) + h1/2Z(d). α(X(d), Y(d)) = 1 ∧ β(X(d), Y(d)) = 1 ∧ π(d)(Y(d)) π(d)(X(d)) log β((X(d), (X(d)+h1/2(Z(d))) ≈ h1/2∇ log π(X(d))·Z(d)+1 2hZ(d)′∇∇′ log π(X(d))Z(d) β ≈ exp{h1/2G(d) − hV (d)/2} G is Gaussian and if V (d) converges in probability to constants, then the variance

f G is V .

In fact that is all we need for the 0.234 framework to hold!

SLIDE 25

Why? Suppose now G is a standard Gaussian and ℓ incorporates scaling choice: ESJD ≈ ℓ2E

1 ∧ exp
ℓG − ℓ2/2
= ℓ2 × 2Φ
−

√ Iℓ 2

= ℓ2A(ℓ)

ESJD = 4 I

Φ−1(A(ℓ))

2 A(ℓ) , which is maximised by taking A(ℓ) = 0.234.

SLIDE 26

Simulated tempering Consider a d-dimensional target density fd(x) = edK

d

i=1

f(xi) , for some unnormalised one-dimensional density function f : R → [0, ∞), where K = − log(

f(x)dx).

Consider simulated tempering in d dimensions, with inverse-temperatures chosen as follows: β(d) = 1, and β(d)

i+1 = β(d) i

−

ℓ(β(d)

i

) d1/2 for some fixed C1 function ℓ : [0, 1] → R.

To stop adding new temperature values, we fix some χ ∈ (0, 1) and keep going until the inverse temperatures drop below χ, i.e. we stop at temperature β(d)

k(d) where

k(d) = sup{i : β(d)

i

≥ χ}.

SLIDE 27

The optimal temperature spacing problem asks what is the optimal choice of the function ℓ. We shall consider a joint process (y(d)

n , Xn), with Xn ∈ Rd, and with y(d) n

taking values in {β(d)

i ; 0 ≤ i ≤ k(d)} defined as follows.

Choose Xn−1 ∼ f β, then proposing Zn to be βi+1 or βi−1 with probability 1/2 each, and then accepting Zn with the usual Metropolis acceptance probability. We assume (unrealistically!) that the chain then immediately jumps to stationary at the new temperature, i.e. that mixing within a temperature is infinitely more efficient than mixing between temperatures. The process (y(d)

n , Xn) is thus a Markov chain with stationary density

fd(β, x) = edK(β)

d

i=1

f β(xi) , where K(β) = − log

f β(x)dx is the normalising constant.

SLIDE 28

A diffusion limit for inverse temperature Theorem {y(d)

n } speeded up by a factor of d, converges weakly as d → ∞ to a

diffusion limit {Yt}t≥0 satisfying dYt =

2ℓ2(Yt)Φ

−ℓ(Y (t))I1/2(Yt) 2 1/2 dBt +

ℓ(Y )ℓ′(Y )Φ

−I1/2(Yt)ℓ 2

− ℓ2

ℓI1/2(Yt) 2 ′ φ −I1/2(Yt)ℓ 2

dt ,

for Yt in (χ, 1) with reflecting boundaries at both χ and 1. Corollary The speed of this diffusion is maximised, and the asymptotic variance

f all L2 functionals is minimised, when the ℓ(y) is chosen so that the asymptotic

temperature acceptance probability at each and every inverse temperature y is equal to 0.234.

SLIDE 29

Final comments The simulated tempering temperature spacing problem can be considered as an example of a state-dependent proposal variance problem. Peskun ordering can be used in a quite general way for high-dimensional processes which exhibit limiting diffusion behavior (very many do!). Interesting extensions of this work will consider whether the temperature scaling problem really interacts with the within-temperature mixing of the algorithm. The 0.234 story is very natural in many situations. It is closely linked to the question

f a concentration of measure phenomenon in high-dimensional problems.

0.234 can be a local phenomenon, not just an average over the whole state space. There are lots of things happening in scaling problems for MCMC (eg my session tomorrow) but still many rich and interesting scaling problems out there.

SLIDE 30

Conditions for diffusion comparison result Suppose that the following hold: (A1) π is log-Lipschitz, i.e. for all x, y, there is L < ∞ with | log π(y) − log π(x)| ≤ L |y − x| (A2) Either (a) the support of π is a bounded interval [a, b], and the diffusions Xσ have reflecting boundaries at a and b; or (b) the support of π is R, and π has exponentially-bounded tails, i.e. there is 0 < K < ∞ and r > 0 such that π(x + y) ≤ π(x)e−ry , x > K, y > 0 and π(x − y) ≤ π(x)e−ry , x < −K, y > 0 .