[PPT] - Tutorial on quasi-Monte Carlo methods Josef Dick School of PowerPoint Presentation

SLIDE 1

Tutorial on quasi-Monte Carlo methods

Josef Dick

School of Mathematics and Statistics, UNSW, Sydney, Australia josef.dick@unsw.edu.au

SLIDE 2

Comparison: MCMC, MC, QMC

Roughly speaking:

Markov chain Monte Carlo and quasi-Monte Carlo are for

different types of problems;

If you have a problem where Monte Carlo does not work, then

chances are quasi-Monte Carlo will not work as well;

If Monte Carlo works, but you want a faster method ⇒ try

(randomized) quasi-Monte Carlo (some tweaking might be necessary).

Quasi-Monte Carlo is an "experimental design" approach to

Monte Carlo simulation; In this talk we shall discuss how quasi-Monte Carlo can be faster than Monte Carlo under certain assumptions.

SLIDE 3

Outline DISCREPANCY, REPRODUCING KERNEL HILBERT

SPACES AND WORST-CASE ERROR

QUASI-MONTE CARLO POINT SETS RANDOMIZATIONS WEIGHTED FUNCTION SPACES AND

TRACTABILITY

SLIDE 4

The task is to approximate an integral Is(f) =

[0,1]s f(z) dz

for some integrand f by some quadrature rule QN,s(f) = 1 N

N−1

n=0

f(xn) at some sample points x0, . . . , xN−1 ∈ [0, 1]s.

SLIDE 5

[0,1]s f(x) dx ≈ 1

N

N−1

n=0

f(xn) In other words: Area under curve = Volume of cube × average function value.

If x0, . . . , xN−1 ∈ [0, 1]s are chosen randomly ⇒ Monte Carlo
If x0, . . . , xN−1 ∈ [0, 1]s chosen deterministically ⇒ Quasi-Monte

Carlo

SLIDE 6

Smooth integrands

Integrand f : [0, 1] → R; say continuously differentiable;
We want to study the integration error:

1 f(x) dx − 1 N

N−1

n=0

f(xn).

Representation:

f(x) = f(1) − 1

x

f ′(t) dt = f(1) − 1 1[0,t](x)f ′(t) dt, where 1[0,t](x) =

1

if x ∈ [0, t],

therwise.

SLIDE 7

Integration error

Substitute:

1 f(x)dx = 1

f(1) −

1 1[0,t](x)f ′(t) dt

dx

= f(1) − 1 1 1[0,t](x)f ′(t) dx dt, 1 N

N−1

n=0

f(xn) = 1 N

N−1

n=0
f(1) −

1 1[0,t](xn)f ′(t) dt

= f(1) −

1 1 N

N−1

n=0

1[0,t](xn)f ′(t) dt.

Integration error:

1 f(x) dx − 1 N

N−1

n=0

f(xn) = 1

1

N

N−1

n=0

1[0,t](xn) − 1 1[0,t](x) dx

f ′(t) dt.

SLIDE 8

Local discrepancy

Integration error: 1 f(x) dx − 1 N

N−1

n=0

f(xn) = 1

1

N

N−1

n=0

1[0,t](xn) − t

f ′(t) dt.

Let P = {x0, . . . , xN−1} ⊂ [0, 1]. Define the local discrepancy by ∆P(t) = 1 N

N−1

n=0

1[0,t](xn) − t, t ∈ [0, 1]. Then 1 f(x) dx − 1 N

N−1

n=0

f(xn) = 1 ∆P(t)f ′(t) dt.

SLIDE 9

Koksma-Hlawka inequality

1

f(x) dx − 1 N

N−1

n=0

f(xn)

=
1

∆P(t)f ′(t) dt

≤

1 |∆P(t)|p dt 1/p 1 |f ′(t)|q dt 1/q = ∆PLpf ′Lq, 1 p + 1 q = 1, where gLp =

|g|p1/p.

SLIDE 10

Interpretation of discrepancy

Let P = {x0, . . . , xN−1} ⊂ [0, 1]. Recall the definition of the local discrepancy: ∆P(t) = 1 N

N−1

n=0

1[0,t](xn) − t, t ∈ [0, 1]. Local discrepancy measures difference between uniform distribution and emperical distribution of quadrature points P = {x0, . . . , xN−1}. This is the Kolmogorov-Smirnov test for the difference between the empirical distribution of {x0, . . . , xN−1} and the uniform distribution.

SLIDE 11

Function space

Representation f(x) = f(1) − 1

x

f ′(t) dt. Define inner product: f, g = f(1)g(1) + 1 f ′(t)g′(t) dt. and norm f =

|f(1)|2 +

1 |f ′(t)|2 dt. Function space: H = {f : [0, 1] → R : f absolutely continuous and f < ∞}.

SLIDE 12

Worst-case error

The worst-case error is defined by e(H, P) = sup

f∈H,f≤1

1

f(x) dx − 1 N

N−1

n=0

f(xn)

.

SLIDE 13

Reproducing kernel Hilbert space

Recall: f(y) = f(1) · 1 + 1 f ′(x)(−1[0,x](y)) dx and f, g = f(1)g(1) + 1 f ′(x)g′(x) dx. Goal: Find set of functions gy ∈ H for each y ∈ [0, 1] such that f, gy = f(y) for all f ∈ H. Conclusions:

gy(1) = 1 for all y ∈ [0, 1];
g′

y(x) = −1[0,x](y) =

−1 if y ≤ x,

therwise;
Make g continuous such that g ∈ H;

SLIDE 14

Reproducing kernel Hilbert space

It follows that gy(x) = 1 + min{1 − x, 1 − y}. The function K(x, y) := gy(x) = 1 + min{1 − x, 1 − y}, x, y ∈ [0, 1] is called reproducing kernel. The space H is called a reproducing kernel Hilbert space (with reproducing kernel K).

SLIDE 15

Numerical integration in reproducing kernel Hilbert spaces

Function representation: 1 f(z) dz = 1 f, K(·, z) dz =

f,

1 K(·, z) dz

1

N

N−1

n=0

f(xn) = 1 N

N−1

n=0

f, K(·, xn) =

f, 1

N

N−1

n=0

K(·, xn)

Integration error

1 f(z) dz − 1 N

N−1

n=0

f(xn) =

f,

1 K(·, z) dz − 1 N

N−1

n=0

K(·, xn)

= f, h,

where h(x) = 1 K(x, z) dz − 1 N

N−1

n=0

K(x, xn).

SLIDE 16

Worst-case error in reproducing kernel Hilbert spaces

Thus e(H, P) = sup

f∈H,f≤1

1

f(z) dz − 1 N

N−1

n=0

f(xn)

=

sup

f∈H,f=1

|f, h| = sup

f∈H,f=0

f

f, h

= h, h

h = h, since the supremum is attained when choosing f =

h h ∈ H.

SLIDE 17

Worst-case error in reproducing kernel Hilbert spaces

e2(H, P) = 1 1 K(x, y) dx dy − 2 N

N−1

n=0

1 K(x, xn) dx + 1 N2

N−1

n,m=0

K(xn, xm)

SLIDE 18

Numerical integration in higher dimensions

Tensor product space Hs = H × · · · × H.
Reproducing kernel

K(x, y) =

s

i=1

[1 + min{1 − xi, 1 − yi}], where x = (x1, . . . , xs), y = (y1, . . . , ys) ∈ [0, 1]s.

Functions f ∈ Hs have partial mixed derivatives up to order 1 in

each variable ∂|u|f ∂xu ∈ L2([0, 1]s), where u ⊆ {1, . . . , s}, xu = (xi)i∈u and |u| denotes the cardinality

f u and where ∂|u|f

∂xu (xu, 1) = 0 for all u ⊂ {1, . . . , s}.

SLIDE 19

Worst-case error

Again e2(H, P) =

[0,1]s
[0,1]s K(x, y) dx dy − 2

N

N−1

n=0
[0,1]s K(x, xn) dx

+ 1 N2

N−1

n,m=0

K(xn, xm) and e2(H, P) =

[0,1]s ∆P(x)∂sf

∂x (x) dx, where ∆P(x) = 1 N

N−1

n=0

1[0,x](xn) −

s

i=1

xi.

SLIDE 20

Discrepancy in higher dimensions

Point set P = {x0, . . . , xN−1} ⊂ [0, 1]s, t = (t1, . . . , ts) ∈ [0, 1]s.

Local discrepancy:

∆P(t) = 1 N

N−1

n=0

1[0,t](xn) −

s

i=1

ti, where [0, t] = s

i=1[0, ti].

SLIDE 21

Koksma-Hlawka inequality

Let f : [0, 1]s → R with f =

[0,1]s
∂s

∂x f(x)

q

dx 1/q , and where ∂|u|f

∂xu (xu, 1) = 0 for all u ⊂ {1, . . . , s}.

Then

[0,1]s f(x) dx − 1

N

N−1

n=0

f(xn)

≤ f∆PLp,

where 1

p + 1 q = 1.

⇒ Construct points P = {x0, . . . , xN−1} ⊂ [0, 1]s with small discrepancy Lp(P) = ∆PLp. It is often useful to consider different reproducing kernels yielding different worst-case errors and discrepancies. We will see further examples later.

SLIDE 22

Constructions of low-discrepancy sequences

Lattice rules (Bilyk, Brauchart, Cools, D., Hellekalek, Hickernell,

Hlawka, Joe, Keller, Korobov, Kritzer, Kuo, Larcher, L ’Ecuyer, Lemieux, Leobacher, Niederreiter, Nuyens, Pillichshammer, Sinescu, Sloan, Temlyakov, Wang, Wo´ zniakowski, ...)

Digital nets and sequences (Baldeaux, Bierbrauer, Brauchart,

Chen, D., Edel, Faure, Hellekalek, Hofer, Keller, Kritzer, Kuo, Larcher, Leobacher, Niederreiter, Owen, Özbudak, Pillichshammer, Pirsic, Schmid, Skriganov, Sobol’, Wang, Xing, Yue, ...)

Hammersley-Halton sequences (Atanassov, De Clerck, Faure,

Halton, Hammersley, Kritzer, Larcher, Lemieux, Pillichshammer, Pirsic, White, ...)

Kronecker sequences (Beck, Hellekalek, Larcher, Niederreiter,

Schoissengeier, ...)

SLIDE 23

Lattice rules

Let N ∈ N, let g = (g1, . . . , gs) ∈ {1, . . . , N − 1}s. Choose the quadrature points as xn = ng N

,

for n = 0, . . . , N − 1, where {z} = z − ⌊z⌋ for z ∈ R+

0 .

SLIDE 24

Fibonacci lattice rules

Lattice rule with N = 55 points and generating vector g = (1, 34).

SLIDE 25

Fibonacci lattice rules

Lattice rule with N = 89 points and generating vector g = (1, 55).

SLIDE 26

In comparison: Random point set

Random set of 64 points generated by Matlab using Mersenne Twister.

SLIDE 27

Lattice rules

How to find generating vector g? Reproducing kernel: K(x, y) =

s

i=1

(1 + 2πBα ({xi − yi})) , where {xi − yi} = (xi − yi) − ⌊xi − yi⌋ is the fractional part of xi − yi. Reproducing kernel Hilbert space of Fourier series: f(x) =

h∈Zs
f(h)e2πih·x.

SLIDE 28

Worst-case error: e2

α(g)

= −1 + 1 N

N−1

n=0

s

i=1
1 + 2πBα

ngi N

,

where Bα is the Bernoulli polynomial of order α. For instance B2(x) = x2 − x + 1/6.

SLIDE 29

Component-by-component construction (Korobov, Sloan-Reztsov,Sloan-Kuo-Joe, Nuyens-Cools)

Set g1 = 1.
For d = 2, . . . , s assume that we have found g2, . . . , gd−1. Then

find gd ∈ {1, . . . , N − 1} which minimizes eα(g1, . . . , gd−1, g) as a function of g, i.e. gd = argmin

g∈{1,...,N−1}e2 α(g1, . . . , gd−1, g).

Using fast Fourier transform (Nuyens, Cools), a good generating vector g ∈ {1, . . . , N − 1}s can be found in O(sN log N) operations.

SLIDE 30

Hammersley-Halton sequence

Radical inverse function in base b:

Let n ∈ N0 have base b expansion

n = n0 + n1b + · · · + na−1ba−1.

Set

ϕb(n) = n0 b + n1 b2 + · · · + na−1 ba ∈ [0, 1]. Hammersley-Halton sequence:

Let P = {p1, p2, . . . , } be the set of prime numbers in increasing
rder, i.e. p1 = 2, p2 = 3, p3 = 5, p4 = 7, . . ..
Define quadrature points x0, x1, . . . by

xn = (ϕp1(n), ϕp2(n), . . . , ϕps(n)) for n = 0, 1, 2, . . . .

SLIDE 31

Hammersley-Halton sequence

Hammerlsey-Halton point set with 64 points.

SLIDE 32

Digital nets

Choose prime number b and finite field Zb = {0, 1, . . . , b − 1} of
rder b.
Choose C1, . . . , Cs ∈ Zm×m

b

.

Let n = n0 + n1b + · · · + nm−1bm−1 and set
n = (n0, . . . , nm−1)⊤ ∈ Zm

b .

Let
yn,i = Ci

n for 1 ≤ i ≤ s, 0 ≤ n < bm.

For

yn,i = (yn,i,1, . . . , yn,i,m)⊤ ∈ Zm

b let

xn,i = yn,i,1 b + · · · + yn,i,m bm .

Set xn = (xn,1, . . . , xn,s) for 0 ≤ n < bm.

SLIDE 33

(t, m, s)-net property

Let m, s ≥ 1 and b ≥ 2 be integers. A point set P = {x0, . . . , xbm−1} is called a (t, m, s)-net in base b, if for all integers d1, . . . , ds ≥ 0 with d1 + · · · + ds = m − t the number of points in the elementary intervals

s

i=1

ai bdi , ai + 1 bdi

where 0 ≤ ai < bdi,

is bt.

SLIDE 34

If P = {x0, . . . , xbm−1} ⊂ [0, 1]s is a (t, m, s)-net, then local discrepancy function ∆P a1 bd1 , . . . , as bds

= 0

for all 0 ≤ ai < bdi, di ≥ 0, d1 + · · · + ds = m − t.

SLIDE 35

SLIDE 36

SLIDE 37

SLIDE 38

SLIDE 39

SLIDE 40

SLIDE 41

SLIDE 42

SLIDE 43

SLIDE 44

SLIDE 45

SLIDE 46

SLIDE 47

Sobol point set

The first 64 points of a Sobol sequence which form a (0, 6, 2)-net in base 2.

SLIDE 48

Niederreiter-Xing point set

The first 64 points of a Niederreiter-Xing sequence.

SLIDE 49

Niederreiter-Xing point set

The first 1024 points of a Niederreiter-Xing sequence.

SLIDE 50

Kronecker sequence

Let α1, . . . , αs ∈ R be linearly independent over Q. Let xn = ({nα1}, . . . , {nαs}) for n = 0, 1, 2, . . . . For instance, one can choose αi = √pi where pi is the ith prime number.

SLIDE 51

Kronecker sequence

The first 64 points of a Kronecker sequence.

SLIDE 52

Discrepancy bounds

It is known that for all the constructions above one has

Lp(P) ≤ Cs (log N)c(s) N , for all N ≥ 1, 1 ≤ p ≤ ∞, where Cs > 0 and c(s) ≈ s.

Lower bound: For all point sets P consisting of N points we have

Lp(P) ≥ C′

s

(log N)(s−1)/2 N for all N ≥ 1, 1 < p ≤ ∞. In comparison: random point set E(L2(P)) = O

log log N

N

.

SLIDE 53

For 1 < p < ∞, the exact order of convergence is known to be of

rder

(log N)(s−1)/2 N . Explicit constructions of such points were achieved by Chen & Skriganov for p = 2 and Skriganov for 1 < p < ∞. Great open problem: What is the correct order of convergence of min

P⊂[0,1]s |P|=N

L∞(P)?

SLIDE 54

Randomized quasi-Monte Carlo

Introduce random element to deterministic point set. Has several advantages:

yields an unbiased estimator;
there is a statistical error estimate;
better rate of convergence of the random-case error compared to

worst-case error;

performs similar to Monte Carlo for L2 functions but has better

rate of convergence for smooth functions;

SLIDE 55

Randomly shifted lattice rules

Lattice rules can be randomized using a random shift:

Lattice point set:

{ng/N} , n = 0, 1, . . . , N − 1.

Shifted lattice rule: choose σ ∈ [0, 1]s uniformly distributed; then

the shifted lattice point set is given by {ng/N + σ} , n = 0, 1, . . . , N − 1.

SLIDE 56

Lattice point set

SLIDE 57

Shifted lattice point set

SLIDE 58

Owen’s scrambling

Let Zb = {0, 1, . . . , b − 1} and let

x = x1 b + x2 b2 + · · · , x1, x2, . . . ∈ Zb.

Randomly choose permutations σ, σx1, σx1,x2, . . . : Zb → Zb.
Let

y1 = σ(x1) y2 = σx1(x2) y3 = σx1,x2(x3) . . . . . .

Set

y = y1 b + y2 b2 + · · · .

SLIDE 59

Scrambled Sobol point set

The first 1024 points of a Sobol sequence (left) and a scrambled Sobol sequence (right).

SLIDE 60

Scrambled net variance

Let e(R) =

[0,1]s f(x) dx − 1

N

bm−1

n=0

f(yn). Then

E(e(R)) = 0
Var(e(R)) = O

(log N)s N2α+1

for integrands with smoothness 0 ≤ α ≤ 1.

SLIDE 61

Dependence on the dimension

Although the convergence rate is better for quasi-Monte Carlo, there is a stronger dependence on the dimension:

Monte Carlo: error = O(N−1/2);
Quasi-Monte Carlo: error = O
(log N)s−1

N

;

Notice that g(N) := N−1(log N)s−1 is an increasing function for N ≤ es−1. So if s is large, say s ≈ 30, then N−1(log N)29 increases for N ≤ 1012.

SLIDE 62

Intractable

For integration error in H with reproducing kernel K(x, y) =

s

i=1

min{1 − xi, 1 − yi} it is known that e2(H, P) ≥

1 − N

8 9 s e2(H, ∅). ⇒ Error can only decrease if N > constant · 9 8 s .

SLIDE 63

Weighted function spaces (Sloan and Wo´ zniakowski)

Study weighted function spaces: Introduce γ1, γ2, . . . , > 0 and define K(x, y) =

s

i=1

(1 + γi min{1 − xi, 1 − yi}). Then if

∞

i=1

γi < ∞ we have e(Hs,γ, P) ≤ CN−δ, where C > 0 is independent of the dimension s and δ < 1.

SLIDE 64

Tractability

Minimal error over all methods using N points:

e∗

N(H) =

inf

P:|P|=N e(H, P);

Inverse of the error:

N∗(s, ε) = min{N ∈ N : e∗

N(H) ≤ εe∗ 0(H)}

for ε > 0;

Strong tractability:

N∗(s, ε) ≤ Cε−β for some constant C > 0; Sloan and Wo´ zniakowski: For the function space considered above, we get strong tractability if

∞

i=1

γi < ∞.

SLIDE 65

Tractability of star-discrepancy

Recall the local discrepancy function ∆P(t) = 1 N

N−1

n=0

1[0,t)(xn) − t1 · · · ts, where P = {x0, . . . , xN−1}. The star-discrepancy is given by D∗

N(P) = sup t∈[0,1]s |∆P(t)| .

Result by Heinrich, Novak, Wo´ zniakowski, Wasilkowski: Minimal star discrepancy D∗(s, N) = inf

P D∗ N(P) ≤ C

s

N for all s, N ∈ N. This implies that N∗(s, ε) ≤ Csε−2.

SLIDE 66

Extensions, open problems and future research

Quasi-Monte Carlo methods which achieve convergence rates of
rder

N−α(log N)αs, α > 1 for sufficiently smooth integrands;

Construction of point sets whose star-discrepancy is tractable;
Completely uniformly distributed point sets to speed up

convergence in Markov chain Monte Carlo algorithms;

Infinite dimensional integration: Integrands have infinitely many

variables;

Connections to codes, orthogonal arrays and experimental

designs;

Quasi-Monte Carlo for function approximation;
Choosing the weights in applications;
Quasi-Monte Carlo on the sphere;

SLIDE 67

Book

SLIDE 68

Thank You!

Special thanks to

Dirk Nuyens for creating many of the pictures in this talk;
Friedrich Pillichshammer for letting me use a picture of his

daughters;

Fred Hickernell, Peter Kritzer, Art Owen, and Friedrich

Pillichshammer for helpful comments on the slides;

All colleagues who worked on quasi-Monte Carlo methods over