[PPT] - Probabilistic Numerics Part I Integration and Differential PowerPoint Presentation

SLIDE 1

Probabilistic Numerics – Part I – Integration and Differential Equations

Philipp Hennig MLSS 2015 18 / 07 / 2015

Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems Tübingen, Germany

SLIDE 2

Information Content of Partial Computations

division with remainder

2 3 7 3 6 ÷ 7 3 6 = X X . X X

1 ,

SLIDE 3

Information Content of Partial Computations

division with remainder

2 3 7 3 6 ÷ 7 3 6 = 3 X . X X 2 2 8 1 6 5 6

1 ,

SLIDE 4

Information Content of Partial Computations

division with remainder

2 3 7 3 6 ÷ 7 3 6 = 3 2 . X X 2 2 8 1 6 5 6 1 4 7 2 1 8 4

1 ,

SLIDE 5

Information Content of Partial Computations

division with remainder

2 3 7 3 6 ÷ 7 3 6 = 3 2 . 2 X 2 2 8 1 6 5 6 1 4 7 2 1 8 4 1 4 7 . 2 3 6 . 8

1 ,

SLIDE 6

Information Content of Partial Computations

division with remainder

2 3 7 3 6 ÷ 7 3 6 = 3 2 . 2 5 2 2 8 1 6 5 6 1 4 7 2 1 8 4 1 4 7 . 2 3 6 . 8 3 6 . 8

1 ,

SLIDE 7

What about ML computations?

Contemporary computational tasks are more challenging

What happens with

▸ a neural net if we stop the “training” of a neural network after four

steps of sgd?

▸ . . . on only 1% of the data set? ▸ a GP regressor if we stop the Gauss-Jordan elemination after three

steps?

▸ a DP mixture model if we only run MCMC for ten samples? ▸ a robotic controller built using all these methods?

2 ,

SLIDE 8

What about ML computations?

Contemporary computational tasks are more challenging

What happens with

▸ a neural net if we stop the “training” of a neural network after four

steps of sgd?

▸ . . . on only 1% of the data set? ▸ a GP regressor if we stop the Gauss-Jordan elemination after three

steps?

▸ a DP mixture model if we only run MCMC for ten samples? ▸ a robotic controller built using all these methods?

As data-sets becomes infinite, ML models increasingly complex, and their applications permeate our lives, we need to model effects of approximations more explicitly to achieve fast, reliable AI.

2 ,

SLIDE 9

Machine learning methods are chains of numerical computations

▸ linear algebra (least-squares) ▸ optimization (training & fitting) ▸ integration (MCMC, marginalization) ▸ solving differential equations (RL, control)

Are these methods just black boxes on your shelf?

3 ,

SLIDE 10

Numerical methods perform inference

an old observation [Poincaré 1896, Diaconis 1988, O’Hagan 1992]

A numerical method estimates a function’s latent property given the result of computations. integration estimates ∫

b a f(x)dx

given {f(xi)} linear algebra estimates x s.t. Ax = b given {As = y}

ptimization estimates x s.t. ∇f(x) = 0

given {∇f(xi)} analysis estimates x(t) s.t. x′ = f(x,t), given {f(xi,ti)}

▸ computations yield “data” / “observations” ▸ non-analytic quantities are “latent” ▸ even deterministic quantities can be uncertain.

4 ,

SLIDE 11

If computation is inference, it should be possible to build probabilistic numerical methods that take in probability measures over inputs, and return probability measures over outputs, which quantify uncertainty arising from the uncertain input and the finite information content of the computation. i compute

5

,

SLIDE 12

Classic methods identified as maximum a-posteriori

probabilistic numerics is anchored in established theory

quadrature [Diaconis 1988] Gaussian quadrature Gaussian process regression linear algebra [Hennig 2014] conjugate gradients Gaussian conditioning nonlinear optimization [Hennig & Kiefel 2013] BFGS autoregressive filtering

rdinary differential equations

[Schober, Duvenaud & Hennig 2014] Runge-Kutta Gauss-Markov extrapolation

6 ,

SLIDE 13

Integration

F = ∫

b a f(x)dx

f ∫ F

7 ,

SLIDE 14

Integration

a toy problem

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 0.5 1 x f(x)

f(x) = exp(−sin2(3x) − x2) F = ∫

3 −3 f(x)dx =?

8 ,

SLIDE 15

Integration

a toy problem

−2 2 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

f(x) = exp(−sin2(3x) − x2) F = ∫

3 −3 f(x)dx =?

≤ exp(−x2) ∫ exp(−x2) = √π

8 ,

SLIDE 16

Monte Carlo

(almost) no assumptions, stochastic convergence

−2 2 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

F = ∫ exp(−sin2(3x) − x2) dx = Z ∫ f(x) g(x) g(x) Z dx Z = ∫ g(x)dx ≈ Z 1 N

N

∑

i

f(xi) g(xi) = ˆ F xi ∼ g(x) Z var ˆ F = varg(f/g) N

9 ,

SLIDE 17

Monte Carlo

(almost) no assumptions, stochastic convergence

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 0.5 1 x f(x)

F = ∫ exp(−sin2(3x) − x2) dx = Z ∫ f(x) g(x) g(x) Z dx Z = ∫ g(x)dx ≈ Z 1 N

N

∑

i

f(xi) g(xi) = ˆ F xi ∼ g(x) Z var ˆ F = varg(f/g) N

▸ adding randomness enforces stochastic convergence

9 ,

SLIDE 18

The probabilistic approach

integration as nonparametric inference [P . Diaconis, 1988, T. O’Hagan, 1991]

p(f) = GP(f;0,k) k(x,x′) = min(x,x′) + b p(z) = N(z;µ,Σ) ⇒ p(Az) = N(Az;Aµ,AΣA⊺) p(∫

b a f(x)dx) = N [∫ b a f(x)dx;∫ b a m(x)dx,∬ b a k(x,x′)dxdx′]

= N(F;0,−1/6(b3 − a3) + 1/2[b3 − 2a2b + a3] − (b − a)2c)

10 ,

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

SLIDE 19

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

xt = arg min[varp(F ∣ x1,...,xt−1)(F)]

11 ,

−2 2 −1 −0.5 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

SLIDE 20

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

xt = arg min[varp(F ∣ x1,...,xt−1)(F)]

11 ,

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

SLIDE 21

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

−2 2 −1 −0.5 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

xt = arg min[varp(F ∣ x1,...,xt−1)(F)] active node placement for maximum expected error reduction gives regular grid.

11 ,

SLIDE 22

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

xt = arg min[varp(F ∣ x1,...,xt−1)(F)] active node placement for maximum expected error reduction gives regular grid.

11 ,

SLIDE 23

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

−2 2 −1 −0.5 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

xt = arg min[varp(F ∣ x1,...,xt−1)(F)] active node placement for maximum expected error reduction gives regular grid.

11 ,

SLIDE 24

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

−2 2 −1 −0.5 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

xt = arg min[varp(F ∣ x1,...,xt−1)(F)] active node placement for maximum expected error reduction gives regular grid.

11 ,

SLIDE 25

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

−2 2 −1 −0.5 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

xt = arg min[varp(F ∣ x1,...,xt−1)(F)] active node placement for maximum expected error reduction gives regular grid.

11 ,

SLIDE 26

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

xt = arg min[varp(F ∣ x1,...,xt−1)(F)] active node placement for maximum expected error reduction gives regular grid.

11 ,

SLIDE 27

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

xt = arg min[varp(F ∣ x1,...,xt−1)(F)] active node placement for maximum expected error reduction gives regular grid.

11 ,

SLIDE 28

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

xt = arg min[varp(F ∣ x1,...,xt−1)(F)] active node placement for maximum expected error reduction gives regular grid.

11 ,

SLIDE 29

Active Collection of Information

choise of evaluation nodes [T. Minka, 2000]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

xt = arg min[varp(F ∣ x1,...,xt−1)(F)] active node placement for maximum expected error reduction gives regular grid.

11 ,

SLIDE 30

Posterior Mean is a Linear Spline. . .

A classic numerical method, derived as a learning machine [P . Diaconis, 1988, T. O’Hagan, 1991]

−2 2 −1 −0.5 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

p(f ∣y,x) = GP(f;my,ky) k(x,xi) = min(x,xi) + b my(x) = ∑

i

k(x,xi)αi αi = kXX−1y

12 ,

SLIDE 31

We just re-discovered the Trapezoid Rule!

A classic numerical method, derived as a learning machine [P . Diaconis, 1988, T. O’Hagan, 1991]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

p(f ∣y,x) = GP(f;my,ky) k(x,xi) = min(x,xi) + b my(x) = ∑

i

k(x,xi)αi αi = kXX−1y Ep(f ∣ y)[F] = ∫ my(x) dx =

N−1

∑

i=1

(xi+1 − xi)1 2(f(xi+1) + f(xi))

12 ,

SLIDE 32

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x) ▸ The trapezoid rule is the maximum a-posteriori estimate for F under a

Wiener process prior on f.

▸ Node placement by information maximization under this prior.

12 ,

SLIDE 33

▸ introducing random numbers into a deterministic problem may not be

a good strategy

▸ recipe for a probabilistic numerical method estimating x from y(t):

▸ choose model p(x, y(t)) ▸ choose action rule / policy / strategy

[t1, . . . , ti−1, y(t1), . . . , y(ti−1)] ti

▸ some classic numerical methods can be derived entirely from an

inference perspective, using classic statistical methods

▸ the trapezoid rule is a MAP estimate under a Wiener process prior on

the integrand; regular node placement arises from information-greediness

▸ the probabilistic interpretation as such does not ensure the posterior

distribution is well calibrated

13 ,

SLIDE 34

Customized Numerics

machine learning providing new numerics [Hennig, Osborne, Girolami, RSPA 2015]

k(x,x′) = exp(−(x − x′)2 2λ2 ) Encodes

▸ smooth (infinitely differentiable) f ▸ exponentially decaying Fourier power spectrum

But ignores

▸ non-stationarity ▸ positivity

14 ,

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

SLIDE 35

Customized Numerics

machine learning providing new numerics [Hennig, Osborne, Girolami, RSPA 2015]

k(x,x′) = exp(−(x − x′)2 2λ2 ) Encodes

▸ smooth (infinitely differentiable) f ▸ exponentially decaying Fourier power spectrum

But ignores

▸ non-stationarity ▸ positivity

14 ,

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

SLIDE 36

No Such Thing as a Free Lunch!

incorrect assumptions give arbitrarily bad performance [Hennig, Osborne, Girolami, RSPA 2015]

100 101 102 103 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

Computations collect information about a latent quantity. The more valid prior information is available, the cheaper the computation—if the algorithm uses the prior information!

15 ,

SLIDE 37

No Such Thing as a Free Lunch!

incorrect assumptions give arbitrarily bad performance [Hennig, Osborne, Girolami, RSPA 2015]

100 101 102 103 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

Computations collect information about a latent quantity. The more valid prior information is available, the cheaper the computation—if the algorithm uses the prior information!

15 ,

SLIDE 38

No Such Thing as a Free Lunch!

incorrect assumptions give arbitrarily bad performance [Hennig, Osborne, Girolami, RSPA 2015]

100 101 102 103 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

Computations collect information about a latent quantity. The more valid prior information is available, the cheaper the computation—if the algorithm uses the prior information!

15 ,

SLIDE 39

Model Mismatch can be Detected at Runtime

“a numerical conscience”

−2 2 0.2 0.4 0.6 0.8 1 f(x) −2 2 0.2 0.4 0.6 0.8 f(x) 10−10 10−5 100 F − ˆ F 10−10 10−5 100 F − ˆ F 100 101 102 103 −400 −200 200 400 # samples r 100 101 102 103 −400 −200 200 400 # samples r

r = log E ˜

f[p( ˜

f(x))] p(f(x)) = (f(x) − µ(x))⊺K−1(f(x) − µ(x)) − N

16 ,

SLIDE 40

▸ including tangible prior information in the prior can give tailored

numerics that drastically improve computational efficiency

▸ in contrast to physical data sources, in numerical problems, prior

assumptions can be rigorously verified, because the problem is stated in a formal (programming) language!

▸ incorrect prior assumptions can catastrophically affect performance ▸ model assumptions can be adapted at runtime, using established

statistical techniques, e.g. “type-II maximum likelihood”

17 ,

SLIDE 41

So why is everyone (in ML) using MCMC?

18 ,

SLIDE 42

Warped Sequential Active Bayesian Integration (WSABI)

[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

√ f(x) ⋅ exp(−x2/2) ∼ GP [0,k = exp(−1/2(x − x′)⊺Λ−1(x − x′))]

▸ encodes positivity ▸ encodes nonstationarity

19 ,

−2 2 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

SLIDE 43

Warped Sequential Active Bayesian Integration (WSABI)

[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

−2 2 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F ▸ select evaluation nodes at arg maxvar[f(x) ⋅ exp(−x2)] ▸ this scales to higher input dimensionality more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,

SLIDE 44

Warped Sequential Active Bayesian Integration (WSABI)

[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 0.5 1 x f(x) ▸ select evaluation nodes at arg maxvar[f(x) ⋅ exp(−x2)] ▸ this scales to higher input dimensionality more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,

SLIDE 45

Warped Sequential Active Bayesian Integration (WSABI)

[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

−2 2 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F ▸ select evaluation nodes at arg maxvar[f(x) ⋅ exp(−x2)] ▸ this scales to higher input dimensionality more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,

SLIDE 46

Warped Sequential Active Bayesian Integration (WSABI)

[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 0.5 1 x f(x) ▸ select evaluation nodes at arg maxvar[f(x) ⋅ exp(−x2)] ▸ this scales to higher input dimensionality more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,

SLIDE 47

Warped Sequential Active Bayesian Integration (WSABI)

[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 0.5 1 x f(x) ▸ select evaluation nodes at arg maxvar[f(x) ⋅ exp(−x2)] ▸ this scales to higher input dimensionality more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,

SLIDE 48

Warped Sequential Active Bayesian Integration (WSABI)

[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 0.5 1 x f(x) ▸ select evaluation nodes at arg maxvar[f(x) ⋅ exp(−x2)] ▸ this scales to higher input dimensionality more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,

SLIDE 49

Warped Sequential Active Bayesian Integration (WSABI)

[Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 0.5 1 x f(x) ▸ select evaluation nodes at arg maxvar[f(x) ⋅ exp(−x2)] ▸ this scales to higher input dimensionality more formal analysis in arXiv 1410.2392 [Oates et al.] & 1506.02681 [Briol et al.]

19 ,

SLIDE 50

Probabilistic Numerics Need Not Be Expensive

WSABI is time-competititive with MCMC [Gunther, Osborne, Garnett, Hennig, Roberts, NIPS 2014]

10−1 100 101 102 103 10−1 101 time [s] ∣Fest − Ftrue∣ GP classification, graph Monte Carlo WSABI 10−2 10−1 100 101 102 10−3 10−2 10−1 100 time [s] ∣Fest − Ftrue∣/Ftrue synthetic (moG) 10−2 10−1 100 101 102 103 102 103 104 time [s] ∣Fest − Ftrue∣ yacht hydrodynamics 10−1 100 101 10−2 10−1 100 101 time [s] ∣Fest − Ftrue∣ GP classification, synthetic

20 ,

SLIDE 51

end of Integration part

▸ computation is inference ▸ there is a deep formal connection between basic numerical methods

and basic statistical models

▸ ignoring salient prior information causes drastic computational cost

increase. Black boxes may be convenient, but they are not efficient

▸ machine learning can help numerics, and vice versa

21 ,

SLIDE 52

Ordinary Differential Equations

x′(t) = f(x(t),t) x(t0) = x0 x ∶ RRN f x0

de

x

22 ,

SLIDE 53

Runge-Kutta Methods

iterative linear extrapolation

t0 t0 + c1 t0 + c2 t0 + h t x(t)

c1 c2 h 1 1 w11 y1 = f(1x0, t0) y2 = f (1x0 + w11y1, t0 + c1)

23 ,

SLIDE 54

Runge-Kutta Methods

iterative linear extrapolation

t0 t0 + c1 t0 + c2 t0 + h t x(t)

c1 c2 h 1 1 w11 1 w21 w22 y1 = f(1x0, t0) y2 = f (1x0 + w11y1, t0 + c1) ys+1 = f (1x0 + ∑s

i wsiyi, t0 + cs)

23 ,

SLIDE 55

Runge-Kutta Methods

iterative linear extrapolation

t0 t0 + c1 t0 + c2 t0 + h t x(t)

c1 c2 h 1 1 w11 1 w21 w22 1 b1 b2 b3 y1 = f(1x0, t0) y2 = f (1x0 + w11y1, t0 + c1) ys+1 = f (1x0 + ∑s

i wsiyi, t0 + cs)

ˆ x(t0 + h) = 1x0 + ∑i biyi

▸ RK choose (c,w,b) such that ∥ˆ

x(t0 + h) − x(t0 + h)∥ = O(hp)

23 ,

SLIDE 56

Gauss-Markov inference on ODEs

a probabilistic model matching Runge-Kutta [Schober, Duvenaud, Hennig, NIPS 2014]

c1 c2 h 1 1 w11 1 w21 w22 1 b1 b2 b3 y1 = f (µ ∣ x0(t0), t0) y2 = f (µ ∣ x0,y1(c1), t0 + c1) ys+1 = f (µ ∣ x0,y1,⋯,ys(cs), t0 + cs) ˆ x(t0 + h) = µ ∣ x0,yi(t0 + h) µ ∣ y(t) ∶= cov(x(t), yi) cov(y, y)−1y = 1x0 + ∑i wsiyi

24 ,

t0 c1 c2 h t x(t)

SLIDE 57

Gauss-Markov inference on ODEs

a probabilistic model matching Runge-Kutta [Schober, Duvenaud, Hennig, NIPS 2014]

▸ Linear extrapolation suggests Gaussian process model ▸ polynomial form suggests Wiener state-space model

dz = F dt + Ldω d ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ x(t) x′(t) ⋮

hk/k!x(k)(t)

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = 1 h ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 ... 2 ⋱ ⋮ ⋱ k ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ dt + σ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⋮ 1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ dω

▸ inference through filtering

24 ,

t0 c1 c2 h t x(t)

SLIDE 58

Calibrating Uncertainty

within the parametrized class

▸ posterior mean µ ∣ y = kK−1y invariant under k θ2k ▸ posterior covariance k ∣ y = k − kK−1k scaled by θ2 ▸ connection to local error estimation in existing methods

25 ,

t0 t0 + c1 t0 + c2 t0 + h t x(t) p[x(t0 + h)] x(t0 + h)

SLIDE 59

▸ as in integration, classic families of ODE solvers can be interpreted as

MAP inference

▸ for each classic solver, whole family of probabilistic solvers with same

point estimate, different uncertainties

▸ probabilistic solvers need not be expensive. Can even have exact

same cost as classic methods

26 ,

SLIDE 60

Visualizing Computational Uncertainty

Neural Pathways [M. Schober, N. Kasenburg, A. Feragen, P .H., S. Hauberg, MICCAI 2014]

27 ,

SLIDE 61

Propagating Uncertainty

geodesics on an uncertain Riemannian metric [Hauberg, Schober, Liptrot, P .H., Feragen, MICCAI 2015] GP density std(Γ1, Γ2, Γ3)

▸ Shortest paths (geodesics) on Riemannian manifold of metric M obey

x′′(t) = −1/2M −1(x(t)) ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ∂ ⇀ M(x(t)) ∂x(t) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦

⊺

(x′(t)⊗x′(t)) = f(x(t),x′(t),t)

28 ,

SLIDE 62

Propagating Uncertainty

geodesics on an uncertain Riemannian metric [Hauberg, Schober, Liptrot, P .H., Feragen, MICCAI 2015] GP density std(Γ1, Γ2, Γ3)

▸ Shortest paths (geodesics) on Riemannian manifold of metric M obey

x′′(t) = −1/2M −1(x(t)) ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ∂ ⇀ M(x(t)) ∂x(t) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦

⊺

(x′(t)⊗x′(t)) = f(x(t),x′(t),t)

▸ what if M N(m,V ) is uncertain (inferred from data)?

28 ,

SLIDE 63

Uncertainty Across Composite Computations

interacting information requirements [Hennig, Osborne, Girolami, Proc. Royal Society A 2015]

machine environment learning / inference / pattern rec. / system id. prediction action

D xt θ xt+δt a

data variables parameters

inference by quadrature estimation by

ptimization

prediction by analysis action by control ▸ numerical methods able to deal with uncertain (probabilistic) inputs,

and returning uncertain (probabilistic) outputs, allow control of computational effort

▸ (this is not the same as probabilistic programming!)

29 ,

SLIDE 64

Summary

Probabilistic Numerics – Part I [Hennig, Osborne, Girolami, Proc. Royal Society A 2015]

The probabilistic view of computation

▸ computation is (active) inference ▸ several classic method can be interpreted precisely as MAP inference

▸ Gaussian Quadrature—Gaussian process regression ▸ Runge-Kutta Methods—Autoregressive Filtering ▸ [more to come on Monday]

▸ correct prior information can reduce runtime ▸ prior assumptions can be tested at runtime ▸ probabilistic uncertainty can be propagated between different

numerical computations, allowing control of cost “[PN is] a re-emerging area of very active research”

Z. Ghahramani, Nature 521, 452–459 (28 May 2015)

http://probabilistic-numerics.org

30 ,

SLIDE 65

31 ,

SLIDE 66

— Backup —

32 ,

SLIDE 67

Monte Carlo is not all that Different

MC as a ‘cautious’ limit

p(f) = GP(f;0;kOU

λ

+ c) kOU

λ

(x,x′) = exp(−∣x − x′∣ λ ) λ∞ ⇒ k(x,x′)

△

= ∣x − x′∣ trapezoid rule λ0 ⇒ k(x,x′)δ(x − x′) averaging

33 ,

−2 2 −1 −0.5 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

SLIDE 68

Monte Carlo is not all that Different

MC as a ‘cautious’ limit

p(f) = GP(f;0;kOU

λ

+ c) kOU

λ

(x,x′) = exp(−∣x − x′∣ λ ) λ∞ ⇒ k(x,x′)

△

= ∣x − x′∣ trapezoid rule λ0 ⇒ k(x,x′)δ(x − x′) averaging

33 ,

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

SLIDE 69

Monte Carlo is not all that Different

MC as a ‘cautious’ limit

p(f) = GP(f;0;kOU

λ

+ c) kOU

λ

(x,x′) = exp(−∣x − x′∣ λ ) λ∞ ⇒ k(x,x′)

△

= ∣x − x′∣ trapezoid rule λ0 ⇒ k(x,x′)δ(x − x′) averaging

33 ,

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

SLIDE 70

Monte Carlo is not all that Different

MC as a ‘cautious’ limit

p(f) = GP(f;0;kOU

λ

+ c) kOU

λ

(x,x′) = exp(−∣x − x′∣ λ ) λ∞ ⇒ k(x,x′)

△

= ∣x − x′∣ trapezoid rule λ0 ⇒ k(x,x′)δ(x − x′) averaging

33 ,

−2 2 −1 −0.5 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

SLIDE 71

Monte Carlo is not all that Different

MC as a ‘cautious’ limit

p(f) = GP(f;0;kOU

λ

+ c) kOU

λ

(x,x′) = exp(−∣x − x′∣ λ ) λ∞ ⇒ k(x,x′)

△

= ∣x − x′∣ trapezoid rule λ0 ⇒ k(x,x′)δ(x − x′) averaging

33 ,

−2 2 −1 −0.5 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

SLIDE 72

Monte Carlo is not all that Different

MC as a ‘cautious’ limit

MC is optimal for totally unstructured (but integrable) f. But

▸ random node placement is not necessary!

Introducing randomness to a deterministic problem is generally a bad idea.

▸ no integrand is totally unstructured. Prior knowledge is available!

Computations should use as much available prior information as possible.

33 ,

100 101 102 10−10 10−5 100 # samples F − ˆ F −2 2 −1 −0.5 0.5 1 x f(x)

SLIDE 73

Monte Carlo is not all that Different

MC as a ‘cautious’ limit

MC is optimal for totally unstructured (but integrable) f. But

▸ random node placement is not necessary!

Introducing randomness to a deterministic problem is generally a bad idea.

▸ no integrand is totally unstructured. Prior knowledge is available!

Computations should use as much available prior information as possible.

33 ,

−2 2 −1 −0.5 0.5 1 x f(x) 100 101 102 10−10 10−5 100 # samples F − ˆ F

SLIDE 74

What is a Random Number?

What is a sequence of random numbers?

▸ 662244111144334455553366666666 ▸ 169399375105820974944592307816 ▸ 712904263472610590208336044895 ▸ 100011111101111111100101000001 ▸ 01110000011100100110111101100011

34 ,

SLIDE 75

What is a Random Number?

What is a sequence of random numbers?

▸ 662244111144334455553366666666

dice, doubled

▸ 169399375105820974944592307816 ▸ 712904263472610590208336044895 ▸ 100011111101111111100101000001 ▸ 01110000011100100110111101100011

34 ,

SLIDE 76

What is a Random Number?

What is a sequence of random numbers?

▸ 662244111144334455553366666666

dice, doubled

▸ 169399375105820974944592307816

41-70th digits of π

▸ 712904263472610590208336044895 ▸ 100011111101111111100101000001 ▸ 01110000011100100110111101100011

34 ,

SLIDE 77

What is a Random Number?

What is a sequence of random numbers?

▸ 662244111144334455553366666666

dice, doubled

▸ 169399375105820974944592307816

41-70th digits of π

▸ 712904263472610590208336044895

von Neumann method, seed 908344

▸ 100011111101111111100101000001 ▸ 01110000011100100110111101100011

34 ,

SLIDE 78

What is a Random Number?

What is a sequence of random numbers?

▸ 662244111144334455553366666666

dice, doubled

▸ 169399375105820974944592307816

41-70th digits of π

▸ 712904263472610590208336044895

von Neumann method, seed 908344

▸ 100011111101111111100101000001

bits from G. Marsaglia’s ‘diehard CD’

▸ 01110000011100100110111101100011

34 ,

SLIDE 79

What is a Random Number?

What is a sequence of random numbers?

▸ 662244111144334455553366666666

dice, doubled

▸ 169399375105820974944592307816

41-70th digits of π

▸ 712904263472610590208336044895

von Neumann method, seed 908344

▸ 100011111101111111100101000001

bits from G. Marsaglia’s ‘diehard CD’

▸ 01110000011100100110111101100011

deterministic sequence, corrupted by horizontal coin drop from unknown height

http://www.stat.fsu.edu/pub/diehard/

34 ,

SLIDE 80

What is a Random Number?

What is a sequence of random numbers?

▸ 662244111144334455553366666666

dice, doubled

▸ 169399375105820974944592307816

41-70th digits of π

▸ 712904263472610590208336044895

von Neumann method, seed 908344

▸ 100011111101111111100101000001

bits from G. Marsaglia’s ‘diehard CD’

▸ 01110000011100100110111101100011

deterministic sequence, corrupted by horizontal coin drop from unknown height

▸ for use in Monte Carlo, the important property is freedom from

patterns (because it implies anytime unbiasedness)

▸ in fact, use for MC only really requires the right density, disorder is

just helpful for the argument

▸ for use in cryptography, the important property is unpredictability ▸ randomness is a philosophically dodgy concept ▸ uncertainty is a much clearer idea http://www.stat.fsu.edu/pub/diehard/

34 ,

SLIDE 81

How Expensive is a Computation?

Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5 ▸ Shortest paths (geodesics) on Riemannian manifold of metric M obey

x′′(t) = −1/2M −1(x(t)) ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ∂ ⇀ M(x(t)) ∂x(t) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦

⊺

(x′(t)⊗x′(t)) = f(x(t),x′(t),t)

▸ Karcher mean µ of data set {xi}i=1,...,N is point minimizing

∑N

i Distance(µ,xi)

35 ,

SLIDE 82

How Expensive is a Computation?

Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5 ▸ to find Karcher mean, do gradient descent from initial guess µ0

µk+1 = µk − α∇µ

N

∑

i

Distance(µ,xi)

▸ requires solving N initial value problems, over and over.

35 ,

SLIDE 83

How Much Information is Needed?

Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]

36 ,

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5

SLIDE 84

How Much Information is Needed?

Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]

36 ,

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5

SLIDE 85

How Much Information is Needed?

Ex: Riemannian Statistics [Schober, Hauberg,Lawrence, Hennig; in prep.]

36 ,

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −0.5 0.5

SLIDE 86

37 ,