[PPT] - An Introduction to Bayesian Optimisation and (Potential) PowerPoint Presentation

SLIDE 1

An Introduction to Bayesian Optimisation and (Potential) Applications in Materials Science

Kirthevasan Kandasamy Machine Learning Dept, CMU Electrochemical Energy Symposium Pittsburgh, PA, November 2017

SLIDE 2

Designing Electrolytes in Batteries

1/19

SLIDE 3

Black-box Optimisation in Computational Astrophysics

Cosmological Simulator

Observation

E.g: Hubble Constant Baryonic Density

Likelihood Score Likelihood computation

1/19

SLIDE 4

Black-box Optimisation

Expensive Blackbox Function

Other Examples:

Pre-clinical Drug Discovery
Optimal policy in Autonomous Driving
Synthetic gene design

1/19

SLIDE 5

Black-box Optimisation

f : X → R is an expensive, black-box function, accessible only via noisy evaluations.

x f(x)

2/19

SLIDE 6

Black-box Optimisation

f : X → R is an expensive, black-box function, accessible only via noisy evaluations.

x f(x)

2/19

SLIDE 7

Black-box Optimisation

f : X → R is an expensive, black-box function, accessible only via noisy evaluations. Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

2/19

SLIDE 8

Outline

◮ Part I: Bayesian Optimisation

◮ Bayesian Models for f ◮ Two algorithms: upper confidence bounds & Thompson

sampling

◮ Part II: Some Modern Challenges

◮ Multi-fidelity Optimisation ◮ Parallelisation 3/19

SLIDE 9

Bayesian Models for f

e.g. Gaussian Processes (GP) GP: A distribution over functions from X to R.

4/19

SLIDE 10

Bayesian Models for f

e.g. Gaussian Processes (GP) GP: A distribution over functions from X to R. Functions with no observations

x f(x)

4/19

SLIDE 11

Bayesian Models for f

e.g. Gaussian Processes (GP) GP: A distribution over functions from X to R. Prior GP

x f(x)

4/19

SLIDE 12

Bayesian Models for f

e.g. Gaussian Processes (GP) GP: A distribution over functions from X to R. Observations

x f(x)

4/19

SLIDE 13

Bayesian Models for f

e.g. Gaussian Processes (GP) GP: A distribution over functions from X to R. Posterior GP given observations

x f(x)

4/19

SLIDE 14

Bayesian Models for f

e.g. Gaussian Processes (GP) GP: A distribution over functions from X to R. Posterior GP given observations

x f(x)

After t observations, f (x) ∼ N( µt(x), σ2

t (x) ).

4/19

SLIDE 15

Bayesian Optimisation with Upper Confidence Bounds

Model f ∼ GP. Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010)

x f(x)

5/19

SLIDE 16

Bayesian Optimisation with Upper Confidence Bounds

Model f ∼ GP. Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010)

x f(x)

1) Construct posterior GP.

5/19

SLIDE 17

Bayesian Optimisation with Upper Confidence Bounds

Model f ∼ GP. Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

1) Construct posterior GP. 2) ϕt = µt−1 + β1/2

t

σt−1 is a UCB.

5/19

SLIDE 18

Bayesian Optimisation with Upper Confidence Bounds

Model f ∼ GP. Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

1) Construct posterior GP. 2) ϕt = µt−1 + β1/2

t

σt−1 is a UCB. 3) Choose xt = argmaxx ϕt(x).

5/19

SLIDE 19

Bayesian Optimisation with Upper Confidence Bounds

Model f ∼ GP. Gaussian Process Upper Confidence Bound (GP-UCB)

(Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

1) Construct posterior GP. 2) ϕt = µt−1 + β1/2

t

σt−1 is a UCB. 3) Choose xt = argmaxx ϕt(x). 4) Evaluate f at xt.

5/19

SLIDE 20

GP-UCB

(Srinivas et al. 2010)

x f(x)

6/19

SLIDE 21

GP-UCB

(Srinivas et al. 2010)

t = 1 x f(x)

6/19

SLIDE 22

GP-UCB

(Srinivas et al. 2010)

t = 2 x f(x)

6/19

SLIDE 23

GP-UCB

(Srinivas et al. 2010)

t = 3 x f(x)

6/19

SLIDE 24

GP-UCB

(Srinivas et al. 2010)

t = 4 x f(x)

6/19

SLIDE 25

GP-UCB

(Srinivas et al. 2010)

t = 5 x f(x)

6/19

SLIDE 26

GP-UCB

(Srinivas et al. 2010)

t = 6 x f(x)

6/19

SLIDE 27

GP-UCB

(Srinivas et al. 2010)

t = 7 x f(x)

6/19

SLIDE 28

GP-UCB

(Srinivas et al. 2010)

t = 11 x f(x)

6/19

SLIDE 29

GP-UCB

(Srinivas et al. 2010)

t = 25 x f(x)

6/19

SLIDE 30

Bayesian Optimisation with Thompson Sampling

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

7/19

SLIDE 31

Bayesian Optimisation with Thompson Sampling

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

1) Construct posterior GP.

7/19

SLIDE 32

Bayesian Optimisation with Thompson Sampling

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

1) Construct posterior GP. 2) Draw sample g from posterior.

7/19

SLIDE 33

Bayesian Optimisation with Thompson Sampling

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x).

7/19

SLIDE 34

Bayesian Optimisation with Thompson Sampling

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x). 4) Evaluate f at xt.

7/19

SLIDE 35

More on Bayesian Optimisation

Theoretical results: Both UCB and TS will eventually find the

ptimum under certain smoothness assumptions of f .

8/19

SLIDE 36

More on Bayesian Optimisation

Theoretical results: Both UCB and TS will eventually find the

ptimum under certain smoothness assumptions of f .

Other criteria for selecting xt:

◮ Expected improvement (Jones et al. 1998) ◮ Probability of improvement (Kushner et al. 1964) ◮ Predictive entropy search (Hern´

andez-Lobato et al. 2014)

◮ Information directed sampling (Russo & Van Roy 2014)

8/19

SLIDE 37

More on Bayesian Optimisation

Theoretical results: Both UCB and TS will eventually find the

ptimum under certain smoothness assumptions of f .

Other criteria for selecting xt:

◮ Expected improvement (Jones et al. 1998) ◮ Probability of improvement (Kushner et al. 1964) ◮ Predictive entropy search (Hern´

andez-Lobato et al. 2014)

◮ Information directed sampling (Russo & Van Roy 2014)

Other Bayesian models for f :

◮ Neural networks (Snoek et al. 2015) ◮ Random Forests (Hutter 2009)

8/19

SLIDE 38

Some Modern Challenges/Opportunities

1. Multi-fidelity Optimisation

(Kandasamy et al. NIPS 2016 a&b, Kandasamy et al. ICML 2017)

2. Parallelisation

(Kandasamy et al. Arxiv 2017)

9/19

SLIDE 39

1. Multi-fidelity Optimisation

(Kandasamy et al. NIPS 2016 a&b, Kandasamy et al. ICML 2017)

Desired function f is very expensive, but . . . we have access to cheap approximations.

x⋆ f

10/19

SLIDE 40

1. Multi-fidelity Optimisation

(Kandasamy et al. NIPS 2016 a&b, Kandasamy et al. ICML 2017)

Desired function f is very expensive, but . . . we have access to cheap approximations.

x⋆ f1 f2 f3 f

f1, f2, f3 ≈ f which are cheaper to evaluate.

10/19

SLIDE 41

1. Multi-fidelity Optimisation

(Kandasamy et al. NIPS 2016 a&b, Kandasamy et al. ICML 2017)

Desired function f is very expensive, but . . . we have access to cheap approximations.

x⋆ f1 f2 f3 f

f1, f2, f3 ≈ f which are cheaper to evaluate. E.g. f : a real world battery experiment f2: lab experiment f1: computer simulation

10/19

SLIDE 42

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

Multi-fidelity Gaussian Process Upper Confidence Bound

With 2 fidelities (1 Approximation), x⋆ xt t = 14 f (1) f (2)

11/19

SLIDE 43

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

Multi-fidelity Gaussian Process Upper Confidence Bound

With 2 fidelities (1 Approximation), x⋆ xt t = 14 f (1) f (2) Theorem: MF-GP-UCB finds the optimum x⋆ with less resources than GP-UCB on f (2).

11/19

SLIDE 44

MF-GP-UCB

(Kandasamy et al. NIPS 2016b)

Multi-fidelity Gaussian Process Upper Confidence Bound

With 2 fidelities (1 Approximation), x⋆ xt t = 14 f (1) f (2) Theorem: MF-GP-UCB finds the optimum x⋆ with less resources than GP-UCB on f (2). Can be extended to multiple approximations and continuous approximations.

11/19

SLIDE 45

Experiment: Cosmological Maximum Likelihood Inference

◮ Type Ia Supernovae Data ◮ Maximum likelihood inference for 3 cosmological parameters:

◮ Hubble Constant H0 ◮ Dark Energy Fraction ΩΛ ◮ Dark Matter Fraction ΩM

◮ Likelihood: Robertson Walker metric

(Robertson 1936)

Requires numerical integration for each point in the dataset.

12/19

SLIDE 46

Experiment: Cosmological Maximum Likelihood Inference

3 cosmological parameters. (d = 3) Fidelities: integration on grids of size (102, 104, 106). (M = 3)

500 1000 1500 2000 2500 3000 3500

10
5

5 10 13/19

SLIDE 47

Experiment: Hartmann-3D

2 Approximations (3 fidelities). We want to optimise the m = 3rd fidelity, which is the most

expensive. m = 1st fidelity is cheapest.

0.5 1 1.5 2 2.5 3 3.5 5 10 15 20 25 30 35 40

Num. of Queries

Query frequencies for Hartmann-3D f (3)(x)

m=1 m=2 m=3

14/19

SLIDE 48

2. Parallelising function evaluations

Parallelisation with M workers: can evaluate f at M different points at the same time. E.g.: Test M different battery solvents at the same time.

15/19

SLIDE 49

2. Parallelising function evaluations

Parallelisation with M workers: can evaluate f at M different points at the same time. E.g.: Test M different battery solvents at the same time. Sequential evaluations with one worker

15/19

SLIDE 50

2. Parallelising function evaluations

Parallelisation with M workers: can evaluate f at M different points at the same time. E.g.: Test M different battery solvents at the same time. Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous)

15/19

SLIDE 51

2. Parallelising function evaluations

Parallelisation with M workers: can evaluate f at M different points at the same time. E.g.: Test M different battery solvents at the same time. Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous)

15/19

SLIDE 52

Parallel Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

1. (x′, y′) ← Wait for

a worker to finish.

2. Compute posterior GP.
3. Draw a sample g ∼ GP.
4. Re-deploy worker at

argmax g.

16/19

SLIDE 53

Parallel Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

1. (x′, y′) ← Wait for

a worker to finish.

2. Compute posterior GP.
3. Draw a sample g ∼ GP.
4. Re-deploy worker at

argmax g. Synchronous: synTS At any given time,

1. {(x′

m, y′ m)}M m=1 ← Wait for

all workers to finish.

2. Compute posterior GP.
3. Draw M samples

gm ∼ GP, ∀m.

4. Re-deploy worker m at

argmax gm, ∀m.

16/19

SLIDE 54

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

10 20 30 40 10 -2 10 -1

17/19

SLIDE 55

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

10 20 30 40 10 -2 10 -1

17/19

SLIDE 56

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

10 20 30 40 10 -2 10 -1

17/19

SLIDE 57

Experiment: Hartmann-18D M = 25

Evaluation time sampled from an exponential distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 25 30 2.5 3 3.5 4 4.5 5 5.5 6 6.5

18/19

SLIDE 58

Summary

◮ Black-box Optimisation methods are used in several scientific

and engineering applications.

◮ Bayesian Optimisation: A method for black-box optimisation

which uses Bayesian uncertainty estimates for f .

◮ Some modern challenges

◮ Multi-fidelity optimisation ◮ Parallel evaluations ◮ and several more . . . 19/19

SLIDE 59

Summary

◮ Black-box Optimisation methods are used in several scientific

and engineering applications.

◮ Bayesian Optimisation: A method for black-box optimisation

which uses Bayesian uncertainty estimates for f .

◮ Some modern challenges

◮ Multi-fidelity optimisation ◮ Parallel evaluations ◮ and several more . . .

Thank you.

Slides are up on my website:

www.cs.cmu.edu/∼kkandasa

19/19