[PPT] - Parallelised Bayesian Optimisation via Thompson Sampling PowerPoint Presentation

SLIDE 1

Parallelised Bayesian Optimisation via Thompson Sampling

Kirthevasan Kandasamy Carnegie Mellon University Google Research, Mountain View, CA Sep 27, 2017

Slides: www.cs.cmu.edu/~kkandasa/talks/google-ts-slides.pdf

SLIDE 2

Slides are up on my website: www.cs.cmu.edu/∼kkandasa

Slides

SLIDE 3

Black-box Optimisation

Neural Network

hyper- parameters cross validation accuracy

Train NN using given hyper-parameters
Compute accuracy on validation set

1/31

SLIDE 4

Black-box Optimisation

Expensive Blackbox Function

1/31

SLIDE 5

Black-box Optimisation

Expensive Blackbox Function

Other Examples:

ML estimation in astrophysics
Pre-clinical drug discovery
Optimal policy in autonomous driving

1/31

SLIDE 6

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

nly via noisy evaluations.

x f(x)

2/31

SLIDE 7

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

nly via noisy evaluations.

x f(x)

2/31

SLIDE 8

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

nly via noisy evaluations.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

2/31

SLIDE 9

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

nly via noisy evaluations.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Simple Regret after n evaluations SR(n) = f (x⋆) − max

t=1,...,n f (xt).

2/31

SLIDE 10

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

nly via noisy evaluations.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Cumulative Regret after n evaluations CR(n) =

n

t=1
f (x⋆) − f (xt)
2/31

SLIDE 11

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

nly via noisy evaluations.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Simple Regret after n evaluations SR(n) = f (x⋆) − max

t=1,...,n f (xt).

2/31

SLIDE 12

A walk-through Bayesian Optimisation (BO) with Gaussian Processes

◮ A review of Gaussian Processes (GPs) ◮ Thompson Sampling (TS): an algorithm for BO ◮ Other methods and models for BO

3/31

SLIDE 13

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R.

4/31

SLIDE 14

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Functions with no observations

x f(x)

4/31

SLIDE 15

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Prior GP

x f(x)

4/31

SLIDE 16

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Observations

x f(x)

4/31

SLIDE 17

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations

x f(x)

4/31

SLIDE 18

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations

x f(x)

Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. After t observations, f (x) ∼ N( µt(x), σ2

t (x) ).

4/31

SLIDE 19

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

5/31

SLIDE 20

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

1) Construct posterior GP.

5/31

SLIDE 21

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

1) Construct posterior GP. 2) Draw sample g from posterior.

5/31

SLIDE 22

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x).

5/31

SLIDE 23

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x). 4) Evaluate f at xt.

5/31

SLIDE 24

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x)

6/31

SLIDE 25

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 1

6/31

SLIDE 26

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 2

6/31

SLIDE 27

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 3

6/31

SLIDE 28

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 4

6/31

SLIDE 29

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 5

6/31

SLIDE 30

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 6

6/31

SLIDE 31

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 7

6/31

SLIDE 32

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 14

6/31

SLIDE 33

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 25

6/31

SLIDE 34

Some Theoretical Results for TS

Simple Regret: SR(n) = f (x⋆) − max

t=1,...,n f (xt)

7/31

SLIDE 35

Some Theoretical Results for TS

Simple Regret: SR(n) = f (x⋆) − max

t=1,...,n f (xt)

Theorem: For Thompson sampling,

(Russo & van Roy 2014, Srinivas et al. 2010)

E[SR(n)]

Ψn log(n)

n .

ignores constants

Ψn ← Maximum Information Gain.

7/31

SLIDE 36

Some Theoretical Results for TS

Simple Regret: SR(n) = f (x⋆) − max

t=1,...,n f (xt)

Theorem: For Thompson sampling,

(Russo & van Roy 2014, Srinivas et al. 2010)

E[SR(n)]

Ψn log(n)

n .

ignores constants

Ψn ← Maximum Information Gain. When X ⊂ Rd, SE (Gaussian) kernel: Ψn ≍ dd log(n)d. Mat´ ern kernel: Ψn ≍ n1− c

d2 . 7/31

SLIDE 37

Some Theoretical Results for TS

Simple Regret: SR(n) = f (x⋆) − max

t=1,...,n f (xt)

Theorem: For Thompson sampling,

(Russo & van Roy 2014, Srinivas et al. 2010)

E[SR(n)]

Ψn log(n)

n .

ignores constants

Ψn ← Maximum Information Gain. When X ⊂ Rd, SE (Gaussian) kernel: Ψn ≍ dd log(n)d. Mat´ ern kernel: Ψn ≍ n1− c

d2 .

Several other results: (Agrawal et al 2012, Kaufmann et al 2012, Russo &

van Roy 2016, Chowdhury & Gopalan 2017 and more . . . )

7/31

SLIDE 38

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

8/31

SLIDE 39

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) 8/31

SLIDE 40

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) 8/31

SLIDE 41

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1 8/31

SLIDE 42

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

8/31

SLIDE 43

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

8/31

SLIDE 44

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt ◮ Expected improvement (Jones et al. 1998) ◮ Probability of improvement (Kushner et al. 1964) ◮ Entropy search (Hern´

andez-Lobato et al. 2014)

All deterministic methods, choose next point for evaluation by maximising a deterministic acquisition function, i.e. xt = argmaxx∈X ϕt(x).

8/31

SLIDE 45

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt ◮ Expected improvement (Jones et al. 1998) ◮ Probability of improvement (Kushner et al. 1964) ◮ Entropy search (Hern´

andez-Lobato et al. 2014)

All deterministic methods, choose next point for evaluation by maximising a deterministic acquisition function, i.e. xt = argmaxx∈X ϕt(x). Other models for f : Neural networks (Snoek et al. 2015), Random Forests (Hutter 2009).

8/31

SLIDE 46

Big picture: scaling up black-box optimisation

9/31

SLIDE 47

Big picture: scaling up black-box optimisation

◮ Optimising in high dimensional spaces

e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms.

(Kandasamy et al. ICML 2015)

9/31

SLIDE 48

Big picture: scaling up black-box optimisation

◮ Optimising in high dimensional spaces

e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms.

(Kandasamy et al. ICML 2015)

◮ Multi-fidelity optimisation: what if we have cheap

approximations to f ? E.g. Train an ML model with N• data and T• iterations. But use N < N• data and T < T• iterations to approximate cross validation performance at (N•, T•).

(Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017)

9/31

SLIDE 49

Big picture: scaling up black-box optimisation

◮ Optimising in high dimensional spaces

e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms.

(Kandasamy et al. ICML 2015)

◮ Multi-fidelity optimisation: what if we have cheap

approximations to f ? E.g. Train an ML model with N• data and T• iterations. But use N < N• data and T < T• iterations to approximate cross validation performance at (N•, T•).

(Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017)

Extends beyond GPs.

9/31

SLIDE 50

This work: Parallel Evaluations

(Kandasamy et al. Arxiv 2017)

Parallelisation with M workers: can evaluate f at M different points at the same time. E.g. Train M models with different hyper-parameter values in parallel at the same time. Inability to parallelise is a real bottleneck in practice!

10/31

SLIDE 51

This work: Parallel Evaluations

(Kandasamy et al. Arxiv 2017)

Parallelisation with M workers: can evaluate f at M different points at the same time. E.g. Train M models with different hyper-parameter values in parallel at the same time. Inability to parallelise is a real bottleneck in practice! Some desiderata:

◮ Statistically, achieve ×M improvement. ◮ Methodologically, be scalable for a very large number of

workers,

Method remains computationally tractable as M increases.
Method is conceptually simple, for robustness in practice.

10/31

SLIDE 52

Outline

(Kandasamy et al. Arxiv 2017)

1. Set up & definitions
2. Prior work & challenges
3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

4. Experiments
5. Theoretical Results
6. Open questions/challenges

11/31

SLIDE 53

Outline

(Kandasamy et al. Arxiv 2017)

1. Set up & definitions
2. Prior work & challenges
3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

4. Experiments
5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats.

6. Open questions/challenges

11/31

SLIDE 54

Outline

(Kandasamy et al. Arxiv 2017)

1. Set up & definitions
2. Prior work & challenges
3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

4. Experiments
5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats

6. Open questions/challenges

11/31

SLIDE 55

Parallel Evaluations: set up

Sequential evaluations with one worker

12/31

SLIDE 56

Parallel Evaluations: set up

Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous)

12/31

SLIDE 57

Parallel Evaluations: set up

Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous)

12/31

SLIDE 58

Parallel Evaluations: set up

Sequential evaluations with one worker

jth job has feedback from all previous j − 1 jobs.

Parallel evaluations with M workers (Asynchronous)

jth job missing feedback from exactly M − 1 jobs.

Parallel evaluations with M workers (Synchronous)

jth job missing feedback from ≤ M − 1 jobs.

12/31

SLIDE 59

Simple Regret in Parallel Settings

(Kandasamy et al. Arxiv 2017)

Simple regret after n evaluations, SR(n) = f (x⋆) − max

t=1,...,n f (xt).

n ← number of completed evaluations by all M workers.

13/31

SLIDE 60

Simple Regret in Parallel Settings

(Kandasamy et al. Arxiv 2017)

Simple regret after n evaluations, SR(n) = f (x⋆) − max

t=1,...,n f (xt).

n ← number of completed evaluations by all M workers. Simple regret with time as a resource, Asynchronous Synchronous SR′(T) = f (x⋆) − max

t=1,...,N f (xt).

N ← (possibly random) number of completed evaluations by all M workers within time T.

13/31

SLIDE 61

Outline

(Kandasamy et al. Arxiv 2017)

1. Set up & definitions
2. Prior work & challenges
3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

4. Experiments
5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats

6. Open questions/challenges

13/31

SLIDE 62

Prior work in Parallel BO

(Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) (Wang et al. 2017) (Kandasamy et al. Arxiv 2017)

14/31

SLIDE 63

Prior work in Parallel BO

Asynchr-

nicity

(Ginsbourger et al. 2011)

(Janusevkis et al. 2012)
(Contal et al. 2013)

(Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016)

(Kathuria et al. 2016)

(Wu & Frazier. 2017) (Wang et al. 2017) (Kandasamy et al. Arxiv 2017)

14/31

SLIDE 64

Prior work in Parallel BO

Asynchr-

nicity

Theoretical guarantees

(Ginsbourger et al. 2011)

(Janusevkis et al. 2012)
(Contal et al. 2013)
(Desautels et al. 2014)
(Gonzalez et al. 2015)

(Shah & Ghahramani. 2015) (Wang et al. 2016)

(Kathuria et al. 2016)
(Wu & Frazier. 2017)

(Wang et al. 2017) (Kandasamy et al. Arxiv 2017)

14/31

SLIDE 65

Prior work in Parallel BO

Asynchr-

nicity

Theoretical guarantees Conceptual simplicity *

(Ginsbourger et al. 2011)

(Janusevkis et al. 2012)
(Contal et al. 2013)
(Desautels et al. 2014)
(Gonzalez et al. 2015)

(Shah & Ghahramani. 2015) (Wang et al. 2016)

(Kathuria et al. 2016)
(Wu & Frazier. 2017)

(Wang et al. 2017) (Kandasamy et al. Arxiv 2017)

* straightforward extension of sequential algorithm works.

14/31

SLIDE 66

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

Direct application of GP-UCB in the synchronous setting ...

x f(x)

15/31

SLIDE 67

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

Direct application of GP-UCB in the synchronous setting ...

First worker: maximise acquisition, xt1 = argmax ϕt(x).

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt1

15/31

SLIDE 68

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

Direct application of GP-UCB in the synchronous setting ...

First worker: maximise acquisition, xt1 = argmax ϕt(x).
Second worker: acquisition is the same! xt1 = xt2

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt2 = xt1

15/31

SLIDE 69

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

Direct application of GP-UCB in the synchronous setting ...

First worker: maximise acquisition, xt1 = argmax ϕt(x).
Second worker: acquisition is the same! xt1 = xt2
xt1 = xt2 = · · · = xtM.

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt2 = xt1

15/31

SLIDE 70

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

Direct application of GP-UCB in the synchronous setting ...

First worker: maximise acquisition, xt1 = argmax ϕt(x).
Second worker: acquisition is the same! xt1 = xt2
xt1 = xt2 = · · · = xtM.

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt2 = xt1

Direct application of sequential algorithm does not work. Need to “encourage diversity”.

15/31

SLIDE 71

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x)

16/31

SLIDE 72

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x)

ˆ f

16/31

SLIDE 73

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x)

ˆ f

16/31

SLIDE 74

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x)

16/31

SLIDE 75

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x) ◮ Optimise an acquisition over X M.

16/31

SLIDE 76

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x) ◮ Optimise an acquisition over X M. ◮ Resort to heuristics, typically requires additional

hyper-parameters and/or computational routines.

16/31

SLIDE 77

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x) ◮ Optimise an acquisition over X M. ◮ Resort to heuristics, typically requires additional

hyper-parameters and/or computational routines. Take-home message: Straightforward application of sequential algorithm works for TS. Inherent randomness takes care of exploration vs. exploitation trade-off when managing M workers.

16/31

SLIDE 78

Parallel Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

1. (x′, y′) ← Wait for

a worker to finish.

2. Compute posterior GP.
3. Draw a sample g ∼ GP.
4. Re-deploy worker at

argmax g.

17/31

SLIDE 79

Parallel Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

1. (x′, y′) ← Wait for

a worker to finish.

2. Compute posterior GP.
3. Draw a sample g ∼ GP.
4. Re-deploy worker at

argmax g. Synchronous: synTS At any given time,

1. {(x′

m, y′ m)}M m=1 ← Wait for

all workers to finish.

2. Compute posterior GP.
3. Draw M samples

gm ∼ GP, ∀m.

4. Re-deploy worker m at

argmax gm, ∀m.

17/31

SLIDE 80

Parallel Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

1. (x′, y′) ← Wait for

a worker to finish.

2. Compute posterior GP.
3. Draw a sample g ∼ GP.
4. Re-deploy worker at

argmax g. Synchronous: synTS At any given time,

1. {(x′

m, y′ m)}M m=1 ← Wait for

all workers to finish.

2. Compute posterior GP.
3. Draw M samples

gm ∼ GP, ∀m.

4. Re-deploy worker m at

argmax gm, ∀m. Variants in prior work:

(Osband et al. 2016, Israelsen et al. 2016, Hernandez-Lobato et al. 2017)

17/31

SLIDE 81

Outline

(Kandasamy et al. Arxiv 2017)

1. Set up & definitions
2. Prior work & challenges
3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

4. Experiments
5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats

6. Open questions/challenges

18/31

SLIDE 82

Experiment: Park1-4D M = 10

Comparison in terms of number of evaluations

10 0

asyTS seqTS

20 40 60 80 100 120

synTS

19/31

SLIDE 83

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

10 20 30 40 10 -2 10 -1

20/31

SLIDE 84

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

10 20 30 40 10 -2 10 -1

20/31

SLIDE 85

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

10 20 30 40 10 -2 10 -1

20/31

SLIDE 86

Experiment: Hartmann-6D M = 12

Evaluation time sampled from a half-normal distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 25 30 10 -1 10 0

21/31

SLIDE 87

Experiment: Hartmann-18D M = 25

Evaluation time sampled from an exponential distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 25 30 2.5 3 3.5 4 4.5 5 5.5 6 6.5

22/31

SLIDE 88

Experiment: Currin-Exponential-14D M = 35

Evaluation time sampled from a Pareto-3 distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 10 15 20 25

23/31

SLIDE 89

Experiment: Model Selection in Cifar10 M = 4

Tune # filters in in range (32, 256) for each layer in a 6 layer CNN. Time taken for an evaluation: 4 - 16 minutes.

1000 2000 3000 4000 5000 6000 7000 0.68 0.69 0.7 0.71 0.72

synTS asyRAND

asyHUCB

asyTS asyEI synHUCB

24/31

SLIDE 90

Outline

(Kandasamy et al. Arxiv 2017)

1. Set up & definitions
2. Prior work & challenges
3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

4. Experiments
5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats.

6. Open questions/challenges

24/31

SLIDE 91

Bounds for SR(n), synTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

Ψn log(n)

n Ψn ← Maximum information gain.

25/31

SLIDE 92

Bounds for SR(n), synTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

Ψn log(n)

n Ψn ← Maximum information gain. Theorem: synTS

(Kandasamy et al. Arxiv 2017)

E[SR(n)] M

log(M)

n +

Ψn+M log(n+M)

n Leading constant is also the same.

25/31

SLIDE 93

Bounds for SR(n), asyTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

Ψn log(n)

n

26/31

SLIDE 94

Bounds for SR(n), asyTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

Ψn log(n)

n Theorem: asyTS

(Kandasamy et al. Arxiv 2017)

E[SR(n)]

ξMΨn log(n)

n ξM = supDn,n≥1 maxA⊂X,|A|≤M eI(f ;A|Dn).

26/31

SLIDE 95

Bounds for SR(n), asyTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

Ψn log(n)

n Theorem: asyTS

(Kandasamy et al. Arxiv 2017)

E[SR(n)]

ξMΨn log(n)

n ξM = supDn,n≥1 maxA⊂X,|A|≤M eI(f ;A|Dn). Theorem: There exists an asynchronously parallelisable initiali- sation scheme requiring O(Mpolylog(M)) evaluations to f such that ξM ≤ C.

(Krause et al. 2008, Desautels et al. 2012)

26/31

SLIDE 96

Bounds for SR(n), asyTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

Ψn log(n)

n Theorem: asyTS, arbitrary X

(Kandasamy et al. Arxiv 2017)

E[SR(n)] Mpolylog(M) n +

CΨn log(n)

n ξM = supDn,n≥1 maxA⊂X,|A|≤M eI(f ;A|Dn). Theorem: There exists an asynchronously parallelisable initiali- sation scheme requiring O(Mpolylog(M)) evaluations to f such that ξM ≤ C.

(Krause et al. 2008, Desautels et al. 2012)

26/31

SLIDE 97

Bounds for SR(n), asyTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

Ψn log(n)

n Theorem: asyTS, arbitrary X

(Kandasamy et al. Arxiv 2017)

E[SR(n)] Mpolylog(M) n +

CΨn log(n)

n ξM = supDn,n≥1 maxA⊂X,|A|≤M eI(f ;A|Dn). Theorem: There exists an asynchronously parallelisable initiali- sation scheme requiring O(Mpolylog(M)) evaluations to f such that ξM ≤ C.

(Krause et al. 2008, Desautels et al. 2012)

* We do not believe this is necessary.

26/31

SLIDE 98

Bounds for asyTS without the initialisation scheme

Theorem: synTS, arbitrary X

(Kandasamy et al. Arxiv 2017)

E[SR(n)] M

log(M)

n +

Ψn+M log(n + M)

n

27/31

SLIDE 99

Bounds for asyTS without the initialisation scheme

Theorem: synTS, arbitrary X

(Kandasamy et al. Arxiv 2017)

E[SR(n)] M

log(M)

n +

Ψn+M log(n + M)

n Theorem: asyTS, X ⊂ Rd

(Ongoing work)

E[SR(n)] . . . +

M log(n)

n1/O(d)

27/31

SLIDE 100

Theoretical Results for SR′(T)

Model evaluation time as an independent random variable

◮ Uniform

unif(a, b) bounded

◮ Half-normal

HN(τ 2) sub-Gaussian

◮ Exponential

exp(λ) sub-exponential

28/31

SLIDE 101

Theoretical Results for SR′(T)

Model evaluation time as an independent random variable

◮ Uniform

unif(a, b) bounded

◮ Half-normal

HN(τ 2) sub-Gaussian

◮ Exponential

exp(λ) sub-exponential Theorem (Informal): If evaluation times are the same, synTS≈ asyTS. Otherwise, bounds for asyTS are better than synTS. More the variability in evaluation times, the bigger the difference.

SLIDE 102

Theoretical Results for SR′(T)

Model evaluation time as an independent random variable

◮ Uniform

unif(a, b) bounded

◮ Half-normal

HN(τ 2) sub-Gaussian

◮ Exponential

exp(λ) sub-exponential Theorem (Informal): If evaluation times are the same, synTS≈ asyTS. Otherwise, bounds for asyTS are better than synTS. More the variability in evaluation times, the bigger the difference.

Uniform: constant factor
Half-normal:
log(M) factor
Exponential: log(M) factor

28/31

SLIDE 103

Outline

(Kandasamy et al. Arxiv 2017)

1. Set up & definitions
2. Prior work & challenges
3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

4. Experiments
5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats

6. Open questions/challenges

28/31

SLIDE 104

Open Challenges for Parallelised TS

1. Bounds for asynchronous TS without initialisation.
2. Other models for evaluation times.
e.g. evaluation time depends on x ∈ X.

29/31

SLIDE 105

Open Challenges for Parallelised TS

1. Bounds for asynchronous TS without initialisation.
2. Other models for evaluation times.
e.g. evaluation time depends on x ∈ X.
3. In the asynchronous setting,

◮ Should you wait for another job to finish without immediately

re-deploying?

◮ Do you kill an on-going job depending on the result of a

completed job?

29/31

SLIDE 106

Open Challenges for Parallelised TS

4. Optimising the sample when X = [0, 1]d,

x f(x)

xt

xt = argmax

x∈X

g(x), where g ∼ Posterior GP

◮ Global optimisation of a non-convex function!

.. a common challenge in most BO methods.

30/31

SLIDE 107

Open Challenges for Parallelised TS

4. Optimising the sample when X = [0, 1]d,

x f(x)

xt

xt = argmax

x∈X

g(x), where g ∼ Posterior GP

◮ Global optimisation of a non-convex function!

.. a common challenge in most BO methods. But additionally for TS,

◮ As g is not deterministic, draw samples from a fixed set of

points and pick the maximum.

◮ Or if using an adaptive method, scales O((N + S)3) where

N ← # of evaluations to f , S ← # of evaluations to g.

30/31

SLIDE 108

Summary

◮ synTS, asyTS: direct application of TS to synchronous and

asynchronous parallel settings.

31/31

SLIDE 109

Summary

◮ synTS, asyTS: direct application of TS to synchronous and

asynchronous parallel settings.

◮ Take-aways: Theory

Both perform essentially the same as seqTS in terms of the

number of evaluations.

When we factor time as a resource, asyTS performs best.

31/31

SLIDE 110

Summary

◮ synTS, asyTS: direct application of TS to synchronous and

asynchronous parallel settings.

◮ Take-aways: Theory

Both perform essentially the same as seqTS in terms of the

number of evaluations.

When we factor time as a resource, asyTS performs best.

◮ Take-aways: Practice

Conceptually simple and scales better with the number of

workers than other methods.

31/31

SLIDE 111

Akshay Jeff Barnab´ as Krishnamurthy Schneider P´

czos