Parallelised Bayesian Optimisation via Thompson Sampling - - PowerPoint PPT Presentation

parallelised bayesian optimisation via thompson sampling
SMART_READER_LITE
LIVE PREVIEW

Parallelised Bayesian Optimisation via Thompson Sampling - - PowerPoint PPT Presentation

Parallelised Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy Carnegie Mellon University Google Research, Mountain View, CA Sep 27, 2017 Slides: www.cs.cmu.edu/~kkandasa/talks/google-ts-slides.pdf www.cs.cmu.edu/ kkandasa


slide-1
SLIDE 1

Parallelised Bayesian Optimisation via Thompson Sampling

Kirthevasan Kandasamy Carnegie Mellon University Google Research, Mountain View, CA Sep 27, 2017

Slides: www.cs.cmu.edu/~kkandasa/talks/google-ts-slides.pdf

slide-2
SLIDE 2

Slides are up on my website: www.cs.cmu.edu/∼kkandasa

Slides

slide-3
SLIDE 3

Black-box Optimisation

Neural Network

hyper- parameters cross validation accuracy

  • Train NN using given hyper-parameters
  • Compute accuracy on validation set

1/31

slide-4
SLIDE 4

Black-box Optimisation

Expensive Blackbox Function

1/31

slide-5
SLIDE 5

Black-box Optimisation

Expensive Blackbox Function

Other Examples:

  • ML estimation in astrophysics
  • Pre-clinical drug discovery
  • Optimal policy in autonomous driving

1/31

slide-6
SLIDE 6

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

  • nly via noisy evaluations.

x f(x)

2/31

slide-7
SLIDE 7

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

  • nly via noisy evaluations.

x f(x)

2/31

slide-8
SLIDE 8

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

  • nly via noisy evaluations.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

2/31

slide-9
SLIDE 9

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

  • nly via noisy evaluations.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Simple Regret after n evaluations SR(n) = f (x⋆) − max

t=1,...,n f (xt).

2/31

slide-10
SLIDE 10

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

  • nly via noisy evaluations.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Cumulative Regret after n evaluations CR(n) =

n

  • t=1
  • f (x⋆) − f (xt)
  • 2/31
slide-11
SLIDE 11

Black-box Optimisation

f : X → R is an expensive, black-box, noisy function, accessible

  • nly via noisy evaluations.

Let x⋆ = argmaxx f (x).

x f(x) x∗

f(x∗)

Simple Regret after n evaluations SR(n) = f (x⋆) − max

t=1,...,n f (xt).

2/31

slide-12
SLIDE 12

A walk-through Bayesian Optimisation (BO) with Gaussian Processes

◮ A review of Gaussian Processes (GPs) ◮ Thompson Sampling (TS): an algorithm for BO ◮ Other methods and models for BO

3/31

slide-13
SLIDE 13

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R.

4/31

slide-14
SLIDE 14

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Functions with no observations

x f(x)

4/31

slide-15
SLIDE 15

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Prior GP

x f(x)

4/31

slide-16
SLIDE 16

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Observations

x f(x)

4/31

slide-17
SLIDE 17

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations

x f(x)

4/31

slide-18
SLIDE 18

Gaussian Processes (GP)

GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations

x f(x)

Completely characterised by mean function µ : X → R, and covariance kernel κ : X × X → R. After t observations, f (x) ∼ N( µt(x), σ2

t (x) ).

4/31

slide-19
SLIDE 19

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

5/31

slide-20
SLIDE 20

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

1) Construct posterior GP.

5/31

slide-21
SLIDE 21

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

1) Construct posterior GP. 2) Draw sample g from posterior.

5/31

slide-22
SLIDE 22

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x).

5/31

slide-23
SLIDE 23

Gaussian Process (Bayesian) Optimisation

Model f ∼ GP(0, κ). Thompson Sampling (TS)

(Thompson, 1933).

x f(x)

xt

1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose xt = argmaxx g(x). 4) Evaluate f at xt.

5/31

slide-24
SLIDE 24

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x)

6/31

slide-25
SLIDE 25

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 1

6/31

slide-26
SLIDE 26

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 2

6/31

slide-27
SLIDE 27

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 3

6/31

slide-28
SLIDE 28

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 4

6/31

slide-29
SLIDE 29

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 5

6/31

slide-30
SLIDE 30

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 6

6/31

slide-31
SLIDE 31

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 7

6/31

slide-32
SLIDE 32

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 14

6/31

slide-33
SLIDE 33

Thompson Sampling (TS) in GPs

(Thompson, 1933)

x f(x) t = 25

6/31

slide-34
SLIDE 34

Some Theoretical Results for TS

Simple Regret: SR(n) = f (x⋆) − max

t=1,...,n f (xt)

7/31

slide-35
SLIDE 35

Some Theoretical Results for TS

Simple Regret: SR(n) = f (x⋆) − max

t=1,...,n f (xt)

Theorem: For Thompson sampling,

(Russo & van Roy 2014, Srinivas et al. 2010)

E[SR(n)]

  • Ψn log(n)

n .

ignores constants

Ψn ← Maximum Information Gain.

7/31

slide-36
SLIDE 36

Some Theoretical Results for TS

Simple Regret: SR(n) = f (x⋆) − max

t=1,...,n f (xt)

Theorem: For Thompson sampling,

(Russo & van Roy 2014, Srinivas et al. 2010)

E[SR(n)]

  • Ψn log(n)

n .

ignores constants

Ψn ← Maximum Information Gain. When X ⊂ Rd, SE (Gaussian) kernel: Ψn ≍ dd log(n)d. Mat´ ern kernel: Ψn ≍ n1− c

d2 . 7/31

slide-37
SLIDE 37

Some Theoretical Results for TS

Simple Regret: SR(n) = f (x⋆) − max

t=1,...,n f (xt)

Theorem: For Thompson sampling,

(Russo & van Roy 2014, Srinivas et al. 2010)

E[SR(n)]

  • Ψn log(n)

n .

ignores constants

Ψn ← Maximum Information Gain. When X ⊂ Rd, SE (Gaussian) kernel: Ψn ≍ dd log(n)d. Mat´ ern kernel: Ψn ≍ n1− c

d2 .

Several other results: (Agrawal et al 2012, Kaufmann et al 2012, Russo &

van Roy 2016, Chowdhury & Gopalan 2017 and more . . . )

7/31

slide-38
SLIDE 38

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

8/31

slide-39
SLIDE 39

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) 8/31

slide-40
SLIDE 40

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) 8/31

slide-41
SLIDE 41

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1 8/31

slide-42
SLIDE 42

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

8/31

slide-43
SLIDE 43

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt

8/31

slide-44
SLIDE 44

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt ◮ Expected improvement (Jones et al. 1998) ◮ Probability of improvement (Kushner et al. 1964) ◮ Entropy search (Hern´

andez-Lobato et al. 2014)

All deterministic methods, choose next point for evaluation by maximising a deterministic acquisition function, i.e. xt = argmaxx∈X ϕt(x).

8/31

slide-45
SLIDE 45

Other methods for BO

Other criteria for selecting xt:

◮ Upper Confidence Bounds (Srinivas et al. 2010)

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt ◮ Expected improvement (Jones et al. 1998) ◮ Probability of improvement (Kushner et al. 1964) ◮ Entropy search (Hern´

andez-Lobato et al. 2014)

All deterministic methods, choose next point for evaluation by maximising a deterministic acquisition function, i.e. xt = argmaxx∈X ϕt(x). Other models for f : Neural networks (Snoek et al. 2015), Random Forests (Hutter 2009).

8/31

slide-46
SLIDE 46

Big picture: scaling up black-box optimisation

9/31

slide-47
SLIDE 47

Big picture: scaling up black-box optimisation

◮ Optimising in high dimensional spaces

e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms.

(Kandasamy et al. ICML 2015)

9/31

slide-48
SLIDE 48

Big picture: scaling up black-box optimisation

◮ Optimising in high dimensional spaces

e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms.

(Kandasamy et al. ICML 2015)

◮ Multi-fidelity optimisation: what if we have cheap

approximations to f ? E.g. Train an ML model with N• data and T• iterations. But use N < N• data and T < T• iterations to approximate cross validation performance at (N•, T•).

(Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017)

9/31

slide-49
SLIDE 49

Big picture: scaling up black-box optimisation

◮ Optimising in high dimensional spaces

e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms.

(Kandasamy et al. ICML 2015)

◮ Multi-fidelity optimisation: what if we have cheap

approximations to f ? E.g. Train an ML model with N• data and T• iterations. But use N < N• data and T < T• iterations to approximate cross validation performance at (N•, T•).

(Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017)

Extends beyond GPs.

9/31

slide-50
SLIDE 50

This work: Parallel Evaluations

(Kandasamy et al. Arxiv 2017)

Parallelisation with M workers: can evaluate f at M different points at the same time. E.g. Train M models with different hyper-parameter values in parallel at the same time. Inability to parallelise is a real bottleneck in practice!

10/31

slide-51
SLIDE 51

This work: Parallel Evaluations

(Kandasamy et al. Arxiv 2017)

Parallelisation with M workers: can evaluate f at M different points at the same time. E.g. Train M models with different hyper-parameter values in parallel at the same time. Inability to parallelise is a real bottleneck in practice! Some desiderata:

◮ Statistically, achieve ×M improvement. ◮ Methodologically, be scalable for a very large number of

workers,

  • Method remains computationally tractable as M increases.
  • Method is conceptually simple, for robustness in practice.

10/31

slide-52
SLIDE 52

Outline

(Kandasamy et al. Arxiv 2017)

  • 1. Set up & definitions
  • 2. Prior work & challenges
  • 3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

  • 4. Experiments
  • 5. Theoretical Results
  • 6. Open questions/challenges

11/31

slide-53
SLIDE 53

Outline

(Kandasamy et al. Arxiv 2017)

  • 1. Set up & definitions
  • 2. Prior work & challenges
  • 3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

  • 4. Experiments
  • 5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats.

  • 6. Open questions/challenges

11/31

slide-54
SLIDE 54

Outline

(Kandasamy et al. Arxiv 2017)

  • 1. Set up & definitions
  • 2. Prior work & challenges
  • 3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

  • 4. Experiments
  • 5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats

  • 6. Open questions/challenges

11/31

slide-55
SLIDE 55

Parallel Evaluations: set up

Sequential evaluations with one worker

12/31

slide-56
SLIDE 56

Parallel Evaluations: set up

Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous)

12/31

slide-57
SLIDE 57

Parallel Evaluations: set up

Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous)

12/31

slide-58
SLIDE 58

Parallel Evaluations: set up

Sequential evaluations with one worker

jth job has feedback from all previous j − 1 jobs.

Parallel evaluations with M workers (Asynchronous)

jth job missing feedback from exactly M − 1 jobs.

Parallel evaluations with M workers (Synchronous)

jth job missing feedback from ≤ M − 1 jobs.

12/31

slide-59
SLIDE 59

Simple Regret in Parallel Settings

(Kandasamy et al. Arxiv 2017)

Simple regret after n evaluations, SR(n) = f (x⋆) − max

t=1,...,n f (xt).

n ← number of completed evaluations by all M workers.

13/31

slide-60
SLIDE 60

Simple Regret in Parallel Settings

(Kandasamy et al. Arxiv 2017)

Simple regret after n evaluations, SR(n) = f (x⋆) − max

t=1,...,n f (xt).

n ← number of completed evaluations by all M workers. Simple regret with time as a resource, Asynchronous Synchronous SR′(T) = f (x⋆) − max

t=1,...,N f (xt).

N ← (possibly random) number of completed evaluations by all M workers within time T.

13/31

slide-61
SLIDE 61

Outline

(Kandasamy et al. Arxiv 2017)

  • 1. Set up & definitions
  • 2. Prior work & challenges
  • 3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

  • 4. Experiments
  • 5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats

  • 6. Open questions/challenges

13/31

slide-62
SLIDE 62

Prior work in Parallel BO

(Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) (Wang et al. 2017) (Kandasamy et al. Arxiv 2017)

14/31

slide-63
SLIDE 63

Prior work in Parallel BO

Asynchr-

  • nicity

(Ginsbourger et al. 2011)

  • (Janusevkis et al. 2012)
  • (Contal et al. 2013)

(Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016)

  • (Kathuria et al. 2016)

(Wu & Frazier. 2017) (Wang et al. 2017) (Kandasamy et al. Arxiv 2017)

  • 14/31
slide-64
SLIDE 64

Prior work in Parallel BO

Asynchr-

  • nicity

Theoretical guarantees

(Ginsbourger et al. 2011)

  • (Janusevkis et al. 2012)
  • (Contal et al. 2013)
  • (Desautels et al. 2014)
  • (Gonzalez et al. 2015)

(Shah & Ghahramani. 2015) (Wang et al. 2016)

  • (Kathuria et al. 2016)
  • (Wu & Frazier. 2017)

(Wang et al. 2017) (Kandasamy et al. Arxiv 2017)

  • 14/31
slide-65
SLIDE 65

Prior work in Parallel BO

Asynchr-

  • nicity

Theoretical guarantees Conceptual simplicity *

(Ginsbourger et al. 2011)

  • (Janusevkis et al. 2012)
  • (Contal et al. 2013)
  • (Desautels et al. 2014)
  • (Gonzalez et al. 2015)

(Shah & Ghahramani. 2015) (Wang et al. 2016)

  • (Kathuria et al. 2016)
  • (Wu & Frazier. 2017)

(Wang et al. 2017) (Kandasamy et al. Arxiv 2017)

  • * straightforward extension of sequential algorithm works.

14/31

slide-66
SLIDE 66

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

Direct application of GP-UCB in the synchronous setting ...

x f(x)

15/31

slide-67
SLIDE 67

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

Direct application of GP-UCB in the synchronous setting ...

  • First worker: maximise acquisition, xt1 = argmax ϕt(x).

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt1

15/31

slide-68
SLIDE 68

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

Direct application of GP-UCB in the synchronous setting ...

  • First worker: maximise acquisition, xt1 = argmax ϕt(x).
  • Second worker: acquisition is the same! xt1 = xt2

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt2 = xt1

15/31

slide-69
SLIDE 69

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

Direct application of GP-UCB in the synchronous setting ...

  • First worker: maximise acquisition, xt1 = argmax ϕt(x).
  • Second worker: acquisition is the same! xt1 = xt2
  • xt1 = xt2 = · · · = xtM.

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt2 = xt1

15/31

slide-70
SLIDE 70

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

Direct application of GP-UCB in the synchronous setting ...

  • First worker: maximise acquisition, xt1 = argmax ϕt(x).
  • Second worker: acquisition is the same! xt1 = xt2
  • xt1 = xt2 = · · · = xtM.

x f(x) ϕt = µt−1 + β1/2

t

σt−1

xt2 = xt1

Direct application of sequential algorithm does not work. Need to “encourage diversity”.

15/31

slide-71
SLIDE 71

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x)

16/31

slide-72
SLIDE 72

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x)

ˆ f

16/31

slide-73
SLIDE 73

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x)

ˆ f

16/31

slide-74
SLIDE 74

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x)

16/31

slide-75
SLIDE 75

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x) ◮ Optimise an acquisition over X M.

16/31

slide-76
SLIDE 76

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x) ◮ Optimise an acquisition over X M. ◮ Resort to heuristics, typically requires additional

hyper-parameters and/or computational routines.

16/31

slide-77
SLIDE 77

Why are deterministic algorithms not “simple”? Need to encourage diversity in parallel evaluations

◮ Add hallucinated observations. x f(x) ◮ Optimise an acquisition over X M. ◮ Resort to heuristics, typically requires additional

hyper-parameters and/or computational routines. Take-home message: Straightforward application of sequential algorithm works for TS. Inherent randomness takes care of exploration vs. exploitation trade-off when managing M workers.

16/31

slide-78
SLIDE 78

Parallel Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g.

17/31

slide-79
SLIDE 79

Parallel Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g. Synchronous: synTS At any given time,

  • 1. {(x′

m, y′ m)}M m=1 ← Wait for

all workers to finish.

  • 2. Compute posterior GP.
  • 3. Draw M samples

gm ∼ GP, ∀m.

  • 4. Re-deploy worker m at

argmax gm, ∀m.

17/31

slide-80
SLIDE 80

Parallel Thompson Sampling

(Kandasamy et al. Arxiv 2017)

Asynchronous: asyTS At any given time,

  • 1. (x′, y′) ← Wait for

a worker to finish.

  • 2. Compute posterior GP.
  • 3. Draw a sample g ∼ GP.
  • 4. Re-deploy worker at

argmax g. Synchronous: synTS At any given time,

  • 1. {(x′

m, y′ m)}M m=1 ← Wait for

all workers to finish.

  • 2. Compute posterior GP.
  • 3. Draw M samples

gm ∼ GP, ∀m.

  • 4. Re-deploy worker m at

argmax gm, ∀m. Variants in prior work:

(Osband et al. 2016, Israelsen et al. 2016, Hernandez-Lobato et al. 2017)

17/31

slide-81
SLIDE 81

Outline

(Kandasamy et al. Arxiv 2017)

  • 1. Set up & definitions
  • 2. Prior work & challenges
  • 3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

  • 4. Experiments
  • 5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats

  • 6. Open questions/challenges

18/31

slide-82
SLIDE 82

Experiment: Park1-4D M = 10

Comparison in terms of number of evaluations

10 0

asyTS seqTS

20 40 60 80 100 120

synTS

19/31

slide-83
SLIDE 83

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

10 20 30 40 10 -2 10 -1

20/31

slide-84
SLIDE 84

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

10 20 30 40 10 -2 10 -1

20/31

slide-85
SLIDE 85

Experiment: Branin-2D M = 4

Evaluation time sampled from a uniform distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

10 20 30 40 10 -2 10 -1

20/31

slide-86
SLIDE 86

Experiment: Hartmann-6D M = 12

Evaluation time sampled from a half-normal distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 25 30 10 -1 10 0

21/31

slide-87
SLIDE 87

Experiment: Hartmann-18D M = 25

Evaluation time sampled from an exponential distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 25 30 2.5 3 3.5 4 4.5 5 5.5 6 6.5

22/31

slide-88
SLIDE 88

Experiment: Currin-Exponential-14D M = 35

Evaluation time sampled from a Pareto-3 distribution

synRAND synHUCB synUCBPE synTS asyRAND asyUCB asyHUCB asyEI asyHTS asyTS

5 10 15 20 10 15 20 25

23/31

slide-89
SLIDE 89

Experiment: Model Selection in Cifar10 M = 4

Tune # filters in in range (32, 256) for each layer in a 6 layer CNN. Time taken for an evaluation: 4 - 16 minutes.

1000 2000 3000 4000 5000 6000 7000 0.68 0.69 0.7 0.71 0.72

synTS asyRAND

asyHUCB

asyTS asyEI synHUCB

24/31

slide-90
SLIDE 90

Outline

(Kandasamy et al. Arxiv 2017)

  • 1. Set up & definitions
  • 2. Prior work & challenges
  • 3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

  • 4. Experiments
  • 5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats.

  • 6. Open questions/challenges

24/31

slide-91
SLIDE 91

Bounds for SR(n), synTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

  • Ψn log(n)

n Ψn ← Maximum information gain.

25/31

slide-92
SLIDE 92

Bounds for SR(n), synTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

  • Ψn log(n)

n Ψn ← Maximum information gain. Theorem: synTS

(Kandasamy et al. Arxiv 2017)

E[SR(n)] M

  • log(M)

n +

  • Ψn+M log(n+M)

n Leading constant is also the same.

25/31

slide-93
SLIDE 93

Bounds for SR(n), asyTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

  • Ψn log(n)

n

26/31

slide-94
SLIDE 94

Bounds for SR(n), asyTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

  • Ψn log(n)

n Theorem: asyTS

(Kandasamy et al. Arxiv 2017)

E[SR(n)]

  • ξMΨn log(n)

n ξM = supDn,n≥1 maxA⊂X,|A|≤M eI(f ;A|Dn).

26/31

slide-95
SLIDE 95

Bounds for SR(n), asyTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

  • Ψn log(n)

n Theorem: asyTS

(Kandasamy et al. Arxiv 2017)

E[SR(n)]

  • ξMΨn log(n)

n ξM = supDn,n≥1 maxA⊂X,|A|≤M eI(f ;A|Dn). Theorem: There exists an asynchronously parallelisable initiali- sation scheme requiring O(Mpolylog(M)) evaluations to f such that ξM ≤ C.

(Krause et al. 2008, Desautels et al. 2012)

26/31

slide-96
SLIDE 96

Bounds for SR(n), asyTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

  • Ψn log(n)

n Theorem: asyTS, arbitrary X

(Kandasamy et al. Arxiv 2017)

E[SR(n)] Mpolylog(M) n +

  • CΨn log(n)

n ξM = supDn,n≥1 maxA⊂X,|A|≤M eI(f ;A|Dn). Theorem: There exists an asynchronously parallelisable initiali- sation scheme requiring O(Mpolylog(M)) evaluations to f such that ξM ≤ C.

(Krause et al. 2008, Desautels et al. 2012)

26/31

slide-97
SLIDE 97

Bounds for SR(n), asyTS

seqTS

(Russo & van Roy 2014)

E[SR(n)]

  • Ψn log(n)

n Theorem: asyTS, arbitrary X

(Kandasamy et al. Arxiv 2017)

E[SR(n)] Mpolylog(M) n +

  • CΨn log(n)

n ξM = supDn,n≥1 maxA⊂X,|A|≤M eI(f ;A|Dn). Theorem: There exists an asynchronously parallelisable initiali- sation scheme requiring O(Mpolylog(M)) evaluations to f such that ξM ≤ C.

(Krause et al. 2008, Desautels et al. 2012)

* We do not believe this is necessary.

26/31

slide-98
SLIDE 98

Bounds for asyTS without the initialisation scheme

Theorem: synTS, arbitrary X

(Kandasamy et al. Arxiv 2017)

E[SR(n)] M

  • log(M)

n +

  • Ψn+M log(n + M)

n

27/31

slide-99
SLIDE 99

Bounds for asyTS without the initialisation scheme

Theorem: synTS, arbitrary X

(Kandasamy et al. Arxiv 2017)

E[SR(n)] M

  • log(M)

n +

  • Ψn+M log(n + M)

n Theorem: asyTS, X ⊂ Rd

(Ongoing work)

E[SR(n)] . . . +

  • M log(n)

n1/O(d)

27/31

slide-100
SLIDE 100

Theoretical Results for SR′(T)

Model evaluation time as an independent random variable

◮ Uniform

unif(a, b) bounded

◮ Half-normal

HN(τ 2) sub-Gaussian

◮ Exponential

exp(λ) sub-exponential

28/31

slide-101
SLIDE 101

Theoretical Results for SR′(T)

Model evaluation time as an independent random variable

◮ Uniform

unif(a, b) bounded

◮ Half-normal

HN(τ 2) sub-Gaussian

◮ Exponential

exp(λ) sub-exponential Theorem (Informal): If evaluation times are the same, synTS≈ asyTS. Otherwise, bounds for asyTS are better than synTS. More the variability in evaluation times, the bigger the difference.

slide-102
SLIDE 102

Theoretical Results for SR′(T)

Model evaluation time as an independent random variable

◮ Uniform

unif(a, b) bounded

◮ Half-normal

HN(τ 2) sub-Gaussian

◮ Exponential

exp(λ) sub-exponential Theorem (Informal): If evaluation times are the same, synTS≈ asyTS. Otherwise, bounds for asyTS are better than synTS. More the variability in evaluation times, the bigger the difference.

  • Uniform: constant factor
  • Half-normal:
  • log(M) factor
  • Exponential: log(M) factor

28/31

slide-103
SLIDE 103

Outline

(Kandasamy et al. Arxiv 2017)

  • 1. Set up & definitions
  • 2. Prior work & challenges
  • 3. Algorithms synTS, asyTS: direct application of TS to

synchronous and asynchronous parallel settings

  • 4. Experiments
  • 5. Theoretical Results

◮ synTS and asyTS perform essentially the same as seqTS in

terms of the number of evaluations.

◮ When we factor time as a resource, asyTS outperforms synTS

and seqTS.

. . . with some caveats

  • 6. Open questions/challenges

28/31

slide-104
SLIDE 104

Open Challenges for Parallelised TS

  • 1. Bounds for asynchronous TS without initialisation.
  • 2. Other models for evaluation times.
  • e.g. evaluation time depends on x ∈ X.

29/31

slide-105
SLIDE 105

Open Challenges for Parallelised TS

  • 1. Bounds for asynchronous TS without initialisation.
  • 2. Other models for evaluation times.
  • e.g. evaluation time depends on x ∈ X.
  • 3. In the asynchronous setting,

◮ Should you wait for another job to finish without immediately

re-deploying?

◮ Do you kill an on-going job depending on the result of a

completed job?

29/31

slide-106
SLIDE 106

Open Challenges for Parallelised TS

  • 4. Optimising the sample when X = [0, 1]d,

x f(x)

xt

xt = argmax

x∈X

g(x), where g ∼ Posterior GP

◮ Global optimisation of a non-convex function!

.. a common challenge in most BO methods.

30/31

slide-107
SLIDE 107

Open Challenges for Parallelised TS

  • 4. Optimising the sample when X = [0, 1]d,

x f(x)

xt

xt = argmax

x∈X

g(x), where g ∼ Posterior GP

◮ Global optimisation of a non-convex function!

.. a common challenge in most BO methods. But additionally for TS,

◮ As g is not deterministic, draw samples from a fixed set of

points and pick the maximum.

◮ Or if using an adaptive method, scales O((N + S)3) where

N ← # of evaluations to f , S ← # of evaluations to g.

30/31

slide-108
SLIDE 108

Summary

◮ synTS, asyTS: direct application of TS to synchronous and

asynchronous parallel settings.

31/31

slide-109
SLIDE 109

Summary

◮ synTS, asyTS: direct application of TS to synchronous and

asynchronous parallel settings.

◮ Take-aways: Theory

  • Both perform essentially the same as seqTS in terms of the

number of evaluations.

  • When we factor time as a resource, asyTS performs best.

31/31

slide-110
SLIDE 110

Summary

◮ synTS, asyTS: direct application of TS to synchronous and

asynchronous parallel settings.

◮ Take-aways: Theory

  • Both perform essentially the same as seqTS in terms of the

number of evaluations.

  • When we factor time as a resource, asyTS performs best.

◮ Take-aways: Practice

  • Conceptually simple and scales better with the number of

workers than other methods.

31/31

slide-111
SLIDE 111

Akshay Jeff Barnab´ as Krishnamurthy Schneider P´

  • czos

Code: github.com/kirthevasank/gp-parallel-ts Slides: www.cs.cmu.edu/~kkandasa/talks/google-ts-slides.pdf

Thank you.