de Scale-free adaptive PLANNING for deterministic dynamics & - - PowerPoint PPT Presentation

de scale free adaptive planning for deterministic
SMART_READER_LITE
LIVE PREVIEW

de Scale-free adaptive PLANNING for deterministic dynamics & - - PowerPoint PPT Presentation

de Scale-free adaptive PLANNING for deterministic dynamics & discounted rewards Peter Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko ICML - June 13th, 2019 1/11 An MCTS setting MDP with starting state x 0 X , action space A


slide-1
SLIDE 1

de Scale-free adaptive PLANNING for deterministic dynamics & discounted rewards

Peter Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko

ICML - June 13th, 2019

1/11

slide-2
SLIDE 2

An MCTS setting

MDP with starting state x0 ∈ X, action space A n interactions: At time t playing at in xt leads to Deterministic dynamics g: xt+1 g(xt, at), Reward: rt(xt, at) + εt with εt being the noise Objective: Recommend action a(n) that minimizes rn max

a∈A Q⋆(x, a) − Q⋆(x, a(n))

simple regret where Q⋆(x, a) r(x, a) + supπ γtr(xt, π(xt)) Assumption: rt ∈ [0, Rmax] and |εt| ≤ b Approach: Trying to explore without the parameters Rmax and b

2/11

slide-3
SLIDE 3

An MCTS setting

MDP with starting state x0 ∈ X, action space A n interactions: At time t playing at in xt leads to Deterministic dynamics g: xt+1 g(xt, at), Reward: rt(xt, at) + εt with εt being the noise Objective: Recommend action a(n) that minimizes rn max

a∈A Q⋆(x, a) − Q⋆(x, a(n))

simple regret where Q⋆(x, a) r(x, a) + supπ γtr(xt, π(xt)) Assumption: rt ∈ [0, Rmax] and |εt| ≤ b Approach: Trying to explore without the parameters Rmax and b

2/11

slide-4
SLIDE 4

An MCTS setting

MDP with starting state x0 ∈ X, action space A n interactions: At time t playing at in xt leads to Deterministic dynamics g: xt+1 g(xt, at), Reward: rt(xt, at) + εt with εt being the noise Objective: Recommend action a(n) that minimizes rn max

a∈A Q⋆(x, a) − Q⋆(x, a(n))

simple regret where Q⋆(x, a) r(x, a) + supπ γtr(xt, π(xt)) Assumption: rt ∈ [0, Rmax] and |εt| ≤ b Approach: Trying to explore without the parameters Rmax and b

2/11

slide-5
SLIDE 5

OLOP (Bubeck and Munos, 2010)

OLOP implements Optimistic Planning using Upper Confidence Bound (UCB) on the Q value of a sequence of q actions a1, . . . , aq:

  • QUCB

t

(a1:q)

q

  • h=1
  • γh

rh(t) + γhb

  • 1

Tah(t)

  • estimation of observed reward

+ Rmaxγq+1 1 − γ

  • unseen reward

in optimization under a fixed budget n, excellent strategies allocate samples to actions without knowing Rmax or b

3/11

slide-6
SLIDE 6

OLOP (Bubeck and Munos, 2010)

OLOP implements Optimistic Planning using Upper Confidence Bound (UCB) on the Q value of a sequence of q actions a1, . . . , aq:

  • QUCB

t

(a1:q)

q

  • h=1
  • γh

rh(t) + γhb

  • 1

Tah(t)

  • estimation of observed reward

+ Rmaxγq+1 1 − γ

  • unseen reward

in optimization under a fixed budget n, excellent strategies allocate samples to actions without knowing Rmax or b

3/11

slide-7
SLIDE 7

Tree Search

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r02 r03 r04 r35 r56

x6

Q(x6)=r03+γr35+γ2r56

4/11

slide-8
SLIDE 8

Tree Search

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r02 r03 r04 r35 r56

x6

Q(x6)=r03+γr35+γ2r56 This is a zero order optimization!

4/11

slide-9
SLIDE 9

Black-box optimization: use the partitioning to explore f (uniformly)

5/11

slide-10
SLIDE 10

Black-box optimization: use the partitioning to explore f (uniformly)

h=0

5/11

slide-11
SLIDE 11

Black-box optimization: use the partitioning to explore f (uniformly)

h=0 h=1

5/11

slide-12
SLIDE 12

Black-box optimization: use the partitioning to explore f (uniformly)

h=0 h=1 h=2

5/11

slide-13
SLIDE 13

Zipf exploration: Open best n

h cells at depth h

h=0 h=1

... ... ...

...

...

n h

6/11

slide-14
SLIDE 14

Noisy case

  • need to pull more each x to limit uncertainty
  • tradeoff: the more you pull each x the shallower you can

explore

7/11

slide-15
SLIDE 15

Noisy case: StroquOOL (Bartlett et al. 2019)

At depth h:

  • order the cells by decreasing value and
  • open the i-th best cell with m = n

hi estimations

8/11

slide-16
SLIDE 16

Black-box optimization vs planning: Reuse of samples and γ

Optimization Planning

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r02 r03 r04 r35 r56

Lower regret for planning! (Bubeck & Munos 2010)

9/11

slide-17
SLIDE 17

Black-box optimization vs planning: Reuse of samples and γ

Optimization Planning

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

f105 h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r1 r2 r3 r4

Lower regret for planning! (Bubeck & Munos 2010)

9/11

slide-18
SLIDE 18

Black-box optimization vs planning: Reuse of samples and γ

Optimization Planning

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

f134 h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r'1 r'2 r'3 r'4

Lower regret for planning! (Bubeck & Munos 2010)

9/11

slide-19
SLIDE 19

Black-box optimization vs planning: Reuse of samples and γ

Optimization Planning

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

f134 h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r'1 r'2 r'3 r'4

K H samples near the root How many samples near the root? Lower regret for planning! (Bubeck & Munos 2010)

9/11

slide-20
SLIDE 20

Black-box optimization vs. planning: Reuse samples and take advantage of γ

Uniform exploration Zipf exploration

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r04 r04 r04 r04 r04 r04 r04 r04 r04 r04 r04 r04

h=0 h=1

... ... ...

...

...

n h

Bubeck & Munos: Only for uniform strategies . . . We figured the amount the samples needed!

10/11

slide-21
SLIDE 21

Black-box optimization vs. planning: Reuse samples and take advantage of γ

Uniform exploration Zipf exploration

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r04 r04 r04 r04 r04 r04 r04 r04 r04 r04 r04 r04

h=0 h=1

... ... ...

...

...

n h

not sharing information Sharing information Bubeck & Munos: Only for uniform strategies . . . We figured the amount the samples needed!

10/11

slide-22
SLIDE 22

PlaTγPOOS

The power of PlaTγPOOS

  • implements Zipf exploration for MCTS StroquOOL,
  • explicitly pulls an action at depth h + 1, γ times less than

action at depth h, (Q⋆(x, a) = r(x, a) + supπ γtr(xt, π(xt)),

  • does not use UCB & no use of Rmax and b,)
  • improves over OLOP with adaptation to low noise and

additional unknown smoothness

  • gets exponential speedups when no noise is present!

11/11