de Scale-free adaptive PLANNING for deterministic dynamics & - - PowerPoint PPT Presentation

▶

Jun 05, 2023 184 likes •421 views

de Scale-free adaptive PLANNING for deterministic dynamics & discounted rewards Peter Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko ICML - June 13th, 2019 1/11 An MCTS setting MDP with starting state x 0 X , action space A

SLIDE 1

de Scale-free adaptive PLANNING for deterministic dynamics & discounted rewards

Peter Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko

ICML - June 13th, 2019

1/11

SLIDE 2

An MCTS setting

MDP with starting state x0 ∈ X, action space A n interactions: At time t playing at in xt leads to Deterministic dynamics g: xt+1 g(xt, at), Reward: rt(xt, at) + εt with εt being the noise Objective: Recommend action a(n) that minimizes rn max

a∈A Q⋆(x, a) − Q⋆(x, a(n))

simple regret where Q⋆(x, a) r(x, a) + supπ γtr(xt, π(xt)) Assumption: rt ∈ [0, Rmax] and |εt| ≤ b Approach: Trying to explore without the parameters Rmax and b

2/11

SLIDE 3

An MCTS setting

MDP with starting state x0 ∈ X, action space A n interactions: At time t playing at in xt leads to Deterministic dynamics g: xt+1 g(xt, at), Reward: rt(xt, at) + εt with εt being the noise Objective: Recommend action a(n) that minimizes rn max

a∈A Q⋆(x, a) − Q⋆(x, a(n))

simple regret where Q⋆(x, a) r(x, a) + supπ γtr(xt, π(xt)) Assumption: rt ∈ [0, Rmax] and |εt| ≤ b Approach: Trying to explore without the parameters Rmax and b

2/11

SLIDE 4

An MCTS setting

MDP with starting state x0 ∈ X, action space A n interactions: At time t playing at in xt leads to Deterministic dynamics g: xt+1 g(xt, at), Reward: rt(xt, at) + εt with εt being the noise Objective: Recommend action a(n) that minimizes rn max

a∈A Q⋆(x, a) − Q⋆(x, a(n))

simple regret where Q⋆(x, a) r(x, a) + supπ γtr(xt, π(xt)) Assumption: rt ∈ [0, Rmax] and |εt| ≤ b Approach: Trying to explore without the parameters Rmax and b

2/11

SLIDE 5

OLOP (Bubeck and Munos, 2010)

OLOP implements Optimistic Planning using Upper Confidence Bound (UCB) on the Q value of a sequence of q actions a1, . . . , aq:

QUCB

t

(a1:q)

q

rh(t) + γhb

Tah(t)

estimation of observed reward

+ Rmaxγq+1 1 − γ

unseen reward

in optimization under a fixed budget n, excellent strategies allocate samples to actions without knowing Rmax or b

3/11

SLIDE 6

OLOP (Bubeck and Munos, 2010)

OLOP implements Optimistic Planning using Upper Confidence Bound (UCB) on the Q value of a sequence of q actions a1, . . . , aq:

QUCB

t

(a1:q)

q

rh(t) + γhb

Tah(t)

estimation of observed reward

+ Rmaxγq+1 1 − γ

unseen reward

in optimization under a fixed budget n, excellent strategies allocate samples to actions without knowing Rmax or b

3/11

SLIDE 7

Tree Search

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r02 r03 r04 r35 r56

x6

Q(x6)=r03+γr35+γ2r56

4/11

SLIDE 8

Tree Search

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r02 r03 r04 r35 r56

x6

Q(x6)=r03+γr35+γ2r56 This is a zero order optimization!

4/11

SLIDE 9

Black-box optimization: use the partitioning to explore f (uniformly)

5/11

SLIDE 10

Black-box optimization: use the partitioning to explore f (uniformly)

h=0

5/11

SLIDE 11

Black-box optimization: use the partitioning to explore f (uniformly)

h=0 h=1

5/11

SLIDE 12

Black-box optimization: use the partitioning to explore f (uniformly)

h=0 h=1 h=2

5/11

SLIDE 13

Zipf exploration: Open best n

h cells at depth h

h=0 h=1

... ... ...

...

n h

6/11

SLIDE 14

Noisy case

need to pull more each x to limit uncertainty
tradeoff: the more you pull each x the shallower you can

explore

7/11

SLIDE 15

Noisy case: StroquOOL (Bartlett et al. 2019)

At depth h:

order the cells by decreasing value and
open the i-th best cell with m = n

hi estimations

8/11

SLIDE 16

Black-box optimization vs planning: Reuse of samples and γ

Optimization Planning

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r02 r03 r04 r35 r56

Lower regret for planning! (Bubeck & Munos 2010)

9/11

SLIDE 17

Black-box optimization vs planning: Reuse of samples and γ

Optimization Planning

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

f105 h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r1 r2 r3 r4

Lower regret for planning! (Bubeck & Munos 2010)

9/11

SLIDE 18

Black-box optimization vs planning: Reuse of samples and γ

Optimization Planning

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

f134 h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r'1 r'2 r'3 r'4

Lower regret for planning! (Bubeck & Munos 2010)

9/11

SLIDE 19

Black-box optimization vs planning: Reuse of samples and γ

Optimization Planning

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

f134 h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r'1 r'2 r'3 r'4

K H samples near the root How many samples near the root? Lower regret for planning! (Bubeck & Munos 2010)

9/11

SLIDE 20

Black-box optimization vs. planning: Reuse samples and take advantage of γ

Uniform exploration Zipf exploration

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r04 r04 r04 r04 r04 r04 r04 r04 r04 r04 r04 r04

h=0 h=1

... ... ...

...

n h

Bubeck & Munos: Only for uniform strategies . . . We figured the amount the samples needed!

10/11

SLIDE 21

Black-box optimization vs. planning: Reuse samples and take advantage of γ

Uniform exploration Zipf exploration

h=0 h=1 h=2 h=3 h=4 h=5

x0

x4 x3 x2 x5 x6 x7

r04 r04 r04 r04 r04 r04 r04 r04 r04 r04 r04 r04

h=0 h=1

... ... ...

...

n h

not sharing information Sharing information Bubeck & Munos: Only for uniform strategies . . . We figured the amount the samples needed!

10/11

SLIDE 22

PlaTγPOOS

The power of PlaTγPOOS

implements Zipf exploration for MCTS StroquOOL,
explicitly pulls an action at depth h + 1, γ times less than

action at depth h, (Q⋆(x, a) = r(x, a) + supπ γtr(xt, π(xt)),

does not use UCB & no use of Rmax and b,)
improves over OLOP with adaptation to low noise and

additional unknown smoothness

gets exponential speedups when no noise is present!

11/11