[PPT] - Safe Reinforcement Learning for Decision-Making in Autonomous PowerPoint Presentation

SLIDE 1

Lille, April 2019

Safe Reinforcement Learning for Decision-Making in Autonomous Driving

Edouard Leurent, Odalric-Ambrym Maillard, Denis Efimov, Wilfrid Perruquetti, Yann Blanco SequeL, Inria Lille – Nord Europe Valse, Inria Lille – Nord Europe Renault Group

SLIDE 2

Motivation

Classic Autonomous Driving Pipeline

Safe Reinforcement Learning for Autonomous Driving

Lille - 2/54

SLIDE 3

Motivation

Classic Autonomous Driving Pipeline

In practice,

◮ The behavioural layer is a hand-crafted rule-based system

(e.g. FSM).

Safe Reinforcement Learning for Autonomous Driving

Lille - 2/54

SLIDE 4

Motivation

Classic Autonomous Driving Pipeline

In practice,

◮ The behavioural layer is a hand-crafted rule-based system

(e.g. FSM).

◮ Won’t scale to complex scenes, handle negotiation and

aggressiveness

Safe Reinforcement Learning for Autonomous Driving

Lille - 2/54

SLIDE 5

Reinforcement Learning: why?

Search for an optimal policy π(a|s): max

π

E ∞

t=0

γtr(st, at)

at ∼ π(st), st+1 ∼ T(st, at)
policy return RT

π

The dynamics T(st+1|st, at) are unknown. The agent learns by interaction with the environment Challenges:

◮ exploration-exploitation ◮ credit assignment ◮ partial observability ◮ safety Safe Reinforcement Learning for Autonomous Driving

Lille - 3/54

SLIDE 6

Reinforcement Learning: how?

Model-free

1. Directly optimise π(a|s) through policy evaluation and policy

improvement

Safe Reinforcement Learning for Autonomous Driving

Lille - 4/54

SLIDE 7

Reinforcement Learning: how?

Model-free

1. Directly optimise π(a|s) through policy evaluation and policy

improvement

Model-based

1. Learn a model for the dynamics ˆ

T(st+1|st, at),

2. (Planning) Leverage it to compute

max

π

E ∞

t=0

γtr(st, at)

at ∼ π(st), st+1 ∼ ˆ

T(st, at)

+ Better sample efficiency, interpretability, priors.

Safe Reinforcement Learning for Autonomous Driving

Lille - 4/54

SLIDE 8

A first benchmark

The highway-env environment

◮ Vehicle kinematics: Kinematic Bicycle Model ◮ Low-level longitudinal and lateral controllers ◮ Behavioural models: IDM and MOBIL ◮ Graphical road network and route planning

A few baseline agents — Setup

◮ Model-free: DQN ◮ Model-based (planning): Value Iteration and MCTS Safe Reinforcement Learning for Autonomous Driving

Lille - 5/54

SLIDE 9

A first benchmark — Results

2 4 6 8 10 12 Rewards 0.0 0.1 0.2 0.3 0.4 0.5 Frequency

Histogram of rewards

VI DQN MCTS 5 10 15 20 25 30 35 40 Lengths 0.0 0.2 0.4 0.6 0.8 1.0 Frequency

Histogram of lengths

VI DQN MCTS

Videos available on

Safe Reinforcement Learning for Autonomous Driving

Lille - 6/54

SLIDE 10

The safety / performance trade-off

Let us look at the performances of DQN:

Safe Reinforcement Learning for Autonomous Driving

Lille - 7/54

SLIDE 11

The safety / performance trade-off

Let us look at the performances of DQN:

Uncertainty and risk

◮ High return variance, many collisions ◮ In RL, we only maximise the return in expectation Safe Reinforcement Learning for Autonomous Driving

Lille - 7/54

SLIDE 12

The safety / performance trade-off

Let us look at the performances of DQN:

Uncertainty and risk

◮ High return variance, many collisions ◮ In RL, we only maximise the return in expectation

Conflicting objectives

◮ Reward rt = ωvvelocity − ωccollision ◮ We only control the return RT π = t γtrt. ◮ For any fixed ω, there can be many optimal policies with

different velocity collision ratios → the Pareto-optimal curve

Safe Reinforcement Learning for Autonomous Driving

Lille - 7/54

SLIDE 13

A first formalisation of risk

Constrained Reinforcement Learning

◮ Augment the MDP with a cost function c : S × A × S → R,

cost discount γc , and a budget β.

◮ Optimise the reward while keeping the cost under a budget

max

π

E

π

∞

t=0

γtrt

s.t.

E

π

∞

t=0

γt

cct

≤ β

Budgeted Reinforcement Learning

Find a single budget-dependent policy π(a|s, β) that solves all the corresponding CMDPs

Safe Reinforcement Learning for Autonomous Driving

Lille - 8/54

SLIDE 14

A BMDP algorithm

Lagrangian Relaxation

Consider the dual problem and replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ: max

π

E

t

γtrt − λγt

cct ◮ Train many policies πk with penalties λk and recover the cost

budgets βk

◮ Very data/memory-heavy Safe Reinforcement Learning for Autonomous Driving

Lille - 9/54

SLIDE 15

Our BMDP algorithm

Budgeted Fitted-Q [Carrara et al. 2019]

A model-free, value-based, fixed-point iteration procedure.

Qr

n+1(si, ai, βi) regression

← − − − − − − r ′

i + γ

a′∈A

πn

A(s′ i, a′, βi)Qr n(s′ i, a′, πn B(s′ i, a′, βi))

Qc

n+1(si, ai, βi) regression

← − − − − − − c′

i + γc

a′∈A

πn

A(s′ i, a′, βi)Qc n(s′ i, a′, πn B(s′ i, a′, βi))

(πn

A, πn B) ←

arg max

(πA,πB)∈Ψn

a∈A

πA(s, a, β)Qr

n(s, a, πB(s, a, β))

Ψn =        πA ∈ M(A)S×R, πB ∈ RS×A×R, such that, ∀s ∈ S, ∀β ∈ R,

a∈A

πA(s, a, β)Qc

n(s, a, πB(s, a, β)) ≤ β

       Safe Reinforcement Learning for Autonomous Driving

Lille - 10/54

SLIDE 16

From dynamic programming to RL

Continuous Reinfocement Learning

1. Risk-sensitive exploration.
2. Scalable function approximation
3. Parallel computing of the targets and experiences.

Safe Reinforcement Learning for Autonomous Driving

Lille - 11/54

SLIDE 17

Risk-sensitive exploration

Algorithm 1: Risk-sensitive exploration

1 Initialise an empty batch D. 2 for each intermediate batch do 3

for each episode in batch do

4

Sample initial budget β ∼ U(B).

5

while episode not done do

6

Update ε from schedule.

7

Sample z ∼ U([0, 1]).

8

if z > ε then

9

Sample (a, β′) from (πA, πB). // Exploit

10

else

11

Sample (a, β′) from U(∆AB). // Explore

12

Append transition (s, β, a, r ′, c′, s′) to batch D.

13

Update episode budget β ← β′ .

14

(πA, πB) ← BFTQ(D).

15 return the batch of transitions D

See example on

Safe Reinforcement Learning for Autonomous Driving

Lille - 12/54

SLIDE 18

Function approximation

s0 s1 β Qr(a0) Qr(a1) Qc(a0) Qc(a1) (s, β) Encoder Hidden Layer 1 Hidden Layer 2 Q

Figure: Neural Network for Q-functions approximation when the state dimension is 2 and there are 2 actions.

Safe Reinforcement Learning for Autonomous Driving

Lille - 13/54

SLIDE 19

Parallel computing of the targets

Algorithm 2: BFTQ

1 In: D,

B,γc, γr, fitr,fitc (regression algorithms);

2 Out: Qr, Qc; 3 X = {si, ai, βi}i∈[0,|D|]; 4 Initialise Qr = Qc = (s, a, β) → 0; 5 repeat 6

Y r, Y c = compute targets(D, Qr, Qc, B, γc, γr);

7

Qr, Qc = fitr(X, Y r), fitc(X, Y c);

8 until convergence or timeout;

Algorithm 3: Compute targets (parallel)

1 Qr, Qc = Q(D)

// perform a single forward pass

2 Split D among workers: D = ∪w∈W Dw

// Run the following loop on each worker in parallel

3 for w ∈ W do 4

(Y c

w, Y r w) ←

compute targets(Dw, Qr, Qc, B, γc, γr)

5 Join the results: Y c = ∪w∈W Y c w and

Y r = ∪w∈W Y r

w 6 return (Y c, Y r)

Safe Reinforcement Learning for Autonomous Driving

Lille - 14/54

SLIDE 20

Experiments

Video available on

10 2 4 6 3 1 λ ∈ {15,20} 0.00 0.10 0.20 0.30 β ∈ [0.01,0.09] β ∈ [0.11,0.19] β ∈ [0.21,0.29] β ∈ [0.31,1.00]

Safe Reinforcement Learning for Autonomous Driving

Lille - 15/54

SLIDE 21

Looking back

2 4 6 8 10 12 Rewards 0.0 0.1 0.2 0.3 0.4 0.5 Frequency

Histogram of rewards

VI DQN MCTS 5 10 15 20 25 30 35 40 Lengths 0.0 0.2 0.4 0.6 0.8 1.0 Frequency

Histogram of lengths

VI DQN MCTS

Compared to DQN, the MCTS was really good in terms of safety. But the VI, not so much.

Safe Reinforcement Learning for Autonomous Driving

Lille - 16/54

SLIDE 22

Model bias

Model-free

1. Directly optimise π(a|s) through policy evaluation and policy

improvement

Model-based

1. Learn a model for the dynamics ˆ

T(st+1|st, at),

2. (Planning) Leverage it to compute

max

π

E ∞

t=0

γtr(st, at)

at ∼ π(st), st+1 ∼ ˆ

T(st, at)

+ Better sample efficiency, interpretability, priors.

Safe Reinforcement Learning for Autonomous Driving

Lille - 17/54

SLIDE 23

Model bias

Model-free

1. Directly optimise π(a|s) through policy evaluation and policy

improvement

Model-based

1. Learn a model for the dynamics ˆ

T(st+1|st, at),

2. (Planning) Leverage it to compute

max

π

E ∞

t=0

γtr(st, at)

at ∼ π(st), st+1 ∼ ˆ

T(st, at)

+ Better sample efficiency, interpretability, priors.
Model bias: T = ˆ

T see example at

Safe Reinforcement Learning for Autonomous Driving

Lille - 17/54

SLIDE 24

Robust RL — How to handle model uncertainty?

◮ Build a confidence region T around the estimated dynamics ˆ

T ∀T ′ ∈ T, P(||T − T ′|| > ε) < δ

◮ Plan robustly with respect to this ambiguity

max

π

min

T∈T ∞

t=0

γtrt

vr(π)

◮ How to optimise this objective?

◮ Linear system: H∞ control, robust LQ ◮ Finite state-space: Robust Dynamic Programming ◮ Non-linear continuous system: ?

Safe Reinforcement Learning for Autonomous Driving

Lille - 18/54

SLIDE 25

Discrete Ambiguity Set [Leurent, Blanco, et al. 2018]

Assumption (Discrete structure)

T = {T1, · · · , Tm}

Optimistic evaluation of paths at the leaves for all dynamics Worst-case aggregation

ver the M

dynamics min

m

Optimal planning of action sequences max

a

Safe Reinforcement Learning for Autonomous Driving

Lille - 19/54

SLIDE 26

A robust extension of action-values

Definition (Robust values)

Given node i ∈ T , define The robust value: vr

i def

= max

π∈iA∞ min m∈[1,M]RTm π

The robust u-value: ur

i (n) def

=

   min

m∈[1,M]

d−1

t=0 γtrt

if i ∈ Ln ; max

a∈A ur ia(n)

if i ∈ Tn \ Ln The robust b-value: br

i (n) def

=

   min

m∈[1,M]

d−1

t=0 γtrt + γd 1−γ

if i ∈ Ln ; max

a∈A br ia(n)

if i ∈ Tn \ Ln

Safe Reinforcement Learning for Autonomous Driving

Lille - 20/54

SLIDE 27

Discrete Robust Optimistic Planning

Remark (Ordering of min and max)

Naive comparison of action values between the different models do not recover the robust policy

1 1/2 1/2 1/2 1 1 1 1/2 1/2 1/2 1 1 1/2 1/2 1/2 1 1 1/2 1/2 1/2 1/2

Algorithm 4: Deterministic Robust Optimistic Planning

1 Initialise T to a root and expand it. Set n = 1. 2 while Numerical resource available do 3

Compute the robust u-values ur

i (n) and robust b-values br i (n). 4

Expand arg maxi∈Ln br

i (n). 5

n = n + 1

6 return a(n) = arg maxa∈A ur a(n)

Safe Reinforcement Learning for Autonomous Driving

Lille - 21/54

SLIDE 28

Main result

Variables

◮ computational budget n ◮ near-optimal branching factor κ ◮ simple regret Rn = vr − vr a(n)

Theorem (Regret bound)

Algorithm 4 enjoys a simple regret of: If κ > 1, Rn = O  n

− log 1/γ

logκ

  (1) If κ = 1, Rn = O

γ

(1−γ)β c

n

(2)

Safe Reinforcement Learning for Autonomous Driving

Lille - 22/54

SLIDE 29

Experiment

Ambiguity Agent Worst-case Mean ± std None Oracle 9.83 10.84 ± 0.16 Discrete Nominal 2.09 8.85 ± 3.53 Algorithm 4 8.99 10.78 ± 0.34

Safe Reinforcement Learning for Autonomous Driving

Lille - 23/54

SLIDE 30

Continuous Ambiguity Set

Approximate the robust objective by a tractable surrogate. Given a policy π and current state s0,

Definition (Reachability set S)

S(t, s0, π) def

={st : ∃T ∈ T s.t.

sk+1 = T(sk, π(sk))}

Definition (Interval hull S = (s, s))

s(t, s0, π) def

= min S(t, s0, π)

s(t, s0, π) def

= max S(t, s0, π)

Safe Reinforcement Learning for Autonomous Driving

Lille - 24/54

SLIDE 31

Approximate Robust Evaluation

Definition (Surrogate objective ˆ v r)

ˆ vr(π) def

=

H

t=0

γt min

s∈S(t,s0,π) r(s, π(s))

(3) Algorithm 5: Interval-based Robust Control

1 Algorithm robust control(s0) 2

Initialize a set Π of policies

3

while resources available do

4

evaluate() each policy π ∈ Π at current state s0

5

Update Π by policy search

6

end

7

return arg maxπ∈Π ˆ v r(π)

1 Procedure evaluate(π, s0) 2

Compute the state interval S(t, s0, π) on a horizon t ∈ [0, H]

3

Minimize r over the intervals S(t, s0, π) for all t ∈ [0, H]

4

return ˆ v r(π)

Safe Reinforcement Learning for Autonomous Driving

Lille - 25/54

SLIDE 32

Results

The approximate performance of a policy is guaranteed on the true environment.

Proposition (Lower bound)

The surrogate objective ˆ vr is a lower bound of the true

bjective vr:

∀π, ˆ vr(π) ≤ vr(π) (4)

Safe Reinforcement Learning for Autonomous Driving

Lille - 26/54

SLIDE 33

Results

The approximate performance of a policy is guaranteed on the true environment.

Proposition (Lower bound)

The surrogate objective ˆ vr is a lower bound of the true

bjective vr:

∀π, ˆ vr(π) ≤ vr(π) (4) But how can we compute S?

Safe Reinforcement Learning for Autonomous Driving

Lille - 26/54

SLIDE 34

Interval prediction by sampling

Sample M different models T m and corresponding trajectories {sm

t+1 = T m(st, π(st))}

Definition (Sampling-based interval predictor)

s(t, s0, π) def

=

min

m∈[1,M] sm t

s(t, s0, π) def

= max

m∈[1,M] sm t ◮ Generic form ◮ Subset of S =

⇒ no guarantee

◮ Heavy computational load Safe Reinforcement Learning for Autonomous Driving

Lille - 27/54

SLIDE 35

Interval arithmetic

Consider an LPV system: ˙ x(t) = A(θ(t))x(t) + Bd(t), t ≥ 0,

Definition (Interval arithmetic predictor)

˙ x(t) = A+x+(t) − A

+x−(t) − A−x+(t)

+A

−x−(t) + B+d(t) − B−d(t),

˙ x(t) = A

+x+(t) − A+x−(t) − A −x+(t)

+A−x−(t) + B+d(t) − B−d(t), x(0) = x0, x(0) = x0,

◮ Fast computation ◮ Overset of S =

⇒ inclusion property

◮ Unstable dynamics, even for stable systems Safe Reinforcement Learning for Autonomous Driving

Lille - 28/54

SLIDE 36

A simple example

˙ x(t) = −θ(t)x(t) + d(t) θ(t) ∈ [0.5, 1.5] d(t) ∈ [−0.1, 0.1]

Safe Reinforcement Learning for Autonomous Driving

Lille - 29/54

SLIDE 37

Novel interval predictor [Leurent, Efimov, et al. 2019]

Under polytopic uncertainty: A(θ) = A0 + N

i=1 λi(θ)∆Ai,

with A0 Hurwitz and Metzler.

Definition (Polytopic interval predictor)

˙ x(t) = A0x(t) − ∆A+x−(t) − ∆A−x+(t) +B+d(t) − B−d(t), (5) ˙ x(t) = A0x(t) + ∆A+x+(t) + ∆A−x−(t) +B+d(t) − B−d(t), x(0) = x0, x(0) = x0

◮ Fast computation ◮ Overset of S =

⇒ inclusion property

◮ Sufficient conditions for stability, in terms of LMIs Safe Reinforcement Learning for Autonomous Driving

Lille - 30/54

SLIDE 38

Novel interval predictor — Results

Theorem (Stability)

If there exist diagonal matrices P, Q, Q+, Q−, Z+, Z−, Ψ+, Ψ−, Ψ, Γ ∈ R2n×2n such that the following LMIs are satisfied: P + min{Z+, Z−} > 0, Υ 0, Γ > 0, Q + min{Q+, Q−} + 2 min{Ψ+, Ψ−} > 0, where Υ =     Υ11 Υ12 Υ13 P Υ⊤

12

Υ22 Υ23 Z+ Υ⊤

13

Υ⊤

23

Υ33 Z− P Z+ Z− −Γ     , Υ11 = A⊤P + PA + Q, Υ12 = A⊤Z+ + PR+ + Ψ+, Υ13 = A⊤Z− + PR− + Ψ−, Υ22 = Z+R+ + R⊤

+ Z+ + Q+,

Υ23 = Z+R− + R⊤

+ Z− + Ψ, Υ33 = Z−R− + R⊤ −Z− + Q−,

A = A0 A0

, R+ =

−∆A− ∆A+

, R− =
∆A+

−∆A−

,

then the predictor (5) is input-to-state stable.

Safe Reinforcement Learning for Autonomous Driving

Lille - 31/54

SLIDE 39

A simple example (cont’)

˙ x(t) = −θ(t)x(t) + d(t) θ(t) ∈ [0.5, 1.5] d(t) ∈ [−0.1, 0.1]

Safe Reinforcement Learning for Autonomous Driving

Lille - 32/54

SLIDE 40

Experiment

◮ Interval prediction for vehicles ◮ Application to interval-based robust planning

Ambiguity Agent Worst-case Mean ± std None Oracle 9.83 10.84 ± 0.16 Discrete Nominal 2.09 8.85 ± 3.53 Algorithm 4 8.99 10.78 ± 0.34 Continuous Nominal 1.99 9.95 ± 2.38 Algorithm 5 7.88 10.73 ± 0.61

Safe Reinforcement Learning for Autonomous Driving

Lille - 33/54

SLIDE 41

Efficient Planning

Look back at the performances of MCTS (actually UCT)

2 4 6 8 10 12 Rewards 0.0 0.1 0.2 0.3 0.4 0.5 Frequency

Histogram of rewards

VI DQN MCTS 5 10 15 20 25 30 35 40 Lengths 0.0 0.2 0.4 0.6 0.8 1.0 Frequency

Histogram of lengths

VI DQN MCTS

It is quite good, but clearly sub-optimal. Should we just increase the budget? What are some failing cases?

Safe Reinforcement Learning for Autonomous Driving

Lille - 34/54

SLIDE 42

Failing cases of UCT

a.k.a the mousehole problem

Safe Reinforcement Learning for Autonomous Driving

Lille - 35/54

SLIDE 43

Failing cases of UCT

It was analysed in [Coquelin and Munos 2007] The sample complexity of is lower-bounded by O(exp(exp(D))).

Safe Reinforcement Learning for Autonomous Driving

Lille - 36/54

SLIDE 44

A Benchmark of Planning Algorithms

Algorithm Complexity Does it run? Does it work? MCTS ? YES ? SparseSampling1 (1/ε)log 1/ε YES NO UCT exp(exp(D)) YES KIND OF OPD2 n− log 1/γ

log κ

YES YES OLOP n− min( 1

2 , log 1/γ log κ )

KIND OF NO ST0P1 (1/ε)2+ log κ′

log 1/γ +o(1)

NO ? (NO) TrailBlazer1 (1/ε)

log Nκ log 1/γ (log 1

δε)α

YES NO PlatYPOOs2 ≤ OLOP YES YES

1 In the PAC framework. 2 With deterministic dynamics

Safe Reinforcement Learning for Autonomous Driving

Lille - 37/54

SLIDE 45

Practical Open Loop Optimistic Planning

The idea behind OLOP

Algorithm 6: General structure for Open-Loop Optimistic Planning

1 for each episode m = 1, · · · , M do 2

Compute Ua(m − 1) from (7) for all a ∈ T

3

Compute Ba(m − 1) from (8) for all a ∈ AL

4

Sample a sequence with highest B-value: am ∈ arg maxa∈AL Ba(m − 1).

5 return the most played sequence a(n) ∈ arg maxa∈AL Ta(M)

Safe Reinforcement Learning for Autonomous Driving

Lille - 38/54

SLIDE 46

Practical Open Loop Optimistic Planning

The idea behind OLOP

Algorithm 7: General structure for Open-Loop Optimistic Planning

1 for each episode m = 1, · · · , M do 2

Compute Ua(m − 1) from (7) for all a ∈ T

3

Compute Ba(m − 1) from (8) for all a ∈ AL

4

Sample a sequence with highest B-value: am ∈ arg maxa∈AL Ba(m − 1).

5 return the most played sequence a(n) ∈ arg maxa∈AL Ta(M)

Safe Reinforcement Learning for Autonomous Driving

Lille - 39/54

SLIDE 47

Practical Open Loop Optimistic Planning

What’s wrong with OLOP?

Overly pessimistic, especially in the low-budget regime. Uµ

a (m) = ˆ

µa(m) +

2 log M

Ta(m) (6) Ua(m) def

=

h

t=1

γtUµ

a1:t(m) + γh+1

1 − γ (7) Ba(m) def

= inf

1≤t≤L Ua1:t(m)

(8) Intuitive explanation:

◮ Unintended behaviour happens when Uµ a (m) > 1, ∀a. ◮ Then the sequence (Ua1:t(m))t is non-decreasing ◮ Then Ba(m) = Ua1:1(m) Safe Reinforcement Learning for Autonomous Driving

Lille - 40/54

SLIDE 48

Practical Open Loop Optimistic Planning

What we were promised

Safe Reinforcement Learning for Autonomous Driving

Lille - 41/54

SLIDE 49

Practical Open Loop Optimistic Planning

What we actually got

OLOP behaves as uniform planning!

Safe Reinforcement Learning for Autonomous Driving

Lille - 42/54

SLIDE 50

Kullback-Leibler Open Loop Optimistic Planning

We summon the upper-confidence bound from kl-UCB [Capp´ e et al. 2013]: Uµ

a (m) def

= max {q ∈ I : Ta(m)d(ˆ

µa(m), q) ≤ f (m)} Algorithm OLOP KL-OLOP Interval I R [0, 1] Divergence d dQUAD dBER f (m) 4 log M 2 log M + 2 log log M dQUAD(p, q) def

= 2(p − q)2

dBER(p, q) def

= p log p

q + (1 − p) log 1 − p 1 − q

Safe Reinforcement Learning for Autonomous Driving

Lille - 43/54

SLIDE 51

Kullback-Leibler Open Loop Optimistic Planning

0 Lµ

a

ˆ µa U µ

a

1

1 Taf(m)

dber(ˆ µa, q)

Conversely,

◮ Uµ a (m) ∈ I = [0, 1], ∀a. ◮ The sequence (Ua1:t(m))t is non-increasing ◮ Ba(m) = Ua(m), the bound sharpening step is superfluous. Safe Reinforcement Learning for Autonomous Driving

Lille - 44/54

SLIDE 52

Sample complexity

KL-OLOP introduced and analysed in [Leurent and Maillard 2019].

Theorem (Sample complexity)

KL-OLOP enjoys the same asymptotic regret bounds as OLOP. More precisely, KL-OLOP satisfies: E rn =     

n− log 1/γ

log κ′

,

if γ √ κ′ > 1

n− 1

2

,

if γ √ κ′ ≤ 1

Safe Reinforcement Learning for Autonomous Driving

Lille - 45/54

SLIDE 53

Time complexity

Original KL-OLOP

Compute Ba(m − 1) from (8) for all a ∈ AL

Lazy KL-OLOP Property (Time and memory complexity)

C(Lazy KL-OLOP) C(KL-OLOP) = nK K L

Safe Reinforcement Learning for Autonomous Driving

Lille - 46/54

SLIDE 54

Experiments — Expanded Trees

Safe Reinforcement Learning for Autonomous Driving

Lille - 47/54

SLIDE 55

Experiments — Expanded Trees

Safe Reinforcement Learning for Autonomous Driving

Lille - 48/54

SLIDE 56

Experiments — Expanded Trees

Safe Reinforcement Learning for Autonomous Driving

Lille - 49/54

SLIDE 57

Experiments — Highway

Safe Reinforcement Learning for Autonomous Driving

Lille - 50/54

SLIDE 58

Experiments — Gridworld

Safe Reinforcement Learning for Autonomous Driving

Lille - 51/54

SLIDE 59

Experiments — Stochastic Gridworld

Safe Reinforcement Learning for Autonomous Driving

Lille - 52/54

SLIDE 60

References I

Olivier Capp´ e, Aur´ elien Garivier, Odalric-Ambrym Maillard, R´ emi Munos, and Gilles Stoltz. “Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation”. In: The Annals of Statistics 41.3 (2013), pp. 1516–1541. Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, and Olivier Pietquin. “Scaling up budgeted reinforcement learning”. In: CoRR abs/1903.01004 (2019). arXiv: 1903.01004. url: http://arxiv.org/abs/1903.01004. Pierre-Arnaud Coquelin and R´ emi Munos. “Bandit Algorithms for Tree Search”. In: Uncertainty in Artificial Intelligence (2007). arXiv: 0703062 [cs]. url: http://arxiv.org/abs/cs/0703062.

Safe Reinforcement Learning for Autonomous Driving

Lille - 53/54

SLIDE 61

References II

Edouard Leurent, Yann Blanco, Denis Efimov, and Odalric-Ambrym Maillard. “Approximate Robust Control of Uncertain Dynamical Systems”. In: NeurIPS Machine Learning for Intelligent Transportation Systems Workshop (2018). Edouard Leurent, Denis Efimov, Tarek Ra¨ ıssi, and Wilfrid Perruquetti. “Interval Prediction for Continuous-Time Systems with Parametric Uncertainties”. In: (2019). Edouard Leurent and Odalric-Ambrym Maillard. “Practical Open-Loop Optimistic Planning”. In: (2019).

Safe Reinforcement Learning for Autonomous Driving

Lille - 54/54