Safe Reinforcement Learning for Decision-Making in Autonomous - - PowerPoint PPT Presentation
Safe Reinforcement Learning for Decision-Making in Autonomous - - PowerPoint PPT Presentation
Safe Reinforcement Learning for Decision-Making in Autonomous Driving Edouard Leurent, Odalric-Ambrym Maillard, Denis Efimov, Wilfrid Perruquetti, Yann Blanco SequeL, Inria Lille Nord Europe Valse, Inria Lille Nord Europe Renault Group
Motivation
Classic Autonomous Driving Pipeline
Safe Reinforcement Learning for Autonomous Driving
Lille - 2/54
Motivation
Classic Autonomous Driving Pipeline
In practice,
◮ The behavioural layer is a hand-crafted rule-based system
(e.g. FSM).
Safe Reinforcement Learning for Autonomous Driving
Lille - 2/54
Motivation
Classic Autonomous Driving Pipeline
In practice,
◮ The behavioural layer is a hand-crafted rule-based system
(e.g. FSM).
◮ Won’t scale to complex scenes, handle negotiation and
aggressiveness
Safe Reinforcement Learning for Autonomous Driving
Lille - 2/54
Reinforcement Learning: why?
Search for an optimal policy π(a|s): max
π
E ∞
- t=0
γtr(st, at)
- at ∼ π(st), st+1 ∼ T(st, at)
- policy return RT
π
The dynamics T(st+1|st, at) are unknown. The agent learns by interaction with the environment Challenges:
◮ exploration-exploitation ◮ credit assignment ◮ partial observability ◮ safety Safe Reinforcement Learning for Autonomous Driving
Lille - 3/54
Reinforcement Learning: how?
Model-free
- 1. Directly optimise π(a|s) through policy evaluation and policy
improvement
Safe Reinforcement Learning for Autonomous Driving
Lille - 4/54
Reinforcement Learning: how?
Model-free
- 1. Directly optimise π(a|s) through policy evaluation and policy
improvement
Model-based
- 1. Learn a model for the dynamics ˆ
T(st+1|st, at),
- 2. (Planning) Leverage it to compute
max
π
E ∞
- t=0
γtr(st, at)
- at ∼ π(st), st+1 ∼ ˆ
T(st, at)
- + Better sample efficiency, interpretability, priors.
Safe Reinforcement Learning for Autonomous Driving
Lille - 4/54
A first benchmark
The highway-env environment
◮ Vehicle kinematics: Kinematic Bicycle Model ◮ Low-level longitudinal and lateral controllers ◮ Behavioural models: IDM and MOBIL ◮ Graphical road network and route planning
A few baseline agents — Setup
◮ Model-free: DQN ◮ Model-based (planning): Value Iteration and MCTS Safe Reinforcement Learning for Autonomous Driving
Lille - 5/54
A first benchmark — Results
2 4 6 8 10 12 Rewards 0.0 0.1 0.2 0.3 0.4 0.5 Frequency
Histogram of rewards
VI DQN MCTS 5 10 15 20 25 30 35 40 Lengths 0.0 0.2 0.4 0.6 0.8 1.0 Frequency
Histogram of lengths
VI DQN MCTS
Videos available on
Safe Reinforcement Learning for Autonomous Driving
Lille - 6/54
The safety / performance trade-off
Let us look at the performances of DQN:
Safe Reinforcement Learning for Autonomous Driving
Lille - 7/54
The safety / performance trade-off
Let us look at the performances of DQN:
Uncertainty and risk
◮ High return variance, many collisions ◮ In RL, we only maximise the return in expectation Safe Reinforcement Learning for Autonomous Driving
Lille - 7/54
The safety / performance trade-off
Let us look at the performances of DQN:
Uncertainty and risk
◮ High return variance, many collisions ◮ In RL, we only maximise the return in expectation
Conflicting objectives
◮ Reward rt = ωvvelocity − ωccollision ◮ We only control the return RT π = t γtrt. ◮ For any fixed ω, there can be many optimal policies with
different velocity collision ratios → the Pareto-optimal curve
Safe Reinforcement Learning for Autonomous Driving
Lille - 7/54
A first formalisation of risk
Constrained Reinforcement Learning
◮ Augment the MDP with a cost function c : S × A × S → R,
cost discount γc , and a budget β.
◮ Optimise the reward while keeping the cost under a budget
max
π
E
π
∞
- t=0
γtrt
- s.t.
E
π
∞
- t=0
γt
cct
- ≤ β
Budgeted Reinforcement Learning
Find a single budget-dependent policy π(a|s, β) that solves all the corresponding CMDPs
Safe Reinforcement Learning for Autonomous Driving
Lille - 8/54
A BMDP algorithm
Lagrangian Relaxation
Consider the dual problem and replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ: max
π
E
- t
γtrt − λγt
cct ◮ Train many policies πk with penalties λk and recover the cost
budgets βk
◮ Very data/memory-heavy Safe Reinforcement Learning for Autonomous Driving
Lille - 9/54
Our BMDP algorithm
Budgeted Fitted-Q [Carrara et al. 2019]
A model-free, value-based, fixed-point iteration procedure.
Qr
n+1(si, ai, βi) regression
← − − − − − − r ′
i + γ
- a′∈A
πn
A(s′ i, a′, βi)Qr n(s′ i, a′, πn B(s′ i, a′, βi))
Qc
n+1(si, ai, βi) regression
← − − − − − − c′
i + γc
- a′∈A
πn
A(s′ i, a′, βi)Qc n(s′ i, a′, πn B(s′ i, a′, βi))
(πn
A, πn B) ←
arg max
(πA,πB)∈Ψn
- a∈A
πA(s, a, β)Qr
n(s, a, πB(s, a, β))
Ψn = πA ∈ M(A)S×R, πB ∈ RS×A×R, such that, ∀s ∈ S, ∀β ∈ R,
- a∈A
πA(s, a, β)Qc
n(s, a, πB(s, a, β)) ≤ β
Safe Reinforcement Learning for Autonomous Driving
Lille - 10/54
From dynamic programming to RL
Continuous Reinfocement Learning
- 1. Risk-sensitive exploration.
- 2. Scalable function approximation
- 3. Parallel computing of the targets and experiences.
Safe Reinforcement Learning for Autonomous Driving
Lille - 11/54
Risk-sensitive exploration
Algorithm 1: Risk-sensitive exploration
1 Initialise an empty batch D. 2 for each intermediate batch do 3
for each episode in batch do
4
Sample initial budget β ∼ U(B).
5
while episode not done do
6
Update ε from schedule.
7
Sample z ∼ U([0, 1]).
8
if z > ε then
9
Sample (a, β′) from (πA, πB). // Exploit
10
else
11
Sample (a, β′) from U(∆AB). // Explore
12
Append transition (s, β, a, r ′, c′, s′) to batch D.
13
Update episode budget β ← β′ .
14
(πA, πB) ← BFTQ(D).
15 return the batch of transitions D
See example on
Safe Reinforcement Learning for Autonomous Driving
Lille - 12/54
Function approximation
s0 s1 β Qr(a0) Qr(a1) Qc(a0) Qc(a1) (s, β) Encoder Hidden Layer 1 Hidden Layer 2 Q
Figure: Neural Network for Q-functions approximation when the state dimension is 2 and there are 2 actions.
Safe Reinforcement Learning for Autonomous Driving
Lille - 13/54
Parallel computing of the targets
Algorithm 2: BFTQ
1 In: D,
B,γc, γr, fitr,fitc (regression algorithms);
2 Out: Qr, Qc; 3 X = {si, ai, βi}i∈[0,|D|]; 4 Initialise Qr = Qc = (s, a, β) → 0; 5 repeat 6
Y r, Y c = compute targets(D, Qr, Qc, B, γc, γr);
7
Qr, Qc = fitr(X, Y r), fitc(X, Y c);
8 until convergence or timeout;
Algorithm 3: Compute targets (parallel)
1 Qr, Qc = Q(D)
// perform a single forward pass
2 Split D among workers: D = ∪w∈W Dw
// Run the following loop on each worker in parallel
3 for w ∈ W do 4
(Y c
w, Y r w) ←
compute targets(Dw, Qr, Qc, B, γc, γr)
5 Join the results: Y c = ∪w∈W Y c w and
Y r = ∪w∈W Y r
w 6 return (Y c, Y r)
Safe Reinforcement Learning for Autonomous Driving
Lille - 14/54
Experiments
Video available on
10 2 4 6 3 1 λ ∈ {15,20} 0.00 0.10 0.20 0.30 β ∈ [0.01,0.09] β ∈ [0.11,0.19] β ∈ [0.21,0.29] β ∈ [0.31,1.00]
Safe Reinforcement Learning for Autonomous Driving
Lille - 15/54
Looking back
2 4 6 8 10 12 Rewards 0.0 0.1 0.2 0.3 0.4 0.5 Frequency
Histogram of rewards
VI DQN MCTS 5 10 15 20 25 30 35 40 Lengths 0.0 0.2 0.4 0.6 0.8 1.0 Frequency
Histogram of lengths
VI DQN MCTS
Compared to DQN, the MCTS was really good in terms of safety. But the VI, not so much.
Safe Reinforcement Learning for Autonomous Driving
Lille - 16/54
Model bias
Model-free
- 1. Directly optimise π(a|s) through policy evaluation and policy
improvement
Model-based
- 1. Learn a model for the dynamics ˆ
T(st+1|st, at),
- 2. (Planning) Leverage it to compute
max
π
E ∞
- t=0
γtr(st, at)
- at ∼ π(st), st+1 ∼ ˆ
T(st, at)
- + Better sample efficiency, interpretability, priors.
Safe Reinforcement Learning for Autonomous Driving
Lille - 17/54
Model bias
Model-free
- 1. Directly optimise π(a|s) through policy evaluation and policy
improvement
Model-based
- 1. Learn a model for the dynamics ˆ
T(st+1|st, at),
- 2. (Planning) Leverage it to compute
max
π
E ∞
- t=0
γtr(st, at)
- at ∼ π(st), st+1 ∼ ˆ
T(st, at)
- + Better sample efficiency, interpretability, priors.
- Model bias: T = ˆ
T see example at
Safe Reinforcement Learning for Autonomous Driving
Lille - 17/54
Robust RL — How to handle model uncertainty?
◮ Build a confidence region T around the estimated dynamics ˆ
T ∀T ′ ∈ T, P(||T − T ′|| > ε) < δ
◮ Plan robustly with respect to this ambiguity
max
π
min
T∈T ∞
- t=0
γtrt
- vr(π)
◮ How to optimise this objective?
◮ Linear system: H∞ control, robust LQ ◮ Finite state-space: Robust Dynamic Programming ◮ Non-linear continuous system: ?
Safe Reinforcement Learning for Autonomous Driving
Lille - 18/54
Discrete Ambiguity Set [Leurent, Blanco, et al. 2018]
Assumption (Discrete structure)
T = {T1, · · · , Tm}
Optimistic evaluation of paths at the leaves for all dynamics Worst-case aggregation
- ver the M
dynamics min
m
Optimal planning of action sequences max
a
Safe Reinforcement Learning for Autonomous Driving
Lille - 19/54
A robust extension of action-values
Definition (Robust values)
Given node i ∈ T , define The robust value: vr
i def
= max
π∈iA∞ min m∈[1,M]RTm π
The robust u-value: ur
i (n) def
=
min
m∈[1,M]
d−1
t=0 γtrt
if i ∈ Ln ; max
a∈A ur ia(n)
if i ∈ Tn \ Ln The robust b-value: br
i (n) def
=
min
m∈[1,M]
d−1
t=0 γtrt + γd 1−γ
if i ∈ Ln ; max
a∈A br ia(n)
if i ∈ Tn \ Ln
Safe Reinforcement Learning for Autonomous Driving
Lille - 20/54
Discrete Robust Optimistic Planning
Remark (Ordering of min and max)
Naive comparison of action values between the different models do not recover the robust policy
1 1/2 1/2 1/2 1 1 1 1/2 1/2 1/2 1 1 1/2 1/2 1/2 1 1 1/2 1/2 1/2 1/2
Algorithm 4: Deterministic Robust Optimistic Planning
1 Initialise T to a root and expand it. Set n = 1. 2 while Numerical resource available do 3
Compute the robust u-values ur
i (n) and robust b-values br i (n). 4
Expand arg maxi∈Ln br
i (n). 5
n = n + 1
6 return a(n) = arg maxa∈A ur a(n)
Safe Reinforcement Learning for Autonomous Driving
Lille - 21/54
Main result
Variables
◮ computational budget n ◮ near-optimal branching factor κ ◮ simple regret Rn = vr − vr a(n)
Theorem (Regret bound)
Algorithm 4 enjoys a simple regret of: If κ > 1, Rn = O n
− log 1/γ
logκ
(1) If κ = 1, Rn = O
- γ
(1−γ)β c
n
- (2)
Safe Reinforcement Learning for Autonomous Driving
Lille - 22/54
Experiment
Ambiguity Agent Worst-case Mean ± std None Oracle 9.83 10.84 ± 0.16 Discrete Nominal 2.09 8.85 ± 3.53 Algorithm 4 8.99 10.78 ± 0.34
Safe Reinforcement Learning for Autonomous Driving
Lille - 23/54
Continuous Ambiguity Set
Approximate the robust objective by a tractable surrogate. Given a policy π and current state s0,
Definition (Reachability set S)
S(t, s0, π) def
={st : ∃T ∈ T s.t.
sk+1 = T(sk, π(sk))}
Definition (Interval hull S = (s, s))
s(t, s0, π) def
= min S(t, s0, π)
s(t, s0, π) def
= max S(t, s0, π)
Safe Reinforcement Learning for Autonomous Driving
Lille - 24/54
Approximate Robust Evaluation
Definition (Surrogate objective ˆ v r)
ˆ vr(π) def
=
H
- t=0
γt min
s∈S(t,s0,π) r(s, π(s))
(3) Algorithm 5: Interval-based Robust Control
1 Algorithm robust control(s0) 2
Initialize a set Π of policies
3
while resources available do
4
evaluate() each policy π ∈ Π at current state s0
5
Update Π by policy search
6
end
7
return arg maxπ∈Π ˆ v r(π)
1 Procedure evaluate(π, s0) 2
Compute the state interval S(t, s0, π) on a horizon t ∈ [0, H]
3
Minimize r over the intervals S(t, s0, π) for all t ∈ [0, H]
4
return ˆ v r(π)
Safe Reinforcement Learning for Autonomous Driving
Lille - 25/54
Results
The approximate performance of a policy is guaranteed on the true environment.
Proposition (Lower bound)
The surrogate objective ˆ vr is a lower bound of the true
- bjective vr:
∀π, ˆ vr(π) ≤ vr(π) (4)
Safe Reinforcement Learning for Autonomous Driving
Lille - 26/54
Results
The approximate performance of a policy is guaranteed on the true environment.
Proposition (Lower bound)
The surrogate objective ˆ vr is a lower bound of the true
- bjective vr:
∀π, ˆ vr(π) ≤ vr(π) (4) But how can we compute S?
Safe Reinforcement Learning for Autonomous Driving
Lille - 26/54
Interval prediction by sampling
Sample M different models T m and corresponding trajectories {sm
t+1 = T m(st, π(st))}
Definition (Sampling-based interval predictor)
s(t, s0, π) def
=
min
m∈[1,M] sm t
s(t, s0, π) def
= max
m∈[1,M] sm t ◮ Generic form ◮ Subset of S =
⇒ no guarantee
◮ Heavy computational load Safe Reinforcement Learning for Autonomous Driving
Lille - 27/54
Interval arithmetic
Consider an LPV system: ˙ x(t) = A(θ(t))x(t) + Bd(t), t ≥ 0,
Definition (Interval arithmetic predictor)
˙ x(t) = A+x+(t) − A
+x−(t) − A−x+(t)
+A
−x−(t) + B+d(t) − B−d(t),
˙ x(t) = A
+x+(t) − A+x−(t) − A −x+(t)
+A−x−(t) + B+d(t) − B−d(t), x(0) = x0, x(0) = x0,
◮ Fast computation ◮ Overset of S =
⇒ inclusion property
◮ Unstable dynamics, even for stable systems Safe Reinforcement Learning for Autonomous Driving
Lille - 28/54
A simple example
˙ x(t) = −θ(t)x(t) + d(t) θ(t) ∈ [0.5, 1.5] d(t) ∈ [−0.1, 0.1]
Safe Reinforcement Learning for Autonomous Driving
Lille - 29/54
Novel interval predictor [Leurent, Efimov, et al. 2019]
Under polytopic uncertainty: A(θ) = A0 + N
i=1 λi(θ)∆Ai,
with A0 Hurwitz and Metzler.
Definition (Polytopic interval predictor)
˙ x(t) = A0x(t) − ∆A+x−(t) − ∆A−x+(t) +B+d(t) − B−d(t), (5) ˙ x(t) = A0x(t) + ∆A+x+(t) + ∆A−x−(t) +B+d(t) − B−d(t), x(0) = x0, x(0) = x0
◮ Fast computation ◮ Overset of S =
⇒ inclusion property
◮ Sufficient conditions for stability, in terms of LMIs Safe Reinforcement Learning for Autonomous Driving
Lille - 30/54
Novel interval predictor — Results
Theorem (Stability)
If there exist diagonal matrices P, Q, Q+, Q−, Z+, Z−, Ψ+, Ψ−, Ψ, Γ ∈ R2n×2n such that the following LMIs are satisfied: P + min{Z+, Z−} > 0, Υ 0, Γ > 0, Q + min{Q+, Q−} + 2 min{Ψ+, Ψ−} > 0, where Υ = Υ11 Υ12 Υ13 P Υ⊤
12
Υ22 Υ23 Z+ Υ⊤
13
Υ⊤
23
Υ33 Z− P Z+ Z− −Γ , Υ11 = A⊤P + PA + Q, Υ12 = A⊤Z+ + PR+ + Ψ+, Υ13 = A⊤Z− + PR− + Ψ−, Υ22 = Z+R+ + R⊤
+ Z+ + Q+,
Υ23 = Z+R− + R⊤
+ Z− + Ψ, Υ33 = Z−R− + R⊤ −Z− + Q−,
A = A0 A0
- , R+ =
−∆A− ∆A+
- , R− =
- ∆A+
−∆A−
- ,
then the predictor (5) is input-to-state stable.
Safe Reinforcement Learning for Autonomous Driving
Lille - 31/54
A simple example (cont’)
˙ x(t) = −θ(t)x(t) + d(t) θ(t) ∈ [0.5, 1.5] d(t) ∈ [−0.1, 0.1]
Safe Reinforcement Learning for Autonomous Driving
Lille - 32/54
Experiment
◮ Interval prediction for vehicles ◮ Application to interval-based robust planning
Ambiguity Agent Worst-case Mean ± std None Oracle 9.83 10.84 ± 0.16 Discrete Nominal 2.09 8.85 ± 3.53 Algorithm 4 8.99 10.78 ± 0.34 Continuous Nominal 1.99 9.95 ± 2.38 Algorithm 5 7.88 10.73 ± 0.61
Safe Reinforcement Learning for Autonomous Driving
Lille - 33/54
Efficient Planning
Look back at the performances of MCTS (actually UCT)
2 4 6 8 10 12 Rewards 0.0 0.1 0.2 0.3 0.4 0.5 Frequency
Histogram of rewards
VI DQN MCTS 5 10 15 20 25 30 35 40 Lengths 0.0 0.2 0.4 0.6 0.8 1.0 Frequency
Histogram of lengths
VI DQN MCTS
It is quite good, but clearly sub-optimal. Should we just increase the budget? What are some failing cases?
Safe Reinforcement Learning for Autonomous Driving
Lille - 34/54
Failing cases of UCT
a.k.a the mousehole problem
Safe Reinforcement Learning for Autonomous Driving
Lille - 35/54
Failing cases of UCT
It was analysed in [Coquelin and Munos 2007] The sample complexity of is lower-bounded by O(exp(exp(D))).
Safe Reinforcement Learning for Autonomous Driving
Lille - 36/54
A Benchmark of Planning Algorithms
Algorithm Complexity Does it run? Does it work? MCTS ? YES ? SparseSampling1 (1/ε)log 1/ε YES NO UCT exp(exp(D)) YES KIND OF OPD2 n− log 1/γ
log κ
YES YES OLOP n− min( 1
2 , log 1/γ log κ )
KIND OF NO ST0P1 (1/ε)2+ log κ′
log 1/γ +o(1)
NO ? (NO) TrailBlazer1 (1/ε)
log Nκ log 1/γ (log 1
δε)α
YES NO PlatYPOOs2 ≤ OLOP YES YES
1 In the PAC framework. 2 With deterministic dynamics
Safe Reinforcement Learning for Autonomous Driving
Lille - 37/54
Practical Open Loop Optimistic Planning
The idea behind OLOP
Algorithm 6: General structure for Open-Loop Optimistic Planning
1 for each episode m = 1, · · · , M do 2
Compute Ua(m − 1) from (7) for all a ∈ T
3
Compute Ba(m − 1) from (8) for all a ∈ AL
4
Sample a sequence with highest B-value: am ∈ arg maxa∈AL Ba(m − 1).
5 return the most played sequence a(n) ∈ arg maxa∈AL Ta(M)
Safe Reinforcement Learning for Autonomous Driving
Lille - 38/54
Practical Open Loop Optimistic Planning
The idea behind OLOP
Algorithm 7: General structure for Open-Loop Optimistic Planning
1 for each episode m = 1, · · · , M do 2
Compute Ua(m − 1) from (7) for all a ∈ T
3
Compute Ba(m − 1) from (8) for all a ∈ AL
4
Sample a sequence with highest B-value: am ∈ arg maxa∈AL Ba(m − 1).
5 return the most played sequence a(n) ∈ arg maxa∈AL Ta(M)
Safe Reinforcement Learning for Autonomous Driving
Lille - 39/54
Practical Open Loop Optimistic Planning
What’s wrong with OLOP?
Overly pessimistic, especially in the low-budget regime. Uµ
a (m) = ˆ
µa(m) +
- 2 log M
Ta(m) (6) Ua(m) def
=
h
- t=1
γtUµ
a1:t(m) + γh+1
1 − γ (7) Ba(m) def
= inf
1≤t≤L Ua1:t(m)
(8) Intuitive explanation:
◮ Unintended behaviour happens when Uµ a (m) > 1, ∀a. ◮ Then the sequence (Ua1:t(m))t is non-decreasing ◮ Then Ba(m) = Ua1:1(m) Safe Reinforcement Learning for Autonomous Driving
Lille - 40/54
Practical Open Loop Optimistic Planning
What we were promised
Safe Reinforcement Learning for Autonomous Driving
Lille - 41/54
Practical Open Loop Optimistic Planning
What we actually got
OLOP behaves as uniform planning!
Safe Reinforcement Learning for Autonomous Driving
Lille - 42/54
Kullback-Leibler Open Loop Optimistic Planning
We summon the upper-confidence bound from kl-UCB [Capp´ e et al. 2013]: Uµ
a (m) def
= max {q ∈ I : Ta(m)d(ˆ
µa(m), q) ≤ f (m)} Algorithm OLOP KL-OLOP Interval I R [0, 1] Divergence d dQUAD dBER f (m) 4 log M 2 log M + 2 log log M dQUAD(p, q) def
= 2(p − q)2
dBER(p, q) def
= p log p
q + (1 − p) log 1 − p 1 − q
Safe Reinforcement Learning for Autonomous Driving
Lille - 43/54
Kullback-Leibler Open Loop Optimistic Planning
0 Lµ
a
ˆ µa U µ
a
1
1 Taf(m)
dber(ˆ µa, q)
Conversely,
◮ Uµ a (m) ∈ I = [0, 1], ∀a. ◮ The sequence (Ua1:t(m))t is non-increasing ◮ Ba(m) = Ua(m), the bound sharpening step is superfluous. Safe Reinforcement Learning for Autonomous Driving
Lille - 44/54
Sample complexity
KL-OLOP introduced and analysed in [Leurent and Maillard 2019].
Theorem (Sample complexity)
KL-OLOP enjoys the same asymptotic regret bounds as OLOP. More precisely, KL-OLOP satisfies: E rn =
- n− log 1/γ
log κ′
- ,
if γ √ κ′ > 1
- n− 1
2
- ,
if γ √ κ′ ≤ 1
Safe Reinforcement Learning for Autonomous Driving
Lille - 45/54
Time complexity
Original KL-OLOP
Compute Ba(m − 1) from (8) for all a ∈ AL
Lazy KL-OLOP Property (Time and memory complexity)
C(Lazy KL-OLOP) C(KL-OLOP) = nK K L
Safe Reinforcement Learning for Autonomous Driving
Lille - 46/54
Experiments — Expanded Trees
Safe Reinforcement Learning for Autonomous Driving
Lille - 47/54
Experiments — Expanded Trees
Safe Reinforcement Learning for Autonomous Driving
Lille - 48/54
Experiments — Expanded Trees
Safe Reinforcement Learning for Autonomous Driving
Lille - 49/54
Experiments — Highway
Safe Reinforcement Learning for Autonomous Driving
Lille - 50/54
Experiments — Gridworld
Safe Reinforcement Learning for Autonomous Driving
Lille - 51/54
Experiments — Stochastic Gridworld
Safe Reinforcement Learning for Autonomous Driving
Lille - 52/54
References I
Olivier Capp´ e, Aur´ elien Garivier, Odalric-Ambrym Maillard, R´ emi Munos, and Gilles Stoltz. “Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation”. In: The Annals of Statistics 41.3 (2013), pp. 1516–1541. Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, and Olivier Pietquin. “Scaling up budgeted reinforcement learning”. In: CoRR abs/1903.01004 (2019). arXiv: 1903.01004. url: http://arxiv.org/abs/1903.01004. Pierre-Arnaud Coquelin and R´ emi Munos. “Bandit Algorithms for Tree Search”. In: Uncertainty in Artificial Intelligence (2007). arXiv: 0703062 [cs]. url: http://arxiv.org/abs/cs/0703062.
Safe Reinforcement Learning for Autonomous Driving
Lille - 53/54
References II
Edouard Leurent, Yann Blanco, Denis Efimov, and Odalric-Ambrym Maillard. “Approximate Robust Control of Uncertain Dynamical Systems”. In: NeurIPS Machine Learning for Intelligent Transportation Systems Workshop (2018). Edouard Leurent, Denis Efimov, Tarek Ra¨ ıssi, and Wilfrid Perruquetti. “Interval Prediction for Continuous-Time Systems with Parametric Uncertainties”. In: (2019). Edouard Leurent and Odalric-Ambrym Maillard. “Practical Open-Loop Optimistic Planning”. In: (2019).
Safe Reinforcement Learning for Autonomous Driving
Lille - 54/54