Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, - - PowerPoint PPT Presentation
Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, - - PowerPoint PPT Presentation
Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, Paul Trichelair, R emi Tachet des Combes Introduction Theory Experiments Conclusion Problem setting Batch setting Fixed dataset, no direct interaction with the
Introduction Theory Experiments Conclusion
Problem setting
Batch setting
- Fixed dataset, no direct interaction with the environment.
- Access to the behavioural policy, called baseline.
- Objective: improve the baseline with high probability.
- Commonly encountered in real world applications.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18
Introduction Theory Experiments Conclusion
Problem setting
Batch setting
- Fixed dataset, no direct interaction with the environment.
- Access to the behavioural policy, called baseline.
- Objective: improve the baseline with high probability.
- Commonly encountered in real world applications.
Distributed systems
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18
Introduction Theory Experiments Conclusion
Problem setting
Batch setting
- Fixed dataset, no direct interaction with the environment.
- Access to the behavioural policy, called baseline.
- Objective: improve the baseline with high probability.
- Commonly encountered in real world applications.
Distributed systems Long trajectories
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 2/18
Introduction Theory Experiments Conclusion
Contributions
Novel batch RL algorithm: SPIBB
- SPIBB comes with reliability guarantees in finite MDPs.
- SPIBB is as computationally efficient as classic RL.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18
Introduction Theory Experiments Conclusion
Contributions
Novel batch RL algorithm: SPIBB
- SPIBB comes with reliability guarantees in finite MDPs.
- SPIBB is as computationally efficient as classic RL.
Finite MDPs benchmark
- Extensive benchmark of existing algorithms.
- Empirical analysis on random MDPs and baselines.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18
Introduction Theory Experiments Conclusion
Contributions
Novel batch RL algorithm: SPIBB
- SPIBB comes with reliability guarantees in finite MDPs.
- SPIBB is as computationally efficient as classic RL.
Finite MDPs benchmark
- Extensive benchmark of existing algorithms.
- Empirical analysis on random MDPs and baselines.
Infinite MDPs benchmark
- Model-free SPIBB for use with function approximation.
- First deep RL algorithm reliable in the batch setting.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 3/18
Introduction Theory Experiments Conclusion
Robust Markov Decision Processes
[Iyengar, 2005, Nilim and El Ghaoui, 2005]
- True environment M∗ = X, A, P∗, R∗, γ is unknown.
- Maximum Likelihood Estimation (MLE) MDP built from
counts: M = X, A, P, R, γ.
- Robust MDP set Ξ(
M, e): M∗ ∈ Ξ( M, e) with probability at least 1 − δ.
- Error function e(x, a) derived from concentration bounds.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 4/18
Introduction Theory Experiments Conclusion
Existing algorithms
[Petrik et al., 2016]: SPI by robust baseline regret minimization
- Robust MDPs considers the maxmin of the value over Ξ,
→ favors over-conservative policies.
- They also consider the maxmin of the value improvement,
→ NP-hard problem.
- RaMDP hacks the reward to account for uncertainty:
- R(x, a) ←
R(x, a) − κadj
- ND(x, a)
, → not theoretically grounded.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 5/18
Introduction Theory Experiments Conclusion
Existing algorithms
[Petrik et al., 2016]: SPI by robust baseline regret minimization
- Robust MDPs considers the maxmin of the value over Ξ,
→ favors over-conservative policies.
- They also consider the maxmin of the value improvement,
→ NP-hard problem.
- RaMDP hacks the reward to account for uncertainty:
- R(x, a) ←
R(x, a) − κadj
- ND(x, a)
, → not theoretically grounded. [Thomas, 2015]: High-Confidence Policy Improvement
- HCPI searches for the best regularization hyperparameter
to allow safe policy improvement.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 5/18
Introduction Theory Experiments Conclusion
Safe Policy Improvement with Baseline Bootstrapping
Safe Policy Improvement with Baseline Bootstrapping (SPIBB)
- Tractable approximate solution to the robust policy
improvement formulation.
- SPIBB allows policy update only with sufficient evidence.
- Sufficient evidence = state-action count that exceeds
some threshold hyperparameter N∧.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 6/18
Introduction Theory Experiments Conclusion
Safe Policy Improvement with Baseline Bootstrapping
Safe Policy Improvement with Baseline Bootstrapping (SPIBB)
- Tractable approximate solution to the robust policy
improvement formulation.
- SPIBB allows policy update only with sufficient evidence.
- Sufficient evidence = state-action count that exceeds
some threshold hyperparameter N∧. SPIBB algorithm
- Construction of the bootstrapped set:
B = {(x, a) ∈ X × A, ND(x, a) < NΛ}.
- Optimization over a constrained policy set:
π⊙
spibb = argmaxπ∈Πb ρ(π,
M), Πb = {π , s.t. π(a|x) = πb(a|x) if (x, a) ∈ B}.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 6/18
Introduction Theory Experiments Conclusion
SPIBB policy iteration
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18
Introduction Theory Experiments Conclusion
SPIBB policy iteration
Policy improvement step example
Q-value Baseline policy Bootstrapping SPIBB policy update Q(i)
- M (x, a1) = 1
πb(a1|x) = 0.1 (x, a1) ∈ B Q(i)
- M (x, a2) = 2
πb(a2|x) = 0.4 (x, a2) / ∈ B Q(i)
- M (x, a3) = 3
πb(a3|x) = 0.3 (x, a3) / ∈ B Q(i)
- M (x, a4) = 4
πb(a4|x) = 0.2 (x, a4) ∈ B
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18
Introduction Theory Experiments Conclusion
SPIBB policy iteration
Policy improvement step example
Q-value Baseline policy Bootstrapping SPIBB policy update Q(i)
- M (x, a1) = 1
πb(a1|x) = 0.1 (x, a1) ∈ B π(i+1)(a1|x) = 0.1 Q(i)
- M (x, a2) = 2
πb(a2|x) = 0.4 (x, a2) / ∈ B Q(i)
- M (x, a3) = 3
πb(a3|x) = 0.3 (x, a3) / ∈ B Q(i)
- M (x, a4) = 4
πb(a4|x) = 0.2 (x, a4) ∈ B π(i+1)(a4|x) = 0.2
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18
Introduction Theory Experiments Conclusion
SPIBB policy iteration
Policy improvement step example
Q-value Baseline policy Bootstrapping SPIBB policy update Q(i)
- M (x, a1) = 1
πb(a1|x) = 0.1 (x, a1) ∈ B π(i+1)(a1|x) = 0.1 Q(i)
- M (x, a2) = 2
πb(a2|x) = 0.4 (x, a2) / ∈ B π(i+1)(a2|x) = 0.0 Q(i)
- M (x, a3) = 3
πb(a3|x) = 0.3 (x, a3) / ∈ B π(i+1)(a3|x) = 0.7 Q(i)
- M (x, a4) = 4
πb(a4|x) = 0.2 (x, a4) ∈ B π(i+1)(a4|x) = 0.2
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 7/18
Introduction Theory Experiments Conclusion
Theoretical analysis
Theorem (Convergence) Policy iteration converges to a policy π⊙
spibb that is Πb-optimal in
the MLE MDP M.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 8/18
Introduction Theory Experiments Conclusion
Theoretical analysis
Theorem (Convergence) Policy iteration converges to a policy π⊙
spibb that is Πb-optimal in
the MLE MDP M. Theorem (Safe policy improvement) With high probability 1 − δ: ρ(π⊙
spibb, M∗) − ρ(πb, M∗) ≥ ρ(π⊙ spibb,
M) − ρ(πb, M) − 4Vmax 1 − γ
- 2
N∧ log 2|X||A|2|X| δ
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 8/18
Introduction Theory Experiments Conclusion
Model-free formulation
SPIBB algorithm
- It may be formulated in a model-free manner by setting the
targets: y(i)
j
= rj + γ
- a′|(x′
j ,a′)∈B
πb(a′|x′
j )Q(i)(x′ j , a′)
+ γ
- a′|(x′
j ,a′)/
∈B
πb(a′|x′
j )
max
a′|(x′
j ,a′)/
∈B Q(i)(x′ j , a′).
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 9/18
Introduction Theory Experiments Conclusion
Model-free formulation
SPIBB algorithm
- It may be formulated in a model-free manner by setting the
targets: y(i)
j
= rj + γ
- a′|(x′
j ,a′)∈B
πb(a′|x′
j )Q(i)(x′ j , a′)
+ γ
- a′|(x′
j ,a′)/
∈B
πb(a′|x′
j )
max
a′|(x′
j ,a′)/
∈B Q(i)(x′ j , a′).
Theorem (Model-free formulation equivalence) In finite MDPs, the model-free formulation admits a unique fixed point that coincides with the Q-value of π⊙
spibb.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 9/18
Introduction Theory Experiments Conclusion
25-state stochastic gridworld – mean
101 102 103 104
number of trajectories in dataset D
0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60
performance
Baseline Optimal policy Basic RL mean perf RaMDP mean Robust MDP mean HCPI doubly robust mean Πb-SPIBB mean Π≤b-SPIBB mean
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 10/18
Introduction Theory Experiments Conclusion
25-state stochastic gridworld – 1%-CVaR
101 102 103 104
number of trajectories in dataset D
0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60
performance
Baseline Optimal policy Basic RL 1%-CVaR perf RaMDP 1%-CVaR Robust MDP 1%-CVaR HCPI doubly robust 1%-CVaR Πb-SPIBB 1%-CVaR Π≤b-SPIBB 1%-CVaR
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 11/18
Introduction Theory Experiments Conclusion
Random MDPs, random baseline – 1%-CVaR
101 102 103
number of trajectories in dataset D
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
ρ = ρ(π, M ∗) − ρb ρ∗ − ρb
Baseline Optimal policy Basic RL 1%-CVaR RaMDP 1%-CVaR Robust MDP 1%-CVaR HCPI doubly robust 1%-CVaR Πb-SPIBB 1%-CVaR Π≤b-SPIBB 1%-CVaR
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 12/18
Introduction Theory Experiments Conclusion
Gridworld – RaMDP hyperparameter sensitivity
10 20 50 100 200 500 1000 2000 5000 10000
Number of trajectories
0.001 0.002 0.003 0.004 0.005
κ
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00
Normalized performance of the target policy π: ρ = ρ(π, M ∗) − ρb ρ∗ − ρb Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 13/18
Introduction Theory Experiments Conclusion
Gridworld – SPIBB hyperparameter sensitivity
10 20 50 100 200 500 1000 2000 5000 10000
Number of trajectories
5 7 10 15 20 30 50 70 100
N∧
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00
Normalized performance of the target policy π: ρ = ρ(π, M ∗) − ρb ρ∗ − ρb Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 14/18
Introduction Theory Experiments Conclusion
Helicopter domain (continuous task)
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 15/18
Introduction Theory Experiments Conclusion
Helicopter domain - benchmark (improved results)
10−1 100 Hyperparameter: ǫ or
1 √N∧ or 1 κ
1.5 2.0 2.5 3.0 3.5 performance Baseline, mean Baseline, 10%-CVaR RaMDP-DQN, mean RaMDP-DQN, 10%-CVaR Πb-SPIBB-DQN, mean Πb-SPIBB-DQN, 10%-CVaR Approx-Soft-SPIBB-DQN, mean Approx-Soft-SPIBB-DQN, 10%-CVaR
Vanilla DQN is off the chart
- mean = 0.22,
- 10%-CVaR = -1 (minimal score).
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 16/18
Introduction Theory Experiments Conclusion
Conclusion
SPIBB
- Assumes fixed dataset, and known behavioural policy.
- Tractable, provably reliable, sample-efficient algorithm.
- Successfully transferred to DQN architectures.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 17/18
Introduction Theory Experiments Conclusion
Conclusion
SPIBB
- Assumes fixed dataset, and known behavioural policy.
- Tractable, provably reliable, sample-efficient algorithm.
- Successfully transferred to DQN architectures.
Follow-up work
- Factored SPIBB [Sim˜
ao and Spaan, 2019a].
- Structure learning coupled [Sim˜
ao and Spaan, 2019b].
- Soft SPIBB [Nadjahi et al., 2019].
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 17/18
Introduction Theory Experiments Conclusion
Conclusion
SPIBB
- Assumes fixed dataset, and known behavioural policy.
- Tractable, provably reliable, sample-efficient algorithm.
- Successfully transferred to DQN architectures.
Follow-up work
- Factored SPIBB [Sim˜
ao and Spaan, 2019a].
- Structure learning coupled [Sim˜
ao and Spaan, 2019b].
- Soft SPIBB [Nadjahi et al., 2019].
Still to do
- Improve the pseudo-count/error estimates.
- Investigate an online SPIBB inspired algorithm.
Safe Policy Improvement w. Baseline Bootstrapping (poster #101) Laroche, Trichelair, Tachet des Combes 17/18
Introduction Theory Experiments Conclusion
Thanks for your attention (POSTER #101)
Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research. Nadjahi, K., Laroche, R., and Tachet des Combes, R. (2019). Safe policy improvement with soft baseline bootstrapping. In Proceedings of the 17th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD). Nilim, A. and El Ghaoui, L. (2005). Robust control of markov decision processes with uncertain transition matrices. Operations Research. Petrik, M., Ghavamzadeh, M., and Chow, Y. (2016). Safe policy improvement by minimizing robust baseline regret. In Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS). Sim˜ ao, T. D. and Spaan, M. T. J. (2019a). Safe policy improvement with baseline bootstrapping in factored environments. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Sim˜ ao, T. D. and Spaan, M. T. J. (2019b). Structure learning for safe policy improvement. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI). Thomas, P . S. (2015). Safe reinforcement learning. PhD thesis, Stanford university.