SLIDE 1 Safe Reinforcement Learning
Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest Lecture May 24, 2017
SLIDE 2 Lecture overview
- What makes a reinforcement learning algorithm safe?
- Notation
- Creating a safe reinforcement learning algorithm
- Off-policy policy evaluation (OPE)
- High-confidence off-policy policy evaluation (HCOPE)
- Safe policy improvement (SPI)
- Empirical results
- Research directions
SLIDE 3
What does it mean for a reinforcement learning algorithm to be safe?
SLIDE 4
SLIDE 5
SLIDE 6 Changing the objective
+0 +20 +20 +0 +0 +0 +0 +0 +0 +20 +20 +20 +20 Policy 1 Policy 2
SLIDE 7 Changing the objective
- Policy 1:
- Reward = 0 with probability 0.999999
- Reward = 109 with probability 1-0.999999
- Expected reward approximately 1000
- Policy 2:
- Reward = 999 with probability 0.5
- Reward = 1000 with probability 0.5
- Expected reward 999.5
SLIDE 8
Another notion of safety
SLIDE 9
Another notion of safety (Munos et. al)
SLIDE 10
Another notion of safety
SLIDE 11
SLIDE 12 The Problem
- If you apply an existing method, do you have confidence that it will
work?
SLIDE 13
Reinforcement learning successes
SLIDE 14 A property of many real applications
- Deploying “bad” policies can be costly or dangerous.
SLIDE 15
Deploying bad policies can be costly
SLIDE 16
Deploying bad policies can be dangerous
SLIDE 17 What property should a safe algorithm have?
- Guaranteed to work on the first try
- “I guarantee that with probability at least 1 − 𝜀, I will not change your policy
to one that is worse than the current policy.”
- You get to choose 𝜀
- This guarantee is not contingent on the tuning of any hyperparameters
SLIDE 18 Lecture overview
- What makes a reinforcement learning algorithm safe?
- Notation
- Creating a safe reinforcement learning algorithm
- Off-policy policy evaluation (OPE)
- High-confidence off-policy policy evaluation (HCOPE)
- Safe policy improvement (SPI)
- Empirical results
- Research directions
SLIDE 19 Notation
𝜌 𝑏 𝑡 = Pr (𝐵𝑢 = 𝑏|𝑇𝑢 = 𝑡)
𝐼 = 𝑡1, 𝑏1, 𝑠
1, 𝑡2, 𝑏2, 𝑠 2, … , 𝑡𝑀, 𝑏𝑀, 𝑠 𝑀
𝐸 = 𝐼1, 𝐼2, … , 𝐼𝑜
- Historical data from behavior policy, 𝜌b
- Objective:
𝐾 𝜌 = 𝐅 𝛿𝑢𝑆𝑢
𝑀 𝑢=1
𝜌
19
Agent Environment
Action, 𝑏 State, 𝑡 Reward, 𝑠
SLIDE 20 Safe reinforcement learning algorithm
- Reinforcement learning algorithm, 𝑏
- Historical data, 𝐸, which is a random variable
- Policy produced by the algorithm, 𝑏(𝐸), which is a random variable
- A safe reinforcement learning algorithm, 𝑏, satisfies:
Pr 𝐾 𝑏 𝐸 ≥ 𝐾 𝜌b ≥ 1 − 𝜀
Pr 𝐾 𝑏 𝐸 ≥ 𝐾min ≥ 1 − 𝜀
SLIDE 21 Lecture overview
- What makes a reinforcement learning algorithm safe?
- Notation
- Creating a safe reinforcement learning algorithm
- Off-policy policy evaluation (OPE)
- High-confidence off-policy policy evaluation (HCOPE)
- Safe policy improvement (SPI)
- Empirical results
- Research directions
SLIDE 22 Creating a safe reinforcement learning algorithm
- Off-policy policy evaluation (OPE)
- For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent
and unbiased estimates of 𝐾 𝜌e
- High-confidence off-policy policy evaluation (HCOPE)
- Use a concentration inequality to convert the 𝑜 independent and unbiased
estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e
- Safe policy improvement (SPI)
- Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
SLIDE 23 Off-policy policy evaluation (OPE)
Historical Data, 𝐸 Proposed Policy, 𝜌e Estimate of 𝐾(𝜌e)
SLIDE 24 Importance Sampling (Intuition)
24
Probability of history Evaluation Policy, 𝜌e Behavior Policy, 𝜌b
𝐾 𝜌𝑓 = 1 𝑜 𝛿𝑢𝑆𝑢
𝑗 𝑀 𝑢=1 𝑜 𝑗=1
𝐾 𝜌e = 1 𝑜 𝑥𝑗 𝛿𝑢𝑆𝑢
𝑗 𝑀 𝑢=1 𝑜 𝑗=1
Math Slide 2/3
𝑥𝑗 = Pr 𝐼𝑗 𝜌e Pr 𝐼𝑗 𝜌b
Math Slide 2/3
- Reminder:
- History, 𝐼 = 𝑡1, 𝑏1, 𝑠
1, 𝑡2, 𝑏2, 𝑠 2, … , 𝑡𝑀, 𝑏𝑀, 𝑠 𝑀
𝛿𝑢𝑆𝑢
𝑀 𝑢=1
𝜌e
= 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢
𝑀 𝑢=1
SLIDE 25 Importance sampling (History)
- Kahn, H., Marhshall, A. W. (1953). Methods of reducing sample size in
Monte Carlo computations. In Journal of the Operations Research Society of America, 1(5):263–278
- Let 𝑌 = 0 with probability 1 − 10−10 and 𝑌 = 1010 with probability 10−10
- 𝐅 𝑌 = 1
- Monte Carlo estimate from 𝑜 ≪ 1010 samples of 𝑌 is almost always zero
- Idea: Sample 𝑌 from some other distribution and use importance sampling to
“correct” the estimate
- Can produce lower variance estimates.
- Josiah Hannah et. al, ICML 2017 (to appear).
SLIDE 26 Importance sampling (History, continued)
- Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-
policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann
SLIDE 27 Importance sampling (Proof)
- Estimate 𝐅𝑞[𝑔 𝑌 ] given a sample of 𝑌~𝑟
- Let 𝑄 = supp 𝑞 , 𝑅 = supp(𝑟), and 𝐺 = supp(𝑔)
- Importance sampling estimate:
𝑞 𝑌 𝑟 𝑌 𝑔 𝑌
𝐅𝑟 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑟(𝑌) 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌)
𝑦∈𝑅
= 𝑞(𝑌)𝑔(𝑌)
𝑦∈𝑄
− 𝑞 𝑌 𝑔 𝑌
𝑦∈𝑄∩𝑅
= 𝑞(𝑌)𝑔(𝑌)
𝑦∈𝑄
+ 𝑞(𝑌)𝑔(𝑌)
𝑦∈𝑄 ∩𝑅
− 𝑞 𝑌 𝑔 𝑌
𝑦∈𝑄∩𝑅
SLIDE 28 Importance sampling (Proof)
- Assume 𝑄 ⊆ 𝑅 (can relax assumption to 𝑄 ⊆ 𝑅 ∪ 𝐺
)
- Importance sampling is an unbiased estimator of 𝐅𝑞 𝑔 𝑌
𝐅𝑟 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)𝑔(𝑌)
𝑦∈𝑄
− 𝑞 𝑌 𝑔 𝑌
𝑦∈𝑄∩𝑅
= 𝐅𝑞 𝑔 𝑌 = 𝑞(𝑌)𝑔(𝑌)
𝑦∈𝑄
SLIDE 29 Importance sampling (proof)
- Assume 𝑔 𝑦 ≥ 0 for all 𝑦
- Importance sampling is a negative-bias estimator of 𝐅𝑞 𝑔 𝑌
𝐅𝑟 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)𝑔(𝑌)
𝑦∈𝑄
− 𝑞 𝑌 𝑔 𝑌
𝑦∈𝑄∩𝑅
≤ 𝑞(𝑌)𝑔(𝑌)
𝑦∈𝑄
= 𝐅𝑞 𝑔 𝑌
SLIDE 30 Importance sampling (reminder)
IS 𝐸 = 1 𝑜 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢
𝑀 𝑢=1
𝛿𝑢𝑆𝑢
𝑗 𝑀 𝑢=1 𝑜 𝑗=1
𝐅 IS(𝐸) = 𝐾 𝜌e
SLIDE 31 Creating a safe reinforcement learning algorithm
- Off-policy policy evaluation (OPE)
- For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent
and unbiased estimates of 𝐾 𝜌e
- High-confidence off-policy policy evaluation (HCOPE)
- Use a concentration inequality to convert the 𝑜 independent and unbiased
estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e
- Safe policy improvement (SPI)
- Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
SLIDE 32 High confidence off-policy policy evaluation (HCOPE)
Historical Data, 𝐸 Proposed Policy, 𝜌𝑓 1 − 𝜀 confidence lower bound on 𝐾(𝜌𝑓) Probability, 1 − 𝜀
SLIDE 33
- Let 𝑌1, … , 𝑌𝑜 be 𝑜 independent identically distributed random
variables such that𝑌i ∈ [0, 𝑐]
- Then with probability at least 1 − 𝜀:
𝐅 𝑌𝑗 ≥ 1 𝑜 𝑌𝑗 −
𝑜 𝑗=1
𝑐 ln 1 𝜀 2𝑜
Hoeffding’s inequality
Math Slide 3/3
1 𝑜 𝑥𝑗 𝛿𝑢𝑆𝑢
𝑗 𝑀 𝑢=1 𝑜 𝑗=1
SLIDE 34 Creating a safe reinforcement learning algorithm
- Off-policy policy evaluation (OPE)
- For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent
and unbiased estimates of 𝐾 𝜌e
- High-confidence off-policy policy evaluation (HCOPE)
- Use a concentration inequality to convert the 𝑜 independent and unbiased
estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e
- Safe policy improvement (SPI)
- Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
SLIDE 35 Safe policy improvement (SPI)
Historical Data, 𝐸 New policy 𝜌, or No Solution Found Probability, 1 − 𝜀
SLIDE 36 Historical Data Training Set (20%) Candidate Policy, 𝜌 Testing Set (80%) Safety Test
36
Safe policy improvement (SPI)
Is 1 − 𝜀 confidence lower bound on 𝐾 𝜌 larger that 𝐾(𝜌cur)?
SLIDE 37 Creating a safe reinforcement learning algorithm
- Off-policy policy evaluation (OPE)
- For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent
and unbiased estimates of 𝐾 𝜌e
- High-confidence off-policy policy evaluation (HCOPE)
- Use a concentration inequality to convert the 𝑜 independent and unbiased
estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e
- Safe policy improvement (SPI)
- Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
WON’T WORK
SLIDE 38 Off-policy policy evaluation (revisited)
- Importance sampling (IS):
IS 𝐸 = 1 𝑜 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢
𝑀 𝑢=1
𝛿𝑢𝑆𝑢
𝑗 𝑀 𝑢=1 𝑜 𝑗=1
- Per-decision importance sampling (PDIS)
PDIS 𝐸 = 𝛿𝑢
𝑀 𝑢=1
1 𝑜 𝜌e 𝑏𝜐 𝑡𝜐 𝜌b 𝑏𝜐 𝑡𝜐
𝑢 𝜐=1
𝑆𝑢
𝑗 𝑜 𝑗=1
SLIDE 39 Off-policy policy evaluation (revisited)
- Importance sampling (IS):
IS 𝐸 = 1 𝑜 𝑥𝑗 𝛿𝑢𝑆𝑢
𝑗 𝑀 𝑢=1 𝑜 𝑗=1
- Weighted importance sampling (WIS)
WIS 𝐸 = 1 𝑥𝑗
𝑜 𝑗=1
𝑥𝑗 𝛿𝑢𝑆𝑢
𝑗 𝑀 𝑢=1 𝑜 𝑗=1
SLIDE 40 Off-policy policy evaluation (revisited)
- Weighted importance sampling (WIS)
1 𝑥𝑗
𝑜 𝑗=1
𝑥𝑗 𝛿𝑢𝑆𝑢
𝑗 𝑀 𝑢=1 𝑜 𝑗=1
- Not unbiased. When 𝑜 = 1, 𝐅 WIS = 𝐾 𝜌b
- Strongly consistent estimator of 𝐾 𝜌e
- i.e., Pr
lim
𝑜→∞ WIS(𝐸) = J 𝜌e
= 1
- If
- Finite horizon
- One behavior policy, or bounded rewards
SLIDE 41 Off-policy policy evaluation (revisited)
- Weighted per-decision importance sampling
- Also called consistent weighted per-decision importance sampling
- A fun exercise!
SLIDE 42 Control variates
- Given: 𝑌
- Estimate: 𝜈 = 𝐅 𝑌
- 𝜈
= 𝑌
= 𝐅 𝑌 = 𝜈
= Var(𝑌)
SLIDE 43 Control variates
- Given: 𝑌, 𝑍, 𝐅 𝑍
- Estimate: 𝜈 = 𝐅 𝑌
- 𝜈
= 𝑌 − 𝑍 + 𝐅[Y]
𝐅 𝜈 = 𝐅 𝑌 − 𝑍 + 𝐅[𝑍] = 𝐅 𝑌 − 𝐅 𝑍 + 𝐅 𝑍 = 𝐅 𝑌 = 𝜈
Var 𝜈 = Var 𝑌 − 𝑍 + 𝐅[𝑍] = Var 𝑌 − 𝑍 = Var 𝑌 + Var 𝑍 − 2Cov(𝑌, 𝑍)
- Lower variance if 2Cov 𝑌, 𝑍 > Var(𝑍)
- We call 𝑍 a control variate.
SLIDE 44 Off-policy policy evaluation (revisited)
- Idea: add a control variate to importance sampling estimators
- 𝑌 is the importance sampling estimator
- 𝑍 is a control variate build from an approximate model of the MDP
- 𝐅 𝑍 = 0 in this case
- PDISCV 𝐸 = PDIS 𝐸 − CV(𝐸)
- Called the doubly robust estimator (Jiang and Li, 2015)
- Robust to 1) poor approximate model, and 2) error in estimates of 𝜌b
- If the model is poor, the estimates are still unbiased
- If the sampling policy is unknown, but the model is good, MSE will still be low
- DR 𝐸 = PDISCV 𝐸
- Non-recursive and weighted forms, as well as control variate view provided
by Thomas and Brunskill (2016)
SLIDE 45 Off-policy policy evaluation (revisited)
DR 𝜌𝑓 ) = 1 𝑜 𝛿𝑢𝑥𝑢
𝑗 𝑆𝑢 𝑗 −𝑟
𝜌e 𝑇𝑢
𝑗, 𝐵𝑢 𝑗
+ 𝛿𝑢𝜍𝑢−1
𝑗
𝑤 𝜌𝑓 𝑇𝑢
𝑗 ∞ 𝑢=0 𝑜 𝑗=1
- Recall: we want the control variate, 𝑍, to cancel with 𝑌:
𝑆 − 𝑟 𝑇, 𝐵 + 𝛿𝑤 𝑇′
Per-decision importance sampling (PDIS)
𝑥𝑢
𝑗 = 𝜌e 𝑏𝜐 𝑡𝜐
𝜌b 𝑏𝜐 𝑡𝜐
𝑢 𝜐=1
SLIDE 46 Empirical Results (Gridworld)
0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS AM
Approximate model Direct method (Dudik, 2011) Indirect method (Sutton and Barto, 1998)
SLIDE 47 Empirical Results (Gridworld)
0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS AM
SLIDE 48 Empirical Results (Gridworld)
0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS DR AM
SLIDE 49 Empirical Results (Gridworld)
0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS WIS CWPDIS DR AM
SLIDE 50 Empirical Results (Gridworld)
0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS WIS CWPDIS DR AM WDR
SLIDE 51 Off-policy policy evaluation (revisited)
- What if supp 𝜌e ⊂ supp 𝜌b ?
- There is a state-action pair, 𝑡, 𝑏 , such that
𝜌𝑓 𝑏 𝑡 = 0, but 𝜌𝑐 𝑏 𝑡 ≠ 0
- If we see a history where (𝑡, 𝑏) occurs, what weight should we give
it?
IS 𝐸 = 1 𝑜 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢
𝑀 𝑢=1
𝛿𝑢𝑆𝑢
𝑗 𝑀 𝑢=1 𝑜 𝑗=1
SLIDE 52 Off-policy policy evaluation (revisited)
- What if there are zero samples (𝑜 = 0)?
- The importance sampling estimate is undefined
- What if no samples are in supp 𝜌e (or supp(𝑞) in general)?
- Importance sampling says: the estimate is zero
- Alternate approach: undefined
- Importance sampling estimator is unbiased if 𝑜 > 0
- Alternate approach will be unbiased given that at least one sample is
in the support of 𝑞.
- Alternate approach detailed in Importance Sampling with Unequal
Support (Thomas and Brunskill, AAAI 2017)
SLIDE 53
Off-policy policy evaluation (revisited)
SLIDE 54 Off-policy policy evaluation (revisited)
- Thomas et. al. Predictive Off-Policy Policy Evaluation for
Nonstationary Decision Problems, with Applications to Digital Marketing (AAAI 2017)
SLIDE 55 Off-policy policy evaluation (revisited)
- Thomas and Brunskill. Data-Efficient Off-Policy Policy Evaluation for
Reinforcement Learning (ICML 2016)
0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS WIS CWPDIS DR AM WDR MAGIC
0.01 0.1 1 10 1 10 100 1,000 10,000 Mean Squared Error Number of Episodes, n IS DR AM WDR MAGIC
SLIDE 56 Creating a safe reinforcement learning algorithm
- Off-policy policy evaluation (OPE)
- For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent
and unbiased estimates of 𝐾 𝜌e
- High-confidence off-policy policy evaluation (HCOPE)
- Use a concentration inequality to convert the 𝑜 independent and unbiased
estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e
- Safe policy improvement (SPI)
- Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
SLIDE 57 High-confidence off-policy policy evaluation (revisited)
- Consider using IS + Hoeffding’s inequality for HCOPE on mountain car
Natural Temporal Difference Learning, Dabney and Thomas, 2014
SLIDE 58 High-confidence off-policy policy evaluation (revisited)
- Using 100,000 trajectories
- Evaluation policy’s true performance is 0.19 ∈ [0,1].
- We get a 95% confidence lower bound of:
−5,831,000
SLIDE 59 What went wrong?
𝑥𝑗 = 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢
𝑀 𝑢=1
SLIDE 60 What went wrong?
𝐅 𝑌𝑗 ≥ 1 𝑜 𝑌𝑗 −
𝑜 𝑗=1
𝑐 ln 1 𝜀 2𝑜 𝑐 ≈ 109.4 Largest observed importance weighted return:316.
SLIDE 61 High-confidence off-policy policy evaluation (revisited)
- Removing the upper tail only decreases the expected value.
SLIDE 62 High-confidence off-policy policy evaluation (revisited)
- Thomas et. al, High confidence off-policy evaluation, AAAI 2015
SLIDE 63
High-confidence off-policy policy evaluation (revisited)
SLIDE 64 High-confidence off-policy policy evaluation (revisited)
- Use 20% of the data to optimize 𝑑.
- Use 80% to compute lower bound with optimized 𝑑.
- Mountain car results:
CUT Chernoff-Hoeffding Maurer Anderson Bubeck et al. 95% Confidence lower bound on the mean 0.145 −5,831,000 −129,703 0.055 −.046
SLIDE 65 High-confidence off-policy policy evaluation (revisited)
SLIDE 66 High-confidence off-policy policy evaluation (revisited)
𝐅 𝑌𝑗 ≥ 1 𝑜 𝑌𝑗 −
𝑜 𝑗=1
𝑐 ln 1 𝜀 2𝑜
SLIDE 67 High-confidence off-policy policy evaluation (revisited)
- Student’s 𝑢-test
- Assumes that IS(𝐸) is normally distributed
- By the central limit theorem, it (is as 𝑜 → ∞)
- Efron’s Bootstrap methods (e.g., BCa)
- Also, without importance sampling: Hanna, Stone, and Niekum, AAMAS 2017
SLIDE 68 High-confidence off-policy policy evaluation (revisited)
- P. S. Thomas. Safe reinforcement learning (PhD Thesis, 2015)
SLIDE 69 Creating a safe reinforcement learning algorithm
- Off-policy policy evaluation (OPE)
- For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent
and unbiased estimates of 𝐾 𝜌e
- High-confidence off-policy policy evaluation (HCOPE)
- Use a concentration inequality to convert the 𝑜 independent and unbiased
estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e
- Safe policy improvement (SPI)
- Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
SLIDE 70 Historical Data Training Set (20%) Candidate Policy, 𝜌 Testing Set (80%) Safety Test
70
Safe policy improvement (revisited)
Is 1 − 𝜀 confidence lower bound on 𝐾 𝜌 larger that 𝐾(𝜌cur)?
SLIDE 71 Lecture overview
- What makes a reinforcement learning algorithm safe?
- Notation
- Creating a safe reinforcement learning algorithm
- Off-policy policy evaluation (OPE)
- High-confidence off-policy policy evaluation (HCOPE)
- Safe policy improvement (SPI)
- Empirical results
- Research directions
SLIDE 72 Empirical Results
- What to look for:
- Data efficiency
- Error rates
SLIDE 73
Empirical Results: Mountain Car
SLIDE 74
Empirical Results: Mountain Car
SLIDE 75
Empirical Results: Mountain Car
SLIDE 76 Empirical Results: Digital Marketing
Agent Environment
Action, 𝑏 State, 𝑡 Reward, 𝑠
SLIDE 77 Empirical Results: Digital Marketing
0.002715 0.003832 n=10000 n=30000 n=60000 n=100000 Expected Normalized Return None, CUT None, BCa k-Fold, CUT k-Fold, Bca
SLIDE 78
Empirical Results: Digital Marketing
SLIDE 79
Empirical Results: Digital Marketing
SLIDE 80 Example Results : Diabetes Treatment
80
Blood Glucose (sugar) Eat Carbohydrates Release Insulin
SLIDE 81 Example Results : Diabetes Treatment
81
Blood Glucose (sugar) Eat Carbohydrates Release Insulin Hyperglycemia
SLIDE 82 Example Results : Diabetes Treatment
82
Blood Glucose (sugar) Eat Carbohydrates Release Insulin Hypoglycemia Hyperglycemia
SLIDE 83 Example Results : Diabetes Treatment
83
injection = bloodglucose − targetbloodglucose 𝐷𝐺 + mealsize 𝐷𝑆
SLIDE 84 Example Results : Diabetes Treatment
84
Intelligent Diabetes Management
SLIDE 85 Example Results : Diabetes Treatment
85
Probability Policy Changed Probability Policy Worse
SLIDE 86 Future Directions
- How to deal with long horizons?
- How to deal with importance sampling being “unfair”?
- What to do when the behavior policy is not known?
- What to do when the behavior policy is deterministic?
SLIDE 87 Summary
- Safe reinforcement learning
- Risk-sensitive
- Learning from demonstration
- Asymptotic convergence even if data is off-policy
- Guaranteed (with probability 𝟐 − 𝜺) not to make the policy worse
- Designing a safe reinforcement learning algorithm:
- Off-policy policy evaluation (OPE)
- IS, PDIS, WIS, WPDIS, DR, WDR, US, TSP, MAGIC
- High confidence off-policy policy evaluation (HCOPE)
- Hoeffding, CUT inequality, Student’s 𝑢-test, BCa
- Safe policy improvement (SPI)
- Selecting the candidate policy
SLIDE 88 Takeaway
- Safe reinforcement learning is tractable!
- Not just polynomial amounts of data – practical amounts of data