[PPT] - Safe Reinforcement Learning Philip S. Thomas Stanford CS234: PowerPoint Presentation

SLIDE 1

Safe Reinforcement Learning

Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest Lecture May 24, 2017

SLIDE 2

Lecture overview

What makes a reinforcement learning algorithm safe?
Notation
Creating a safe reinforcement learning algorithm
Off-policy policy evaluation (OPE)
High-confidence off-policy policy evaluation (HCOPE)
Safe policy improvement (SPI)
Empirical results
Research directions

SLIDE 3

What does it mean for a reinforcement learning algorithm to be safe?

SLIDE 4

SLIDE 5

SLIDE 6

Changing the objective

50

+0 +20 +20 +0 +0 +0 +0 +0 +0 +20 +20 +20 +20 Policy 1 Policy 2

SLIDE 7

Changing the objective

Policy 1:
Reward = 0 with probability 0.999999
Reward = 109 with probability 1-0.999999
Expected reward approximately 1000
Policy 2:
Reward = 999 with probability 0.5
Reward = 1000 with probability 0.5
Expected reward 999.5

SLIDE 8

Another notion of safety

SLIDE 9

Another notion of safety (Munos et. al)

SLIDE 10

Another notion of safety

SLIDE 11

SLIDE 12

The Problem

If you apply an existing method, do you have confidence that it will

work?

SLIDE 13

Reinforcement learning successes

SLIDE 14

A property of many real applications

Deploying “bad” policies can be costly or dangerous.

SLIDE 15

Deploying bad policies can be costly

SLIDE 16

Deploying bad policies can be dangerous

SLIDE 17

What property should a safe algorithm have?

Guaranteed to work on the first try
“I guarantee that with probability at least 1 − 𝜀, I will not change your policy

to one that is worse than the current policy.”

You get to choose 𝜀
This guarantee is not contingent on the tuning of any hyperparameters

SLIDE 18

Lecture overview

What makes a reinforcement learning algorithm safe?
Notation
Creating a safe reinforcement learning algorithm
Off-policy policy evaluation (OPE)
High-confidence off-policy policy evaluation (HCOPE)
Safe policy improvement (SPI)
Empirical results
Research directions

SLIDE 19

Notation

Policy, 𝜌

𝜌 𝑏 𝑡 = Pr⁡ (𝐵𝑢 = 𝑏|𝑇𝑢 = 𝑡)

History:

𝐼 = 𝑡1, 𝑏1, 𝑠

1, 𝑡2, 𝑏2, 𝑠 2, … , 𝑡𝑀, 𝑏𝑀, 𝑠 𝑀

Historical data:

𝐸 = 𝐼1, 𝐼2, … , 𝐼𝑜

Historical data from behavior policy, 𝜌b
Objective:

𝐾 𝜌 = 𝐅 𝛿𝑢𝑆𝑢

𝑀 𝑢=1

𝜌

19

Agent Environment

Action, 𝑏 State, 𝑡 Reward, 𝑠

SLIDE 20

Safe reinforcement learning algorithm

Reinforcement learning algorithm, 𝑏
Historical data, 𝐸, which is a random variable
Policy produced by the algorithm, 𝑏(𝐸), which is a random variable
A safe reinforcement learning algorithm, 𝑏, satisfies:

Pr 𝐾 𝑏 𝐸 ≥ 𝐾 𝜌b ≥ 1 − 𝜀

r, in general:

Pr 𝐾 𝑏 𝐸 ≥ 𝐾min ≥ 1 − 𝜀

SLIDE 21

Lecture overview

What makes a reinforcement learning algorithm safe?
Notation
Creating a safe reinforcement learning algorithm
Off-policy policy evaluation (OPE)
High-confidence off-policy policy evaluation (HCOPE)
Safe policy improvement (SPI)
Empirical results
Research directions

SLIDE 22

Creating a safe reinforcement learning algorithm

Off-policy policy evaluation (OPE)
For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

High-confidence off-policy policy evaluation (HCOPE)
Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

Safe policy improvement (SPI)
Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

SLIDE 23

Off-policy policy evaluation (OPE)

Historical Data, 𝐸 Proposed Policy, 𝜌e Estimate of 𝐾(𝜌e)

SLIDE 24

Importance Sampling (Intuition)

24

Probability of history Evaluation Policy, 𝜌e Behavior Policy, 𝜌b

𝐾 𝜌𝑓 = 1 𝑜 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

𝐾 𝜌e = 1 𝑜 𝑥𝑗 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

Math Slide 2/3

𝑥𝑗 = Pr 𝐼𝑗 𝜌e Pr 𝐼𝑗 𝜌b

Math Slide 2/3

Reminder:
History, 𝐼 = 𝑡1, 𝑏1, 𝑠

1, 𝑡2, 𝑏2, 𝑠 2, … , 𝑡𝑀, 𝑏𝑀, 𝑠 𝑀

Objective, 𝐾 𝜌e = 𝐅

𝛿𝑢𝑆𝑢

𝑀 𝑢=1

𝜌e

= 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢

𝑀 𝑢=1

SLIDE 25

Importance sampling (History)

Kahn, H., Marhshall, A. W. (1953). Methods of reducing sample size in

Monte Carlo computations. In Journal of the Operations Research Society of America, 1(5):263–278

Let 𝑌 = 0 with probability 1 − 10−10 and 𝑌 = 1010 with probability 10−10
𝐅 𝑌 = 1
Monte Carlo estimate from 𝑜 ≪ 1010 samples of 𝑌 is almost always zero
Idea: Sample 𝑌 from some other distribution and use importance sampling to

“correct” the estimate

Can produce lower variance estimates.
Josiah Hannah et. al, ICML 2017 (to appear).

SLIDE 26

Importance sampling (History, continued)

Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-

policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann

SLIDE 27

Importance sampling (Proof)

Estimate 𝐅𝑞[𝑔 𝑌 ] given a sample of 𝑌~𝑟
Let 𝑄 = supp 𝑞 , 𝑅 = supp(𝑟), and 𝐺 = supp(𝑔)
Importance sampling estimate:

𝑞 𝑌 𝑟 𝑌 𝑔 𝑌

𝐅𝑟 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑟(𝑌) 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌)

𝑦∈𝑅

= 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

− 𝑞 𝑌 𝑔 𝑌

𝑦∈𝑄∩𝑅

= 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

+ 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄 ∩𝑅

− 𝑞 𝑌 𝑔 𝑌

𝑦∈𝑄∩𝑅

SLIDE 28

Importance sampling (Proof)

Assume 𝑄 ⊆ 𝑅 (can relax assumption to 𝑄 ⊆ 𝑅 ∪ 𝐺

)

Importance sampling is an unbiased estimator of 𝐅𝑞 𝑔 𝑌

𝐅𝑟 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

− 𝑞 𝑌 𝑔 𝑌

𝑦∈𝑄∩𝑅

= 𝐅𝑞 𝑔 𝑌 = 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

SLIDE 29

Importance sampling (proof)

Assume 𝑔 𝑦 ≥ 0 for all 𝑦
Importance sampling is a negative-bias estimator of 𝐅𝑞 𝑔 𝑌

𝐅𝑟 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

− 𝑞 𝑌 𝑔 𝑌

𝑦∈𝑄∩𝑅

≤ 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

= 𝐅𝑞 𝑔 𝑌

SLIDE 30

Importance sampling (reminder)

IS 𝐸 = 1 𝑜 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢

𝑀 𝑢=1

𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

𝐅 IS(𝐸) = 𝐾 𝜌e

SLIDE 31

Creating a safe reinforcement learning algorithm

Off-policy policy evaluation (OPE)
For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

High-confidence off-policy policy evaluation (HCOPE)
Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

Safe policy improvement (SPI)
Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

SLIDE 32

High confidence off-policy policy evaluation (HCOPE)

Historical Data, 𝐸 Proposed Policy, 𝜌𝑓 1 − 𝜀 confidence lower bound on 𝐾(𝜌𝑓) Probability, 1 − 𝜀

SLIDE 33

Let 𝑌1, … , 𝑌𝑜 be 𝑜 independent identically distributed random

variables such that⁡𝑌i ∈ [0, 𝑐]

Then with probability at least 1 − 𝜀:

𝐅 𝑌𝑗 ≥ 1 𝑜 𝑌𝑗 −

𝑜 𝑗=1

𝑐 ln 1 𝜀 2𝑜

Hoeffding’s inequality

Math Slide 3/3

1 𝑜 𝑥𝑗 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

SLIDE 34

Creating a safe reinforcement learning algorithm

Off-policy policy evaluation (OPE)
For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

High-confidence off-policy policy evaluation (HCOPE)
Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

Safe policy improvement (SPI)
Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

SLIDE 35

Safe policy improvement (SPI)

Historical Data, 𝐸 New policy 𝜌, or No Solution Found Probability, 1 − 𝜀

SLIDE 36

Historical Data Training Set (20%) Candidate Policy, 𝜌 Testing Set (80%) Safety Test

36

Safe policy improvement (SPI)

Is 1 − 𝜀 confidence lower bound on 𝐾 𝜌 larger that 𝐾(𝜌cur)?

SLIDE 37

Creating a safe reinforcement learning algorithm

Off-policy policy evaluation (OPE)
For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

High-confidence off-policy policy evaluation (HCOPE)
Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

Safe policy improvement (SPI)
Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

WON’T WORK

SLIDE 38

Off-policy policy evaluation (revisited)

Importance sampling (IS):

IS 𝐸 = 1 𝑜 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢

𝑀 𝑢=1

𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

Per-decision importance sampling (PDIS)

PDIS 𝐸 = 𝛿𝑢

𝑀 𝑢=1

1 𝑜 𝜌e 𝑏𝜐 𝑡𝜐 𝜌b 𝑏𝜐 𝑡𝜐

𝑢 𝜐=1

𝑆𝑢

𝑗 𝑜 𝑗=1

SLIDE 39

Off-policy policy evaluation (revisited)

Importance sampling (IS):

IS 𝐸 = 1 𝑜 𝑥𝑗 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

Weighted importance sampling (WIS)

WIS 𝐸 = 1 𝑥𝑗

𝑜 𝑗=1

𝑥𝑗 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

SLIDE 40

Off-policy policy evaluation (revisited)

Weighted importance sampling (WIS)

1 𝑥𝑗

𝑜 𝑗=1

𝑥𝑗 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

Not unbiased. When 𝑜 = 1, 𝐅 WIS = 𝐾 𝜌b
Strongly consistent estimator of 𝐾 𝜌e
i.e., Pr

lim

𝑜→∞ WIS(𝐸) = J 𝜌e

= 1

If
Finite horizon
One behavior policy, or bounded rewards

SLIDE 41

Off-policy policy evaluation (revisited)

Weighted per-decision importance sampling
Also called consistent weighted per-decision importance sampling
A fun exercise!

SLIDE 42

Control variates

Given: 𝑌
Estimate: 𝜈 = 𝐅 𝑌
𝜈

= 𝑌

Unbiased: 𝐅 𝜈

= 𝐅 𝑌 = 𝜈

Variance: Var 𝜈

= Var(𝑌)

SLIDE 43

Control variates

Given: 𝑌, 𝑍, 𝐅 𝑍
Estimate: 𝜈 = 𝐅 𝑌
𝜈

= 𝑌 − 𝑍 + 𝐅[Y]

Unbiased:

𝐅 𝜈 = 𝐅 𝑌 − 𝑍 + 𝐅[𝑍] = 𝐅 𝑌 − 𝐅 𝑍 + 𝐅 𝑍 = 𝐅 𝑌 = 𝜈

Variance:

Var 𝜈 ⁡ = Var 𝑌 − 𝑍 + 𝐅[𝑍] = Var 𝑌 − 𝑍 = Var 𝑌 + Var 𝑍 − 2Cov(𝑌, 𝑍)

Lower variance if 2Cov 𝑌, 𝑍 > Var(𝑍)
We call 𝑍 a control variate.

SLIDE 44

Off-policy policy evaluation (revisited)

Idea: add a control variate to importance sampling estimators
𝑌 is the importance sampling estimator
𝑍 is a control variate build from an approximate model of the MDP
𝐅 𝑍 = 0 in this case
PDISCV 𝐸 = PDIS 𝐸 − CV(𝐸)
Called the doubly robust estimator (Jiang and Li, 2015)
Robust to 1) poor approximate model, and 2) error in estimates of 𝜌b
If the model is poor, the estimates are still unbiased
If the sampling policy is unknown, but the model is good, MSE will still be low
DR 𝐸 = PDISCV 𝐸
Non-recursive and weighted forms, as well as control variate view provided

by Thomas and Brunskill (2016)

SLIDE 45

Off-policy policy evaluation (revisited)

DR 𝜌𝑓 𝒠) =⁡ 1 𝑜 𝛿𝑢𝑥𝑢

𝑗 𝑆𝑢 𝑗 −⁡𝑟

𝜌e 𝑇𝑢

𝑗, 𝐵𝑢 𝑗

+ 𝛿𝑢𝜍𝑢−1

𝑗

𝑤 𝜌𝑓 𝑇𝑢

𝑗 ∞ 𝑢=0 𝑜 𝑗=1

Recall: we want the control variate, 𝑍, to cancel with 𝑌:

𝑆⁡ − 𝑟 𝑇, 𝐵 + 𝛿𝑤 𝑇′

Per-decision importance sampling (PDIS)

𝑥𝑢

𝑗 = 𝜌e 𝑏𝜐 𝑡𝜐

𝜌b 𝑏𝜐 𝑡𝜐

𝑢 𝜐=1

SLIDE 46

Empirical Results (Gridworld)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS AM

Approximate model Direct method (Dudik, 2011) Indirect method (Sutton and Barto, 1998)

SLIDE 47

Empirical Results (Gridworld)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS AM

SLIDE 48

Empirical Results (Gridworld)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS DR AM

SLIDE 49

Empirical Results (Gridworld)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS WIS CWPDIS DR AM

SLIDE 50

Empirical Results (Gridworld)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS WIS CWPDIS DR AM WDR

SLIDE 51

Off-policy policy evaluation (revisited)

What if supp 𝜌e ⊂ supp 𝜌b ?
There is a state-action pair, 𝑡, 𝑏 , such that

𝜌𝑓 𝑏 𝑡 = 0, but 𝜌𝑐 𝑏 𝑡 ≠ 0

If we see a history where (𝑡, 𝑏) occurs, what weight should we give

it?

IS 𝐸 = 1 𝑜 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢

𝑀 𝑢=1

𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

SLIDE 52

Off-policy policy evaluation (revisited)

What if there are zero samples (𝑜 = 0)?
The importance sampling estimate is undefined
What if no samples are in supp 𝜌e (or supp(𝑞) in general)?
Importance sampling says: the estimate is zero
Alternate approach: undefined
Importance sampling estimator is unbiased if 𝑜 > 0
Alternate approach will be unbiased given that at least one sample is

in the support of 𝑞.

Alternate approach detailed in Importance Sampling with Unequal

Support (Thomas and Brunskill, AAAI 2017)

SLIDE 53

Off-policy policy evaluation (revisited)

SLIDE 54

Off-policy policy evaluation (revisited)

Thomas et. al. Predictive Off-Policy Policy Evaluation for

Nonstationary Decision Problems, with Applications to Digital Marketing (AAAI 2017)

SLIDE 55

Off-policy policy evaluation (revisited)

Thomas and Brunskill. Data-Efficient Off-Policy Policy Evaluation for

Reinforcement Learning (ICML 2016)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS WIS CWPDIS DR AM WDR MAGIC

0.01 0.1 1 10 1 10 100 1,000 10,000 Mean Squared Error Number of Episodes, n IS DR AM WDR MAGIC

SLIDE 56

Creating a safe reinforcement learning algorithm

Off-policy policy evaluation (OPE)
For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

High-confidence off-policy policy evaluation (HCOPE)
Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

Safe policy improvement (SPI)
Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

SLIDE 57

High-confidence off-policy policy evaluation (revisited)

Consider using IS + Hoeffding’s inequality for HCOPE on mountain car

Natural Temporal Difference Learning, Dabney and Thomas, 2014

SLIDE 58

High-confidence off-policy policy evaluation (revisited)

Using 100,000 trajectories
Evaluation policy’s true performance is 0.19 ∈ [0,1].
We get a 95% confidence lower bound of:

−5,831,000

SLIDE 59

What went wrong?

𝑥𝑗 = 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢

𝑀 𝑢=1

SLIDE 60

What went wrong?

𝐅 𝑌𝑗 ≥ 1 𝑜 𝑌𝑗 −

𝑜 𝑗=1

𝑐 ln 1 𝜀 2𝑜 𝑐 ≈ 109.4 Largest observed importance weighted return:⁡316.

SLIDE 61

High-confidence off-policy policy evaluation (revisited)

Removing the upper tail only decreases the expected value.

SLIDE 62

High-confidence off-policy policy evaluation (revisited)

Thomas et. al, High confidence off-policy evaluation, AAAI 2015

SLIDE 63

High-confidence off-policy policy evaluation (revisited)

SLIDE 64

High-confidence off-policy policy evaluation (revisited)

Use 20% of the data to optimize 𝑑.
Use 80% to compute lower bound with optimized 𝑑.
Mountain car results:

CUT Chernoff-Hoeffding Maurer Anderson Bubeck et al. 95% Confidence lower bound on the mean 0.145 −5,831,000 −129,703 0.055 −.046

SLIDE 65

High-confidence off-policy policy evaluation (revisited)

Digital Marketing:

SLIDE 66

High-confidence off-policy policy evaluation (revisited)

Cognitive dissonance

𝐅 𝑌𝑗 ≥ 1 𝑜 𝑌𝑗 −

𝑜 𝑗=1

𝑐 ln 1 𝜀 2𝑜

SLIDE 67

High-confidence off-policy policy evaluation (revisited)

Student’s 𝑢-test
Assumes that IS(𝐸) is normally distributed
By the central limit theorem, it (is as 𝑜 → ∞)
Efron’s Bootstrap methods (e.g., BCa)
Also, without importance sampling: Hanna, Stone, and Niekum, AAMAS 2017

SLIDE 68

High-confidence off-policy policy evaluation (revisited)

P. S. Thomas. Safe reinforcement learning (PhD Thesis, 2015)

SLIDE 69

Creating a safe reinforcement learning algorithm

Off-policy policy evaluation (OPE)
For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

High-confidence off-policy policy evaluation (HCOPE)
Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

Safe policy improvement (SPI)
Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

SLIDE 70

Historical Data Training Set (20%) Candidate Policy, 𝜌 Testing Set (80%) Safety Test

70

Safe policy improvement (revisited)

Is 1 − 𝜀 confidence lower bound on 𝐾 𝜌 larger that 𝐾(𝜌cur)?

Thomas et. al, ICML 2015

SLIDE 71

Lecture overview

What makes a reinforcement learning algorithm safe?
Notation
Creating a safe reinforcement learning algorithm
Off-policy policy evaluation (OPE)
High-confidence off-policy policy evaluation (HCOPE)
Safe policy improvement (SPI)
Empirical results
Research directions

SLIDE 72

Empirical Results

What to look for:
Data efficiency
Error rates

SLIDE 73

Empirical Results: Mountain Car

SLIDE 74

Empirical Results: Mountain Car

SLIDE 75

Empirical Results: Mountain Car

SLIDE 76

Empirical Results: Digital Marketing

Agent Environment

Action, 𝑏 State, 𝑡 Reward, 𝑠

SLIDE 77

Empirical Results: Digital Marketing

0.002715 0.003832 n=10000 n=30000 n=60000 n=100000 Expected Normalized Return None, CUT None, BCa k-Fold, CUT k-Fold, Bca

SLIDE 78

Empirical Results: Digital Marketing

SLIDE 79

Empirical Results: Digital Marketing

SLIDE 80

Example Results : Diabetes Treatment

80

Blood Glucose (sugar) Eat Carbohydrates Release Insulin

SLIDE 81

Example Results : Diabetes Treatment

81

Blood Glucose (sugar) Eat Carbohydrates Release Insulin Hyperglycemia

SLIDE 82

Example Results : Diabetes Treatment

82

Blood Glucose (sugar) Eat Carbohydrates Release Insulin Hypoglycemia Hyperglycemia

SLIDE 83

Example Results : Diabetes Treatment

83

injection = blood⁡glucose⁡ − target⁡blood⁡glucose 𝐷𝐺 + meal⁡size 𝐷𝑆

SLIDE 84

Example Results : Diabetes Treatment

84

Intelligent Diabetes Management

SLIDE 85

Example Results : Diabetes Treatment

85

Probability Policy Changed Probability Policy Worse

SLIDE 86

Future Directions

How to deal with long horizons?
How to deal with importance sampling being “unfair”?
What to do when the behavior policy is not known?
What to do when the behavior policy is deterministic?

SLIDE 87

Summary

Safe reinforcement learning
Risk-sensitive
Learning from demonstration
Asymptotic convergence even if data is off-policy
Guaranteed (with probability 𝟐 − 𝜺) not to make the policy worse
Designing a safe reinforcement learning algorithm:
Off-policy policy evaluation (OPE)
IS, PDIS, WIS, WPDIS, DR, WDR, US, TSP, MAGIC
High confidence off-policy policy evaluation (HCOPE)
Hoeffding, CUT inequality, Student’s 𝑢-test, BCa
Safe policy improvement (SPI)
Selecting the candidate policy

SLIDE 88

Takeaway

Safe reinforcement learning is tractable!
Not just polynomial amounts of data – practical amounts of data