Safe Reinforcement Learning Philip S. Thomas Stanford CS234: - - PowerPoint PPT Presentation

safe reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: - - PowerPoint PPT Presentation

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest Lecture May 24, 2017 Lecture overview What makes a reinforcement learning algorithm safe ? Notation Creating a safe reinforcement learning


slide-1
SLIDE 1

Safe Reinforcement Learning

Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest Lecture May 24, 2017

slide-2
SLIDE 2

Lecture overview

  • What makes a reinforcement learning algorithm safe?
  • Notation
  • Creating a safe reinforcement learning algorithm
  • Off-policy policy evaluation (OPE)
  • High-confidence off-policy policy evaluation (HCOPE)
  • Safe policy improvement (SPI)
  • Empirical results
  • Research directions
slide-3
SLIDE 3

What does it mean for a reinforcement learning algorithm to be safe?

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

Changing the objective

  • 50

+0 +20 +20 +0 +0 +0 +0 +0 +0 +20 +20 +20 +20 Policy 1 Policy 2

slide-7
SLIDE 7

Changing the objective

  • Policy 1:
  • Reward = 0 with probability 0.999999
  • Reward = 109 with probability 1-0.999999
  • Expected reward approximately 1000
  • Policy 2:
  • Reward = 999 with probability 0.5
  • Reward = 1000 with probability 0.5
  • Expected reward 999.5
slide-8
SLIDE 8

Another notion of safety

slide-9
SLIDE 9

Another notion of safety (Munos et. al)

slide-10
SLIDE 10

Another notion of safety

slide-11
SLIDE 11
slide-12
SLIDE 12

The Problem

  • If you apply an existing method, do you have confidence that it will

work?

slide-13
SLIDE 13

Reinforcement learning successes

slide-14
SLIDE 14

A property of many real applications

  • Deploying “bad” policies can be costly or dangerous.
slide-15
SLIDE 15

Deploying bad policies can be costly

slide-16
SLIDE 16

Deploying bad policies can be dangerous

slide-17
SLIDE 17

What property should a safe algorithm have?

  • Guaranteed to work on the first try
  • “I guarantee that with probability at least 1 − 𝜀, I will not change your policy

to one that is worse than the current policy.”

  • You get to choose 𝜀
  • This guarantee is not contingent on the tuning of any hyperparameters
slide-18
SLIDE 18

Lecture overview

  • What makes a reinforcement learning algorithm safe?
  • Notation
  • Creating a safe reinforcement learning algorithm
  • Off-policy policy evaluation (OPE)
  • High-confidence off-policy policy evaluation (HCOPE)
  • Safe policy improvement (SPI)
  • Empirical results
  • Research directions
slide-19
SLIDE 19

Notation

  • Policy, 𝜌

𝜌 𝑏 𝑡 = Pr⁡ (𝐵𝑢 = 𝑏|𝑇𝑢 = 𝑡)

  • History:

𝐼 = 𝑡1, 𝑏1, 𝑠

1, 𝑡2, 𝑏2, 𝑠 2, … , 𝑡𝑀, 𝑏𝑀, 𝑠 𝑀

  • Historical data:

𝐸 = 𝐼1, 𝐼2, … , 𝐼𝑜

  • Historical data from behavior policy, 𝜌b
  • Objective:

𝐾 𝜌 = 𝐅 𝛿𝑢𝑆𝑢

𝑀 𝑢=1

𝜌

19

Agent Environment

Action, 𝑏 State, 𝑡 Reward, 𝑠

slide-20
SLIDE 20

Safe reinforcement learning algorithm

  • Reinforcement learning algorithm, 𝑏
  • Historical data, 𝐸, which is a random variable
  • Policy produced by the algorithm, 𝑏(𝐸), which is a random variable
  • A safe reinforcement learning algorithm, 𝑏, satisfies:

Pr 𝐾 𝑏 𝐸 ≥ 𝐾 𝜌b ≥ 1 − 𝜀

  • r, in general:

Pr 𝐾 𝑏 𝐸 ≥ 𝐾min ≥ 1 − 𝜀

slide-21
SLIDE 21

Lecture overview

  • What makes a reinforcement learning algorithm safe?
  • Notation
  • Creating a safe reinforcement learning algorithm
  • Off-policy policy evaluation (OPE)
  • High-confidence off-policy policy evaluation (HCOPE)
  • Safe policy improvement (SPI)
  • Empirical results
  • Research directions
slide-22
SLIDE 22

Creating a safe reinforcement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
slide-23
SLIDE 23

Off-policy policy evaluation (OPE)

Historical Data, 𝐸 Proposed Policy, 𝜌e Estimate of 𝐾(𝜌e)

slide-24
SLIDE 24

Importance Sampling (Intuition)

24

Probability of history Evaluation Policy, 𝜌e Behavior Policy, 𝜌b

𝐾 𝜌𝑓 = 1 𝑜 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

𝐾 𝜌e = 1 𝑜 𝑥𝑗 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

Math Slide 2/3

𝑥𝑗 = Pr 𝐼𝑗 𝜌e Pr 𝐼𝑗 𝜌b

Math Slide 2/3

  • Reminder:
  • History, 𝐼 = 𝑡1, 𝑏1, 𝑠

1, 𝑡2, 𝑏2, 𝑠 2, … , 𝑡𝑀, 𝑏𝑀, 𝑠 𝑀

  • Objective, 𝐾 𝜌e = 𝐅

𝛿𝑢𝑆𝑢

𝑀 𝑢=1

𝜌e

= 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢

𝑀 𝑢=1

slide-25
SLIDE 25

Importance sampling (History)

  • Kahn, H., Marhshall, A. W. (1953). Methods of reducing sample size in

Monte Carlo computations. In Journal of the Operations Research Society of America, 1(5):263–278

  • Let 𝑌 = 0 with probability 1 − 10−10 and 𝑌 = 1010 with probability 10−10
  • 𝐅 𝑌 = 1
  • Monte Carlo estimate from 𝑜 ≪ 1010 samples of 𝑌 is almost always zero
  • Idea: Sample 𝑌 from some other distribution and use importance sampling to

“correct” the estimate

  • Can produce lower variance estimates.
  • Josiah Hannah et. al, ICML 2017 (to appear).
slide-26
SLIDE 26

Importance sampling (History, continued)

  • Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-

policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann

slide-27
SLIDE 27

Importance sampling (Proof)

  • Estimate 𝐅𝑞[𝑔 𝑌 ] given a sample of 𝑌~𝑟
  • Let 𝑄 = supp 𝑞 , 𝑅 = supp(𝑟), and 𝐺 = supp(𝑔)
  • Importance sampling estimate:

𝑞 𝑌 𝑟 𝑌 𝑔 𝑌

𝐅𝑟 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑟(𝑌) 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌)

𝑦∈𝑅

= 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

− 𝑞 𝑌 𝑔 𝑌

𝑦∈𝑄∩𝑅

= 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

+ 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄 ∩𝑅

− 𝑞 𝑌 𝑔 𝑌

𝑦∈𝑄∩𝑅

slide-28
SLIDE 28

Importance sampling (Proof)

  • Assume 𝑄 ⊆ 𝑅 (can relax assumption to 𝑄 ⊆ 𝑅 ∪ 𝐺

)

  • Importance sampling is an unbiased estimator of 𝐅𝑞 𝑔 𝑌

𝐅𝑟 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

− 𝑞 𝑌 𝑔 𝑌

𝑦∈𝑄∩𝑅

= 𝐅𝑞 𝑔 𝑌 = 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

slide-29
SLIDE 29

Importance sampling (proof)

  • Assume 𝑔 𝑦 ≥ 0 for all 𝑦
  • Importance sampling is a negative-bias estimator of 𝐅𝑞 𝑔 𝑌

𝐅𝑟 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

− 𝑞 𝑌 𝑔 𝑌

𝑦∈𝑄∩𝑅

≤ 𝑞(𝑌)⁡𝑔(𝑌)

𝑦∈𝑄

= 𝐅𝑞 𝑔 𝑌

slide-30
SLIDE 30

Importance sampling (reminder)

IS 𝐸 = 1 𝑜 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢

𝑀 𝑢=1

𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

𝐅 IS(𝐸) = 𝐾 𝜌e

slide-31
SLIDE 31

Creating a safe reinforcement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
slide-32
SLIDE 32

High confidence off-policy policy evaluation (HCOPE)

Historical Data, 𝐸 Proposed Policy, 𝜌𝑓 1 − 𝜀 confidence lower bound on 𝐾(𝜌𝑓) Probability, 1 − 𝜀

slide-33
SLIDE 33
  • Let 𝑌1, … , 𝑌𝑜 be 𝑜 independent identically distributed random

variables such that⁡𝑌i ∈ [0, 𝑐]

  • Then with probability at least 1 − 𝜀:

𝐅 𝑌𝑗 ≥ 1 𝑜 𝑌𝑗 −

𝑜 𝑗=1

𝑐 ln 1 𝜀 2𝑜

Hoeffding’s inequality

Math Slide 3/3

1 𝑜 𝑥𝑗 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

slide-34
SLIDE 34

Creating a safe reinforcement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
slide-35
SLIDE 35

Safe policy improvement (SPI)

Historical Data, 𝐸 New policy 𝜌, or No Solution Found Probability, 1 − 𝜀

slide-36
SLIDE 36

Historical Data Training Set (20%) Candidate Policy, 𝜌 Testing Set (80%) Safety Test

36

Safe policy improvement (SPI)

Is 1 − 𝜀 confidence lower bound on 𝐾 𝜌 larger that 𝐾(𝜌cur)?

slide-37
SLIDE 37

Creating a safe reinforcement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

WON’T WORK

slide-38
SLIDE 38

Off-policy policy evaluation (revisited)

  • Importance sampling (IS):

IS 𝐸 = 1 𝑜 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢

𝑀 𝑢=1

𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

  • Per-decision importance sampling (PDIS)

PDIS 𝐸 = 𝛿𝑢

𝑀 𝑢=1

1 𝑜 𝜌e 𝑏𝜐 𝑡𝜐 𝜌b 𝑏𝜐 𝑡𝜐

𝑢 𝜐=1

𝑆𝑢

𝑗 𝑜 𝑗=1

slide-39
SLIDE 39

Off-policy policy evaluation (revisited)

  • Importance sampling (IS):

IS 𝐸 = 1 𝑜 𝑥𝑗 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

  • Weighted importance sampling (WIS)

WIS 𝐸 = 1 𝑥𝑗

𝑜 𝑗=1

𝑥𝑗 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

slide-40
SLIDE 40

Off-policy policy evaluation (revisited)

  • Weighted importance sampling (WIS)

1 𝑥𝑗

𝑜 𝑗=1

𝑥𝑗 𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

  • Not unbiased. When 𝑜 = 1, 𝐅 WIS = 𝐾 𝜌b
  • Strongly consistent estimator of 𝐾 𝜌e
  • i.e., Pr

lim

𝑜→∞ WIS(𝐸) = J 𝜌e

= 1

  • If
  • Finite horizon
  • One behavior policy, or bounded rewards
slide-41
SLIDE 41

Off-policy policy evaluation (revisited)

  • Weighted per-decision importance sampling
  • Also called consistent weighted per-decision importance sampling
  • A fun exercise!
slide-42
SLIDE 42

Control variates

  • Given: 𝑌
  • Estimate: 𝜈 = 𝐅 𝑌
  • 𝜈

= 𝑌

  • Unbiased: 𝐅 𝜈

= 𝐅 𝑌 = 𝜈

  • Variance: Var 𝜈

= Var(𝑌)

slide-43
SLIDE 43

Control variates

  • Given: 𝑌, 𝑍, 𝐅 𝑍
  • Estimate: 𝜈 = 𝐅 𝑌
  • 𝜈

= 𝑌 − 𝑍 + 𝐅[Y]

  • Unbiased:

𝐅 𝜈 = 𝐅 𝑌 − 𝑍 + 𝐅[𝑍] = 𝐅 𝑌 − 𝐅 𝑍 + 𝐅 𝑍 = 𝐅 𝑌 = 𝜈

  • Variance:

Var 𝜈 ⁡ = Var 𝑌 − 𝑍 + 𝐅[𝑍] = Var 𝑌 − 𝑍 = Var 𝑌 + Var 𝑍 − 2Cov(𝑌, 𝑍)

  • Lower variance if 2Cov 𝑌, 𝑍 > Var(𝑍)
  • We call 𝑍 a control variate.
slide-44
SLIDE 44

Off-policy policy evaluation (revisited)

  • Idea: add a control variate to importance sampling estimators
  • 𝑌 is the importance sampling estimator
  • 𝑍 is a control variate build from an approximate model of the MDP
  • 𝐅 𝑍 = 0 in this case
  • PDISCV 𝐸 = PDIS 𝐸 − CV(𝐸)
  • Called the doubly robust estimator (Jiang and Li, 2015)
  • Robust to 1) poor approximate model, and 2) error in estimates of 𝜌b
  • If the model is poor, the estimates are still unbiased
  • If the sampling policy is unknown, but the model is good, MSE will still be low
  • DR 𝐸 = PDISCV 𝐸
  • Non-recursive and weighted forms, as well as control variate view provided

by Thomas and Brunskill (2016)

slide-45
SLIDE 45

Off-policy policy evaluation (revisited)

DR 𝜌𝑓 𝒠) =⁡ 1 𝑜 𝛿𝑢𝑥𝑢

𝑗 𝑆𝑢 𝑗 −⁡𝑟

𝜌e 𝑇𝑢

𝑗, 𝐵𝑢 𝑗

+ 𝛿𝑢𝜍𝑢−1

𝑗

𝑤 𝜌𝑓 𝑇𝑢

𝑗 ∞ 𝑢=0 𝑜 𝑗=1

  • Recall: we want the control variate, 𝑍, to cancel with 𝑌:

𝑆⁡ − 𝑟 𝑇, 𝐵 + 𝛿𝑤 𝑇′

Per-decision importance sampling (PDIS)

𝑥𝑢

𝑗 = 𝜌e 𝑏𝜐 𝑡𝜐

𝜌b 𝑏𝜐 𝑡𝜐

𝑢 𝜐=1

slide-46
SLIDE 46

Empirical Results (Gridworld)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS AM

Approximate model Direct method (Dudik, 2011) Indirect method (Sutton and Barto, 1998)

slide-47
SLIDE 47

Empirical Results (Gridworld)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS AM

slide-48
SLIDE 48

Empirical Results (Gridworld)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS DR AM

slide-49
SLIDE 49

Empirical Results (Gridworld)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS WIS CWPDIS DR AM

slide-50
SLIDE 50

Empirical Results (Gridworld)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS WIS CWPDIS DR AM WDR

slide-51
SLIDE 51

Off-policy policy evaluation (revisited)

  • What if supp 𝜌e ⊂ supp 𝜌b ?
  • There is a state-action pair, 𝑡, 𝑏 , such that

𝜌𝑓 𝑏 𝑡 = 0, but 𝜌𝑐 𝑏 𝑡 ≠ 0

  • If we see a history where (𝑡, 𝑏) occurs, what weight should we give

it?

IS 𝐸 = 1 𝑜 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢

𝑀 𝑢=1

𝛿𝑢𝑆𝑢

𝑗 𝑀 𝑢=1 𝑜 𝑗=1

slide-52
SLIDE 52

Off-policy policy evaluation (revisited)

  • What if there are zero samples (𝑜 = 0)?
  • The importance sampling estimate is undefined
  • What if no samples are in supp 𝜌e (or supp(𝑞) in general)?
  • Importance sampling says: the estimate is zero
  • Alternate approach: undefined
  • Importance sampling estimator is unbiased if 𝑜 > 0
  • Alternate approach will be unbiased given that at least one sample is

in the support of 𝑞.

  • Alternate approach detailed in Importance Sampling with Unequal

Support (Thomas and Brunskill, AAAI 2017)

slide-53
SLIDE 53

Off-policy policy evaluation (revisited)

slide-54
SLIDE 54

Off-policy policy evaluation (revisited)

  • Thomas et. al. Predictive Off-Policy Policy Evaluation for

Nonstationary Decision Problems, with Applications to Digital Marketing (AAAI 2017)

slide-55
SLIDE 55

Off-policy policy evaluation (revisited)

  • Thomas and Brunskill. Data-Efficient Off-Policy Policy Evaluation for

Reinforcement Learning (ICML 2016)

0.001 0.01 0.1 1 10 100 1000 10000 2 20 200 2,000 Mean Squared Error Number of Episodes, n IS PDIS WIS CWPDIS DR AM WDR MAGIC

0.01 0.1 1 10 1 10 100 1,000 10,000 Mean Squared Error Number of Episodes, n IS DR AM WDR MAGIC

slide-56
SLIDE 56

Creating a safe reinforcement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
slide-57
SLIDE 57

High-confidence off-policy policy evaluation (revisited)

  • Consider using IS + Hoeffding’s inequality for HCOPE on mountain car

Natural Temporal Difference Learning, Dabney and Thomas, 2014

slide-58
SLIDE 58

High-confidence off-policy policy evaluation (revisited)

  • Using 100,000 trajectories
  • Evaluation policy’s true performance is 0.19 ∈ [0,1].
  • We get a 95% confidence lower bound of:

−5,831,000

slide-59
SLIDE 59

What went wrong?

𝑥𝑗 = 𝜌e 𝑏𝑢 𝑡𝑢 𝜌b 𝑏𝑢 𝑡𝑢

𝑀 𝑢=1

slide-60
SLIDE 60

What went wrong?

𝐅 𝑌𝑗 ≥ 1 𝑜 𝑌𝑗 −

𝑜 𝑗=1

𝑐 ln 1 𝜀 2𝑜 𝑐 ≈ 109.4 Largest observed importance weighted return:⁡316.

slide-61
SLIDE 61

High-confidence off-policy policy evaluation (revisited)

  • Removing the upper tail only decreases the expected value.
slide-62
SLIDE 62

High-confidence off-policy policy evaluation (revisited)

  • Thomas et. al, High confidence off-policy evaluation, AAAI 2015
slide-63
SLIDE 63

High-confidence off-policy policy evaluation (revisited)

slide-64
SLIDE 64

High-confidence off-policy policy evaluation (revisited)

  • Use 20% of the data to optimize 𝑑.
  • Use 80% to compute lower bound with optimized 𝑑.
  • Mountain car results:

CUT Chernoff-Hoeffding Maurer Anderson Bubeck et al. 95% Confidence lower bound on the mean 0.145 −5,831,000 −129,703 0.055 −.046

slide-65
SLIDE 65

High-confidence off-policy policy evaluation (revisited)

  • Digital Marketing:
slide-66
SLIDE 66

High-confidence off-policy policy evaluation (revisited)

  • Cognitive dissonance

𝐅 𝑌𝑗 ≥ 1 𝑜 𝑌𝑗 −

𝑜 𝑗=1

𝑐 ln 1 𝜀 2𝑜

slide-67
SLIDE 67

High-confidence off-policy policy evaluation (revisited)

  • Student’s 𝑢-test
  • Assumes that IS(𝐸) is normally distributed
  • By the central limit theorem, it (is as 𝑜 → ∞)
  • Efron’s Bootstrap methods (e.g., BCa)
  • Also, without importance sampling: Hanna, Stone, and Niekum, AAMAS 2017
slide-68
SLIDE 68

High-confidence off-policy policy evaluation (revisited)

  • P. S. Thomas. Safe reinforcement learning (PhD Thesis, 2015)
slide-69
SLIDE 69

Creating a safe reinforcement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, 𝜌e, Convert historical data, 𝐸, into 𝑜 independent

and unbiased estimates of 𝐾 𝜌e

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the 𝑜 independent and unbiased

estimates of 𝐾 𝜌e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌e

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏
slide-70
SLIDE 70

Historical Data Training Set (20%) Candidate Policy, 𝜌 Testing Set (80%) Safety Test

70

Safe policy improvement (revisited)

Is 1 − 𝜀 confidence lower bound on 𝐾 𝜌 larger that 𝐾(𝜌cur)?

  • Thomas et. al, ICML 2015
slide-71
SLIDE 71

Lecture overview

  • What makes a reinforcement learning algorithm safe?
  • Notation
  • Creating a safe reinforcement learning algorithm
  • Off-policy policy evaluation (OPE)
  • High-confidence off-policy policy evaluation (HCOPE)
  • Safe policy improvement (SPI)
  • Empirical results
  • Research directions
slide-72
SLIDE 72

Empirical Results

  • What to look for:
  • Data efficiency
  • Error rates
slide-73
SLIDE 73

Empirical Results: Mountain Car

slide-74
SLIDE 74

Empirical Results: Mountain Car

slide-75
SLIDE 75

Empirical Results: Mountain Car

slide-76
SLIDE 76

Empirical Results: Digital Marketing

Agent Environment

Action, 𝑏 State, 𝑡 Reward, 𝑠

slide-77
SLIDE 77

Empirical Results: Digital Marketing

0.002715 0.003832 n=10000 n=30000 n=60000 n=100000 Expected Normalized Return None, CUT None, BCa k-Fold, CUT k-Fold, Bca

slide-78
SLIDE 78

Empirical Results: Digital Marketing

slide-79
SLIDE 79

Empirical Results: Digital Marketing

slide-80
SLIDE 80

Example Results : Diabetes Treatment

80

Blood Glucose (sugar) Eat Carbohydrates Release Insulin

slide-81
SLIDE 81

Example Results : Diabetes Treatment

81

Blood Glucose (sugar) Eat Carbohydrates Release Insulin Hyperglycemia

slide-82
SLIDE 82

Example Results : Diabetes Treatment

82

Blood Glucose (sugar) Eat Carbohydrates Release Insulin Hypoglycemia Hyperglycemia

slide-83
SLIDE 83

Example Results : Diabetes Treatment

83

injection = blood⁡glucose⁡ − target⁡blood⁡glucose 𝐷𝐺 + meal⁡size 𝐷𝑆

slide-84
SLIDE 84

Example Results : Diabetes Treatment

84

Intelligent Diabetes Management

slide-85
SLIDE 85

Example Results : Diabetes Treatment

85

Probability Policy Changed Probability Policy Worse

slide-86
SLIDE 86

Future Directions

  • How to deal with long horizons?
  • How to deal with importance sampling being “unfair”?
  • What to do when the behavior policy is not known?
  • What to do when the behavior policy is deterministic?
slide-87
SLIDE 87

Summary

  • Safe reinforcement learning
  • Risk-sensitive
  • Learning from demonstration
  • Asymptotic convergence even if data is off-policy
  • Guaranteed (with probability 𝟐 − 𝜺) not to make the policy worse
  • Designing a safe reinforcement learning algorithm:
  • Off-policy policy evaluation (OPE)
  • IS, PDIS, WIS, WPDIS, DR, WDR, US, TSP, MAGIC
  • High confidence off-policy policy evaluation (HCOPE)
  • Hoeffding, CUT inequality, Student’s 𝑢-test, BCa
  • Safe policy improvement (SPI)
  • Selecting the candidate policy
slide-88
SLIDE 88

Takeaway

  • Safe reinforcement learning is tractable!
  • Not just polynomial amounts of data – practical amounts of data