[PPT] - Reinforcement Learning++ Emma Brunskill (today) Ariel PowerPoint Presentation

SLIDE 1

Reinforcement ¡Learning++ ¡

Emma ¡Brunskill ¡(today) ¡ Ariel ¡Procaccia ¡

1 ¡

SLIDE 2

Recall ¡MDPs: ¡What ¡You ¡Should ¡Know ¡

DefiniGon ¡
How ¡to ¡define ¡for ¡a ¡problem ¡
Value ¡iteraGon ¡and ¡policy ¡iteraGon ¡

– How ¡to ¡implement ¡ – Convergence ¡guarantees ¡ – ComputaGonal ¡complexity ¡

SLIDE 3

Reinforcement ¡Learning ¡

TransiGon ¡ Model? ¡ Agent ¡ AcGon ¡ State ¡ Reward ¡model? ¡

Goal: ¡Maximize ¡expected ¡sum ¡of ¡future ¡rewards ¡ ¡

SLIDE 4

Recap ¡of ¡Last ¡Time ¡

Model-‑based ¡RL ¡when ¡select ¡acGons ¡randomly ¡

– EsGmate ¡a ¡model ¡of ¡the ¡dynamics ¡and ¡rewards ¡ from ¡data ¡(e.g. ¡T(s1|s2,a2) ¡~ ¡0.3) ¡ – Do ¡MDP ¡planning ¡given ¡those ¡esGmated ¡models ¡

Q-‑learning ¡

– No ¡model ¡of ¡dynamics ¡and ¡rewards ¡ – Directly ¡esGmate ¡state-‑acGon ¡value ¡funcGon ¡

4 ¡

SLIDE 5

Q-‑Learning ¡

At ¡each ¡step, ¡for ¡current ¡state ¡s ¡and ¡acGon ¡taken ¡

– Observe ¡r ¡and ¡s’ ¡ ¡ – Update ¡Q(s,a) ¡

¡

IntuiGon: ¡using ¡samples ¡to ¡approximate ¡

– Future ¡rewards ¡ – ExpectaGon ¡over ¡next ¡states ¡due ¡to ¡transiGon ¡model ¡ uncertainty ¡

¡

sampleQ (s,a) = R(s,a,s')+γ max a' Q(s',a') Q(s,a) = (1−α)Q(s,a)+α *sampleQ(s,a)

SLIDE 6

Q-‑Learning ¡ProperGes ¡

If ¡acGng ¡randomly, ¡Q-‑learning ¡converges* ¡to ¡
pGmal ¡state—acGon ¡values, ¡and ¡also ¡

therefore ¡finds ¡opGmal ¡policy ¡

Off-‑policy ¡learning ¡

– Can ¡act ¡in ¡one ¡way ¡ – But ¡learning ¡values ¡of ¡another ¡policy ¡(the ¡opGmal ¡

ne!) ¡

SLIDE 7

Towards ¡Gathering ¡High ¡Reward ¡

Fortunately, ¡acGng ¡randomly ¡is ¡sufficient, ¡but ¡

not ¡necessary, ¡to ¡learn ¡the ¡opGmal ¡values ¡and ¡ policy ¡

SLIDE 8

How ¡to ¡Act? ¡

IniGalize ¡s ¡to ¡a ¡starGng ¡state ¡
IniGalize ¡Q(s,a) ¡values ¡ ¡
For ¡t=1,2,… ¡

– Choose ¡a ¡= ¡argmax ¡Q(s,a) ¡ – Observe ¡s’,r(s,a,s’) ¡ ¡ – Update/Compute ¡Q ¡values ¡

SLIDE 9

Is ¡this ¡Approach ¡Guaranteed ¡to ¡Learn ¡ OpGmal ¡Policy? ¡

IniGalize ¡s ¡to ¡a ¡starGng ¡state ¡
IniGalize ¡Q(s,a) ¡values ¡ ¡
For ¡t=1,2,… ¡

– Choose ¡a ¡= ¡argmax ¡Q(s,a) ¡ – Observe ¡s’,r(s,a,s’) ¡ ¡ – Update/Compute ¡Q ¡values ¡(using ¡model-‑based ¡or ¡Q-‑learning ¡ approach) ¡ ¡

1. ¡Yes ¡ ¡ ¡ ¡2. ¡No ¡ ¡ ¡ ¡3. ¡Not ¡sure ¡

SLIDE 10

To ¡Explore ¡or ¡Exploit? ¡

Slide adapted from Klein and Abbeel Drawing ¡by ¡Ketrina ¡Yim ¡

SLIDE 11

Simple ¡Approach: ¡E-‑greedy ¡

With ¡probability ¡1-‑e ¡

– Choose ¡argmaxa ¡Q(s,a) ¡

With ¡probability ¡e ¡

– Select ¡random ¡acGon ¡ ¡

Guaranteed ¡to ¡compute ¡opGmal ¡policy ¡
But ¡even ¡aker ¡millions ¡of ¡steps ¡sGll ¡won’t ¡always ¡be ¡

following ¡policy ¡compute ¡(the ¡argmax ¡Q(s,a)) ¡

SLIDE 12

Greedy ¡in ¡Limit ¡of ¡Infinite ¡ExploraGon ¡ (GLIE) ¡

E-‑Greedy ¡approach ¡
But ¡decay ¡epsilon ¡over ¡Gme ¡
Eventually ¡will ¡be ¡following ¡opGmal ¡policy ¡

almost ¡all ¡the ¡Gme ¡ ¡

SLIDE 13

How ¡should ¡we ¡evaluate ¡the ¡performance ¡of ¡an ¡ algorithm? ¡

13 ¡

SLIDE 14

How ¡should ¡we ¡evaluate ¡the ¡performance ¡of ¡an ¡ algorithm? ¡ ¡

‑ ComputaGonal ¡efficiency ¡
‑ How ¡much ¡reward ¡gathered ¡under ¡algorithm? ¡

14 ¡

SLIDE 15

The ¡Speed ¡of ¡Learning ¡and ¡ Speeding ¡Learning ¡

15 ¡

SLIDE 16

ObjecGves ¡for ¡an ¡RL ¡Algorithm ¡

AsymptoGc ¡guarantees ¡

– In ¡limit ¡converge ¡to ¡a ¡policy ¡idenGcal ¡to ¡the ¡opGmal ¡ policy ¡if ¡knew ¡unknown ¡model ¡parameters ¡

16 ¡

SLIDE 17

ObjecGves ¡for ¡an ¡RL ¡Algorithm ¡

AsymptoGc ¡guarantees ¡

– In ¡limit ¡converge ¡to ¡a ¡policy ¡idenGcal ¡to ¡the ¡opGmal ¡ policy ¡if ¡knew ¡unknown ¡model ¡parameters ¡ – Q-‑learning! ¡(under ¡what ¡condiGons?) ¡

Probably ¡Approximately ¡Correct ¡

– On ¡all ¡but ¡finite ¡number ¡of ¡samples, ¡choose ¡acGon ¡ whose ¡expected ¡reward ¡is ¡close ¡to ¡expected ¡reward ¡

f ¡acGon ¡take ¡if ¡knew ¡model ¡parameters ¡

– E3 ¡(Kearns ¡& ¡Singh), ¡R-‑MAX ¡(Brafman ¡& ¡Tennenholtz) ¡

17 ¡

SLIDE 18

Model-‑Based ¡RL ¡

Given ¡data ¡seen ¡so ¡far ¡
Build ¡an ¡explicit ¡model ¡of ¡the ¡MDP ¡
Compute ¡policy ¡for ¡it ¡
Select ¡acGon ¡for ¡current ¡state ¡given ¡policy, ¡
bserve ¡next ¡state ¡and ¡reward ¡
Repeat ¡

18 ¡

SLIDE 19

R-‑max ¡(Brafman ¡& ¡Tennenholtz) ¡

¡ ¡ ¡ ¡ ¡ ¡ S2 ¡ S1 ¡ …

SLIDE 20

R-‑max ¡is ¡Model-‑based ¡RL ¡

Act ¡in ¡world ¡ Think ¡hard: ¡esGmate ¡models ¡& ¡compute ¡policies

¡

¡ Rmax ¡leverages ¡opGmism ¡under ¡uncertainty! ¡

SLIDE 21

R-‑max ¡Algorithm: ¡ ¡ IniGalize: ¡Define ¡“Known” ¡MDP ¡

¡ ¡ Reward ¡ ¡ ¡

TransiGon ¡ Counts ¡ Known/ ¡ Unknown ¡

S1 S2 S3 S4 … U U U U U U U U U U U U U U U U

¡ ¡

S1 S2 S3 S4 …

¡ ¡

S1 S2 S3 S4 …

Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax

In ¡the ¡“known” ¡MDP, ¡ any ¡unknown ¡(s,a) ¡pair ¡ has ¡its ¡dynamics ¡set ¡as ¡ ¡ a ¡self ¡loop ¡& ¡ ¡reward ¡= ¡Rmax ¡

SLIDE 22

R-‑max ¡Algorithm ¡

Plan ¡in ¡known ¡MDP ¡

SLIDE 23

R-‑max: ¡Planning ¡

Compute ¡opGmal ¡policy ¡πknown ¡for ¡“known” ¡MDP ¡

SLIDE 24

Exercise: ¡What ¡Will ¡IniGal ¡Value ¡of ¡Q(s,a) ¡be ¡for ¡ each ¡(s,a) ¡Pair ¡in ¡the ¡Known ¡MDP? ¡What ¡is ¡the ¡ Policy? ¡

¡ ¡ Reward ¡ ¡ ¡

TransiGon ¡ Counts ¡ Known/ ¡ Unknown ¡

S1 S2 S3 S4 … U U U U U U U U U U U U U U U U

¡ ¡

S1 S2 S3 S4 …

¡ ¡

S1 S2 S3 S4 …

Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax

In ¡the ¡“known” ¡MDP, ¡ any ¡unknown ¡(s,a) ¡pair ¡ has ¡its ¡dynamics ¡set ¡as ¡ ¡ a ¡self ¡loop ¡& ¡ ¡reward ¡= ¡Rmax ¡

SLIDE 25

R-‑max ¡Algorithm ¡

Act ¡using ¡ ¡ policy ¡ Plan ¡in ¡known ¡MDP ¡

Given ¡opGmal ¡policy ¡πknown ¡for ¡“known” ¡MDP ¡
Take ¡best ¡acGon ¡for ¡current ¡state ¡πknown(s), ¡

transiGon ¡to ¡new ¡state ¡s’ ¡and ¡get ¡reward ¡r ¡

SLIDE 26

R-‑max ¡Algorithm ¡

Act ¡using ¡ ¡ policy ¡ Update ¡state-‑acGon ¡ counts ¡ Plan ¡in ¡known ¡MDP ¡

SLIDE 27

Update ¡Known ¡MDP ¡

¡ ¡ Reward ¡ ¡ ¡

TransiGon ¡ Counts ¡ Known/ ¡ Unknown ¡

S2 S2 S3 S4 … U U U U U U U U U U U U U U U U

¡ ¡

S2 S2 S3 S4 … 1

¡ ¡

S2 S2 S3 S4 …

Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax

Increment ¡counts ¡for ¡ state-‑acGon ¡tuple ¡

SLIDE 28

Update ¡Known ¡MDP ¡

¡ ¡ Reward ¡ ¡ ¡

TransiGon ¡ Counts ¡ Known/ ¡ Unknown ¡

S2 S2 S3 S4 … U U U U U U K U U U U U U U U U

¡ ¡

S2 S2 S3 S4 … 3 3 4 3 2 4 5 4 4 4 2 2 4 1

¡ ¡

S2 S2 S3 S4 …

Rmax Rmax Rmax Rmax Rmax Rmax R Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax Rmax

If ¡counts ¡for ¡(s,a) ¡> ¡N, ¡ (s,a) ¡becomes ¡known: ¡ use ¡observed ¡data ¡to ¡ esGmate ¡transiGon ¡& ¡ reward ¡model ¡for ¡(s,a) ¡ when ¡planning ¡

SLIDE 29

EsGmaGng ¡MDP ¡Model ¡for ¡a ¡(s,a) ¡ Pair ¡Given ¡Data ¡

TransiGon ¡model ¡esGmaGon ¡ ¡
Reward ¡model ¡esGmaGon ¡

29 ¡

SLIDE 30

R-‑max ¡Algorithm ¡

Act ¡using ¡ ¡ policy ¡ Update ¡state-‑acGon ¡ counts ¡ Update ¡known ¡MDP ¡ dynamics ¡& ¡reward ¡ models ¡ Plan ¡in ¡known ¡MDP ¡

SLIDE 31

R-‑max ¡Behavior ¡

31 ¡

SLIDE 32

Sample ¡Complexity ¡of ¡R-‑max ¡

32 ¡

 O SA ε(1−γ)2 S ε 2(1−γ)4 " # $ % & '

# ¡samples ¡ need ¡per ¡(s,a) ¡ pair ¡

On ¡all ¡but ¡the ¡above ¡number ¡of ¡steps, ¡chooses ¡acGon ¡ whose ¡expected ¡reward ¡is ¡close ¡to ¡expected ¡reward ¡of ¡ acGon ¡take ¡if ¡knew ¡model ¡parameters, ¡with ¡high ¡ probability ¡

SLIDE 33

Common ¡Idea ¡of ¡Model-‑Based ¡PAC ¡RL ¡ Algorithms: ¡ ¡ QuanGfy ¡how ¡many ¡samples ¡do ¡we ¡need ¡ to ¡build ¡a ¡good ¡model ¡that ¡we ¡can ¡use ¡to ¡ act ¡well ¡in ¡the ¡world? ¡

33 ¡

SLIDE 34

“Good” ¡RL ¡Models ¡

EsGmate ¡model ¡parameters ¡from ¡experience ¡
More ¡experience ¡means ¡our ¡esGmated ¡model ¡

parameters ¡will ¡closer ¡be ¡to ¡the ¡true ¡unknown ¡ parameters, ¡with ¡high ¡probability ¡

34 ¡

SLIDE 35

AcGng ¡Well ¡in ¡the ¡World ¡

¡known ¡à ¡

35 ¡

p(s' | s,a)

 p(s' | s,a)− p(s' | s,a)

( )

Bound ¡

à ¡

Bound ¡error ¡in ¡ policy ¡calculated ¡ using ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

 p(s' | s,a)

Compute ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ε-‑opGmal ¡policy ¡

SLIDE 36

How ¡many ¡samples ¡do ¡we ¡need ¡to ¡build ¡a ¡ good ¡model ¡that ¡we ¡can ¡use ¡to ¡act ¡well ¡in ¡ the ¡world? ¡

(R-‑MAX ¡and ¡E3) ¡ ¡ ¡

36 ¡

# ¡steps ¡on ¡which ¡may ¡ not ¡act ¡well ¡(could ¡be ¡ far ¡from ¡opGmal) ¡ ¡ Poly( ¡# ¡of ¡states) ¡ Sample ¡complexity ¡ ¡ ¡ ¡ ¡ ¡ ¡= ¡ ¡ = ¡ ¡

SLIDE 37

Is ¡this ¡a ¡small ¡number ¡of ¡steps? ¡

37 ¡

 O SA ε(1−γ)2 S ε 2(1−γ)4 " # $ % & '

γ=.99, ¡ε=.05 ¡

SLIDE 38

Concurrent ¡RL ¡

¡ (Guo ¡and ¡Brunskill ¡2015) ¡

Other ¡examples: ¡all ¡customers ¡using ¡Amazon, ¡or ¡ ¡ paGents ¡in ¡a ¡hospital, ¡… ¡

SLIDE 39

What ¡You ¡Should ¡Know ¡About ¡RL ¡

Define ¡

– ExploraGon ¡vs ¡exploitaGon ¡tradeoff ¡ – Model ¡free ¡and ¡model ¡based ¡RL ¡

Implement ¡Q-‑learning ¡and ¡R-‑max ¡
Describe ¡evaluaGon ¡criteria ¡for ¡RL ¡algorithms ¡

– empirical ¡performance, ¡computaGonal ¡complexity ¡ sample/data ¡efficiency ¡ ¡ – Contrast ¡strengths ¡& ¡weaknesses ¡of ¡Q-‑learning ¡vs ¡ R-‑max ¡in ¡terms ¡of ¡these ¡criteria. ¡Give ¡examples ¡of ¡ where ¡each ¡might ¡be ¡prefered ¡

39 ¡

SLIDE 40

AI ¡in ¡the ¡real ¡world: ¡Flappy ¡Bird ¡
h{p://sarvagyavaish.github.io/FlappyBirdRL/ ¡

SLIDE 41

Real ¡ObjecGve ¡

41 ¡

Reinforcement ¡Learning++ ¡

Emma ¡Brunskill ¡(today) ¡ Ariel ¡Procaccia ¡

Recall ¡MDPs: ¡What ¡You ¡Should ¡Know ¡

– How ¡to ¡implement ¡ – Convergence ¡guarantees ¡ – ComputaGonal ¡complexity ¡

Reinforcement ¡Learning ¡

TransiGon ¡ Model? ¡ Agent ¡ AcGon ¡ State ¡ Reward ¡model? ¡

Goal: ¡Maximize ¡expected ¡sum ¡of ¡future ¡rewards ¡ ¡

Recap ¡of ¡Last ¡Time ¡

– EsGmate ¡a ¡model ¡of ¡the ¡dynamics ¡and ¡rewards ¡ from ¡data ¡(e.g. ¡T(s1|s2,a2) ¡~ ¡0.3) ¡ – Do ¡MDP ¡planning ¡given ¡those ¡esGmated ¡models ¡

– No ¡model ¡of ¡dynamics ¡and ¡rewards ¡ – Directly ¡esGmate ¡state-­‑acGon ¡value ¡funcGon ¡

Q-­‑Learning ¡

¡

¡

sampleQ (s,a) = R(s,a,s')+γ max a' Q(s',a') Q(s,a) = (1−α)Q(s,a)+α *sampleQ(s,a)

Q-­‑Learning ¡ProperGes ¡

therefore ¡finds ¡opGmal ¡policy ¡

– Can ¡act ¡in ¡one ¡way ¡ – But ¡learning ¡values ¡of ¡another ¡policy ¡(the ¡opGmal ¡

Towards ¡Gathering ¡High ¡Reward ¡

not ¡necessary, ¡to ¡learn ¡the ¡opGmal ¡values ¡and ¡ policy ¡

How ¡to ¡Act? ¡

Is ¡this ¡Approach ¡Guaranteed ¡to ¡Learn ¡ OpGmal ¡Policy? ¡

To ¡Explore ¡or ¡Exploit? ¡

Slide adapted from Klein and Abbeel Drawing ¡by ¡Ketrina ¡Yim ¡

Simple ¡Approach: ¡E-­‑greedy ¡

– Choose ¡argmaxa ¡Q(s,a) ¡

– Select ¡random ¡acGon ¡ ¡

following ¡policy ¡compute ¡(the ¡argmax ¡Q(s,a)) ¡

Greedy ¡in ¡Limit ¡of ¡Infinite ¡ExploraGon ¡ (GLIE) ¡

almost ¡all ¡the ¡Gme ¡ ¡

How ¡should ¡we ¡evaluate ¡the ¡performance ¡of ¡an ¡ algorithm? ¡

How ¡should ¡we ¡evaluate ¡the ¡performance ¡of ¡an ¡ algorithm? ¡ ¡

The ¡Speed ¡of ¡Learning ¡and ¡ Speeding ¡Learning ¡

ObjecGves ¡for ¡an ¡RL ¡Algorithm ¡

– In ¡limit ¡converge ¡to ¡a ¡policy ¡idenGcal ¡to ¡the ¡opGmal ¡ policy ¡if ¡knew ¡unknown ¡model ¡parameters ¡

ObjecGves ¡for ¡an ¡RL ¡Algorithm ¡

– In ¡limit ¡converge ¡to ¡a ¡policy ¡idenGcal ¡to ¡the ¡opGmal ¡ policy ¡if ¡knew ¡unknown ¡model ¡parameters ¡ – Q-­‑learning! ¡(under ¡what ¡condiGons?) ¡

– On ¡all ¡but ¡finite ¡number ¡of ¡samples, ¡choose ¡acGon ¡ whose ¡expected ¡reward ¡is ¡close ¡to ¡expected ¡reward ¡

– E3 ¡(Kearns ¡& ¡Singh), ¡R-­‑MAX ¡(Brafman ¡& ¡Tennenholtz) ¡

Model-­‑Based ¡RL ¡

R-­‑max ¡(Brafman ¡& ¡Tennenholtz) ¡

¡ ¡ ¡ ¡ ¡ ¡ S2 ¡ S1 ¡ …

R-­‑max ¡is ¡Model-­‑based ¡RL ¡

Act ¡in ¡world ¡ Think ¡hard: ¡esGmate ¡models ¡& ¡compute ¡policies

¡

¡ Rmax ¡leverages ¡opGmism ¡under ¡uncertainty! ¡

R-­‑max ¡Algorithm: ¡ ¡ IniGalize: ¡Define ¡“Known” ¡MDP ¡

¡ ¡ Reward ¡ ¡ ¡

¡ ¡

¡ ¡

In ¡the ¡“known” ¡MDP, ¡ any ¡unknown ¡(s,a) ¡pair ¡ has ¡its ¡dynamics ¡set ¡as ¡ ¡ a ¡self ¡loop ¡& ¡ ¡reward ¡= ¡Rmax ¡

R-­‑max ¡Algorithm ¡

Plan ¡in ¡known ¡MDP ¡

R-­‑max: ¡Planning ¡

Exercise: ¡What ¡Will ¡IniGal ¡Value ¡of ¡Q(s,a) ¡be ¡for ¡ each ¡(s,a) ¡Pair ¡in ¡the ¡Known ¡MDP? ¡What ¡is ¡the ¡ Policy? ¡

¡ ¡ Reward ¡ ¡ ¡

¡ ¡

¡ ¡

In ¡the ¡“known” ¡MDP, ¡ any ¡unknown ¡(s,a) ¡pair ¡ has ¡its ¡dynamics ¡set ¡as ¡ ¡ a ¡self ¡loop ¡& ¡ ¡reward ¡= ¡Rmax ¡

R-­‑max ¡Algorithm ¡

Act ¡using ¡ ¡ policy ¡ Plan ¡in ¡known ¡MDP ¡

transiGon ¡to ¡new ¡state ¡s’ ¡and ¡get ¡reward ¡r ¡

R-­‑max ¡Algorithm ¡

Act ¡using ¡ ¡ policy ¡ Update ¡state-­‑acGon ¡ counts ¡ Plan ¡in ¡known ¡MDP ¡

Update ¡Known ¡MDP ¡

¡ ¡ Reward ¡ ¡ ¡

¡ ¡

¡ ¡

Increment ¡counts ¡for ¡ state-­‑acGon ¡tuple ¡

Update ¡Known ¡MDP ¡

¡ ¡ Reward ¡ ¡ ¡

¡ ¡

¡ ¡

If ¡counts ¡for ¡(s,a) ¡> ¡N, ¡ (s,a) ¡becomes ¡known: ¡ use ¡observed ¡data ¡to ¡ esGmate ¡transiGon ¡& ¡ reward ¡model ¡for ¡(s,a) ¡ when ¡planning ¡

EsGmaGng ¡MDP ¡Model ¡for ¡a ¡(s,a) ¡ Pair ¡Given ¡Data ¡

R-­‑max ¡Algorithm ¡

Act ¡using ¡ ¡ policy ¡ Update ¡state-­‑acGon ¡ counts ¡ Update ¡known ¡MDP ¡ dynamics ¡& ¡reward ¡ models ¡ Plan ¡in ¡known ¡MDP ¡

R-­‑max ¡Behavior ¡

Sample ¡Complexity ¡of ¡R-­‑max ¡

 O SA ε(1−γ)2 S ε 2(1−γ)4 " # $ % & '

# ¡samples ¡ need ¡per ¡(s,a) ¡ pair ¡

– No ¡model ¡of ¡dynamics ¡and ¡rewards ¡ – Directly ¡esGmate ¡state-‑acGon ¡value ¡funcGon ¡

Q-‑Learning ¡

Q-‑Learning ¡ProperGes ¡

Simple ¡Approach: ¡E-‑greedy ¡

– In ¡limit ¡converge ¡to ¡a ¡policy ¡idenGcal ¡to ¡the ¡opGmal ¡ policy ¡if ¡knew ¡unknown ¡model ¡parameters ¡ – Q-‑learning! ¡(under ¡what ¡condiGons?) ¡

– E3 ¡(Kearns ¡& ¡Singh), ¡R-‑MAX ¡(Brafman ¡& ¡Tennenholtz) ¡

Model-‑Based ¡RL ¡

R-‑max ¡(Brafman ¡& ¡Tennenholtz) ¡

R-‑max ¡is ¡Model-‑based ¡RL ¡

R-‑max ¡Algorithm: ¡ ¡ IniGalize: ¡Define ¡“Known” ¡MDP ¡

R-‑max ¡Algorithm ¡

R-‑max: ¡Planning ¡

R-‑max ¡Algorithm ¡

R-‑max ¡Algorithm ¡

Act ¡using ¡ ¡ policy ¡ Update ¡state-‑acGon ¡ counts ¡ Plan ¡in ¡known ¡MDP ¡

Increment ¡counts ¡for ¡ state-‑acGon ¡tuple ¡

R-‑max ¡Algorithm ¡

Act ¡using ¡ ¡ policy ¡ Update ¡state-‑acGon ¡ counts ¡ Update ¡known ¡MDP ¡ dynamics ¡& ¡reward ¡ models ¡ Plan ¡in ¡known ¡MDP ¡

R-‑max ¡Behavior ¡

Sample ¡Complexity ¡of ¡R-‑max ¡

Common ¡Idea ¡of ¡Model-‑Based ¡PAC ¡RL ¡ Algorithms: ¡ ¡ QuanGfy ¡how ¡many ¡samples ¡do ¡we ¡need ¡ to ¡build ¡a ¡good ¡model ¡that ¡we ¡can ¡use ¡to ¡ act ¡well ¡in ¡the ¡world? ¡

Compute ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ε-‑opGmal ¡policy ¡

(R-‑MAX ¡and ¡E3) ¡ ¡ ¡

– empirical ¡performance, ¡computaGonal ¡complexity ¡ sample/data ¡efficiency ¡ ¡ – Contrast ¡strengths ¡& ¡weaknesses ¡of ¡Q-‑learning ¡vs ¡ R-‑max ¡in ¡terms ¡of ¡these ¡criteria. ¡Give ¡examples ¡of ¡ where ¡each ¡might ¡be ¡prefered ¡