Reinforcemen t Learning Read Chapter Exercises - - PDF document

reinforcemen t learning read chapter exercises
SMART_READER_LITE
LIVE PREVIEW

Reinforcemen t Learning Read Chapter Exercises - - PDF document

Reinforcemen t Learning Read Chapter Exercises Con trol learning Con trol p olici es that c ho ose optimal actions Q learning


slide-1
SLIDE 1
  • Reinforcemen
t Learning Read Chapter
  • Exercises
  • Con
trol learning
  • Con
trol p
  • lici
es that c ho
  • se
  • ptimal
actions
  • Q
learning
  • Con
v ergence
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-2
SLIDE 2 Con trol Learning Consider learning to c ho
  • se
actions eg
  • Rob
  • t
learning to do c k
  • n
battery c harger
  • Learning
to c ho
  • se
actions to
  • ptimize
factory
  • utput
  • Learning
to pla y Bac kgammon Note sev eral problem c haracteristics
  • Dela
y ed rew ard
  • Opp
  • rtunit
y for activ e exploration
  • P
  • ssibilit
y that state
  • nly
partially
  • bserv
able
  • P
  • ssible
need to learn m ultiple tasks with same sensorseectors
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-3
SLIDE 3 One Example TDGammon T esauro
  • Learn
to pla y Bac kgammon Immediate rew ard
  • if
win
  • if
lose
  • for
all
  • ther
states T rained b y pla ying
  • million
games against itself No w appro ximately equal to b est h uman pla y er
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-4
SLIDE 4 Reinforcemen t Learning Problem

Agent Environment

State Reward Action

r + γ γ r + r + ... , where γ <1 2 2 1 Goal: Learn to choose actions that maximize s 1 s 2 s a 1 a 2 a r 1 r 2 r ... <

  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-5
SLIDE 5 Mark
  • v
Decision Pro cesses Assume
  • nite
set
  • f
states S
  • set
  • f
actions A
  • at
eac h discrete time agen t
  • bserv
es state s t
  • S
and c ho
  • ses
action a t
  • A
  • then
receiv es immediate rew ard r t
  • and
state c hanges to s t
  • Mark
  • v
assumption s t
  • s
t
  • a
t
  • and
r t
  • r
s t
  • a
t
  • ie
r t and s t dep end
  • nly
  • n
curr ent state and action
  • functions
  • and
r ma y b e nondeterministic
  • functions
  • and
r not necessarily kno wn to agen t
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-6
SLIDE 6 Agen ts Learning T ask Execute actions in en vironmen t
  • bserv
e results and
  • learn
action p
  • licy
  • S
  • A
that maximizes E r t
  • r
t
  • r
t
  • from
an y starting state in S
  • here
  • is
the discoun t factor for future rew ards Note something new
  • T
arget function is
  • S
  • A
  • but
w e ha v e no training examples
  • f
form hs ai
  • training
examples are
  • f
form hhs ai r i
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-7
SLIDE 7 V alue F unction T
  • b
egin consider deterministic w
  • rlds
F
  • r
eac h p
  • ssible
p
  • licy
  • the
agen t migh t adopt w e can dene an ev aluation function
  • v
er states V
  • s
  • r
t
  • r
t
  • r
t
  • X
i
  • i
r ti where r t
  • r
t
  • are
generated b y follo wing p
  • licy
  • starting
at state s Restated the task is to learn the
  • ptimal
p
  • licy
  • argmax
  • V
  • s
s
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-8
SLIDE 8

G

100 100

r s a immediate rew ard v alues

G

100 90 100 81 90 81 81 90 81 72 72 81

Qs a v alues

G

100 100 90 90 81

V
  • s
v alues

G

One
  • ptimal
p
  • licy
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-9
SLIDE 9 What to Learn W e migh t try to ha v e agen t learn the ev aluati
  • n
function V
  • whic
h w e write as V
  • It
could then do a lo
  • k
ahead searc h to c ho
  • se
b est action from an y state s b ecause
  • s
  • argmax
a r s a
  • V
  • s
a A problem
  • This
w
  • rks
w ell if agen t kno ws
  • S
  • A
  • S
  • and
r
  • S
  • A
  • But
when it do esnt it cant c ho
  • se
actions this w a y
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-10
SLIDE 10 Q F unction Dene new function v ery similar to V
  • Qs
a
  • r
s a
  • V
  • s
a If agen t learns Q it can c ho
  • se
  • ptimal
action ev en without kno wing
  • s
  • argmax
a r s a
  • V
  • s
a
  • s
  • argmax
a Qs a Q is the ev aluation function the agen t will learn
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-11
SLIDE 11 T raining Rule to Learn Q Note Q and V
  • closely
related V
  • s
  • max
a
  • Qs
a
  • Whic
h allo ws us to write Q recursiv ely as Qs t
  • a
t
  • r
s t
  • a
t
  • V
  • s
t
  • a
t
  • r
s t
  • a
t
  • max
a
  • Qs
t
  • a
  • Nice
Let
  • Q
denote learners curren t appro ximation to Q Consider training rule
  • Qs
a
  • r
  • max
a
  • Qs
  • a
  • where
s
  • is
the state resulting from applying action a in state s
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-12
SLIDE 12 Q Learning for Deterministi c W
  • rlds
F
  • r
eac h s a initial i ze table en try
  • Q
s a
  • Observ
e curren t state s Do forev er
  • Select
an action a and execute it
  • Receiv
e immediate rew ard r
  • Observ
e the new state s
  • Up
date the table en try for
  • Qs
a as follo ws
  • Qs
a
  • r
  • max
a
  • Qs
  • a
  • s
  • s
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-13
SLIDE 13 Up dating
  • Q

100 81

R

63 72

initial state: s1

100 90 81

R

63

next state: s2

aright

  • Qs
  • a
r ig ht
  • r
  • max
a
  • Qs
  • a
  • max
f
  • g
  • notice
if rew ards nonnegativ e then s a n
  • Q
n s a
  • Q
n s a and s a n
  • Q
n s a
  • Qs
a
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-14
SLIDE 14
  • Q
con v erges to Q Consider case
  • f
deterministic w
  • rld
where see eac h hs ai visited innitely
  • ften
Pr
  • f
Dene a full in terv al to b e an in terv al during whic h eac h hs ai is visited During eac h full in terv al the largest error in
  • Q
table is reduced b y factor
  • f
  • Let
  • Q
n b e table after n up dates and
  • n
b e the maxim um error in
  • Q
n
  • that
is
  • n
  • max
sa j
  • Q
n s a
  • Qs
aj F
  • r
an y table en try
  • Q
n s a up dated
  • n
iteration n
  • the
error in the revised estimate
  • Q
n s a is j
  • Q
n s a
  • Qs
aj
  • jr
  • max
a
  • Q
n s
  • a
  • r
  • max
a
  • Qs
  • a
  • j
  • j
max a
  • Q
n s
  • a
  • max
a
  • Qs
  • a
  • j
  • max
a
  • j
  • Q
n s
  • a
  • Qs
  • a
  • j
  • max
s
  • a
  • j
  • Q
n s
  • a
  • Qs
  • a
  • j
j
  • Q
n s a
  • Qs
aj
  • n
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-15
SLIDE 15 Note w e used general fact that j max a f
  • a
  • max
a f
  • aj
  • max
a jf
  • a
  • f
  • aj
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-16
SLIDE 16 Nondeterministi c Case What if rew ard and next state are nondeterministic W e redene V
  • Q
b y taking exp ected v alues V
  • s
  • E
r t
  • r
t
  • r
t
  • E
  • X
i
  • i
r ti
  • Qs
a
  • E
r s a
  • V
  • s
a
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-17
SLIDE 17 Nondeterministi c Case Q learning generalizes to nondeterministic w
  • rlds
Alter training rule to
  • Q
n s a
  • n
  • Q
n s a n r max a
  • Q
n s
  • a
  • where
  • n
  • v
isits n s a Can still pro v e con v ergence
  • f
  • Q
to Q W atkins and Da y an
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-18
SLIDE 18 T emp
  • ral
Dierence Learning Q learning reduce discrepancy b et w een successiv e Q estimates One step time dierence Q
  • s
t
  • a
t
  • r
t
  • max
a
  • Qs
t
  • a
Wh y not t w
  • steps
Q
  • s
t
  • a
t
  • r
t
  • r
t
  • max
a
  • Qs
t
  • a
Or n Q n s t
  • a
t
  • r
t
  • r
t
  • n
r tn
  • n
max a
  • Qs
tn
  • a
Blend all
  • f
these Q
  • s
t
  • a
t
  • Q
  • s
t
  • a
t
  • Q
  • s
t
  • a
t
  • Q
  • s
t
  • a
t
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-19
SLIDE 19 T emp
  • ral
Dierence Learning Q
  • s
t
  • a
t
  • Q
  • s
t
  • a
t
  • Q
  • s
t
  • a
t
  • Q
  • s
t
  • a
t
  • Equiv
alen t expression Q
  • s
t
  • a
t
  • r
t
  • max
a
  • Qs
t
  • a
t
  • Q
  • s
t
  • a
t
  • TD
algorithm uses ab
  • v
e training rule
  • Sometimes
con v erges faster than Q learning
  • con
v erges for learning V
  • for
an y
  • Da
y an
  • T
esauros TDGammon uses this algorithm
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill
slide-20
SLIDE 20 Subtleties and Ongoing Researc h
  • Replace
  • Q
table with neural net
  • r
  • ther
generalizer
  • Handle
case where state
  • nly
partially
  • bserv
able
  • Design
  • ptimal
exploration strategies
  • Extend
to con tin uous action state
  • Learn
and use
  • S
  • A
  • S
  • Relationship
to dynamic programming
  • lecture
slides for textb
  • k
Machine L e arning T Mitc hell McGra w Hill