[PDF] - Reinforcemen t Learning Read Chapter Exercises PDF Document

SLIDE 1

Reinforcemen

t Learning Read Chapter

Exercises
Con

trol learning

Con

trol p

lici

es that c ho

se
ptimal

actions

Q

learning

Con

v ergence

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 2 Con trol Learning Consider learning to c ho

se

actions eg

Rob
t

learning to do c k

n

battery c harger

Learning

to c ho

se

actions to

ptimize

factory

utput
Learning

to pla y Bac kgammon Note sev eral problem c haracteristics

Dela

y ed rew ard

Opp
rtunit

y for activ e exploration

P
ssibilit

y that state

nly

partially

bserv

able

P
ssible

need to learn m ultiple tasks with same sensorseectors

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 3 One Example TDGammon T esauro

Learn

to pla y Bac kgammon Immediate rew ard

if

win

if

lose

for

all

ther

states T rained b y pla ying

million

games against itself No w appro ximately equal to b est h uman pla y er

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 4 Reinforcemen t Learning Problem

Agent Environment

State Reward Action

r + γ γ r + r + ... , where γ <1 2 2 1 Goal: Learn to choose actions that maximize s 1 s 2 s a 1 a 2 a r 1 r 2 r ... <

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 5 Mark

v

Decision Pro cesses Assume

nite

set

f

states S

set
f

actions A

at

eac h discrete time agen t

bserv

es state s t

S

and c ho

ses

action a t

A
then

receiv es immediate rew ard r t

and

state c hanges to s t

Mark
v

assumption s t

s

t

a

t

and

r t

r

s t

a

t

ie

r t and s t dep end

nly
n

curr ent state and action

functions
and

r ma y b e nondeterministic

functions
and

r not necessarily kno wn to agen t

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 6 Agen ts Learning T ask Execute actions in en vironmen t

bserv

e results and

learn

action p

licy
S
A

that maximizes E r t

r

t

r

t

from

an y starting state in S

here
is

the discoun t factor for future rew ards Note something new

T

arget function is

S
A
but

w e ha v e no training examples

f

form hs ai

training

examples are

f

form hhs ai r i

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 7 V alue F unction T

b

egin consider deterministic w

rlds

F

r

eac h p

ssible

p

licy
the

agen t migh t adopt w e can dene an ev aluation function

v

er states V

s
r

t

r

t

r

t

X

i

i

r ti where r t

r

t

are

generated b y follo wing p

licy
starting

at state s Restated the task is to learn the

ptimal

p

licy
argmax
V
s

s

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 8

G

100 100

r s a immediate rew ard v alues

G

100 90 100 81 90 81 81 90 81 72 72 81

Qs a v alues

G

100 100 90 90 81

V

s

v alues

G

One

ptimal

p

licy
lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 9 What to Learn W e migh t try to ha v e agen t learn the ev aluati

n

function V

whic

h w e write as V

It

could then do a lo

k

ahead searc h to c ho

se

b est action from an y state s b ecause

s
argmax

a r s a

V
s

a A problem

This

w

rks

w ell if agen t kno ws

S
A
S
and

r

S
A
But

when it do esnt it cant c ho

se

actions this w a y

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 10 Q F unction Dene new function v ery similar to V

Qs

a

r

s a

V
s

a If agen t learns Q it can c ho

se
ptimal

action ev en without kno wing

s
argmax

a r s a

V
s

a

s
argmax

a Qs a Q is the ev aluation function the agen t will learn

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 11 T raining Rule to Learn Q Note Q and V

closely

related V

s
max

a

Qs

a

Whic

h allo ws us to write Q recursiv ely as Qs t

a

t

r

s t

a

t

V
s

t

a

t

r

s t

a

t

max

a

Qs

t

a
Nice

Let

Q

denote learners curren t appro ximation to Q Consider training rule

Qs

a

r
max

a

Qs
a
where

s

is

the state resulting from applying action a in state s

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 12 Q Learning for Deterministi c W

rlds

F

r

eac h s a initial i ze table en try

Q

s a

Observ

e curren t state s Do forev er

Select

an action a and execute it

Receiv

e immediate rew ard r

Observ

e the new state s

Up

date the table en try for

Qs

a as follo ws

Qs

a

r
max

a

Qs
a
s
s
lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 13 Up dating

Q

100 81

R

63 72

initial state: s1

100 90 81

R

63

next state: s2

aright

Qs
a

r ig ht

r
max

a

Qs
a
max

f

g
notice

if rew ards nonnegativ e then s a n

Q

n s a

Q

n s a and s a n

Q

n s a

Qs

a

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 14

Q

con v erges to Q Consider case

f

deterministic w

rld

where see eac h hs ai visited innitely

ften

Pr

f

Dene a full in terv al to b e an in terv al during whic h eac h hs ai is visited During eac h full in terv al the largest error in

Q

table is reduced b y factor

f
Let
Q

n b e table after n up dates and

n

b e the maxim um error in

Q

n

that

is

n
max

sa j

Q

n s a

Qs

aj F

r

an y table en try

Q

n s a up dated

n

iteration n

the

error in the revised estimate

Q

n s a is j

Q

n s a

Qs

aj

jr
max

a

Q

n s

a
r
max

a

Qs
a
j
j

max a

Q

n s

a
max

a

Qs
a
j
max

a

j
Q

n s

a
Qs
a
j
max

s

a
j
Q

n s

a
Qs
a
j

j

Q

n s a

Qs

aj

n
lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 15 Note w e used general fact that j max a f

a
max

a f

aj
max

a jf

a
f
aj
lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 16 Nondeterministi c Case What if rew ard and next state are nondeterministic W e redene V

Q

b y taking exp ected v alues V

s
E

r t

r

t

r

t

E
X

i

i

r ti

Qs

a

E

r s a

V
s

a

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 17 Nondeterministi c Case Q learning generalizes to nondeterministic w

rlds

Alter training rule to

Q

n s a

n
Q

n s a n r max a

Q

n s

a
where
n
v

isits n s a Can still pro v e con v ergence

f
Q

to Q W atkins and Da y an

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 18 T emp

ral

Dierence Learning Q learning reduce discrepancy b et w een successiv e Q estimates One step time dierence Q

s

t

a

t

r

t

max

a

Qs

t

a

Wh y not t w

steps

Q

s

t

a

t

r

t

r

t

max

a

Qs

t

a

Or n Q n s t

a

t

r

t

r

t

n

r tn

n

max a

Qs

tn

a

Blend all

f

these Q

s

t

a

t

Q
s

t

a

t

Q
s

t

a

t

Q
s

t

a

t

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 19 T emp

ral

Dierence Learning Q

s

t

a

t

Q
s

t

a

t

Q
s

t

a

t

Q
s

t

a

t

Equiv

alen t expression Q

s

t

a

t

r

t

max

a

Qs

t

a

t

Q
s

t

a

t

TD

algorithm uses ab

v

e training rule

Sometimes

con v erges faster than Q learning

con

v erges for learning V

for

an y

Da

y an

T

esauros TDGammon uses this algorithm

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill

SLIDE 20 Subtleties and Ongoing Researc h

Replace
Q

table with neural net

r
ther

generalizer

Handle

case where state

nly

partially

bserv

able

Design
ptimal

exploration strategies

Extend

to con tin uous action state

Learn

and use

S
A
S
Relationship

to dynamic programming

lecture

slides for textb

k

Machine L e arning T Mitc hell McGra w Hill