Reinforcement Learning Part 1 Yingyu Liang yliang@cs.wisc.edu - - PowerPoint PPT Presentation

▶

Aug 13, 2022 302 likes •456 views

Reinforcement Learning Part 1 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven] Goals for the lecture you should understand the following concepts

SLIDE 1

Reinforcement Learning

Part 1

Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison

[Based on slides from David Page, Mark Craven]

SLIDE 2

Goals for the lecture

you should understand the following concepts

the reinforcement learning task
Markov decision process
value functions
value iteration

SLIDE 3

Reinforcement learning (RL)

Task of an agent embedded in an environment repeat forever 1) sense world 2) reason 3) choose an action to perform 4) get feedback (usually reward = 0) 5) learn the environment may be the physical world or an artificial one

SLIDE 4

world

– 30 pieces, 24 locations

actions

– roll dice, e.g. 2, 5 – move one piece 2 – move one piece 5

rewards

– win, lose

TD-Gammon 0.0

– trained against itself (300,000 games) – as good as best previous BG computer program (also by Tesauro)

TD-Gammon 2

– beat human champion

Example: RL Backgammon Player

[Tesauro, CACM 1995]

SLIDE 5

world

– 19x19 locations

actions

– Put one stone on some empty location

rewards

– win, lose

2016 beats World Champion

Lee Sedol by 4-1

Subsequent system (AlphaGo Master/zero )

shows superior performance than humans

Trained by supervised learning +

reinforcement learning

Example: AlphaGo

[Nature, 2017]

SLIDE 6

Reinforcement learning

agent environment state reward action

s0 s1 s2 a0 a1 a2 r0 r1 r2

set of states S
set of actions A
at each time t, agent observes state

st ∈ S then chooses action at ∈ A

then receives reward rt and changes

to state st+1

SLIDE 7

Reinforcement learning as a Markov decision process (MDP)

agent environment state reward action

s0 s1 s2 a0 a1 a2 r0 r1 r2

Markov assumption
also assume reward is Markovian

Goal: learn a policy π : S → A for choosing actions that maximizes for every possible starting state s0

) , | ( ,...) , , , | (

1 1 1 1 t t t t t t t t

a s s P a s a s s P

   

 ) , | ( ,...) , , , | (

1 1 1 1 t t t t t t t t

a s r P a s a s r P

   

 1 where ...] [

2 2 1

    

 

  

t t t

r r r E

SLIDE 8

Reinforcement learning task

Suppose we want to learn a control policy π : S → A that

maximizes from every state s ∈ S G

100 100

each arrow represents an action a and the associated number represents deterministic reward r(s, a)



 0

] [

t t t

r E 

SLIDE 9

Value function for a policy

given a policy π : S → A define

assuming action sequence chosen according to π starting at state s

we want the optimal policy π* where

p * = argmaxp V p (s) for all s

we’ll denote the value function for this optimal policy as V*(s)



 

 ] [ ) (

t t t

r E s V 



SLIDE 10

Value function for a policy π

Suppose π is shown by red arrows, γ = 0.9

100 100

Vπ(s) values are shown in red 100 90 81 73 66

SLIDE 11

Value function for an optimal policy π*

Suppose π* is shown by red arrows, γ = 0.9

100 100

V*(s) values are shown in red 100 90 100 90 81

SLIDE 12

Using a value function

If we know V(s), r(st, a), and P(st | st-1, at-1) we can compute π(s)

        



   S s t t t A a t

s V a s s s P a s r s ) ( ) , | ( ) , ( max arg ) (

* 1 *

 

SLIDE 13

Value iteration for learning V*(s)

initialize V(s) arbitrarily loop until policy good enough { loop for s ∈ S { loop for a ∈ A { } }

}





 

S s

s V a s s P a s r a s Q

) ' ( ) , | ' ( ) , ( ) , (  ) , ( max ) ( a s Q s V



SLIDE 14

Value iteration for learning V*(s)

V(s) converges to V*(s)
works even if we randomly traverse environment instead of

looping through each state and action methodically – but we must visit each state infinitely often

implication: we can do online learning as an agent roams

around its environment

assumes we have a model of the world: i.e. know P(st | st-1, at-1)
What if we don’t?

Reinforcement Learning

Part 1

Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison

Goals for the lecture

Reinforcement learning (RL)

Example: RL Backgammon Player

[Tesauro, CACM 1995]

Example: AlphaGo

[Nature, 2017]

Reinforcement learning

Reinforcement learning as a Markov decision process (MDP)

Reinforcement learning task



Value function for a policy

p * = argmaxp V p (s) for all s



 ] [ ) (

r E s V 

Value function for a policy π

Value function for an optimal policy π*

Using a value function

If we know V*(s), r(st, a), and P(st | st-1, at-1) we can compute π*(s)

        



s V a s s s P a s r s ) ( ) , | ( ) , ( max arg ) (

 

Value iteration for learning V*(s)

}



 

s V a s s P a s r a s Q

) ' ( ) , | ' ( ) , ( ) , (  ) , ( max ) ( a s Q s V



Value iteration for learning V*(s)

If we know V(s), r(st, a), and P(st | st-1, at-1) we can compute π(s)