Reinforcement Learning
Part 1
Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison
[Based on slides from David Page, Mark Craven]
Reinforcement Learning Part 1 Yingyu Liang yliang@cs.wisc.edu - - PowerPoint PPT Presentation
Reinforcement Learning Part 1 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven] Goals for the lecture you should understand the following concepts
[Based on slides from David Page, Mark Craven]
you should understand the following concepts
2
Task of an agent embedded in an environment repeat forever 1) sense world 2) reason 3) choose an action to perform 4) get feedback (usually reward = 0) 5) learn the environment may be the physical world or an artificial one
3
– 30 pieces, 24 locations
– roll dice, e.g. 2, 5 – move one piece 2 – move one piece 5
– win, lose
– trained against itself (300,000 games) – as good as best previous BG computer program (also by Tesauro)
– beat human champion
4
– 19x19 locations
– Put one stone on some empty location
– win, lose
Lee Sedol by 4-1
shows superior performance than humans
reinforcement learning
5
agent environment state reward action
s0 s1 s2 a0 a1 a2 r0 r1 r2
st ∈ S then chooses action at ∈ A
to state st+1
6
agent environment state reward action
s0 s1 s2 a0 a1 a2 r0 r1 r2
Goal: learn a policy π : S → A for choosing actions that maximizes for every possible starting state s0
7
) , | ( ,...) , , , | (
1 1 1 1 t t t t t t t t
a s s P a s a s s P
) , | ( ,...) , , , | (
1 1 1 1 t t t t t t t t
a s r P a s a s r P
1 where ...] [
2 2 1
t t t
r r r E
maximizes from every state s ∈ S G
100 100
each arrow represents an action a and the associated number represents deterministic reward r(s, a)
8
0
] [
t t t
r E
assuming action sequence chosen according to π starting at state s
we’ll denote the value function for this optimal policy as V*(s)
9
t t t
G
100 100
Vπ(s) values are shown in red 100 90 81 73 66
10
G
100 100
V*(s) values are shown in red 100 90 100 90 81
11
12
S s t t t A a t
* 1 *
initialize V(s) arbitrarily loop until policy good enough { loop for s ∈ S { loop for a ∈ A { } }
13
S s
'
a
looping through each state and action methodically – but we must visit each state infinitely often
around its environment
14