CSE 573
Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming
Slides adapted from Andrey Kolobov and Mausam
1
CSE 573 Markov Decision Processes: Heuristic Search & Real-Time - - PowerPoint PPT Presentation
CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost to achieve a goal Example
1
– Controlling a Mars rover
“How to collect scientific data without damaging the rover?”
– Navigation
“What’s the fastest way to get to a destination, taking into account the traffic jams?”
8
which it does not reach the goal with P=1
9
Bertsekas, 1995
– If a policy is optimal, it will take a finite, but apriori unknown, time to reach goal
10
11
S1 S2
a1 C(s2, a1, s1) = -1 C(s1, a1, s2) = 1 a2 a2 C(s1, a2, s1) = 7.2 C(s2, a2, sG) = 1
SG
C(sG, a2, sG) = 0 C(sG, a1, sG) = 0 C(s2, a2, s2) = -3 T(s2, a2, sG) = 0.3 T(s2, a2, sG) = 0.7
S3
C(s3, a2, s3) = 0.8 C(s3, a1, s3) = 2.4 a1 a2 C(s2, a1, s3) = 5 a1 T(s2, a1, s3) = 0.6 T(s2, a1, s1) = 0.4
No dead ends allowed!
a1 a2
12
S1 S2
a1 a1 C(s2, a1, s1) = -1 C(s1, a1, s2) = 1 a2 a2 C(s1, a2, s1) = 7.2 C(s2, a2, sG) = 1
SG
C(sG, a2, sG) = 0 C(sG, a1, sG) = 0 C(s2, a2, s2) = -3 T(s2, a2, sG) = 0.3 T(s2, a2, sG) = 0.7
No cost-free “loops” allowed!
a2 a1
13
S1 S2
a1 a1 C(s2, a1, s1) = 0 C(s1, a1, s2) = 1 a2 a2 C(s1, a2, s1) = 7.2 C(s2, a2, sG) = 1
SG
C(sG, a2, sG) = 0 C(sG, a1, sG) = 0 C(s2, a2, s2) = 1 T(s2, a2, sG) = 0.3 T(s2, a2, sG) = 0.7
– Vπ(h) = Eh[C1 + C2 + …] for all h
– V* exists and is stationary Markovian, π* exists and is stationary deterministic Markovian – For all s: V*(s) = mina in A [ ∑s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ] π*(s) = argmina in A [ ∑s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ]
14
π
Every policy either takes a finite exp. # of steps to reach a goal, or has an infinite cost. For every history, the value of a policy is well-defined!
15
16
17
18
+1 +2 +2 +1
19
+10
1.0
20
+10
1-𝛿 ½𝛿 ½𝛿 𝛿 𝛿 ½𝛿
1-𝛿 1-𝛿
– Solving IHDR, SSP in flat representation is P-complete – Solving FH in flat representation is P-hard – That is, they don’t benefit from parallelization, but are solvable in polynomial time!
22
– Solving FH, IHDR, SSP in factored representation is EXPTIME- complete! – Flat representation doesn’t make MDPs harder to solve, it makes big ones easier to describe.
23
26
Pr=0.6
a00
Pr=0.4
a01 a21 a1 a20 a40
C=5
a41 a3
C=2
All costs 1 unless otherwise marked
V1= 0 V1= 2
Q2(s4,a40) = 5 + 0 Q2(s4,a41) = 2+ 0.6×0 + 0.4×2 = 2.8 min
V2= = 2. 2.8 agr
greedy dy =
= a41
41
a41 a40
C=5 C=2
Pr=0.6
Pr=0.4
a40
C=5
a41 a3
C=2
28
iteration n ℇ-consistency
termination condition No restriction on initial value function
29
Pr=0.6
a00
Pr=0.4
a01 a21 a1 a20 a40
C=5
a41 a3
C=2
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
3 3 2 2 1 1 3 3 2 2 2.8 2 3 3 3.8 3.8 2.8 3 4 4.8 3.8 3.8 3.52 4 4.8 4.8 4.52 4.52 3.52 5 5.52 5.52 4.52 4.52 3.808 20 5.99921 5.99921 4.99969 4.99969 3.99969
30
35
36
ResV <∊ (∊-consistency)
37
38
39
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8 s9
A1 A2
~ (all sources shortest path à single source shortest path)
~ (dfs/bfs à A*)
40
41
42
43
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8 s9 πs0(s0)= a1 πs0(s1)= a2 πs0(s2)= a1
Is this policy closed wrt s0?
a1 is left action a2 is on right
44
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8 s9 πs0(s0)= a1 πs0(s1)= a2 πs0(s2)= a1 πs0(s6)= a1
Is this policy closed wrt s0?
a1 is left action a2 is on right
45
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8 s9 πs0(s0)= a1 πs0(s1)= a2 πs0(s2)= a1 πs0(s6)= a1 a1 is left action a2 is on right
46
V s0
V s0
s 0
47
– Relax probabilistic domain to deterministic domain – Use heuristics(classical planning)
– For each outcome create a different action
– Cheapest cost solution for determinized domain – Classical heuristics over determinized domain
48
s1 s s2 a s1 s s2 a1 a2
49
50
51
(perform Bellman backups)
52
s
Q(s,a1)=5
. .
Q(s, a2)=10
. . All values < V*, Q* All values = V*, Q*
[Barto et al 95]
– agent acting in the real world
– simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal
– Converges in the limit #trials à ∞
84
85
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8
86
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8
h h h h V
start at start state repeat perform a Bellman backup simulate greedy action
87
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8
h h h h V
start at start state repeat perform a Bellman backup simulate greedy action
h h
88
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8
h h V h V
start at start state repeat perform a Bellman backup simulate greedy action
h h
89
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8
h h V h V
start at start state repeat perform a Bellman backup simulate greedy action
h h
90
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8
h h V h V
start at start state repeat perform a Bellman backup simulate greedy action
V h
91
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8
h h V h V
start at start state repeat perform a Bellman backup simulate greedy action until hit the goal
V h
92
s0
Sg
s1 s2 s3 s4 s5 s6 s7 s8
h h V h V
start at start state repeat perform a Bellman backup simulate greedy action until hit the goal
V h
Backup all states
RTDP repeat forever
[Barto et al 95]
– agent acting in the real world
– simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal
– Converges in the limit #trials à ∞
93
No termination condition!
94
⇒ V(s) ≤ V*(s) ⇒ Q(s,a) ≤ Q*(s,a)
– if V(s) has converged
best action
ResV(s) < ε ⇒ V(s) won’t change!
label s as solved
sg s
96
best action
ResV(s) < ε s' already solved ⇒ V(s) won’t change!
label s as solved
sg s s'
97
best action
ResV(s) < ε s' already solved ⇒ V(s) won’t change! label s as solved sg s s'
best action
ResV(s) < ε ResV(s’) < ε
V(s), V(s’) won’t change! label s, s’ as solved
sg s s'
best action
repeat s ß s0 label all goal states as solved repeat //trials REVISE s; identify agreedy FIND: sample s’ from T(s, agreedy, s’) s ß s’ until s is solved for all states s in the trial try to label s as solved until s0 is solved
98
99
100
igure 2: Racetrack for . The initial
Racetrack Domain Start Goal
101
large-ring large-square
50 100 150 200 250 300 350 400 450 500 2 4 6 8 10 12 elapsed time RTDP VI LAO LRTDP 50 100 150 200 250 300 350 400 450 500 20 30 40 50 60 70 80 90 100 110 elapsed time RTDP VI LAO LRTDP
Cost Computation Time
algorithm small-b large-b h-track small-r large-r small-s large-s small-y large-y
VI(h = 0)
1.101 4.045 15.451 0.662 5.435 5.896 78.720 16.418 61.773
ILAO*(h = 0)
2.568 11.794 43.591 1.114 11.166 12.212 250.739 57.488 182.649
LRTDP(h = 0)
0.885 7.116 15.591 0.431 4.275 3.238 49.312 9.393 34.100
Table 2: Convergence time in seconds for the different algorithms with initial value function h = 0 and ϵ = 10−3. Times for
RTDP not shown as they exceed the cutoff time for convergence (10 minutes). Faster times are shown in bold font.
algorithm small-b large-b h-track small-r large-r small-s large-s small-y large-y
VI(hmin)
1.317 4.093 12.693 0.737 5.932 6.855 102.946 17.636 66.253
ILAO*(hmin)
1.161 2.910 11.401 0.309 3.514 0.387 1.055 0.692 1.367
LRTDP(hmin)
0.521 2.660 7.944 0.187 1.599 0.259 0.653 0.336 0.749
Table 3: Convergence time in seconds for the different algorithms with initial value function h = hmin and ϵ = 10−3. Times for RTDP not shown as they exceed the cutoff time for convergence (10 minutes). Faster times are shown in bold font.
102
103
– continue to be lower bounds
– continues to be upper bounds
– Vl will increase to converge to V* – Vu will decrease to converge to V*
104
repeat s ß s0 repeat //trials identify agreedybased on Vl FIND: sample s’ ∝ T(s, agreedy, s’).gap(s’) s ß s’ until gap(s) < ε for all states s in trial in reverse order REVISE s until gap(s0) < ε
105
106
A (0.94s) B (2.43s) C (1.81s) D (43.98s) 0.2 0.4 0.6 0.8 1
Informed Initialization Problem Fracton of longest time
BRTDP LRTDP HDP A (2.11s) B (10.13s) C (3.41s) D (45.42s) 0.2 0.4 0.6 0.8 1
Uninformed Initialization Problem Fracton of longest time
BRTDP LRTDP HDP
A, B – racetrack; C,D gridworld. A,C have sparse noise; B,D much noise
107
[Slide adapted from Scott Sanner] 108
Cost Cost A B D C
109
110 Algorithm large-b large-b-3 large-b-w large-ring large-ring-3 large-ring-w RTDP 5.30 (5.19) 10.27 (9.12) 149.07 (190.55) 3.39 (4.81) 8.05 (8.56) 16.44 (91.67) LRTDP 1.21 (3.52) 1.63 (4.08) 1.96 ( 14.38) 1.74 (5.19) 2.14 (5.71) 3.13 (22.15) HDP 1.29 (3.43) 1.86 (4.12) 2.87 ( 15.99) 1.27 (4.35) 2.74 (6.41) 2.92 (20.14) HDP+L 1.29 (3.75) 1.86 (4.55) 2.87 ( 16.88) 1.27 (4.70) 2.74 (7.02) 2.92 (21.12) FRTDP 0.29 (2.10) 0.49 (2.38) 0.84 ( 10.71) 0.22 (2.60) 0.43 (3.04) 0.99 (14.73)
Figure 1: Millions of backups before convergence with ✏ = 10−3. Each entry gives the number of millions of backups, with the corresponding wallclock time (seconds) in parentheses. The fastest time for each problem is shown in bold.
large-ring-w large-ring-3
102 103 104 105 106 107 HDP HDP+L FRTDP
103 104 105 106 107 HDP HDP+L FRTDP
Figure 2: Anytime performance comparison: solution qual- ity vs. number of backups.