Module 6 Value Iteration CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation

β–Ά
module 6
SMART_READER_LITE
LIVE PREVIEW

Module 6 Value Iteration CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation

Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Process Definition Set of states: Set of actions (i.e., decisions): Transition model: Pr


slide-1
SLIDE 1

Module 6 Value Iteration

CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

slide-2
SLIDE 2

CS886 (c) 2013 Pascal Poupart

2

Markov Decision Process

  • Definition

– Set of states: 𝑇 – Set of actions (i.e., decisions): 𝐡 – Transition model: Pr⁑ (𝑑𝑒|π‘‘π‘’βˆ’1, π‘π‘’βˆ’1) – Reward model (i.e., utility): 𝑆(𝑑𝑒, 𝑏𝑒) – Discount factor: 0 ≀ 𝛿 ≀ 1 – Horizon (i.e., # of time steps): β„Ž

  • Goal: find optimal policy 𝜌
slide-3
SLIDE 3

CS886 (c) 2013 Pascal Poupart

3

Finite Horizon

  • Policy evaluation

π‘Š

β„Ž 𝜌 𝑑 =

𝛿𝑒Pr⁑ (𝑇𝑒 = 𝑑′|𝑇0 = 𝑑, 𝜌)𝑆(𝑑′, πœŒπ‘’(𝑑′))

β„Ž 𝑒=0

  • Recursive form (dynamic programming)

π‘Š

𝜌 𝑑 = 𝑆(𝑑, 𝜌0 𝑑 )

π‘Š

𝑒 𝜌 𝑑 = 𝑆 𝑑, πœŒπ‘’ 𝑑

+ 𝛿 Pr 𝑑′ 𝑑, πœŒπ‘’ 𝑑 π‘Š

π‘’βˆ’1 𝜌 (𝑑′) 𝑑′

slide-4
SLIDE 4

CS886 (c) 2013 Pascal Poupart

4

Finite Horizon

  • Optimal Policy πœŒβˆ—

π‘Š

β„Ž πœŒβˆ— 𝑑 β‰₯ π‘Š β„Ž 𝜌 𝑑 β‘β‘βˆ€πœŒ, 𝑑

  • Optimal value function π‘Šβˆ— (shorthand for π‘ŠπœŒβˆ—)

π‘Š

βˆ— 𝑑 = max 𝑏

𝑆(𝑑, 𝑏) π‘Š

𝑒 βˆ— 𝑑 = max 𝑏

𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š

π‘’βˆ’1 βˆ—

(𝑑′)

𝑑′

Bellman’s equation

slide-5
SLIDE 5

CS886 (c) 2013 Pascal Poupart

5

Value Iteration Algorithm

Optimal policy πœŒβˆ— 𝑒 = 0:⁑𝜌0

βˆ— 𝑑 ← argmax 𝑏

𝑆 𝑑, 𝑏 β‘βˆ€π‘‘ 𝑒 > 0: πœŒπ‘’

βˆ— 𝑑 ← argmax 𝑏

𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š

π‘’βˆ’1 βˆ— (𝑑′) 𝑑′

β‘βˆ€π‘‘ NB: πœŒβˆ— is non stationary (i.e., time dependent)

valueIteration(MDP)

π‘Š

βˆ— 𝑑 ← max 𝑏

𝑆(𝑑, 𝑏)β‘βˆ€π‘‘ For 𝑒 = 1 to β„Ž do π‘Š

𝑒 βˆ— 𝑑 ← max 𝑏

𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š

π‘’βˆ’1 βˆ— (𝑑′) 𝑑′

β‘βˆ€π‘‘ Return π‘Šβˆ—

slide-6
SLIDE 6

CS886 (c) 2013 Pascal Poupart

6

Value Iteration

  • Matrix form:

𝑆𝑏: 𝑇 Γ— 1 column vector of rewards for 𝑏 π‘Š

𝑒 βˆ—: 𝑇 Γ— 1 column vector of state values

π‘ˆπ‘: 𝑇 Γ— 𝑇 matrix of transition prob. for 𝑏 valueIteration(MDP)

π‘Š

βˆ— ← max 𝑏

𝑆𝑏⁑ For 𝑒 = 1 to β„Ž do π‘Š

𝑒 βˆ— ← max 𝑏

𝑆𝑏 + π›Ώπ‘ˆπ‘π‘Š

π‘’βˆ’1 βˆ— ⁑

Return π‘Šβˆ—

slide-7
SLIDE 7

CS886 (c) 2013 Pascal Poupart

7

Infinite Horizon

  • Let β„Ž β†’ ∞
  • Then π‘Š

β„Ž 𝜌 β†’ π‘Š ∞ 𝜌 and π‘Š β„Žβˆ’1 𝜌

β†’ π‘Š

∞ 𝜌

  • Policy evaluation:

π‘Š

∞ 𝜌 𝑑 = 𝑆 𝑑, 𝜌∞ 𝑑

+ 𝛿 Pr 𝑑′ 𝑑, 𝜌∞ 𝑑 π‘Š

∞ 𝜌(𝑑′) 𝑑′

β‘βˆ€π‘‘

  • Bellman’s equation:

π‘Š

∞ βˆ— 𝑑 = max 𝑏

𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š

∞ βˆ—(𝑑′) 𝑑′

slide-8
SLIDE 8

CS886 (c) 2013 Pascal Poupart

8

Policy evaluation

  • Linear system of equations

π‘Š

∞ 𝜌 𝑑 = 𝑆 𝑑, 𝜌∞ 𝑑

+ 𝛿 Pr 𝑑′ 𝑑, 𝜌∞ 𝑑 π‘Š

∞ 𝜌(𝑑′) 𝑑′

β‘βˆ€π‘‘

  • Matrix form:

𝑆: 𝑇 Γ— 1 column vector of sate rewards for 𝜌 π‘Š: 𝑇 Γ— 1 column vector of state values for 𝜌 π‘ˆ: 𝑇 Γ— 𝑇 matrix of transition prob for 𝜌 π‘Š = 𝑆 + π›Ώπ‘ˆπ‘Š

slide-9
SLIDE 9

CS886 (c) 2013 Pascal Poupart

9

Solving linear equations

  • Linear system: π‘Š = 𝑆 + π›Ώπ‘ˆπ‘Š
  • Gaussian elimination: 𝐽 βˆ’ π›Ώπ‘ˆ π‘Š = 𝑆
  • Compute inverse: π‘Š = 𝐽 βˆ’ π›Ώπ‘ˆ βˆ’1𝑆
  • Iterative methods
  • Value iteration (a.k.a. Richardson iteration)
  • Repeat π‘Š ← 𝑆 + π›Ώπ‘ˆπ‘Š
slide-10
SLIDE 10

CS886 (c) 2013 Pascal Poupart

10

Contraction

  • Let 𝐼(π‘Š) ≝ 𝑆 + π›Ώπ‘ˆπ‘Š be the policy eval operator
  • Lemma 1: 𝐼 is a contraction mapping.

𝐼 π‘Š βˆ’ 𝐼 π‘Š

∞ ≀ 𝛿

π‘Š βˆ’ π‘Š

∞

  • Proof 𝐼 π‘Š

βˆ’ 𝐼 π‘Š

∞

= 𝑆 + π›Ώπ‘ˆπ‘Š βˆ’ 𝑆 βˆ’ π›Ώπ‘ˆπ‘Š

∞ (by definition)

= π›Ώπ‘ˆ π‘Š βˆ’ π‘Š

∞ (simplification)

≀ 𝛿 π‘ˆ

∞

π‘Š βˆ’ π‘Š

∞ (since 𝐡𝐢

≀ 𝐡 𝐢 ) = 𝛿 π‘Š βˆ’ π‘Š

∞ (since max 𝑑

π‘ˆ(𝑑, 𝑑′)

𝑑′

= 1)

slide-11
SLIDE 11

CS886 (c) 2013 Pascal Poupart

11

Convergence

  • Theorem 2: Policy evaluation converges to π‘ŠπœŒ

for any initial estimate π‘Š lim

π‘œβ†’βˆž 𝐼(π‘œ) π‘Š = π‘ŠπœŒβ‘β‘βˆ€π‘Š

  • Proof
  • By definition V𝜌 = 𝐼 ∞ 0 , but policy evaluation

computes 𝐼 ∞ π‘Š for any initial π‘Š

  • By lemma 1, 𝐼(π‘œ) π‘Š βˆ’ 𝐼 π‘œ

π‘Š

∞ ≀ π›Ώπ‘œ

π‘Š βˆ’ π‘Š

∞

  • Hence, when π‘œ β†’ ∞, then 𝐼(π‘œ) π‘Š βˆ’ 𝐼 π‘œ 0

∞ β†’ 0

and 𝐼 ∞ π‘Š = π‘ŠπœŒβ‘β‘βˆ€π‘Š

slide-12
SLIDE 12

CS886 (c) 2013 Pascal Poupart

12

Approximate Policy Evaluation

  • In practice, we can’t perform an infinite number
  • f iterations.
  • Suppose that we perform value iteration for 𝑙

steps and 𝐼 𝑙 π‘Š βˆ’ 𝐼 π‘™βˆ’1 π‘Š

∞ = πœ—, how far

is 𝐼 𝑙 π‘Š from π‘ŠπœŒ?

slide-13
SLIDE 13

CS886 (c) 2013 Pascal Poupart

13

Approximate Policy Evaluation

  • Theorem 3: If 𝐼 𝑙 π‘Š βˆ’ 𝐼 π‘™βˆ’1 π‘Š

∞ ≀ πœ— then

π‘ŠπœŒ βˆ’ 𝐼 𝑙 π‘Š

∞ ≀

πœ— 1 βˆ’ 𝛿

  • Proof π‘ŠπœŒ βˆ’ 𝐼 𝑙 π‘Š

∞

= 𝐼 ∞ (π‘Š) βˆ’ 𝐼 𝑙 π‘Š

∞ (by Theorem 2)

= 𝐼 𝑒+𝑙 π‘Š

∞ 𝑒=1

βˆ’ 𝐼 𝑒+π‘™βˆ’1 π‘Š

∞

≀ 𝐼 𝑒+𝑙 (π‘Š) βˆ’ 𝐼 𝑒+π‘™βˆ’1 π‘Š

∞ ∞ 𝑒=1

( 𝐡 + 𝐢

≀ 𝐡 + | 𝐢 |)

= π›Ώπ‘’πœ—

∞ 𝑒=1

=

πœ— 1βˆ’π›Ώ (by Lemma 1)

slide-14
SLIDE 14

CS886 (c) 2013 Pascal Poupart

14

Optimal Value Function

  • Non-linear system of equations

π‘Š

∞ βˆ— 𝑑 = max 𝑏

𝑆 𝑑, 𝑏⁑ + 𝛿 Pr 𝑑′ 𝑑, 𝑏⁑ π‘Š

∞ βˆ—(𝑑′) 𝑑′

β‘βˆ€π‘‘

  • Matrix form:

𝑆𝑏: 𝑇 Γ— 1 column vector of rewards for 𝑏 π‘Šβˆ—: 𝑇 Γ— 1 column vector of optimal values π‘ˆa: 𝑇 Γ— 𝑇 matrix of transition prob for 𝑏 π‘Šβˆ— = max

𝑏

𝑆𝑏 + π›Ώπ‘ˆπ‘π‘Šβˆ—

slide-15
SLIDE 15

CS886 (c) 2013 Pascal Poupart

15

Contraction

  • Let πΌβˆ—(π‘Š) ≝ max

𝑏

𝑆𝑏 + π›Ώπ‘ˆπ‘ π‘Š be the operator in value iteration

  • Lemma 3: πΌβˆ— is a contraction mapping.

πΌβˆ— π‘Š βˆ’ πΌβˆ— π‘Š

∞ ≀ 𝛿

π‘Š βˆ’ π‘Š

∞

  • Proof: without loss of generality,

let πΌβˆ— π‘Š 𝑑 β‰₯ πΌβˆ—(π‘Š)(𝑑) and let 𝑏𝑑

βˆ— = argmax 𝑏

𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š(𝑑′)

𝑑′

slide-16
SLIDE 16

CS886 (c) 2013 Pascal Poupart

16

Contraction

  • Proof continued:
  • Then 0 ≀ πΌβˆ— π‘Š

𝑑 βˆ’ πΌβˆ— π‘Š 𝑑 (by assumption) ≀ 𝑆 𝑑, 𝑏𝑑

βˆ— + 𝛿

Pr 𝑑′ 𝑑, 𝑏𝑑

βˆ— 𝑑′

π‘Š 𝑑′ (by definition) βˆ’π‘† 𝑑, 𝑏𝑑

βˆ— βˆ’ 𝛿

Pr 𝑑′ 𝑑, 𝑏𝑑

βˆ— 𝑑′

π‘Š 𝑑′ = 𝛿 Pr 𝑑′ 𝑑, 𝑏𝑑

βˆ—

π‘Š 𝑑′ βˆ’ π‘Š 𝑑′

𝑑′

≀ 𝛿 Pr 𝑑′ 𝑑, 𝑏𝑑

βˆ—

π‘Š βˆ’ π‘Š

∞ 𝑑′

(maxnorm upper bound) = 𝛿 π‘Š βˆ’ π‘Š

∞ (since

Pr 𝑑′ 𝑑, 𝑏𝑑

βˆ— 𝑑′

= 1)

  • Repeat the same argument for πΌβˆ— π‘Š

𝑑 β‰₯ πΌβˆ—(π‘Š )(𝑑) and for each 𝑑

slide-17
SLIDE 17

CS886 (c) 2013 Pascal Poupart

17

Convergence

  • Theorem 4: Value iteration converges to π‘Šβˆ— for

any initial estimate π‘Š lim

π‘œβ†’βˆž πΌβˆ—(π‘œ) π‘Š = π‘Šβˆ—β‘β‘βˆ€π‘Š

  • Proof
  • By definition Vβˆ— = πΌβˆ— ∞ 0 , but value iteration

computes πΌβˆ— ∞ π‘Š for some initial π‘Š

  • By lemma 3, πΌβˆ—(π‘œ) π‘Š βˆ’ πΌβˆ— π‘œ

π‘Š

∞

≀ π›Ώπ‘œ π‘Š βˆ’ π‘Š

∞

  • Hence, when π‘œ β†’ ∞, then πΌβˆ—(π‘œ) π‘Š βˆ’ πΌβˆ— π‘œ 0

∞

β†’ 0 and πΌβˆ— ∞ π‘Š = π‘Šβˆ—β‘β‘βˆ€π‘Š

slide-18
SLIDE 18

CS886 (c) 2013 Pascal Poupart

18

Value Iteration

  • Even when horizon is infinite, perform finitely

many iterations

  • Stop when π‘Š

π‘œ βˆ’ π‘Š π‘œβˆ’1

≀ πœ— valueIteration(MDP)

π‘Š

βˆ— ← max 𝑏

𝑆𝑏 ; β‘β‘β‘β‘β‘β‘β‘β‘π‘œ ← 0 Repeat

π‘œ ← π‘œ + 1

π‘Š

π‘œ ← max 𝑏

𝑆𝑏 + π›Ώπ‘ˆπ‘π‘Š

π‘œβˆ’1

Until π‘Š

π‘œ βˆ’ π‘Š π‘œβˆ’1 ∞ ≀ πœ—

Return π‘Š

π‘œ

slide-19
SLIDE 19

CS886 (c) 2013 Pascal Poupart

19

Induced Policy

  • Since π‘Š

π‘œ βˆ’ π‘Š π‘œβˆ’1 ∞ ≀ πœ—, by Theorem 4: we know

that π‘Š

π‘œ βˆ’ π‘Šβˆ— ∞ ≀ πœ— 1βˆ’π›Ώ

  • But, how good is the stationary policy πœŒπ‘œ 𝑑

extracted based on π‘Š

π‘œ?

πœŒπ‘œ 𝑑 = argmax

𝑏

𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑′ 𝑑, 𝑏 π‘Š

π‘œ(𝑑′) 𝑑′

  • How far is π‘ŠπœŒπ‘œ from π‘Šβˆ—?
slide-20
SLIDE 20

CS886 (c) 2013 Pascal Poupart

20

Induced Policy

  • Theorem 5: π‘ŠπœŒπ‘œ βˆ’ π‘Šβˆ—

∞ ≀ 2πœ— 1βˆ’π›Ώ

  • Proof

π‘ŠπœŒπ‘œ βˆ’ π‘Šβˆ—

∞ =

π‘ŠπœŒπ‘œ βˆ’ π‘Š

π‘œ + π‘Š π‘œ βˆ’ π‘Šβˆ— ∞

≀ π‘ŠπœŒπ‘œ βˆ’ π‘Š

π‘œ ∞ +

π‘Š

π‘œ βˆ’ π‘Šβˆ— ∞ ( 𝐡 + 𝐢

≀ 𝐡 + | 𝐢 |)

= πΌπœŒπ‘œ ∞ (π‘Š

π‘œ) βˆ’ π‘Š π‘œ ∞

+ π‘Š

π‘œ βˆ’ πΌβˆ— ∞ π‘Š π‘œ ∞

≀

πœ— 1βˆ’π›Ώ + πœ— 1βˆ’π›Ώ (by Theorems 2 and 4)

=

2πœ— 1βˆ’π›Ώ

slide-21
SLIDE 21

CS886 (c) 2013 Pascal Poupart

21

Summary

  • Value iteration

– Simple dynamic programming algorithm – Complexity: 𝑃(π‘œ 𝐡 𝑇 2)

  • Here π‘œ is the number of iterations
  • Can we optimize the policy directly instead of
  • ptimizing the value function and then inducing a

policy?

– Yes: by policy iteration