[PPT] - Module 6 Value Iteration CS 886 Sequential Decision Making and PowerPoint Presentation

SLIDE 1

Module 6 Value Iteration

CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

SLIDE 2

2

Markov Decision Process

Definition

– Set of states: 𝑇 – Set of actions (i.e., decisions): 𝐵 – Transition model: Pr⁡ (𝑡𝑢|𝑡𝑢−1, 𝑏𝑢−1) – Reward model (i.e., utility): 𝑆(𝑡𝑢, 𝑏𝑢) – Discount factor: 0 ≤ 𝛿 ≤ 1 – Horizon (i.e., # of time steps): ℎ

Goal: find optimal policy 𝜌

SLIDE 3

3

Finite Horizon

Policy evaluation

𝑊

ℎ 𝜌 𝑡 =

𝛿𝑢Pr⁡ (𝑇𝑢 = 𝑡′|𝑇0 = 𝑡, 𝜌)𝑆(𝑡′, 𝜌𝑢(𝑡′))

ℎ 𝑢=0

Recursive form (dynamic programming)

𝑊

𝜌 𝑡 = 𝑆(𝑡, 𝜌0 𝑡 )

𝑊

𝑢 𝜌 𝑡 = 𝑆 𝑡, 𝜌𝑢 𝑡

+ 𝛿 Pr 𝑡′ 𝑡, 𝜌𝑢 𝑡 𝑊

𝑢−1 𝜌 (𝑡′) 𝑡′

SLIDE 4

4

Finite Horizon

Optimal Policy 𝜌∗

𝑊

ℎ 𝜌∗ 𝑡 ≥ 𝑊 ℎ 𝜌 𝑡 ⁡⁡∀𝜌, 𝑡

Optimal value function 𝑊∗ (shorthand for 𝑊𝜌∗)

𝑊

∗ 𝑡 = max 𝑏

𝑆(𝑡, 𝑏) 𝑊

𝑢 ∗ 𝑡 = max 𝑏

𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊

𝑢−1 ∗

(𝑡′)

𝑡′

Bellman’s equation

SLIDE 5

5

Value Iteration Algorithm

Optimal policy 𝜌∗ 𝑢 = 0:⁡𝜌0

∗ 𝑡 ← argmax 𝑏

𝑆 𝑡, 𝑏 ⁡∀𝑡 𝑢 > 0: 𝜌𝑢

∗ 𝑡 ← argmax 𝑏

𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊

𝑢−1 ∗ (𝑡′) 𝑡′

⁡∀𝑡 NB: 𝜌∗ is non stationary (i.e., time dependent)

valueIteration(MDP)

𝑊

∗ 𝑡 ← max 𝑏

𝑆(𝑡, 𝑏)⁡∀𝑡 For 𝑢 = 1 to ℎ do 𝑊

𝑢 ∗ 𝑡 ← max 𝑏

𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊

𝑢−1 ∗ (𝑡′) 𝑡′

⁡∀𝑡 Return 𝑊∗

SLIDE 6

6

Value Iteration

Matrix form:

𝑆𝑏: 𝑇 × 1 column vector of rewards for 𝑏 𝑊

𝑢 ∗: 𝑇 × 1 column vector of state values

𝑈𝑏: 𝑇 × 𝑇 matrix of transition prob. for 𝑏 valueIteration(MDP)

𝑊

∗ ← max 𝑏

𝑆𝑏⁡ For 𝑢 = 1 to ℎ do 𝑊

𝑢 ∗ ← max 𝑏

𝑆𝑏 + 𝛿𝑈𝑏𝑊

𝑢−1 ∗ ⁡

Return 𝑊∗

SLIDE 7

7

Infinite Horizon

Let ℎ → ∞
Then 𝑊

ℎ 𝜌 → 𝑊 ∞ 𝜌 and 𝑊 ℎ−1 𝜌

→ 𝑊

∞ 𝜌

Policy evaluation:

𝑊

∞ 𝜌 𝑡 = 𝑆 𝑡, 𝜌∞ 𝑡

+ 𝛿 Pr 𝑡′ 𝑡, 𝜌∞ 𝑡 𝑊

∞ 𝜌(𝑡′) 𝑡′

⁡∀𝑡

Bellman’s equation:

𝑊

∞ ∗ 𝑡 = max 𝑏

𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊

∞ ∗(𝑡′) 𝑡′

SLIDE 8

8

Policy evaluation

Linear system of equations

𝑊

∞ 𝜌 𝑡 = 𝑆 𝑡, 𝜌∞ 𝑡

+ 𝛿 Pr 𝑡′ 𝑡, 𝜌∞ 𝑡 𝑊

∞ 𝜌(𝑡′) 𝑡′

⁡∀𝑡

Matrix form:

𝑆: 𝑇 × 1 column vector of sate rewards for 𝜌 𝑊: 𝑇 × 1 column vector of state values for 𝜌 𝑈: 𝑇 × 𝑇 matrix of transition prob for 𝜌 𝑊 = 𝑆 + 𝛿𝑈𝑊

SLIDE 9

9

Solving linear equations

Linear system: 𝑊 = 𝑆 + 𝛿𝑈𝑊
Gaussian elimination: 𝐽 − 𝛿𝑈 𝑊 = 𝑆
Compute inverse: 𝑊 = 𝐽 − 𝛿𝑈 −1𝑆
Iterative methods
Value iteration (a.k.a. Richardson iteration)
Repeat 𝑊 ← 𝑆 + 𝛿𝑈𝑊

SLIDE 10

10

Contraction

Let 𝐼(𝑊) ≝ 𝑆 + 𝛿𝑈𝑊 be the policy eval operator
Lemma 1: 𝐼 is a contraction mapping.

𝐼 𝑊 − 𝐼 𝑊

∞ ≤ 𝛿

𝑊 − 𝑊

∞

Proof 𝐼 𝑊

− 𝐼 𝑊

∞

= 𝑆 + 𝛿𝑈𝑊 − 𝑆 − 𝛿𝑈𝑊

∞ (by definition)

= 𝛿𝑈 𝑊 − 𝑊

∞ (simplification)

≤ 𝛿 𝑈

∞

𝑊 − 𝑊

∞ (since 𝐵𝐶

≤ 𝐵 𝐶 ) = 𝛿 𝑊 − 𝑊

∞ (since max 𝑡

𝑈(𝑡, 𝑡′)

𝑡′

= 1)

SLIDE 11

11

Convergence

Theorem 2: Policy evaluation converges to 𝑊𝜌

for any initial estimate 𝑊 lim

𝑜→∞ 𝐼(𝑜) 𝑊 = 𝑊𝜌⁡⁡∀𝑊

Proof
By definition V𝜌 = 𝐼 ∞ 0 , but policy evaluation

computes 𝐼 ∞ 𝑊 for any initial 𝑊

By lemma 1, 𝐼(𝑜) 𝑊 − 𝐼 𝑜

𝑊

∞ ≤ 𝛿𝑜

𝑊 − 𝑊

∞

Hence, when 𝑜 → ∞, then 𝐼(𝑜) 𝑊 − 𝐼 𝑜 0

∞ → 0

and 𝐼 ∞ 𝑊 = 𝑊𝜌⁡⁡∀𝑊

SLIDE 12

12

Approximate Policy Evaluation

In practice, we can’t perform an infinite number
f iterations.
Suppose that we perform value iteration for 𝑙

steps and 𝐼 𝑙 𝑊 − 𝐼 𝑙−1 𝑊

∞ = 𝜗, how far

is 𝐼 𝑙 𝑊 from 𝑊𝜌?

SLIDE 13

13

Approximate Policy Evaluation

Theorem 3: If 𝐼 𝑙 𝑊 − 𝐼 𝑙−1 𝑊

∞ ≤ 𝜗 then

𝑊𝜌 − 𝐼 𝑙 𝑊

∞ ≤

𝜗 1 − 𝛿

Proof 𝑊𝜌 − 𝐼 𝑙 𝑊

∞

= 𝐼 ∞ (𝑊) − 𝐼 𝑙 𝑊

∞ (by Theorem 2)

= 𝐼 𝑢+𝑙 𝑊

∞ 𝑢=1

− 𝐼 𝑢+𝑙−1 𝑊

∞

≤ 𝐼 𝑢+𝑙 (𝑊) − 𝐼 𝑢+𝑙−1 𝑊

∞ ∞ 𝑢=1

( 𝐵 + 𝐶

≤ 𝐵 + | 𝐶 |)

= 𝛿𝑢𝜗

∞ 𝑢=1

=

𝜗 1−𝛿 (by Lemma 1)

SLIDE 14

14

Optimal Value Function

Non-linear system of equations

𝑊

∞ ∗ 𝑡 = max 𝑏

𝑆 𝑡, 𝑏⁡ + 𝛿 Pr 𝑡′ 𝑡, 𝑏⁡ 𝑊

∞ ∗(𝑡′) 𝑡′

⁡∀𝑡

Matrix form:

𝑆𝑏: 𝑇 × 1 column vector of rewards for 𝑏 𝑊∗: 𝑇 × 1 column vector of optimal values 𝑈a: 𝑇 × 𝑇 matrix of transition prob for 𝑏 𝑊∗ = max

𝑏

𝑆𝑏 + 𝛿𝑈𝑏𝑊∗

SLIDE 15

15

Contraction

Let 𝐼∗(𝑊) ≝ max

𝑏

𝑆𝑏 + 𝛿𝑈𝑏 𝑊 be the operator in value iteration

Lemma 3: 𝐼∗ is a contraction mapping.

𝐼∗ 𝑊 − 𝐼∗ 𝑊

∞ ≤ 𝛿

𝑊 − 𝑊

∞

Proof: without loss of generality,

let 𝐼∗ 𝑊 𝑡 ≥ 𝐼∗(𝑊)(𝑡) and let 𝑏𝑡

∗ = argmax 𝑏

𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊(𝑡′)

𝑡′

SLIDE 16

16

Contraction

Proof continued:
Then 0 ≤ 𝐼∗ 𝑊

𝑡 − 𝐼∗ 𝑊 𝑡 (by assumption) ≤ 𝑆 𝑡, 𝑏𝑡

∗ + 𝛿

Pr 𝑡′ 𝑡, 𝑏𝑡

∗ 𝑡′

𝑊 𝑡′ (by definition) −𝑆 𝑡, 𝑏𝑡

∗ − 𝛿

Pr 𝑡′ 𝑡, 𝑏𝑡

∗ 𝑡′

𝑊 𝑡′ = 𝛿 Pr 𝑡′ 𝑡, 𝑏𝑡

∗

𝑊 𝑡′ − 𝑊 𝑡′

𝑡′

≤ 𝛿 Pr 𝑡′ 𝑡, 𝑏𝑡

∗

𝑊 − 𝑊

∞ 𝑡′

(maxnorm upper bound) = 𝛿 𝑊 − 𝑊

∞ (since

Pr 𝑡′ 𝑡, 𝑏𝑡

∗ 𝑡′

= 1)

Repeat the same argument for 𝐼∗ 𝑊

𝑡 ≥ 𝐼∗(𝑊 )(𝑡) and for each 𝑡

SLIDE 17

17

Convergence

Theorem 4: Value iteration converges to 𝑊∗ for

any initial estimate 𝑊 lim

𝑜→∞ 𝐼∗(𝑜) 𝑊 = 𝑊∗⁡⁡∀𝑊

Proof
By definition V∗ = 𝐼∗ ∞ 0 , but value iteration

computes 𝐼∗ ∞ 𝑊 for some initial 𝑊

By lemma 3, 𝐼∗(𝑜) 𝑊 − 𝐼∗ 𝑜

𝑊

∞

≤ 𝛿𝑜 𝑊 − 𝑊

∞

Hence, when 𝑜 → ∞, then 𝐼∗(𝑜) 𝑊 − 𝐼∗ 𝑜 0

∞

→ 0 and 𝐼∗ ∞ 𝑊 = 𝑊∗⁡⁡∀𝑊

SLIDE 18

18

Value Iteration

Even when horizon is infinite, perform finitely

many iterations

Stop when 𝑊

𝑜 − 𝑊 𝑜−1

≤ 𝜗 valueIteration(MDP)

𝑊

∗ ← max 𝑏

𝑆𝑏 ; ⁡⁡⁡⁡⁡⁡⁡⁡𝑜 ← 0 Repeat

𝑜 ← 𝑜 + 1

𝑊

𝑜 ← max 𝑏

𝑆𝑏 + 𝛿𝑈𝑏𝑊

𝑜−1

Until 𝑊

𝑜 − 𝑊 𝑜−1 ∞ ≤ 𝜗

Return 𝑊

𝑜

SLIDE 19

19

Induced Policy

Since 𝑊

𝑜 − 𝑊 𝑜−1 ∞ ≤ 𝜗, by Theorem 4: we know

that 𝑊

𝑜 − 𝑊∗ ∞ ≤ 𝜗 1−𝛿

But, how good is the stationary policy 𝜌𝑜 𝑡

extracted based on 𝑊

𝑜?

𝜌𝑜 𝑡 = argmax

𝑏

𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡′ 𝑡, 𝑏 𝑊

𝑜(𝑡′) 𝑡′

How far is 𝑊𝜌𝑜 from 𝑊∗?

SLIDE 20

20

Induced Policy

Theorem 5: 𝑊𝜌𝑜 − 𝑊∗

∞ ≤ 2𝜗 1−𝛿

Proof

𝑊𝜌𝑜 − 𝑊∗

∞ =

𝑊𝜌𝑜 − 𝑊

𝑜 + 𝑊 𝑜 − 𝑊∗ ∞

≤ 𝑊𝜌𝑜 − 𝑊

𝑜 ∞ +

𝑊

𝑜 − 𝑊∗ ∞ ( 𝐵 + 𝐶

≤ 𝐵 + | 𝐶 |)

= 𝐼𝜌𝑜 ∞ (𝑊

𝑜) − 𝑊 𝑜 ∞

+ 𝑊

𝑜 − 𝐼∗ ∞ 𝑊 𝑜 ∞

≤

𝜗 1−𝛿 + 𝜗 1−𝛿 (by Theorems 2 and 4)

=

2𝜗 1−𝛿

SLIDE 21

21

Summary

Value iteration

– Simple dynamic programming algorithm – Complexity: 𝑃(𝑜 𝐵 𝑇 2)

Here 𝑜 is the number of iterations
Can we optimize the policy directly instead of
ptimizing the value function and then inducing a