Module 6 Value Iteration CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation
Module 6 Value Iteration CS 886 Sequential Decision Making and - - PowerPoint PPT Presentation
Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Process Definition Set of states: Set of actions (i.e., decisions): Transition model: Pr
CS886 (c) 2013 Pascal Poupart
2
Markov Decision Process
- Definition
β Set of states: π β Set of actions (i.e., decisions): π΅ β Transition model: Prβ‘ (π‘π’|π‘π’β1, ππ’β1) β Reward model (i.e., utility): π(π‘π’, ππ’) β Discount factor: 0 β€ πΏ β€ 1 β Horizon (i.e., # of time steps): β
- Goal: find optimal policy π
CS886 (c) 2013 Pascal Poupart
3
Finite Horizon
- Policy evaluation
π
β π π‘ =
πΏπ’Prβ‘ (ππ’ = π‘β²|π0 = π‘, π)π(π‘β², ππ’(π‘β²))
β π’=0
- Recursive form (dynamic programming)
π
π π‘ = π(π‘, π0 π‘ )
π
π’ π π‘ = π π‘, ππ’ π‘
+ πΏ Pr π‘β² π‘, ππ’ π‘ π
π’β1 π (π‘β²) π‘β²
CS886 (c) 2013 Pascal Poupart
4
Finite Horizon
- Optimal Policy πβ
π
β πβ π‘ β₯ π β π π‘ β‘β‘βπ, π‘
- Optimal value function πβ (shorthand for ππβ)
π
β π‘ = max π
π(π‘, π) π
π’ β π‘ = max π
π π‘, π + πΏ Pr π‘β² π‘, π π
π’β1 β
(π‘β²)
π‘β²
Bellmanβs equation
CS886 (c) 2013 Pascal Poupart
5
Value Iteration Algorithm
Optimal policy πβ π’ = 0:β‘π0
β π‘ β argmax π
π π‘, π β‘βπ‘ π’ > 0: ππ’
β π‘ β argmax π
π π‘, π + πΏ Pr π‘β² π‘, π π
π’β1 β (π‘β²) π‘β²
β‘βπ‘ NB: πβ is non stationary (i.e., time dependent)
valueIteration(MDP)
π
β π‘ β max π
π(π‘, π)β‘βπ‘ For π’ = 1 to β do π
π’ β π‘ β max π
π π‘, π + πΏ Pr π‘β² π‘, π π
π’β1 β (π‘β²) π‘β²
β‘βπ‘ Return πβ
CS886 (c) 2013 Pascal Poupart
6
Value Iteration
- Matrix form:
ππ: π Γ 1 column vector of rewards for π π
π’ β: π Γ 1 column vector of state values
ππ: π Γ π matrix of transition prob. for π valueIteration(MDP)
π
β β max π
ππβ‘ For π’ = 1 to β do π
π’ β β max π
ππ + πΏπππ
π’β1 β β‘
Return πβ
CS886 (c) 2013 Pascal Poupart
7
Infinite Horizon
- Let β β β
- Then π
β π β π β π and π ββ1 π
β π
β π
- Policy evaluation:
π
β π π‘ = π π‘, πβ π‘
+ πΏ Pr π‘β² π‘, πβ π‘ π
β π(π‘β²) π‘β²
β‘βπ‘
- Bellmanβs equation:
π
β β π‘ = max π
π π‘, π + πΏ Pr π‘β² π‘, π π
β β(π‘β²) π‘β²
CS886 (c) 2013 Pascal Poupart
8
Policy evaluation
- Linear system of equations
π
β π π‘ = π π‘, πβ π‘
+ πΏ Pr π‘β² π‘, πβ π‘ π
β π(π‘β²) π‘β²
β‘βπ‘
- Matrix form:
π: π Γ 1 column vector of sate rewards for π π: π Γ 1 column vector of state values for π π: π Γ π matrix of transition prob for π π = π + πΏππ
CS886 (c) 2013 Pascal Poupart
9
Solving linear equations
- Linear system: π = π + πΏππ
- Gaussian elimination: π½ β πΏπ π = π
- Compute inverse: π = π½ β πΏπ β1π
- Iterative methods
- Value iteration (a.k.a. Richardson iteration)
- Repeat π β π + πΏππ
CS886 (c) 2013 Pascal Poupart
10
Contraction
- Let πΌ(π) β π + πΏππ be the policy eval operator
- Lemma 1: πΌ is a contraction mapping.
πΌ π β πΌ π
β β€ πΏ
π β π
β
- Proof πΌ π
β πΌ π
β
= π + πΏππ β π β πΏππ
β (by definition)
= πΏπ π β π
β (simplification)
β€ πΏ π
β
π β π
β (since π΅πΆ
β€ π΅ πΆ ) = πΏ π β π
β (since max π‘
π(π‘, π‘β²)
π‘β²
= 1)
CS886 (c) 2013 Pascal Poupart
11
Convergence
- Theorem 2: Policy evaluation converges to ππ
for any initial estimate π lim
πββ πΌ(π) π = ππβ‘β‘βπ
- Proof
- By definition Vπ = πΌ β 0 , but policy evaluation
computes πΌ β π for any initial π
- By lemma 1, πΌ(π) π β πΌ π
π
β β€ πΏπ
π β π
β
- Hence, when π β β, then πΌ(π) π β πΌ π 0
β β 0
and πΌ β π = ππβ‘β‘βπ
CS886 (c) 2013 Pascal Poupart
12
Approximate Policy Evaluation
- In practice, we canβt perform an infinite number
- f iterations.
- Suppose that we perform value iteration for π
steps and πΌ π π β πΌ πβ1 π
β = π, how far
is πΌ π π from ππ?
CS886 (c) 2013 Pascal Poupart
13
Approximate Policy Evaluation
- Theorem 3: If πΌ π π β πΌ πβ1 π
β β€ π then
ππ β πΌ π π
β β€
π 1 β πΏ
- Proof ππ β πΌ π π
β
= πΌ β (π) β πΌ π π
β (by Theorem 2)
= πΌ π’+π π
β π’=1
β πΌ π’+πβ1 π
β
β€ πΌ π’+π (π) β πΌ π’+πβ1 π
β β π’=1
( π΅ + πΆ
β€ π΅ + | πΆ |)
= πΏπ’π
β π’=1
=
π 1βπΏ (by Lemma 1)
CS886 (c) 2013 Pascal Poupart
14
Optimal Value Function
- Non-linear system of equations
π
β β π‘ = max π
π π‘, πβ‘ + πΏ Pr π‘β² π‘, πβ‘ π
β β(π‘β²) π‘β²
β‘βπ‘
- Matrix form:
ππ: π Γ 1 column vector of rewards for π πβ: π Γ 1 column vector of optimal values πa: π Γ π matrix of transition prob for π πβ = max
π
ππ + πΏπππβ
CS886 (c) 2013 Pascal Poupart
15
Contraction
- Let πΌβ(π) β max
π
ππ + πΏππ π be the operator in value iteration
- Lemma 3: πΌβ is a contraction mapping.
πΌβ π β πΌβ π
β β€ πΏ
π β π
β
- Proof: without loss of generality,
let πΌβ π π‘ β₯ πΌβ(π)(π‘) and let ππ‘
β = argmax π
π π‘, π + πΏ Pr π‘β² π‘, π π(π‘β²)
π‘β²
CS886 (c) 2013 Pascal Poupart
16
Contraction
- Proof continued:
- Then 0 β€ πΌβ π
π‘ β πΌβ π π‘ (by assumption) β€ π π‘, ππ‘
β + πΏ
Pr π‘β² π‘, ππ‘
β π‘β²
π π‘β² (by definition) βπ π‘, ππ‘
β β πΏ
Pr π‘β² π‘, ππ‘
β π‘β²
π π‘β² = πΏ Pr π‘β² π‘, ππ‘
β
π π‘β² β π π‘β²
π‘β²
β€ πΏ Pr π‘β² π‘, ππ‘
β
π β π
β π‘β²
(maxnorm upper bound) = πΏ π β π
β (since
Pr π‘β² π‘, ππ‘
β π‘β²
= 1)
- Repeat the same argument for πΌβ π
π‘ β₯ πΌβ(π )(π‘) and for each π‘
CS886 (c) 2013 Pascal Poupart
17
Convergence
- Theorem 4: Value iteration converges to πβ for
any initial estimate π lim
πββ πΌβ(π) π = πββ‘β‘βπ
- Proof
- By definition Vβ = πΌβ β 0 , but value iteration
computes πΌβ β π for some initial π
- By lemma 3, πΌβ(π) π β πΌβ π
π
β
β€ πΏπ π β π
β
- Hence, when π β β, then πΌβ(π) π β πΌβ π 0
β
β 0 and πΌβ β π = πββ‘β‘βπ
CS886 (c) 2013 Pascal Poupart
18
Value Iteration
- Even when horizon is infinite, perform finitely
many iterations
- Stop when π
π β π πβ1
β€ π valueIteration(MDP)
π
β β max π
ππ ; β‘β‘β‘β‘β‘β‘β‘β‘π β 0 Repeat
π β π + 1
π
π β max π
ππ + πΏπππ
πβ1
Until π
π β π πβ1 β β€ π
Return π
π
CS886 (c) 2013 Pascal Poupart
19
Induced Policy
- Since π
π β π πβ1 β β€ π, by Theorem 4: we know
that π
π β πβ β β€ π 1βπΏ
- But, how good is the stationary policy ππ π‘
extracted based on π
π?
ππ π‘ = argmax
π
π π‘, π + πΏ Pr π‘β² π‘, π π
π(π‘β²) π‘β²
- How far is πππ from πβ?
CS886 (c) 2013 Pascal Poupart
20
Induced Policy
- Theorem 5: πππ β πβ
β β€ 2π 1βπΏ
- Proof
πππ β πβ
β =
πππ β π
π + π π β πβ β
β€ πππ β π
π β +
π
π β πβ β ( π΅ + πΆ
β€ π΅ + | πΆ |)
= πΌππ β (π
π) β π π β
+ π
π β πΌβ β π π β
β€
π 1βπΏ + π 1βπΏ (by Theorems 2 and 4)
=
2π 1βπΏ
CS886 (c) 2013 Pascal Poupart
21
Summary
- Value iteration
β Simple dynamic programming algorithm β Complexity: π(π π΅ π 2)
- Here π is the number of iterations
- Can we optimize the policy directly instead of
- ptimizing the value function and then inducing a