Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic - - PowerPoint PPT Presentation

▶

Jul 04, 2023 363 likes •570 views

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic Programming Dynamic Programming is for problems with two properties: 1. Optimal substructure Optimal solution can be decomposed into subproblems 2. Overlapping subproblems

SLIDE 1

Dynamic Programming

Prof. Kuan-Ting Lai

2020/4/10

SLIDE 2

Dynamic Programming

Dynamic Programming is for problems with two properties:
1. Optimal substructure
Optimal solution can be decomposed into subproblems
2. Overlapping subproblems
Subproblems recur many times
Solutions can be cached and reused
Examples:

−Shortest Path, Hanoi Tower ,……. −Markov Decision Process

SLIDE 3

Sutton, Richard S.; Barto, Andrew G.. Reinforcement Learning (Adaptive Computation and Machine Learning series) (p. 189)

SLIDE 4

Dynamic Programming for MDP

Bellman equation gives recursive decomposition
Value function stores and reuses solutions
Dynamic programming assumes full knowledge of the MDP
Used for Model-based Planning

SLIDE 5

Policy Evaluation (Prediction)

Calculate the state-action function 𝑊

𝜌 for an arbitrary policy 𝜌

Can be solved iteratively

𝑤𝑙+1 𝑇 ← 𝐹𝜌 𝑆𝑢+1 + 𝛿𝑤𝑙 𝑇𝑢+1

SLIDE 6

Policy Evaluation in Small Grid World

One terminal state (shown twice as shaded squares)
Actions leading out of the grid leave state unchanged
Reward is -1 until the terminal state is reached

SLIDE 7

SLIDE 8

SLIDE 9

How to Improve a Policy

1. Evaluate the policy

− 𝑤𝜌 𝑡 = 𝐹[𝑆𝑢+1 + 𝑆𝑢+2 + ⋯ |𝑇𝑢 = 𝑡]

2. Improve the policy by acting greedily with respect to v

− 𝜌′ = 𝑕𝑠𝑓𝑓𝑒𝑧(𝑤𝜌)

This process of policy iteration always converges to 𝜌′

SLIDE 10

Policy Iteration

Policy evaluation Estimate 𝑤𝜌
Policy improvement Generate 𝜌′ ≥ 𝜌

SLIDE 11

Jack’s Car Rental

SLIDE 12

SLIDE 13

Policy Improvement (1)

SLIDE 14

Policy Improvement (2)

SLIDE 15

Modified Policy Iteration

Do we need to iteratively evaluate until convergence of 𝑤𝜌?
Can we simply stop after k iteration?

− Example: Small grid world achieves optimal policy after k=3 iterations

Update policy every iteration? => Value Iteration

SLIDE 16

Value Iteration

Updating value function 𝑤 only, don’t calculate policy function 𝜌
Policy is implicit built using 𝑤

SLIDE 17

Shortest Path Example

SLIDE 18

Policy Iteration vs. Value Iteration

Policy iteration
Value iteration

SLIDE 19

Reference

David Silver, Lecture 3: Planning by Dynamic Programming

(https://www.youtube.com/watch?v=Nd1-UUMVfz4&list=PLqYmG7hTraZDM- OYHWgPebj2MfCFzFObQ&index=3)

Chapter 4, Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An

Introduction,” 2nd edition, Nov. 2018