Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic - - PowerPoint PPT Presentation

dynamic programming
SMART_READER_LITE
LIVE PREVIEW

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic - - PowerPoint PPT Presentation

Dynamic Programming Prof. Kuan-Ting Lai 2020/4/10 Dynamic Programming Dynamic Programming is for problems with two properties: 1. Optimal substructure Optimal solution can be decomposed into subproblems 2. Overlapping subproblems


slide-1
SLIDE 1

Dynamic Programming

  • Prof. Kuan-Ting Lai

2020/4/10

slide-2
SLIDE 2

Dynamic Programming

  • Dynamic Programming is for problems with two properties:
  • 1. Optimal substructure
  • Optimal solution can be decomposed into subproblems
  • 2. Overlapping subproblems
  • Subproblems recur many times
  • Solutions can be cached and reused
  • Examples:

−Shortest Path, Hanoi Tower ,……. −Markov Decision Process

slide-3
SLIDE 3

Sutton, Richard S.; Barto, Andrew G.. Reinforcement Learning (Adaptive Computation and Machine Learning series) (p. 189)

slide-4
SLIDE 4

Dynamic Programming for MDP

  • Bellman equation gives recursive decomposition
  • Value function stores and reuses solutions
  • Dynamic programming assumes full knowledge of the MDP
  • Used for Model-based Planning
slide-5
SLIDE 5

Policy Evaluation (Prediction)

  • Calculate the state-action function 𝑊

𝜌 for an arbitrary policy 𝜌

  • Can be solved iteratively

𝑤𝑙+1 𝑇 ← 𝐹𝜌 𝑆𝑢+1 + 𝛿𝑤𝑙 𝑇𝑢+1

slide-6
SLIDE 6

Policy Evaluation in Small Grid World

  • One terminal state (shown twice as shaded squares)
  • Actions leading out of the grid leave state unchanged
  • Reward is -1 until the terminal state is reached
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

How to Improve a Policy

  • 1. Evaluate the policy

− 𝑤𝜌 𝑡 = 𝐹[𝑆𝑢+1 + 𝑆𝑢+2 + ⋯ |𝑇𝑢 = 𝑡]

  • 2. Improve the policy by acting greedily with respect to v

− 𝜌′ = 𝑕𝑠𝑓𝑓𝑒𝑧(𝑤𝜌)

  • This process of policy iteration always converges to 𝜌′
slide-10
SLIDE 10

Policy Iteration

  • Policy evaluation Estimate 𝑤𝜌
  • Policy improvement Generate 𝜌′ ≥ 𝜌
slide-11
SLIDE 11

Jack’s Car Rental

slide-12
SLIDE 12
slide-13
SLIDE 13

Policy Improvement (1)

slide-14
SLIDE 14

Policy Improvement (2)

slide-15
SLIDE 15

Modified Policy Iteration

  • Do we need to iteratively evaluate until convergence of 𝑤𝜌?
  • Can we simply stop after k iteration?

− Example: Small grid world achieves optimal policy after k=3 iterations

  • Update policy every iteration? => Value Iteration
slide-16
SLIDE 16

Value Iteration

  • Updating value function 𝑤 only, don’t calculate policy function 𝜌
  • Policy is implicit built using 𝑤
slide-17
SLIDE 17

Shortest Path Example

slide-18
SLIDE 18

Policy Iteration vs. Value Iteration

  • Policy iteration
  • Value iteration
slide-19
SLIDE 19

Reference

  • David Silver, Lecture 3: Planning by Dynamic Programming

(https://www.youtube.com/watch?v=Nd1-UUMVfz4&list=PLqYmG7hTraZDM- OYHWgPebj2MfCFzFObQ&index=3)

  • Chapter 4, Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning: An

Introduction,” 2nd edition, Nov. 2018