CS 285 Instructor: Sergey Levine UC Berkeley Recap: actor-critic - - PowerPoint PPT Presentation

cs 285
SMART_READER_LITE
LIVE PREVIEW

CS 285 Instructor: Sergey Levine UC Berkeley Recap: actor-critic - - PowerPoint PPT Presentation

Value Function Methods CS 285 Instructor: Sergey Levine UC Berkeley Recap: actor-critic fit a model to estimate return generate samples (i.e. run the policy) improve the policy Can we omit policy gradient completely? forget policies,


slide-1
SLIDE 1

Value Function Methods

CS 285

Instructor: Sergey Levine UC Berkeley

slide-2
SLIDE 2

Recap: actor-critic

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

slide-3
SLIDE 3

Can we omit policy gradient completely?

forget policies, let’s just do this!

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

slide-4
SLIDE 4

Policy iteration

High level idea:

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

how to do this?

slide-5
SLIDE 5

Dynamic programming

0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5

just use the current estimate here

slide-6
SLIDE 6

Policy iteration with dynamic programming

generate samples (i.e. run the policy) fit a model to estimate return improve the policy 0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5

slide-7
SLIDE 7

Even simpler dynamic programming

approximates the new value!

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

slide-8
SLIDE 8

Fitted Value Iteration & Q-Iteration

slide-9
SLIDE 9

Fitted value iteration

curse of dimensionality

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

slide-10
SLIDE 10

What if we don’t know the transition dynamics?

need to know outcomes for different actions! Back to policy iteration… can fit this using samples

slide-11
SLIDE 11

Can we do the “max” trick again?

doesn’t require simulation of actions!

+ works even for off-policy samples (unlike actor-critic) + only one network, no high-variance policy gradient

  • no convergence guarantees for non-linear function approximation (more on this later)

forget policy, compute value directly can we do this with Q-values also, without knowing the transitions?

slide-12
SLIDE 12

Fitted Q-iteration

slide-13
SLIDE 13

Review

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

  • Value-based methods
  • Don’t learn a policy explicitly
  • Just learn value or Q-function
  • If we have value function, we

have a policy

  • Fitted Q-iteration
slide-14
SLIDE 14

From Q-Iteration to Q-Learning

slide-15
SLIDE 15

Why is this algorithm off-policy?

dataset of transitions Fitted Q-iteration

slide-16
SLIDE 16

What is fitted Q-iteration optimizing?

most guarantees are lost when we leave the tabular case (e.g., use neural networks)

slide-17
SLIDE 17

Online Q-learning algorithms

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

  • ff policy, so many choices here!
slide-18
SLIDE 18

Exploration with Q-learning

We’ll discuss exploration in detail in a later lecture!

“epsilon-greedy” final policy: why is this a bad idea for step 1? “Boltzmann exploration”

slide-19
SLIDE 19

Review

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

  • Value-based methods
  • Don’t learn a policy explicitly
  • Just learn value or Q-function
  • If we have value function, we

have a policy

  • Fitted Q-iteration
  • Batch mode, off-policy method
  • Q-learning
  • Online analogue of fitted Q-

iteration

slide-20
SLIDE 20

Value Functions in Theory

slide-21
SLIDE 21

Value function learning theory

0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5

slide-22
SLIDE 22

Value function learning theory

0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5

slide-23
SLIDE 23

Non-tabular value function learning

slide-24
SLIDE 24

Non-tabular value function learning

Conclusions: value iteration converges (tabular case) fitted value iteration does not converge not in general

  • ften not in practice
slide-25
SLIDE 25

What about fitted Q-iteration?

Applies also to online Q-learning

slide-26
SLIDE 26

But… it’s just regression!

Q-learning is not gradient descent! no gradient through target value

slide-27
SLIDE 27

A sad corollary

An aside regarding terminology

slide-28
SLIDE 28

Review

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

  • Value iteration theory
  • Operator for backup
  • Operator for projection
  • Backup is contraction
  • Value iteration converges
  • Convergence with function

approximation

  • Projection is also a contraction
  • Projection + backup is not a contraction
  • Fitted value iteration does not in general

converge

  • Implications for Q-learning
  • Q-learning, fitted Q-iteration, etc. does

not converge with function approximation

  • But we can make it work in practice!
  • Sometimes – tune in next time