CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy - - PowerPoint PPT Presentation

cs 285
SMART_READER_LITE
LIVE PREVIEW

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy - - PowerPoint PPT Presentation

Advanced Policy Gradients CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate return generate samples (i.e. run the policy) improve the policy reward to go can also use function


slide-1
SLIDE 1

Advanced Policy Gradients

CS 285

Instructor: Sergey Levine UC Berkeley

slide-2
SLIDE 2

Recap: policy gradients

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

“reward to go” can also use function approximation here

slide-3
SLIDE 3

Why does policy gradient work?

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

look familiar?

slide-4
SLIDE 4

Policy gradient as policy iteration

slide-5
SLIDE 5

Policy gradient as policy iteration

importance sampling

slide-6
SLIDE 6

Ignoring distribution mismatch?

?

why do we want this to be true? is it true? and when?

slide-7
SLIDE 7

Bounding the Distribution Change

slide-8
SLIDE 8

Ignoring distribution mismatch?

?

why do we want this to be true? is it true? and when?

slide-9
SLIDE 9

Bounding the distribution change

seem familiar? not a great bound, but a bound!

slide-10
SLIDE 10

Bounding the distribution change

Proof based on: Schulman, Levine, Moritz, Jordan, Abbeel. “Trust Region Policy Optimization.”

slide-11
SLIDE 11

Bounding the objective value

slide-12
SLIDE 12

Where are we at so far?

slide-13
SLIDE 13

Policy Gradients with Constraints

slide-14
SLIDE 14

A more convenient bound

KL divergence has some very convenient properties that make it much easier to approximate!

slide-15
SLIDE 15

How do we optimize the objective?

slide-16
SLIDE 16

How do we enforce the constraint?

can do this incompletely (for a few grad steps)

slide-17
SLIDE 17

Natural Gradient

slide-18
SLIDE 18

How (else) do we optimize the objective?

Use first order Taylor approximation for objective (a.k.a., linearization)

slide-19
SLIDE 19

How do we optimize the objective?

(see policy gradient lecture for derivation) exactly the normal policy gradient!

slide-20
SLIDE 20

Can we just use the gradient then?

slide-21
SLIDE 21

Can we just use the gradient then?

not the same! second order Taylor expansion

slide-22
SLIDE 22

Can we just use the gradient then?

natural gradient

slide-23
SLIDE 23

Is this even a problem in practice?

(image from Peters & Schaal 2008)

Essentially the same problem as this: (figure from Peters & Schaal 2008)

slide-24
SLIDE 24

Practical methods and notes

  • Natural policy gradient
  • Generally a good choice to stabilize policy gradient training
  • See this paper for details:
  • Peters, Schaal. Reinforcement learning of motor skills with policy gradients.
  • Practical implementation: requires efficient Fisher-vector products, a bit

non-trivial to do without computing the full matrix

  • See: Schulman et al. Trust region policy optimization
  • Trust region policy optimization
  • Just use the IS objective directly
  • Use regularization to stay close to old policy
  • See: Proximal policy optimization
slide-25
SLIDE 25

generate samples (i.e. run the policy) fit a model to estimate return improve the policy

Review

  • Policy gradient = policy iteration
  • Optimize advantage under new policy state

distribution

  • Using old policy state distribution optimizes a

bound, if the policies are close enough

  • Results in constrained optimization problem
  • First order approximation to objective = gradient

ascent

  • Regular gradient ascent has the wrong constraint,

use natural gradient

  • Practical algorithms
  • Natural policy gradient
  • Trust region policy optimization