Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley - - PowerPoint PPT Presentation

reinforcement learning policy op5miza5on
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley - - PowerPoint PPT Presentation

Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley EECS Policy Op>miza>on n Consider control policy parameterized by parameter vector H X max E[ R ( s t ) | ] t =0 n O<en stochas>c policy class


slide-1
SLIDE 1

Reinforcement Learning – Policy Op5miza5on

Pieter Abbeel UC Berkeley EECS

slide-2
SLIDE 2

n Consider control policy parameterized by parameter vector n O<en stochas>c policy class (smooths out the problem):

n probability of taking ac>on u in state s

Policy Op>miza>on

max

θ

E[

H

X

t=0

R(st)|πθ]

θ

πθ(u|s)

slide-3
SLIDE 3

Learning to Trot/Run

Before learning (hand-tuned) A<er learning [Policy search was done through trials on the actual robot.]

Kohl and Stone, ICRA2004

slide-4
SLIDE 4

n

12 parameters define the Aibo’s gait:

n The front locus (3 parameters: height, x-pos., y-pos.) n The rear locus (3 parameters) n Locus length n Locus skew mul>plier in the x-y plane (for turning) n The height of the front of the body n The height of the rear of the body n The >me each foot takes to move through its locus n The frac>on of >me each foot spends on the ground

Learning to Trot/Run

Kohl and Stone, ICRA2004

slide-5
SLIDE 5

[Ng + al, ISER 2004] [Policy search was done in simulation]

slide-6
SLIDE 6

Learning to Hover

slide-7
SLIDE 7

[Kober and Peters, 2009]

Ball-In-A-Cup

slide-8
SLIDE 8

Learning to Walk in 20 Minutes

[Tedrake, Zhang, Seung 2005]

slide-9
SLIDE 9

Learning to Walk in 20 Minutes

[Tedrake, Zhang, Seung 2005]

passive hip joint [1DOF] 2 x 2 (roll, pitch) posi>on controlled servo motors [4 DOF] 44 cm Natural gait down 0.03 radians ramp: 0.8Hz, 6.5cm steps Arms: coupled to opposite leg to reduce yaw moment freely swinging load [1DOF] 9DOFs: * 6 internal DOFs * 3 DOFs for the robot’s orienta>on (always assumed in contact with ground at a single point, absolute (x,y) ignored)

slide-10
SLIDE 10

Learning to Walk in 20 Minutes

[Tedrake, Zhang, Seung 2005]

slide-11
SLIDE 11

n Cross-Entropy Method (CEM) n Covariance Matrix Adapta>on (CMA)

n Dynamics model: stochas>c: OK; unknown: OK n Policy class:

stochas>c: OK

n Downside: gradient-free methods slower than gradient-based methods

à in prac>ce OK if low-dimensional θ and willing to do do many runs

Gradient-Free Methods

max

θ

U(θ) = max

θ

E[

H

X

t=0

R(st)|πθ]

slide-12
SLIDE 12

Gradient-Based Policy Op>miza>on

max

θ

U(θ) = max

θ

E[

H

X

t=0

R(st)|πθ]

s0 s1 s2 u0 u1 u2 r0 r1 r2

f f

R R R

πθ πθ πθ

slide-13
SLIDE 13

Overview of Methods / Seings

Dynamics Policy D+K D+U S+K+R S+K S+U D S+R S

PD

+ + + +

LR

+ + + + + + + D: determinis>c; S: stochas>c; K: known; U: unknown; R: reparameterizable; PD: path deriva>ve (=perturba>on analysis) LR: likelihood ra>o (=score func>on)

slide-14
SLIDE 14

n When more than one is applicable, which one is best? n When dynamics is only available as black-box, but deriva>ves

aren’t available – finite differences based deriva>ves?

n Vs. directly finite differences / gradient-free on the policy n Note: finite difference tricky (imprac>cal?) when can’t control random

seed

n What if model is unknown, but es>mate available

Ques>ons

slide-15
SLIDE 15

Gradient Computa>on – Unknown Model – Finite Differences

slide-16
SLIDE 16

Noise Can Dominate

slide-17
SLIDE 17

n Solu>on 1: Average over many samples n Solu>on 2: Fix the randomness (if possible)

n Intui>on by example: wind influence on a helicopter is stochas>c, but if

we assume the same wind paqern across trials, this will make the different choices of θ more readily comparable

n General instan>a>on: Fix the random seed; and the result is

determinis>c system

n Ng & Jordan, 2000 provide theore>cal analysis of gains from fixing

randomness

Finite Differences and Noise

slide-18
SLIDE 18

n

Reminder of op>miza>on objec>ve:

n

Can compute gradient es>mate along current roll-out:

Path Deriva>ve for Dynamics: D+K; Policy: D

max

θ

U(θ) = max

θ

E[

H

X

t=0

R(st)|πθ]

∂U ∂θi =

H

X

t=0

∂R ∂s (st)∂st ∂θi ∂st ∂θi = ∂f ∂s (st−1, ut−1)∂st−1 ∂θi + ∂f ∂s (st−1, ut−1)∂ut−1 ∂θi ∂ut ∂θi = ∂πθ ∂θi (st, θ) + ∂πθ ∂s (st, θ)∂st ∂θi

slide-19
SLIDE 19

n

Reminder of op>miza>on objec>ve:

n

(draw reparameterized graph on board)

n

+ average over mul>ple samples

Path Deriva>ve for Dynamics: S+K+R; Policy: S+R

max

θ

U(θ) = max

θ

E[

H

X

t=0

R(st)|πθ]

slide-20
SLIDE 20

Overview of Methods / Seings

Dynamics Policy D+K D+U S+K+R S+K S+U D S+R S

PD

+ + + +

LR

+ + + + + + + D: determinis>c; S: stochas>c; K: known; U: unknown; R: reparameterizable; PD: path deriva>ve (=perturba>on analysis) LR: likelihood ra>o (=score func>on)

slide-21
SLIDE 21

Gradient Computa>on – Unknown Model – Likelihood Ra>o

slide-22
SLIDE 22

Likelihood Ra>o Gradient

[Note: Can also be derived/generalized through an importance sampling deriva>on – Tang and Abbeel, 2011]

slide-23
SLIDE 23

n On board..

Importance Sampling

slide-24
SLIDE 24

Likelihood Ra>o Gradient Es>mate

slide-25
SLIDE 25

Likelihood Ra>o Gradient Es>mate

slide-26
SLIDE 26

n As formulated thus far: unbiased but very noisy n Fixes that lead to real-world prac>cality

n Baseline n Temporal structure n Also: KL-divergence trust region / natural gradient (= general trick,

equally applicable to perturba>on analysis and finite differences)

Likelihood Ra>o Gradient Es>mate

slide-27
SLIDE 27

n Gradient es>mate with baseline: n Crudely, increasing log-likelihood of paths with higher than

baseline reward and decreasing log-likelihood of paths with lower than baseline reward

n S>ll unbiased? Yes!

Likelihood Ra>o with Baseline

ˆ g = 1 m

m

X

i=1

rθ log P(τ (i); θ)(R(τ (i)) b)

E " 1 m

m

X

i=1

rθ log P(τ (i); θ)b # = 0

slide-28
SLIDE 28

n Current es>mate: n Future ac>ons do not depend on past rewards, hence can

lower variance by instead using:

Likelihood Ra>o and Temporal Structure

ˆ g = 1 m

m

X

i=1

rθ log P(τ (i); θ)(R(τ (i)) b) = 1 m

m

X

i=1

H−1 X

t=0

rθ log πθ(u(i)

t |s(i) t )

! H−1 X

t=0

R(s(i)

t , u(i) t ) b

!

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) b

!

slide-29
SLIDE 29

n Naïve step-sizing: Line search

n Step-sizing necessary as gradient is only first-order approxima>on n Line search in the direc>on of gradient

n Simple, but expensive (evalua>ons along the line) n Naïve: ignores where the first-order approxima>on is good/poor

Step-sizing and Trust Regions

slide-30
SLIDE 30

n Advanced step-sizing: Trust regions n First-order approxima>on from gradient is a good

approxima>on within “trust region” à Solve for best point within trust region:

Step-sizing and Trust Regions

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

slide-31
SLIDE 31

n Solve for best point within trust region: n KL can be approximated efficiently with 2nd order expansion:

KL Trust Region (a.k.a. natural gradient)

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

G: Fisher Informa>on Matrix

slide-32
SLIDE 32

[Schulman, Levine, Abbeel, 2014]

Experiments in Locomo>on

slide-33
SLIDE 33

n Current es>mate: n Actor-cri>c algorithms in parallel run an es>mator for the Q-

func>on, and subs>tute in the es>mated Q value

Actor-Cri>c Variant

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) b

!

Sample based estimate of

Q(s(i)

k , u(i) k )

slide-34
SLIDE 34

Learning Locomo>on

[Schulman, Moritz, Levine, Jordan, Abbeel, 2015]

slide-35
SLIDE 35

In Contrast: Darpa Robo>cs Challenge

slide-36
SLIDE 36

Thank you