[PPT] - Reinforcement Learning Policy Op5miza5on Pieter Abbeel UC Berkeley PowerPoint Presentation

SLIDE 1

Reinforcement Learning – Policy Op5miza5on

Pieter Abbeel UC Berkeley EECS

SLIDE 2

n Consider control policy parameterized by parameter vector n O<en stochas>c policy class (smooths out the problem):

n probability of taking ac>on u in state s

Policy Op>miza>on

max

θ

E[

H

X

t=0

R(st)|πθ]

θ

πθ(u|s)

SLIDE 3

Learning to Trot/Run

Before learning (hand-tuned) A<er learning [Policy search was done through trials on the actual robot.]

Kohl and Stone, ICRA2004

SLIDE 4

n

12 parameters define the Aibo’s gait:

n The front locus (3 parameters: height, x-pos., y-pos.) n The rear locus (3 parameters) n Locus length n Locus skew mul>plier in the x-y plane (for turning) n The height of the front of the body n The height of the rear of the body n The >me each foot takes to move through its locus n The frac>on of >me each foot spends on the ground

Learning to Trot/Run

Kohl and Stone, ICRA2004

SLIDE 5

[Ng + al, ISER 2004] [Policy search was done in simulation]

SLIDE 6

Learning to Hover

SLIDE 7

[Kober and Peters, 2009]

Ball-In-A-Cup

SLIDE 8

Learning to Walk in 20 Minutes

[Tedrake, Zhang, Seung 2005]

SLIDE 9

Learning to Walk in 20 Minutes

[Tedrake, Zhang, Seung 2005]

passive hip joint [1DOF] 2 x 2 (roll, pitch) posi>on controlled servo motors [4 DOF] 44 cm Natural gait down 0.03 radians ramp: 0.8Hz, 6.5cm steps Arms: coupled to opposite leg to reduce yaw moment freely swinging load [1DOF] 9DOFs: * 6 internal DOFs * 3 DOFs for the robot’s orienta>on (always assumed in contact with ground at a single point, absolute (x,y) ignored)

SLIDE 10

Learning to Walk in 20 Minutes

[Tedrake, Zhang, Seung 2005]

SLIDE 11

n Cross-Entropy Method (CEM) n Covariance Matrix Adapta>on (CMA)

n Dynamics model: stochas>c: OK; unknown: OK n Policy class:

stochas>c: OK

n Downside: gradient-free methods slower than gradient-based methods

à in prac>ce OK if low-dimensional θ and willing to do do many runs

Gradient-Free Methods

max

θ

U(θ) = max

θ

E[

H

X

t=0

R(st)|πθ]

SLIDE 12

Gradient-Based Policy Op>miza>on

max

θ

U(θ) = max

θ

E[

H

X

t=0

R(st)|πθ]

s0 s1 s2 u0 u1 u2 r0 r1 r2

f f

R R R

πθ πθ πθ

SLIDE 13

Overview of Methods / Seings

Dynamics Policy D+K D+U S+K+R S+K S+U D S+R S

PD

+ + + +

LR

+ + + + + + + D: determinis>c; S: stochas>c; K: known; U: unknown; R: reparameterizable; PD: path deriva>ve (=perturba>on analysis) LR: likelihood ra>o (=score func>on)

SLIDE 14

n When more than one is applicable, which one is best? n When dynamics is only available as black-box, but deriva>ves

aren’t available – finite differences based deriva>ves?

n Vs. directly finite differences / gradient-free on the policy n Note: finite difference tricky (imprac>cal?) when can’t control random

seed

n What if model is unknown, but es>mate available

Ques>ons

SLIDE 15

Gradient Computa>on – Unknown Model – Finite Differences

SLIDE 16

Noise Can Dominate

SLIDE 17

n Solu>on 1: Average over many samples n Solu>on 2: Fix the randomness (if possible)

n Intui>on by example: wind influence on a helicopter is stochas>c, but if

we assume the same wind paqern across trials, this will make the different choices of θ more readily comparable

n General instan>a>on: Fix the random seed; and the result is

determinis>c system

n Ng & Jordan, 2000 provide theore>cal analysis of gains from fixing

randomness

Finite Differences and Noise

SLIDE 18

n

Reminder of op>miza>on objec>ve:

n

Can compute gradient es>mate along current roll-out:

Path Deriva>ve for Dynamics: D+K; Policy: D

max

θ

U(θ) = max

θ

E[

H

X

t=0

R(st)|πθ]

∂U ∂θi =

H

X

t=0

∂R ∂s (st)∂st ∂θi ∂st ∂θi = ∂f ∂s (st−1, ut−1)∂st−1 ∂θi + ∂f ∂s (st−1, ut−1)∂ut−1 ∂θi ∂ut ∂θi = ∂πθ ∂θi (st, θ) + ∂πθ ∂s (st, θ)∂st ∂θi

SLIDE 19

n

Reminder of op>miza>on objec>ve:

n

(draw reparameterized graph on board)

n

+ average over mul>ple samples

Path Deriva>ve for Dynamics: S+K+R; Policy: S+R

max

θ

U(θ) = max

θ

E[

H

X

t=0

R(st)|πθ]

SLIDE 20

Overview of Methods / Seings

Dynamics Policy D+K D+U S+K+R S+K S+U D S+R S

PD

+ + + +

LR

+ + + + + + + D: determinis>c; S: stochas>c; K: known; U: unknown; R: reparameterizable; PD: path deriva>ve (=perturba>on analysis) LR: likelihood ra>o (=score func>on)

SLIDE 21

Gradient Computa>on – Unknown Model – Likelihood Ra>o

SLIDE 22

Likelihood Ra>o Gradient

[Note: Can also be derived/generalized through an importance sampling deriva>on – Tang and Abbeel, 2011]

SLIDE 23

n On board..

Importance Sampling

SLIDE 24

Likelihood Ra>o Gradient Es>mate

SLIDE 25

Likelihood Ra>o Gradient Es>mate

SLIDE 26

n As formulated thus far: unbiased but very noisy n Fixes that lead to real-world prac>cality

n Baseline n Temporal structure n Also: KL-divergence trust region / natural gradient (= general trick,

equally applicable to perturba>on analysis and finite differences)

Likelihood Ra>o Gradient Es>mate

SLIDE 27

n Gradient es>mate with baseline: n Crudely, increasing log-likelihood of paths with higher than

baseline reward and decreasing log-likelihood of paths with lower than baseline reward

n S>ll unbiased? Yes!

Likelihood Ra>o with Baseline

ˆ g = 1 m

m

X

i=1

rθ log P(τ (i); θ)(R(τ (i)) b)

E " 1 m

m

X

i=1

rθ log P(τ (i); θ)b # = 0

SLIDE 28

n Current es>mate: n Future ac>ons do not depend on past rewards, hence can

lower variance by instead using:

Likelihood Ra>o and Temporal Structure

ˆ g = 1 m

m

X

i=1

rθ log P(τ (i); θ)(R(τ (i)) b) = 1 m

m

X

i=1

H−1 X

t=0

rθ log πθ(u(i)

t |s(i) t )

! H−1 X

t=0

R(s(i)

t , u(i) t ) b

!

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) b

!

SLIDE 29

n Naïve step-sizing: Line search

n Step-sizing necessary as gradient is only first-order approxima>on n Line search in the direc>on of gradient

n Simple, but expensive (evalua>ons along the line) n Naïve: ignores where the first-order approxima>on is good/poor

Step-sizing and Trust Regions

SLIDE 30

n Advanced step-sizing: Trust regions n First-order approxima>on from gradient is a good

approxima>on within “trust region” à Solve for best point within trust region:

Step-sizing and Trust Regions

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

SLIDE 31

n Solve for best point within trust region: n KL can be approximated efficiently with 2nd order expansion:

KL Trust Region (a.k.a. natural gradient)

max

δθ

ˆ g>δθ s.t. KL(P(τ; θ)||P(τ; θ + δθ)) ≤ ε

G: Fisher Informa>on Matrix

SLIDE 32

[Schulman, Levine, Abbeel, 2014]

Experiments in Locomo>on

SLIDE 33

n Current es>mate: n Actor-cri>c algorithms in parallel run an es>mator for the Q-

func>on, and subs>tute in the es>mated Q value

Actor-Cri>c Variant

1 m

m

X

i=1 H−1

X

t=0

rθ log πθ(u(i)

t |s(i) t )

H−1 X

k=t

R(s(i)

k , u(i) k ) b

!

Sample based estimate of

Q(s(i)

k , u(i) k )

SLIDE 34

Learning Locomo>on

[Schulman, Moritz, Levine, Jordan, Abbeel, 2015]

SLIDE 35

In Contrast: Darpa Robo>cs Challenge

SLIDE 36