CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ - - PowerPoint PPT Presentation

cs 287 lecture 19 fall 2019 off policy model free rl dqn
SMART_READER_LITE
LIVE PREVIEW

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ - - PowerPoint PPT Presentation

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC Pieter Abbeel UC Berkeley EECS Outline Motivation n Q-learning n DQN + variants n Q-learning with continuous action spaces (SoftQ) n Deep Deterministic


slide-1
SLIDE 1

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC

Pieter Abbeel UC Berkeley EECS

slide-2
SLIDE 2

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

slide-3
SLIDE 3

n

TRPO, PPO: Importance sampling surrogate loss allows to do more than a gradient step, but still very local

n

Could we re-use samples more? Could we learn more globally / off-policy?

n

Yes! By leveraging the dynamic programming structure of the problem, breaking it down into 1-step pieces

n

Q-learning, DQN: 1-step (sampled) off-policy Bellman back-ups à more sample re-use à more data- efficient learning directly about the optimal policy

n

Why not always Q-learning/DQN?

n

Often less stable

n

The data doesn’t always support learning about the optimal policy (even if in principle can learn fully off-policy)

n

DDGP, SAC: like Q-learning, but does off-policy learning about the current policy and how to locally improve it (vs. directly learning about the optimal policy)

Story-line

slide-4
SLIDE 4

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

slide-5
SLIDE 5

Recap Q-Values

Q*(s, a) = expected utility starting in s, taking action a, and (thereafter) acting optimally Bellman Equation: Q-Value Iteration:

slide-6
SLIDE 6

n

Q-value iteration:

n

Rewrite as expectation:

n

(Tabular) Q-Learning: replace expectation by samples

n

For an state-action pair (s,a), receive:

n

Consider your old estimate:

n

Consider your new sample estimate:

n

Incorporate the new estimate into a running average:

(Tabular) Q-Learning

Qk+1 ← Es0⇠P (s0|s,a) h R(s, a, s0) + γ max

a0 Qk(s0, a0)

i s0 ∼ P(s0|s, a) Qk(s, a) Qk+1(s, a) ← (1 − α)Qk(s, a) + α [target(s0)]

slide-7
SLIDE 7

(Tabular) Q-Learning

Algorithm:

Start with for all s, a. Get initial state s For k = 1, 2, … till convergence Sample action a, get next state s’ If s’ is terminal: Sample new initial state s’ else:

Q0(s, a) target = R(s, a, s0) + γ max

a0 Qk(s0, a0)

target = R(s, a, s0) s ← s0 Qk+1(s, a) ← (1 − α)Qk(s, a) + α [target]

slide-8
SLIDE 8

n

Choose random actions?

n

Choose action that maximizes (i.e. greedily)?

n

ɛ-Greedy: choose random action with prob. ɛ, otherwise choose action greedily

How to sample actions?

Qk(s, a)

slide-9
SLIDE 9

n

Amazing result: Q-learning converges to optimal policy -- even if you’re acting suboptimally!

n

This is called off-policy learning

n

Caveats:

n

You have to explore enough

n

You have to eventually make the learning rate small enough

n

… but not decrease it too quickly

Q-Learning Properties

slide-10
SLIDE 10

n Technical requirements.

n All states and actions are visited infinitely often

n Basically, in the limit, it doesn’t matter how you select actions (!)

n Learning rate schedule such that for all state and action

pairs (s,a):

Q-Learning Properties

For details, see Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), November 1994.

X

t=0

αt(s, a) = ∞

X

t=0

α2

t (s, a) < ∞

slide-11
SLIDE 11

Q-Learning Demo: Crawler

  • States: discretized value of 2d state: (arm angle, hand angle)
  • Actions: Cartesian product of {arm up, arm down} and {hand up, hand down}
  • Reward: speed in the forward direction
slide-12
SLIDE 12

Video of Demo Crawler Bot

slide-13
SLIDE 13

Video of Demo Q-Learning -- Crawler

slide-14
SLIDE 14

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

slide-15
SLIDE 15

n Discrete environments

Can tabular methods scale?

Tetris 10^60 Atari 10^308 (ram) 10^16992 (pixels) Gridworld 10^1

slide-16
SLIDE 16

n Continuous environments (by crude discretization)

Crawler 10^2 Hopper 10^10 Humanoid 10^100

Can tabular methods scale?

slide-17
SLIDE 17

Generalizing Across States

n

Basic Q-Learning keeps a table of all q-values

n

In realistic situations, we cannot possibly learn about every single state!

n

Too many states to visit them all in training

n

Too many states to hold the q-tables in memory

n

Instead, we want to generalize:

n

Learn about some small number of training states from experience

n

Generalize that experience to new, similar situations

n

This is a fundamental idea in machine learning

slide-18
SLIDE 18

n

Instead of a table, we have a parametrized Q function:

n Can be a linear function in features: n Or a neural net, decision tree, etc.

n

Learning rule:

n Remember: n Update:

Approximate Q-Learning

Qθ(s, a) Qθ(s, a) = θ0f0(s, a) + θ1f1(s, a) + · · · + θnfn(s, a) target(s0) = R(s, a, s0) + γ max

a0 Qθk(s0, a0)

θk+1 θk αrθ 1 2(Qθ(s, a) target(s0))2

  • θ=θk
slide-19
SLIDE 19

n

Instead of a table, we have a parametrized Q function

n E.g. a neural net

n

Learning rule:

n Compute target: n Update Q-network:

Recall Approximate Q-Learning

Qθ(s, a) target(s0) = R(s, a, s0) + γ max

a0 Qθk(s0, a0)

θk+1 θk αrθ 1 2(Qθ(s, a) target(s0))2

  • θ=θk
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27

n “Rainbow: Combining Improvements in Deep Reinforcement

Learning,” Matteo Hessel et al, 2017

n Double DQN (DDQN) n Prioritized Replay DDQN n Dueling DQN n Distributional DQN n Noisy DQN

See also

slide-28
SLIDE 28

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

slide-29
SLIDE 29

Soft Q-Learning

→ Supervised learning → Use a sample estimate → Stein variational gradient descent

slide-30
SLIDE 30

Stein Variational Gradient Descent: Intuition

Implicit density model

  • D. Wang et al., Learning to draw samples: With application to amortized MLE for generative adversarial learning, 2016.

Q-function Policy sampling network

slide-31
SLIDE 31

0 min 12 min 30 min 2 hours Training time

sites.google.com/view/composing-real-world-policies/

slide-32
SLIDE 32

After 2 hours of training

sites.google.com/view/composing-real-world-policies/

slide-33
SLIDE 33

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

slide-34
SLIDE 34

Deep Deterministic Policy Gradient (DDPG): Basic (=SVG(0))

  • for iter = 1, 2, …

Roll-outs: Execute roll-outs under current policy (+some noise for exploration) Q function update: Policy update: Backprop through Q to compute gradient estimates for all t:

g / X

t

rθQφ(st, πθ(st, vt))

g / rφ X

t

(Qφ(st, ut) ˆ Q(st, ut))2 with ˆ Q(st, ut) = rt + γQφ(st+1, ut+1)

slide-35
SLIDE 35

n Applied to 2-D robotics tasks n Different gradient estimators behave similarly

SVG(k)

slide-36
SLIDE 36

SVG(k)

slide-37
SLIDE 37

n Add noise for exploration n Incorporate replay buffer for off-policy learning n For increased stability, use lagged (Polyak-averaging) version

  • f and for target values

Deep Deterministic Policy Gradient (DDPG): Complete

πθ

ˆ Qt = rt + γQφ0(st+1, πθ0(st+1))

  • ff-policy!
slide-38
SLIDE 38

n Applied to 2D and 3D robotics tasks and driving with pixel input

DDPG

slide-39
SLIDE 39

DDPG

slide-40
SLIDE 40

+ very sample efficient thanks to off-policy updates

  • often unstable

à Soft Actor Critic (SAC), which adds entropy of policy to the

  • bjective, ensuring better exploration and less overfitting of the

policy to any quirks in the Q-function

DDPG

slide-41
SLIDE 41

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

slide-42
SLIDE 42

Soft Policy Iteration

  • 1. Soft policy evaluation:

Fix policy, apply soft Bellman backup until converges:

  • 2. Soft policy improvement:

Update the policy through information projection: This converges to . For the new policy, we have .

  • 3. Repeat until convergence
  • 1. Take one stochastic

gradient step to minimize soft Bellman residual

  • 2. Take one stochastic

gradient step to minimize the KL divergence

  • 3. Execute one action in the

environment and repeat

Soft Actor-Critic

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Actor- Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML, 2018.

slide-43
SLIDE 43

n Objective: n Iterate:

n Perform roll-out from pi, add data in replay buffer n Learn V, Q, pi:

Soft Actor Critic

[see also: https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665]

slide-44
SLIDE 44

Algorithms: Soft Actor-Critic (SAC) Deep Deterministic Policy Gradient (DDPG) Proximal Policy Optimization (PPO) Soft Q-Learning (SQL) sites.google.com/view/soft-actor-critic

slide-45
SLIDE 45

sites.google.com/view/soft-actor-critic

slide-46
SLIDE 46
slide-47
SLIDE 47

Real Robot Results

slide-48
SLIDE 48

Real Robot Results

slide-49
SLIDE 49

Real Robot Results