[PPT] - CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ PowerPoint Presentation

SLIDE 1

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC

Pieter Abbeel UC Berkeley EECS

SLIDE 2

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

SLIDE 3

n

TRPO, PPO: Importance sampling surrogate loss allows to do more than a gradient step, but still very local

n

Could we re-use samples more? Could we learn more globally / off-policy?

n

Yes! By leveraging the dynamic programming structure of the problem, breaking it down into 1-step pieces

n

Q-learning, DQN: 1-step (sampled) off-policy Bellman back-ups à more sample re-use à more data- efficient learning directly about the optimal policy

n

Why not always Q-learning/DQN?

n

Often less stable

n

The data doesn’t always support learning about the optimal policy (even if in principle can learn fully off-policy)

n

DDGP, SAC: like Q-learning, but does off-policy learning about the current policy and how to locally improve it (vs. directly learning about the optimal policy)

Story-line

SLIDE 4

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

SLIDE 5

Recap Q-Values

Q*(s, a) = expected utility starting in s, taking action a, and (thereafter) acting optimally Bellman Equation: Q-Value Iteration:

SLIDE 6

n

Q-value iteration:

n

Rewrite as expectation:

n

(Tabular) Q-Learning: replace expectation by samples

n

For an state-action pair (s,a), receive:

n

Consider your old estimate:

n

Consider your new sample estimate:

n

Incorporate the new estimate into a running average:

(Tabular) Q-Learning

Qk+1 ← Es0⇠P (s0|s,a) h R(s, a, s0) + γ max

a0 Qk(s0, a0)

i s0 ∼ P(s0|s, a) Qk(s, a) Qk+1(s, a) ← (1 − α)Qk(s, a) + α [target(s0)]

SLIDE 7

(Tabular) Q-Learning

Algorithm:

Start with for all s, a. Get initial state s For k = 1, 2, … till convergence Sample action a, get next state s’ If s’ is terminal: Sample new initial state s’ else:

Q0(s, a) target = R(s, a, s0) + γ max

a0 Qk(s0, a0)

target = R(s, a, s0) s ← s0 Qk+1(s, a) ← (1 − α)Qk(s, a) + α [target]

SLIDE 8

n

Choose random actions?

n

Choose action that maximizes (i.e. greedily)?

n

ɛ-Greedy: choose random action with prob. ɛ, otherwise choose action greedily

How to sample actions?

Qk(s, a)

SLIDE 9

n

Amazing result: Q-learning converges to optimal policy -- even if you’re acting suboptimally!

n

This is called off-policy learning

n

Caveats:

n

You have to explore enough

n

You have to eventually make the learning rate small enough

n

… but not decrease it too quickly

Q-Learning Properties

SLIDE 10

n Technical requirements.

n All states and actions are visited infinitely often

n Basically, in the limit, it doesn’t matter how you select actions (!)

n Learning rate schedule such that for all state and action

pairs (s,a):

Q-Learning Properties

For details, see Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), November 1994.

∞

X

t=0

αt(s, a) = ∞

∞

X

t=0

α2

t (s, a) < ∞

SLIDE 11

Q-Learning Demo: Crawler

States: discretized value of 2d state: (arm angle, hand angle)
Actions: Cartesian product of {arm up, arm down} and {hand up, hand down}
Reward: speed in the forward direction

SLIDE 12

Video of Demo Crawler Bot

SLIDE 13

Video of Demo Q-Learning -- Crawler

SLIDE 14

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

SLIDE 15

n Discrete environments

Can tabular methods scale?

Tetris 10^60 Atari 10^308 (ram) 10^16992 (pixels) Gridworld 10^1

SLIDE 16

n Continuous environments (by crude discretization)

Crawler 10^2 Hopper 10^10 Humanoid 10^100

Can tabular methods scale?

SLIDE 17

Generalizing Across States

n

Basic Q-Learning keeps a table of all q-values

n

In realistic situations, we cannot possibly learn about every single state!

n

Too many states to visit them all in training

n

Too many states to hold the q-tables in memory

n

Instead, we want to generalize:

n

Learn about some small number of training states from experience

n

Generalize that experience to new, similar situations

n

This is a fundamental idea in machine learning

SLIDE 18

n

Instead of a table, we have a parametrized Q function:

n Can be a linear function in features: n Or a neural net, decision tree, etc.

n

Learning rule:

n Remember: n Update:

Approximate Q-Learning

Qθ(s, a) Qθ(s, a) = θ0f0(s, a) + θ1f1(s, a) + · · · + θnfn(s, a) target(s0) = R(s, a, s0) + γ max

a0 Qθk(s0, a0)

θk+1 θk αrθ 1 2(Qθ(s, a) target(s0))2

θ=θk

SLIDE 19

n

Instead of a table, we have a parametrized Q function

n E.g. a neural net

n

Learning rule:

n Compute target: n Update Q-network:

Recall Approximate Q-Learning

Qθ(s, a) target(s0) = R(s, a, s0) + γ max

a0 Qθk(s0, a0)

θk+1 θk αrθ 1 2(Qθ(s, a) target(s0))2

θ=θk

SLIDE 20

SLIDE 21

SLIDE 22

SLIDE 23

SLIDE 24

SLIDE 25

SLIDE 26

SLIDE 27

n “Rainbow: Combining Improvements in Deep Reinforcement

Learning,” Matteo Hessel et al, 2017

n Double DQN (DDQN) n Prioritized Replay DDQN n Dueling DQN n Distributional DQN n Noisy DQN

Outline

SLIDE 29

Soft Q-Learning

→ Supervised learning → Use a sample estimate → Stein variational gradient descent

SLIDE 30

Stein Variational Gradient Descent: Intuition

Implicit density model

D. Wang et al., Learning to draw samples: With application to amortized MLE for generative adversarial learning, 2016.

Q-function Policy sampling network

SLIDE 31

0 min 12 min 30 min 2 hours Training time

sites.google.com/view/composing-real-world-policies/

SLIDE 32

After 2 hours of training

sites.google.com/view/composing-real-world-policies/

SLIDE 33

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

SLIDE 34

Deep Deterministic Policy Gradient (DDPG): Basic (=SVG(0))

for iter = 1, 2, …

Roll-outs: Execute roll-outs under current policy (+some noise for exploration) Q function update: Policy update: Backprop through Q to compute gradient estimates for all t:

g / X

t

rθQφ(st, πθ(st, vt))

g / rφ X

t

(Qφ(st, ut) ˆ Q(st, ut))2 with ˆ Q(st, ut) = rt + γQφ(st+1, ut+1)

SLIDE 35

n Applied to 2-D robotics tasks n Different gradient estimators behave similarly

SVG(k)

SLIDE 36

SVG(k)

SLIDE 37

n Add noise for exploration n Incorporate replay buffer for off-policy learning n For increased stability, use lagged (Polyak-averaging) version

f and for target values

Deep Deterministic Policy Gradient (DDPG): Complete

Qφ

πθ

ˆ Qt = rt + γQφ0(st+1, πθ0(st+1))

ff-policy!

SLIDE 38

n Applied to 2D and 3D robotics tasks and driving with pixel input

DDPG

SLIDE 39

DDPG

SLIDE 40

+ very sample efficient thanks to off-policy updates

often unstable

à Soft Actor Critic (SAC), which adds entropy of policy to the

bjective, ensuring better exploration and less overfitting of the

policy to any quirks in the Q-function

DDPG

SLIDE 41

n

Motivation

n

Q-learning

n

DQN + variants

n

Q-learning with continuous action spaces (SoftQ)

n

Deep Deterministic Policy Gradient (DDPG)

n

Soft Actor Critic (SAC)

Outline

SLIDE 42

Soft Policy Iteration

1. Soft policy evaluation:

Fix policy, apply soft Bellman backup until converges:

2. Soft policy improvement:

Update the policy through information projection: This converges to . For the new policy, we have .

3. Repeat until convergence
1. Take one stochastic

gradient step to minimize soft Bellman residual

2. Take one stochastic

gradient step to minimize the KL divergence

3. Execute one action in the

environment and repeat

Soft Actor-Critic

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Actor- Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML, 2018.

SLIDE 43

n Objective: n Iterate:

n Perform roll-out from pi, add data in replay buffer n Learn V, Q, pi:

Soft Actor Critic

[see also: https://towardsdatascience.com/soft-actor-critic-demystified-b8427df61665]

SLIDE 44

Algorithms: Soft Actor-Critic (SAC) Deep Deterministic Policy Gradient (DDPG) Proximal Policy Optimization (PPO) Soft Q-Learning (SQL) sites.google.com/view/soft-actor-critic

SLIDE 45

sites.google.com/view/soft-actor-critic

SLIDE 46

SLIDE 47

Real Robot Results

SLIDE 48

Real Robot Results

SLIDE 49

CS 287 Lecture 19 (Fall 2019) Off-Policy, Model-Free RL: DQN, SoftQ , DDPG, SAC

Outline

Story-line

Outline

Recap Q-Values

Q-value iteration:

Rewrite as expectation:

(Tabular) Q-Learning: replace expectation by samples

(Tabular) Q-Learning

(Tabular) Q-Learning

Choose random actions?

Choose action that maximizes (i.e. greedily)?

ɛ-Greedy: choose random action with prob. ɛ, otherwise choose action greedily

How to sample actions?

Qk(s, a)

Amazing result: Q-learning converges to optimal policy -- even if you’re acting suboptimally!

This is called off-policy learning

Caveats:

Q-Learning Properties

pairs (s,a):

Q-Learning Properties

X

αt(s, a) = ∞

X

α2

Q-Learning Demo: Crawler

Video of Demo Crawler Bot

Video of Demo Q-Learning -- Crawler

Outline

Can tabular methods scale?

Can tabular methods scale?

Generalizing Across States

Instead of a table, we have a parametrized Q function:

Learning rule:

Approximate Q-Learning

θk+1 θk αrθ 1 2(Qθ(s, a) target(s0))2

Instead of a table, we have a parametrized Q function

Learning rule:

Recall Approximate Q-Learning

θk+1 θk αrθ 1 2(Qθ(s, a) target(s0))2

Learning,” Matteo Hessel et al, 2017

See also

Outline

Soft Q-Learning

Stein Variational Gradient Descent: Intuition

Outline

Deep Deterministic Policy Gradient (DDPG): Basic (=SVG(0))

SVG(k)

SVG(k)

Deep Deterministic Policy Gradient (DDPG): Complete

Qφ

πθ

ˆ Qt = rt + γQφ0(st+1, πθ0(st+1))

DDPG

DDPG

+ very sample efficient thanks to off-policy updates

à Soft Actor Critic (SAC), which adds entropy of policy to the

policy to any quirks in the Q-function

DDPG

Outline

Soft Policy Iteration

Soft Actor-Critic

Soft Actor Critic

Real Robot Results

Real Robot Results

Real Robot Results