Class notes 1. Homework 5 due Tuesday, November 13 th 11:59pm - - PowerPoint PPT Presentation

class notes
SMART_READER_LITE
LIVE PREVIEW

Class notes 1. Homework 5 due Tuesday, November 13 th 11:59pm - - PowerPoint PPT Presentation

Class notes 1. Homework 5 due Tuesday, November 13 th 11:59pm Real-World Robot Learning: Safety and Flexibility CS294-112: Deep Reinforcement Learning Gregory Kahn Why should you care? Safety Flexibility Outline Topics Algorithms


slide-1
SLIDE 1
  • 1. Homework 5 due Tuesday, November 13th 11:59pm

Class notes

slide-2
SLIDE 2

Real-World Robot Learning: Safety and Flexibility

CS294-112: Deep Reinforcement Learning Gregory Kahn

slide-3
SLIDE 3

Safety Flexibility

Why should you care?

slide-4
SLIDE 4

Topics

  • Safety
  • Flexibility

Outline

Algorithms

  • Imitation learning
  • Model-free
  • Model-based

Safety Flexibility Imitation learning Model-free Model-based

2 * 3 = 6 papers we’ll cover By no means the best / only papers on these topics

slide-5
SLIDE 5

Safety Flexibility Imitation learning Model-free Model-based

slide-6
SLIDE 6

Learn control policy that maps observations to controls

Observation Control Policy

Goal

Safety Flexibility Imitation learning Model-free Model-based

slide-7
SLIDE 7

Human expert

  • Able to generate good trajectories using an expert policy
  • cost function
  • optimization
  • full state information
  • nly during training

Trajectory optimization

Assumption

Safety Flexibility Imitation learning Model-free Model-based

slide-8
SLIDE 8
  • Problem: training and test distributions differ

Gather expert trajectories Supervised learning Training trajectory Policy reaches states not in training set! [Ross et al 2010] Learned policy trajectory Trajectory

  • ptimization

Supervised Learning

Safety Flexibility Imitation learning Model-free Model-based

slide-9
SLIDE 9

[Ross et al 2011]

  • Problem: training and test distributions differ
  • Solution: execute policy during training

Gather expert trajectories Supervised learning

Dataset Aggregation (DAgger)

Safety Flexibility Imitation learning Model-free Model-based

slide-10
SLIDE 10
  • DAgger mixes the actions

Safety during training

Safety Flexibility Imitation learning Model-free Model-based

slide-11
SLIDE 11
  • DAgger mixes the actions
  • PLATO mixes the objectives

cost J → avoids high cost

Policy Learning using Adaptive Trajectory Optimization (PLATO)

Safety Flexibility Imitation learning Model-free Model-based

slide-12
SLIDE 12

approach sampling policy safe similar training and test distributions PLATO supervised learning DAgger

Algorithm comparisons

Safety Flexibility Imitation learning Model-free Model-based

slide-13
SLIDE 13

Canyon Forest

Experiments: final neural network policies

Safety Flexibility Imitation learning Model-free Model-based

slide-14
SLIDE 14

Canyon Forest

Experiments: metrics

Safety Flexibility Imitation learning Model-free Model-based

slide-15
SLIDE 15

Canyon Forest Canyon Forest

Experiments: metrics

Safety Flexibility Imitation learning Model-free Model-based

slide-16
SLIDE 16

Safety Flexibility Imitation learning Model-free Model-based

slide-17
SLIDE 17

Goal

NOT SAFE

Safety Flexibility Imitation learning Model-free Model-based

slide-18
SLIDE 18

Shielding

Like learning in a transformed MDP

Pre-emptive shielding

Shield can be used at test time

Post-posed shielding

Safety Flexibility Imitation learning Model-free Model-based

slide-19
SLIDE 19

How to shield: linear temporal logic

  • Encode safety with temporal logic
  • Assumption: Known approximate/conservative transition dynamics

Safety Flexibility Imitation learning Model-free Model-based

slide-20
SLIDE 20

Experiments

Safety criteria

  • Don’t crash

Safety Flexibility Imitation learning Model-free Model-based

slide-21
SLIDE 21

Experiments

Safety criteria

  • Don’t run out of oxygen
  • If enough oxygen,

don’t surface w/o divers

Safety Flexibility Imitation learning Model-free Model-based

slide-22
SLIDE 22

Safety Flexibility Imitation learning Model-free Model-based

slide-23
SLIDE 23

How to do reinforcement learning without destroying the robot during training using only onboard images unknown environment

Goal

Safety Flexibility Imitation learning Model-free Model-based

slide-24
SLIDE 24

unknown environment learn a collision prediction model command velocities raw image neural network

Approach

Safety Flexibility Imitation learning Model-free Model-based

slide-25
SLIDE 25

Collision prediction model

Safety Flexibility Imitation learning Model-free Model-based

slide-26
SLIDE 26

Train uncertainty-aware collision prediction model Gather trajectories using MPC controller Data Deep neural network with uncertainty estimates from bootstrapping and dropout Encourage safe, low-speed collisions by reasoning about the model’s uncertainty Robot increases speed as model becomes more confident May experience collisions Form speed-dependent, uncertainty-aware collision cost .

Model-based RL using collision prediction model

Safety Flexibility Imitation learning Model-free Model-based

slide-27
SLIDE 27

high speed predict collision large uncertainty large cost

Collision cost

Safety Flexibility Imitation learning Model-free Model-based

slide-28
SLIDE 28

Bootstrapping

Data D1 D2 D3 Resample with replacement Train Train Train M1 M3 M2

Training time Test time

Input M1 M2 M3

Estimating neural network output uncertainty

Safety Flexibility Imitation learning Model-free Model-based

slide-29
SLIDE 29

Dropout

Data Model Model Model Model Model Model Input

Training time Test time

Estimating neural network output uncertainty

Safety Flexibility Imitation learning Model-free Model-based

slide-30
SLIDE 30

Not accounting for uncertainty (higher-speed collisions)

Preliminary real-world experiments

Safety Flexibility Imitation learning Model-free Model-based

slide-31
SLIDE 31
slide-32
SLIDE 32

accounting for uncertainty (lower-speed collisions)

Preliminary real-world experiments

Safety Flexibility Imitation learning Model-free Model-based

slide-33
SLIDE 33
slide-34
SLIDE 34

successful flight past obstacle

Preliminary real-world experiments

Safety Flexibility Imitation learning Model-free Model-based

slide-35
SLIDE 35
slide-36
SLIDE 36
  • Tradeoff between safety and exploration
  • Safety guarantees require expert oversight or known environment + dynamics
  • Uncertainty can play a key role

Safety takeaways

Safety Flexibility Imitation learning Model-free Model-based

slide-37
SLIDE 37

Safety Flexibility Imitation learning Model-free Model-based

slide-38
SLIDE 38

Goal

User-specified command

Safety Flexibility Imitation learning Model-free Model-based

slide-39
SLIDE 39

Approach

Option A: Input command Option B: Branch using command + empirically better

  • only works for discrete commands

Safety Flexibility Imitation learning Model-free Model-based

slide-40
SLIDE 40

Important details

  • Data augmentation
  • Contrast
  • Brightness
  • Tone
  • Gaussian blur
  • Salt-and-pepper noise
  • Region dropout
  • Adding noise to expert

Approach

Safety Flexibility Imitation learning Model-free Model-based

slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44

[slides adapted from Tuomas Haarnoja] Safety Flexibility Imitation learning Model-free Model-based

slide-45
SLIDE 45

Avoidance skill Reaching skill Task 1: Reach Task 2: Avoid Reaching while avoiding skill Space of trajectories

Goal

Safety Flexibility Imitation learning Model-free Model-based

slide-46
SLIDE 46

Task 1+2: Reach and avoid Task 1: Reach Task 2: Avoid

Reusability!

Related to divergence between and

Avoidance skill Reaching skill Reaching while avoiding skill Space of trajectories

Policy Composition

Safety Flexibility Imitation learning Model-free Model-based

slide-47
SLIDE 47

Task 1 Task 2 Task 1 + 2

slide-48
SLIDE 48

Avoidance policy Stacking policy

slide-49
SLIDE 49

Avoidance policy Stacking policy Combined policy

slide-50
SLIDE 50

Safety Flexibility Imitation learning Model-free Model-based

slide-51
SLIDE 51

Standard Reinforcement Learning

Data Policy Train Test Data Policy Data Policy

Data inefficient Expert in the loop Inflexible

slide-52
SLIDE 52

CAPs Approach

CAPs Data Train Test

Event Cues Detector

Data efficient Detector in the loop Flexible

slide-53
SLIDE 53

Detect Predict Control

Safety Flexibility Imitation learning Model-free Model-based

slide-54
SLIDE 54

Detect Predict Control

Event Cues Detector

Safety Flexibility Imitation learning Model-free Model-based

slide-55
SLIDE 55

Detect Predict Control

Safety Flexibility Imitation learning Model-free Model-based

slide-56
SLIDE 56

Detect Predict Control

Safety Flexibility Imitation learning Model-free Model-based

slide-57
SLIDE 57

Detect Predict Control

Safety Flexibility Imitation learning Model-free Model-based

slide-58
SLIDE 58

8x 8x 8x 8x 8x 8x

Safety Flexibility Imitation learning Model-free Model-based

slide-59
SLIDE 59

8x

Safety Flexibility Imitation learning Model-free Model-based

slide-60
SLIDE 60

Drive in right lane

6x 6x

Drive in either lane Drive at 7m/s Avoid collisions

Safety Flexibility Imitation learning Model-free Model-based

slide-61
SLIDE 61

6x

CAPs

slide-62
SLIDE 62

Safety Flexibility Imitation learning Model-free Model-based

slide-63
SLIDE 63

Safety Flexibility Imitation learning Model-free Model-based

slide-64
SLIDE 64

CAPs DQL Collision Avoidance

Safety Flexibility Imitation learning Model-free Model-based

slide-65
SLIDE 65

Heading

Avoid collisions Follow goal heading Move towards doors

slide-66
SLIDE 66
  • Carefully construct how your policy / model deals with goals
  • Model-free methods require extra care to reuse
  • Model-based methods are flexible by construction

Flexibility takeaways

Safety Flexibility Imitation learning Model-free Model-based