CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation - - PowerPoint PPT Presentation

โ–ถ
cs885 reinforcement learning module 3 july 5 2020
SMART_READER_LITE
LIVE PREVIEW

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation - - PowerPoint PPT Presentation

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). Ho, J., & Ermon, S. (2016). Generative adversarial


slide-1
SLIDE 1

CS885 Reinforcement Learning Module 3: July 5, 2020

Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from

  • bservation. In IJCAI (pp. 4950-4957).

Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In NeurIPS (pp. 4565-4573).

CS885 Spring 2020 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

CS885 Spring 2020 Pascal Poupart 2

Imitation Learning

  • Behavioural cloning (supervised learning)
  • Generative adversarial imitation learning (GAIL)
  • Imitation learning from observations
  • Inverse reinforcement learning

University of Waterloo

slide-3
SLIDE 3

CS885 Spring 2020 Pascal Poupart 3

Motivation

  • Learn from expert demonstrations

โ€“ No reward function needed โ€“ Faster learning

University of Waterloo

autonomous driving chatbots robotics

slide-4
SLIDE 4

CS885 Spring 2020 Pascal Poupart 4

Behavioural Cloning

  • Simplest form of imitation learning
  • Assumption: state-action pairs observable

Imitation learning

  • Observe trajectories: ๐‘ก!, ๐‘! , ๐‘ก", ๐‘" , ๐‘ก#, ๐‘# , โ€ฆ , (๐‘ก$, ๐‘$)
  • Create training set: ๐‘‡ โ†’ ๐ต
  • Train by supervised learning

โ€“ Classification or regression

University of Waterloo

slide-5
SLIDE 5

CS885 Spring 2020 Pascal Poupart 5

Case Study I: Autonomous driving

  • Bojarski et al. (2016) End-to-end learning for self-driving cars
  • On road tests:

โ€“ Holmdel to Atlantic Highlands (NJ): autonomous ~98% of the time โ€“ Garden State Parkway (10 miles): no human intervention

University of Waterloo

slide-6
SLIDE 6

CS885 Spring 2020 Pascal Poupart 6

Case study II: conversational agents

Objective: max

๐›

Pr ๐’ƒ ๐’• = โˆ" Pr ๐‘" ๐‘"#$, โ€ฆ , ๐‘$, ๐’•

University of Waterloo

How are you doing ? I am fine

Encoder: state ๐’• Decoder: action ๐’ƒ

Sordoni et al., 2015

slide-7
SLIDE 7

CS885 Spring 2020 Pascal Poupart 7

Generative adversarial imitation learning (GAIL)

  • Common approach: training generator to maximize

likelihood of expert actions

  • Alternative: train generator to fool a discriminator in

believing that the generated actions are from expert

โ€“ Leverage GANs (Generative adversarial networks) โ€“ Ho & Ermon, 2016

University of Waterloo

slide-8
SLIDE 8

CS885 Spring 2020 Pascal Poupart 8

Generative adversarial networks (GANs)

min

! max " , #

log Pr ๐‘ฆ# ๐‘—๐‘ก ๐‘ ๐‘“๐‘๐‘š; ๐‘ฅ + log(Pr(๐‘•! ๐‘จ# ๐‘—๐‘ก ๐‘”๐‘๐‘™๐‘“; ๐‘ฅ) = min

! max " , #

log ๐‘’" ๐‘ฆ# + log 1 โˆ’ ๐‘’" ๐‘•! ๐‘จ#

University of Waterloo

๐‘จ

real

  • r

fake real

  • r

fake random vector real data ๐‘•!: generator ๐‘’": discriminator ๐‘’": discriminator

StyleGAN2 (Karras et al., 2020) CelebA (Liu et al., 2015)

+ ๐‘ฆ ๐‘ฆ

slide-9
SLIDE 9

CS885 Spring 2020 Pascal Poupart 9

GAIL Pseudocode

Input: expert trajectories ๐œ$ โˆผ ๐œŒ$%&$'( where ๐œ$ = ๐‘ก), ๐‘), ๐‘ก*, ๐‘*, โ€ฆ Initialize params ๐œ„ of policy ๐œŒ! and params ๐‘ฅ of discriminator ๐‘’" Repeat until stopping criterion Update discriminator parameters: ๐œ€" = โˆ‘ +,- โˆˆ /! โˆ‡0 log ๐‘’"(๐‘ก, ๐‘) + โˆ‘+,-โˆผ2"(-|+) โˆ‡" log(1 โˆ’ ๐‘’"(๐‘ก, ๐‘)) ๐‘ฅ โ† ๐‘ฅ + ๐›ฝ"๐œ€" Update policy parameters with TRPO: ๐ท๐‘๐‘ก๐‘ข(๐‘ก6, ๐‘6) = โˆ‘+,-|+#,-#,2" log(1 โˆ’ ๐‘’"(๐‘ก, ๐‘)) ๐œ€! = โˆ‘+,-|2" โˆ‡! log ๐œŒ! ๐‘ ๐‘ก ๐ท๐‘๐‘ก๐‘ข ๐‘ก, ๐‘ โˆ’ ๐œ‡โˆ‡!๐ผ(๐œŒ!) ๐œ„ โ† ๐œ„ โˆ’ ๐›ฝ!๐œ€!

University of Waterloo

slide-10
SLIDE 10

CS885 Spring 2020 Pascal Poupart 10

Robotics Experiments

  • Robot imitating expert policy (Ho & Ermon, 2016)

University of Waterloo

GAIL

slide-11
SLIDE 11

CS885 Spring 2020 Pascal Poupart 11

Imitation Learning from Observations

  • Consider imitation learning from a human expert:
  • Actions (e.g., forces) unobservable
  • Only states/observations (e.g., joint positions) observable
  • Problem: infer actions from state/observation sequences

University of Waterloo

Schaal et al., 2003

slide-12
SLIDE 12

CS885 Spring 2020 Pascal Poupart 12

Inverse Dynamics

Two steps: 1. Learn inverse dynamics

โ€“ Learn Pr(๐‘|๐‘ก, ๐‘ก7) by supervised learning โ€“ From (๐‘ก, ๐‘, ๐‘ก7) samples obtained by executing random actions

2. Behavioural cloning

โ€“ Learn ๐œŒ(W ๐‘|๐‘ก) by supervised learning โ€“ From (๐‘ก, ๐‘กโ€™) samples from expert trajectories and from W ๐‘ ~ Pr(๐‘|๐‘ก, ๐‘ก7) sampled by inverse dynamics

University of Waterloo

slide-13
SLIDE 13

CS885 Spring 2020 Pascal Poupart 13

Pseudocode: Imitation Learning from Observations

Input: expert trajectories ๐œ! โˆผ ๐œŒ!"#!$% where ๐œ! = ๐‘ก&, ๐‘ก', ๐‘ก(, โ€ฆ Initialize agent policy ๐œŒ) at random Repeat Learn inverse dynamics model with parameters ๐‘ฅ: Sample ๐‘ก%

*# , ๐‘% (*#), ๐‘ก%-& *#

by executing ๐œŒ) ๐‘ฅ โ† ๐‘๐‘ ๐‘•๐‘›๐‘๐‘ฆ. โˆ‘% log Pr

. (๐‘% (*#)|๐‘ก% *# , ๐‘ก%-& (*#))

Learn policy parameters ๐œ„: For each ๐‘ก%

/$ , ๐‘ก%-& /$

from expert trajectories ๐œ! do: 9 ๐‘%

/$ โˆผ Pr(๐‘% /$ |๐‘ก% /$ , ๐‘ก%-& (/$))

๐œ„ โ† ๐‘๐‘ ๐‘•๐‘›๐‘๐‘ฆ) โˆ‘% log ๐œŒ)(9 ๐‘%

/$ |๐‘ก% (/$))

University of Waterloo

slide-14
SLIDE 14

CS885 Spring 2020 Pascal Poupart 14

Robotics Experiments

Torabi et al., 2018

University of Waterloo