CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation - - PowerPoint PPT Presentation

▶

Mar 16, 2023 317 likes •466 views

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). Ho, J., & Ermon, S. (2016). Generative adversarial

SLIDE 1

CS885 Reinforcement Learning Module 3: July 5, 2020

Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from

bservation. In IJCAI (pp. 4950-4957).

Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In NeurIPS (pp. 4565-4573).

CS885 Spring 2020 Pascal Poupart 1 University of Waterloo

SLIDE 2

CS885 Spring 2020 Pascal Poupart 2

Imitation Learning

Behavioural cloning (supervised learning)
Generative adversarial imitation learning (GAIL)
Imitation learning from observations
Inverse reinforcement learning

University of Waterloo

SLIDE 3

CS885 Spring 2020 Pascal Poupart 3

Motivation

Learn from expert demonstrations

– No reward function needed – Faster learning

University of Waterloo

autonomous driving chatbots robotics

SLIDE 4

CS885 Spring 2020 Pascal Poupart 4

Behavioural Cloning

Simplest form of imitation learning
Assumption: state-action pairs observable

Imitation learning

Observe trajectories: 𝑡!, 𝑏! , 𝑡", 𝑏" , 𝑡#, 𝑏# , … , (𝑡$, 𝑏$)
Create training set: 𝑇 → 𝐵
Train by supervised learning

– Classification or regression

University of Waterloo

SLIDE 5

CS885 Spring 2020 Pascal Poupart 5

Case Study I: Autonomous driving

Bojarski et al. (2016) End-to-end learning for self-driving cars
On road tests:

– Holmdel to Atlantic Highlands (NJ): autonomous ~98% of the time – Garden State Parkway (10 miles): no human intervention

University of Waterloo

SLIDE 6

CS885 Spring 2020 Pascal Poupart 6

Case study II: conversational agents

Objective: max

𝐛

Pr 𝒃 𝒕 = ∏" Pr 𝑏" 𝑏"#$, … , 𝑏$, 𝒕

University of Waterloo

How are you doing ? I am fine

Encoder: state 𝒕 Decoder: action 𝒃

Sordoni et al., 2015

SLIDE 7

CS885 Spring 2020 Pascal Poupart 7

Generative adversarial imitation learning (GAIL)

Common approach: training generator to maximize

likelihood of expert actions

Alternative: train generator to fool a discriminator in

believing that the generated actions are from expert

– Leverage GANs (Generative adversarial networks) – Ho & Ermon, 2016

University of Waterloo

SLIDE 8

CS885 Spring 2020 Pascal Poupart 8

Generative adversarial networks (GANs)

min

! max " , #

log Pr 𝑦# 𝑗𝑡 𝑠𝑓𝑏𝑚; 𝑥 + log(Pr(𝑕! 𝑨# 𝑗𝑡 𝑔𝑏𝑙𝑓; 𝑥) = min

! max " , #

log 𝑒" 𝑦# + log 1 − 𝑒" 𝑕! 𝑨#

University of Waterloo

𝑨

real

fake real

fake random vector real data 𝑕!: generator 𝑒": discriminator 𝑒": discriminator

StyleGAN2 (Karras et al., 2020) CelebA (Liu et al., 2015)

+ 𝑦 𝑦

SLIDE 9

CS885 Spring 2020 Pascal Poupart 9

GAIL Pseudocode

Input: expert trajectories 𝜐$ ∼ 𝜌$%&$'( where 𝜐$ = 𝑡), 𝑏), 𝑡*, 𝑏*, … Initialize params 𝜄 of policy 𝜌! and params 𝑥 of discriminator 𝑒" Repeat until stopping criterion Update discriminator parameters: 𝜀" = ∑ +,- ∈ /! ∇0 log 𝑒"(𝑡, 𝑏) + ∑+,-∼2"(-|+) ∇" log(1 − 𝑒"(𝑡, 𝑏)) 𝑥 ← 𝑥 + 𝛽"𝜀" Update policy parameters with TRPO: 𝐷𝑝𝑡𝑢(𝑡6, 𝑏6) = ∑+,-|+#,-#,2" log(1 − 𝑒"(𝑡, 𝑏)) 𝜀! = ∑+,-|2" ∇! log 𝜌! 𝑏 𝑡 𝐷𝑝𝑡𝑢 𝑡, 𝑏 − 𝜇∇!𝐼(𝜌!) 𝜄 ← 𝜄 − 𝛽!𝜀!

University of Waterloo

SLIDE 10

CS885 Spring 2020 Pascal Poupart 10

Robotics Experiments

Robot imitating expert policy (Ho & Ermon, 2016)

University of Waterloo

GAIL

SLIDE 11

CS885 Spring 2020 Pascal Poupart 11

Imitation Learning from Observations

Consider imitation learning from a human expert:
Actions (e.g., forces) unobservable
Only states/observations (e.g., joint positions) observable
Problem: infer actions from state/observation sequences

University of Waterloo

Schaal et al., 2003

SLIDE 12

CS885 Spring 2020 Pascal Poupart 12

Inverse Dynamics

Two steps: 1. Learn inverse dynamics

– Learn Pr(𝑏|𝑡, 𝑡7) by supervised learning – From (𝑡, 𝑏, 𝑡7) samples obtained by executing random actions

2. Behavioural cloning

– Learn 𝜌(W 𝑏|𝑡) by supervised learning – From (𝑡, 𝑡’) samples from expert trajectories and from W 𝑏 ~ Pr(𝑏|𝑡, 𝑡7) sampled by inverse dynamics

University of Waterloo

SLIDE 13

CS885 Spring 2020 Pascal Poupart 13

Pseudocode: Imitation Learning from Observations

Input: expert trajectories 𝜐! ∼ 𝜌!"#!$% where 𝜐! = 𝑡&, 𝑡', 𝑡(, … Initialize agent policy 𝜌) at random Repeat Learn inverse dynamics model with parameters 𝑥: Sample 𝑡%

*# , 𝑏% (*#), 𝑡%-& *#

by executing 𝜌) 𝑥 ← 𝑏𝑠𝑕𝑛𝑏𝑦. ∑% log Pr

. (𝑏% (*#)|𝑡% *# , 𝑡%-& (*#))

Learn policy parameters 𝜄: For each 𝑡%

/$ , 𝑡%-& /$

from expert trajectories 𝜐! do: 9 𝑏%

/$ ∼ Pr(𝑏% /$ |𝑡% /$ , 𝑡%-& (/$))

𝜄 ← 𝑏𝑠𝑕𝑛𝑏𝑦) ∑% log 𝜌)(9 𝑏%

/$ |𝑡% (/$))

University of Waterloo

SLIDE 14

CS885 Spring 2020 Pascal Poupart 14

Robotics Experiments

Torabi et al., 2018

University of Waterloo