[PPT] - Dhruv Batra Long-term Goal Physical agent Is there smoke in any PowerPoint Presentation

SLIDE 1

Habitat: A Platform for Embodied AI Research

Dhruv Batra

SLIDE 2

SLIDE 3

3

Long-term Goal

Is there smoke in any room around you? Yes, in one room Go there and look for people …

Physical agent capable of taking actions in the world and talking to humans in natural language

Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das

SLIDE 4

…

Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das 4

SLIDE 5

…

Slide credit: Abhishek Das

Internet AI à Embodied AI

5 Image Credit: Image-Net Image Credit: Lockheed Martin; DARPA Robotics Challenge

SLIDE 6

6 Image Credit: Image-Net, Video Credit: Lee et al., 2012

Egocentric vision

No access to well-composed, curated images

Slide credit: Abhishek Das

Internet AI à Embodied AI

SLIDE 7

7 Video Credit: Lee et al., 2012

Active perception Action Observation

Agent controls incoming data distribution

Slide credit: Abhishek Das

Internet AI à Embodied AI

SLIDE 8

8 Image Credit: Image-Net

Sparse rewards

Slide credit: Abhishek Das

Internet AI à Embodied AI

SLIDE 9

9 Image Credit: Image-Net

Sparse rewards

Slide credit: Abhishek Das

Internet AI à Embodied AI

SLIDE 10

10 Image Credit: Image-Net

+ —

{

Sparse rewards

Slide credit: Abhishek Das

Internet AI à Embodied AI

SLIDE 11

11

Language understanding

asda nkslndjksan asdlaskmdlas

Slide credit: Abhishek Das

Internet AI à Embodied AI

SLIDE 12

12

Slow
Dangerous
Expensive
Difficult to control
Not easy reproducible

Problems with reality

…

SLIDE 13

13

Slow
Dangerous
Expensive
Difficult to control
Not easy reproducible

Our Approach: Sim2Real

SLIDE 14

14

Resurrection of Embodied AI

SLIDE 15

15

EmbodiedQA

SUNCG (Song et al., 2017)

Datasets Simulators Tasks

Matterport3D (Chang et al., 2017) AI2-THOR (Kolve et al., 2017) MINOS (Savva et al., 2017) Gibson (Zamir et al., 2018) Stanford 2D-3D-S (Armeni et al., 2017) CHALET (Yan et al., 2018) House3D (Wu et al., 2017) Interactive QA (Gordon et al., 2018) Vision-Language Navigation (Anderson et al., 2018) Language grounding (Chaplot et al., 2017, Hermann & Hill et al., 2017) Visual Navigation (Zhu & Gordon et al., 2017, Savva et al., 2017, Wu et al., 2017) HoME (Brodeur et al., 2018) VirtualHome (Puig et al., 2018) AdobeIndoorNav (Mo et al., 2018) Matterport3DSim (Anderson et al., 2018)

>= 2017 (!)

Slide credit: Abhishek Das 15

SLIDE 16

Abhishek Das (Georgia Tech) Samyak Datta (Georgia Tech) Devi Parikh (FAIR/Georgia Tech) Dhruv Batra (FAIR/Georgia Tech) Stefan Lee (Georgia Tech) Georgia Gkioxari (FAIR)

Embodied Question Answering

[CVPR ’18 Oral]

16

SLIDE 17

SLIDE 18

18

What is to the left of the shower? Cabinet

SLIDE 19

EmbodiedQA

Slide credit: Abhishek Das

SLIDE 20

EmbodiedQA

House3D (Wu et al., 2017)

Slide credit: Abhishek Das

SLIDE 21

EmbodiedQA

SUNCG (Song et al., 2017) House3D (Wu et al., 2017)

Slide credit: Abhishek Das

SLIDE 22

EmbodiedQA

SUNCG (Song et al., 2017) House3D (Wu et al., 2017)

Dataset

Slide credit: Abhishek Das

SLIDE 23

EmbodiedQA

SUNCG (Song et al., 2017) House3D (Wu et al., 2017)

Dataset Simulator

Slide credit: Abhishek Das

SLIDE 24

EmbodiedQA

SUNCG (Song et al., 2017) House3D (Wu et al., 2017)

Dataset Simulator Task

Slide credit: Abhishek Das

SLIDE 25

EmbodiedQA

SUNCG (Song et al., 2017)

Datasets Simulators Tasks

Matterport3D (Chang et al., 2017) AI2-THOR (Kolve et al., 2017) MINOS (Savva et al., 2017) Gibson (Zamir et al., 2018) Stanford 2D-3D-S (Armeni et al., 2017) CHALET (Yan et al., 2018) House3D (Wu et al., 2017) Interactive QA (Gordon et al., 2018) Vision-Language Navigation (Anderson et al., 2018) Language grounding (Chaplot et al., 2017, Hermann & Hill et al., 2017) Visual Navigation (Zhu & Gordon et al., 2017, Savva et al., 2017, Wu et al., 2017) HoME (Brodeur et al., 2018) VirtualHome (Puig et al., 2018) AdobeIndoorNav (Mo et al., 2018) Matterport3DSim (Anderson et al., 2018)

Slide credit: Abhishek Das

>= 2017 (!)

SLIDE 26

EmbodiedQA

SUNCG (Song et al., 2017)

Datasets Simulators Tasks

Matterport3D (Chang et al., 2017) AI2-THOR (Kolve et al., 2017) MINOS (Savva et al., 2017) Gibson (Zamir et al., 2018) Stanford 2D-3D-S (Armeni et al., 2017) CHALET (Yan et al., 2018) House3D (Wu et al., 2017) Interactive QA (Gordon et al., 2018) Vision-Language Navigation (Anderson et al., 2018) Language grounding (Chaplot et al., 2017, Hermann & Hill et al., 2017) Visual Navigation (Zhu & Gordon et al., 2017, Savva et al., 2017, Wu et al., 2017) HoME (Brodeur et al., 2018) VirtualHome (Puig et al., 2018) AdobeIndoorNav (Mo et al., 2018) Matterport3DSim (Anderson et al., 2018)

Slide credit: Abhishek Das

SLIDE 27

Datasets: Matterport3D

SLIDE 28

Datasets: Matterport3D

SLIDE 29

Matterport3d dataset

[Chang 3DV 2017]

10,800 panoramic views 194,400 RGB-D images of 90 building-scale scenes

Datasets: Matterport3D Datasets: Matterport3D

SLIDE 30

30

SLIDE 31

EmbodiedQA

SUNCG (Song et al., 2017)

Datasets Simulators Tasks

Matterport3D (Chang et al., 2017) AI2-THOR (Kolve et al., 2017) MINOS (Savva et al., 2017) Gibson (Zamir et al., 2018) Stanford 2D-3D-S (Armeni et al., 2017) CHALET (Yan et al., 2018) House3D (Wu et al., 2017) Interactive QA (Gordon et al., 2018) Vision-Language Navigation (Anderson et al., 2018) Language grounding (Chaplot et al., 2017, Hermann & Hill et al., 2017) Visual Navigation (Zhu & Gordon et al., 2017, Savva et al., 2017, Wu et al., 2017) HoME (Brodeur et al., 2018) VirtualHome (Puig et al., 2018) AdobeIndoorNav (Mo et al., 2018) Matterport3DSim (Anderson et al., 2018)

Slide credit: Abhishek Das

SLIDE 32

Example: House3D

32

[Wu et al. 2017]

Slide credit: Manolis Savva

SLIDE 33

Example: MINOS

33

[Savva et al. 2017]

Slide credit: Manolis Savva

SLIDE 34

Gibson [Xia et al. 2018]

34 Slide credit: Manolis Savva

SLIDE 35

AI2 THOR [Kolve et al. 2017]

35

SLIDE 36

DeepMind Lab [Beattie et al. 2016]

36

SLIDE 37

EmbodiedQA

SUNCG (Song et al., 2017)

Datasets Simulators Tasks

Matterport3D (Chang et al., 2017) AI2-THOR (Kolve et al., 2017) MINOS (Savva et al., 2017) Gibson (Zamir et al., 2018) Stanford 2D-3D-S (Armeni et al., 2017) CHALET (Yan et al., 2018) House3D (Wu et al., 2017) Interactive QA (Gordon et al., 2018) Vision-Language Navigation (Anderson et al., 2018) Language grounding (Chaplot et al., 2017, Hermann & Hill et al., 2017) Visual Navigation (Zhu & Gordon et al., 2017, Savva et al., 2017, Wu et al., 2017) HoME (Brodeur et al., 2018) VirtualHome (Puig et al., 2018) AdobeIndoorNav (Mo et al., 2018) Matterport3DSim (Anderson et al., 2018)

Slide credit: Abhishek Das

SLIDE 38

Vision Language Robotics / RL

Slide credit: Dhruv Batra 38

SLIDE 39

Vision Language Robotics / RL

Visual Navigation

Slide credit: Dhruv Batra 39

SLIDE 40

Vision Language Robotics / RL

V&L Navigation Embodied QA

Slide credit: Dhruv Batra 40

SLIDE 41

Vision Language Robotics / RL

Language Grounding

Slide credit: Dhruv Batra 41

SLIDE 42

Create the ImageNet/COCO/VQA of Embodied AI
Dataset à Simulator à Task à Benchmark Challenge

Our Vision

SLIDE 43

SUNCG (Song et al., 2017)

Datasets

Matterport3D (Chang et al., 2017) 2D-3D-S (Armeni et al., 2017)

Simulators

AI2-THOR (Kolve et al., 2017) MINOS (Savva et al., 2017) Gibson (Zamir et al., 2018) CHALET (Yan et al., 2018) House3D (Wu et al., 2017)

Habitat Sim

Generic Dataset Support

Habitat API Habitat Platform

EmbodiedQA (Das et al., 2018)

Tasks

Interactive QA (Gordon et al., 2018) Vision-Language Navigation (Anderson et al., 2018) Language grounding (Hill et al., 2017) Visual Navigation (Zhu et al., 2017, Gupta et al., 2017)

Standardizing the Embodied Agent Stack

SLIDE 44

SUNCG (Song et al., 2017)

Datasets

Matterport3D (Chang et al., 2017) 2D-3D-S (Armeni et al., 2017)

Simulators

AI2-THOR (Kolve et al., 2017) MINOS (Savva et al., 2017) Gibson (Zamir et al., 2018) CHALET (Yan et al., 2018) House3D (Wu et al., 2017) EmbodiedQA (Das et al., 2018)

Tasks

Interactive QA (Gordon et al., 2018) Vision-Language Navigation (Anderson et al., 2018) Language grounding (Hill et al., 2017) Visual Navigation (Zhu et al., 2017, Gupta et al., 2017)

Standardizing the Embodied Agent Stack

Julian Straub (FRL) Richard Newcombe (FRL)

SLIDE 45

Julian Straub (FRL) Richard Newcombe (FRL)

SLIDE 46

The Replica Dataset: A Digital Replica of Indoor Spaces [Straub et al. 2019]

FRL Surreal Team: high quality 3D reconstructions

SLIDE 47

SLIDE 48

SLIDE 49

SLIDE 50

SLIDE 51

SUNCG (Song et al., 2017)

Datasets

Matterport3D (Chang et al., 2017) 2D-3D-S (Armeni et al., 2017)

Simulators

AI2-THOR (Kolve et al., 2017) MINOS (Savva et al., 2017) Gibson (Zamir et al., 2018) CHALET (Yan et al., 2018) House3D (Wu et al., 2017) EmbodiedQA (Das et al., 2018)

Tasks

Interactive QA (Gordon et al., 2018) Vision-Language Navigation (Anderson et al., 2018) Language grounding (Hill et al., 2017) Visual Navigation (Zhu et al., 2017, Gupta et al., 2017)

Standardizing the Embodied Agent Stack

Manolis Savva (FAIR) Yili Zhao (FAIR)

SLIDE 52

Challenge: human vs machine needs

54

1080p @ 60Hz 256x256 @ 1000+ Hz

…

Slide credit: Manolis Savva

SLIDE 53

SLIDE 54

Habitat-Sim

Photorealistic 3D simulator

(C++ with pybind11)

Generic 3D dataset support

(Replica, Gibson, MP3D, +more)

Fast: over 1,000 FPS single-threaded

10,000 FPS multi-process (single GPU)

SLIDE 55

57

Habitat-Sim: Datasets agnostic!

SLIDE 56

100 200 300 400 500 600 700 800 900 1000 1100 Gibson AI2 Thor MINOS House3D Habitat-Sim Habitat-Simx10

Frames Per Second

ver 2x faster

SLIDE 57

100 200 300 400 500 600 700 800 900 1000 1100 Gibson AI2 Thor MINOS House3D Habitat-Sim Habitat-Simx10

Frames Per Second

SLIDE 58

100 200 300 400 500 600 700 800 900 1000 1100 Gibson AI2 Thor MINOS House3D Habitat-Sim Habitat-Simx10

Frames Per Second

SLIDE 59

100 200 300 400 500 600 700 800 900 1000 1100 Gibson AI2 Thor MINOS House3D Habitat-Sim Habitat-Simx10

Frames Per Second

SLIDE 60

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 Gibson AI2 Thor MINOS House3D Habitat-Sim Habitat-Simx10