Self-Supervised Exploration via Disagreement Deepak Pathak* Dhiraj - - PowerPoint PPT Presentation

self supervised exploration via disagreement
SMART_READER_LITE
LIVE PREVIEW

Self-Supervised Exploration via Disagreement Deepak Pathak* Dhiraj - - PowerPoint PPT Presentation

Self-Supervised Exploration via Disagreement Deepak Pathak* Dhiraj Gandhi* Abhinav Gupta UC Berkeley CMU CMU, FAIR ICML 2019 * equal contribution Exploration a major challenge! Exploration a major challenge! Mohamed et.al.


slide-1
SLIDE 1

Self-Supervised Exploration via Disagreement

Deepak Pathak* UC Berkeley Dhiraj Gandhi* CMU Abhinav Gupta CMU, FAIR

* equal contribution

ICML 2019

slide-2
SLIDE 2

Exploration – a major challenge!

slide-3
SLIDE 3
  • Mohamed et.al. “Variational information

maximisation for intrinsically motivated reinforcement learning”. NIPS, 2015.

  • Houthooft et.al. “VIME: Variational information

maximizing exploration”. NIPS, 2016.

  • Gregor et.al. “Variational intrinsic control”. ICLR

Workshop, 2017.

  • Pathak et.al. “Curiosity-driven Exploration by Self-

supervised Exploration”. ICML 2017

  • Ostrovski et.al. “Count-based exploration with

neural density models”. ICML, 2017.

  • Burda*, Edwards*, Pathak* et.al. “Large-Scale

Study of Curiosity-driven Learning”. ICLR 2019

  • Eysenbach et al. “Diversity is all you need: Learn

skills without a reward function”. ICLR 2019.

  • Savinov et al. "Episodic curiosity through

reachability". ICLR 2019.

Exploration – a major challenge!

  • Schmidhuber, Jurgen. “A possibility for

implementing curiosity and boredom in model building neural controllers”, 1991.

  • Schmidhuber, Jurgen. “Formal theory of creativity,

fun, and intrinsic motivation (1990–2010)”, 2010.

  • Oudeyer, P.-Y. and Kaplan, F. What is intrinsic

motivation? a typology of computational

  • approaches. Frontiers in neurorobotics, 2009.
  • Poupart et.al. “An analytic solution to discrete

bayesian reinforcement learning”. ICML, 2006.

  • Lopes et.al. “Exploration in model-based

reinforcement learning by empirically estimating learning progress”. NIPS, 2012.

  • Bellemare et.al. “Unifying count-based exploration

and intrinsic motivation”. NIPS, 2016.

slide-4
SLIDE 4

Exploration – a major challenge!

  • Schmidhuber, Jurgen. “A possibility for

implementing curiosity and boredom in model building neural controllers”, 1991.

  • Schmidhuber, Jurgen. “Formal theory of creativity,

fun, and intrinsic motivation (1990–2010)”, 2010.

  • Oudeyer, P.-Y. and Kaplan, F. What is intrinsic

motivation? a typology of computational

  • approaches. Frontiers in neurorobotics, 2009.
  • Poupart et.al. “An analytic solution to discrete

bayesian reinforcement learning”. ICML, 2006.

  • Lopes et.al. “Exploration in model-based

reinforcement learning by empirically estimating learning progress”. NIPS, 2012.

  • Bellemare et.al. “Unifying count-based exploration

and intrinsic motivation”. NIPS, 2016.

  • Mohamed et.al. “Variational information

maximisation for intrinsically motivated reinforcement learning”. NIPS, 2015.

  • Houthooft et.al. “VIME: Variational information

maximizing exploration”. NIPS, 2016.

  • Gregor et.al. “Variational intrinsic control”. ICLR

Workshop, 2017.

  • Pathak et.al. “Curiosity-driven Exploration by Self-

supervised Exploration”. ICML 2017

  • Ostrovski et.al. “Count-based exploration with

neural density models”. ICML, 2017.

  • Burda*, Edwards*, Pathak* et.al. “Large-Scale

Study of Curiosity-driven Learning”. ICLR 2019

  • Eysenbach et al. “Diversity is all you need: Learn

skills without a reward function”. ICLR 2019.

  • Savinov et al. "Episodic curiosity through

reachability". ICLR 2019.

slide-5
SLIDE 5

Exploration – a major challenge!

  • Schmidhuber, Jurgen. “A possibility for

implementing curiosity and boredom in model building neural controllers”, 1991.

  • Schmidhuber, Jurgen. “Formal theory of creativity,

fun, and intrinsic motivation (1990–2010)”, 2010.

  • Oudeyer, P.-Y. and Kaplan, F. What is intrinsic

motivation? a typology of computational

  • approaches. Frontiers in neurorobotics, 2009.
  • Poupart et.al. “An analytic solution to discrete

bayesian reinforcement learning”. ICML, 2006.

  • Lopes et.al. “Exploration in model-based

reinforcement learning by empirically estimating learning progress”. NIPS, 2012.

  • Bellemare et.al. “Unifying count-based exploration

and intrinsic motivation”. NIPS, 2016.

  • Mohamed et.al. “Variational information

maximisation for intrinsically motivated reinforcement learning”. NIPS, 2015.

  • Houthooft et.al. “VIME: Variational information

maximizing exploration”. NIPS, 2016.

  • Gregor et.al. “Variational intrinsic control”. ICLR

Workshop, 2017.

  • Pathak et.al. “Curiosity-driven Exploration by Self-

supervised Exploration”. ICML 2017

  • Ostrovski et.al. “Count-based exploration with

neural density models”. ICML, 2017.

  • Burda*, Edwards*, Pathak* et.al. “Large-Scale

Study of Curiosity-driven Learning”. ICLR 2019

  • Eysenbach et al. “Diversity is all you need: Learn

skills without a reward function”. ICLR 2019.

  • Savinov et al. "Episodic curiosity through

reachability". ICLR 2019.

slide-6
SLIDE 6

Exploration – a major challenge!

  • Schmidhuber, Jurgen. “A possibility for

implementing curiosity and boredom in model building neural controllers”, 1991.

  • Schmidhuber, Jurgen. “Formal theory of creativity,

fun, and intrinsic motivation (1990–2010)”, 2010.

  • Oudeyer, P.-Y. and Kaplan, F. What is intrinsic

motivation? a typology of computational

  • approaches. Frontiers in neurorobotics, 2009.
  • Poupart et.al. “An analytic solution to discrete

bayesian reinforcement learning”. ICML, 2006.

  • Lopes et.al. “Exploration in model-based

reinforcement learning by empirically estimating learning progress”. NIPS, 2012.

  • Bellemare et.al. “Unifying count-based exploration

and intrinsic motivation”. NIPS, 2016.

  • Mohamed et.al. “Variational information

maximisation for intrinsically motivated reinforcement learning”. NIPS, 2015.

  • Houthooft et.al. “VIME: Variational information

maximizing exploration”. NIPS, 2016.

  • Gregor et.al. “Variational intrinsic control”. ICLR

Workshop, 2017.

  • Pathak et.al. “Curiosity-driven Exploration by Self-

supervised Exploration”. ICML 2017

  • Ostrovski et.al. “Count-based exploration with

neural density models”. ICML, 2017.

  • Burda*, Edwards*, Pathak* et.al. “Large-Scale

Study of Curiosity-driven Learning”. ICLR 2019

  • Eysenbach et al. “Diversity is all you need: Learn

skills without a reward function”. ICLR 2019.

  • Savinov et al. "Episodic curiosity through

reachability". ICLR 2019.

S a m p l e I n e f f i c i e n t [ m i l l i

  • n

s

  • f

s a m p l e s ]

slide-7
SLIDE 7

Simulation

Sample Inefficient

slide-8
SLIDE 8

Real Robots Simulation

Sample Inefficient

slide-9
SLIDE 9

Real Robots Simulation

Sample Inefficient “Stuck” in Stochastic Envs

slide-10
SLIDE 10

Real Robots Simulation

Sample Inefficient “Stuck” in Stochastic Envs

Curiosity Exploration w/ Noisy TV & Remote

[Burda*, Edwards*, Pathak* et. al. ICLR’19] [Juliani et.al., ArXiv’19]

slide-11
SLIDE 11

Why inefficient?

slide-12
SLIDE 12

[Pathak et al. ICML, 2017]

slide-13
SLIDE 13

current image xt

[Pathak et al. ICML, 2017]

slide-14
SLIDE 14

current image xt policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

slide-15
SLIDE 15

current image xt action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

slide-16
SLIDE 16

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

slide-17
SLIDE 17

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

slide-18
SLIDE 18

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

Prediction Model 𝑔(𝑦$, 𝑏$)

slide-19
SLIDE 19

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

current image xt

Prediction Model 𝑔(𝑦$, 𝑏$)

action at

slide-20
SLIDE 20

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

current image xt

Prediction Model 𝑔(𝑦$, 𝑏$)

action at predicted next image * 𝒚𝒖-𝟐

slide-21
SLIDE 21

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

current image xt

Prediction Model 𝑔(𝑦$, 𝑏$)

Intrinsic Reward

action at predicted next image * 𝒚𝒖-𝟐

ri

t

<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>

𝑠$

0 =

2 𝑦$-3 − 𝑦$-3

slide-22
SLIDE 22

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

current image xt

Prediction Model 𝑔(𝑦$, 𝑏$)

Intrinsic Reward

action at predicted next image * 𝒚𝒖-𝟐

ri

t

<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>

𝑠$

0 =

2 𝑦$-3 − 𝑦$-3

slide-23
SLIDE 23

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

current image xt

Prediction Model 𝑔(𝑦$, 𝑏$)

Intrinsic Reward

action at predicted next image * 𝒚𝒖-𝟐

ri

t

<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>

𝑠$

0 =

2 𝑦$-3 − 𝑦$-3

Environment is “black-box” à hard optimization

slide-24
SLIDE 24

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

current image xt

Prediction Model 𝑔(𝑦$, 𝑏$)

Intrinsic Reward

action at predicted next image * 𝒚𝒖-𝟐

ri

t

<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>

max

"

𝔽 9

$:3 ;

𝑠$

𝑠$

0 =

2 𝑦$-3 − 𝑦$-3

REINFORCE

slide-25
SLIDE 25

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

current image xt

Prediction Model 𝑔(𝑦$, 𝑏$)

Intrinsic Reward

action at predicted next image * 𝒚𝒖-𝟐

ri

t

<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>

max

"

𝔽 9

$:3 ;

𝑠$

𝑠$

0 =

2 𝑦$-3 − 𝑦$-3

REINFORCE

slide-26
SLIDE 26

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

[Pathak et al. ICML, 2017]

current image xt

Prediction Model 𝑔(𝑦$, 𝑏$)

Intrinsic Reward

action at predicted next image * 𝒚𝒖-𝟐

ri

t

<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>

max

"

𝔽 9

$:3 ;

𝑠$

𝑠$

0 =

2 𝑦$-3 − 𝑦$-3

REINFORCE

slide-27
SLIDE 27

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

slide-28
SLIDE 28

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

slide-29
SLIDE 29

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

slide-30
SLIDE 30

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

𝑦$-3 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

slide-31
SLIDE 31

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

slide-32
SLIDE 32

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

slide-33
SLIDE 33

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

𝑠$

0 = 𝜏

𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

slide-34
SLIDE 34

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

𝑠$

0 = 𝜏

𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

Intrinsic Reward

slide-35
SLIDE 35

current image xt next image xt+1 action at policy network

𝜌" 𝑦$

𝑠$

0 = 𝜏

𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

Intrinsic Reward Disagreement

slide-36
SLIDE 36

Deterministic Environments

performs as well as state-of-the-art methods

slide-37
SLIDE 37

Deterministic Environments

performs as well as state-of-the-art methods

Reward (not for training)

slide-38
SLIDE 38

Deterministic Environments

performs as well as state-of-the-art methods

Reward (not for training) Number of Frames (in millions)

slide-39
SLIDE 39

Deterministic Environments

performs as well as state-of-the-art methods

Reward (not for training) Number of Frames (in millions)

slide-40
SLIDE 40

Stochastic Environments

slide-41
SLIDE 41

Stochastic Environments

Every model’s goes to mean à variance drops à unstuck

slide-42
SLIDE 42

Stochastic Environments

Every model’s goes to mean à variance drops à unstuck

Reward (not for training) Number of Frames (in millions)

: 3D Navigation

w/o TV Noisy TV w/ Remote w/o TV Noisy TV w/ Remote

slide-43
SLIDE 43

Stochastic Environments

Every model’s goes to mean à variance drops à unstuck

Reward (not for training) Number of Frames (in millions)

: 3D Navigation

w/o TV Noisy TV w/ Remote w/o TV Noisy TV w/ Remote

slide-44
SLIDE 44

current state xt next state xt+1 action at policy network

𝜌" 𝑦$

𝑠$

0 = 𝜏

𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

Curiosity Reward Disagreement

slide-45
SLIDE 45

Disagreement

current state xt next state xt+1 action at policy network

𝜌" 𝑦$

𝑠$

0 = 𝜏

𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

Curiosity Reward

slide-46
SLIDE 46

Disagreement

current state xt next state xt+1 action at policy network

𝜌" 𝑦$

𝑠$

0 = 𝜏

𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

Curiosity Reward

slide-47
SLIDE 47

Disagreement

current state xt next state xt+1 action at policy network

𝜌" 𝑦$

𝑠$

0 = 𝜏

𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

Curiosity Reward

No dependency on the environment!

slide-48
SLIDE 48

Disagreement

current state xt next state xt+1 action at policy network

𝜌" 𝑦$

𝑠$

0 = 𝜏

𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔

3

2 𝑦$-3

3

𝑦$-3 − 2 𝑦$-3

3

𝑦$ 𝑏$ 𝑔

<

2 𝑦$-3

<

𝑦$-3 − 2 𝑦$-3

<

𝑦$ 𝑏$ 𝑔

=

2 𝑦$-3

=

𝑦$-3 − 2 𝑦$-3

=

Curiosity Reward

No dependency on the environment!

Differentiable Exploration

slide-49
SLIDE 49

Differentiable Exploration

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

slide-50
SLIDE 50

Differentiable Exploration

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

min

"D,… ,"F 9 0:3 G

𝑔

"H 𝑦$, 𝜌 𝑦$; 𝜄K

− 𝑦$-3 <

Model Optimization

slide-51
SLIDE 51

Differentiable Exploration

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

min

"D,… ,"F 9 0:3 G

𝑔

"H 𝑦$, 𝜌 𝑦$; 𝜄K

− 𝑦$-3 < max

"L

9

0:3 G

𝑔

"H 𝑦$, 𝜌 𝑦$; 𝜄K

− 1 𝑙 9

O:3 G

𝑔

"P 𝑦$, 𝜌 𝑦$; 𝜄K <

Model Optimization Policy Optimization

slide-52
SLIDE 52

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

Differentiable Exploration

slide-53
SLIDE 53

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

Differentiable Exploration

slide-54
SLIDE 54

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

Differentiable Exploration

slide-55
SLIDE 55

Position Control: 1. Position 2. Direction 3. Gripper Angle 4. Gripper Distance

Differentiable Exploration

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

slide-56
SLIDE 56

Differentiable Exploration

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

Efficiency over REINFORCE

Object Interaction Rate Training Samples

slide-57
SLIDE 57

Differentiable Exploration

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

Efficiency over REINFORCE

Object Interaction Rate Training Samples

slide-58
SLIDE 58

Differentiable Exploration

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

Pushing skill Efficiency over REINFORCE

Object Interaction Rate Training Samples

slide-59
SLIDE 59

Differentiable Exploration

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.

Pushing skill Picking skill Efficiency over REINFORCE

Object Interaction Rate Training Samples

slide-60
SLIDE 60
slide-61
SLIDE 61

Summary: Exploration via Disagreement

slide-62
SLIDE 62

Summary: Exploration via Disagreement

  • Similar to state-of-the-art in deterministic envs

(Atari games)

slide-63
SLIDE 63

Summary: Exploration via Disagreement

  • Similar to state-of-the-art in deterministic envs

(Atari games)

  • Does not get stuck in stochastic scenarios

(Stochastic Atari; Unity-TV)

slide-64
SLIDE 64

Summary: Exploration via Disagreement

  • Similar to state-of-the-art in deterministic envs

(Atari games)

  • Does not get stuck in stochastic scenarios

(Stochastic Atari; Unity-TV)

  • Differentiable reformulation for real robots

(Sawyer Robot)

slide-65
SLIDE 65

Code Available

https://pathak22.github.io/exploration-by-disagreement/

slide-66
SLIDE 66

Thank you!

Poster # 39 (today)

Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.