Self-Supervised Exploration via Disagreement
Deepak Pathak* UC Berkeley Dhiraj Gandhi* CMU Abhinav Gupta CMU, FAIR
* equal contribution
ICML 2019
Self-Supervised Exploration via Disagreement Deepak Pathak* Dhiraj - - PowerPoint PPT Presentation
Self-Supervised Exploration via Disagreement Deepak Pathak* Dhiraj Gandhi* Abhinav Gupta UC Berkeley CMU CMU, FAIR ICML 2019 * equal contribution Exploration a major challenge! Exploration a major challenge! Mohamed et.al.
Deepak Pathak* UC Berkeley Dhiraj Gandhi* CMU Abhinav Gupta CMU, FAIR
* equal contribution
ICML 2019
maximisation for intrinsically motivated reinforcement learning”. NIPS, 2015.
maximizing exploration”. NIPS, 2016.
Workshop, 2017.
supervised Exploration”. ICML 2017
neural density models”. ICML, 2017.
Study of Curiosity-driven Learning”. ICLR 2019
skills without a reward function”. ICLR 2019.
reachability". ICLR 2019.
implementing curiosity and boredom in model building neural controllers”, 1991.
fun, and intrinsic motivation (1990–2010)”, 2010.
motivation? a typology of computational
bayesian reinforcement learning”. ICML, 2006.
reinforcement learning by empirically estimating learning progress”. NIPS, 2012.
and intrinsic motivation”. NIPS, 2016.
implementing curiosity and boredom in model building neural controllers”, 1991.
fun, and intrinsic motivation (1990–2010)”, 2010.
motivation? a typology of computational
bayesian reinforcement learning”. ICML, 2006.
reinforcement learning by empirically estimating learning progress”. NIPS, 2012.
and intrinsic motivation”. NIPS, 2016.
maximisation for intrinsically motivated reinforcement learning”. NIPS, 2015.
maximizing exploration”. NIPS, 2016.
Workshop, 2017.
supervised Exploration”. ICML 2017
neural density models”. ICML, 2017.
Study of Curiosity-driven Learning”. ICLR 2019
skills without a reward function”. ICLR 2019.
reachability". ICLR 2019.
implementing curiosity and boredom in model building neural controllers”, 1991.
fun, and intrinsic motivation (1990–2010)”, 2010.
motivation? a typology of computational
bayesian reinforcement learning”. ICML, 2006.
reinforcement learning by empirically estimating learning progress”. NIPS, 2012.
and intrinsic motivation”. NIPS, 2016.
maximisation for intrinsically motivated reinforcement learning”. NIPS, 2015.
maximizing exploration”. NIPS, 2016.
Workshop, 2017.
supervised Exploration”. ICML 2017
neural density models”. ICML, 2017.
Study of Curiosity-driven Learning”. ICLR 2019
skills without a reward function”. ICLR 2019.
reachability". ICLR 2019.
implementing curiosity and boredom in model building neural controllers”, 1991.
fun, and intrinsic motivation (1990–2010)”, 2010.
motivation? a typology of computational
bayesian reinforcement learning”. ICML, 2006.
reinforcement learning by empirically estimating learning progress”. NIPS, 2012.
and intrinsic motivation”. NIPS, 2016.
maximisation for intrinsically motivated reinforcement learning”. NIPS, 2015.
maximizing exploration”. NIPS, 2016.
Workshop, 2017.
supervised Exploration”. ICML 2017
neural density models”. ICML, 2017.
Study of Curiosity-driven Learning”. ICLR 2019
skills without a reward function”. ICLR 2019.
reachability". ICLR 2019.
Simulation
Real Robots Simulation
Real Robots Simulation
Real Robots Simulation
Curiosity Exploration w/ Noisy TV & Remote
[Burda*, Edwards*, Pathak* et. al. ICLR’19] [Juliani et.al., ArXiv’19]
[Pathak et al. ICML, 2017]
current image xt
[Pathak et al. ICML, 2017]
current image xt policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
Prediction Model 𝑔(𝑦$, 𝑏$)
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt
Prediction Model 𝑔(𝑦$, 𝑏$)
action at
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt
Prediction Model 𝑔(𝑦$, 𝑏$)
action at predicted next image * 𝒚𝒖-𝟐
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt
Prediction Model 𝑔(𝑦$, 𝑏$)
Intrinsic Reward
action at predicted next image * 𝒚𝒖-𝟐
ri
t
<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>0 =
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt
Prediction Model 𝑔(𝑦$, 𝑏$)
Intrinsic Reward
action at predicted next image * 𝒚𝒖-𝟐
ri
t
<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>0 =
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt
Prediction Model 𝑔(𝑦$, 𝑏$)
Intrinsic Reward
action at predicted next image * 𝒚𝒖-𝟐
ri
t
<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>0 =
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt
Prediction Model 𝑔(𝑦$, 𝑏$)
Intrinsic Reward
action at predicted next image * 𝒚𝒖-𝟐
ri
t
<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>max
"
𝔽 9
$:3 ;
𝑠$
0 =
REINFORCE
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt
Prediction Model 𝑔(𝑦$, 𝑏$)
Intrinsic Reward
action at predicted next image * 𝒚𝒖-𝟐
ri
t
<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>max
"
𝔽 9
$:3 ;
𝑠$
0 =
REINFORCE
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
[Pathak et al. ICML, 2017]
current image xt
Prediction Model 𝑔(𝑦$, 𝑏$)
Intrinsic Reward
action at predicted next image * 𝒚𝒖-𝟐
ri
t
<latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit><latexit sha1_base64="U/ugCliNy2Q03OP4PdzLhjBIxM=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mKoN4KXjxWMLbQxrLZbtqlm03YnQgl9Dd48aDi1T/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTx6MEmGfdZIhPdCanhUijuo0DJO6nmNA4lb4fjm5nfuLaiETd4yTlQUyHSkSCUbSr/v4KPrVmlt35yCrxCtIDQq0+tWv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQV5EKlGXLFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvrxK/Ub+ue3cXtWajSKMJ3AK5+DBJThFlrgAwMBz/AKb45yXpx352PRWnKmWP4A+fzB0p7jnw=</latexit>max
"
𝔽 9
$:3 ;
𝑠$
0 =
REINFORCE
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
𝑦$-3 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
𝑠$
0 = 𝜏
𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
𝑠$
0 = 𝜏
𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
Intrinsic Reward
current image xt next image xt+1 action at policy network
𝜌" 𝑦$
𝑠$
0 = 𝜏
𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
Intrinsic Reward Disagreement
Reward (not for training)
Reward (not for training) Number of Frames (in millions)
Reward (not for training) Number of Frames (in millions)
Reward (not for training) Number of Frames (in millions)
w/o TV Noisy TV w/ Remote w/o TV Noisy TV w/ Remote
Reward (not for training) Number of Frames (in millions)
w/o TV Noisy TV w/ Remote w/o TV Noisy TV w/ Remote
current state xt next state xt+1 action at policy network
𝜌" 𝑦$
𝑠$
0 = 𝜏
𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
Curiosity Reward Disagreement
Disagreement
current state xt next state xt+1 action at policy network
𝜌" 𝑦$
𝑠$
0 = 𝜏
𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
Curiosity Reward
Disagreement
current state xt next state xt+1 action at policy network
𝜌" 𝑦$
𝑠$
0 = 𝜏
𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
Curiosity Reward
Disagreement
current state xt next state xt+1 action at policy network
𝜌" 𝑦$
𝑠$
0 = 𝜏
𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
Curiosity Reward
Disagreement
current state xt next state xt+1 action at policy network
𝜌" 𝑦$
𝑠$
0 = 𝜏
𝑦$-3 𝑛𝑗𝑜 𝑦$ 𝑏$ 𝑔
3
2 𝑦$-3
3
𝑦$-3 − 2 𝑦$-3
3
𝑦$ 𝑏$ 𝑔
<
2 𝑦$-3
<
𝑦$-3 − 2 𝑦$-3
<
𝑦$ 𝑏$ 𝑔
=
2 𝑦$-3
=
𝑦$-3 − 2 𝑦$-3
=
Curiosity Reward
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
min
"D,… ,"F 9 0:3 G
𝑔
"H 𝑦$, 𝜌 𝑦$; 𝜄K
− 𝑦$-3 <
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
min
"D,… ,"F 9 0:3 G
𝑔
"H 𝑦$, 𝜌 𝑦$; 𝜄K
− 𝑦$-3 < max
"L
9
0:3 G
𝑔
"H 𝑦$, 𝜌 𝑦$; 𝜄K
− 1 𝑙 9
O:3 G
𝑔
"P 𝑦$, 𝜌 𝑦$; 𝜄K <
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
Efficiency over REINFORCE
Object Interaction Rate Training Samples
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
Efficiency over REINFORCE
Object Interaction Rate Training Samples
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
Pushing skill Efficiency over REINFORCE
Object Interaction Rate Training Samples
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.
Pushing skill Picking skill Efficiency over REINFORCE
Object Interaction Rate Training Samples
(Atari games)
(Atari games)
(Stochastic Atari; Unity-TV)
(Atari games)
(Stochastic Atari; Unity-TV)
(Sawyer Robot)
Pathak*, Gandhi*, Gupta. “Self-Supervised Exploration via Disagreement“. ICML, 2019.