[PPT] - Efficient Exploration by Novelty Pursuit Ziniu Li PowerPoint Presentation

SLIDE 1

Efficient Exploration by Novelty Pursuit Ziniu Li

ziniuli@link.cuhk.edu.cn

The Chinese University of Hong Kong, Shenzhen & Polixir

Joint work with Xiong-Hui Chen, Nanjing University International Conference on Distributed Artificial Intelligence (DAI), 2020

October 12, 2020

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 1 / 30

SLIDE 2

Overview

Introduction Proposed Method Experiment Conclusion

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 2 / 30

SLIDE 3

Outline

Introduction Proposed Method Experiment Conclusion

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 3 / 30

SLIDE 4

Reinforcement Learning

◮ RL is a learning paradigm that an agent interacts with the unknown environment to find the optimal decisions.

action

<latexit sha1_base64="DpwzqE8QmuIvEwFykJmgJTAdULA=">AC1HicjVHLSsNAFD2Nr1ofjbp0EyCq5KIoMuiG5cV7APaUpLptA5Nk5BMxFK7Erf+gFv9JvEP9C+8M6agFtEJSc6ce86dufd6kS8SaduvOWNhcWl5Jb9aWFvf2CyaW9v1JExjxms9MO46bkJ90XAa1JInzejmLsjz+cNb3im4o1rHiciDC7lOKdkTsIRF8wVxLVNYtyW+klBOXKWLaNUt2dbLmgdOBkrIVjU0X9BGDyEYUozAEUAS9uEioacFBzYi4jqYEBcTEjrOMUWBvCmpOClcYof0HdCulbEB7VXORLsZneLTG5PTwj5QtLFhNVplo6nOrNif8s90TnV3cb097JcI2Ilroj9yzdT/tenapHo40TXIKimSDOqOpZlSXVX1M2tL1VJyhARp3CP4jFhp2zPlvak+jaVW9dHX/TSsWqPcu0Kd7VLWnAzs9xzoP6Ydmxy87FUalymo06j13s4YDmeYwKzlFTc/8EU94NurGrXFn3H9KjVzm2cG3ZTx8AI0/lkE=</latexit><latexit sha1_base64="DpwzqE8QmuIvEwFykJmgJTAdULA=">AC1HicjVHLSsNAFD2Nr1ofjbp0EyCq5KIoMuiG5cV7APaUpLptA5Nk5BMxFK7Erf+gFv9JvEP9C+8M6agFtEJSc6ce86dufd6kS8SaduvOWNhcWl5Jb9aWFvf2CyaW9v1JExjxms9MO46bkJ90XAa1JInzejmLsjz+cNb3im4o1rHiciDC7lOKdkTsIRF8wVxLVNYtyW+klBOXKWLaNUt2dbLmgdOBkrIVjU0X9BGDyEYUozAEUAS9uEioacFBzYi4jqYEBcTEjrOMUWBvCmpOClcYof0HdCulbEB7VXORLsZneLTG5PTwj5QtLFhNVplo6nOrNif8s90TnV3cb097JcI2Ilroj9yzdT/tenapHo40TXIKimSDOqOpZlSXVX1M2tL1VJyhARp3CP4jFhp2zPlvak+jaVW9dHX/TSsWqPcu0Kd7VLWnAzs9xzoP6Ydmxy87FUalymo06j13s4YDmeYwKzlFTc/8EU94NurGrXFn3H9KjVzm2cG3ZTx8AI0/lkE=</latexit><latexit sha1_base64="DpwzqE8QmuIvEwFykJmgJTAdULA=">AC1HicjVHLSsNAFD2Nr1ofjbp0EyCq5KIoMuiG5cV7APaUpLptA5Nk5BMxFK7Erf+gFv9JvEP9C+8M6agFtEJSc6ce86dufd6kS8SaduvOWNhcWl5Jb9aWFvf2CyaW9v1JExjxms9MO46bkJ90XAa1JInzejmLsjz+cNb3im4o1rHiciDC7lOKdkTsIRF8wVxLVNYtyW+klBOXKWLaNUt2dbLmgdOBkrIVjU0X9BGDyEYUozAEUAS9uEioacFBzYi4jqYEBcTEjrOMUWBvCmpOClcYof0HdCulbEB7VXORLsZneLTG5PTwj5QtLFhNVplo6nOrNif8s90TnV3cb097JcI2Ilroj9yzdT/tenapHo40TXIKimSDOqOpZlSXVX1M2tL1VJyhARp3CP4jFhp2zPlvak+jaVW9dHX/TSsWqPcu0Kd7VLWnAzs9xzoP6Ydmxy87FUalymo06j13s4YDmeYwKzlFTc/8EU94NurGrXFn3H9KjVzm2cG3ZTx8AI0/lkE=</latexit><latexit sha1_base64="DpwzqE8QmuIvEwFykJmgJTAdULA=">AC1HicjVHLSsNAFD2Nr1ofjbp0EyCq5KIoMuiG5cV7APaUpLptA5Nk5BMxFK7Erf+gFv9JvEP9C+8M6agFtEJSc6ce86dufd6kS8SaduvOWNhcWl5Jb9aWFvf2CyaW9v1JExjxms9MO46bkJ90XAa1JInzejmLsjz+cNb3im4o1rHiciDC7lOKdkTsIRF8wVxLVNYtyW+klBOXKWLaNUt2dbLmgdOBkrIVjU0X9BGDyEYUozAEUAS9uEioacFBzYi4jqYEBcTEjrOMUWBvCmpOClcYof0HdCulbEB7VXORLsZneLTG5PTwj5QtLFhNVplo6nOrNif8s90TnV3cb097JcI2Ilroj9yzdT/tenapHo40TXIKimSDOqOpZlSXVX1M2tL1VJyhARp3CP4jFhp2zPlvak+jaVW9dHX/TSsWqPcu0Kd7VLWnAzs9xzoP6Ydmxy87FUalymo06j13s4YDmeYwKzlFTc/8EU94NurGrXFn3H9KjVzm2cG3ZTx8AI0/lkE=</latexit>

reward

<latexit sha1_base64="EeOpa/oL5zGoQgAt0eCud3mg+Os=">AC1HicjVHLSsNAFD3GV62PRl26CRbBVUlE0GXRjcsK9gFtKZN0WkPTJEwmaqldiVt/wK1+k/gH+hfeGVNQi+iEJGfOPefO3HvdOPATaduvc8b8wuLScm4lv7q2vlEwN7dqSZQKj1e9KIhEw2UJD/yQV6UvA96IBWdDN+B1d3Cq4vUrLhI/Ci/kKObtIeuHfs/3mCSqYxZakt9IKceCXzPRnXTMol2y9bJmgZOBIrJVicwXtNBFBA8phuAIQkHYEjoacKBjZi4NsbECUK+jnNMkCdvSipOCkbsgL592jUzNqS9yplot0enBPQKclrYI09EOkFYnWbpeKozK/a3GOdU91tRH83yzUkVuKS2L98U+V/faoWiR6OdQ0+1RrRlXnZVlS3RV1c+tLVZIyxMQp3KW4IOxp57TPlvYkunbVW6bjb1qpWLX3Mm2Kd3VLGrDzc5yzoHZQcuySc35YLJ9ko85hB7vYp3keoYwzVFDVM3/E56NmnFr3Bn3n1JjLvNs49syHj4Anj+WSA=</latexit><latexit sha1_base64="EeOpa/oL5zGoQgAt0eCud3mg+Os=">AC1HicjVHLSsNAFD3GV62PRl26CRbBVUlE0GXRjcsK9gFtKZN0WkPTJEwmaqldiVt/wK1+k/gH+hfeGVNQi+iEJGfOPefO3HvdOPATaduvc8b8wuLScm4lv7q2vlEwN7dqSZQKj1e9KIhEw2UJD/yQV6UvA96IBWdDN+B1d3Cq4vUrLhI/Ci/kKObtIeuHfs/3mCSqYxZakt9IKceCXzPRnXTMol2y9bJmgZOBIrJVicwXtNBFBA8phuAIQkHYEjoacKBjZi4NsbECUK+jnNMkCdvSipOCkbsgL592jUzNqS9yplot0enBPQKclrYI09EOkFYnWbpeKozK/a3GOdU91tRH83yzUkVuKS2L98U+V/faoWiR6OdQ0+1RrRlXnZVlS3RV1c+tLVZIyxMQp3KW4IOxp57TPlvYkunbVW6bjb1qpWLX3Mm2Kd3VLGrDzc5yzoHZQcuySc35YLJ9ko85hB7vYp3keoYwzVFDVM3/E56NmnFr3Bn3n1JjLvNs49syHj4Anj+WSA=</latexit><latexit sha1_base64="EeOpa/oL5zGoQgAt0eCud3mg+Os=">AC1HicjVHLSsNAFD3GV62PRl26CRbBVUlE0GXRjcsK9gFtKZN0WkPTJEwmaqldiVt/wK1+k/gH+hfeGVNQi+iEJGfOPefO3HvdOPATaduvc8b8wuLScm4lv7q2vlEwN7dqSZQKj1e9KIhEw2UJD/yQV6UvA96IBWdDN+B1d3Cq4vUrLhI/Ci/kKObtIeuHfs/3mCSqYxZakt9IKceCXzPRnXTMol2y9bJmgZOBIrJVicwXtNBFBA8phuAIQkHYEjoacKBjZi4NsbECUK+jnNMkCdvSipOCkbsgL592jUzNqS9yplot0enBPQKclrYI09EOkFYnWbpeKozK/a3GOdU91tRH83yzUkVuKS2L98U+V/faoWiR6OdQ0+1RrRlXnZVlS3RV1c+tLVZIyxMQp3KW4IOxp57TPlvYkunbVW6bjb1qpWLX3Mm2Kd3VLGrDzc5yzoHZQcuySc35YLJ9ko85hB7vYp3keoYwzVFDVM3/E56NmnFr3Bn3n1JjLvNs49syHj4Anj+WSA=</latexit><latexit sha1_base64="EeOpa/oL5zGoQgAt0eCud3mg+Os=">AC1HicjVHLSsNAFD3GV62PRl26CRbBVUlE0GXRjcsK9gFtKZN0WkPTJEwmaqldiVt/wK1+k/gH+hfeGVNQi+iEJGfOPefO3HvdOPATaduvc8b8wuLScm4lv7q2vlEwN7dqSZQKj1e9KIhEw2UJD/yQV6UvA96IBWdDN+B1d3Cq4vUrLhI/Ci/kKObtIeuHfs/3mCSqYxZakt9IKceCXzPRnXTMol2y9bJmgZOBIrJVicwXtNBFBA8phuAIQkHYEjoacKBjZi4NsbECUK+jnNMkCdvSipOCkbsgL592jUzNqS9yplot0enBPQKclrYI09EOkFYnWbpeKozK/a3GOdU91tRH83yzUkVuKS2L98U+V/faoWiR6OdQ0+1RrRlXnZVlS3RV1c+tLVZIyxMQp3KW4IOxp57TPlvYkunbVW6bjb1qpWLX3Mm2Kd3VLGrDzc5yzoHZQcuySc35YLJ9ko85hB7vYp3keoYwzVFDVM3/E56NmnFr3Bn3n1JjLvNs49syHj4Anj+WSA=</latexit>

bservation

<latexit sha1_base64="b3TGODdXTClhWpMK20HD13T4Yg=">AC2XicjVHLSsNAFD2Nr1pf8bFzEyCq5KIoMuiG5cV7APaUpJ0WkPzYjIp1tKFO3HrD7jVHxL/QP/CO2MKahGdkOTMufecmXuvE/teIkzNafNzS8sLuWXCyura+sb+uZWLYlS7rKqG/kRbzh2wnwvZFXhCZ81Ys7swPFZ3RmcyXh9yHjiReGlGMWsHdj90Ot5ri2I6ug7LcGuhRDjyEkYHyp20tGLZslUy5gFVgaKyFYl0l/QhcRXKQIwBCEPZhI6GnCQsmYuLaGBPHCXkqzjBgbQpZTHKsIkd0LdPu2bGhrSXnolSu3SKTy8npYF90kSUxwnL0wVT5WzZH/zHitPebcR/Z3MKyBW4IrYv3TzP/qZC0CPZyoGjyqKVaMrM7NXFLVFXlz40tVghxi4iTuUpwTdpVy2mdDaRJVu+ytreJvKlOycu9muSne5S1pwNbPc6C2mHJMkvWxVGxfJqNOo9d7OGA5nmMs5RQZW8b/CIJzxrTe1Wu9PuP1O1XKbZxrelPXwAncaYoQ=</latexit><latexit sha1_base64="b3TGODdXTClhWpMK20HD13T4Yg=">AC2XicjVHLSsNAFD2Nr1pf8bFzEyCq5KIoMuiG5cV7APaUpJ0WkPzYjIp1tKFO3HrD7jVHxL/QP/CO2MKahGdkOTMufecmXuvE/teIkzNafNzS8sLuWXCyura+sb+uZWLYlS7rKqG/kRbzh2wnwvZFXhCZ81Ys7swPFZ3RmcyXh9yHjiReGlGMWsHdj90Ot5ri2I6ug7LcGuhRDjyEkYHyp20tGLZslUy5gFVgaKyFYl0l/QhcRXKQIwBCEPZhI6GnCQsmYuLaGBPHCXkqzjBgbQpZTHKsIkd0LdPu2bGhrSXnolSu3SKTy8npYF90kSUxwnL0wVT5WzZH/zHitPebcR/Z3MKyBW4IrYv3TzP/qZC0CPZyoGjyqKVaMrM7NXFLVFXlz40tVghxi4iTuUpwTdpVy2mdDaRJVu+ytreJvKlOycu9muSne5S1pwNbPc6C2mHJMkvWxVGxfJqNOo9d7OGA5nmMs5RQZW8b/CIJzxrTe1Wu9PuP1O1XKbZxrelPXwAncaYoQ=</latexit><latexit sha1_base64="b3TGODdXTClhWpMK20HD13T4Yg=">AC2XicjVHLSsNAFD2Nr1pf8bFzEyCq5KIoMuiG5cV7APaUpJ0WkPzYjIp1tKFO3HrD7jVHxL/QP/CO2MKahGdkOTMufecmXuvE/teIkzNafNzS8sLuWXCyura+sb+uZWLYlS7rKqG/kRbzh2wnwvZFXhCZ81Ys7swPFZ3RmcyXh9yHjiReGlGMWsHdj90Ot5ri2I6ug7LcGuhRDjyEkYHyp20tGLZslUy5gFVgaKyFYl0l/QhcRXKQIwBCEPZhI6GnCQsmYuLaGBPHCXkqzjBgbQpZTHKsIkd0LdPu2bGhrSXnolSu3SKTy8npYF90kSUxwnL0wVT5WzZH/zHitPebcR/Z3MKyBW4IrYv3TzP/qZC0CPZyoGjyqKVaMrM7NXFLVFXlz40tVghxi4iTuUpwTdpVy2mdDaRJVu+ytreJvKlOycu9muSne5S1pwNbPc6C2mHJMkvWxVGxfJqNOo9d7OGA5nmMs5RQZW8b/CIJzxrTe1Wu9PuP1O1XKbZxrelPXwAncaYoQ=</latexit><latexit sha1_base64="b3TGODdXTClhWpMK20HD13T4Yg=">AC2XicjVHLSsNAFD2Nr1pf8bFzEyCq5KIoMuiG5cV7APaUpJ0WkPzYjIp1tKFO3HrD7jVHxL/QP/CO2MKahGdkOTMufecmXuvE/teIkzNafNzS8sLuWXCyura+sb+uZWLYlS7rKqG/kRbzh2wnwvZFXhCZ81Ys7swPFZ3RmcyXh9yHjiReGlGMWsHdj90Ot5ri2I6ug7LcGuhRDjyEkYHyp20tGLZslUy5gFVgaKyFYl0l/QhcRXKQIwBCEPZhI6GnCQsmYuLaGBPHCXkqzjBgbQpZTHKsIkd0LdPu2bGhrSXnolSu3SKTy8npYF90kSUxwnL0wVT5WzZH/zHitPebcR/Z3MKyBW4IrYv3TzP/qZC0CPZyoGjyqKVaMrM7NXFLVFXlz40tVghxi4iTuUpwTdpVy2mdDaRJVu+ytreJvKlOycu9muSne5S1pwNbPc6C2mHJMkvWxVGxfJqNOo9d7OGA5nmMs5RQZW8b/CIJzxrTe1Wu9PuP1O1XKbZxrelPXwAncaYoQ=</latexit>

◮ RL directly learns from stochastic feedbacks (s, a, r, s′).

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 4 / 30

SLIDE 5

Exploration v.s. Exploitation

◮ In an unknown environment, the agent is uncertain about the possible outcomes. ◮ Exploration: investigate the unknown actions which may bring large returns or unexpected losses. ◮ Exploitation: implement the well-known but possibly sub-optimal actions.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 5 / 30

SLIDE 6

Towards Efficient Exploration

◮ Simple exploration strategies based on “dithering” methods are inefficient.

ǫ-greedy and Boltzmann strategy require almost O(2N) samples to make progress on

deep-sea. [Osband et al., 2019]. Figure 2: Deep-sea. Figure from [Osband et al., 2019]. There are only two actions and the reward is released at the most bottom right corner.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 6 / 30

SLIDE 7

Towards Efficient Exploration

◮ Theoretically, efficient exploration requires to “write-off” the known and inferior actions. ◮ There are general two principles borrowed from bandits: optimism in the face of uncertainty (OFU) and Thompson sampling (TS).

OFU: add “reward bonus” by constructing upper confidence intervals [Stadie et al., 2015,

Pathak et al., 2017, Burda et al., 2019b].

TS: sample the plausible actions from the iteratively updated posterior distribution [Osband

et al., 2016a,b, O’Donoghue et al., 2018].

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 7 / 30

SLIDE 8

Efficient Exploration for Large-scale MDPs

◮ Large-scale MDPs (e.g., Montezuma’s Revenge, SuperMarioBros, and StarCraft) are very challenging.

State space (and action space) is huge.
Planning horizon is long.
Rewards are sparse.

◮ OFU may lead to unexpected results and the weights to balance the tradeoff cannot be learnt from environments [Fortunato et al., 2018, Burda et al., 2019a]. ◮ It’s intractable to compute the posterior distribution to perform TS [Osband et al., 2019, Russo et al., 2018].

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 8 / 30

SLIDE 9

Outline

Introduction Proposed Method Experiment Conclusion

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 9 / 30

SLIDE 10

Novelty-Pursuit: Intuition

◮ Large-scale MDPs require to efficiently visit the whole state space. ◮ Goal-conditioned policy is proficient in reaching the desired state. ◮ We can learn a (near)-optimal policy from the diverse experience generated by a class of goal-conditioned policies that explore the whole state space.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 10 / 30

SLIDE 11

Novelty-Pursit: Method

◮ (Exploration) We mark seldomly visited states as the targets for the goal-conditioned policy. ◮ (Exploration) Once the target is achieved, the agent performs random actions to discover new states. ◮ (Exploitation) A deployment (evaluation) policy learns from the collected experience by

ff-policy learning.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 11 / 30

SLIDE 12

Novelty-Pursit: Method

goal-conditioned action

<latexit sha1_base64="ElsD2/x4VTME6Pxn7JrBcv2nueE=">AC5XicjVHLSsNAFD3GV31XboJFsGNJRFBl0U3LitYLWgpk8m0Dk0zIZmIUty6cydu/QG3+iviH+hfeGdMwQeiE5KcOfeM3PvDZJIZtrzXkac0bHxicnS1PTM7Nz8Qnlx6ShTecpFg6tIpc2AZSKSsWhoqSPRTFLB+kEkjoPenokfn4s0kyo+1JeJaPVZN5YdyZkmql12T7W40FoPuopFG1zFoTQBEbqMG3DVLle8qmeX+xP4BaigWHVfsYpQihw5OhDIYmHIEho+cEPjwkxLUwIC4lJG1c4ArTpM0pS1AGI7ZH3y7tTgo2pr3xzKya0ykRvSkpXayRlFeStic5tp4bp0N+5v3wHqau13SPyi8+sRqnBH7l26Y+V+dqUWjgx1bg6SaEsuY6njhktumJu7n6rS5JAQZ3BI8ZQwt8phn12ryWztprfMxl9tpmHNnhe5Od7MLWnA/vdx/gRHm1Xfq/oHW5XabjHqElawinWa5zZq2EcdDfK+xgMe8eR0nRvn1rn7SHVGCs0yvizn/h0jn51r</latexit><latexit sha1_base64="ElsD2/x4VTME6Pxn7JrBcv2nueE=">AC5XicjVHLSsNAFD3GV31XboJFsGNJRFBl0U3LitYLWgpk8m0Dk0zIZmIUty6cydu/QG3+iviH+hfeGdMwQeiE5KcOfeM3PvDZJIZtrzXkac0bHxicnS1PTM7Nz8Qnlx6ShTecpFg6tIpc2AZSKSsWhoqSPRTFLB+kEkjoPenokfn4s0kyo+1JeJaPVZN5YdyZkmql12T7W40FoPuopFG1zFoTQBEbqMG3DVLle8qmeX+xP4BaigWHVfsYpQihw5OhDIYmHIEho+cEPjwkxLUwIC4lJG1c4ArTpM0pS1AGI7ZH3y7tTgo2pr3xzKya0ykRvSkpXayRlFeStic5tp4bp0N+5v3wHqau13SPyi8+sRqnBH7l26Y+V+dqUWjgx1bg6SaEsuY6njhktumJu7n6rS5JAQZ3BI8ZQwt8phn12ryWztprfMxl9tpmHNnhe5Od7MLWnA/vdx/gRHm1Xfq/oHW5XabjHqElawinWa5zZq2EcdDfK+xgMe8eR0nRvn1rn7SHVGCs0yvizn/h0jn51r</latexit><latexit sha1_base64="ElsD2/x4VTME6Pxn7JrBcv2nueE=">AC5XicjVHLSsNAFD3GV31XboJFsGNJRFBl0U3LitYLWgpk8m0Dk0zIZmIUty6cydu/QG3+iviH+hfeGdMwQeiE5KcOfeM3PvDZJIZtrzXkac0bHxicnS1PTM7Nz8Qnlx6ShTecpFg6tIpc2AZSKSsWhoqSPRTFLB+kEkjoPenokfn4s0kyo+1JeJaPVZN5YdyZkmql12T7W40FoPuopFG1zFoTQBEbqMG3DVLle8qmeX+xP4BaigWHVfsYpQihw5OhDIYmHIEho+cEPjwkxLUwIC4lJG1c4ArTpM0pS1AGI7ZH3y7tTgo2pr3xzKya0ykRvSkpXayRlFeStic5tp4bp0N+5v3wHqau13SPyi8+sRqnBH7l26Y+V+dqUWjgx1bg6SaEsuY6njhktumJu7n6rS5JAQZ3BI8ZQwt8phn12ryWztprfMxl9tpmHNnhe5Od7MLWnA/vdx/gRHm1Xfq/oHW5XabjHqElawinWa5zZq2EcdDfK+xgMe8eR0nRvn1rn7SHVGCs0yvizn/h0jn51r</latexit><latexit sha1_base64="ElsD2/x4VTME6Pxn7JrBcv2nueE=">AC5XicjVHLSsNAFD3GV31XboJFsGNJRFBl0U3LitYLWgpk8m0Dk0zIZmIUty6cydu/QG3+iviH+hfeGdMwQeiE5KcOfeM3PvDZJIZtrzXkac0bHxicnS1PTM7Nz8Qnlx6ShTecpFg6tIpc2AZSKSsWhoqSPRTFLB+kEkjoPenokfn4s0kyo+1JeJaPVZN5YdyZkmql12T7W40FoPuopFG1zFoTQBEbqMG3DVLle8qmeX+xP4BaigWHVfsYpQihw5OhDIYmHIEho+cEPjwkxLUwIC4lJG1c4ArTpM0pS1AGI7ZH3y7tTgo2pr3xzKya0ykRvSkpXayRlFeStic5tp4bp0N+5v3wHqau13SPyi8+sRqnBH7l26Y+V+dqUWjgx1bg6SaEsuY6njhktumJu7n6rS5JAQZ3BI8ZQwt8phn12ryWztprfMxl9tpmHNnhe5Od7MLWnA/vdx/gRHm1Xfq/oHW5XabjHqElawinWa5zZq2EcdDfK+xgMe8eR0nRvn1rn7SHVGCs0yvizn/h0jn51r</latexit>

random action

<latexit sha1_base64="4b31aMRMgIktVadlOpfdIyv+LSA=">AC23icjVHLSsNAFD2N7/qCm7cBIvgqiQi6LoxqWC1YItZTKd1tC8mEzEUrtyJ279Abf6P+If6F94Z0xBLaITkpw5954zc+/1ksBPleO8FqyJyanpmdm54vzC4tJyaWX1LI0zyUWNx0Es6x5LReBHoqZ8FYh6IgULvUCce71DHT+/EjL14+hU9RPRDFk38js+Z4qoVm9ocS1UmogWdSOQ5txzQ9bpbJTcyx4GbgzLydRyXtBAGzE4MoQiKAIB2BI6bmACwcJcU0MiJOEfBMXGKJI2oyBGUwYnv07dLuImcj2mvP1Kg5nRLQK0lpY4s0MeVJwvo028Qz46zZ37wHxlPfrU9/L/cKiVW4JPYv3Sjzvzpdi0IH+6YGn2pKDKOr47lLZrqib25/qUqRQ0Kcxm2KS8LcKEd9to0mNbXr3jITfzOZmtV7nudmeNe3pAG7P8c5Ds52Kq5TcU92y9WDfNSz2MAmtme6jiCMeokfcNHvGEZ6tp3Vp31v1nqlXINWv4tqyHDwnhmSg=</latexit><latexit sha1_base64="4b31aMRMgIktVadlOpfdIyv+LSA=">AC23icjVHLSsNAFD2N7/qCm7cBIvgqiQi6LoxqWC1YItZTKd1tC8mEzEUrtyJ279Abf6P+If6F94Z0xBLaITkpw5954zc+/1ksBPleO8FqyJyanpmdm54vzC4tJyaWX1LI0zyUWNx0Es6x5LReBHoqZ8FYh6IgULvUCce71DHT+/EjL14+hU9RPRDFk38js+Z4qoVm9ocS1UmogWdSOQ5txzQ9bpbJTcyx4GbgzLydRyXtBAGzE4MoQiKAIB2BI6bmACwcJcU0MiJOEfBMXGKJI2oyBGUwYnv07dLuImcj2mvP1Kg5nRLQK0lpY4s0MeVJwvo028Qz46zZ37wHxlPfrU9/L/cKiVW4JPYv3Sjzvzpdi0IH+6YGn2pKDKOr47lLZrqib25/qUqRQ0Kcxm2KS8LcKEd9to0mNbXr3jITfzOZmtV7nudmeNe3pAG7P8c5Ds52Kq5TcU92y9WDfNSz2MAmtme6jiCMeokfcNHvGEZ6tp3Vp31v1nqlXINWv4tqyHDwnhmSg=</latexit><latexit sha1_base64="4b31aMRMgIktVadlOpfdIyv+LSA=">AC23icjVHLSsNAFD2N7/qCm7cBIvgqiQi6LoxqWC1YItZTKd1tC8mEzEUrtyJ279Abf6P+If6F94Z0xBLaITkpw5954zc+/1ksBPleO8FqyJyanpmdm54vzC4tJyaWX1LI0zyUWNx0Es6x5LReBHoqZ8FYh6IgULvUCce71DHT+/EjL14+hU9RPRDFk38js+Z4qoVm9ocS1UmogWdSOQ5txzQ9bpbJTcyx4GbgzLydRyXtBAGzE4MoQiKAIB2BI6bmACwcJcU0MiJOEfBMXGKJI2oyBGUwYnv07dLuImcj2mvP1Kg5nRLQK0lpY4s0MeVJwvo028Qz46zZ37wHxlPfrU9/L/cKiVW4JPYv3Sjzvzpdi0IH+6YGn2pKDKOr47lLZrqib25/qUqRQ0Kcxm2KS8LcKEd9to0mNbXr3jITfzOZmtV7nudmeNe3pAG7P8c5Ds52Kq5TcU92y9WDfNSz2MAmtme6jiCMeokfcNHvGEZ6tp3Vp31v1nqlXINWv4tqyHDwnhmSg=</latexit><latexit sha1_base64="4b31aMRMgIktVadlOpfdIyv+LSA=">AC23icjVHLSsNAFD2N7/qCm7cBIvgqiQi6LoxqWC1YItZTKd1tC8mEzEUrtyJ279Abf6P+If6F94Z0xBLaITkpw5954zc+/1ksBPleO8FqyJyanpmdm54vzC4tJyaWX1LI0zyUWNx0Es6x5LReBHoqZ8FYh6IgULvUCce71DHT+/EjL14+hU9RPRDFk38js+Z4qoVm9ocS1UmogWdSOQ5txzQ9bpbJTcyx4GbgzLydRyXtBAGzE4MoQiKAIB2BI6bmACwcJcU0MiJOEfBMXGKJI2oyBGUwYnv07dLuImcj2mvP1Kg5nRLQK0lpY4s0MeVJwvo028Qz46zZ37wHxlPfrU9/L/cKiVW4JPYv3Sjzvzpdi0IH+6YGn2pKDKOr47lLZrqib25/qUqRQ0Kcxm2KS8LcKEd9to0mNbXr3jITfzOZmtV7nudmeNe3pAG7P8c5Ds52Kq5TcU92y9WDfNSz2MAmtme6jiCMeokfcNHvGEZ6tp3Vp31v1nqlXINWv4tqyHDwnhmSg=</latexit>

Figure 3: Illustration for novelty-pursuit. First, a goal-conditioned policy plans to reach the exploration boundary; then, it performs random actions to discover new states.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 12 / 30

SLIDE 13

Novelty-Pursit: Method

goal-conditioned action

< l a t e x i t s h a 1 _ b a s e 6 4 = " E l s D 2 / x 4 V T M E 6 P x n 7 J r B c v 2 n u e E = " > A A A C 5 X i c j V H L S s N A F D 3 G V 3 1 X X b

J

F s G N J R F B l U 3 L i t Y L W g p k 8 m D k z I Z m I U t y 6 c y d u / Q G 3 + i v i H + h f e G d M w Q e i E 5 K c O f e e M 3 P v D Z J I Z t r z X k a c b H x i c n S 1 P T M 7 N z 8 Q n l x 6 S h T e c p F g 6 t I p c 2 A Z S K S s W h

q

S P R T F L B + k E k j

P

e n

k

f n 4 s k y

+

1 J e J a P V Z N 5 Y d y Z k m q l 1 2 T 7 W 4 F

P

u

p

F G 1 z F

T

Q B E b q M G 3 D V L l e 8 q m e X + x P 4 B a i g W H V V f s Y p Q i h w 5 O h D I I Y m H I E h

+

c E P j w k x L U w I C 4 l J G 1 c 4 A r T p M p S 1 A G I 7 Z H 3 y 7 t T g

2

p r 3 x z K y a y k R v S k p X a y R R l F e S t i c 5 t p 4 b p N + 5 v 3 w H q a u 1 3 S P y i 8 + s R q n B H 7 l 2 6 Y + V + d q U W j g x 1 b g 6 S a E s u Y 6 n j h k t u u m J u 7 n 6 r S 5 J A Q Z 3 B I 8 Z Q w t 8 p h n 1 2 r y W z t p r f M x l 9 t p m H N n h e 5 O d 7 M L W n A / v d x / g R H m 1 X f q /

H

W 5 X a b j H q E l a w i n W a 5 z Z q 2 E c d D f K + x g M e 8 e R n R v n 1 r n 7 S H V G C s y v i z n / h j n 5 1 r < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " E l s D 2 / x 4 V T M E 6 P x n 7 J r B c v 2 n u e E = " > A A A C 5 X i c j V H L S s N A F D 3 G V 3 1 X X b

J

F s G N J R F B l U 3 L i t Y L W g p k 8 m D k z I Z m I U t y 6 c y d u / Q G 3 + i v i H + h f e G d M w Q e i E 5 K c O f e e M 3 P v D Z J I Z t r z X k a c b H x i c n S 1 P T M 7 N z 8 Q n l x 6 S h T e c p F g 6 t I p c 2 A Z S K S s W h

q

S P R T F L B + k E k j

P

e n

k

f n 4 s k y

+

1 J e J a P V Z N 5 Y d y Z k m q l 1 2 T 7 W 4 F

P

u

p

F G 1 z F

T

Q B E b q M G 3 D V L l e 8 q m e X + x P 4 B a i g W H V V f s Y p Q i h w 5 O h D I I Y m H I E h

+

c E P j w k x L U w I C 4 l J G 1 c 4 A r T p M p S 1 A G I 7 Z H 3 y 7 t T g

2

p r 3 x z K y a y k R v S k p X a y R R l F e S t i c 5 t p 4 b p N + 5 v 3 w H q a u 1 3 S P y i 8 + s R q n B H 7 l 2 6 Y + V + d q U W j g x 1 b g 6 S a E s u Y 6 n j h k t u u m J u 7 n 6 r S 5 J A Q Z 3 B I 8 Z Q w t 8 p h n 1 2 r y W z t p r f M x l 9 t p m H N n h e 5 O d 7 M L W n A / v d x / g R H m 1 X f q /

H

W 5 X a b j H q E l a w i n W a 5 z Z q 2 E c d D f K + x g M e 8 e R n R v n 1 r n 7 S H V G C s y v i z n / h j n 5 1 r < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " E l s D 2 / x 4 V T M E 6 P x n 7 J r B c v 2 n u e E = " > A A A C 5 X i c j V H L S s N A F D 3 G V 3 1 X X b

J

F s G N J R F B l U 3 L i t Y L W g p k 8 m D k z I Z m I U t y 6 c y d u / Q G 3 + i v i H + h f e G d M w Q e i E 5 K c O f e e M 3 P v D Z J I Z t r z X k a c b H x i c n S 1 P T M 7 N z 8 Q n l x 6 S h T e c p F g 6 t I p c 2 A Z S K S s W h

q

S P R T F L B + k E k j

P

e n

k

f n 4 s k y

+

1 J e J a P V Z N 5 Y d y Z k m q l 1 2 T 7 W 4 F

P

u

p

F G 1 z F

T

Q B E b q M G 3 D V L l e 8 q m e X + x P 4 B a i g W H V V f s Y p Q i h w 5 O h D I I Y m H I E h

+

c E P j w k x L U w I C 4 l J G 1 c 4 A r T p M p S 1 A G I 7 Z H 3 y 7 t T g

2

p r 3 x z K y a y k R v S k p X a y R R l F e S t i c 5 t p 4 b p N + 5 v 3 w H q a u 1 3 S P y i 8 + s R q n B H 7 l 2 6 Y + V + d q U W j g x 1 b g 6 S a E s u Y 6 n j h k t u u m J u 7 n 6 r S 5 J A Q Z 3 B I 8 Z Q w t 8 p h n 1 2 r y W z t p r f M x l 9 t p m H N n h e 5 O d 7 M L W n A / v d x / g R H m 1 X f q /

H

W 5 X a b j H q E l a w i n W a 5 z Z q 2 E c d D f K + x g M e 8 e R n R v n 1 r n 7 S H V G C s y v i z n / h j n 5 1 r < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " E l s D 2 / x 4 V T M E 6 P x n 7 J r B c v 2 n u e E = " > A A A C 5 X i c j V H L S s N A F D 3 G V 3 1 X X b

J

F s G N J R F B l U 3 L i t Y L W g p k 8 m D k z I Z m I U t y 6 c y d u / Q G 3 + i v i H + h f e G d M w Q e i E 5 K c O f e e M 3 P v D Z J I Z t r z X k a c b H x i c n S 1 P T M 7 N z 8 Q n l x 6 S h T e c p F g 6 t I p c 2 A Z S K S s W h

q

S P R T F L B + k E k j

P

e n

k

f n 4 s k y

+

1 J e J a P V Z N 5 Y d y Z k m q l 1 2 T 7 W 4 F

P

u

p

F G 1 z F

T

Q B E b q M G 3 D V L l e 8 q m e X + x P 4 B a i g W H V V f s Y p Q i h w 5 O h D I I Y m H I E h

+

c E P j w k x L U w I C 4 l J G 1 c 4 A r T p M p S 1 A G I 7 Z H 3 y 7 t T g

2

p r 3 x z K y a y k R v S k p X a y R R l F e S t i c 5 t p 4 b p N + 5 v 3 w H q a u 1 3 S P y i 8 + s R q n B H 7 l 2 6 Y + V + d q U W j g x 1 b g 6 S a E s u Y 6 n j h k t u u m J u 7 n 6 r S 5 J A Q Z 3 B I 8 Z Q w t 8 p h n 1 2 r y W z t p r f M x l 9 t p m H N n h e 5 O d 7 M L W n A / v d x / g R H m 1 X f q /

H

W 5 X a b j H q E l a w i n W a 5 z Z q 2 E c d D f K + x g M e 8 e R n R v n 1 r n 7 S H V G C s y v i z n / h j n 5 1 r < / l a t e x i t >

random action

<latexit sha1_base64="4b31aMRMgIktVadlOpfdIyv+LSA=">AC23icjVHLSsNAFD2N7/qCm7cBIvgqiQi6LoxqWC1YItZTKd1tC8mEzEUrtyJ279Abf6P+If6F94Z0xBLaITkpw5954zc+/1ksBPleO8FqyJyanpmdm54vzC4tJyaWX1LI0zyUWNx0Es6x5LReBHoqZ8FYh6IgULvUCce71DHT+/EjL14+hU9RPRDFk38js+Z4qoVm9ocS1UmogWdSOQ5txzQ9bpbJTcyx4GbgzLydRyXtBAGzE4MoQiKAIB2BI6bmACwcJcU0MiJOEfBMXGKJI2oyBGUwYnv07dLuImcj2mvP1Kg5nRLQK0lpY4s0MeVJwvo028Qz46zZ37wHxlPfrU9/L/cKiVW4JPYv3Sjzvzpdi0IH+6YGn2pKDKOr47lLZrqib25/qUqRQ0Kcxm2KS8LcKEd9to0mNbXr3jITfzOZmtV7nudmeNe3pAG7P8c5Ds52Kq5TcU92y9WDfNSz2MAmtme6jiCMeokfcNHvGEZ6tp3Vp31v1nqlXINWv4tqyHDwnhmSg=</latexit><latexit sha1_base64="4b31aMRMgIktVadlOpfdIyv+LSA=">AC23icjVHLSsNAFD2N7/qCm7cBIvgqiQi6LoxqWC1YItZTKd1tC8mEzEUrtyJ279Abf6P+If6F94Z0xBLaITkpw5954zc+/1ksBPleO8FqyJyanpmdm54vzC4tJyaWX1LI0zyUWNx0Es6x5LReBHoqZ8FYh6IgULvUCce71DHT+/EjL14+hU9RPRDFk38js+Z4qoVm9ocS1UmogWdSOQ5txzQ9bpbJTcyx4GbgzLydRyXtBAGzE4MoQiKAIB2BI6bmACwcJcU0MiJOEfBMXGKJI2oyBGUwYnv07dLuImcj2mvP1Kg5nRLQK0lpY4s0MeVJwvo028Qz46zZ37wHxlPfrU9/L/cKiVW4JPYv3Sjzvzpdi0IH+6YGn2pKDKOr47lLZrqib25/qUqRQ0Kcxm2KS8LcKEd9to0mNbXr3jITfzOZmtV7nudmeNe3pAG7P8c5Ds52Kq5TcU92y9WDfNSz2MAmtme6jiCMeokfcNHvGEZ6tp3Vp31v1nqlXINWv4tqyHDwnhmSg=</latexit><latexit sha1_base64="4b31aMRMgIktVadlOpfdIyv+LSA=">AC23icjVHLSsNAFD2N7/qCm7cBIvgqiQi6LoxqWC1YItZTKd1tC8mEzEUrtyJ279Abf6P+If6F94Z0xBLaITkpw5954zc+/1ksBPleO8FqyJyanpmdm54vzC4tJyaWX1LI0zyUWNx0Es6x5LReBHoqZ8FYh6IgULvUCce71DHT+/EjL14+hU9RPRDFk38js+Z4qoVm9ocS1UmogWdSOQ5txzQ9bpbJTcyx4GbgzLydRyXtBAGzE4MoQiKAIB2BI6bmACwcJcU0MiJOEfBMXGKJI2oyBGUwYnv07dLuImcj2mvP1Kg5nRLQK0lpY4s0MeVJwvo028Qz46zZ37wHxlPfrU9/L/cKiVW4JPYv3Sjzvzpdi0IH+6YGn2pKDKOr47lLZrqib25/qUqRQ0Kcxm2KS8LcKEd9to0mNbXr3jITfzOZmtV7nudmeNe3pAG7P8c5Ds52Kq5TcU92y9WDfNSz2MAmtme6jiCMeokfcNHvGEZ6tp3Vp31v1nqlXINWv4tqyHDwnhmSg=</latexit><latexit sha1_base64="4b31aMRMgIktVadlOpfdIyv+LSA=">AC23icjVHLSsNAFD2N7/qCm7cBIvgqiQi6LoxqWC1YItZTKd1tC8mEzEUrtyJ279Abf6P+If6F94Z0xBLaITkpw5954zc+/1ksBPleO8FqyJyanpmdm54vzC4tJyaWX1LI0zyUWNx0Es6x5LReBHoqZ8FYh6IgULvUCce71DHT+/EjL14+hU9RPRDFk38js+Z4qoVm9ocS1UmogWdSOQ5txzQ9bpbJTcyx4GbgzLydRyXtBAGzE4MoQiKAIB2BI6bmACwcJcU0MiJOEfBMXGKJI2oyBGUwYnv07dLuImcj2mvP1Kg5nRLQK0lpY4s0MeVJwvo028Qz46zZ37wHxlPfrU9/L/cKiVW4JPYv3Sjzvzpdi0IH+6YGn2pKDKOr47lLZrqib25/qUqRQ0Kcxm2KS8LcKEd9to0mNbXr3jITfzOZmtV7nudmeNe3pAG7P8c5Ds52Kq5TcU92y9WDfNSz2MAmtme6jiCMeokfcNHvGEZ6tp3Vp31v1nqlXINWv4tqyHDwnhmSg=</latexit>

Figure 4: Illustration for novelty-pursuit. First, a goal-conditioned policy plans to reach the exploration boundary; then, it performs random actions to discover new states.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 13 / 30

SLIDE 14

Novelty-Pursuit: Select Goals

◮ We select novel states from the buffer B to set goals. ◮ There are many approximation methods to measure novelty.

To name a few, pseudo-count [Bellemare et al., 2016, Ostrovski et al., 2017], prediction

error [Stadie et al., 2015, Pathak et al., 2017], random network distillation (RND) [Burda et al., 2019b], etc.

◮ In practice, we maintain a small priority buffer to store the recent novel states.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 14 / 30

SLIDE 15

Novelty-Pursuit: Train Goal Policy

◮ Training the goal-conditioned policy with sparse rewards (e.g., 1 for success and 0 for failure) is difficult. ◮ We use hindsight experience replay (HER) [Andrychowicz et al., 2017] (or reward shaping) to accelerate learning.

HER: replace each episode with an achieved goal rather than one that the agent was trying to

achieve.

Reward shaping: introduce additional training rewards with mild conditions to guide the

agent.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 15 / 30

SLIDE 16

Novelty-Pursuit: Train Deployment Policy

◮ We use off-policy learning algorithms to learn the deployment (evaluation) policy from the experience buffer.

ACER [Wang et al., 2017] for environments with the discrete action space.
DDPG [Lillicrap et al., 2016] for environments with the continuous action space.

◮ To remedy the issue of extrapolation error [Fujimoto et al., 2019] when doing off-policy learning, we allow this policy to interact with environments occasionally.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 16 / 30

SLIDE 17

Outline

Introduction Proposed Method Experiment Conclusion

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 17 / 30

SLIDE 18

E1: The Efficiency of Exploration

◮ We use the entropy of visited states distribution to measure the efficiency of exploration.

A larger entropy implies a more flat distribution and thus is better.

◮ We consider the baselines: random walk and the exploration method based on reward bonus [Burda et al., 2019b]. ◮ We also consider the variants of our method with oracles.

planning-oracle: the oracle provides the shortest path to reach goals.
counts-oracle: the oracle provides exact counts to select the seldomly visited states.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 18 / 30

SLIDE 19

E1: The Efficiency of Exploration

Table 1: Average entropy of visited state distribution over 5 random seeds on the 17 × 17 maze environment.

Entropy random 5.129 ± 0.021 bonus 5.138 ± 0.085 novelty-pursuit 5.285 ± 0.073 novelty-pursuit-planning-oracle 5.513 ± 0.077 novelty-pursuit-counts-oracle 5.409 ± 0.059 novelty-pursuit-oracles 5.627 ± 0.001 maximum 5.666

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 19 / 30

SLIDE 20

E2: Training of Goal Policy

◮ We test the technique of goal-conditioned policy training on the 17 × 17 maze environment.

50k 100k 150k 200k

steps

1.0 0.5 0.0 0.5 1.0

episode return

reward shaping HER distance reward

◮ HER and reward shaping work well.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 20 / 30

SLIDE 21

E3: Performance on Complicated Tasks

◮ Baselines: bonus [Burda et al., 2019b] and vanilla method described below.

ACER [Wang et al., 2017] (with entropy regularization): Empty Room, Four Rooms and

SupperMarioBros.

DDPG [Lillicrap et al., 2016] (with Gaussian noise): Fetch Reach [Andrychowicz et al.,

2017].

◮ Environments: Empty Room, Four Rooms, Fetch Reach and SuperMarioBros.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 21 / 30

SLIDE 22

E3: Performance on Complicated Tasks

50k 100k 150k 200k steps 1.0 0.5 0.0 0.5 1.0 episode return

Empty Room

125k 250k 375k 500k steps 1.0 0.5 0.0 0.5 1.0 episode return

Four Rooms

250k 500k 750k 1M steps 1 1 2 3 4 episode return

Fetch Reach novelty-pursuit bonus vanilla

Figure 5: Training curves of learned policies over 5 random seeds on the Empty Room, Four Rooms, and FetchReach environments.

4.5M 9M 13.5M 18M steps 10 20 30 40 episode return

SuperMarioBros-1-1

4.5M 9M 13.5M 18M steps 10 20 30 40 episode return

SuperMarioBros-1-2

4.5M 9M 13.5M 18M steps 2 4 6 8 episode return

SuperMarioBros-1-3 novelty-pursuit bonus vanilla

Figure 6: Training curves of learned policies over 3 random seeds on the game of SuperMarioBros.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 22 / 30

SLIDE 23

E3: Performance on Complicated Tasks

Figure 7: Trajectory visualization on SuperMarioBros-1-3. Trajectories are plotted in green cycles with the same training samples (18 million). The agent starts from the most left part and needs to fetch the flag on the most right part. Top row: vanilla ACER; middle row: ACER + exploration bonus; bottom row: novelty-pursuit (ours).

Video is available at http://www.liziniu.org/images/mario-video.mp4.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 23 / 30

SLIDE 24

Outline

Introduction Proposed Method Experiment Conclusion

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 24 / 30

SLIDE 25

Conclusion

◮ The proposed method novelty-pursuit is efficient to handle some large-scale MDPs like SuperMarioBros.

First, a goal-conditioned policy plans to reach the exploration boundary;
Subsequently, it performs random actions to discover new states.

◮ We will consider to extend this idea into solving more challenging tasks like Montezuma’s Revenge and Sonic the Hedgehog.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 25 / 30

SLIDE 26

Bibliography

M. Andrychowicz, D. Crow, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin,
P. Abbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information

Processing Systems 30, pages 5048–5058, 2017.

M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying

count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29, pages 1471–1479, 2016.

Y. Burda, H. Edwards, D. Pathak, A. J. Storkey, T. Darrell, and A. A. Efros. Large-scale study
f curiosity-driven learning. In Proceedings of the 7th International Conference on Learning

Representations, 2019a.

Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov. Exploration by random network
distillation. In Proceedings of the 7th International Conference on Learning Representations,

2019b.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 26 / 30

SLIDE 27

Bibliography (cont.)

M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hessel, I. Osband, A. Graves, V. Mnih,
R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg. Noisy networks for
exploration. In Proceedings of the 6th International Conference on Learning Representations,

2018.

S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without
exploration. In Proceedings of the 36th International Conference on Machine Learning, pages

2052–2062, 2019.

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.

Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations, 2016.

B. O’Donoghue, I. Osband, R. Munos, and V. Mnih. The uncertainty bellman equation and
exploration. In Proceedings of the 35th International Conference on Machine Learning, pages

3836–3845, 2018.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 27 / 30

SLIDE 28

Bibliography (cont.)

I. Osband, C. Blundell, A. Pritzel, and B. V. Roy. Deep exploration via bootstrapped DQN. In

Advances in Neural Information Processing Systems 29, pages 4026–4034, 2016a.

I. Osband, B. V. Roy, and Z. Wen. Generalization and exploration via randomized value
functions. In Proceedings of the 33nd International Conference on Machine Learning, pages

2377–2386, 2016b.

I. Osband, B. V. Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value
functions. Journal of Machine Learning Research, 20(124):1–62, 2019.
G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-based exploration with

neural density models. In Proceedings of the 34th International Conference on Machine Learning, pages 2721–2730, 2017.

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by

self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning, pages 2778–2787, 2017.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 28 / 30

SLIDE 29

Bibliography (cont.)

D. Russo, B. V. Roy, A. Kazerouni, I. Osband, and Z. Wen. A tutorial on thompson sampling.

Foundations and Trends in Machine Learning, 11(1):1–96, 2018.

B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with

deep predictive models. arXiv, 1507.00814, 2015.

Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample

efficient actor-critic with experience replay. In Proceedings of the 5th International Conference on Learning Representations, 2017.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 29 / 30

SLIDE 30

Acknowledgements

We thank Fabio Pardo for sharing ideas to visualize trajectories for SuperMarioBros. In addition, we appreciate the helpful instruction from Dr. Yang Yu and the insightful discussion with Tian Xu as well as Xianghan Kong.

Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 30 / 30