Overview of Robot Decision Making Prof. Yuke Zhu Fall 2020 CS391R: - - PowerPoint PPT Presentation

overview of robot decision making
SMART_READER_LITE
LIVE PREVIEW

Overview of Robot Decision Making Prof. Yuke Zhu Fall 2020 CS391R: - - PowerPoint PPT Presentation

Overview of Robot Decision Making Prof. Yuke Zhu Fall 2020 CS391R: Robot Learning (Fall 2020) 1 Todays Agenda What is Robot Decision Making? Mathematical Framework of Sequential Decision Making Learning for Decision Making


slide-1
SLIDE 1

CS391R: Robot Learning (Fall 2020)

Overview of Robot Decision Making

1

  • Prof. Yuke Zhu

Fall 2020

slide-2
SLIDE 2

CS391R: Robot Learning (Fall 2020) 3

Today’s Agenda

  • What is Robot Decision Making?
  • Mathematical Framework of Sequential Decision Making
  • Learning for Decision Making

○ reinforcement learning (model-free vs. model-based) ○ imitation learning (behavior cloning, DAgger, IRL, and adversarial learning)

  • Research Frontiers

○ compositionality, learning to learn, …

slide-3
SLIDE 3

CS391R: Robot Learning (Fall 2020) 4

[Levine et al. JMLR 2016] [Bohg et al. ICRA 2018] [Sa et al. IROS 2014] Perceive Act Perceive Act Act Perceive

Ro Robot Learning is to close the perception-action loop.

slide-4
SLIDE 4

CS391R: Robot Learning (Fall 2020) 5

What is Robot Decision Making?

Choosing the action a robot should perform in the physical world…

Assistive Robots (Companions) Outer Space (Explorers) Autonomous Driving (Transporters)

slide-5
SLIDE 5

CS391R: Robot Learning (Fall 2020) 6

What is Robot Decision Making?

Choosing the action a robot should perform in the physical world…

  • Behaviors can’t be easily programmed
  • Safety and robustness under uncertainty
  • Imperfect sensing and actuation

[Source: Boston Dynamics]

slide-6
SLIDE 6

CS391R: Robot Learning (Fall 2020) 7

Robot Decision Making vs. Playing Games

Robot decision making is embodied, active, and environmentally situated.

[Source: Boston Dynamics] [Source: DeepMind’s AlphaGo]

slide-7
SLIDE 7

CS391R: Robot Learning (Fall 2020) 8

Before We Dive In…

  • This lecture is intended to provide a high-level, bird-eye

view on (robot) decision making.

  • The goal is not to go through all technical details:

We will re-visit them through paper reading in the following weeks.

Study the parts that you are less familiar with from online resources.

  • Take related courses and read textbooks to learn this

subject in depth (see the last slide).

slide-8
SLIDE 8

CS391R: Robot Learning (Fall 2020) 9

Mathematical Framework: Marko kov Decisi sion Processe sses

A Markov Decision Process is defined by a tuple M = hS, A, P, R, γi

<latexit sha1_base64="UGPCm8FQBtob6fJA7liEs4eoWaY=">ACP3icdVBNSxtBGJ7VamP8aIxHL0ODkIOE3WBILkLEixchtk0UsiG8O5kQ2Zml5lZISz5Df1DvehP8ObViwel9NJDb53dtBpFXxh4nuf9mPd9gogzbVz31la/rCy+jG3l/f2Nz6VNgudnQYK0LbJOShughAU84kbRtmOL2IFAURcHoeTI7T/PklVZqF8puZRrQnYCTZkBEwVuoXOr4AMybAk9PZoc9BjFT9rX2f4zOVokrUXyJSUjEAKwr7IR/ULJrVTrDbfu4Qx4Xm0O3FoNexU3i1Kz6Jd/X3/3W/3CjT8ISyoNISD1l3PjUwvAWUY4XSW92NIyATGNGuhRIE1b0ku3+G96wywMNQ2ScNztTFjgSE1lMR2Mp0Zf06l4pv5bqxGTZ6CZNRbKgk84+GMcmxKmZeMAUJYZPLQCimN0VkzEoIMZanrcm/L8Uvw861Yp3UGmcWTeqaB45tIs+ozLyUB010QlqoTYi6Ae6Qw/o0bly7p2fzq956ZLzr2cHvQjnz18tjrPb</latexit>

S

<latexit sha1_base64="FywXp2+qoAPEXh6BbgVeq4gZKgw=">AB8nicdVDLSsNAFJ34rPFVdelmsAiuQlIs7UYsuHFZ0T6gDWUynbRDJzNhZiKU0M9w40KRbt37H27Ev3GSKjogQuHc+7lnuDmFGlXfdWlpeWV1bL23Ym1vbO7vlvf2OEonEpI0FE7IXIEUY5aStqWakF0uCoCRbjC9yPzuLZGKCn6jZzHxIzTmNKQYaSP1BxHSE4xYej0fliuU603LoHc+J5tYK4tRr0HDdH5fzFPosXb3ZrWH4djAROIsI1ZkipvufG2k+R1BQzMrcHiSIxwlM0Jn1DOYqI8tM8hweG2UEQyFNcQ1z9ftEiKlZlFgOrOI6reXiX95/USHDT+lPE404bhYFCYMagGz+GISoI1mxmCsKQmK8QTJBHW5ku2ecLXpfB/0qk63qnTuHIrzSoUAKH4AicA/UQRNcghZoAwEuAMP4NHS1r31ZC2K1iXrc+YA/ID1/AFCQ5Td</latexit>

: state space A

<latexit sha1_base64="pvU12GU1dDG1BShnHUIzAeZ3O6M=">AB8nicdVDLSsNAFJ34rPFVdelmsAiuwqRY2o1YceOygn1AG8pkOmHTiZhZiKU0M9w40KRbt37H27Ev3GSKjogQuHc+7lnv9mDOlEXq3lpZXVtfWSxv25tb2zm5b7+jokQS2iYRj2TPx4pyJmhbM81pL5YUhz6nX96mfndWyoVi8SNnsXUC/FYsIARrI3UH4RYTwjm6cV8WK4gp1pvoLoLc+K6tYKgWg26DspROX+xz+LFm90al8Ho4gkIRWacKxU30Wx9lIsNSOczu1BomiMyRSPad9QgUOqvDSPIfHRhnBIJKmhIa5+n0ixaFSs9A3nVlE9dvLxL+8fqKDhpcyESeaClIsChIOdQSz+GISUo0nxmCiWQmKyQTLDHR5ku2ecLXpfB/0qk67qnTuEaVZhUKIFDcAROgAvqoAmuQAu0AQERuAMP4NHS1r31ZC2K1iXrc+YA/ID1/AEm6ZTL</latexit>

: action space P

<latexit sha1_base64="m2zP/jMubcdRPaHpYCksETvGRPs=">AB8nicdVDLSsNAFJ3UV42vqks3g0VwFSbF0m7EghuXFewD0lAm0k7dJIJMxOhH6GxeKdOve/3Aj/o3TREFD1w4nHMv9wbJwpjdC7VpZXVvfKG/aW9s7u3uV/YOuEqktEMEF7IfYEU5i2lHM81pP5EURwGnvWB6ufR7t1QqJuIbPUuoH+FxzEJGsDaSN4iwnhDMs/Z8WKkip9ZoYLc+K69YKgeh26DspRvXixz5PFm90eVl4HI0HSiMacKyU56JE+xmWmhFO5/YgVTBZIrH1DM0xhFVfpZHnsMTo4xgKSpWMNc/T6R4UipWRSYzmVE9dtbin95XqrDp+xOEk1jUmxKEw51AIu74cjJinRfGYIJpKZrJBMsMREmy/Z5glfl8L/SbfmuGdO8xpVWzVQoAyOwDE4BS5ogBa4Am3QAQIcAcewKOlrXvryVoUrSXrc+YQ/ID1/AE9tJTa</latexit>

: transition probability R

<latexit sha1_base64="i+E65M1i3gjNnzJeIYbguafpo1s=">AB8nicdVDLSsNAFJ34rPFVdelmsAiuQlIs7UYsuHFZxT6gDWUynbRDJzNhZiKU0M9w40KRbt37H27Ev3GSKjogQuHc+7lnuDmFGlXfdWlpeWV1bL23Ym1vbO7vlvf2OEonEpI0FE7IXIEUY5aStqWakF0uCoCRbjC9yPzuLZGKCn6jZzHxIzTmNKQYaSP1BxHSE4xYej0fliuU603LoHc+J5tYK4tRr0HDdH5fzFPosXb3ZrWH4djAROIsI1ZkipvufG2k+R1BQzMrcHiSIxwlM0Jn1DOYqI8tM8hweG2UEQyFNcQ1z9ftEiKlZlFgOrOI6reXiX95/USHDT+lPE404bhYFCYMagGz+GISoI1mxmCsKQmK8QTJBHW5ku2ecLXpfB/0qk63qnTuHIrzSoUAKH4AicA/UQRNcghZoAwEuAMP4NHS1r31ZC2K1iXrc+YA/ID1/AFAvpTc</latexit>

: reward function γ

<latexit sha1_base64="JpD9q9Jvp09IkN0chWQ+ZK5PxIY=">AB7XicdVBNSwMxEM3Wr1q/qh69BIvgadktLe2x4MVjBfsB7VJm02wbm2SXJCuUpf/BiwdFvPp/vPlvTLcVPTBwO9GWbmhQln2njeh1PY2Nza3inulvb2Dw6PyscnXR2nitAOiXms+iFoypmkHcMp/1EURAhp71wdrX0e/dUaRbLWzNPaCBgIlnECBgrdYcTEAJG5YrnVhtNr+HjnPh+fUW8eh37rpejgtZoj8rvw3FMUkGlIRy0HvheYoIMlGE0VpmGqaAJnBhA4slSCoDrL82gW+sMoYR7GyJQ3O1e8TGQit5yK0nQLMVP/2luJf3iA1UTPImExSQyVZLYpSjk2Ml6/jMVOUGD63BIhi9lZMpqCAGBtQyYbw9Sn+n3Srl9zmze1Squ6jqOIztA5ukQ+aqAWukZt1E3aEH9ISendh5dF6c1VrwVnPnKIfcN4+AfOgj1w=</latexit>

: a discount factor γ ∈ [0, 1]

<latexit sha1_base64="5iAfSjFxJW7S24c+E2zRXg95A=">AB9XicdVDLSgMxFM3UV62vqks3wSK4kGFSWtplwY3LCvYBM2PJpJk2NMkMSUYpQ/DjQtF3Pov7vwb04egogcuHM65l3viVLOtPG8D6ewtr6xuVXcLu3s7u0flA+PujrJFKEdkvBE9SOsKWeSdgwznPZTRbGIO1Fk8u537ujSrNE3phpSkOBR5LFjGBjpdtghIXAZO+d4HCQbniudVG02sguCAI1ZfEq9chcr0FKmCF9qD8HgwTkgkqDeFYax95qQlzrAwjnM5KQaZpiskEj6hvqcSC6jBfXD2DZ1YZwjhRtqSBC/X7RI6F1lMR2U6BzVj/9ubiX56fmbgZ5kymaGSLBfFGYcmgfMI4JApSgyfWoKJYvZWSMZYWJsUCUbwten8H/Srbqo5java5VWdRVHEZyAU3AOEGiAFrgCbdABCjwAJ7As3PvPDovzuyteCsZo7BDzhvn/YQkiQ=</latexit>

(st ∈ S)

<latexit sha1_base64="jGRosKYE5dgsYh5U1uEh78GwnDA=">AB/nicdVDLSsNAFJ34rPUVFVduBotQNyWJLW13BTcuK9oHNKFMpN26GQSZiZCQV/xY0LRdz6He78GydtBU9MHA4517umePHjEplWR/Gyura+sZmYau4vbO7t28eHZlAhMOjhikej7SBJGOekoqhjpx4Kg0Gek508vM793R4SkEb9Vs5h4IRpzGlCMlJaG5nFZDhV0KYduiNQEI5bezM+HZsmqOHbNqTXhkjQvclKtQ7tiLVACOdpD890dRTgJCVeYISkHthUrL0VCUczIvOgmksQIT9GYDTlKCTSxfx5/BMKyMYREI/ruBC/b6RolDKWejrySyj/O1l4l/eIFBw0spjxNFOF4eChIGVQSzLuCICoIVm2mCsKA6K8QTJBWurGiLuHrp/B/0nUqdrXSuK6Wk5eRwGcgFNQBjaogxa4Am3QARik4AE8gWfj3ng0XozX5eiKke8cgR8w3j4BpBSVRg=</latexit>

(at ∈ A)

<latexit sha1_base64="9mN0evEAmW6HidnbyBTqbhwcRE=">AB/nicdVDLSsNAFJ34rPUVFVduBotQNyGJLW13FTcuK9gHNKVMpN26GQSZiZCQV/xY0LRdz6He78GydtBU9MHA4517umePHjEpl2x/Gyura+sZmYau4vbO7t28eHZklAhM2jhikej5SBJGOWkrqhjpxYKg0Gek60+vMr97R4SkEb9Vs5gMQjTmNKAYKS0NzeMyGiroUQ69EKkJRiy9nJ8PzZJtuU7VrTbgkjQuclKpQceyFyiBHK2h+e6NIpyEhCvMkJR9x47VIEVCUczIvOglksQIT9GY9DXlKCRykC7iz+GZVkYwiIR+XMGF+n0jRaGUs9DXk1lG+dvLxL+8fqKC+iClPE4U4Xh5KEgYVBHMuoAjKghWbKYJwoLqrBPkEBY6caKuoSvn8L/Sce1nIpVv6mUm5eRwGcgFNQBg6ogSa4Bi3QBhik4AE8gWfj3ng0XozX5eiKke8cgR8w3j4BbAqVIg=</latexit>

Pa

ss0 = Pr[st+1 | st, at]

<latexit sha1_base64="47fJjpAaSDkG5CZij9cVsTrtsLM=">ACFnicdVDLSgNBEJz1bXxFPXoZDKJgDLsxoh4EwYvHCEaF7Lr0TiZmcPbBTK8Q1v0KL/6KFw+KeBVv/o2Th6CiBQNFVTc9VUEihUb/rBGRsfGJyanpgszs3PzC8XFpTMdp4rxBotlrC4C0FyKiDdQoOQXieIQBpKfB9dHPf/8hist4ugUuwn3QriKRFswQCP5xS03BOwkFk9zOt1/NLOHDrqn9Ded3C3fumXtY5mCj5fLNmVqrNT3dmnA7K/PS1XepU7D5KZIi6X3x3WzFLQx4hk6B107ET9DJQKJjkecFNU+AXcMVbxoaQci1l/Vj5XTNKC3ajpV5EdK+n0jg1DrbhiYyV4I/dvriX95zRTbe14moiRFHrHBoXYqKca01xFtCcUZyq4hwJQwf6WsAwoYmiYLpoSvpPR/clatOLXK3kmtdFgd1jFVsgq2SAO2SWH5JjUSYMwckceyBN5tu6tR+vFeh2MjljDnWXyA9bJxupn0s=</latexit>

r(s, a) = E[rt+1|s = st, a = at]

<latexit sha1_base64="Pnkf2aWKD9BYQ2qZKVRM674ymOg=">ACFHicdVDLSgNBEJz1bXxFPXoZDIKSEHZjQpJDQBDBo4JRIVmW3slEh8w+mOkVwpqP8OKvePGgiFcP3vwbJw9BRQsaiqpurv8WAqNtv1hTU3PzM7NLyxmlpZXVtey6xvnOkoU40WyUhd+qC5FCFvokDJL2PFIfAlv/B7h0P/4oYrLaLwDPsxdwO4CkVXMEAjedm82tUF2KMN2g4Ar30/PRq0lJdi3hnQW6ob2sMChQZ46HrZnF0sOZVSpU7HpL4/IeUqdYr2CDkywYmXfW93IpYEPEQmQeuWY8fopqBQMkHmXaieQysB1e8ZWgIAduOnpqQHeM0qHdSJkKkY7U7xMpBFr3A90Di/Xv72h+JfXSrBbc1MRxgnykI0XdRNJMaLDhGhHKM5Q9g0BpoS5lbJrUMDQ5JgxIXx9Sv8n56WiUy7WTsu5g9IkjgWyRbJLnFIlRyQY3JCmoSRO/JAnsizdW89Wi/W67h1yprMbJIfsN4+AR9XnP8=</latexit>
slide-9
SLIDE 9

CS391R: Robot Learning (Fall 2020) 10

Mathematical Framework: Marko kov Decisi sion Processe sses

A Markov Decision Process is defined by a tuple M = hS, A, P, R, γi

<latexit sha1_base64="UGPCm8FQBtob6fJA7liEs4eoWaY=">ACP3icdVBNSxtBGJ7VamP8aIxHL0ODkIOE3WBILkLEixchtk0UsiG8O5kQ2Zml5lZISz5Df1DvehP8ObViwel9NJDb53dtBpFXxh4nuf9mPd9gogzbVz31la/rCy+jG3l/f2Nz6VNgudnQYK0LbJOShughAU84kbRtmOL2IFAURcHoeTI7T/PklVZqF8puZRrQnYCTZkBEwVuoXOr4AMybAk9PZoc9BjFT9rX2f4zOVokrUXyJSUjEAKwr7IR/ULJrVTrDbfu4Qx4Xm0O3FoNexU3i1Kz6Jd/X3/3W/3CjT8ISyoNISD1l3PjUwvAWUY4XSW92NIyATGNGuhRIE1b0ku3+G96wywMNQ2ScNztTFjgSE1lMR2Mp0Zf06l4pv5bqxGTZ6CZNRbKgk84+GMcmxKmZeMAUJYZPLQCimN0VkzEoIMZanrcm/L8Uvw861Yp3UGmcWTeqaB45tIs+ozLyUB010QlqoTYi6Ae6Qw/o0bly7p2fzq956ZLzr2cHvQjnz18tjrPb</latexit>

A policy maps states to actions

Goal of (robot) decision making

Choose policy that maximizes cumulative reward

π : S → A

<latexit sha1_base64="wDC/QFkHJ+FxtPclWknuvFubNKU=">ACEHicdVDLSsNAFJ34rPUVdelmsIiuQlJaWlxV3LisaB/QhDKZTtqhk0yYmSgl5BPc+CtuXCji1qU7/8ZpWvGBHrhwOde7r3HjxmVyrbfjYXFpeWV1cJacX1jc2vb3NltS54ITFqYMy6PpKE0Yi0FWMdGNBUOgz0vHZ1O/c02EpDy6UpOYeCEaRjSgGCkt9c0jN6Yn0A2RGmHE0svMFXQ4UkgIfvMln2Z9s2Rb5VrdrjkwJ45TnRG7WoWOZecogTmafPNHXCchCRSmCEpe4dKy9FQlHMSFZ0E0lihMdoSHqaRigk0kvzhzJ4qJUBDLjQFSmYq98nUhRKOQl93Tk9Uf72puJfXi9RQd1LaRQnikR4tihIGFQcTtOBAyoIVmyiCcKC6lshHiGBsNIZFnUIn5/C/0m7bDkVq35RKTXK8zgKYB8cgGPgBpogHPQBC2AwS24B4/gybgzHoxn42XWumDMZ/bADxivH9f0nbw=</latexit>

π∗ = arg max

π

E[ X

t≥0

γtr(st, π(st))]

<latexit sha1_base64="BxIznK8tD/2t7nlMrojSY3Jxd4=">ACOXicdVBSxtBFJ612qbR1rQ9jIYClpK2IRANoeCIUePEQwKmTW5e1kEofM7G5n3paGJX/Li/Cm+ClB0V67R9wNomg0j4Y5pvfR/z3hdnSlr0/Stv5cXq2stXldfV9Y03bzdr794f2TQ3XPR5qlJzEoMVSiaijxKVOMmMAB0rcRxP9sr+8U9hrEyTQ5xmItQwTuRIckBHRbUey+Tp568MzJhp+BW5p7vxLI6LbzO272gxYDbXUYFsLH74MzYGreEUqdm2EX6hzlCnZ2FOIxqdb/R6vhBt0PnoNv2FyAIWrTZ8OdVJ8vqRbVLNkx5rkWCXIG1g6afYViAQcmVmFVZbkUGfAJuEgcT0MKGxXzGf3kmCEdpcadBOmcfewoQFs71bFTlvZ572S/FdvkOMoCAuZDmKhC8+GuWKYkrLGOlQGsFRTR0AbqSblfIzMDRhV1ITxsSv8PjlqNZrsRHLTru61lHBXykWyRbdIkHbJLvpMe6RNOzsk1uSG3oX327vz/iykK97S84E8Ke/vPX3LreI=</latexit>
slide-10
SLIDE 10

CS391R: Robot Learning (Fall 2020) 11

Mathematical Framework: Marko kov Decisi sion Processe sses

We define two functions given a policy

Value function: the expected cumulative

discounted reward when acting according to the policy from a given state

π

<latexit sha1_base64="Vm+X3RfZmuAxM+H7kXJItGzqKQ=">AB6nicdVBNS8NAEJ3Ur1q/qh69LBbBU0hiS9tbwYvHitYW2lA2027dLMJuxuhP4ELx4U8eov8ua/cdtGUNEHA4/3ZpiZFyScKe04H1ZhbX1jc6u4XdrZ3ds/KB8e3ak4lYR2SMxj2QuwopwJ2tFMc9pLJMVRwGk3mF4u/O49lYrF4lbPEupHeCxYyAjWRroZJGxYrji259a8WhOtSPMiJ9U6cm1niQrkaA/L74NRTNKICk04VqrvOon2Myw1I5zOS4NU0QSTKR7TvqECR1T52fLUOTozygiFsTQlNFq3ycyHCk1iwLTGWE9Ub+9hfiX10912PAzJpJU0FWi8KUIx2jxd9oxCQlms8MwUQycysiEywx0Sadkgnh61P0P7nzbLdqN6rlZaXx1GEziFc3ChDi24gjZ0gMAYHuAJni1uPVov1uqtWDlM8fwA9bJ69bjgs=</latexit>

V π(s) = E ⇥ X

t=0

γtr(st, π(st)) | s0 = s ⇤

<latexit sha1_base64="R4yDJ86It7M8asPoegrzKF8D8=">ACM3icdVBNSysxFM34bfX5qi7dBIvQikzfRV1URBEFcKtgrNOGTStAaTmSG5I5Sx/8mNf8SFIC4Ucet/MNW0Md7BwKHc+7l5pwkcKA6z45U9Mzs3PzC4uFpeVfK7+Lq2tE6ea8RaLZawvQmq4FBFvgQDJLxLNqQolPw+vD3L/IZrI+LoDAYJ9xXtR6InGAUrBcXj9iVJRNlUmkRuArD7HBIQtHvEJOqIOmOyR9qhS9BKzLJoAqHs0HUKmQ6i2pmsBtmnzD4olt1b3tuvbe3hM9v5MSGMHezV3hBKa4CQoPpBuzFLFI2CSGtPx3AT8jGoQTPJhgaSGJ5Rd0z7vWBpRxY2fjTIP8ZVurgXa/siwCP1+0ZGlTEDFdrJPJj528vFf3mdFHq7fiaiJAUesfGhXioxDgvEHeF5gzkwBLKtLB/xeyKasrA1lywJXwlxf8n7XrNa9R2Txul/eqkjgW0gTZRGXloB+2jI3SCWoihO/SIXtCrc+8O2/O+3h0ypnsrKMfcD4+AXiQqiU=</latexit>

Q function: the expected cumulative

discounted reward when acting according to the policy from a given state and taki king a give ven action Qπ(s, a) = r(s, a) + γ X

s02S

P(s0|s, a)V π(s0)

<latexit sha1_base64="RbpDiy13DI8Yh/AUxM3TeWs8tSs=">ACM3icdVBNSwMxEM36bf2qevQSLNKUnZrRT0IghfxVNFWoVvLbJrWYJdkqxQ1v4nL/4RD4J4UMSr/8G0W0FHwQeb95MZl4QcaN6z45I6Nj4xOTU9OZmdm5+YXs4lJNh7EitEpCHqLADTlTNKqYbTi0hREAGn58H1Yb9+fkOVZqE8M92INgR0JGszAsZKzezxyaUfsYLexLC+j1VK8Ab2OyAEYF/HopnovM+kL8BcEeDJa+HKwWdv029tXRAfr2ZzbnFkrd2t7DKdnbGpLyDvaK7gA5NESlmX3wWyGJBZWGcNC67rmRaSgDCOc9jJ+rGkE5Bo6tG6pBEF1Ixnc3MNrVmnhdqjskwYP1O8dCQituyKwzv7i+netL/5Vq8emvdtImIxiQyVJP2rHJsQ9wPELaYoMbxrCRDF7K6YXIECYmzMGRvC16X4f1IrFb1ycfeknDvYHMYxhVbQKiogD+2gA3SEKqiKCLpDj+gFvTr3zrPz5ryn1hFn2LOMfsD5+AT+u6eH</latexit>

Pi*(a|s) = arg max_a Q*(s, a) V*(s) = max_a Q*(s, a) à Pi*

slide-11
SLIDE 11

CS391R: Robot Learning (Fall 2020) 12

Solving MDPs with Known Models

When we know the model of the MDP M = hS, A, P, R, γi

<latexit sha1_base64="UGPCm8FQBtob6fJA7liEs4eoWaY=">ACP3icdVBNSxtBGJ7VamP8aIxHL0ODkIOE3WBILkLEixchtk0UsiG8O5kQ2Zml5lZISz5Df1DvehP8ObViwel9NJDb53dtBpFXxh4nuf9mPd9gogzbVz31la/rCy+jG3l/f2Nz6VNgudnQYK0LbJOShughAU84kbRtmOL2IFAURcHoeTI7T/PklVZqF8puZRrQnYCTZkBEwVuoXOr4AMybAk9PZoc9BjFT9rX2f4zOVokrUXyJSUjEAKwr7IR/ULJrVTrDbfu4Qx4Xm0O3FoNexU3i1Kz6Jd/X3/3W/3CjT8ISyoNISD1l3PjUwvAWUY4XSW92NIyATGNGuhRIE1b0ku3+G96wywMNQ2ScNztTFjgSE1lMR2Mp0Zf06l4pv5bqxGTZ6CZNRbKgk84+GMcmxKmZeMAUJYZPLQCimN0VkzEoIMZanrcm/L8Uvw861Yp3UGmcWTeqaB45tIs+ozLyUB010QlqoTYi6Ae6Qw/o0bly7p2fzq956ZLzr2cHvQjnz18tjrPb</latexit>

Value Iteration Policy Iteration

  • 1. Estimate optimal value function
  • 1. Start with random policy
  • 2. Compute optimal policy from optimal value function
  • 2. Iteratively improve it until convergence to optimal policy

Use ideas from Dynamic Programming using model using model

slide-12
SLIDE 12

CS391R: Robot Learning (Fall 2020) 13

Solving MDPs with Known Models

When we know the model of the MDP M = hS, A, P, R, γi

<latexit sha1_base64="UGPCm8FQBtob6fJA7liEs4eoWaY=">ACP3icdVBNSxtBGJ7VamP8aIxHL0ODkIOE3WBILkLEixchtk0UsiG8O5kQ2Zml5lZISz5Df1DvehP8ObViwel9NJDb53dtBpFXxh4nuf9mPd9gogzbVz31la/rCy+jG3l/f2Nz6VNgudnQYK0LbJOShughAU84kbRtmOL2IFAURcHoeTI7T/PklVZqF8puZRrQnYCTZkBEwVuoXOr4AMybAk9PZoc9BjFT9rX2f4zOVokrUXyJSUjEAKwr7IR/ULJrVTrDbfu4Qx4Xm0O3FoNexU3i1Kz6Jd/X3/3W/3CjT8ISyoNISD1l3PjUwvAWUY4XSW92NIyATGNGuhRIE1b0ku3+G96wywMNQ2ScNztTFjgSE1lMR2Mp0Zf06l4pv5bqxGTZ6CZNRbKgk84+GMcmxKmZeMAUJYZPLQCimN0VkzEoIMZanrcm/L8Uvw861Yp3UGmcWTeqaB45tIs+ozLyUB010QlqoTYi6Ae6Qw/o0bly7p2fzq956ZLzr2cHvQjnz18tjrPb</latexit>

Optimal Control (LQR)

A special case: exact solution is easily to solve Assume linear transi sitions s and qu quadr drati tic r reward fu d functi tions

Sampling-based Planning

Evaluate outcomes of sampled actions with models Choose the action that leads to the best (predicted) outcome π∗

<latexit sha1_base64="WD1KV0FU/uObB/zjPi9X9gs8U=">AB7HicdVBNS8NAEJ3Ur1q/qh69LBaheAhpKTS9Fbx4rGDaQhvLZrtpl242YXcjlNDf4MWDIl79Qd78N27TCir6YODx3gwz84KEM6Ud58MqbGxube8Ud0t7+weHR+Xjk6KU0moR2Iey36AFeVMUE8zWk/kRHAae9YHa19Hv3VCoWi1s9T6gf4YlgISNYG8kbJuzuclSuOHa96bitJspJq+GsiOvWUc12clRgjc6o/D4cxySNqNCEY6UGNSfRfoalZoTRWmYKpgMsMTOjBU4IgqP8uPXaALo4xRGEtTQqNc/T6R4UipeRSYzgjrqfrtLcW/vEGqQ9fPmEhSTQVZLQpTjnSMlp+jMZOUaD43BPJzK2ITLHERJt8SiaEr0/R/6Rbt2sN271pVNrVdRxFOINzqEINmtCGa+iABwQYPMATPFvCerRerNdVa8Faz5zCD1hvn90Rjqw=</latexit>

Linear transi sition st+1 = Atst + Btat

<latexit sha1_base64="la3/cP+HjW65Q7ghWZW5R3XthD8=">ACAXicdVDJSgNBEO2JW4xb1IvgpTEIQiTMhEAmByHqxWMEs0AyD2dnqRJz0J3jRCGePFXvHhQxKt/4c2/sbMIKvqg4PFeFVX1vFhwBab5YWSWldW17LruY3Nre2d/O5eS0WJpKxJIxHJjkcUEzxkTeAgWCeWjASeYG1vdDn127dMKh6FNzCOmROQch9Tgloyc0fKDeFojU5O3cBK1FfOECcHNF8xSuWratSqekVrFnBPbLmOrZM5QAs03Px7rx/RJGAhUEGU6lpmDE5KJHAq2CTXSxSLCR2RAetqGpKAKSedfTDBx1rpYz+SukLAM/X7REoCpcaBpzsDAkP125uKf3ndBHzbSXkYJ8BCOl/kJwJDhKdx4D6XjIYa0Ko5PpWTIdEgo6tJwO4etT/D9plUtWpWRfVwr10UcWXSIjtAJslAV1dEVaqAmougOPaAn9GzcG4/Gi/E6b80Yi5l9APG2yf1lJXn</latexit>

Qu Quadrati tic r reward

always negative

Ext xtensi sions: s: LQG (Gaussian noise), iLQR (non-linear transition)

Monte-Carlo Tree Search (MCTS) for Tic-Tac-Toe

r(st, at) = −s>

t Utst − a> t Wtat

<latexit sha1_base64="9RK0xP5XPU1nqrKrzIDonO2hrbs=">ACGnicdVBLS0JBFJ5rL7PXrZthiQwULnXFHURCG1aGqQGape546iDcx/MnBuI+Dva9FfatCiXbTp3zTqFSrqwMD3OIcz53NDwRVY1qeRWFldW9Ibqa2tnd298z9g6YKIklZgwYikDcuUxwnzWAg2A3oWTEcwVruaOLmd+6Y1LxwL+Gci6Hhn4vM8pAS05pi0zyoEsceAUn+OcxrcdCELcABrgnOYLKWpo4ZtrKF+xSoVTFC1A9i0GxjO28Na80iqvumO+dXkAj/lABVGqbVshdCdEAqeCTVOdSLGQ0BEZsLaGPvGY6k7mp03xiVZ6uB9I/XzAc/X7xIR4So09V3d6BIbqtzcT/LaEfQr3Qn3wiYTxeL+pHAEOBZTrjHJaMgxhoQKrn+K6ZDIgkFnWZKh7C8FP8PmoW8XcxXrorpWjaOI4mO0DHKIBuVUQ1dojpqIru0SN6Ri/Gg/FkvBpvi9aEc8coh9lfHwBUu2fLQ=</latexit>
slide-13
SLIDE 13

CS391R: Robot Learning (Fall 2020) 14

Solving MDPs with Learned Models

Model is known in restricted domains: ga game mes, si simulated robots, si simple mechanics When model is not known, we can learn the model from data.

A key role of learning in model- based approaches τ = {(si, ai, ri) | i = 0, . . . , H}

<latexit sha1_base64="RwrDhI7j2rAzkN7hwL6RClfgo/s=">ACFHicdVDLSgMxFM3UV62vqks3wSJUHMq0FDpdFApuqxgH9ApQyZN29DMg+SOUMZ+hBt/xY0LRdy6cOfmD4EFT1wL4dz7iW5x4sEV2BZH0ZqbX1jcyu9ndnZ3ds/yB4etVUYS8paNBSh7HpEMcED1gIOgnUjyYjvCdbxJpdzv3PDpOJhcA3TiPV9Mgr4kFMCWnKzFw6QuOYkeVyE5N5ky4/d8xbx+Q1y8SOGISgTNxwZm42ZxVKFcuVvCVMvWkth2CRcL1gI5tELTzb47g5DGPguACqJUr2hF0E+IBE4Fm2WcWLGI0AkZsZ6mAfGZ6ieLo2b4TCsDPAylrgDwQv2+kRBfqanv6UmfwFj9ubiX14vhqHdT3gQxcACunxoGAsMIZ4nhAdcMgpiqgmhku/YjomklDQOWZ0CF+X4v9Ju1Qolgv2VTlXN1dxpNEJOkV5VEQVEcN1EQtRNEdekBP6Nm4Nx6NF+N1OZoyVjvH6AeMt09iZx+</latexit>

agent’s experience learned model mo model-base sed RL RL

ˆ P

<latexit sha1_base64="3biLl6JOGgvPMvY0Eo2mJhKGKs=">AB+nicdVDNSsNAGNzUv1r/Uj16WSyCBwlJKTS9Fbx4rGBboQls920SzebsLtRSsyjePGgiFefxJtv4yatoKIDC8PM9/HNTpAwKpVtfxiVtfWNza3qdm1nd2/wKwfDmScCkz6OGaxuAmQJIxy0ldUMXKTCIKigJFhML8o/OEtEZLG/FotEuJHaMpSDFSWhqbdW+GVOZFSM0wYlkvz8dmw7abdvtGFJOi17SVy3CR3LtEAK/TG5rs3iXEaEa4wQ1KOHDtRfoaEopiRvOalkiQIz9GUjDTlKCLSz8roOTzVygSGsdCPK1iq3zcyFEm5iAI9WSUv71C/MsbpSp0/YzyJFWE4+WhMGVQxbDoAU6oIFixhSYIC6qzQjxDAmGl26rpEr5+Cv8ng6bltCz3qtXonq/qIJjcALOgAPaoAsuQ/0AQZ34AE8gWfj3ng0XozX5WjFWO0cgR8w3j4BViOUqw=</latexit>

Can be represented by Gaussi ssian Processe sses, Ne Neural l Networks ks, GM GMMs, etc.

Use planning and

  • ptimization methods

for known models (previous two slides)

slide-14
SLIDE 14

CS391R: Robot Learning (Fall 2020) 15

Solving MDPs with Learned Models

Model structure is known (e.g., simulator). We tune some model parameters (e.g., mass and friction).

System Identification

[Ramos et al. RSS’19]

Week 12 Tue

Sensor-Space Model Latent-Space Model

Week 8 Tue Week 8 Tue

[Hafner et al. ICLR’20] [Finn et al. ICRA’17]

Predicting future raw sensory data Predicting future latent state

f(st+1 | st, at)

<latexit sha1_base64="czGOX5I0Npitp2mnp59LCix1TY=">ACAHicbVBNS8NAEN3Ur1q/oh48eFksQsVSEinY8GLxwr2A9oSNtNu3SzCbsTocRe/CtePCji1Z/hzX/jts1BWx8MPN6bYWaeHwuwXG+rdza+sbmVn67sLO7t39gHx61dJQoypo0EpHq+EQzwSVrAgfBOrFiJPQFa/vjm5nfmBK80jewyRm/ZAMJQ84JWAkz4JStpL4dKd9sqPvbL2oIyJBxeXQqzhx4lbgZKaIMDc/+6g0imoRMAhVE67rxNBPiQJOBZsWeolmMaFjMmRdQyUJme6n8wem+NwoAxEypQEPFd/T6Qk1HoS+qYzJDSy95M/M/rJhDU+imXcQJM0sWiIBEYIjxLAw+4YhTExBCFTe3YjoilAwmRVMCO7y6ukdVxq5XaXbVYL2dx5NEpOkMl5KJrVEe3qIGaiKIpekav6M16sl6sd+tj0Zqzsplj9AfW5w+965Up</latexit>

ht = g(st) f(ht+1 | ht, at)

<latexit sha1_base64="QpkRAutW5J6yHrAMiClf6htp39k=">ACEnicbVDJSgNBEO1xjXEb9eilMQgJDmFGAuYiBLx4jGAWyIShp9PJNOlZ7K4Rwphv8OKvePGgiFdP3vwbO8tBEx8UPN6roqenwiuwLa/jZXVtfWNzdxWfntnd2/fPDhsqjiVlDVoLGLZ9oligkesARwEayeSkdAXrOUPryZ+65JxePoFkYJ64ZkEPE+pwS05JmlwAN8iQdF5UHJvUtJD/eLgZfBmTN2rQfX0r6FiTY9s2CX7SnwMnHmpIDmqHvml9uLaRqyCKgSnUcO4FuRiRwKtg476aKJYQOyYB1NI1IyFQ3m740xqda0afEUlcEeKr+nshIqNQo9HVnSCBQi95E/M/rpNCvdjMeJSmwiM4W9VOBIcaTfHCPS0ZBjDQhVHJ9K6YBkYSCTjGvQ3AWX14mzfOyUylXbyqFmjWPI4eO0QkqIgdoBq6RnXUQBQ9omf0it6MJ+PFeDc+Zq0rxnzmCP2B8fkD1ZabpA=</latexit>

Pr[µ | D]

<latexit sha1_base64="vJyDIvsGefceHj4gcymMicsT6k=">ACAXicbVDLSsNAFJ34rPUVdSO4GSyCi1ISKdhlQRcuK9gHJKFMpN26MwkzEyEuvGX3HjQhG3/oU7/8ZJm4W2HrhwOde7r0nTBhV2nG+rZXVtfWNzdJWeXtnd2/fPjsqDiVmLRxzGLZC5EijArS1lQz0kskQTxkpBuOr3K/e0+korG405OEBwNBY0oRtpIfvYb0nP56lfCrPkd6hBHLrqdB364NWcGuEzcglRAgVbf/vIHMU45ERozpJTnOokOMiQ1xYxMy36qSILwGA2JZ6hAnKgm30whWdGcAolqaEhjP190SGuFITHprO/Ea16OXif56X6qgRZFQkqSYCzxdFKYM6hnkcEAlwZpNDEFYUnMrxCMkEdYmtLIJwV18eZl0Lmpuvda4rVea1SKOEjgBp+AcuOASNMENaIE2wOARPINX8GY9WS/Wu/Uxb12xipkj8AfW5w8fCpai</latexit>
slide-15
SLIDE 15

CS391R: Robot Learning (Fall 2020) 16

Examples of Mo Model-Base sed Reinforcement Learning

“Dynamics Learning with Cascaded Variational Inference for Multi-Step Manipulation.” Fang, Zhu, Garg, Savarese, Fei-Fei, CoRL 2019

slide-16
SLIDE 16

CS391R: Robot Learning (Fall 2020) 17

Solving MDPs without Models

τ = {(si, ai, ri) | i = 0, . . . , H}

<latexit sha1_base64="RwrDhI7j2rAzkN7hwL6RClfgo/s=">ACFHicdVDLSgMxFM3UV62vqks3wSJUHMq0FDpdFApuqxgH9ApQyZN29DMg+SOUMZ+hBt/xY0LRdy6cOfmD4EFT1wL4dz7iW5x4sEV2BZH0ZqbX1jcyu9ndnZ3ds/yB4etVUYS8paNBSh7HpEMcED1gIOgnUjyYjvCdbxJpdzv3PDpOJhcA3TiPV9Mgr4kFMCWnKzFw6QuOYkeVyE5N5ky4/d8xbx+Q1y8SOGISgTNxwZm42ZxVKFcuVvCVMvWkth2CRcL1gI5tELTzb47g5DGPguACqJUr2hF0E+IBE4Fm2WcWLGI0AkZsZ6mAfGZ6ieLo2b4TCsDPAylrgDwQv2+kRBfqanv6UmfwFj9ubiX14vhqHdT3gQxcACunxoGAsMIZ4nhAdcMgpiqgmhku/YjomklDQOWZ0CF+X4v9Ju1Qolgv2VTlXN1dxpNEJOkV5VEQVEcN1EQtRNEdekBP6Nm4Nx6NF+N1OZoyVjvH6AeMt09iZx+</latexit>

agent’s experience

  • ptimal policy

mo model-fre free RL RL

π∗ = arg max

π

E[ X

t≥0

γtr(st, π(st))]

<latexit sha1_base64="PFKAgaY5LPxyxaNOeiwYDXcCVw=">ACOXicdVBSxtBFJ7VWm20bdRjL0NDQUXCJgSyHoSACB56SMFEIbMubyeTODizu515K4Ylf8uL/8Kb4MWDpfTaP+BskJb2gfDfPO972Pe+JMSYu+/+AtLb9aeb269qayvH23fvq5lbfprnhosdTlZrzGKxQMhE9lKjEeWYE6FiJs/jqOyfXQtjZqc4iQToYZxIkeSAzoqnZJi/2DhmYMdNwE7mnu/EyjovjKfvsaDFgNtdRgWwsvpTNgat4QKp2bER7lNnKMHu7lwcRtWaX2+2/eCgTWfgoOXPQRA0aPuz6pGFtWNqvdsmPJciwS5AmsHDT/DsACDkisxrbDcigz4FbhJHExACxsWs82n9JNjhnSUGncSpDP2d0cB2tqJjp2y3Mr+3SvJf/UGOY6CsJBJlqNI+PyjUa4oprSMkQ6lERzVxAHgRrpZKb8EAxd2BUXwq9N6f9Bv1lvtOrBl1ats7+IY418IB/JDmQNumQE9IlPcLJLXkz+Sbd+c9ed+9H3PpkrfwbJM/yv5Anv9rdw=</latexit>

τ

<latexit sha1_base64="AkVg31Rnz9LVTWBzjOoN6yckWTc=">AB63icdVBNS8NAEJ3Ur1q/qh69LBbBU0hCoemt4MVjBdsKbSib7aZdursJuxuhP4FLx4U8eof8ua/MUkrqOiDgcd7M8zMCxPOtHGcD6uysbm1vVPdre3tHxwe1Y9P+jpOFaE9EvNY3YVYU84k7RlmOL1LFMUi5HQzq8Kf3BPlWaxvDWLhAYCTyWLGMGmkEYGp+N6w7G9luO3W6gk7azIr7vId2SjRgje64/j6axCQVBrCsdZD10lMkGFlGOF0WRulmiaYzPGUDnMqsaA6yMpbl+giVyYoilVe0qBS/T6RYaH1QoR5p8Bmpn97hfiXN0xN5AcZk0lqCSrRVHKkYlR8TiaMEWJ4YucYKJYfisiM6wMXk8tTyEr0/R/6Tv2W7T9m+ajY63jqMKZ3AOl+BCzpwDV3oAYEZPMATPFvCerRerNdVa8Vaz5zCD1hvn5a6jpU=</latexit>

When model is unknown and hard to estimate, we can learn policy directly from the agent’s trajectories from interacting with an MDP.

slide-17
SLIDE 17

CS391R: Robot Learning (Fall 2020) 18

Solving MDPs without Models

Model-free Value-based RL

Optimality condition (Bellman equation) Q⇤(s, a) = r(s, a) + γ Es0|s,a[max

a0 Q⇤(s0, a0)]

<latexit sha1_base64="QsQJRlqAiHpoY7k4pQwlT2dZFag=">ACRHicdVDPSxtBGJ21tmq0Ndajl4+GEm1D2IRA1kNBkIHDwpGhex2+XYyiYMzu8vMrBi2+8d58Q/w5l/QSw8V6bV0NtlCW/TBwO978d8L0oF18Z1752F4svXy0tr9RW16/Wa9vD3VSaYoG9BEJOo8Qs0Ej9nAcCPYeaoYykiws+hyv/TPrpjSPIlPzDRlgcRJzMecorFSWB8ef8k/FNu6hTvwCdScfAR/glIi+C3wJZqLKMo/F2Gum1+tX4B/iGrChqV3HebYLKCa0mwBNncqPwjrDbfd7bvebh9mZLfnzondaHTdmdokApHYf3OHyU0kyw2VKDWw46bmiBHZTgVrKj5mWYp0ku0y2NUTId5LMQCnhvlRGME2VfbGCm/t2Ro9R6KiNbWZ6k/dK8SlvmJmxF+Q8TjPDYjpfNM4EmATKRGHEFaNGTC1Bqrj9K9ALVEiNzb1mQ/hzKTxPTrvtTq/tHfcae60qjmWyRd6RbdIhfbJHDsgRGRBKbsg38oM8OLfOd+fR+TkvXCqnk3yD5xfvwGhQa2/</latexit>

Q-learning rule (temporal different learning)

Week k 7 Thu

π⇤(a|s) = arg max

a0 Q⇤(s, a0)

<latexit sha1_base64="WlTtbEsmH3PRnmnOt8R8NiTgFyY=">ACDnicdVBLS0JBFJ7b0+xltWwzJKGyFUEr4tAaNSIR/gNTl3HVw7oOZuZHc/AVt+itWhTRtnW7/k3jI6ioDw58fN85nHM+J+BMKtP8MFZW19Y3NmNb8e2d3b39xMFhU/qhILRBfO6LtgOScubRhmK03YgKLgOpy1nfD7zW9dUSOZ7l2oS0K4LQ48NGAGlpV4iZQfs6jQDtzKLz7ANYmi7cNOLID3FdW3IHIZ0tpdImvli2bQqZTwnlZK5IJZVxIW8OUcSLVHrJd7tvk9Cl3qKcJCyUzAD1Y1AKEY4ncbtUNIAyBiGtKOpBy6V3Wj+zhSntNLHA1/o8hSeq98nInClnLiO7nRBjeRvbyb+5XVCNbC6EfOCUFGPLBYNQo6Vj2fZ4D4TlCg+0QSIYPpWTEYgCidYFyH8PUp/p80i/lCKW/VS8lqbhlHDB2jE5RBVRGVXSBaqiBCLpD+gJPRv3xqPxYrwuWleM5cwR+gHj7ROihpn</latexit>

De Deep Q-Network k (DQN): Represent Q with neural networks

slide-18
SLIDE 18

CS391R: Robot Learning (Fall 2020) 19

Solving MDPs without Models

Objective Function

J(θ) = Eτ∼p(τ|θ)[r(τ)]

<latexit sha1_base64="mS7RKFmnUDVXKgZBfzP/VXkM6mg=">ACInicdVDLSgMxFM3UV62vqks3wSK0mzIthU4XQkEcVXBPqAzlEyatqGZB8kdoYz9Fjf+ihsXiroS/Bgz0woqeiBwcs69JOe4oeAKTPdyKysrq1vZDdzW9s7u3v5/YOCiJWZsGIpA9lygmuM/awEGwXigZ8VzBu70LPG7N0wqHvjXMAuZ45Gxz0ecEtDSIN+4LNowYUBK+BTbHoGJ68bn80FsA4lsxT0cFhN6u5ya92V6LzmDfMEsV+um1ajlDRq5oJYVhVXymaKAlqiNci/2sOARh7zgQqiVL9ihuDERAKngs1zdqRYSOiUjFlfU594TDlxGnGOT7QyxKNA6uMDTtXvGzHxlJp5rp5MQqjfXiL+5fUjGFlOzP0wAubTxUOjSGAIcNIXHnLJKIiZJoRKrv+K6YRIQkG3mtMlfCXF/5NOtVypla2rWqFZXdaRUfoGBVRBdVRE12gFmojiu7QA3pCz8a98Wi8G+L0Yyx3DlEP2B8fAL+BqP7</latexit>

θ

<latexit sha1_base64="FT9DpJw3zS0wCBlcKgoTiBa93nQ=">AB7XicdVBNS8NAEN3Ur1q/qh69LBbBU0hCoemt4MVjBfsBbSib7aZdu9mE3YlQv+DFw+KePX/ePfuE0rqOiDgcd7M8zMC1PBNTjOh1Xa2Nza3invVvb2Dw6PqscnXZ1kirIOTUSi+iHRTHDJOsBsH6qGIlDwXrh7Grp9+6Z0jyRtzBPWRCTieQRpwSM1B3ClAEZVWuO7TUcv9nABWnWnRXxfQ+7tlOghtZoj6rvw3FCs5hJoIJoPXCdFIKcKOBUsEVlmGmWEjojEzYwVJKY6SAvrl3gC6OMcZQoUxJwoX6fyEms9TwOTWdMYKp/e0vxL2+QeQHOZdpBkzS1aIoExgSvHwdj7liFMTcEIVN7diOiWKUDABVUwIX5/i/0nXs9267d/Uay1vHUcZnaFzdIlc1EAtdI3aqIMoukMP6Ak9W4n1aL1Yr6vWkrWeOU/YL19AhmTj3U=</latexit>

policy parameterized by

θ

<latexit sha1_base64="+bQ9HdbhNbK6uND5RUi3m8XDGjk=">AB7XicbVBNS8NAEN3Ur1q/qh69LBbBU0lKwR4LXjxWsB/QhrLZTtq1m2zYnQgl9D948aCIV/+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpb2zu7e8X90sHh0fFJ+fSsY1SqObS5kr3AmZAihjaKFBCL9HAokBCN5jeLvzuE2gjVPyAswT8iI1jEQrO0EqdAU4A2bBcavuEnSTeDmpkBytYflrMFI8jSBGLpkxfc9N0M+YRsElzEuD1EDC+JSNoW9pzCIwfra8dk6vrDKiodK2YqRL9fdExiJjZlFgOyOGE7PuLcT/vH6KYcPRJykCDFfLQpTSVHRxet0JDRwlDNLGNfC3kr5hGnG0QZUsiF46y9vk6t6tWrjft6pVnL4yiSC3JrolHbkiT3JEWaRNOHskzeSVvjnJenHfnY9VacPKZc/IHzucPorWPIw=</latexit>

trajectories under policy

πθ

<latexit sha1_base64="XwxTAKQzl2a1lmFrRUY7nKxDwk=">AB83icbVBNS8NAEN3Ur1q/qh69BIvgqSlYI8FLx4r2FZoQtlsJ+3SzWbZnQgl9G948aCIV/+MN/+N2zYHbX0w8Hhvhpl5kRLcoOd9O6Wt7Z3dvfJ+5eDw6PikenrWM2mGXRZKlL9GFEDgkvoIkcBj0oDTSIB/Wh6u/D7T6ANT+UDzhSECR1LHnNG0UpBoPgwD3ACSOfDas2re0u4m8QvSI0U6AyrX8EoZVkCEpmgxgx8T2GYU42cCZhXgsyAomxKxzCwVNIETJgvb567V1YZuXGqbUl0l+rviZwmxsySyHYmFCdm3VuI/3mDONWmHOpMgTJVoviTLiYuosA3BHXwFDMLKFMc3uryZU4Y2poNwV9/eZP0GnW/W/dN2vtRhFHmVyQS3JNfHJD2uSOdEiXMKLIM3klb07mvDjvzseqteQUM+fkD5zPH3Gukes=</latexit>

Total reward of trajectory

policy gradient theorem

Different ways of computing the Q values lead to different PG variants: Monte-Carlo estimates (REINFORCE), learning value functions (Actor-Critic)

rθJ(θ) = Eπ[Qπ(s, a)rθ log πθ(a|s)]

<latexit sha1_base64="yXxdRnv1LjblX8BXjI5i0ysE1u8=">ACPnicdVBNixNBEO3J6hqju2bXo5fGICQgYSYkJBchIJ4SsB8QGYcajqdpElPz9BdI4TZ/DIv+xv25tGLB0X2ukd7kghG1oKmX79Xj+p6USqFQdf96pROHjw8fVR+XHny9Oz8WfXicmySTDM+YolM9DQCw6VQfIQCJZ+mkMcST6J1m8LfKZayMS9RE3KQ9iWCqxEAzQUmF15CuIJIQ+rjgC/VDfg8YbPwZcRVH+bhv6qaCz4Sd71c1raBw7fJksqZUO7zpcmUYQVmtus9XtuV2P7oDndfbA7XSo13R3VSOHGoTVG3+esCzmCpkEY2aem2KQg0bBJN9W/MzwFNgalnxmoYKYmyDfrb+lrywzp4tE26OQ7ti/HTnExmziyHYWS5l/tYK8T5tluOgFuVBphlyx/aBFJikmtMiSzoXmDOXGAmBa2L9StgINDG3iFRvCn03p/8G41fTazd6wXeu3DnGUyQvyktSJR7qkT96TARkRr6Qb+QH+elcO9+dX87tvrXkHDzPyVE5d78BTIqvMQ=</latexit>

Model-free Policy-Gradient RL Week k 7 Thu

slide-19
SLIDE 19

CS391R: Robot Learning (Fall 2020) 20

Solving MDPs without Models

Model-free Policy-Gradient RL Week k 7 Thu

slide-20
SLIDE 20

CS391R: Robot Learning (Fall 2020) 21

Examples of Mo Model-Fr Free ee Rei einf nfor

  • rcem

cement ent Lear Learni ning ng

“QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation.” Kalashnikov et al. CoRL 2018

Week 13 Thu, Nov 19

slide-21
SLIDE 21

CS391R: Robot Learning (Fall 2020) 22

Solving MDPs without Models

Model-free Value-based RL Model-free Policy-Gradient RL

Learns policy directly – often more stable Works for continuous action spaces Needs data from current policy to compute policy gradient (“on-policy” algorithm) – data inefficient Gradient estimates can be very noisy Can learn Q function from any interaction data, not just trajectories gathered using the current policy (“off-policy” algorithm) Relatively data-efficient (can reuse old interaction data) Need to optimize over actions: hard to apply to continuous action spaces Optimal Q function can be complicated, hard to learn

slide-22
SLIDE 22

CS391R: Robot Learning (Fall 2020) 23

Mathematical Framework: Marko kov Decisi sion Processe sses

Reinforcement learning optimizes policy by trial and error in an MDP. Goal: To maximize the long-term rewards

S

<latexit sha1_base64="FywXp2+qoAPEXh6BbgVeq4gZKgw=">AB8nicdVDLSsNAFJ34rPFVdelmsAiuQlIs7UYsuHFZ0T6gDWUynbRDJzNhZiKU0M9w40KRbt37H27Ev3GSKjogQuHc+7lnuDmFGlXfdWlpeWV1bL23Ym1vbO7vlvf2OEonEpI0FE7IXIEUY5aStqWakF0uCoCRbjC9yPzuLZGKCn6jZzHxIzTmNKQYaSP1BxHSE4xYej0fliuU603LoHc+J5tYK4tRr0HDdH5fzFPosXb3ZrWH4djAROIsI1ZkipvufG2k+R1BQzMrcHiSIxwlM0Jn1DOYqI8tM8hweG2UEQyFNcQ1z9ftEiKlZlFgOrOI6reXiX95/USHDT+lPE404bhYFCYMagGz+GISoI1mxmCsKQmK8QTJBHW5ku2ecLXpfB/0qk63qnTuHIrzSoUAKH4AicA/UQRNcghZoAwEuAMP4NHS1r31ZC2K1iXrc+YA/ID1/AFCQ5Td</latexit>

: state space A

<latexit sha1_base64="pvU12GU1dDG1BShnHUIzAeZ3O6M=">AB8nicdVDLSsNAFJ34rPFVdelmsAiuwqRY2o1YceOygn1AG8pkOmHTiZhZiKU0M9w40KRbt37H27Ev3GSKjogQuHc+7lnv9mDOlEXq3lpZXVtfWSxv25tb2zm5b7+jokQS2iYRj2TPx4pyJmhbM81pL5YUhz6nX96mfndWyoVi8SNnsXUC/FYsIARrI3UH4RYTwjm6cV8WK4gp1pvoLoLc+K6tYKgWg26DspROX+xz+LFm90al8Ho4gkIRWacKxU30Wx9lIsNSOczu1BomiMyRSPad9QgUOqvDSPIfHRhnBIJKmhIa5+n0ixaFSs9A3nVlE9dvLxL+8fqKDhpcyESeaClIsChIOdQSz+GISUo0nxmCiWQmKyQTLDHR5ku2ecLXpfB/0qk67qnTuEaVZhUKIFDcAROgAvqoAmuQAu0AQERuAMP4NHS1r31ZC2K1iXrc+YA/ID1/AEm6ZTL</latexit>

: action space P

<latexit sha1_base64="m2zP/jMubcdRPaHpYCksETvGRPs=">AB8nicdVDLSsNAFJ3UV42vqks3g0VwFSbF0m7EghuXFewD0lAm0k7dJIJMxOhH6GxeKdOve/3Aj/o3TREFD1w4nHMv9wbJwpjdC7VpZXVvfKG/aW9s7u3uV/YOuEqktEMEF7IfYEU5i2lHM81pP5EURwGnvWB6ufR7t1QqJuIbPUuoH+FxzEJGsDaSN4iwnhDMs/Z8WKkip9ZoYLc+K69YKgeh26DspRvXixz5PFm90eVl4HI0HSiMacKyU56JE+xmWmhFO5/YgVTBZIrH1DM0xhFVfpZHnsMTo4xgKSpWMNc/T6R4UipWRSYzmVE9dtbin95XqrDp+xOEk1jUmxKEw51AIu74cjJinRfGYIJpKZrJBMsMREmy/Z5glfl8L/SbfmuGdO8xpVWzVQoAyOwDE4BS5ogBa4Am3QAQIcAcewKOlrXvryVoUrSXrc+YQ/ID1/AE9tJTa</latexit>

: transition probability R

<latexit sha1_base64="i+E65M1i3gjNnzJeIYbguafpo1s=">AB8nicdVDLSsNAFJ34rPFVdelmsAiuQlIs7UYsuHFZxT6gDWUynbRDJzNhZiKU0M9w40KRbt37H27Ev3GSKjogQuHc+7lnuDmFGlXfdWlpeWV1bL23Ym1vbO7vlvf2OEonEpI0FE7IXIEUY5aStqWakF0uCoCRbjC9yPzuLZGKCn6jZzHxIzTmNKQYaSP1BxHSE4xYej0fliuU603LoHc+J5tYK4tRr0HDdH5fzFPosXb3ZrWH4djAROIsI1ZkipvufG2k+R1BQzMrcHiSIxwlM0Jn1DOYqI8tM8hweG2UEQyFNcQ1z9ftEiKlZlFgOrOI6reXiX95/USHDT+lPE404bhYFCYMagGz+GISoI1mxmCsKQmK8QTJBHW5ku2ecLXpfB/0qk63qnTuHIrzSoUAKH4AicA/UQRNcghZoAwEuAMP4NHS1r31ZC2K1iXrc+YA/ID1/AFAvpTc</latexit>

: reward function γ

<latexit sha1_base64="JpD9q9Jvp09IkN0chWQ+ZK5PxIY=">AB7XicdVBNSwMxEM3Wr1q/qh69BIvgadktLe2x4MVjBfsB7VJm02wbm2SXJCuUpf/BiwdFvPp/vPlvTLcVPTBwO9GWbmhQln2njeh1PY2Nza3inulvb2Dw6PyscnXR2nitAOiXms+iFoypmkHcMp/1EURAhp71wdrX0e/dUaRbLWzNPaCBgIlnECBgrdYcTEAJG5YrnVhtNr+HjnPh+fUW8eh37rpejgtZoj8rvw3FMUkGlIRy0HvheYoIMlGE0VpmGqaAJnBhA4slSCoDrL82gW+sMoYR7GyJQ3O1e8TGQit5yK0nQLMVP/2luJf3iA1UTPImExSQyVZLYpSjk2Ml6/jMVOUGD63BIhi9lZMpqCAGBtQyYbw9Sn+n3Srl9zmze1Squ6jqOIztA5ukQ+aqAWukZt1E3aEH9ISendh5dF6c1VrwVnPnKIfcN4+AfOgj1w=</latexit>

: a discount factor γ ∈ [0, 1]

<latexit sha1_base64="5iAfSjFxJW7S24c+E2zRXg95A=">AB9XicdVDLSgMxFM3UV62vqks3wSK4kGFSWtplwY3LCvYBM2PJpJk2NMkMSUYpQ/DjQtF3Pov7vwb04egogcuHM65l3viVLOtPG8D6ewtr6xuVXcLu3s7u0flA+PujrJFKEdkvBE9SOsKWeSdgwznPZTRbGIO1Fk8u537ujSrNE3phpSkOBR5LFjGBjpdtghIXAZO+d4HCQbniudVG02sguCAI1ZfEq9chcr0FKmCF9qD8HgwTkgkqDeFYax95qQlzrAwjnM5KQaZpiskEj6hvqcSC6jBfXD2DZ1YZwjhRtqSBC/X7RI6F1lMR2U6BzVj/9ubiX56fmbgZ5kymaGSLBfFGYcmgfMI4JApSgyfWoKJYvZWSMZYWJsUCUbwten8H/Srbqo5java5VWdRVHEZyAU3AOEGiAFrgCbdABCjwAJ7As3PvPDovzuyteCsZo7BDzhvn/YQkiQ=</latexit>

Fundamental assumption of RL: reward function (st ∈ S)

<latexit sha1_base64="jGRosKYE5dgsYh5U1uEh78GwnDA=">AB/nicdVDLSsNAFJ34rPUVFVduBotQNyWJLW13BTcuK9oHNKFMpN26GQSZiZCQV/xY0LRdz6He78GydtBU9MHA4517umePHjEplWR/Gyura+sZmYau4vbO7t28eHZlAhMOjhikej7SBJGOekoqhjpx4Kg0Gek508vM793R4SkEb9Vs5h4IRpzGlCMlJaG5nFZDhV0KYduiNQEI5bezM+HZsmqOHbNqTXhkjQvclKtQ7tiLVACOdpD890dRTgJCVeYISkHthUrL0VCUczIvOgmksQIT9GYDTlKCTSxfx5/BMKyMYREI/ruBC/b6RolDKWejrySyj/O1l4l/eIFBw0spjxNFOF4eChIGVQSzLuCICoIVm2mCsKA6K8QTJBWurGiLuHrp/B/0nUqdrXSuK6Wk5eRwGcgFNQBjaogxa4Am3QARik4AE8gWfj3ng0XozX5eiKke8cgR8w3j4BpBSVRg=</latexit>

(at ∈ A)

<latexit sha1_base64="9mN0evEAmW6HidnbyBTqbhwcRE=">AB/nicdVDLSsNAFJ34rPUVFVduBotQNyGJLW13FTcuK9gHNKVMpN26GQSZiZCQV/xY0LRdz6He78GydtBU9MHA4517umePHjEpl2x/Gyura+sZmYau4vbO7t28eHZklAhM2jhikej5SBJGOWkrqhjpxYKg0Gek60+vMr97R4SkEb9Vs5gMQjTmNKAYKS0NzeMyGiroUQ69EKkJRiy9nJ8PzZJtuU7VrTbgkjQuclKpQceyFyiBHK2h+e6NIpyEhCvMkJR9x47VIEVCUczIvOglksQIT9GY9DXlKCRykC7iz+GZVkYwiIR+XMGF+n0jRaGUs9DXk1lG+dvLxL+8fqKC+iClPE4U4Xh5KEgYVBHMuoAjKghWbKYJwoLqrBPkEBY6caKuoSvn8L/Sce1nIpVv6mUm5eRwGcgFNQBg6ogSa4Bi3QBhik4AE8gWfj3ng0XozX5eiKke8cgR8w3j4BbAqVIg=</latexit>

Pa

ss0 = Pr[st+1 | st, at]

<latexit sha1_base64="47fJjpAaSDkG5CZij9cVsTrtsLM=">ACFnicdVDLSgNBEJz1bXxFPXoZDKJgDLsxoh4EwYvHCEaF7Lr0TiZmcPbBTK8Q1v0KL/6KFw+KeBVv/o2Th6CiBQNFVTc9VUEihUb/rBGRsfGJyanpgszs3PzC8XFpTMdp4rxBotlrC4C0FyKiDdQoOQXieIQBpKfB9dHPf/8hist4ugUuwn3QriKRFswQCP5xS03BOwkFk9zOt1/NLOHDrqn9Ded3C3fumXtY5mCj5fLNmVqrNT3dmnA7K/PS1XepU7D5KZIi6X3x3WzFLQx4hk6B107ET9DJQKJjkecFNU+AXcMVbxoaQci1l/Vj5XTNKC3ajpV5EdK+n0jg1DrbhiYyV4I/dvriX95zRTbe14moiRFHrHBoXYqKca01xFtCcUZyq4hwJQwf6WsAwoYmiYLpoSvpPR/clatOLXK3kmtdFgd1jFVsgq2SAO2SWH5JjUSYMwckceyBN5tu6tR+vFeh2MjljDnWXyA9bJxupn0s=</latexit>

r(s, a) = E[rt+1|s = st, a = at]

<latexit sha1_base64="Pnkf2aWKD9BYQ2qZKVRM674ymOg=">ACFHicdVDLSgNBEJz1bXxFPXoZDIKSEHZjQpJDQBDBo4JRIVmW3slEh8w+mOkVwpqP8OKvePGgiFcP3vwbJw9BRQsaiqpurv8WAqNtv1hTU3PzM7NLyxmlpZXVtey6xvnOkoU40WyUhd+qC5FCFvokDJL2PFIfAlv/B7h0P/4oYrLaLwDPsxdwO4CkVXMEAjedm82tUF2KMN2g4Ar30/PRq0lJdi3hnQW6ob2sMChQZ46HrZnF0sOZVSpU7HpL4/IeUqdYr2CDkywYmXfW93IpYEPEQmQeuWY8fopqBQMkHmXaieQysB1e8ZWgIAduOnpqQHeM0qHdSJkKkY7U7xMpBFr3A90Di/Xv72h+JfXSrBbc1MRxgnykI0XdRNJMaLDhGhHKM5Q9g0BpoS5lbJrUMDQ5JgxIXx9Sv8n56WiUy7WTsu5g9IkjgWyRbJLnFIlRyQY3JCmoSRO/JAnsizdW89Wi/W67h1yprMbJIfsN4+AR9XnP8=</latexit>
slide-23
SLIDE 23

CS391R: Robot Learning (Fall 2020) 24

Mathematical Framework: Marko kov Decisi sion Processe sses

Imitation learning optimizes policy by imitating the expert in an MDP. Goal: To match the behavioral distributions

S

<latexit sha1_base64="FywXp2+qoAPEXh6BbgVeq4gZKgw=">AB8nicdVDLSsNAFJ34rPFVdelmsAiuQlIs7UYsuHFZ0T6gDWUynbRDJzNhZiKU0M9w40KRbt37H27Ev3GSKjogQuHc+7lnuDmFGlXfdWlpeWV1bL23Ym1vbO7vlvf2OEonEpI0FE7IXIEUY5aStqWakF0uCoCRbjC9yPzuLZGKCn6jZzHxIzTmNKQYaSP1BxHSE4xYej0fliuU603LoHc+J5tYK4tRr0HDdH5fzFPosXb3ZrWH4djAROIsI1ZkipvufG2k+R1BQzMrcHiSIxwlM0Jn1DOYqI8tM8hweG2UEQyFNcQ1z9ftEiKlZlFgOrOI6reXiX95/USHDT+lPE404bhYFCYMagGz+GISoI1mxmCsKQmK8QTJBHW5ku2ecLXpfB/0qk63qnTuHIrzSoUAKH4AicA/UQRNcghZoAwEuAMP4NHS1r31ZC2K1iXrc+YA/ID1/AFCQ5Td</latexit>

: state space A

<latexit sha1_base64="pvU12GU1dDG1BShnHUIzAeZ3O6M=">AB8nicdVDLSsNAFJ34rPFVdelmsAiuwqRY2o1YceOygn1AG8pkOmHTiZhZiKU0M9w40KRbt37H27Ev3GSKjogQuHc+7lnv9mDOlEXq3lpZXVtfWSxv25tb2zm5b7+jokQS2iYRj2TPx4pyJmhbM81pL5YUhz6nX96mfndWyoVi8SNnsXUC/FYsIARrI3UH4RYTwjm6cV8WK4gp1pvoLoLc+K6tYKgWg26DspROX+xz+LFm90al8Ho4gkIRWacKxU30Wx9lIsNSOczu1BomiMyRSPad9QgUOqvDSPIfHRhnBIJKmhIa5+n0ixaFSs9A3nVlE9dvLxL+8fqKDhpcyESeaClIsChIOdQSz+GISUo0nxmCiWQmKyQTLDHR5ku2ecLXpfB/0qk67qnTuEaVZhUKIFDcAROgAvqoAmuQAu0AQERuAMP4NHS1r31ZC2K1iXrc+YA/ID1/AEm6ZTL</latexit>

: action space P

<latexit sha1_base64="m2zP/jMubcdRPaHpYCksETvGRPs=">AB8nicdVDLSsNAFJ3UV42vqks3g0VwFSbF0m7EghuXFewD0lAm0k7dJIJMxOhH6GxeKdOve/3Aj/o3TREFD1w4nHMv9wbJwpjdC7VpZXVvfKG/aW9s7u3uV/YOuEqktEMEF7IfYEU5i2lHM81pP5EURwGnvWB6ufR7t1QqJuIbPUuoH+FxzEJGsDaSN4iwnhDMs/Z8WKkip9ZoYLc+K69YKgeh26DspRvXixz5PFm90eVl4HI0HSiMacKyU56JE+xmWmhFO5/YgVTBZIrH1DM0xhFVfpZHnsMTo4xgKSpWMNc/T6R4UipWRSYzmVE9dtbin95XqrDp+xOEk1jUmxKEw51AIu74cjJinRfGYIJpKZrJBMsMREmy/Z5glfl8L/SbfmuGdO8xpVWzVQoAyOwDE4BS5ogBa4Am3QAQIcAcewKOlrXvryVoUrSXrc+YQ/ID1/AE9tJTa</latexit>

: transition probability R

<latexit sha1_base64="i+E65M1i3gjNnzJeIYbguafpo1s=">AB8nicdVDLSsNAFJ34rPFVdelmsAiuQlIs7UYsuHFZxT6gDWUynbRDJzNhZiKU0M9w40KRbt37H27Ev3GSKjogQuHc+7lnuDmFGlXfdWlpeWV1bL23Ym1vbO7vlvf2OEonEpI0FE7IXIEUY5aStqWakF0uCoCRbjC9yPzuLZGKCn6jZzHxIzTmNKQYaSP1BxHSE4xYej0fliuU603LoHc+J5tYK4tRr0HDdH5fzFPosXb3ZrWH4djAROIsI1ZkipvufG2k+R1BQzMrcHiSIxwlM0Jn1DOYqI8tM8hweG2UEQyFNcQ1z9ftEiKlZlFgOrOI6reXiX95/USHDT+lPE404bhYFCYMagGz+GISoI1mxmCsKQmK8QTJBHW5ku2ecLXpfB/0qk63qnTuHIrzSoUAKH4AicA/UQRNcghZoAwEuAMP4NHS1r31ZC2K1iXrc+YA/ID1/AFAvpTc</latexit>

: reward function γ

<latexit sha1_base64="JpD9q9Jvp09IkN0chWQ+ZK5PxIY=">AB7XicdVBNSwMxEM3Wr1q/qh69BIvgadktLe2x4MVjBfsB7VJm02wbm2SXJCuUpf/BiwdFvPp/vPlvTLcVPTBwO9GWbmhQln2njeh1PY2Nza3inulvb2Dw6PyscnXR2nitAOiXms+iFoypmkHcMp/1EURAhp71wdrX0e/dUaRbLWzNPaCBgIlnECBgrdYcTEAJG5YrnVhtNr+HjnPh+fUW8eh37rpejgtZoj8rvw3FMUkGlIRy0HvheYoIMlGE0VpmGqaAJnBhA4slSCoDrL82gW+sMoYR7GyJQ3O1e8TGQit5yK0nQLMVP/2luJf3iA1UTPImExSQyVZLYpSjk2Ml6/jMVOUGD63BIhi9lZMpqCAGBtQyYbw9Sn+n3Srl9zmze1Squ6jqOIztA5ukQ+aqAWukZt1E3aEH9ISendh5dF6c1VrwVnPnKIfcN4+AfOgj1w=</latexit>

: a discount factor γ ∈ [0, 1]

<latexit sha1_base64="5iAfSjFxJW7S24c+E2zRXg95A=">AB9XicdVDLSgMxFM3UV62vqks3wSK4kGFSWtplwY3LCvYBM2PJpJk2NMkMSUYpQ/DjQtF3Pov7vwb04egogcuHM65l3viVLOtPG8D6ewtr6xuVXcLu3s7u0flA+PujrJFKEdkvBE9SOsKWeSdgwznPZTRbGIO1Fk8u537ujSrNE3phpSkOBR5LFjGBjpdtghIXAZO+d4HCQbniudVG02sguCAI1ZfEq9chcr0FKmCF9qD8HgwTkgkqDeFYax95qQlzrAwjnM5KQaZpiskEj6hvqcSC6jBfXD2DZ1YZwjhRtqSBC/X7RI6F1lMR2U6BzVj/9ubiX56fmbgZ5kymaGSLBfFGYcmgfMI4JApSgyfWoKJYvZWSMZYWJsUCUbwten8H/Srbqo5java5VWdRVHEZyAU3AOEGiAFrgCbdABCjwAJ7As3PvPDovzuyteCsZo7BDzhvn/YQkiQ=</latexit>

: set of demonstrations drawn from the expert policy πE

<latexit sha1_base64="3zYtUwa6/WGK0JGp92OU5mZ5NJk=">AB7HicdVBNS8NAEJ3Ur1q/qh69LBbBU0hiS9tbQSPFawtKFstpt26WYTdjdCf0NXjwo4tUf5M1/47aNoKIPBh7vzTAzL0g4U9pxPqzC2vrG5lZxu7Szu7d/UD48ulNxKgntkJjHshdgRTkTtKOZ5rSXSIqjgNuML1c+N17KhWLxa2eJdSP8FiwkBGsjdQZJGx4NSxXHNtza16tiVakeZGTah25trNEBXK0h+X3wSgmaUSFJhwr1XedRPsZlpoRTuelQapogskUj2nfUIEjqvxsewcnRlhMJYmhIaLdXvExmOlJpFgemMsJ6o395C/Mvrpzps+BkTSaqpIKtFYcqRjtHiczRikhLNZ4ZgIpm5FZEJlphok0/JhPD1Kfqf3Hm2W7UbN9VKy8vjKMIJnMI5uFCHFlxDGzpAgMEDPMGzJaxH68V6XbUWrHzmGH7AevsE9JWOw=</latexit>

D

<latexit sha1_base64="I7EhzGqPpHFoQNQusa4o8NUVU1g=">AB8nicdVBPS8MwHE3nvzn/T16CQ7BU2nrxrbQA8eJzg36MpIs3QLS5OSpMIo+xhePCji1U/jzW9julVQ0QeBx3u/H3m/FyaMKu04H1ZpbX1jc6u8XdnZ3ds/qB4e3SmRSkx6WDAhByFShFOepqRgaJCgOGemHs8vc798Tqajgt3qekCBGE04jipE2kj+MkZ5ixLKrxahac2zPbXiNlyR9kVB6k3o2s4SNVCgO6q+D8cCpzHhGjOklO86iQ4yJDXFjCwqw1SRBOEZmhDfUI5ioJsGXkBz4wyhpGQ5nENl+r3jQzFSs3j0EzmEdVvLxf/8vxUR60gozxJNeF49VGUMqgFzO+HYyoJ1mxuCMKSmqwQT5FEWJuWKqaEr0vh/+TOs9263bqp1zpeUcZnIBTcA5c0AQdcA26oAcwEOABPIFnS1uP1ov1uhotWcXOMfgB6+0T1HuRmA=</latexit>

(st ∈ S)

<latexit sha1_base64="jGRosKYE5dgsYh5U1uEh78GwnDA=">AB/nicdVDLSsNAFJ34rPUVFVduBotQNyWJLW13BTcuK9oHNKFMpN26GQSZiZCQV/xY0LRdz6He78GydtBU9MHA4517umePHjEplWR/Gyura+sZmYau4vbO7t28eHZlAhMOjhikej7SBJGOekoqhjpx4Kg0Gek508vM793R4SkEb9Vs5h4IRpzGlCMlJaG5nFZDhV0KYduiNQEI5bezM+HZsmqOHbNqTXhkjQvclKtQ7tiLVACOdpD890dRTgJCVeYISkHthUrL0VCUczIvOgmksQIT9GYDTlKCTSxfx5/BMKyMYREI/ruBC/b6RolDKWejrySyj/O1l4l/eIFBw0spjxNFOF4eChIGVQSzLuCICoIVm2mCsKA6K8QTJBWurGiLuHrp/B/0nUqdrXSuK6Wk5eRwGcgFNQBjaogxa4Am3QARik4AE8gWfj3ng0XozX5eiKke8cgR8w3j4BpBSVRg=</latexit>

(at ∈ A)

<latexit sha1_base64="9mN0evEAmW6HidnbyBTqbhwcRE=">AB/nicdVDLSsNAFJ34rPUVFVduBotQNyGJLW13FTcuK9gHNKVMpN26GQSZiZCQV/xY0LRdz6He78GydtBU9MHA4517umePHjEpl2x/Gyura+sZmYau4vbO7t28eHZklAhM2jhikej5SBJGOWkrqhjpxYKg0Gek60+vMr97R4SkEb9Vs5gMQjTmNKAYKS0NzeMyGiroUQ69EKkJRiy9nJ8PzZJtuU7VrTbgkjQuclKpQceyFyiBHK2h+e6NIpyEhCvMkJR9x47VIEVCUczIvOglksQIT9GY9DXlKCRykC7iz+GZVkYwiIR+XMGF+n0jRaGUs9DXk1lG+dvLxL+8fqKC+iClPE4U4Xh5KEgYVBHMuoAjKghWbKYJwoLqrBPkEBY6caKuoSvn8L/Sce1nIpVv6mUm5eRwGcgFNQBg6ogSa4Bi3QBhik4AE8gWfj3ng0XozX5eiKke8cgR8w3j4BbAqVIg=</latexit>

Pa

ss0 = Pr[st+1 | st, at]

<latexit sha1_base64="47fJjpAaSDkG5CZij9cVsTrtsLM=">ACFnicdVDLSgNBEJz1bXxFPXoZDKJgDLsxoh4EwYvHCEaF7Lr0TiZmcPbBTK8Q1v0KL/6KFw+KeBVv/o2Th6CiBQNFVTc9VUEihUb/rBGRsfGJyanpgszs3PzC8XFpTMdp4rxBotlrC4C0FyKiDdQoOQXieIQBpKfB9dHPf/8hist4ugUuwn3QriKRFswQCP5xS03BOwkFk9zOt1/NLOHDrqn9Ded3C3fumXtY5mCj5fLNmVqrNT3dmnA7K/PS1XepU7D5KZIi6X3x3WzFLQx4hk6B107ET9DJQKJjkecFNU+AXcMVbxoaQci1l/Vj5XTNKC3ajpV5EdK+n0jg1DrbhiYyV4I/dvriX95zRTbe14moiRFHrHBoXYqKca01xFtCcUZyq4hwJQwf6WsAwoYmiYLpoSvpPR/clatOLXK3kmtdFgd1jFVsgq2SAO2SWH5JjUSYMwckceyBN5tu6tR+vFeh2MjljDnWXyA9bJxupn0s=</latexit>

r(s, a) = E[rt+1|s = st, a = at]

<latexit sha1_base64="Pnkf2aWKD9BYQ2qZKVRM674ymOg=">ACFHicdVDLSgNBEJz1bXxFPXoZDIKSEHZjQpJDQBDBo4JRIVmW3slEh8w+mOkVwpqP8OKvePGgiFcP3vwbJw9BRQsaiqpurv8WAqNtv1hTU3PzM7NLyxmlpZXVtey6xvnOkoU40WyUhd+qC5FCFvokDJL2PFIfAlv/B7h0P/4oYrLaLwDPsxdwO4CkVXMEAjedm82tUF2KMN2g4Ar30/PRq0lJdi3hnQW6ob2sMChQZ46HrZnF0sOZVSpU7HpL4/IeUqdYr2CDkywYmXfW93IpYEPEQmQeuWY8fopqBQMkHmXaieQysB1e8ZWgIAduOnpqQHeM0qHdSJkKkY7U7xMpBFr3A90Di/Xv72h+JfXSrBbc1MRxgnykI0XdRNJMaLDhGhHKM5Q9g0BpoS5lbJrUMDQ5JgxIXx9Sv8n56WiUy7WTsu5g9IkjgWyRbJLnFIlRyQY3JCmoSRO/JAnsizdW89Wi/W67h1yprMbJIfsN4+AR9XnP8=</latexit>
slide-24
SLIDE 24

CS391R: Robot Learning (Fall 2020) 25

Mathematical Framework: Marko kov Decisi sion Processe sses

Imitation learning optimizes policy by imitating the expert in an MDP. Goal: To match the behavioral distributions Two basic ideas

  • Direct estimation of the expert policy from expert data

(behavioral cloning)

  • Reconstruct a reward function (inverse RL) and then learn

a policy from the reward (RL)

slide-25
SLIDE 25

CS391R: Robot Learning (Fall 2020) 26

Imitation as Supervised Learning

Idea 1: Direct estimation of the expert policy from expert data

π∗ = arg min

π

X

st∈D

L ✓ π(st), πE(st) ◆

<latexit sha1_base64="3R/T6E3CEu7RQ8Dj5KcUwBCJ1Ds=">ACOHicdZDNSgMxFIUz/lv/qi7dBItQRcpMragLoaC8EKVoWmDpk0nQaTzJBkhDL0sdz4GO7EjQtF3PoEptMKnoh8HOvdzcE8ScaeO6j87I6Nj4xOTUdG5mdm5+Ib+4dKGjRBFaJxGP1FWANeVM0rphtOrWFEsAk4vg5uDvn95S5VmkTw3Zg2BQ4lazOCjZX8/CmK2fUG3IcIqxAJn0rQKQT4afaN4hJLDpEMzTw14PnqCAhWHR9hStu74JLflHGWfOup8vuKWyt13e3oMD2NsaQmUHeiU3qwIYVs3P6BWRBJBpSEca93w3Ng0U6wMI5z2cijRNMbkBoe0YVFiQXUzQ7vwTWrtGA7UvZJAzP1+0SKhdZdEdjO/hX6t9cX/IaiWnvNlMm48RQSQaL2gmHJoL9FGLKUoM71rARDH7V0g6WGFibNY5G8LXpfB/uCiXvEp96xSqJaHcUyBFbAKisADO6AKjkEN1AEBd+AJvIBX595dt6c90HriDOcWQY/yvn4BKDKrEU=</latexit>

This can be cast as a supervised learning problem, called behavioral cloning

action from expert policy Distance metric that measures the discrepancy between the expert action and the policy action (e.g., KL-divergence)

Week k 8 Thu

slide-26
SLIDE 26

CS391R: Robot Learning (Fall 2020) 27

Imitation as Supervised Learning

Idea 1: Direct estimation of the expert policy from expert data This can be cast as a supervised learning problem, called behavioral cloning

Week k 8 Thu

What can go wrong? compounding errors

How to fix: Asking expert for more data (DAgger)

slide-27
SLIDE 27

CS391R: Robot Learning (Fall 2020) 28

Examples of Supervi vise sed Imitation Learning

"Agile Autonomous Driving using End-to-End Deep Imitation Learning" Pan, Cheng, Saigol, Lee, Yan, Theodorou, Boots. RSS 2018

slide-28
SLIDE 28

CS391R: Robot Learning (Fall 2020) 29

Inverse Reinforcement Learning

Idea 2: Reconstruct a reward function and then learn a policy from the reward

Week k 9 Tue

Solvi ving fu full-fl fledged RL RL in in th the in inner lo loop To solve efficiently, IRL methods often assume:

v Known dynamics (for comparing and efficiently) v Linear reward function

π

<latexit sha1_base64="9oX97t57fuQzRrqveilbymgBA=">AB6nicdVBNS8NAEJ3Ur1q/qh69LBbBg4SktrS9Fbx4rGg/oA1ls920SzebsLsRSuhP8OJBEa/+Im/+G7dtBV9MPB4b4aZeX7MmdKO82Hl1tY3Nrfy24Wd3b39g+LhUdFiS0TSIeyZ6PFeVM0LZmtNeLCkOfU67/vRq4XfvqVQsEnd6FlMvxGPBAkawNtLtIGbDYsmxy261XG2gFWlcZqRSQ67tLFGCDK1h8X0wikgSUqEJx0r1XSfWXoqlZoTeWGQKBpjMsVj2jdU4JAqL12eOkdnRhmhIJKmhEZL9ftEikOlZqFvOkOsJ+q3txD/8vqJDupeykScaCrIalGQcKQjtPgbjZikRPOZIZhIZm5FZIlJtqkUzAhfH2K/iedsu1W7PpNpdS8yOLIwmcwjm4UIMmXEML2kBgDA/wBM8Wtx6tF+t1Zqzsplj+AHr7ROtjY4F</latexit>

π∗

<latexit sha1_base64="FHkfeWh9FQ1KBWk3Q8BRoQqSQ=">AB7HicdVBNS8NAEJ34WetX1aOXxSKISEhqS9tbwYvHCqYtLFstpt26WYTdjdCf0NXjwo4tUf5M1/4/ZDUNEHA4/3ZpiZFyScKe04H9bK6tr6xmZuK7+9s7u3Xzg4bKk4lYR6JOax7ARYUc4E9TnHYSXEUcNoOxlczv31PpWKxuNWThPoRHgoWMoK1kbxewu7O+4WiY5fcSqlSRwtSv1ySchW5tjNHEZo9gvUFM0ogKThWqus6ifYzLDUjnE7zvVTRBJMxHtKuoQJHVPnZ/NgpOjXKAIWxNCU0mqvfJzIcKTWJAtMZYT1Sv72Z+JfXTXVY8zMmklRTQRaLwpQjHaPZ52jAJCWaTwzBRDJzKyIjLDHRJp+8CeHrU/Q/aZVst2zXbsrFxsUyjhwcwmcgQtVaMA1NMEDAgwe4AmeLWE9Wi/W6J1xVrOHMEPWG+fyFaOoQ=</latexit>

r(s, a) = w>φ(s)

<latexit sha1_base64="7Z1QRpB/gD47TUuzr1mKBhPjLs=">AB/XicdVDLSsNAFJ34rPUVHzs3g0VoYSktrRdCAU3LivYBzSxTKaTdujkwcxEqaH4K25cKOLW/3Dn3zhtI6jogQuHc+7l3nvciFEhTfNDW1peWV1bz2xkN7e2d3b1vf2CGOSQuHLORdFwnCaEBakpGuhEnyHcZ6bj85nfuSFc0DC4kpOIOD4aBtSjGEkl9fVDnhdFVDi7vbZlGNnRiOZFoa/nTKNkVUqVOlyQ+mlKylVoGeYcOZCi2df7UGIY58EjMkRM8yI+kiEuKGZlm7ViQCOExGpKeogHyiXCS+fVTeKUAfRCriqQcK5+n0iQL8TEd1Wnj+RI/PZm4l9eL5ZezUloEMWSBHixyIsZlCGcRQEHlBMs2UQRhDlVt0I8QhxhqQLqhC+PoX/k3bJsMpG7bKcaxTODLgCByDPLBAFTABWiCFsDgDjyAJ/Cs3WuP2ov2umhd0tKZA/AD2tsneRiUkA=</latexit>

Problem: IRL is generally ill-posed – many reward functions under which the expert policy is optimal. How can we address it?

slide-29
SLIDE 29

CS391R: Robot Learning (Fall 2020) 30

Examples of Inve verse se Reinforcement Learning

slide-30
SLIDE 30

CS391R: Robot Learning (Fall 2020) 31

Adversarial Imitation Learning

Week k 9 Thu

IL reward rIL

<latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i7+gsTLXTzQrMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GS7KDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RtRxzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D7odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uat6tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="K8VBAofnidEG3r5kxgDeaKbolSQ=">AB6XicbZBLSwMxFIXv1FetVatbN8EiuCozbnQpuFwUcE+oB1KJr3ThmYeJneKZejvcONCEf+QO/+N6WOhrQcCH+ck3JsTpEoact1vp7CxubW9U9wt7ZX3Dw4rR+WmSTItsCESleh2wA0qGWODJClspxp5FChsBaObWd4aozYyiR9pkqIf8UEsQyk4WcvXvS7hMxHld/fTXqXq1ty52Dp4S6jCUvVe5avbT0QWYUxCcWM6npuSn3NUiclrqZwZSLER9gx2LMIzR+Pl96ys6s02dhou2Jic3d3y9yHhkziQJ7M+I0NKvZzPwv62QUXvm5jNOMBaLQWGmGCVs1gDrS42C1MQCF1raXZkYcs0F2Z5KtgRv9cvr0LyoeZYfXCjCZzCOXhwCdwC3VogIAneIE3eHfGzqvzsair4Cx7O4Y/cj5/AO7rkQ=</latexit><latexit sha1_base64="K8VBAofnidEG3r5kxgDeaKbolSQ=">AB6XicbZBLSwMxFIXv1FetVatbN8EiuCozbnQpuFwUcE+oB1KJr3ThmYeJneKZejvcONCEf+QO/+N6WOhrQcCH+ck3JsTpEoact1vp7CxubW9U9wt7ZX3Dw4rR+WmSTItsCESleh2wA0qGWODJClspxp5FChsBaObWd4aozYyiR9pkqIf8UEsQyk4WcvXvS7hMxHld/fTXqXq1ty52Dp4S6jCUvVe5avbT0QWYUxCcWM6npuSn3NUiclrqZwZSLER9gx2LMIzR+Pl96ys6s02dhou2Jic3d3y9yHhkziQJ7M+I0NKvZzPwv62QUXvm5jNOMBaLQWGmGCVs1gDrS42C1MQCF1raXZkYcs0F2Z5KtgRv9cvr0LyoeZYfXCjCZzCOXhwCdwC3VogIAneIE3eHfGzqvzsair4Cx7O4Y/cj5/AO7rkQ=</latexit><latexit sha1_base64="DRDrA2um96uea2c+l4iKoNuU/hg=">AB9HicbZA9SwNBEIb34leMX1FLm8UgWIU7Gy0DNgoWEcwHJEfY28wlS/b2zt25YDjyO2wsFLH1x9j5b9wkV2jiCwsP78ws2+QSGHQdb+dwtr6xuZWcbu0s7u3f1A+PGqaONUcGjyWsW4HzIAUChoUEI70cCiQEIrGF3P6q0xaCNi9YCTBPyIDZQIBWdoLV/3ughPiJjd3k175Ypbdeiq+DlUCG56r3yV7cf8zQChVwyYzqem6CfMY2CS5iWuqmBhPERG0DHomIRGD+bHz2lZ9bp0zDW9imkc/f3RMYiYyZRYDsjhkOzXJuZ/9U6KYZXfiZUkiIovlgUpJiTGcJ0L7QwFOLDCuhb2V8iHTjKPNqWRD8Ja/vArNi6pn+d6t1Gp5HEVyQk7JOfHIJamRG1InDcLJI3kmr+TNGTsvzrvzsWgtOPnMfkj5/MHOwuSXQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit>

Dψ(

<latexit sha1_base64="bK4po48UILSLHp19l0RKlr1uCSY=">ACAHicbZDLSsNAFIYnXmu9RV24cBMsQt2URARdFnXhsoK9QBPCZDph04yYeZEDCEbX8WNC0Xc+hjufBunbRba+sPAx3/O4cz5g4QzBb9bSwtr6yurVc2qptb2zu75t5+R4lUEtomgvZC7CinMW0DQw47SWS4ijgtBuMryf17gOVion4HrKEehEexixkBIO2fPwxncTxeou0EcAyBPBGcmKU9+s2Q17KmsRnBJqFTLN7/cgSBpRGMgHCvVd+wEvBxLYITouqmiaYjPGQ9jXGOKLKy6cHFNaJdgZWKR+MVhT9/dEjiOlsijQnRGkZqvTcz/av0UwksvZ3GSAo3JbFGYcguENUnDGjBJCfBMAyaS6b9aZIQlJqAzq+oQnPmTF6Fz1nA0353XmldlHBV0hI5RHTnoAjXRLWqhNiKoQM/oFb0ZT8aL8W58zFqXjHLmAP2R8fkDIlGWvg=</latexit><latexit sha1_base64="bK4po48UILSLHp19l0RKlr1uCSY=">ACAHicbZDLSsNAFIYnXmu9RV24cBMsQt2URARdFnXhsoK9QBPCZDph04yYeZEDCEbX8WNC0Xc+hjufBunbRba+sPAx3/O4cz5g4QzBb9bSwtr6yurVc2qptb2zu75t5+R4lUEtomgvZC7CinMW0DQw47SWS4ijgtBuMryf17gOVion4HrKEehEexixkBIO2fPwxncTxeou0EcAyBPBGcmKU9+s2Q17KmsRnBJqFTLN7/cgSBpRGMgHCvVd+wEvBxLYITouqmiaYjPGQ9jXGOKLKy6cHFNaJdgZWKR+MVhT9/dEjiOlsijQnRGkZqvTcz/av0UwksvZ3GSAo3JbFGYcguENUnDGjBJCfBMAyaS6b9aZIQlJqAzq+oQnPmTF6Fz1nA0353XmldlHBV0hI5RHTnoAjXRLWqhNiKoQM/oFb0ZT8aL8W58zFqXjHLmAP2R8fkDIlGWvg=</latexit><latexit sha1_base64="bK4po48UILSLHp19l0RKlr1uCSY=">ACAHicbZDLSsNAFIYnXmu9RV24cBMsQt2URARdFnXhsoK9QBPCZDph04yYeZEDCEbX8WNC0Xc+hjufBunbRba+sPAx3/O4cz5g4QzBb9bSwtr6yurVc2qptb2zu75t5+R4lUEtomgvZC7CinMW0DQw47SWS4ijgtBuMryf17gOVion4HrKEehEexixkBIO2fPwxncTxeou0EcAyBPBGcmKU9+s2Q17KmsRnBJqFTLN7/cgSBpRGMgHCvVd+wEvBxLYITouqmiaYjPGQ9jXGOKLKy6cHFNaJdgZWKR+MVhT9/dEjiOlsijQnRGkZqvTcz/av0UwksvZ3GSAo3JbFGYcguENUnDGjBJCfBMAyaS6b9aZIQlJqAzq+oQnPmTF6Fz1nA0353XmldlHBV0hI5RHTnoAjXRLWqhNiKoQM/oFb0ZT8aL8W58zFqXjHLmAP2R8fkDIlGWvg=</latexit><latexit sha1_base64="bK4po48UILSLHp19l0RKlr1uCSY=">ACAHicbZDLSsNAFIYnXmu9RV24cBMsQt2URARdFnXhsoK9QBPCZDph04yYeZEDCEbX8WNC0Xc+hjufBunbRba+sPAx3/O4cz5g4QzBb9bSwtr6yurVc2qptb2zu75t5+R4lUEtomgvZC7CinMW0DQw47SWS4ijgtBuMryf17gOVion4HrKEehEexixkBIO2fPwxncTxeou0EcAyBPBGcmKU9+s2Q17KmsRnBJqFTLN7/cgSBpRGMgHCvVd+wEvBxLYITouqmiaYjPGQ9jXGOKLKy6cHFNaJdgZWKR+MVhT9/dEjiOlsijQnRGkZqvTcz/av0UwksvZ3GSAo3JbFGYcguENUnDGjBJCfBMAyaS6b9aZIQlJqAzq+oQnPmTF6Fz1nA0353XmldlHBV0hI5RHTnoAjXRLWqhNiKoQM/oFb0ZT8aL8W58zFqXjHLmAP2R8fkDIlGWvg=</latexit>

score on policy

1 [Goodfellow et al. 2014; Ho & Ermon, 2016]

discriminator policy generated trajectories environment demonstration trajectories IL reward:

rIL(st, at) = − log(1 − Dψ(st))

<latexit sha1_base64="BHp/E+EXQgk/HnKpRrc7q+0SuI=">ACFnicbZDLSgMxFIYz9VbrerSTbAILdgyI4JuhKIuFxUsBfoDEMmTdvQzIXkjFiGPoUbX8WNC0XcijvfxkzbhbYeCHz8/zmcnN+LBFdgmt9GZmFxaXklu5pbW9/Y3Mpv7zRUGEvK6jQUoWx5RDHBA1YHDoK1IsmI7wnW9AYXqd+8Z1LxMLiDYcQcn/QC3uWUgJbcfFm6NrAHAEiub0ZF5cIhJi6Uzsq2CHu4aJUvXTtSPHVKJTdfMCvmuPA8WFMoGnV3PyX3Qlp7LMAqCBKtS0zAichEjgVbJSzY8UiQgekx9oaA+Iz5STjs0b4QCsd3A2lfgHgsfp7IiG+UkPf050+gb6a9VLxP68dQ/fUSXgQxcACOlnUjQWGEKcZ4Q6XjIYaiBUcv1XTPtEgo6yZwOwZo9eR4aRxVL8+1xoXo+jSOL9tA+KiILnaAqukI1VEcUPaJn9IrejCfjxXg3PiatGWM6s4v+lPH5A4AXnaQ=</latexit><latexit sha1_base64="BHp/E+EXQgk/HnKpRrc7q+0SuI=">ACFnicbZDLSgMxFIYz9VbrerSTbAILdgyI4JuhKIuFxUsBfoDEMmTdvQzIXkjFiGPoUbX8WNC0XcijvfxkzbhbYeCHz8/zmcnN+LBFdgmt9GZmFxaXklu5pbW9/Y3Mpv7zRUGEvK6jQUoWx5RDHBA1YHDoK1IsmI7wnW9AYXqd+8Z1LxMLiDYcQcn/QC3uWUgJbcfFm6NrAHAEiub0ZF5cIhJi6Uzsq2CHu4aJUvXTtSPHVKJTdfMCvmuPA8WFMoGnV3PyX3Qlp7LMAqCBKtS0zAichEjgVbJSzY8UiQgekx9oaA+Iz5STjs0b4QCsd3A2lfgHgsfp7IiG+UkPf050+gb6a9VLxP68dQ/fUSXgQxcACOlnUjQWGEKcZ4Q6XjIYaiBUcv1XTPtEgo6yZwOwZo9eR4aRxVL8+1xoXo+jSOL9tA+KiILnaAqukI1VEcUPaJn9IrejCfjxXg3PiatGWM6s4v+lPH5A4AXnaQ=</latexit><latexit sha1_base64="BHp/E+EXQgk/HnKpRrc7q+0SuI=">ACFnicbZDLSgMxFIYz9VbrerSTbAILdgyI4JuhKIuFxUsBfoDEMmTdvQzIXkjFiGPoUbX8WNC0XcijvfxkzbhbYeCHz8/zmcnN+LBFdgmt9GZmFxaXklu5pbW9/Y3Mpv7zRUGEvK6jQUoWx5RDHBA1YHDoK1IsmI7wnW9AYXqd+8Z1LxMLiDYcQcn/QC3uWUgJbcfFm6NrAHAEiub0ZF5cIhJi6Uzsq2CHu4aJUvXTtSPHVKJTdfMCvmuPA8WFMoGnV3PyX3Qlp7LMAqCBKtS0zAichEjgVbJSzY8UiQgekx9oaA+Iz5STjs0b4QCsd3A2lfgHgsfp7IiG+UkPf050+gb6a9VLxP68dQ/fUSXgQxcACOlnUjQWGEKcZ4Q6XjIYaiBUcv1XTPtEgo6yZwOwZo9eR4aRxVL8+1xoXo+jSOL9tA+KiILnaAqukI1VEcUPaJn9IrejCfjxXg3PiatGWM6s4v+lPH5A4AXnaQ=</latexit><latexit sha1_base64="BHp/E+EXQgk/HnKpRrc7q+0SuI=">ACFnicbZDLSgMxFIYz9VbrerSTbAILdgyI4JuhKIuFxUsBfoDEMmTdvQzIXkjFiGPoUbX8WNC0XcijvfxkzbhbYeCHz8/zmcnN+LBFdgmt9GZmFxaXklu5pbW9/Y3Mpv7zRUGEvK6jQUoWx5RDHBA1YHDoK1IsmI7wnW9AYXqd+8Z1LxMLiDYcQcn/QC3uWUgJbcfFm6NrAHAEiub0ZF5cIhJi6Uzsq2CHu4aJUvXTtSPHVKJTdfMCvmuPA8WFMoGnV3PyX3Qlp7LMAqCBKtS0zAichEjgVbJSzY8UiQgekx9oaA+Iz5STjs0b4QCsd3A2lfgHgsfp7IiG+UkPf050+gb6a9VLxP68dQ/fUSXgQxcACOlnUjQWGEKcZ4Q6XjIYaiBUcv1XTPtEgo6yZwOwZo9eR4aRxVL8+1xoXo+jSOL9tA+KiILnaAqukI1VEcUPaJn9IrejCfjxXg3PiatGWM6s4v+lPH5A4AXnaQ=</latexit>

strong discriminator bad policy weak discriminator good policy predicts 0 if policy and 1 if demo

− Dψ(st))

disc scriminator objective ve

− Dψ

slide-31
SLIDE 31

CS391R: Robot Learning (Fall 2020) 32

Adversarial Imitation Learning

Week k 9 Thu

[Goodfellow et al. 2014; Ho & Ermon, 2016]

discriminator policy generated trajectories environment demonstration trajectories IL reward:

rIL(st, at) = − log(1 − Dψ(st))

<latexit sha1_base64="BHp/E+EXQgk/HnKpRrc7q+0SuI=">ACFnicbZDLSgMxFIYz9VbrerSTbAILdgyI4JuhKIuFxUsBfoDEMmTdvQzIXkjFiGPoUbX8WNC0XcijvfxkzbhbYeCHz8/zmcnN+LBFdgmt9GZmFxaXklu5pbW9/Y3Mpv7zRUGEvK6jQUoWx5RDHBA1YHDoK1IsmI7wnW9AYXqd+8Z1LxMLiDYcQcn/QC3uWUgJbcfFm6NrAHAEiub0ZF5cIhJi6Uzsq2CHu4aJUvXTtSPHVKJTdfMCvmuPA8WFMoGnV3PyX3Qlp7LMAqCBKtS0zAichEjgVbJSzY8UiQgekx9oaA+Iz5STjs0b4QCsd3A2lfgHgsfp7IiG+UkPf050+gb6a9VLxP68dQ/fUSXgQxcACOlnUjQWGEKcZ4Q6XjIYaiBUcv1XTPtEgo6yZwOwZo9eR4aRxVL8+1xoXo+jSOL9tA+KiILnaAqukI1VEcUPaJn9IrejCfjxXg3PiatGWM6s4v+lPH5A4AXnaQ=</latexit><latexit sha1_base64="BHp/E+EXQgk/HnKpRrc7q+0SuI=">ACFnicbZDLSgMxFIYz9VbrerSTbAILdgyI4JuhKIuFxUsBfoDEMmTdvQzIXkjFiGPoUbX8WNC0XcijvfxkzbhbYeCHz8/zmcnN+LBFdgmt9GZmFxaXklu5pbW9/Y3Mpv7zRUGEvK6jQUoWx5RDHBA1YHDoK1IsmI7wnW9AYXqd+8Z1LxMLiDYcQcn/QC3uWUgJbcfFm6NrAHAEiub0ZF5cIhJi6Uzsq2CHu4aJUvXTtSPHVKJTdfMCvmuPA8WFMoGnV3PyX3Qlp7LMAqCBKtS0zAichEjgVbJSzY8UiQgekx9oaA+Iz5STjs0b4QCsd3A2lfgHgsfp7IiG+UkPf050+gb6a9VLxP68dQ/fUSXgQxcACOlnUjQWGEKcZ4Q6XjIYaiBUcv1XTPtEgo6yZwOwZo9eR4aRxVL8+1xoXo+jSOL9tA+KiILnaAqukI1VEcUPaJn9IrejCfjxXg3PiatGWM6s4v+lPH5A4AXnaQ=</latexit><latexit sha1_base64="BHp/E+EXQgk/HnKpRrc7q+0SuI=">ACFnicbZDLSgMxFIYz9VbrerSTbAILdgyI4JuhKIuFxUsBfoDEMmTdvQzIXkjFiGPoUbX8WNC0XcijvfxkzbhbYeCHz8/zmcnN+LBFdgmt9GZmFxaXklu5pbW9/Y3Mpv7zRUGEvK6jQUoWx5RDHBA1YHDoK1IsmI7wnW9AYXqd+8Z1LxMLiDYcQcn/QC3uWUgJbcfFm6NrAHAEiub0ZF5cIhJi6Uzsq2CHu4aJUvXTtSPHVKJTdfMCvmuPA8WFMoGnV3PyX3Qlp7LMAqCBKtS0zAichEjgVbJSzY8UiQgekx9oaA+Iz5STjs0b4QCsd3A2lfgHgsfp7IiG+UkPf050+gb6a9VLxP68dQ/fUSXgQxcACOlnUjQWGEKcZ4Q6XjIYaiBUcv1XTPtEgo6yZwOwZo9eR4aRxVL8+1xoXo+jSOL9tA+KiILnaAqukI1VEcUPaJn9IrejCfjxXg3PiatGWM6s4v+lPH5A4AXnaQ=</latexit><latexit sha1_base64="BHp/E+EXQgk/HnKpRrc7q+0SuI=">ACFnicbZDLSgMxFIYz9VbrerSTbAILdgyI4JuhKIuFxUsBfoDEMmTdvQzIXkjFiGPoUbX8WNC0XcijvfxkzbhbYeCHz8/zmcnN+LBFdgmt9GZmFxaXklu5pbW9/Y3Mpv7zRUGEvK6jQUoWx5RDHBA1YHDoK1IsmI7wnW9AYXqd+8Z1LxMLiDYcQcn/QC3uWUgJbcfFm6NrAHAEiub0ZF5cIhJi6Uzsq2CHu4aJUvXTtSPHVKJTdfMCvmuPA8WFMoGnV3PyX3Qlp7LMAqCBKtS0zAichEjgVbJSzY8UiQgekx9oaA+Iz5STjs0b4QCsd3A2lfgHgsfp7IiG+UkPf050+gb6a9VLxP68dQ/fUSXgQxcACOlnUjQWGEKcZ4Q6XjIYaiBUcv1XTPtEgo6yZwOwZo9eR4aRxVL8+1xoXo+jSOL9tA+KiILnaAqukI1VEcUPaJn9IrejCfjxXg3PiatGWM6s4v+lPH5A4AXnaQ=</latexit>

predicts 0 if policy and 1 if demo

− Dψ(st))

disc scriminator objective ve

− Dψ

v Represent complex x reward function by neural networks v More iterative ve approaches to update reward and policy (no need to run full RL before updating the reward function) v We don’t know the dynamics but have access to a si simulator to compare with and .

π

<latexit sha1_base64="9oX97t57fuQzRrqveilbymgBA=">AB6nicdVBNS8NAEJ3Ur1q/qh69LBbBg4SktrS9Fbx4rGg/oA1ls920SzebsLsRSuhP8OJBEa/+Im/+G7dtBV9MPB4b4aZeX7MmdKO82Hl1tY3Nrfy24Wd3b39g+LhUdFiS0TSIeyZ6PFeVM0LZmtNeLCkOfU67/vRq4XfvqVQsEnd6FlMvxGPBAkawNtLtIGbDYsmxy261XG2gFWlcZqRSQ67tLFGCDK1h8X0wikgSUqEJx0r1XSfWXoqlZoTeWGQKBpjMsVj2jdU4JAqL12eOkdnRhmhIJKmhEZL9ftEikOlZqFvOkOsJ+q3txD/8vqJDupeykScaCrIalGQcKQjtPgbjZikRPOZIZhIZm5FZIlJtqkUzAhfH2K/iedsu1W7PpNpdS8yOLIwmcwjm4UIMmXEML2kBgDA/wBM8Wtx6tF+t1Zqzsplj+AHr7ROtjY4F</latexit>

π∗

<latexit sha1_base64="FHkfeWh9FQ1KBWk3Q8BRoQqSQ=">AB7HicdVBNS8NAEJ34WetX1aOXxSKISEhqS9tbwYvHCqYtLFstpt26WYTdjdCf0NXjwo4tUf5M1/4/ZDUNEHA4/3ZpiZFyScKe04H9bK6tr6xmZuK7+9s7u3Xzg4bKk4lYR6JOax7ARYUc4E9TnHYSXEUcNoOxlczv31PpWKxuNWThPoRHgoWMoK1kbxewu7O+4WiY5fcSqlSRwtSv1ySchW5tjNHEZo9gvUFM0ogKThWqus6ifYzLDUjnE7zvVTRBJMxHtKuoQJHVPnZ/NgpOjXKAIWxNCU0mqvfJzIcKTWJAtMZYT1Sv72Z+JfXTXVY8zMmklRTQRaLwpQjHaPZ52jAJCWaTwzBRDJzKyIjLDHRJp+8CeHrU/Q/aZVst2zXbsrFxsUyjhwcwmcgQtVaMA1NMEDAgwe4AmeLWE9Wi/W6J1xVrOHMEPWG+fyFaOoQ=</latexit>
slide-32
SLIDE 32

CS391R: Robot Learning (Fall 2020) 33

Examples of Adve versa sarial Imitation Learning

  • 2x

4x 2x

  • “Reinforcement and Imitation Learning for Diverse Visuomotor Skills.” Zhu et al. RSS 2018
slide-33
SLIDE 33

CS391R: Robot Learning (Fall 2020) 34

Robotics and Decision Making: Landsc scape

le learnin ing so source re reward rd fu functi tion exp xpert demonst stration have ve mo model? no no ye yes

  • p
  • ptimal

mal cont control rol & & planning

le learn mo model?

mod model el-base sed rei reinf nforcement

  • rcement learni

earning ng

ye yes

mod model el-fr free rei reinf nforcement

  • rcement learni

earning ng Week 7 Thu Week 8 Tue

no no est stimate re reward rd? no no ye yes

im imit itatio ion as su supervi vise sed learning Week 8 Thu

kn known dyn ynamics? s? no no ye yes

inve verse se reinforcement learning Week 9 Tue adve versa sarial imitation learning Week 9 Thu

slide-34
SLIDE 34

CS391R: Robot Learning (Fall 2020) 35

Robotics and Decision Making: Fr Front

  • ntier

ers

Learning from rich data sources

Language, preferences, instruction

  • videos. Suboptimal demonstrations.

Object variations and long-horizon tasks.

Efficient learning of new tasks

Fast learning from limited experience. Representing and transferring past knowledge.

Safety and robustness

Probabilistic and formal guarantees

  • f the robot behaviors during learning

and inference

Week 10 (Tue, Thu): Learning to Learn Week 11 (Tue, Thu): Compositionality

slide-35
SLIDE 35

CS391R: Robot Learning (Fall 2020) 36

Resources

Related courses at UTCS

  • CS342: Neural Networks
  • CS394R: Reinforcement Learning: Theory and Practice

Other Course Materials and Textbooks

  • UCL Course on RL by David Silver
  • Berkeley CS 294: Deep Reinforcement Learning
  • Reinforcement Learning: An Introduction, Sutton and Barto
  • Reinforcement Learning and Optimal Control, Bertsekas