Bayesian Counterfactual Risk Minimization Ben London (blondon@) - - PowerPoint PPT Presentation

bayesian counterfactual risk minimization
SMART_READER_LITE
LIVE PREVIEW

Bayesian Counterfactual Risk Minimization Ben London (blondon@) - - PowerPoint PPT Presentation

Bayesian Counterfactual Risk Minimization Ben London (blondon@) Amazon Music Ted Sandler (sandler@) Amazon Music International Conference on Machine Learning Long Beach, CA, June 11, 2019 Learning from Logged Data Pull log data e.g., user i


slide-1
SLIDE 1

Bayesian Counterfactual Risk Minimization

International Conference on Machine Learning Long Beach, CA, June 11, 2019

Ted Sandler (sandler@) Amazon Music Ben London (blondon@) Amazon Music

slide-2
SLIDE 2

Learning from Logged Data

Design/train new rec policy A/B test new policy vs

  • ld policy

Launch new policy

if unsuccessful if successful

Pull log data

e.g., user i listened to item j

slide-3
SLIDE 3
  • Only observe outcomes from actions taken
  • e.g., only get feedback on recommendations

Problem 1: Bandit Feedback

Here’s a station you might like … Alexa, play music

slide-4
SLIDE 4

Problem 2: Bias

low support → who knows? high support → better estimate

  • Logged data is biased
  • Policy typically not uniform distribution
  • User typically doesn’t see everything
  • Bias affects inferences
  • Self-fulfilling prophecies; “rich get richer”
  • Miss key insights due to insufficient support
slide-5
SLIDE 5

IPS Policy Optimization

  • Use inverse propensity score (IPS) estimator

logged propensity pi = π0(ai | xi)

<latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit>
  • IPS is an unbiased estimator of expected reward
  • Caveat: logging policy must have full support

arg min

π

1 n

n

X

i=1

−ri π(ai | xi) pi

<latexit sha1_base64="vq06mY9CIP/+RPZL4LrwG8KBOrc=">ACf3icjVFdSxwxFM1Mv7arat9CV0ERXsMBntugsVhL74aKGrws50yGQzazDJDElmcUjzQwv9Ff4Cs+uCVrohcDJuefmHk6KmjNt4vhXEL54+er1m87b7tr6u/cbvc2tC101itAxqXilrgqsKWeSjg0znF7VimJRcHpZ3Hxd9C/nVGlWye+mrWkm8EykhFsPJX35ilWM8FkbtOaOZgewLRUmFjkrPRX3YjcshPkfkj4SeXsUeDle3hB/IS3Odt31qZLNxM1KzIbR0kSD4ajgzj6PDgcoMSD0fAoQcjVOXMu7/VRFC8L/hv0warO895dOq1I6g0hGOtJyiuTWaxMoxw6rpo2mNyQ2e0YmHEguqM7t05OCOZ6awrJQ/0sAl+eExULrVhReKbC51s97C/JvUljymFmawbQyV5WFQ2HJoKLsKGU6YoMbz1ABPFvFdIrGPz/gvebJFtFNav8CFC0UXlzp7v+FdJFE6DBKvh31T7+s4uqAbfAR7AEjsEpOAPnYAwI+B2EwVqwHgbhbhiF8YM0DFYzH8CTCkf3fJ69cA=</latexit>

E

(x,ρ)∼D

E

a∼π(x)[ρ(x, a)] ≈ 1

n

n

X

i=1

ri π(ai | xi) pi

<latexit sha1_base64="PSf7DjZY4iJDoPC0SXiPXT/dCmw=">AChnicjVFdaxNBFJ1dP5rGr9g+nIxCAmUkK1K+6AQUMHCqYtZNdldjLbDJ0vZmYlyzj9nz7D/wFziZ5sKLghQuHc869dzhTac6sm06/J+mdu/fu7/X2+w8ePnr8ZPD04NyqxhA6J4orc1lhSzmTdO6Y4/RSG4pFxelFdf2u0y+UmOZkp9dq2kh8JVkNSPYRaochPyDLv1ofZSblRpDXrWxq/cBNjzeEpqN1uOwgM4TrXgMRf8mx1obtb6BvDaY+Cx4GadsI0rP3mbhiwRTMsiPdnq3BHfEN1iXbBy8LlkoB8NsMt0U/BsM0a7OysHPfKlI6h0hGNrF9lUu8Jj4xjhNPTzxlKNyTW+osIJRbUFn4TU4AXkVlCrUxs6WD/j7hsbC2FV0CuxW9k+tI/+mLRpXnxaeSd04Ksn2UN1wcAq6zGHJDCWOtxFgYlh8K5AVjqm4+DO3roh2SWsbN4BoQUSzsv3/C+n8eJK9nBx/ejWcvdnF1UP0HM0Qhk6QTP0EZ2hOSLoR7KfHCSHaS+dpK/Tk601TXYzh+hWpbNfrhDCKg=</latexit>
slide-6
SLIDE 6

IPS Policy Optimization

  • Use inverse propensity score (IPS) estimator

logged propensity pi = π0(ai | xi)

<latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit><latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit>

!"($|&) !((|&) !"($|&) !($|&)

  • Problem: IPS has high variance

arg min

π

1 n

n

X

i=1

−ri π(ai | xi) pi

<latexit sha1_base64="vq06mY9CIP/+RPZL4LrwG8KBOrc=">ACf3icjVFdSxwxFM1Mv7arat9CV0ERXsMBntugsVhL74aKGrws50yGQzazDJDElmcUjzQwv9Ff4Cs+uCVrohcDJuefmHk6KmjNt4vhXEL54+er1m87b7tr6u/cbvc2tC101itAxqXilrgqsKWeSjg0znF7VimJRcHpZ3Hxd9C/nVGlWye+mrWkm8EykhFsPJX35ilWM8FkbtOaOZgewLRUmFjkrPRX3YjcshPkfkj4SeXsUeDle3hB/IS3Odt31qZLNxM1KzIbR0kSD4ajgzj6PDgcoMSD0fAoQcjVOXMu7/VRFC8L/hv0warO895dOq1I6g0hGOtJyiuTWaxMoxw6rpo2mNyQ2e0YmHEguqM7t05OCOZ6awrJQ/0sAl+eExULrVhReKbC51s97C/JvUljymFmawbQyV5WFQ2HJoKLsKGU6YoMbz1ABPFvFdIrGPz/gvebJFtFNav8CFC0UXlzp7v+FdJFE6DBKvh31T7+s4uqAbfAR7AEjsEpOAPnYAwI+B2EwVqwHgbhbhiF8YM0DFYzH8CTCkf3fJ69cA=</latexit>
slide-7
SLIDE 7

CRM Principle

  • Counterfactual Risk Minimization (CRM) principle
  • Motivated by PAC risk analysis
  • Stochastic optimization of variance regularizer is tricky
  • Policy optimization for exponential models (POEM) algorithm

[Swaminathan & Joachims, ICML 2015]

variance regularization

arg min

π

1 n

n

X

i=1

−ri π(ai | xi) pi

<latexit sha1_base64="cWvFQAa06y/zjgRhGkPYJf5mUuE=">ACVnicjVDLSgMxFM2Mr1pfoy7dBItQUtHBV0oFNy4ESrYKnTqkEkzNTJDElGHOJ8lT9jt/oHfoCY1oIPFLwQODn3HuTE6WMKl2vDx13anpmdq40X15YXFpe8VbX2irJCYtnLBEXkdIEUYFaWmqGblOJUE8YuQqGpyO9Ks7IhVNxKXOU9LlqC9oTDHSlgq98wDJPqciNEFKCxjswCWCBu/MJeVcZDQ0/84kbAXRnST4O1V9GIeID3Id0uTBrSIvQqfq0+Lvg3qIBJNUPvNeglONEaMyQUh2/nuquQVJTzEhRDjJFUoQHqE86FgrEieqa8bcLuGWZHowTaY/QcMx+7TCIK5XzyDo50rfqpzYif9M6mY6PuoaKNE4I9FcagTuAoQ9ijkmDNcgsQltS+FeJbZFPRNulvW3jeI7GyEyDPIbfmRJX/F1J7r+bv1/YuDiqN40lcJbABNkEV+OAQNMAZaIWwOARDMEzeHGenDd3xp37sLrOpGcdfCvXewf7/bVW</latexit>

+ λ q ˆ Var(π, S)

<latexit sha1_base64="IUqBlo6BGJCKN8j7pt+Y8e1iNg=">ACNHicdVBNb9NAEF2Xj5bw0QBHLisipKIiy3ZLSCUOlbhwbAVJK8VRNF6Pm1V3bXd3jLAs9z/wZ9or/AwkbohrD/0F3aRBoghG2tXTe/Nmdl9SKmkpCL57K7du37m7unavc/Bw0fr3cdPRraojMChKFRhDhOwqGSOQ5Kk8LA0CDpReJAcv5vrB5/QWFnkH6kucaLhKJeZFECOmnY3N/npKY+Vc6TAY3tiqIln4C7Cz9SMwLTtRlzKVx9etNuL/CjKOgPdnjgv+5v9cPIgZ3BdhSGPSDRfXYsvam3cs4LUSlMSehwNpxGJQ0acCQFArbTlxZLEcwxGOHcxBo50i0+1/IVjUp4Vxp2c+IL909GAtrbWievUQDP7tzYn/6WNK8oGk0bmZUWYi+tFWaU4FXyeE+lQUGqdgCEke6tXMzAgCX40tuk4xs24C1zXrmwHRfS7yT4/8Eo8sMtP9rf7u2+Xca1xp6x52yDhewN2Xv2R4bMsG+sHP2lX3zrwf3k/v13Xrirf0PGU3yru4ApDq4s=</latexit>
slide-8
SLIDE 8

Bayesian CRM Principle

  • Bayesian Counterfactual Risk Minimization (CRM) principle
  • Bayesian policy:
  • Motivated by PAC-Bayes risk analysis
  • Takeaway: posterior should stay close to the prior
  • What should the prior be? How about the logging policy!

[London & Sandler, ICML 2019]

KL div. from prior to posterior

arg min

Q

1 n

n

X

i=1

−ri πQ(ai | xi) pi

<latexit sha1_base64="FKE3v27XKHU3Af8ihAabaNut8Mk=">ACXnicjVDLSgMxFE1HrVqtrboR3ASLUEFLpwq6UBDcuGzBPqBTh0yaqaFJZkgy4hDny/wSd7rVnV9gWgs+UPBA4OTc+9NThAzqnS9/phz5uYX8otLy4WV1eJaqby+0VFRIjFp4hFshcgRgVpK2pZqQXS4J4wEg3GF9M6t1bIhWNxJVOYzLgaCRoSDHSVvLbQ/JEafCN14QtDLo7UMvlAgbNzPCXlXCfUP3OxawAPp0+DF9NZUxVN9Ht459O9zMQ+zfxyxa3Vp4B/kwqYoemX37xhBNOhMYMKdV367EeGCQ1xYxkBS9RJEZ4jEakb6lAnKiBmX4/g7tWGcIwkvYIDafq1w6DuFIpD6yTI32jftYm4m+1fqLDk4GhIk40EfhjUZgwqCM4yRIOqSRYs9QShCW1b4X4BtlwtE382xaeDkmo7ATIU8itOVKF/4XUadTcw1qjdVQ5P53FtQS2wQ6oAhcg3NwCZqgDTB4AM/gBbzmnpy8U3RKH1YnN+vZBN/gbL0D3m+3qw=</latexit>

+ λ Dkl(QkP)

<latexit sha1_base64="TZIo5oFiUTJ7hZIquoaW9C4VbFo=">ACK3icdVDLSgMxFM3Ud31VXboJFkFQhplRawsuBDeCLirYB3RKyWQybWgyMySZwjDUtT+jW/0PV4pbf8AvMK0VrOiFhM59ybHC9mVCrLejFyM7Nz8wuLS/nldW19cLGZl1GicCkhiMWiaHJGE0JDVFSPNWBDEPUYaXv98pDcGREgahTcqjUmbo25IA4qR0lSnsLMPb2+hy7TDR9A9gO7lU8Hmet518PRXR12CkXLdByrVK5AyzwuHZsR4NK+cixbWib1riKYFLVTuHD9SOcBIqzJCULduKVTtDQlHMyDvJpLECPdRl7Q0DBEnsp2N/zKEu5rxYRAJfUIFx+xPR4a4lCn3dCdHqid/ayPyL62VqKDczmgYJ4qE+GtRkDCoIjgKBvpUEKxYqgHCguq3QtxDAmGl45vawlOfBFJPgDyFXDdHMq9D+k4C/g/qjmkfms71UfHsdBLXItgGO2AP2OAEnIELUAU1gMEdeACP4Mm4N56NV+PtqzVnTDxbYKqM90+lZabq</latexit>

πQ(a | x) = Prh∼Q{h(x) = a}, h : X → A

<latexit sha1_base64="+t0dYAp3xN+F+e4OhaULvUtakY=">ACTHicdZDdShtBFMdnUz+3VWO9tBcHQyGChN2I+IFCxJteRjAayIRldjLrDs7sLjOz4rLGV/A5fI0+QL2tr1HaKxGcJCJa2j8Mc/idc+bM+YeZ4Np43oNT+TA1PTM7N+9+/LSwuFRd/nyq01xR1qGpSFU3JoJnrCO4UawbqYkaFgZ+HF0Sh/dsmU5mlyYoqM9SU5T3jEKTEWBdUjnPGgxGF4PKwTfH217h4AbqugjHFYjDEuIa5frcMBEMDfmBmLYA0y7gE1q78OgWvMau9tbu80d8BveWGCJ1Zb3SmqtL7e4/uf7bTuo/sKDlOaSJYKonXP9zLTL4kynAo2dHGuWUboBTlnPRsmRDLdL8fLDuGrJQOIUmVPYmBM3aURGpdyNBWSmJi/XduBP+V6+Um2umXPMlywxI6GRTlAuySI+dgwBWjRhQ2IFRx+1egMVGEGuvuymyGLBI2xdAFiBtcapda9KrN/8PTpsNf7PRPZrX0RxaRWuojny0jVroG2qjDqLoDt2jn+jB+eH8dh6dp0lpxXnpWUHvVJl5Bi+StK8=</latexit>
slide-9
SLIDE 9

Application to Mixed Logits

  • Mixed logit model
  • Like a softmax with normally distributed parameters
  • Risk bound motivates logging policy regularization

[London & Sandler, ICML 2019]

L2 distance to logging policy parameters

hw,γ(x) = arg maxa w · φ(x, a) + γa

<latexit sha1_base64="CEu/DqFGcvPUlqBMrX5MVLu5BM0=">ACRnicdVBda9swFJWzj7bZR7PtcS+XhkHKQrBTSltYITAYfSm0bGkLsTHXspyISJaR5DXG5Nfsz2yv21P/RAeDstcpySjr2A4IDuecqyudpBDcWN+/8hr37j94uLa+0Xz0+MnTzdaz52dGlZqyIVC6YsEDRM8Z0PLrWAXhWYoE8HOk+nbhX/+kWnDVf7BVgWLJI5znGK1klx63AS15fdcIxS4rwz24ZDCFGPJc7iGucQduESQpoqC2Ex4Z1ZF7fhNazyMcat872Ns96O9D0POXAKc47Pq3SnsAx+/T6N3J3HrR5gqWkqWyrQmFHgFzaqUVtOBZs3w9KwAukUx2zkaI6SmahefnMOr5ySQqa0O7mFpfrnRI3SmEomLinRTszf3kL8lzcqbYf1TwvSstyulqUlQKsgkVnkHLNqBWVI0g1d28FOkGN1Lpm72yRVcoy424AWYF0YWarqTbv5Pzvq9YKfXPw3agzdkhXykmyRDgnIHhmQI3JChoST+QL+Uq+eZ+9a+/G+7mKNrzfMy/IHTIL/+csqQ=</latexit>

γ ∼ Gumbel(0, 1)k

<latexit sha1_base64="/K5/42P25wlb+IRN4rgq1zjmL0=">ACJnicdVDLatAFB2lL9d9qe2yKQw1BReKkVyCE8jCkEWyTKC2A5ZqRqMrZ/CMJGauQoTwKh+Q32i2yR/kA7orIbt8QyD7ju1i6tIeGDic+7cmRPlUhj0vBtn7cHDR4+f1J7Wnz1/8fKV+/pN32SF5tDjmcz0YcQMSJFCDwVKOMw1MBVJGESTnZk/OAZtRJZ+xTKHULFxKhLBGVp5L4PxkwpFkRlgHC1W6hIpDTpvfZ/RtMnIbXmurs7HV3qR+y5uDWsViw1sqje76WdC8vzrbH7l3QZzxQkGKXDJjhr6XY1gxjYJLmNaDwkDO+ISNYWhpyhSYsJp/Y0o/WiWmSabtSZHO1T8nKqaMKVk4rhkfnbm4n/8oYFJpthJdK8QEj5YlFSIoZnXVCY6GBoywtYVwL+1bKj5hmHG1zK1tUGUNi7A1UlVTZcGbqtqRlN/8n/XbL/9JqH/iN7jZoEbekQ+kSXzSIV2yR/ZJj3BySs7JBbl0vjs/nJ/O9SK65vyeUtW4Nz+Aj24qTA=</latexit>

Dkl(QkP) = O(kµ µ0k2)

<latexit sha1_base64="rGWMk0SbIjyAt8LiF5XfihmXdQI=">ACOXicdVDdShwxGM1o/Vv/VnvZm9BF0AuHzKjrChaE3hQqdIWuCjvrkslkdoP5GZKMAzGr5Me6v4GXvxMv6As2uW1DRA/k4nPN9+ZITZ5wZi9CdNzX9YWZ2bn6htri0vLJaX1s/NSrXhHaI4kqfx9hQziTtWGY5Pc80xSLm9Cy+/Dryz6oNkzJn7bIaE/gWQpI9g6qV9H0fjhF2VURyfVKParuAXGMVs8GMzkqLMhI53Iau9lF1EW716w3khyFqtg4g8veaO80gdOSgtRsGAQx8NEYDTNDu1/9GiSK5oNISjo3pBizvRJrywinVS3KDc0wucQD2nVUYkFNrxz/rIbTklgqrQ70sKx+nyixMKYQsSuU2A7NK+9kfiW181t2uqVTGa5pZI8LUpzDq2Co5hgwjQlheOYKZeyskQ6wxsS7MF1tEkdDUuBugKBwzcrUXEj/k4Dvk9PQD3b8GS3cXQ4iWsefAKfwSYIwD4At9AG3QAdfgN7gBt94v74937z08tU5k5mP4AW8x3+DQqxN</latexit>

w ∼ N(µ, Σ)

<latexit sha1_base64="MBWDrEtOzjXyF+/Llo1h5b8E1Yw=">ACIXicdVDLSgMxFM34rPU16kZwEyxCBSnTamkLgpuXElF+4BOKZk04YmkyHJKEOpP6Nb/Q934k78C7/A9EGxogcCh3POzU2OFzKqtON8WAuLS8srq4m15PrG5ta2vbNbUyKSmFSxYEI2PKQIowGpaqoZaYSIO4xUvf6FyO/fkekoiK41XFIWhx1A+pTjLSR2vb+PXS9GLpXQvK0y6MT6N7QLkfHbTvlZEqFfClXhNmMwY0ikHemSkpMEWlbX+5HYEjTgKNGVKqmXVC3RogqSlmZJh0I0VChPuoS5qGBogT1RqMfzCER0bpQF9IcwINx+rPiQHiSsXcM0mOdE/9kbiX14z0n6xNaBGkS4MkiP2JQCziqA3aoJFiz2BCEJTVvhbiHJMLalDa3hcd4itzA+Qx5CYsVNKUNOvmf1LZbKnmdz1Wap8Pq0rAQ7AIUiDLCiAMrgEFVAFGDyAJ/AMXqxH69V6s94n0QVrOrMH5mB9fgN6/aKo</latexit>

πQ(a | x) = E(w,γ)∼Q [1{hw,γ(x) = a}] = Ew h

exp(w·f(x,a)) P

a0 exp(w·f(x,a0))

i

<latexit sha1_base64="B0E7zDHFhHzMQ8SqV67/w2efN3w=">ACqXicjVHditNAFJ5EXdf6s129FGSwyCZYSrOy6oWLiyJ4uQW7W+yEcDKZpMNmkjAzsQ1jLn0PH8NbH8Nn0CufwEnrgusPeGDg8P2cM3wnrnKu9Hj8xXEvXb6ydX7Wu/6jZu3dvq7t09UWUvKprTMSzmLQbGcF2yquc7ZrJIMRJyz0/jsZcefvmNS8bJ4o5uKhQKygqecgrZQ1P9IKh6ROJ54QN6v/N4hJq+qyHjLIclACPBJ3HR0S3KW6jkmATGLyJzTrbfyD4G0mEieLXR47l+25AXPrDyVQA1hq8pbYkKTUuPUWw3B91tDVC0iA3vW/Ae/ZwW4GxFG/UEwGq8L/7sZP/89cO9T5Nvx1H/O0lKWgtWaJqDUvNgXOnQgNSc5qztkVqxCugZGxu2wIEU6FZJ9niBxZJcFpK+wqN1+ivDgNCqUbEVilAL9TvXAf+jZvXOn0aGl5UtWYF3SxK6xzrEndnwQmXjOq8sQ1Qye1fMV2ADU/b413YIpqEpcpOwKLBwopL1fu/kE72R8Gj0f4kGBw9Q5vaRnfRfeShAD1BR+g1OkZTRJ0tZ+gcOI/dh+7EnblvN1LX+em5gy6US38ARQzSkg=</latexit>
slide-10
SLIDE 10

Contributions

  • PAC Risk bounds for Bayesian policies
  • Application to mixed logits
  • Logging policy prior motivates new regularizer
  • Two new learning objectives (one convex) that minimize bounds
  • Experiments show proposed methods gain up to 76% more reward

than POEM, while also simpler/more efficient

Visit poster #113 in Pacific Ballroom