How Good is the Bayes Posterior in Deep Neural Networks Really? - - PowerPoint PPT Presentation

how good is the bayes posterior in deep neural networks
SMART_READER_LITE
LIVE PREVIEW

How Good is the Bayes Posterior in Deep Neural Networks Really? - - PowerPoint PPT Presentation

Code: github.com/google-research/google-research/tree/ master/cold_posterior_bnn How Good is the Bayes Posterior in Deep Neural Networks Really? Florian Wenzel (Google Research Berlin) Joint first authors: Kevin Roth, Bas Veeling, and: Jakub


slide-1
SLIDE 1

How Good is the Bayes Posterior in Deep Neural Networks Really?

Joint first authors: Kevin Roth, Bas Veeling, and: Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin

Florian Wenzel, 15 June 2020

Florian Wenzel (Google Research Berlin)

Code: github.com/google-research/google-research/tree/ master/cold_posterior_bnn

slide-2
SLIDE 2

2 Florian Wenzel, 15 June 2020

Bayesian Deep Learning

Goal: enable Bayesian inference for deep networks to improve robustness of predictions! Active research field where most work focuses on improving approximate inference to get closer to the Bayes posterior

slide-3
SLIDE 3

3 Florian Wenzel, 15 June 2020

But is the Bayes posterior actually good?

slide-4
SLIDE 4

4

Bayesian Neural Networks (BNNs)

Neural Network

) = p(yi|xi, θ)

<latexit sha1_base64="IY1J2jvCqr5IEA9S/aVNBI46QY=">ACSXicbVDLSgMxFM20Pmp9V26CRahBSkzPlAQoagLlxXsAzplzKRpG5p5kNwRy9jfc+POnf/gxoUirsy0XWjbAyGHc+5N7j1uKLgC03wzUumFxaXlzEp2dW19YzO3tV1TQSQpq9JABLhEsUE91kVOAjWCUjnitY3e1fJX79gUnFA/8OBiFreaTr8w6nBLTk5O7Dgu0Goq0Gnr5iG3oMyPA2x6BHiUivh7iIr7AYWHg8KdHh2trtr6I7XM896Wik8ubJXMEPEusCcmjCSpO7tVuBzTymA9UEKWalhlCKyYSOBVsmLUjxUJC+6TLmpr6xGOqFY+SGOJ9rbRxJ5D6+IBH6t+OmHgqGVBXJguqaS8R53nNCDpnrZj7YQTMp+OPOpHAEOAkVtzmklEQA0IlVzPimPSEJBh5/VIVjTK8+S2mHJOiqd3B7ny5eTODJoF+2hArLQKSqjG1RBVUTRM3pHn+jLeDE+jG/jZ1yaMiY9O+gfUulfRdCzKA=</latexit>

Input Output Hidden

Different models obtained by different θ

<latexit sha1_base64="rTwF5qKwJnpd9PmqZdb1cRT7ZVM=">AB/HicbVC7TsMwFHV4lvIKdGSJqJCYqoSHYKxgYSwSfUhNVDmO01p17Mi+Qaqi8isDCDEyoew8Tc4bQZoOZLlo3PulY9PmHKmwXW/rZXVtfWNzcpWdXtnd2/fPjsaJkpQtEcql6IdaUM0HbwIDTXqoTkJOu+H4tvC7j1RpJsUDTFIaJHgoWMwIBiMN7JofSh7pSWKu3IcRBTwd2HW34c7gLBOvJHVUojWwv/xIkiyhAgjHWvc9N4UgxwoY4XRa9TNU0zGeEj7hgqcUB3ks/BT58QokRNLZY4AZ6b+3shxot8ZjLBMNKLXiH+5/UziK+DnIk0AyrI/KE4w5Ip2jCiZiBPjEwUM1kdMsIKEzB9VU0J3uKXl0nrOGdNy7vL+rNm7KOCjpCx+gUegKNdEdaqE2ImiCntErerOerBfr3fqYj65Y5U4N/YH1+QOnQZVt</latexit>

Florian Wenzel, 15 June 2020

p(D|θ) =

<latexit sha1_base64="WGOL1vxHGku1eDCobZodwAJrpgc=">ACEHicbVDLSsNAFJ34rPUVdelmsIh1UxIf6EYo6sJlBfuAJpTJdNIOnTyYuRFKzCe48VfcuFDErUt3/o2TtgtvTDM4Zx7ueceLxZcgWV9G3PzC4tLy4WV4ura+samubXdUFEiKavTSESy5RHFBA9ZHTgI1olI4EnWNMbXOV685JxaPwDoYxcwPSC7nPKQFNdcyDuOwEBPqUiPQ6w/Y8SLRVcNAf6kDfQYkO8QXuGOWrIo1KjwL7AkoUnVOuaX041oErAQqCBKtW0rBjclEjgVLCs6iWIxoQPSY20NQxIw5ajgzK8r5ku9iOpXwh4xP6eSEmgco+6MzevprWc/E9rJ+CfuykP4wRYSMeL/ERgiHCeDu5ySiIoQaESq69YtonklDQGRZ1CPb0ybOgcVSxjyuntyel6uUkjgLaRXuojGx0hqroBtVQHVH0iJ7RK3oznowX4934GLfOGZOZHfSnjM8fHtWcoQ=</latexit>
slide-5
SLIDE 5

5

Bayesian Neural Networks (BNNs)

Bayesian Neural Network

p(θ, D) = p(yi|xi, θ) p(θ)

<latexit sha1_base64="IY1J2jvCqr5IEA9S/aVNBI46QY=">ACSXicbVDLSgMxFM20Pmp9V26CRahBSkzPlAQoagLlxXsAzplzKRpG5p5kNwRy9jfc+POnf/gxoUirsy0XWjbAyGHc+5N7j1uKLgC03wzUumFxaXlzEp2dW19YzO3tV1TQSQpq9JABLhEsUE91kVOAjWCUjnitY3e1fJX79gUnFA/8OBiFreaTr8w6nBLTk5O7Dgu0Goq0Gnr5iG3oMyPA2x6BHiUivh7iIr7AYWHg8KdHh2trtr6I7XM896Wik8ubJXMEPEusCcmjCSpO7tVuBzTymA9UEKWalhlCKyYSOBVsmLUjxUJC+6TLmpr6xGOqFY+SGOJ9rbRxJ5D6+IBH6t+OmHgqGVBXJguqaS8R53nNCDpnrZj7YQTMp+OPOpHAEOAkVtzmklEQA0IlVzPimPSEJBh5/VIVjTK8+S2mHJOiqd3B7ny5eTODJoF+2hArLQKSqjG1RBVUTRM3pHn+jLeDE+jG/jZ1yaMiY9O+gfUulfRdCzKA=</latexit>

Input Output Hidden

Florian Wenzel, 15 June 2020

Posterior: Distribution over likely models given the data

p(θ|D)

<latexit sha1_base64="wiljDcQFTYfNdIHzdnPtl0J/I+0=">ACDHicbVDLSsNAFJ34rPVdelmsAh1UxIf6LKoC5cV7AOaUCaTSTt08mDmRigxH+DGX3HjQhG3foA7/8ZJm4W2HhjmcO653HuPGwuwDS/jYXFpeWV1dJaeX1jc2u7srPbVlEiKWvRSESy6xLFBA9ZCzgI1o0lI4ErWMcdXeX1zj2TikfhHYxj5gRkEHKfUwJa6leqc12I+GpcaC/1IYhA5I9YDsgMKREpNfZkXaZdXMCPE+sglRgWa/8mV7EU0CFgIVRKmeZcbgpEQCp4JlZTtRLCZ0RAasp2lIAqacdHJMhg+14mE/kvqFgCfq746UBCrfVjvzHdVsLRf/q/US8C+clIdxAiyk0F+IjBEOE8Ge1wyCmKsCaGS610xHRJKOj8yjoEa/bkedI+rlsn9bPb02rjsoijhPbRAaohC52jBrpBTdRCFD2iZ/SK3own48V4Nz6m1gWj6NlDf2B8/gBvL5vc</latexit>
slide-6
SLIDE 6

6

BNNs: Predictions

BNNs use samples from the posterior (ensemble of models) In standard deep learning we optimize

Florian Wenzel, 15 June 2020

θ

SGD (MAP)

∝ exp(−U(θ))

<latexit sha1_base64="uJWKBoYxcgei08sye4qA0oB9Su0=">ACD3icbVC7TgMxEPTxDOEVoKSxiEBJQXTHQ1BG0FCRApF0U+Z5NY8Z0tew8RnfIHNPwKDQUI0dLS8Tc4IQUQRrI8mtnV7k6kpbDo+5/e1PTM7Nx8biG/uLS8slpYW7+yKjUcalxJZW4iZkGKBGoUMKNsDiSMJ1Dsd+te3YKxQySX2NTRi1klEW3CGTmoWdkJtlEZFQ7jTtLRbK4WRki3bj92XhdgFZINyuVko+hV/BDpJgjEpkjHOm4WPsKV4GkOCXDJr64GvsZExg4JLGOTD1IJmvMc6UHc0YTHYRja6Z0C3ndKibWXcS5CO1J8dGYvtcENXGTPs2r/eUPzPq6fYPm5kItEpQsK/B7VTSd39w3BoSxjgKPuOMG6E25XyLjOMo4sw70I/p48Sa72KsF+5fDioFg9GceRI5tki5RIQI5IlZyRc1IjnNyTR/JMXrwH78l79d6+S6e8c8G+QXv/QtTmJw4</latexit>

θ1, θ2, θ3, ... ∼ p(θ|D)

<latexit sha1_base64="fXqnuC6GFqFpqTVuYlqBxuVUNsM=">ACXHicbVHLSgMxFM2MVfuwOiq4cRMsQoUyzLSKLou6cFnBPqBTSiZNbWjmQXJHKGN/0l03/opm2i607YGQw7n3ck9O/FhwBY6zMy93P7BYb5QLB2Vj0+s07OihJWZtGIpI9nygmeMjawEGwXiwZCXzBuv70Kat3P5hUPArfYBazQUDeQz7mlICWhpby/EiM1CzQV+rBhAGZD90a3iXd8uNGrZtG3uKBziu7ujAn9gLCEwoEenz/AYPrYpjO0vgbeKuSQWt0RpaX94oknAQqCKNV3nRgGKZHAqWDzopcoFhM6Je+sr2lIAqYG6TKcOb7WygiPI6lPCHip/p1ISaAyv7ozM6k2a5m4q9ZPYPwSHkYJ8BCulo0TgSGCGdJ4xGXjIKYaUKo5NorphMiCQX9H0Udgrv5G3Sqdtuw757va0H9dx5NElukJV5KJ71EQvqIXaiKIF+jHyRsH4NnNmySyvWk1jPXO/sG8+AWhaLY/</latexit>
slide-7
SLIDE 7

In this talk: A model is good if it predicts well (e.g. low cross entropy loss)

7

Predict by using an average of models

θ

SGD (MAP)

≈ X

s

p(y|x, θs)

<latexit sha1_base64="aOGT/zEJg6itkXTtYS7WSQOTsjM=">ACFnicbVDLSgMxFM3UV62vUZdugkVQ0DLjA12KblxWsA/olCGTpm1oZhKSO2IZ+xVu/BU3LhRxK+78G9PHQq0HQg7n3Mu90RKcAOe9+XkZmbn5hfyi4Wl5ZXVNXd9o2pkqimrUCmkrkfEMETVgEOgtWVZiSOBKtFvcuhX7tl2nCZ3EBfsWZMOglvc0rASqF7EBCltLwLTBqHBqvdPr7Hd/s4iKRomX5svyALgMyCM1e6Ba9kjcCnib+hBTRBOXQ/QxakqYxS4AKYkzD9xQ0M6KBU8EGhSA1TBHaIx3WsDQhMTPNbHTWAO9YpYXbUtuXAB6pPzsyEpvhirYyJtA1f72h+J/XSKF91sx4olJgCR0PaqcCg8TDjHCLa0ZB9C0hVHO7K6ZdogkFm2TBhuD/PXmaVA9L/lHp5Pq4eH4xiSOPtA2kU+OkXn6AqVUQVR9ICe0At6dR6dZ+fNeR+X5pxJzyb6BefjG31un5Y=</latexit>

θ1, θ2, θ3, ... ∼ p(θ|D)

<latexit sha1_base64="fXqnuC6GFqFpqTVuYlqBxuVUNsM=">ACXHicbVHLSgMxFM2MVfuwOiq4cRMsQoUyzLSKLou6cFnBPqBTSiZNbWjmQXJHKGN/0l03/opm2i607YGQw7n3ck9O/FhwBY6zMy93P7BYb5QLB2Vj0+s07OihJWZtGIpI9nygmeMjawEGwXiwZCXzBuv70Kat3P5hUPArfYBazQUDeQz7mlICWhpby/EiM1CzQV+rBhAGZD90a3iXd8uNGrZtG3uKBziu7ujAn9gLCEwoEenz/AYPrYpjO0vgbeKuSQWt0RpaX94oknAQqCKNV3nRgGKZHAqWDzopcoFhM6Je+sr2lIAqYG6TKcOb7WygiPI6lPCHip/p1ISaAyv7ozM6k2a5m4q9ZPYPwSHkYJ8BCulo0TgSGCGdJ4xGXjIKYaUKo5NorphMiCQX9H0Udgrv5G3Sqdtuw757va0H9dx5NElukJV5KJ71EQvqIXaiKIF+jHyRsH4NnNmySyvWk1jPXO/sG8+AWhaLY/</latexit>

Florian Wenzel, 15 June 2020

BNNs: Predictions

slide-8
SLIDE 8

8

Promises of BNNs*:

  • Robustness in generalization
  • Better uncertainty quantification (calibration)
  • Enables new deep learning applications (continual learning, sequential

decision making, …)

Bayesian Neural Networks (BNNs)

Florian Wenzel, 15 June 2020

* [e.g., Neal 1995, Gal et al. 2016, Wilson 2019, Ovadia et al. 2019].

slide-9
SLIDE 9

9

But in practice BNNs are rarely used!

Bayesian Neural Networks (BNNs)

Florian Wenzel, 15 June 2020

slide-10
SLIDE 10

10

In practice:

  • Often, the Bayes posterior is worse than SGD point estimates

Bayesian Neural Networks (BNNs)

Cold Posterior* For temperature T<1: We sharpen the posterior (over-count evidence)

*Explicitly (or implicitly) used by most recent Bayesian DL papers [e.g., Li et al. 2016, Zhang et al. 2020, Ashukha et al. 2020].

Florian Wenzel, 15 June 2020

  • But Bayes predictions can be improved by the use of the
slide-11
SLIDE 11

11

Bayesian Neural Networks (BNNs)

θ

Cold Posterior For temperature T<1: We sharpen the posterior (over-count evidence)

Florian Wenzel, 15 June 2020

slide-12
SLIDE 12

12

ResNet-20 / CIFAR-10 CNN-LSTM / IMDB

True Bayes posterior Optimal cold posterior

Florian Wenzel, 15 June 2020

slide-13
SLIDE 13

13

The cold posterior sharply deviates from the Bayesian paradigm.

What is the use of more accurate posterior approximations if the posterior is poor?

Florian Wenzel, 15 June 2020

slide-14
SLIDE 14

14

Our paper: Hypothesis for the origin of the improved performance of cold posteriors Likelihood Inference

Inaccurate SDE Simulation? Bias of SG-MCMC? Minibatch noise (which is not Gaussian)? Bias-variance tradeoff induced by cold posterior? Dirty likelihoods? (batch-normalization, dropout, data augmentation)

Prior

Current priors used for BNN parameters are poor? The effect becomes stronger with increasing model depths and capacity?

Florian Wenzel, 15 June 2020

slide-15
SLIDE 15

15

Our paper: Hypothesis for the origin of the improved performance of cold posteriors Likelihood Inference

Inaccurate SDE Simulation? Bias of SG-MCMC? Minibatch noise (which is not Gaussian)? Bias-variance tradeoff induced by cold posterior? Dirty likelihoods? (batch-normalization, dropout, data augmentation)

Prior

Current priors used for BNN parameters are poor? The effect becomes stronger with increasing model depths and capacity?

Florian Wenzel, 15 June 2020

slide-16
SLIDE 16

16

Our paper: Hypothesis for the origin of the improved performance of cold posteriors Likelihood Inference

Inaccurate SDE Simulation? Bias of SG-MCMC? Minibatch noise (which is not Gaussian)? Bias-variance tradeoff induced by cold posterior? Dirty likelihoods? (batch-normalization, dropout, data augmentation)

Prior

Current priors used for BNN parameters are poor? The effect becomes stronger with increasing model depths and capacity?

Florian Wenzel, 15 June 2020

slide-17
SLIDE 17
  • 1. How to compute the posterior (inference)?
  • 2. Does inaccurate inference lead to the cold posterior effect?

Inference: Is it accurate?

17

Sample from the posterior using SG-MCMC methods Not covered: Approximate posterior using variational inference

Florian Wenzel, 15 June 2020

slide-18
SLIDE 18

18

SG-MCMC: Stochastic Gradient Markov Chain Monte Carlo

SGD = optimization goal

Florian Wenzel, 15 June 2020

slide-19
SLIDE 19

19

SG-MCMC = convergence in distribution, integration

Florian Wenzel, 15 June 2020

SG-MCMC: Stochastic Gradient Markov Chain Monte Carlo

slide-20
SLIDE 20

20

Stochastic Gradient Markov Chain Monte Carlo

Langevin Dynamics: one-slide refresher

  • Simulating SDE has stationary distribution proporuional to exp(-U(θ) / T)

[Langevin, 1908], [Leimkuhler and Matuhews, “Molecular Dynamics”, 2016]

  • Parameters θ, moments m, mass matrix M > 0, friction γ > 0
  • “Solving SDE” ⇔ obtain one random continuous-time path

dθ = M−1mdt dm = rθU(θ)dt γmdt + p 2γTM1/2dW

<latexit sha1_base64="FtrF8Xgsx6GKnyHr7HVPqCdNU=">ADB3icdVJNbxMxEPUuXyV8pXBESBYRVRFK2E2L4FKpgsXpCI1TaU4RLNe78aq7V1sL1Jk7Y0Lf4ULBxDiyl/gxr/Bm64oacNIlp/fvPGMx5OUghsbRb+D8NLlK1evbVzv3Lh56/ad7ubdI1NUmrIRLUShjxMwTHDFRpZbwY5LzUAmgo2Tk1eNf/yBacMLdWgXJZtKyBXPOAXrqdlm8IAkLOfKgeC5YmNiQ719I1MClEahbSb47YObNQ4629pSDJ3Jv6nevHbYA/y5VYiwlZuetMtbXJwoSATO3JkONR9tr6Mcrl/dJDlLC/3I/Iea9tm6IW9nhWf6m6hg/xcN6XVjD5lK/3Zj1u1Fg2hp+CKIW9BDrR3Mur9IWtBKMmWpAGMmcVTaqQNtORWs7pDKsBLoCeRs4qECyczULf+xo8k+Ks0H4pi5fsvxEOpGm64pVNvea8ryHX+SaVzV5MHVdlZmip4mySmBb4GYocMo1o1YsPACqua8V0zloNaPTsc3IT7/5IvgaDiIdwbP3u729l+27dhA9FDtI1i9Bzto9foAI0QDT4Gn4OvwbfwU/gl/B7+OJWGQRtzD61Y+PMPkyT6gQ=</latexit>

Florian Wenzel, 15 June 2020

slide-21
SLIDE 21

Gaussian Noise scaled by temperature SGD with Momentum

21

Stochastic Gradient Markov Chain Monte Carlo

Symplectic Euler (Discretized version of SDE) m(t) = (1 hγ)m(t−1) hnrθ ˜ G ⇣ θ(t−1)⌘ + p 2γhTM1/2R θ(t) = θ(t−1) + hM−1m(t)

<latexit sha1_base64="Mer1bSxBrBfFTfikeBJ7mVqjCRs=">ADI3icdVJNb9MwGHbC1yhfHRy5WFSgVlNLUkDAWmCA1yQBlq3SXWpHMdJrDlOsN8gVb+Cxf+ChcOoIkLB/4LThcYa7dXsvzoed5vOyqlMBAEvz/wsVLl69sXO1cu37j5q3u5u09U1Sa8QkrZKEPImq4FIpPQIDkB6XmNI8k348OXzX6/ieujSjULixKPstpqkQiGAVHzTe95yTiqVCWSpEqHteY5BSyKLF5/cH2YVA/eNEPhxkmKc1zOliRh+GgdqLCRNFI0rklUSFjs8jdZQlkHGjtUoKQMbevayJ5Av0zfP7mIlqkGQzwFjEfNdhxWxZnePeks7fOPcQP8fiEet82iwnB56RvJjlPa0pvZacKDMP1VWDCVfxvU/NuLxgFS8PrIGxBD7W2M+8ekbhgVc4VMEmNmYZBCTNLNQgmed0hleElZYc05VMHFc25mdnlG9f4vmNinBTaHQV4yf4fYWlumsGcZ9O1WdUa8ixtWkHybGaFKivgih0XSiqJocDNh8Gx0JyBXDhAmRauV8wyqikD9606bgnh6sjrYG8Ch+Nnrx73Nt+2a5jA91F91Afhegp2kZv0A6aIOZ9r56370f/hf/m3/k/zx29b025g46Zf7vP/u0Aqk=</latexit>

Florian Wenzel, 15 June 2020

N (0, I)

<latexit sha1_base64="st1infY2cLCS2xtZdwYZkuQXMUM=">ACBXicbVDLSsNAFJ3UV62vqEtdDBahgpTEB7osutGNVLAPaEKZTCft0MmDmRuhG7c+CtuXCji1n9w5984abPQ6oELh3Pu5d57vFhwBZb1ZRTm5hcWl4rLpZXVtfUNc3OrqaJEUtagkYhk2yOKCR6yBnAQrB1LRgJPsJY3vMz81j2TikfhHYxi5gakH3KfUwJa6pq7TkBgQIlIb8aOYD5UrEN87UjeH8B1yxbVWsC/JfYOSmjHPWu+en0IpoELAQqiFId24rBTYkETgUbl5xEsZjQIemzjqYhCZhy08kXY7yvlR72I6krBDxRf06kJFBqFHi6M7tZzXqZ+J/XScA/d1MexgmwkE4X+YnAEOEsEtzjklEQI0IlVzfiumASEJB1fSIdizL/8lzaOqfVw9vT0p1y7yOIpoB+2hCrLRGaqhK1RHDUTRA3pCL+jVeDSejTfjfdpaMPKZbfQLxsc3Z8mX3Q=</latexit>
slide-22
SLIDE 22

22

Stochastic Gradient Markov Chain Monte Carlo

Florian Wenzel, 15 June 2020

The discretization scheme leads to approximate samples from the posterior

θ(1), θ(2), θ(3), ...

<latexit sha1_base64="GbV8zneDu6OpqEKGpcBciqFfJ8g=">ACPXicdVC7SgNBFJ2NrxhfUubwSBEkGU3UbQM2lhGyAuyMcxOJsmQ2Qczd4Ww7I/Z+A92djYWitjaOpuk0CQeGOZwzr3ce48bCq7Asl6MzMrq2vpGdjO3tb2zu5fP2ioIJKU1WkgAtlyiWKC+6wOHARrhZIRzxWs6Y5uUr/5wKTigV+Dcg6Hhn4vM8pAS18zXHDURPjT39xQ4MGZDkPi7ap8kZXm6V/rfKqWaZjdfsExrArxI7BkpoBmq3fyz0wto5DEfqCBKtW0rhE5MJHAqWJzIsVCQkdkwNqa+sRjqhNPrk/wiVZ6uB9I/XzAE/V3R0w8la6qKz0CQzXvpeIyrx1B/6oTcz+MgPl0OqgfCQwBTqPEPS4ZBTHWhFDJ9a6YDokFHTgOR2CPX/yImUTLtsXtydFyrXsziy6AgdoyKy0SWqoFtURXVE0SN6Re/ow3gy3oxP42tamjFmPYfoD4zvH+lKruo=</latexit>

Is this accurate enough?

slide-23
SLIDE 23

23 Florian Wenzel, 15 June 2020

Novel diagnostics for SG-MCMC

Diagnostics check out!

slide-24
SLIDE 24

24 Florian Wenzel, 15 June 2020

Novel diagnostics for SG-MCMC

Diagnostics fail.

slide-25
SLIDE 25

25

SG-MCMC works well enough!

Florian Wenzel, 15 June 2020

Synthetic data generated from an MLP

Gold Standard (HMC) SG-MCMC

slide-26
SLIDE 26

26

Inference

Inaccurate SDE Simulation? Bias of SG-MCMC? Minibatch noise (which is not Gaussian)? Bias-variance tradeoff induced by cold posterior?

SG-MCMC inference works well enough!

Florian Wenzel, 15 June 2020

slide-27
SLIDE 27

27

Inference

Inaccurate SDE Simulation? Bias of SG-MCMC? Minibatch noise (which is not Gaussian)? Bias-variance tradeoff induced by cold posterior?

If the model is well-specifjed, T=1 is optimal. But for real-world data T<1 is betuer!

Florian Wenzel, 15 June 2020

SG-MCMC inference works well enough!

slide-28
SLIDE 28

28

The cold posterior effect

Why does the cold posterior perform better than the true Bayes posterior?

Florian Wenzel, 15 June 2020

slide-29
SLIDE 29

29

Problems with the prior? Prior

Current priors used for BNN parameters are poor? The effect becomes stronger with increasing model depths and capacity?

Florian Wenzel, 15 June 2020

∼ p(θ) = N(0, I)

<latexit sha1_base64="YISD6eRDlW5pOVFtekEnNw3UZg=">ACMHicbVDLSgMxFM3UV62vUZdugkVoQcqMD3QjF2oG6lgH9CpJZOmbWjmQXJHKMN8khs/RTcKirj1K8y0XWjtgZDuedy7z1uKLgCy3ozMnPzC4tL2eXcyura+oa5uVTQSQpq9JABLhEsUE91kVOAjWCUjnitY3R1cpPX6A5OKB/4dDEPW8kjP51OCWipbV46biA6aujpL3agz4Ak93GBFxPsKO7hsDUDxzPAJ9SkR8kxSsfXxdbJt5q2SNgP8Te0LyaIJK23x2OgGNPOYDFUSpm2F0IqJBE4FS3JOpFhI6ID0WFNTn3hMteLRwQne0oHdwOpnw94pP7uiImn0pW1M1UTdScVatGUH3tBVzP4yA+XQ8qBsJDAFO08MdLhkFMdSEUMn1rpj2iSQUdMY5HYI9fJ/Ujso2Yel49ujfPl8EkcW7aBdVEA2OkFldIUqIoekQv6B19GE/Gq/FpfI2tGWPSs43+wPj+AdDmqfc=</latexit>
slide-30
SLIDE 30

30

Prior Predictive Experiment

Ex∼p(x) h p ⇣ y|x, θ(i)⌘i

<latexit sha1_base64="y13U6VdSiug1ns9tmXsqnl1W4Ec=">ACPHicbVDJSgNBEO1xN25Rj14agxBwowLehRF8KhoVMiMoadTkzT2LHTXSMI4H+bFj/DmyYsHRbx6tifJwe1BU49XVXS95ydSaLTtJ2tkdGx8YnJqujQzOze/UF5cutBxqjUeSxjdeUzDVJEUEeBEq4SBSz0JVz6N4dF/IWlBZxdI69BLyQtSMRCM7QSM3ymRsy7Ph+dpQ3sy51tQhpUu2u56EABtJv1R79I52N6jrx7Kle6EpmYsdQJZfZ1VhpVod3B9ULxmuWLX7D7oX+IMSYUMcdIsP7qtmKchRMgl07rh2Al6GVMouIS85KYaEsZvWBsahkYsBO1lfM5XTNKiwaxMi9C2le/b2Qs1MXNZrKwqn/3CvG/XiPFYM/LRJSkCBEfBSkmJMiyRpSyjgKHuGMK6EuZXyDlOMo8m7ZEJwflv+Sy42a85Wbed0u7J/MIxjiqyQVIlDtkl+SYnJA64eSePJNX8mY9WC/Wu/UxGB2xhjvL5Aeszy9Pwa9l</latexit>

θ(i) ∼ p(θ) = N(0, I)

<latexit sha1_base64="YISD6eRDlW5pOVFtekEnNw3UZg=">ACMHicbVDLSgMxFM3UV62vUZdugkVoQcqMD3QjF2oG6lgH9CpJZOmbWjmQXJHKMN8khs/RTcKirj1K8y0XWjtgZDuedy7z1uKLgCy3ozMnPzC4tL2eXcyura+oa5uVTQSQpq9JABLhEsUE91kVOAjWCUjnitY3R1cpPX6A5OKB/4dDEPW8kjP51OCWipbV46biA6aujpL3agz4Ak93GBFxPsKO7hsDUDxzPAJ9SkR8kxSsfXxdbJt5q2SNgP8Te0LyaIJK23x2OgGNPOYDFUSpm2F0IqJBE4FS3JOpFhI6ID0WFNTn3hMteLRwQne0oHdwOpnw94pP7uiImn0pW1M1UTdScVatGUH3tBVzP4yA+XQ8qBsJDAFO08MdLhkFMdSEUMn1rpj2iSQUdMY5HYI9fJ/Ujso2Yel49ujfPl8EkcW7aBdVEA2OkFldIUqIoekQv6B19GE/Gq/FpfI2tGWPSs43+wPj+AdDmqfc=</latexit>

Draw from prior Induced predictive distribution

Florian Wenzel, 15 June 2020

slide-31
SLIDE 31

31

Prior Predictive Experiment

Ex∼p(x) h p ⇣ y|x, θ(i)⌘i

<latexit sha1_base64="y13U6VdSiug1ns9tmXsqnl1W4Ec=">ACPHicbVDJSgNBEO1xN25Rj14agxBwowLehRF8KhoVMiMoadTkzT2LHTXSMI4H+bFj/DmyYsHRbx6tifJwe1BU49XVXS95ydSaLTtJ2tkdGx8YnJqujQzOze/UF5cutBxqjUeSxjdeUzDVJEUEeBEq4SBSz0JVz6N4dF/IWlBZxdI69BLyQtSMRCM7QSM3ymRsy7Ph+dpQ3sy51tQhpUu2u56EABtJv1R79I52N6jrx7Kle6EpmYsdQJZfZ1VhpVod3B9ULxmuWLX7D7oX+IMSYUMcdIsP7qtmKchRMgl07rh2Al6GVMouIS85KYaEsZvWBsahkYsBO1lfM5XTNKiwaxMi9C2le/b2Qs1MXNZrKwqn/3CvG/XiPFYM/LRJSkCBEfBSkmJMiyRpSyjgKHuGMK6EuZXyDlOMo8m7ZEJwflv+Sy42a85Wbed0u7J/MIxjiqyQVIlDtkl+SYnJA64eSePJNX8mY9WC/Wu/UxGB2xhjvL5Aeszy9Pwa9l</latexit>

θ(i) ∼ p(θ) = N(0, I)

<latexit sha1_base64="YISD6eRDlW5pOVFtekEnNw3UZg=">ACMHicbVDLSgMxFM3UV62vUZdugkVoQcqMD3QjF2oG6lgH9CpJZOmbWjmQXJHKMN8khs/RTcKirj1K8y0XWjtgZDuedy7z1uKLgCy3ozMnPzC4tL2eXcyura+oa5uVTQSQpq9JABLhEsUE91kVOAjWCUjnitY3R1cpPX6A5OKB/4dDEPW8kjP51OCWipbV46biA6aujpL3agz4Ak93GBFxPsKO7hsDUDxzPAJ9SkR8kxSsfXxdbJt5q2SNgP8Te0LyaIJK23x2OgGNPOYDFUSpm2F0IqJBE4FS3JOpFhI6ID0WFNTn3hMteLRwQne0oHdwOpnw94pP7uiImn0pW1M1UTdScVatGUH3tBVzP4yA+XQ8qBsJDAFO08MdLhkFMdSEUMn1rpj2iSQUdMY5HYI9fJ/Ujso2Yel49ujfPl8EkcW7aBdVEA2OkFldIUqIoekQv6B19GE/Gq/FpfI2tGWPSs43+wPj+AdDmqfc=</latexit>

Draw from prior Induced predictive distribution

Florian Wenzel, 15 June 2020

Class Probabilities Model parameters

slide-32
SLIDE 32

32

Prior Predictive Experiment

Ex∼p(x) h p ⇣ y|x, θ(i)⌘i

<latexit sha1_base64="y13U6VdSiug1ns9tmXsqnl1W4Ec=">ACPHicbVDJSgNBEO1xN25Rj14agxBwowLehRF8KhoVMiMoadTkzT2LHTXSMI4H+bFj/DmyYsHRbx6tifJwe1BU49XVXS95ydSaLTtJ2tkdGx8YnJqujQzOze/UF5cutBxqjUeSxjdeUzDVJEUEeBEq4SBSz0JVz6N4dF/IWlBZxdI69BLyQtSMRCM7QSM3ymRsy7Ph+dpQ3sy51tQhpUu2u56EABtJv1R79I52N6jrx7Kle6EpmYsdQJZfZ1VhpVod3B9ULxmuWLX7D7oX+IMSYUMcdIsP7qtmKchRMgl07rh2Al6GVMouIS85KYaEsZvWBsahkYsBO1lfM5XTNKiwaxMi9C2le/b2Qs1MXNZrKwqn/3CvG/XiPFYM/LRJSkCBEfBSkmJMiyRpSyjgKHuGMK6EuZXyDlOMo8m7ZEJwflv+Sy42a85Wbed0u7J/MIxjiqyQVIlDtkl+SYnJA64eSePJNX8mY9WC/Wu/UxGB2xhjvL5Aeszy9Pwa9l</latexit>

θ(i) ∼ p(θ) = N(0, I)

<latexit sha1_base64="YISD6eRDlW5pOVFtekEnNw3UZg=">ACMHicbVDLSgMxFM3UV62vUZdugkVoQcqMD3QjF2oG6lgH9CpJZOmbWjmQXJHKMN8khs/RTcKirj1K8y0XWjtgZDuedy7z1uKLgCy3ozMnPzC4tL2eXcyura+oa5uVTQSQpq9JABLhEsUE91kVOAjWCUjnitY3R1cpPX6A5OKB/4dDEPW8kjP51OCWipbV46biA6aujpL3agz4Ak93GBFxPsKO7hsDUDxzPAJ9SkR8kxSsfXxdbJt5q2SNgP8Te0LyaIJK23x2OgGNPOYDFUSpm2F0IqJBE4FS3JOpFhI6ID0WFNTn3hMteLRwQne0oHdwOpnw94pP7uiImn0pW1M1UTdScVatGUH3tBVzP4yA+XQ8qBsJDAFO08MdLhkFMdSEUMn1rpj2iSQUdMY5HYI9fJ/Ujso2Yel49ujfPl8EkcW7aBdVEA2OkFldIUqIoekQv6B19GE/Gq/FpfI2tGWPSs43+wPj+AdDmqfc=</latexit>

Draw from prior Induced predictive distribution

Florian Wenzel, 15 June 2020

Class Probabilities Model parameters

slide-33
SLIDE 33

33

Prior Predictive Experiment

Ex∼p(x) h p ⇣ y|x, θ(i)⌘i

<latexit sha1_base64="y13U6VdSiug1ns9tmXsqnl1W4Ec=">ACPHicbVDJSgNBEO1xN25Rj14agxBwowLehRF8KhoVMiMoadTkzT2LHTXSMI4H+bFj/DmyYsHRbx6tifJwe1BU49XVXS95ydSaLTtJ2tkdGx8YnJqujQzOze/UF5cutBxqjUeSxjdeUzDVJEUEeBEq4SBSz0JVz6N4dF/IWlBZxdI69BLyQtSMRCM7QSM3ymRsy7Ph+dpQ3sy51tQhpUu2u56EABtJv1R79I52N6jrx7Kle6EpmYsdQJZfZ1VhpVod3B9ULxmuWLX7D7oX+IMSYUMcdIsP7qtmKchRMgl07rh2Al6GVMouIS85KYaEsZvWBsahkYsBO1lfM5XTNKiwaxMi9C2le/b2Qs1MXNZrKwqn/3CvG/XiPFYM/LRJSkCBEfBSkmJMiyRpSyjgKHuGMK6EuZXyDlOMo8m7ZEJwflv+Sy42a85Wbed0u7J/MIxjiqyQVIlDtkl+SYnJA64eSePJNX8mY9WC/Wu/UxGB2xhjvL5Aeszy9Pwa9l</latexit>

θ(i) ∼ p(θ) = N(0, I)

<latexit sha1_base64="YISD6eRDlW5pOVFtekEnNw3UZg=">ACMHicbVDLSgMxFM3UV62vUZdugkVoQcqMD3QjF2oG6lgH9CpJZOmbWjmQXJHKMN8khs/RTcKirj1K8y0XWjtgZDuedy7z1uKLgCy3ozMnPzC4tL2eXcyura+oa5uVTQSQpq9JABLhEsUE91kVOAjWCUjnitY3R1cpPX6A5OKB/4dDEPW8kjP51OCWipbV46biA6aujpL3agz4Ak93GBFxPsKO7hsDUDxzPAJ9SkR8kxSsfXxdbJt5q2SNgP8Te0LyaIJK23x2OgGNPOYDFUSpm2F0IqJBE4FS3JOpFhI6ID0WFNTn3hMteLRwQne0oHdwOpnw94pP7uiImn0pW1M1UTdScVatGUH3tBVzP4yA+XQ8qBsJDAFO08MdLhkFMdSEUMn1rpj2iSQUdMY5HYI9fJ/Ujso2Yel49ujfPl8EkcW7aBdVEA2OkFldIUqIoekQv6B19GE/Gq/FpfI2tGWPSs43+wPj+AdDmqfc=</latexit>

Draw from prior Induced predictive distribution

Florian Wenzel, 15 June 2020

Class Probabilities Model parameters

slide-34
SLIDE 34

34

Prior Predictive Experiment: ResNet-20 / CIFAR-10

Each network drawn from the prior maps all images to one class!

Florian Wenzel, 15 June 2020

Ex∼p(x) h p ⇣ y|x, θ(i)⌘i

<latexit sha1_base64="y13U6VdSiug1ns9tmXsqnl1W4Ec=">ACPHicbVDJSgNBEO1xN25Rj14agxBwowLehRF8KhoVMiMoadTkzT2LHTXSMI4H+bFj/DmyYsHRbx6tifJwe1BU49XVXS95ydSaLTtJ2tkdGx8YnJqujQzOze/UF5cutBxqjUeSxjdeUzDVJEUEeBEq4SBSz0JVz6N4dF/IWlBZxdI69BLyQtSMRCM7QSM3ymRsy7Ph+dpQ3sy51tQhpUu2u56EABtJv1R79I52N6jrx7Kle6EpmYsdQJZfZ1VhpVod3B9ULxmuWLX7D7oX+IMSYUMcdIsP7qtmKchRMgl07rh2Al6GVMouIS85KYaEsZvWBsahkYsBO1lfM5XTNKiwaxMi9C2le/b2Qs1MXNZrKwqn/3CvG/XiPFYM/LRJSkCBEfBSkmJMiyRpSyjgKHuGMK6EuZXyDlOMo8m7ZEJwflv+Sy42a85Wbed0u7J/MIxjiqyQVIlDtkl+SYnJA64eSePJNX8mY9WC/Wu/UxGB2xhjvL5Aeszy9Pwa9l</latexit>

θ(i) ∼ p(θ) = N(0, I)

<latexit sha1_base64="YISD6eRDlW5pOVFtekEnNw3UZg=">ACMHicbVDLSgMxFM3UV62vUZdugkVoQcqMD3QjF2oG6lgH9CpJZOmbWjmQXJHKMN8khs/RTcKirj1K8y0XWjtgZDuedy7z1uKLgCy3ozMnPzC4tL2eXcyura+oa5uVTQSQpq9JABLhEsUE91kVOAjWCUjnitY3R1cpPX6A5OKB/4dDEPW8kjP51OCWipbV46biA6aujpL3agz4Ak93GBFxPsKO7hsDUDxzPAJ9SkR8kxSsfXxdbJt5q2SNgP8Te0LyaIJK23x2OgGNPOYDFUSpm2F0IqJBE4FS3JOpFhI6ID0WFNTn3hMteLRwQne0oHdwOpnw94pP7uiImn0pW1M1UTdScVatGUH3tBVzP4yA+XQ8qBsJDAFO08MdLhkFMdSEUMn1rpj2iSQUdMY5HYI9fJ/Ujso2Yel49ujfPl8EkcW7aBdVEA2OkFldIUqIoekQv6B19GE/Gq/FpfI2tGWPSs43+wPj+AdDmqfc=</latexit>

θ(1)

<latexit sha1_base64="WDmANksqNzbN03C7d6V/yiuIiw8=">ACAnicbVDLSsNAFJ34rPUVdSVugkWom5L4QJdFNy4r2Ac0sUwm03boZBJmboQSght/xY0LRdz6Fe78GydtFtp6YJjDOfdy7z1+zJkC2/42FhaXldWS2vl9Y3NrW1zZ7elokQS2iQRj2THx4pyJmgTGHDaiSXFoc9p2x9d537gUrFInEH45h6IR4I1mcEg5Z65r7rRzxQ41B/qQtDCji7T6vOcdYzK3bNnsCaJ05BKqhAo2d+uUFEkpAKIBwr1XsGLwUS2CE06zsJorGmIzwgHY1FTikyksnJ2TWkVYCqx9J/QRYE/V3R4pDlW+pK0MQzXr5eJ/XjeB/qWXMhEnQAWZDuon3ILIyvOwAiYpAT7WBPJ9K4WGWKJCejUyjoEZ/bkedI6qTmntfPbs0r9qoijhA7QIaoiB12gOrpBDdREBD2iZ/SK3own48V4Nz6mpQtG0bOH/sD4/AGmJeB</latexit>

θ(2)

<latexit sha1_base64="m7NaBFgehN0G3BCj+F0bBQmgDEA=">ACAnicbVDLSsNAFJ3UV62vqCtxEyxC3ZSkKrosunFZwT6giWUymbRDJw9mboQSght/xY0LRdz6Fe78GydtFtp6YJjDOfdy7z1uzJkE0/zWSkvLK6tr5fXKxubW9o6+u9eRUSIbZOIR6LnYk5C2kbGHDaiwXFgctp1x1f5373gQrJovAOJjF1AjwMmc8IBiUN9APbjbgnJ4H6UhtGFHB2n9YaJ9lAr5p1cwpjkVgFqaICrYH+ZXsRSQIaAuFYyr5lxuCkWAjnGYVO5E0xmSMh7SvaIgDKp10ekJmHCvFM/xIqBeCMV/d6Q4kPmWqjLAMJLzXi7+5/UT8C+dlIVxAjQks0F+wg2IjDwPw2OCEuATRTARTO1qkBEWmIBKraJCsOZPXiSdRt06rZ/fnlWbV0UcZXSIjlANWegCNdENaqE2IugRPaNX9KY9aS/au/YxKy1pRc8+gPt8weIHpeC</latexit>

θ(3)

<latexit sha1_base64="xYx8VxVW3H+cV741/zW4LzaJVA4=">ACAnicbVDLSsNAFJ3UV62vqCtxEyxC3ZTEKrosunFZwT6giWUymbRDJw9mboQSght/xY0LRdz6Fe78GydtFtp6YJjDOfdy7z1uzJkE0/zWSkvLK6tr5fXKxubW9o6+u9eRUSIbZOIR6LnYk5C2kbGHDaiwXFgctp1x1f5373gQrJovAOJjF1AjwMmc8IBiUN9APbjbgnJ4H6UhtGFHB2n9YaJ9lAr5p1cwpjkVgFqaICrYH+ZXsRSQIaAuFYyr5lxuCkWAjnGYVO5E0xmSMh7SvaIgDKp10ekJmHCvFM/xIqBeCMV/d6Q4kPmWqjLAMJLzXi7+5/UT8C+dlIVxAjQks0F+wg2IjDwPw2OCEuATRTARTO1qkBEWmIBKraJCsOZPXiSd07rVqJ/fnlWbV0UcZXSIjlANWegCNdENaqE2IugRPaNX9KY9aS/au/YxKy1pRc8+gPt8weJpJeD</latexit>
slide-35
SLIDE 35

35

There is no “easy” fix of the prior

Florian Wenzel, 15 June 2020

Final performance of different variances used for the prior

σ θ ∼ 𝒪(0,σ)

ResNet-20 / CIFAR-10

slide-36
SLIDE 36

36

The cold posterior effect becomes stronger with increasing capacity

(fixed depth=3) (fixed width=128)

Florian Wenzel, 15 June 2020

MLP / CIFAR-10

slide-37
SLIDE 37

37

Summary

Florian Wenzel, 15 June 2020

SG-MCMC is accurate enough. Cold posteriors work. More work on priors for deep nets is needed.

ResNet-20 / CIFAR-10

True Bayes posterior Optimal cold posterior

slide-38
SLIDE 38

38

Code: github.com/google-research/

google-research/tree/master/ cold_posterior_bnn More info/feedback: www.florianwenzel.com florianwenzel@google.com

Florian Wenzel, 15 June 2020