[PPT] - CS 103: Representation Learning, Information Theory and Control PowerPoint Presentation

SLIDE 1

CS 103: Representation Learning, Information Theory and Control

Lecture 8, Mar 1, 2019

SLIDE 2

2

Recap

Group nuisances

Group convolutions
Canonical reference frames
SIFT descriptors

General nuisances

Minimal information in activation ⇒ Invariance to nuisances
Information Bottleneck
IB loss can upper-bounded by introducing an auxiliary variable
Aside: Variational Auto-Encoder can be seen as a particular case
Aside: Disentanglement in VAE

How does this relate to standard deep learning?

SLIDE 3

3

The Kolmogorov Structure of a Task

How can we define the structure of a task? Define the Kolmogorov Structure Function:

Training Loss Kolmogorov complexity of model

S𝒠(t) = min

K(M)≤t L(𝒠; M)

Increasing the complexity of the model leads to big gains in accuracy: We are learning the structure of the problem. After learning all the structure, we can only memorize: inefficient asymptotic phase. Optimal

Information Complexity of Tasks, their Structure and their Distance, Achille et al., 2018

Tangent = 1 in the asymptote: Need to store 1 bit in the model to decrease the loss by 1 bit Kolmogorov minimal sufficient statistic

Kolmogorov's Structure Functions and Model Selection, Vereshchagin and Vitanyi, 2002

SLIDE 4

4

Optimizing using Deep Neural Networks

How do we find the optimal solution?

S𝒠(t) = min

K(M)<t L(𝒠; M)

ℒ(M) = L(𝒠; M) + λK(M)

Corresponding Lagrangian K(M) ≤ KL(q(w|𝒠)∥p(w))

ℒ(M) = L(𝒠; M) + λ KL(q(w|𝒠)∥p(w))

Use the bound Let w be the parameters of the model

This loss can be implemented using a DNN and the local reparametrization trick.*

* Variational Dropout and the Local Reparameterization Trick, Kingma et al., 2015 Information Complexity of Tasks, their Structure and their Distance, Achille et al., 2018

SLIDE 5

5

Let’s rewrite it using Information Theory

We used an upperbound, what is the best we value it can assume?

I(w; 𝒠) ≤ 𝔽𝒠[KL(q(w|𝒠)∥p(w))],

Recall that: which is obtained when p(w) = q(w|D). Hence, on expectation over the datasets, the best function loss function to use to recover the task structure is:

ℒ(M) = 𝔽𝒠[H(𝒠|w)] + λI(w; 𝒠) .

IB Lagrangian for the weights

ℒ(M) = 𝔽w∼q(w|𝒠)[Hp,q(𝒠|w)] + λ KL(q(w|𝒠)∥p(w)) .

SLIDE 6

A new Information Bottleneck

dataset weights real distribution

D w p(y|x)

data activations label

x z y

Activations IB Invariance Weights IB Overfitting

6

min

w L = Hp,qw (y|z) + βI(D; w)

<latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit>

min

q(z|x) L = Hp,q(y|z) + βI(z; x)

<latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit>

SLIDE 7

7

PAC-Bayes bound (Catoni, 2007; McAllester 2013).

Corollary. Minimizing the IB Lagrangian for the weights minimizes an upper bound on

the test error. This gives non-vacuous generalization bounds! (Dziugaite and Roy, 2017)

Catoni, 2007; McAllester 2013

The PAC-Bayes generalization bound

SLIDE 8

8

Can we really minimize the IBL for the weights?

We are making an approximation by assuming that both q(w|D) and p(w) are Gaussian. Let w* be a local minimum. The optimal amount of gaussian noise is to add is: where F(w) is the Fisher Information Matrix (equiv. Hessian) computed in w. Flat minima have low information in the weights. Σ = (I + 2λ2 β F (w0))

−1

, I(w; D)  kwk2 λ2 + log |2λ2 N F(w ∗) + I|

<latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">ACUHicbVDLSgMxFL1T3/VdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN0hA3ULn5zeOMEbxzmtdreaMet2uZrF2we4X+AmcAsjCo01rmBdDGvs0FQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit><latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">ACUHicbVDLSgMxFL1T3/VdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN0hA3ULn5zeOMEbxzmtdreaMet2uZrF2we4X+AmcAsjCo01rmBdDGvs0FQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit><latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">ACUHicbVDLSgMxFL1T3/VdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN0hA3ULn5zeOMEbxzmtdreaMet2uZrF2we4X+AmcAsjCo01rmBdDGvs0FQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit><latexit sha1_base64="QDF8UiYJ9M3yOcB2uVn0sVGhLVM=">ACUHicbVDLSgMxFL1T3/VdekmWIRWpcwUQcGNqIhuRNFqoaklk2ZqMPMwyVjKtB/mb7hz5U70D9xp+lCseiHkcM69ObnHjQRX2rafrNTI6Nj4xORUenpmdm4+s7B4qcJYUlaioQhl2SWKCR6wkuZasHIkGfFdwa7c2/2ufnXPpOJhcKFbEav6pBFwj1OiDVXLnB/nmjsI+0TfUCKSg04eYcHuEPYkoQluN3H7uthJsDBP1omBaN0hA3ULn5zeOMEbxzmtdreaMet2uZrF2we4X+AmcAsjCo01rmBdDGvs0FQpSqOHelqQqTmVLBOGseKRYTekgarGBgQn6lq0lu+g1YNU0deKM0JNOqxPycS4ivV8l3T2d1S/da65H9aJdbedjXhQRrFtC+kRcLpEPUTRLVuWRUi5YBhEpu/oroDTGxaZN3eshGeT0XE4zO4a/4LJYcOyCc7aZ3d0bRDQJy7ACOXBgC3bhCE6hBQe4Ble4c16tN6tj5TVb/26YQmGKpX+BMWJsk0=</latexit>

Weight Information is bounded by the geometry of the loss landscape*

SLIDE 9

9

Is this approximation good? Phase transition

𝛾 < 1 ⇒ overfitting 𝛾 > 1 ⇒ underfitting 𝛾 < 1 ⇒ overfitting 𝛾 >> 1 ⇒ underfitting fitting

Consider a dataset with random labels. There is no structure, so for λ < 1we should not fit anything, and for λ > 1 we should memorize everything (see the structure function).

10−2 10−1 100 101 102

Value of β

0% 20% 40% 60% 80% 100%

Train error

All-CNN ResNet Small AlexNet

Achille and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018

Phase transition

Even with local approximation, we can observe this behavior in real deep networks. For real label, we have a “Goldilocks Zone” where we fit without overfitting for λ > 1.

SLIDE 10

Information is a better measure of complexity than number of parameters

10

Bias-variance tradeoff

Variance Bias2 Total error Model complexity Error Optimal DNN Optimal model Optimal DNN Information complexity

Parametrizing the complexity with information in the weights, we recover bias-variance trade-off trend.

Achille and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018 Arora et al., Stronger generalization bounds for deep nets via a compression approach, ICML 2018

SLIDE 11

A new Information Bottleneck

dataset weights real distribution

D w p(y|x)

data activations label

x z y

Activations IB Invariance Weights IB Overfitting

11

min

w L = Hp,qw (y|z) + βI(D; w)

<latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit><latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit>

min

q(z|x) L = Hp,q(y|z) + βI(z; x)

<latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit><latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit>

SLIDE 12

12

The Emergence Bound

Minimality of the weights (representation of the training set) induces minimality (hence invariance) and disentanglement of the activations.

g(I(w; D)) ≤ I(z; x) + TC(z) ≤ g(I(w; D)) + O(1/ dim(x))

<latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">ACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva/CMevHhV/AtezLYedPpA4OF5n/cjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNHkjAakoaipFmLAjiPiN3/n1XL97JELSKLxRg5h4HVDGlCMlJbaxeuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45dq6OSpWLKY82AG7wAQOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit><latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">ACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva/CMevHhV/AtezLYedPpA4OF5n/cjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNHkjAakoaipFmLAjiPiN3/n1XL97JELSKLxRg5h4HVDGlCMlJbaxeuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45dq6OSpWLKY82AG7wAQOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit><latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">ACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva/CMevHhV/AtezLYedPpA4OF5n/cjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNHkjAakoaipFmLAjiPiN3/n1XL97JELSKLxRg5h4HVDGlCMlJbaxeuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45dq6OSpWLKY82AG7wAQOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit><latexit sha1_base64="1GEXv2gC0eqEpKMRcwOX24wTxDo=">ACU3icbVBNS8MwGM7q15xfU49egkNoEWYrgoKXwTy4i05xOljLSLN0CyZtTVLdVva/CMevHhV/AtezLYedPpA4OF5n/cjx8zKpVtv+aMufmFxaX8cmFldW19o7i5dSujRGDSwBGLRNHkjAakoaipFmLAjiPiN3/n1XL97JELSKLxRg5h4HVDGlCMlJbaxeuWTOfTqHLkephxNKzkWVBl5EHWDOHp7BvwX14UzWHmfivfR9ems6B26Hc7FtWu1iy/YE8C9xMlICGert4qfbiXDCSagwQ1K2HDtWXoqEopiRUcFNJIkRvkd0tI0RJxIL538fQT3tNKBQST0CxWcqD87UsSlHBfO8c3y9naWPyv1kpUcOKlNIwTRUI8XRQkDKoIjoOEHSoIVmygCcKC6lsh7iGBsNJxF36NwtxLZTDZpMNxZqP4S24Py45dq6OSpWLKY82AG7wAQOAYVcA7qoAEweAZv4B185F5yX4ZhzE+tRi7r2Qa/YKx9A10Xr2Y=</latexit>

I(zL; x) ≤ mink<L{dim(zk) ⇥ g( I(Wk;D)

dim(Wk)) + 1

⇤ }

Tight for one layer; for more layers we have:

Achille and Soatto, Emergence of Invariance and Disentanglement in Deep Representations, JMLR 2018

Info in activations Info in weights

SLIDE 13

13

How do minimize the information in the weights?

1. Explicitly minimize the IBL for the weights using local reparametrization trick.
2. Let SGD do it for you:
Empirically we know that SGD tend to find flatter minima.
We know from the local information bound that flat minima have less

information in the weights.

Hence, SGD implicitly minimizes the information in the weights.
3. Modify SGD to reduce information more aggressively.

Warning: These are still open questions and the claims are not proved.

Counterpoint: Are flat minima the whole story?

SLIDE 14

14

Information in Weights during training

Training Epoch Information in Weights What should we expect from the information in the weights during training? Maybe something like this?

SLIDE 15

15

Information during training

Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018

Information extraction Information consolidation

SLIDE 16

16

A snag: Critical periods in Deep Networks

Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018

Deficit Normal training

160+N N

Show network blurred images to simulate cataract The network does not classify correctly if the deficit is removed to late A short deficit at epoch ~40 is enough to permanently damage the network.

SLIDE 17

17

Critical periods

Critical periods. A time-period in early development where sensory deficits can permanently impair the acquisition of a skill Examples: monocular deprivation, cataracts, imprinting, language acquisition, …

Deficit Normal training

180 + N

Kitten does not recover vision in covered eye

Image from Cnops et al., 2008

Hubel and Wiesel

SLIDE 18

18

Critical learning periods and Information in Weights

Achille, Rovere, Soatto, Critical Learning Periods in Deep Networks, 2018

Sensitivity to deficits peaks when network is absorbing information. Is minimal when the network is consolidating information.

SLIDE 19

19

CS 103: Representation Learning, Information Theory and Control

Recap

Group nuisances

General nuisances

How does this relate to standard deep learning?

The Kolmogorov Structure of a Task

How can we define the structure of a task? Define the Kolmogorov Structure Function:

Training Loss Kolmogorov complexity of model

S𝒠(t) = min

Optimizing using Deep Neural Networks

How do we find the optimal solution?

S𝒠(t) = min

ℒ(M) = L(𝒠; M) + λK(M)

ℒ(M) = L(𝒠; M) + λ KL(q(w|𝒠)∥p(w))

This loss can be implemented using a DNN and the local reparametrization trick.*

Let’s rewrite it using Information Theory

We used an upperbound, what is the best we value it can assume?

I(w; 𝒠) ≤ 𝔽𝒠[KL(q(w|𝒠)∥p(w))],

Recall that: which is obtained when p(w) = q(w|D). Hence, on expectation over the datasets, the best function loss function to use to recover the task structure is:

ℒ(M) = 𝔽𝒠[H(𝒠|w)] + λI(w; 𝒠) .

IB Lagrangian for the weights

ℒ(M) = 𝔽w∼q(w|𝒠)[Hp,q(𝒠|w)] + λ KL(q(w|𝒠)∥p(w)) .

A new Information Bottleneck

dataset weights real distribution

D w p(y|x)

data activations label

x z y

Activations IB Invariance Weights IB Overfitting

min

w L = Hp,qw (y|z) + βI(D; w)

min

q(z|x) L = Hp,q(y|z) + βI(z; x)

PAC-Bayes bound (Catoni, 2007; McAllester 2013).

the test error. This gives non-vacuous generalization bounds! (Dziugaite and Roy, 2017)

The PAC-Bayes generalization bound

Can we really minimize the IBL for the weights?

, I(w; D)  kwk2 λ2 + log |2λ2 N F(w ∗) + I|

Weight Information is bounded by the geometry of the loss landscape*

Is this approximation good? Phase transition

Consider a dataset with random labels. There is no structure, so for λ < 1we should not fit anything, and for λ > 1 we should memorize everything (see the structure function).

Even with local approximation, we can observe this behavior in real deep networks. For real label, we have a “Goldilocks Zone” where we fit without overfitting for λ > 1.

Bias-variance tradeoff

Parametrizing the complexity with information in the weights, we recover bias-variance trade-off trend.

A new Information Bottleneck

dataset weights real distribution

D w p(y|x)

data activations label

x z y

Activations IB Invariance Weights IB Overfitting

min

w L = Hp,qw (y|z) + βI(D; w)

min

q(z|x) L = Hp,q(y|z) + βI(z; x)

The Emergence Bound

Minimality of the weights (representation of the training set) induces minimality (hence invariance) and disentanglement of the activations.

g(I(w; D)) ≤ I(z; x) + TC(z) ≤ g(I(w; D)) + O(1/ dim(x))

I(zL; x) ≤ mink<L{dim(zk) ⇥ g( I(Wk;D)

dim(Wk)) + 1

⇤ }

Tight for one layer; for more layers we have:

How do minimize the information in the weights?

information in the weights.

Counterpoint: Are flat minima the whole story?

Information in Weights during training

Training Epoch Information in Weights What should we expect from the information in the weights during training? Maybe something like this?

Information during training

Information extraction Information consolidation

A snag: Critical periods in Deep Networks

Critical periods

Critical periods. A time-period in early development where sensory deficits can permanently impair the acquisition of a skill Examples: monocular deprivation, cataracts, imprinting, language acquisition, …

Hubel and Wiesel

Critical learning periods and Information in Weights

Sensitivity to deficits peaks when network is absorbing information. Is minimal when the network is consolidating information.

Are flat minima an epiphenomenon?

Final sharpness correlates with generalization… …but generalization quality is decided here, far from convergence to minima