Measurements of Three-Level Hierarchical Structure in the Outliers - - PowerPoint PPT Presentation

measurements of three level hierarchical structure in the
SMART_READER_LITE
LIVE PREVIEW

Measurements of Three-Level Hierarchical Structure in the Outliers - - PowerPoint PPT Presentation

Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians Vardan Papyan Department of Statistics Stanford University June 13, 2019 Setting C-class classification problem Setting C-class


slide-1
SLIDE 1

Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians

Vardan Papyan

Department of Statistics Stanford University

June 13, 2019

slide-2
SLIDE 2

Setting

◮ C-class classification problem

slide-3
SLIDE 3

Setting

◮ C-class classification problem ◮ Loss:

L(θ) = Avei,c{ℓ(f (xi,c; θ), yc)}

slide-4
SLIDE 4

Setting

◮ C-class classification problem ◮ Loss:

L(θ) = Avei,c{ℓ(f (xi,c; θ), yc)}

◮ Hessian:

Hess(θ) = Avei,c ∂2ℓ(f (xi,c; θ), yc) ∂θ2

slide-5
SLIDE 5

Setting

◮ C-class classification problem ◮ Loss:

L(θ) = Avei,c{ℓ(f (xi,c; θ), yc)}

◮ Hessian:

Hess(θ) = Avei,c ∂2ℓ(f (xi,c; θ), yc) ∂θ2

  • ◮ Gauss-Newton decomposition:

Hess = G + H

slide-6
SLIDE 6

Previous work: LeCun et al. (1998)

slide-7
SLIDE 7

Previous work: Dauphin et al. (2014)

slide-8
SLIDE 8

Previous work: Sagun et al. (2017)

◮ Noticed that the spectrum can be decomposed into:

slide-9
SLIDE 9

Previous work: Sagun et al. (2017)

◮ Noticed that the spectrum can be decomposed into:

◮ Bulk+outliers

slide-10
SLIDE 10

Previous work: Sagun et al. (2017)

◮ Noticed that the spectrum can be decomposed into:

◮ Bulk+outliers ◮ Number of outliers ≈ number of classes

slide-11
SLIDE 11

This work

What is causing the

  • utliers in the spectrum?
slide-12
SLIDE 12

G is a second moment of gradients with structure on indices

◮ Define the gradient:

δi,c,c′T =

  • pc′(xi,c; θ)(yc′ − p(xi,c; θ))T ∂f (xi,c; θ)

∂θ

slide-13
SLIDE 13

G is a second moment of gradients with structure on indices

◮ Define the gradient:

δi,c,c′T =

  • pc′(xi,c; θ)(yc′ − p(xi,c; θ))T ∂f (xi,c; θ)

∂θ

◮ δi,c,c: gradient of i-th example in c-th class (up to a scalar)

slide-14
SLIDE 14

G is a second moment of gradients with structure on indices

◮ Define the gradient:

δi,c,c′T =

  • pc′(xi,c; θ)(yc′ − p(xi,c; θ))T ∂f (xi,c; θ)

∂θ

◮ δi,c,c: gradient of i-th example in c-th class (up to a scalar) ◮ δi,c,c′: gradient of i-th example in c-th class, if it belonged to

class c′ instead (up to a scalar)

slide-15
SLIDE 15

G is a second moment of gradients with structure on indices

◮ Define the gradient:

δi,c,c′T =

  • pc′(xi,c; θ)(yc′ − p(xi,c; θ))T ∂f (xi,c; θ)

∂θ

◮ δi,c,c: gradient of i-th example in c-th class (up to a scalar) ◮ δi,c,c′: gradient of i-th example in c-th class, if it belonged to

class c′ instead (up to a scalar)

◮ These gradients can be indexed by three numbers:

slide-16
SLIDE 16

G is a second moment of gradients with structure on indices

◮ Define the gradient:

δi,c,c′T =

  • pc′(xi,c; θ)(yc′ − p(xi,c; θ))T ∂f (xi,c; θ)

∂θ

◮ δi,c,c: gradient of i-th example in c-th class (up to a scalar) ◮ δi,c,c′: gradient of i-th example in c-th class, if it belonged to

class c′ instead (up to a scalar)

◮ These gradients can be indexed by three numbers:

◮ i: observation

slide-17
SLIDE 17

G is a second moment of gradients with structure on indices

◮ Define the gradient:

δi,c,c′T =

  • pc′(xi,c; θ)(yc′ − p(xi,c; θ))T ∂f (xi,c; θ)

∂θ

◮ δi,c,c: gradient of i-th example in c-th class (up to a scalar) ◮ δi,c,c′: gradient of i-th example in c-th class, if it belonged to

class c′ instead (up to a scalar)

◮ These gradients can be indexed by three numbers:

◮ i: observation ◮ c: true class

slide-18
SLIDE 18

G is a second moment of gradients with structure on indices

◮ Define the gradient:

δi,c,c′T =

  • pc′(xi,c; θ)(yc′ − p(xi,c; θ))T ∂f (xi,c; θ)

∂θ

◮ δi,c,c: gradient of i-th example in c-th class (up to a scalar) ◮ δi,c,c′: gradient of i-th example in c-th class, if it belonged to

class c′ instead (up to a scalar)

◮ These gradients can be indexed by three numbers:

◮ i: observation ◮ c: true class ◮ c′: potential class

slide-19
SLIDE 19

G is a second moment of gradients with structure on indices

◮ Define the gradient:

δi,c,c′T =

  • pc′(xi,c; θ)(yc′ − p(xi,c; θ))T ∂f (xi,c; θ)

∂θ

◮ δi,c,c: gradient of i-th example in c-th class (up to a scalar) ◮ δi,c,c′: gradient of i-th example in c-th class, if it belonged to

class c′ instead (up to a scalar)

◮ These gradients can be indexed by three numbers:

◮ i: observation ◮ c: true class ◮ c′: potential class

◮ G is a second moment (not Covariance) of these gradients:

G = Avei,c,c′

  • δi,c,c′δT

i,c,c′

slide-20
SLIDE 20

Three-level hierarchical structure in gradients

◮ Averaging over the index i

!","$ !%,","& !',","&

slide-21
SLIDE 21

Three-level hierarchical structure in gradients

◮ Averaging over the index i ◮ Averaging over the index c′

!" !","$ !%,","& !',","& !","&&

slide-22
SLIDE 22

Three-level hierarchical structure in gradients

◮ Averaging over the index i ◮ Averaging over the index c′ ◮ Averaging over the index c

!"# !" !","# !%,","& !',","& !","&& Av*" !"!"

+

slide-23
SLIDE 23

Visualization of three-level hierarchical structure in gradients

Figure: ResNet50 trained on ImageNet. Large circles: δc. Small circles: δc,c′.

slide-24
SLIDE 24

Visualization of three-level hierarchical structure in gradients

MNIST, 13 examples per class Fashion, 13 examples per class CIFAR10, 13 examples per class MNIST, 702 examples per class Fashion, 702 examples per class CIFAR10, 702 examples per class MNIST, 5000 examples per class Fashion, 5000 examples per class CIFAR10, 5000 examples per class

slide-25
SLIDE 25

Avec

  • δcδT

c

  • causes outliers in G

Figure: ResNet18 trained on CIFAR10, 1351 examples per class. Orange: eigenvalues of Avec

  • δcδT

c

  • .
slide-26
SLIDE 26

Avec

  • δcδT

c

  • causes outliers in G

MNIST, 136 examples per class Fashion, 136 examples per class CIFAR10, 136 examples per class MNIST, 365 examples per class Fashion, 365 examples per class CIFAR10, 365 examples per class MNIST, 702 examples per class Fashion, 702 examples per class CIFAR10, 702 examples per class MNIST, 2599 examples per class Fashion, 2599 examples per class CIFAR10, 1351 examples per class