Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs Timur - - PowerPoint PPT Presentation

loss surfaces mode connectivity and fast ensembling of
SMART_READER_LITE
LIVE PREVIEW

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs Timur - - PowerPoint PPT Presentation

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs Timur Garipov 1 , 2 Pavel Izmailov 3 Dmitrii Podoprikhin 4 Dmitry Vetrov 5 Andrew Gordon Wilson 3 1 Samsung AI Center in Moscow, 2 Skolkovo Institute of Science and Technology, 3 Cornell


slide-1
SLIDE 1

1/10

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

Timur Garipov1,2 Pavel Izmailov3 Dmitrii Podoprikhin4 Dmitry Vetrov5 Andrew Gordon Wilson3

1Samsung AI Center in Moscow, 2Skolkovo Institute of Science and Technology, 3Cornell University, 4Samsung-HSE Laboratory, 5National Research University Higher School of Economics

Neural Information Processing Systems Montreal, Canada December 4, 2018

slide-2
SLIDE 2

2/10

Loss Surfaces

ResNet-164, CIFAR-100

slide-3
SLIDE 3

3/10

Loss Surfaces

ResNet-164, CIFAR-100

slide-4
SLIDE 4

4/10

Finding Paths between Modes

Weights of pretrained networks: w1, w2 ∈ R|net| Define parametric curve: φθ(·) [0, 1] → R|net| φθ(0) = w1, φθ(1) = w2 DNN loss function: L(w) Minimize averaged loss w.r.t. θ minimize

θ

ℓ(θ) =

1

  • L(φθ(t))dt = Et∼U(0,1)L(φθ(t))
slide-5
SLIDE 5

5/10

slide-6
SLIDE 6

6/10

Loss Surfaces

VGG-16, CIFAR-10

Train loss

−20 20 40 60 80 100 −20 20 40 60 80 0.039 0.055 0.078 0.13 0.26 0.56 1.3 3 > 3 −20 20 40 60 80 −10 10 20 30 40 50 0.028 0.044 0.066 0.12 0.25 0.55 1.3 3 > 3 −20 20 40 60 80 −20 20 40 60 0.026 0.042 0.064 0.12 0.24 0.55 1.3 3 > 3

Test error (%)

−20 20 40 60 80 100 −20 20 40 60 80 6.8 7.6 8.3 9.7 12 17 25 40 > 40 −20 20 40 60 80 −10 10 20 30 40 50 6.7 7.5 8.2 9.5 12 16 25 40 > 40 −20 20 40 60 80 −20 20 40 60 6.6 7.4 8.1 9.4 12 16 25 40 > 40

slide-7
SLIDE 7

7/10

slide-8
SLIDE 8

8/10

Fast Geometric Ensembles (FGE)

Ensemble Epoch Learning Rate 75% training c c c

α2 α1

Learning rate n

25 30 35

Test error (%)

0.5c 1c 1.5c 2c 2.5c 3c 3.5c

FGE iteration number

5 10 15

Distance

slide-9
SLIDE 9

9/10

Ensembling Results

0.5B B 1.5B 2B

Training budget

74 76 78 80 82

Test accuracy (%)

SSE separate FGE separate SSE ensemble FGE ensemble 1B model

SSE = Huang et al., (“Snapshot ensembles: Train 1, get m for free”), ICLR 2017

slide-10
SLIDE 10

10/10

Summary

Local optima are connected by simple curves. To find these curves we minimize loss uniformly in expectation over a path from one mode to another. We are inspired by these insights to propose a fast ensembling algorithm. PyTorch code released for both mode connectivity and FGE

Come to our poster #162!