Gradient Descent Finds Global Minima of Deep Neural Networks Simon - - PowerPoint PPT Presentation

gradient descent finds global minima of deep neural
SMART_READER_LITE
LIVE PREVIEW

Gradient Descent Finds Global Minima of Deep Neural Networks Simon - - PowerPoint PPT Presentation

Gradient Descent Finds Global Minima of Deep Neural Networks Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai 1 Empirical Observations on Empirical Risk Zhang et al, 2017, Understanding Deep Learning Requires Rethinking


slide-1
SLIDE 1

Gradient Descent Finds Global Minima of Deep Neural Networks

Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai

1

slide-2
SLIDE 2

Empirical Observations on Empirical Risk

2

  • Zhang et al, 2017, Understanding Deep Learning Requires Rethinking

Generalization.

Randomization Test: replace true labels by random labels. Observations: Empirical Risk-> 0 for both true labels and random labels. Conjecture: because neural networks are

  • ver-parameterized.

Open Problem: why gradient descent can find a neural network that fits all labels.

slide-3
SLIDE 3

Setup

  • Training Data:
  • A Model.
  • Fully connected neural network:
  • A loss function.
  • Quadratic loss:
  • An optimization algorithm:
  • Gradient descent:

3

f(θ, x) = WLσ(WL−1 · · · W2σ(W1x) · · · ) {xi, yi}n

i=1 , xi ∈ Rd, yi ∈ R

θ(t + 1) ← θ(t) − η ∂R(θ(t)) ∂θ(t) R(θ) = 1 2n

n

X

i=1

(f(θ, xi) − yi)2

slide-4
SLIDE 4

Trajectory-based Analysis

4

  • Trajectory of parameters:
  • Predictions:
  • Trajectory of predictions:

θ(0), θ(1), θ(2), · · · ui(t) , f(θ(t), xi), u(t) , (u1(t), . . . , un(t))> ∈ Rn u(0), u(1), u(2), . . . θ(t + 1) ← θ(t) − η ∂R(θ(t)) ∂θ(t)

slide-5
SLIDE 5

Proof Sketch

  • Simplified form (continuous time):
  • Random initialization + concentration + perturbation analysis:
  • Linear ODE theory:

5

du(t) dt = −

L

X

`=1

H`(t) (y − u(t)) H`

ij(t) = 1

nh ∂ui(t) ∂W`(t), ∂uj(t) ∂W`(t)i lim

m→∞ L

X

`=1

H`(0) → H∞ lim

m→∞ L

X

`=1

H`(t) →

L

X

`=1

H`(0), ∀t ≥ 0 ku(t) yk2

2  exp (λ0t) ku(0) yk2 2, λ0 = λmin (H∞)

slide-6
SLIDE 6

Main Results

6

Theorem 1: For fully-connected neural network with smooth activation, if ! = poly ', 2*, 1/-. and step size / = 0

12 3456(8) , then with high probability over

random initialization we have: for : = 1,2, … < = : ≤ 1 − /-.

@<(=(0)).

  • First global linear convergence guarantee for deep NN.
  • Exponential dependence due to error propagation.
slide-7
SLIDE 7

Main Results (Cont’d)

7

  • ResNet architecture makes the error propagation more stable =>

exponential improvement over fully-connected neural networks. Theorem 2: For ResNet or Convolutional ResNet with smooth activation, if ! = poly ', ), 1/,- and step size . = /

01 23 ,

then with high probability over random initialization we have: for 4 = 1,2, … 7 8 4 ≤ 1 − .,-

;7(8(0)).

slide-8
SLIDE 8

8

Learn more @ Pacific Ball Room #80