Gradient Descent Finds Global Minima of Deep Neural Networks Simon - - PowerPoint PPT Presentation

▶

Dec 06, 2023 178 likes •266 views

Gradient Descent Finds Global Minima of Deep Neural Networks Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai 1 Empirical Observations on Empirical Risk Zhang et al, 2017, Understanding Deep Learning Requires Rethinking

SLIDE 1

Gradient Descent Finds Global Minima of Deep Neural Networks

Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai

SLIDE 2

Empirical Observations on Empirical Risk

Zhang et al, 2017, Understanding Deep Learning Requires Rethinking

Generalization.

Randomization Test: replace true labels by random labels. Observations: Empirical Risk-> 0 for both true labels and random labels. Conjecture: because neural networks are

ver-parameterized.

Open Problem: why gradient descent can find a neural network that fits all labels.

SLIDE 3

Setup

Training Data:
A Model.
Fully connected neural network:
A loss function.
Quadratic loss:
An optimization algorithm:
Gradient descent:

f(θ, x) = WLσ(WL−1 · · · W2σ(W1x) · · · ) {xi, yi}n

i=1 , xi ∈ Rd, yi ∈ R

θ(t + 1) ← θ(t) − η ∂R(θ(t)) ∂θ(t) R(θ) = 1 2n

X

i=1

(f(θ, xi) − yi)2

SLIDE 4

Trajectory-based Analysis

Trajectory of parameters:
Predictions:
Trajectory of predictions:

θ(0), θ(1), θ(2), · · · ui(t) , f(θ(t), xi), u(t) , (u1(t), . . . , un(t))> ∈ Rn u(0), u(1), u(2), . . . θ(t + 1) ← θ(t) − η ∂R(θ(t)) ∂θ(t)

SLIDE 5

Proof Sketch

Simplified form (continuous time):
Random initialization + concentration + perturbation analysis:
Linear ODE theory:

du(t) dt = −

X

`=1

H`(t) (y − u(t)) H`

ij(t) = 1

nh ∂ui(t) ∂W`(t), ∂uj(t) ∂W`(t)i lim

m→∞ L

X

`=1

H`(0) → H∞ lim

m→∞ L

X

`=1

H`(t) →

X

`=1

H`(0), ∀t ≥ 0 ku(t) yk2

2  exp (λ0t) ku(0) yk2 2, λ0 = λmin (H∞)

SLIDE 6

Main Results

Theorem 1: For fully-connected neural network with smooth activation, if ! = poly ', 2*, 1/-. and step size / = 0

12 3456(8) , then with high probability over

random initialization we have: for : = 1,2, … < = : ≤ 1 − /-.

@<(=(0)).

First global linear convergence guarantee for deep NN.
Exponential dependence due to error propagation.

SLIDE 7

Main Results (Cont’d)

ResNet architecture makes the error propagation more stable =>

exponential improvement over fully-connected neural networks. Theorem 2: For ResNet or Convolutional ResNet with smooth activation, if ! = poly ', ), 1/,- and step size . = /

01 23 ,

then with high probability over random initialization we have: for 4 = 1,2, … 7 8 4 ≤ 1 − .,-

;7(8(0)).

SLIDE 8