[PPT] - Principled Deep Neural Network Training through Linear Programming PowerPoint Presentation

SLIDE 1

Principled Deep Neural Network Training through Linear Programming

Daniel Bienstock1, Gonzalo Muñoz2, Sebastian Pokutta3 January 9, 2019

1IEOR, Columbia University 2IVADO, Polytechnique Montréal 3ISyE, Georgia Tech

1

SLIDE 2

“...I’m starting to look at machine learning problems” Oktay Günlük’s research interests, Aussois 2019

2

SLIDE 3

Goal of this talk

SLIDE 4

Goal of this talk

Deep Learning is receiving signifjcant attention due to its impressive

performance.

Unfortunately, only recent results regarding the complexity of

training deep neural networks have been obtained.

Our goal: to show that large classes of Neural Networks can be

trained to near optimality using linear programs whose size is linear

n the data.

3

SLIDE 5

Goal of this talk

Deep Learning is receiving signifjcant attention due to its impressive

performance.

Unfortunately, only recent results regarding the complexity of

training deep neural networks have been obtained.

Our goal: to show that large classes of Neural Networks can be

trained to near optimality using linear programs whose size is linear

n the data.

3

SLIDE 6

Goal of this talk

Deep Learning is receiving signifjcant attention due to its impressive

performance.

Unfortunately, only recent results regarding the complexity of

training deep neural networks have been obtained.

Our goal: to show that large classes of Neural Networks can be

trained to near optimality using linear programs whose size is linear

n the data.

3

SLIDE 7

Empirical Risk Minimization problem

Given:

D data points xi yi

i 1 D

xi

n

yi

m

A loss function

m m

(not necessarily convex) Compute f

n m to solve f

1 D

D i 1

f xi yi (+ optional regularizer f ) f F (some class)

4

SLIDE 8

Empirical Risk Minimization problem

Given:

D data points (ˆ

xi,ˆ yi), i = 1, . . . , D

ˆ

xi ∈ Rn, ˆ yi ∈ Rm

A loss function ℓ : Rm × Rm → R (not necessarily convex)

Compute f

n m to solve f

1 D

D i 1

f xi yi (+ optional regularizer f ) f F (some class)

4

SLIDE 9

Empirical Risk Minimization problem

Given:

D data points (ˆ

xi,ˆ yi), i = 1, . . . , D

ˆ

xi ∈ Rn, ˆ yi ∈ Rm

A loss function ℓ : Rm × Rm → R (not necessarily convex)

Compute f : Rn → Rm to solve min

f

1 D

D

∑

i=1

ℓ(f(ˆ xi),ˆ yi) (+ optional regularizer Φ(f)) f ∈ F (some class)

4

SLIDE 10

Empirical Risk Minimization problem

min

f

1 D

D

∑

i=1

ℓ(f(ˆ xi),ˆ yi) (+ optional regularizer Φ(f)) f ∈ F (some class) Examples:

Linear Regression. f(x) = Ax + b with ℓ2-loss.
Binary Classifjcation. Varying f architectures and cross-entropy loss:

ℓ(p, y) = −y log(p) − (1 − y) log(1 − p)

Neural Networks with k layers.

f(x) = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1(x), each Tj affjne.

5

SLIDE 11

Function parameterization

We assume family F (statisticians’ hypothesis) is parameterized: there exists f such that F = {f(x, θ) : θ ∈ Θ ⊆ [−1, 1]N}. Thus, THE problem becomes 1 D

D i 1

f xi yi

6

SLIDE 12

Function parameterization

We assume family F (statisticians’ hypothesis) is parameterized: there exists f such that F = {f(x, θ) : θ ∈ Θ ⊆ [−1, 1]N}. Thus, THE problem becomes min

θ∈Θ

1 D

D

∑

i=1

ℓ(f(ˆ xi, θ),ˆ yi)

6

SLIDE 13

What we know for Neural Nets

SLIDE 14

Neural Networks

D data points (ˆ

xi,ˆ yi), 1 ≤ i ≤ D, ˆ xi ∈ Rn, ˆ yi ∈ Rm

f

Tk

1

Tk T1

Each Ti affjne Ti y

Aiy bi

A1 is n

w, Ak

1 is w

m, Ai is w w otherwise. . . . n w w m

7

SLIDE 15

Neural Networks

D data points (ˆ

xi,ˆ yi), 1 ≤ i ≤ D, ˆ xi ∈ Rn, ˆ yi ∈ Rm

f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1
Each Ti affjne Ti y

Aiy bi

A1 is n

w, Ak

1 is w

m, Ai is w w otherwise. . . . n w w m

7

SLIDE 16

Neural Networks

D data points (ˆ

xi,ˆ yi), 1 ≤ i ≤ D, ˆ xi ∈ Rn, ˆ yi ∈ Rm

f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1
Each Ti affjne Ti(y) = Aiy + bi
A1 is n

w, Ak

1 is w

m, Ai is w w otherwise. . . . n w w m

7

SLIDE 17

Neural Networks

D data points (ˆ

xi,ˆ yi), 1 ≤ i ≤ D, ˆ xi ∈ Rn, ˆ yi ∈ Rm

f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1
Each Ti affjne Ti(y) = Aiy + bi
A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.

. . . n w w m

7

SLIDE 18

Hardness Results

Theorem (Blum and Rivest 1992) Let ˆ xi ∈ Rn, ˆ yi ∈ {0, 1}, ℓ ∈ (absolute value, 2-norm squared) and σ a threshold function. Then training is NP-hard even in this simple network: . . . Theorem (Boob, Dey and Lan 2018) Let ˆ xi ∈ Rn, ˆ yi ∈ {0, 1}, ℓ a norm and σ(t) = max{0, t} a ReLU

activation. Then training is NP-hard in the same network.

8

SLIDE 19

Exact Training Complexity

Theorem (Arora, Basu, Mianjy and Mukherjee 2018) If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exact training algorithm of complexity O 2wDnwpoly D n w Polynomial in the size of the data set, for fjxed n w. Also in that paper:

“we are not aware of any complexity results which would rule out the possibility of an algorithm which trains to global optimality in time that is polynomial in the data size” “Perhaps an even better breakthrough would be to get optimal training algorithms for DNNs with two or more hidden layers and this seems like a substantially harder nut to crack”

9

SLIDE 20

Exact Training Complexity

Theorem (Arora, Basu, Mianjy and Mukherjee 2018) If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exact training algorithm of complexity O ( 2wDnwpoly(D, n, w) ) Polynomial in the size of the data set, for fjxed n, w. Also in that paper:

“we are not aware of any complexity results which would rule out the possibility of an algorithm which trains to global optimality in time that is polynomial in the data size” “Perhaps an even better breakthrough would be to get optimal training algorithms for DNNs with two or more hidden layers and this seems like a substantially harder nut to crack”

9

SLIDE 21

Exact Training Complexity

Theorem (Arora, Basu, Mianjy and Mukherjee 2018) If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exact training algorithm of complexity O ( 2wDnwpoly(D, n, w) ) Polynomial in the size of the data set, for fjxed n, w. Also in that paper:

“we are not aware of any complexity results which would rule out the possibility of an algorithm which trains to global optimality in time that is polynomial in the data size” “Perhaps an even better breakthrough would be to get optimal training algorithms for DNNs with two or more hidden layers and this seems like a substantially harder nut to crack”

9

SLIDE 22

What we’ll prove

There exists a polytope: whose size depends linearly on D that encodes approximately all possible training problems coming from xi yi D

i 1

1 1

n m D.

Spoiler: Theory-only results

10

SLIDE 23

What we’ll prove

There exists a polytope: whose size depends linearly on D that encodes approximately all possible training problems coming from xi yi D

i 1

1 1

n m D.

Spoiler: Theory-only results

10

SLIDE 24

What we’ll prove

There exists a polytope: whose size depends linearly on D that encodes approximately all possible training problems coming from (ˆ xi,ˆ yi)D

i=1 ⊆ [−1, 1](n+m)D.

Spoiler: Theory-only results

10

SLIDE 25

What we’ll prove

There exists a polytope: whose size depends linearly on D that encodes approximately all possible training problems coming from (ˆ xi,ˆ yi)D

i=1 ⊆ [−1, 1](n+m)D.

Spoiler: Theory-only results

10

SLIDE 26

Our Hammer

SLIDE 27

Treewidth

Treewidth is a parameter that measures how tree-like a graph is. Defjnition Given a chordal graph G, we say its treewidth is if its clique number is 1.

Trees have treewidth 1
Cycles have treewidth 2
Kn has treewidth n

1

11

SLIDE 28

Treewidth

Treewidth is a parameter that measures how tree-like a graph is. Defjnition Given a chordal graph G, we say its treewidth is ω if its clique number is ω + 1.

Trees have treewidth 1
Cycles have treewidth 2
Kn has treewidth n

1

11

SLIDE 29

Treewidth

Treewidth is a parameter that measures how tree-like a graph is. Defjnition Given a chordal graph G, we say its treewidth is ω if its clique number is ω + 1.

Trees have treewidth 1
Cycles have treewidth 2
Kn has treewidth n − 1

11

SLIDE 30

Approximate optimization of well-behaved functions

Prototype problem: min cTx s.t. fi(x) ≤ 0, i = 1, . . . , m x ∈ [0, 1]n Toolset:

Each fi is “well-behaved”: Lipschitz constant

i over 0 1 n

Intersection graph: An edge whenever two variables appear in the

same fi For example: x1 x2 x3 1 x3 x4 1 x4 x5 x6 2 The intersection graph is:

1 2 3 4 5 6

12

SLIDE 31

Approximate optimization of well-behaved functions

Prototype problem: min cTx s.t. fi(x) ≤ 0, i = 1, . . . , m x ∈ [0, 1]n Toolset:

Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n
Intersection graph: An edge whenever two variables appear in the

same fi For example: x1 x2 x3 1 x3 x4 1 x4 x5 x6 2 The intersection graph is:

1 2 3 4 5 6

12

SLIDE 32

Approximate optimization of well-behaved functions

Prototype problem: min cTx s.t. fi(x) ≤ 0, i = 1, . . . , m x ∈ [0, 1]n Toolset:

Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n
Intersection graph: An edge whenever two variables appear in the

same fi For example: x1 x2 x3 1 x3 x4 1 x4 x5 x6 2 The intersection graph is:

1 2 3 4 5 6

12

SLIDE 33

Approximate optimization of well-behaved functions

Prototype problem: min cTx s.t. fi(x) ≤ 0, i = 1, . . . , m x ∈ [0, 1]n Toolset:

Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n
Intersection graph: An edge whenever two variables appear in the

same fi For example: x1 + x2 + x3 ≤ 1 x3 + x4 ≥ 1 x4 · x5 + x6 ≤ 2 The intersection graph is:

1 2 3 4 5 6

12

SLIDE 34

Approximate optimization of well-behaved functions

Prototype problem: min cTx s.t. fi(x) ≤ 0, i = 1, . . . , m x ∈ [0, 1]n Toolset:

Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n
Intersection graph: An edge whenever two variables appear in the

same fi For example: x1 + x2 + x3 ≤ 1 x3 + x4 ≥ 1 x4 · x5 + x6 ≤ 2 The intersection graph is:

1 2 3 4 5 6

12

SLIDE 35

Approximate optimization of well-behaved functions

Prototype problem: min cTx s.t. fi(x) ≤ 0, i = 1, . . . , m x ∈ [0, 1]n An extension of result by Bienstock and M. 2018: Theorem Suppose the intersection graph has tree-width ω and let L = maxi Li. Then, for every 0 there is an LP relaxation of size O

1 n

that guarantees

ptimality and feasibility errors.

13

SLIDE 36

Approximate optimization of well-behaved functions

Prototype problem: min cTx s.t. fi(x) ≤ 0, i = 1, . . . , m x ∈ [0, 1]n An extension of result by Bienstock and M. 2018: Theorem Suppose the intersection graph has tree-width ω and let L = maxi Li. Then, for every ϵ > 0 there is an LP relaxation of size O ( (L/ϵ)ω+1 n ) that guarantees ϵ optimality and feasibility errors.

13

SLIDE 37

Application to ERM problem

We now apply the LP approximation result to: min

θ∈Θ

1 D

D

∑

i=1

ℓ(f(ˆ xi, θ),ˆ yi) with Θ ⊆ [−1, 1]N, ˆ xi ∈ [−1, 1]n and ˆ yi ∈ [−1, 1]m. We use the epigraph formulation: 1 D

D i 1

Li Li f xi yi 1 i D Let be the Lipschitz constant of g x y f x y over 1 1 n

m N. 14

SLIDE 38

Application to ERM problem

We now apply the LP approximation result to: min

θ∈Θ

1 D

D

∑

i=1

ℓ(f(ˆ xi, θ),ˆ yi) with Θ ⊆ [−1, 1]N, ˆ xi ∈ [−1, 1]n and ˆ yi ∈ [−1, 1]m. We use the epigraph formulation: min

θ∈Θ

1 D

D

∑

i=1

Li Li ≥ ℓ(f(ˆ xi, θ),ˆ yi) 1 ≤ i ≤ D Let L be the Lipschitz constant of g(x, y, θ) . = ℓ(f(x, θ), y) over [−1, 1]n+m+N.

14

SLIDE 39

Application to ERM problem

Theorem For every ϵ > 0, ℓ, Θ ⊆ [−1, 1]N and D, there is a polytope of size O ( (2L/ϵ)N+n+m D ) such that for every data set X Y xi yi D

i 1

1 1

n m D, there is

a face

X Y such that optimizing 1 D D i 1 Li over X Y provides an

approximation to ERM with data X Y.

15

SLIDE 40

Application to ERM problem

Theorem For every ϵ > 0, ℓ, Θ ⊆ [−1, 1]N and D, there is a polytope of size O ( (2L/ϵ)N+n+m D ) such that for every data set (ˆ X, ˆ Y) = (ˆ xi,ˆ yi)D

i=1 ⊆ [−1, 1](n+m)D, there is

a face Fˆ

X,ˆ Y

such that optimizing 1

D D i 1 Li over X Y provides an

approximation to ERM with data X Y.

15

SLIDE 41

Application to ERM problem

Theorem For every ϵ > 0, ℓ, Θ ⊆ [−1, 1]N and D, there is a polytope of size O ( (2L/ϵ)N+n+m D ) such that for every data set (ˆ X, ˆ Y) = (ˆ xi,ˆ yi)D

i=1 ⊆ [−1, 1](n+m)D, there is

a face Fˆ

X,ˆ Y such that optimizing 1 D

∑D

i=1 Li over Fˆ X,ˆ Y provides an

ϵ-approximation to ERM with data ˆ X, ˆ Y.

15

SLIDE 42

Proof Sketch

Every system of constraints of the type Li ≥ ℓ(f(xi, θ), yi) 1 ≤ i ≤ D has an intersection graph with the following structure:

θ1, · · · , θN L1 x1, y1 L2 x2, y2 L3 x3, y3 LD xD, yD L4 x4, y4

and has treewidth at most N n m

16

SLIDE 43

Proof Sketch

Every system of constraints of the type Li ≥ ℓ(f(xi, θ), yi) 1 ≤ i ≤ D has an intersection graph with the following structure:

θ1, · · · , θN L1 x1, y1 L2 x2, y2 L3 x3, y3 LD xD, yD L4 x4, y4

and has treewidth at most N + n + m

16

SLIDE 44

LP size details

Thus the LP size given by the treewidth O ( (L/ϵ)ω+1 n ) becomes O ( (2L/ϵ)N+n+m D ) The key lies in the fact that the D does not add to the treewidth. Difgerent architectures N and .

17

SLIDE 45

LP size details

Thus the LP size given by the treewidth O ( (L/ϵ)ω+1 n ) becomes O ( (2L/ϵ)N+n+m D ) The key lies in the fact that the D does not add to the treewidth. Difgerent architectures N and .

17

SLIDE 46

LP size details

Thus the LP size given by the treewidth O ( (L/ϵ)ω+1 n ) becomes O ( (2L/ϵ)N+n+m D ) The key lies in the fact that the D does not add to the treewidth. Difgerent architectures → N and L.

17

SLIDE 47

Architecture-Specifjc Consequences

SLIDE 48

Fully connected DNN, ReLU activations, quadratic loss

For any k, n, m, w, ϵ there is a uniform LP of size O ( (2k+1mnwk2/ϵ)N+n+m D ) with the same guarantees: ϵ-approximation and data-dependent faces Core of the proof: In a DNN with k hidden layers and quadratic loss the Lipschitz constant of g x y

ver

1 1 n

m N is O mnwk2 . 18

SLIDE 49

Fully connected DNN, ReLU activations, quadratic loss

For any k, n, m, w, ϵ there is a uniform LP of size O ( (2k+1mnwk2/ϵ)N+n+m D ) with the same guarantees: ϵ-approximation and data-dependent faces Core of the proof: In a DNN with k hidden layers and quadratic loss the Lipschitz constant of g(x, y, θ) over [−1, 1]n+m+N is O(mnwk2).

18

SLIDE 50

Comparison with Arora et al.

In the Arora, Basu, Mianjy and Mukherjee setting: k = 1, m = 1 and N ≈ nw Arora et al. Running Time O ( 2wDnwpoly(D, n, w) ) Uniform LP Size O ( (4nw/ϵ)(n+1)(w+1) D ) Other difgerences: exactness, boundedness, convexity v lipschitz-ness, uniformness

19

SLIDE 51

Last comments

The results can be improved by considering the sparsity of the

network itself.

One can obtain previously unknown complexity results (ResNet,

Convolutional NN, etc)

Training using this approach generalizes. Meaning, using enough1

i.i.d data points we get an approximation to the “true” Risk Minimization problem. Our results improve on the best approximations to this problem as well.

1depends on L and ϵ

20

SLIDE 52

Still Open and Future Work

It is unknown if the dependency on w or k can be improved
A better LP size can be obtained assuming more about the input

data or the nature of the problem

We would like to combine these ideas with empirically effjcient

methods

21

SLIDE 53

Thank you!

21

SLIDE 54

One other improvement

If we denote G the underlying Neural Network, we can improve the exponent in O ( (nw/ϵ)poly(n,k,w,m) D ) using the treewidth of G tw(G), and its maximum degree ∆(G). More specifjcally, one can obtain a uniform LP of size O nw

O k tw G G

E G D

22

SLIDE 55

One other improvement

If we denote G the underlying Neural Network, we can improve the exponent in O ( (nw/ϵ)poly(n,k,w,m) D ) using the treewidth of G tw(G), and its maximum degree ∆(G). More specifjcally, one can obtain a uniform LP of size O ( (nw/ϵ)O(k·tw(G)·∆(G)) (|E(G)| + D) )

22