[PPT] - Deep Neural Networks and Partial Differential Equations: PowerPoint Presentation

SLIDE 1

Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural Properties

Philipp Christian Petersen

SLIDE 2

Joint work

Joint work with:

◮ Helmut B¨

lcskei (ETH Z¨

urich)

◮ Philipp Grohs (University of Vienna) ◮ Joost Opschoor (ETH Z¨

urich)

◮ Gitta Kutyniok (TU Berlin) ◮ Mones Raslan (TU Berlin) ◮ Christoph Schwab (ETH Z¨

urich)

◮ Felix Voigtlaender (KU Eichst¨

att-Ingolstadt)

1 / 36

SLIDE 3

Today’s Goal

Goal of this talk: Discuss the suitability of neural networks as an ansatz system for the solution of PDEs. Two threads: Approximation theory:

◮ universal approximation ◮ optimal

approximation rates for all classical function spaces

◮ reduced curse of dimen-

sion

1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 1

1

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 0.5 1 1

Structural properties:

◮ non-convex,

non-closed ansatz spaces

◮ parametrization not stable ◮ very hard to optimize over

2 / 36

SLIDE 4

Outline

Neural networks Introduction to neural networks Approaches to solve PDEs Approximation theory of neural networks Classical results Optimality High-dimensional approximation Structural results Convexity Closedness Stable parametrization

3 / 36

SLIDE 5

Neural networks

We consider neural networks as a special kind of functions:

◮ d = N0 ∈ N: input dimension, ◮ L: number of layers, ◮ ̺ : R → R: activation function, ◮ Tℓ : RNℓ−1 → RNℓ, ℓ = 1, . . . , L: affine-linear maps.

Then Φ̺ : Rd → RNL given by Φ̺(x) = TL(̺(TL−1(̺(. . . ̺(T1(x)))))), x ∈ Rd, is called a neural network (NN). The sequence (d, N1, . . . , NL) is called the architecture of Φ̺.

4 / 36

SLIDE 6

Why are neural networks interesting? - I

Deep Learning: Deep learning describes a variety of techniques based on data-driven adaptation of the affine linear maps in a neural network. Overwhelming success:

◮ Image classification ◮ Text understanding ◮ Game intelligence

Hardware design of the future!

Ren, He, Girshick, Sun; 2015

5 / 36

SLIDE 7

Why are neural networks interesting? - II

Expressibility: Neural networks constitute a very powerful architecture.

Theorem (Cybenko; 1989, Hornik; 1991, Pinkus; 1999)

Let d ∈ N, K ⊂ Rd compact, f : K → R continuous, ̺ : R → R continuous and not a polynomial. Let ε > 0, then there exist a two-layer NN Φ̺: f − Φ̺∞ ≤ ε. Efficient expressibility: RM ∋ θ → (T1, . . . , TL) → Φ̺

θ yields a

parametrized system of functions. In a sense this parametrization is

ptimally efficient. (More on this below).

6 / 36

SLIDE 8

How can we apply NNs to solve PDEs?

PDE problem: For D ⊂ Rd, d ∈ N find u such that G(x, u(x), ∇u(x), ∇2u(x)) = 0 for all x ∈ D. Approach of [Lagaris, Likas, Fotiadis; 1998]: Let (xi)i∈I ⊂ D, find a NN Φ̺

θ such that

G(xi, Φ̺

θ(xi), ∇Φ̺ θ(xi), ∇2Φ̺ θ(xi)) = 0 for all i ∈ I.

Standard methods can be used to find parameters θ.

7 / 36

SLIDE 9

Approaches to solve PDEs - Examples

General Framework: Deep Ritz Method [E, Yu; 2017]: NNs as trial functions, SGD naturally replaces quadrature. High-dimensional PDEs: [Sirignano, Spiliopoulos; 2017]: Let D ⊂ Rd d ≥ 100 find u such that ∂u ∂t (t, x) + H(u)(t, x) = 0, (t, x) ∈ [0, T] × Ω, + BC + IC As the number of parameters of the NNs increases the minimizer of associated energy approaches true solution. No mesh generation required! [Berner, Grohs, Hornung, Jentzen, von Wurstemberger; 2017]: Phrasing problem as empirical risk minimization provably no curse of dimension in approximation problem or number of samples.

8 / 36

SLIDE 10

How can we apply NNs to solve PDEs?

Deep learning and PDEs: Both approaches above are based on two ideas.

◮ Neural networks are highly efficient in representing solutions of

PDEs, hence the complexity of the problem can be greatly reduced.

◮ There exist black box methods from machine learning that

solve the optimization problem. This talk:

◮ We will show exactly how efficient the representations are. ◮ Raise doubt that the black box can produce reliable results in

general.

9 / 36

SLIDE 11

Approximation theory of neural networks

10 / 36

SLIDE 12

Complexity of neural networks

Recall: Φ̺(x) = TL(̺(TL−1(̺(. . . ̺(T1(x)))))), x ∈ Rd. Each affine linear mapping Tℓ is defined by a matrix Aℓ ∈ RNℓ×Nℓ−1 and a translation bℓ ∈ RNℓ via Tℓ(x) = Aℓ x + bℓ. The number of weights W (Φ̺) and the number of neurons N(Φ̺) are W (Φ̺) =

j≤L

(Ajℓ0 + bjℓ0) and N(Φ̺) =

L

j=0

Nj .

11 / 36

SLIDE 13

Power of the architecture — Exemplary results

Given f from some class of functions, how many weights/neurons does an ε-approximating NN need to have?

12 / 36

SLIDE 14

Power of the architecture — Exemplary results

Given f from some class of functions, how many weights/neurons does an ε-approximating NN need to have? Not so many...

Theorem (Maiorov, Pinkus; 1999)

There exists an activation function ̺weird : R → R that

◮ is analytic and strictly increasing, ◮ satisfies limx→−∞ ̺weird(x) = 0 and limx→∞ ̺weird(x) = 1,

such that for any d ∈ N, any f ∈ C([0, 1]d), and any ε > 0, there is a 3-layer ̺-network Φ̺weird

ε

with f − Φ̺weird

ε

L∞ ≤ ε and N(Φ̺weird

ε

) = 9d + 3.

12 / 36

SLIDE 15

Power of the architecture — Exemplary results

◮ Barron; 1993: Approximation rate for functions with one finite

Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero.

13 / 36

SLIDE 16

Power of the architecture — Exemplary results

◮ Barron; 1993: Approximation rate for functions with one finite

Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero.

◮ Mhaskar; 1993: Let ̺ be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have f − Φ̺

nL∞ N(Φ̺ n)−s/d and

L(Φ̺

n) = L(d, s, k).

13 / 36

SLIDE 17

Power of the architecture — Exemplary results

◮ Barron; 1993: Approximation rate for functions with one finite

Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero.

◮ Mhaskar; 1993: Let ̺ be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have f − Φ̺

nL∞ N(Φ̺ n)−s/d and

L(Φ̺

n) = L(d, s, k). ◮ Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for ̺(x) = x+

(called ReLU) that f − Φ̺

nL∞ W (Φ̺ n)−s/d and

L(Φ̺

ε) ≍ log(n).

13 / 36

SLIDE 18

Power of the architecture — Exemplary results

◮ Barron; 1993: Approximation rate for functions with one finite

Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero.

◮ Mhaskar; 1993: Let ̺ be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have f − Φ̺

nL∞ N(Φ̺ n)−s/d and

L(Φ̺

n) = L(d, s, k). ◮ Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for ̺(x) = x+

(called ReLU) that f − Φ̺

nL∞ W (Φ̺ n)−s/d and

L(Φ̺

ε) ≍ log(n). ◮ Shaham, Cloninger, Coifman; 2015: One can implement

certain wavelets using 4–layer NNs.

13 / 36

SLIDE 19

Power of the architecture — Exemplary results

◮ Barron; 1993: Approximation rate for functions with one finite

Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero.

◮ Mhaskar; 1993: Let ̺ be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have f − Φ̺

nL∞ N(Φ̺ n)−s/d and

L(Φ̺

n) = L(d, s, k). ◮ Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for ̺(x) = x+

(called ReLU) that f − Φ̺

nL∞ W (Φ̺ n)−s/d and

L(Φ̺

ε) ≍ log(n). ◮ Shaham, Cloninger, Coifman; 2015: One can implement

certain wavelets using 4–layer NNs.

◮ He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019:

ReLU NNs reproduce approximation rates of h-, p- and hp-FEM.

13 / 36

SLIDE 20

Lower bounds

Optimal approximation rates: Lower bounds on required network size only exist under additional assumptions. (Recall networks based

n ̺weird).

Options: (A) Place restrictions on activation function (e.g. only consider the ReLU), thereby excluding pathological examples like ̺weird. ( VC dimension bounds) (B) Place restrictions on the weights. ( Information theoretical bounds, entropy arguments) (C) Use still other concepts like continuous N-widths.

14 / 36

SLIDE 21

Lower bounds

Optimal approximation rates: Lower bounds on required network size only exist under additional assumptions. (Recall networks based

n ̺weird).

Options: (A) Place restrictions on activation function (e.g. only consider the ReLU), thereby excluding pathological examples like ̺weird. ( VC dimension bounds) (B) Place restrictions on the weights. ( Information theoretical bounds, entropy arguments) (C) Use still other concepts like continuous N-widths.

14 / 36

SLIDE 22

Asymptotic min-max rate distortion

Encoders: Let C ⊂ L2(Rd), ℓ ∈ N Eℓ :=

E : C → {0, 1}ℓ

, Dℓ :=

D : {0, 1}ℓ → L2(Rd)
.

{0, 1, 0, 0, 1, 1, 1}

Min-max code length: L(ǫ, C) := min

ℓ ∈ N : ∃D ∈ Dℓ, C ∈ Cℓ :sup

f ∈C

D(E(f )) − f 2 < ǫ

.

Optimal exponent: γ∗(C) := inf

γ > 0 : L(ǫ, C) = O(ǫ−γ)
.

15 / 36

SLIDE 23

Asymptotic min-max rate distortion

Theorem (Boelcskei, Grohs, Kutyniok, P.; 2017)

Let C ⊂ L2(Rd), ̺ : R → R, then for all ǫ > 0: sup

f ∈C

  inf

Φ̺ NN with quantised weights Φ̺−f 2≤ǫ

W (Φ̺)   ǫ−γ∗(C). (1) Optimal approximation/parametrization: If for C ⊂ L2(Rd) one also has in (1), then NNs approximate a function class optimally. Versatility: It turns out that NNs achieve optimal approximation rates for many practically-used function classes.

16 / 36

SLIDE 24

Some instances of optimal approximation

◮ Mhaskar; 1993: Let ̺ be sigmoidal function of order k ≥ 2.

For f ∈ C s([0, 1]d), we have f − Φ̺

nL∞ N(Φ̺ n)−s/d.

We have γ∗({f ∈ C s([0, 1]d : f ≤ 1}) = d/s.

◮ Shaham, Cloninger, Coifman; 2015: One can implement

certain wavelets using 4–layer ReLU NNs. Optimal, when wavelets are optimal.

◮ B¨

lcskei, Grohs, Kutyniok, P.; 2017: Networks yield
ptimal rates if any affine system does. Example: shearlets for

cartoon-like functions.

17 / 36

SLIDE 25

ReLU Approximation

Piecewise smooth functions: Eβ,d denotes the d-dimensional C β-piecewise smooth functions

n [0, 1]d with interfaces in C β.

1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 0.5 1

Theorem (P., Voigtlaender; 2018)

Let d ∈ N, β ≥ 0, ̺ : R → R, ̺(x) = x+, then sup

f ∈Eβ,d

  inf

Φ̺ NN with quantised weights Φ̺−f 2≤ǫ

W (Φ̺)   ∼ ǫ−γ∗(Eβ,d) = ǫ−2(d−1)/β. The optimal depth of the networks is ∼ β/d.

18 / 36

SLIDE 26

High-dimensional approximation

Curse of dimension: To guarantee approximation with error ≤ ε

f functions in Eβ,d one requires networks with O(ε−2(d−1)/β)

weights. Symmetries and invariances: Image classifiers are often:

◮ translation, dilation, and rotation invariant, ◮ invariant to small deformations, ◮ invariant to small changes in brightness, contrast, color.

19 / 36

SLIDE 27

Curse of dimension

Two-step setup: f = χ ◦ τ

◮ τ : RD → Rd is a smooth dimension reducing feature map. ◮ χ ∈ Eβ,d performs classification on low-dimensional space.

20 / 36

SLIDE 28

Curse of dimension

Two-step setup: f = χ ◦ τ

◮ τ : RD → Rd is a smooth dimension reducing feature map. ◮ χ ∈ Eβ,d performs classification on low-dimensional space.

Theorem (P., Voigtlaender; 2017)

Let ̺(x) = x+. There are constants c > 0, L ∈ N such that for any f = χ ◦ τ and any ε ∈ (0, 1/2), there is a NN Φ̺

ε with at most L

layers, and at most c · ε−2(d−1)/β non-zero weights such that Φ̺

ε − f L2 < ε.

Asymptotic approximation rate depends only on d, not on D.

20 / 36

SLIDE 29

Compositional functions

Compositional functions: [Mhaskar, Poggio; 2016] High-dimensional functions as dyadic composition of 2-dimensional functions. R8 ∋ x → h3

1(h2 1(h1 1(x1, x2), h1 2(x3, x4)), h2 2(h1 3(x5, x6), h1 4(x7, x8)))

x1 x2 x3 x4 x5 x6 x7 x8

21 / 36

SLIDE 30

Extensions

Approximation with respect to Sobolev norms: ReLU NNs Φ are Lipschitz continuous. Hence, for s ∈ [0, 1], p ≥ 1 and f ∈ W s,p(Ω), we can measure f − ΦW s,p(Ω). ReLU Networks achieve the same approximation rates as h-, p-, hp-FEM, [Opschoor, P., Schwab; 2019]. Convolutional neural networks: Direct correspondence between approximation by CNNs (without pooling) and approximation by fully-connected networks, [P., Voigtlaender; 2018].

22 / 36

SLIDE 31

Optimal parametrization

Optimal parametrization:

◮ Neural networks yield optimal representations of many function

classes relevant in PDE applications,

◮ Approximation is flexible and quality is improved if

low-dimensional structure is present. PDE discretization:

◮ Problem complexity drastically reduced, ◮ No design of ansatz system necessary, since NNs approximate

almost every function class well. Can neural networks really be this good?

23 / 36

SLIDE 32

The inconvenient structure of neural networks

24 / 36

SLIDE 33

Fixed architecture networks

Goal: Fix a space of networks with prescribed shape and understand the associated set of functions. Fixed architecture networks: Let d, L ∈ N, N1, . . . , NL−1 ∈ N, ̺ : R → R then we define by NN ̺(d, N1, . . . , NL−1, 1) the set of NNs with architecture (d, N1, . . . , NL−1, 1).

d = 8 N1 = 12 N2 = 12 N3 = 12 N4 = 8 25 / 36

SLIDE 34

Back to the basics

Topological properties: Is NN ̺(d, N1, . . . , NL−1, 1)

◮ star-shaped? ◮ convex? approximately convex? ◮ closed?

Is the map (T1, . . . , TL) → Φ open? Implications for optimization: If we do not have the properties above, then we can have

◮ terrible local minima, ◮ exploding weights, ◮ very slow convergence.

26 / 36

SLIDE 35

Star-shapedness

Star-shapedness: NN ̺(d, N1, . . . , NL−1, 1) is trivially star-shaped with center 0. ...but...

Proposition (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1, . . . , NL−1 ∈ N and let ̺ : R → R be locally Lipschitz continuous. Then the number of linearly independent centers of NN ̺(d, N1, . . . , NL) is at most L

ℓ=1(Nℓ−1 + 1)Nℓ,

where N0 = d.

27 / 36

SLIDE 36

Convexity?

Corollary (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1, . . . , NL−1 ∈ N, N0 = d, and let ̺ : R → R be locally Lipschitz continuous. If NN ̺(d, N1, . . . , NL−1, 1) contains more than L

ℓ=1(Nℓ−1 + 1)Nℓ

linearly independent functions, then NN ̺(d, N1, . . . , NL−1, 1) is not convex. From translation invariance: If NN ̺(d, N1, . . . , NL−1, 1) only finitely many linearly independent functions then ̺ is a finite sum of complex exponentials multiplied with polynomials.

28 / 36

SLIDE 37

Weak Convexity?

Weak convexity: NN ̺(d, N1, . . . , NL−1, 1) is almost never convex, but what about NN ̺(d, N1, . . . , NL−1, 1) + B·∞

ǫ

(0) for a hopefully small ǫ > 0?

29 / 36

SLIDE 38

Weak Convexity?

Weak convexity: NN ̺(d, N1, . . . , NL−1, 1) is almost never convex, but what about NN ̺(d, N1, . . . , NL−1, 1) + B·∞

ǫ

(0) for a hopefully small ǫ > 0?

Theorem (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1, . . . , NL−1 ∈ N, N0 = d. For all commonly-used activation functions there does not exist ǫ > 0 such that NN ̺(d, N1, . . . , NL−1, 1) + B·∞

ǫ

(0) is convex. As a corollary, we also get that NN ̺(d, N1, . . . , NL−1, 1) is usually nowhere dense.

29 / 36

SLIDE 39

Illustration

Illustration: The set NN ̺(d, N1, . . . , NL−1, 1) has very few centers, it is scaling invariant, not approximately convex, and nowhere dense.

30 / 36

SLIDE 40

Closedness in Lp

Compact weights: If the activation function ̺ is continuous, then a compactness argument shows that the set of networks of a compact parameter set is closed.

31 / 36

SLIDE 41

Closedness in Lp

Compact weights: If the activation function ̺ is continuous, then a compactness argument shows that the set of networks of a compact parameter set is closed.

Theorem (P., Raslan, Voigtlaender; 2018)

Let d, L, N1, . . . , NL−1 ∈ N, N0 = d. If ̺ has one of the properties below, then NN ̺(d, N1, . . . , NL−1, 1) is not closed in Lp, p ∈ (0, ∞).

◮ analytic, bounded, not constant, ◮ C 1 but not C ∞, ◮ continuous, monotone, bounded, ̺′(x0) exists and is non-zero

in at least one point x0 ∈ R.

◮ continuous, monotone, continuous differentiable outside a

compact set, and limx→∞ ̺′(x), limx→−∞ ̺′(x) exist and do not coincide.

31 / 36

SLIDE 42

Closedness in L∞

Theorem (P., Raslan, Voigtlaender; 2018)

Let d, L, N, N1, . . . , NL−1 ∈ N, N0 = d. If ̺ is has one of the properties below, then NN ̺(d, N1, . . . , NL−1, 1) is not closed in L∞.

◮ analytic, bounded, not constant, ◮ C 1 but not C ∞ ◮ ρ ∈ C p and |ρ(x) − xp +| bounded, for p ≥ 1.

ReLU: The set of two-layer ReLU NNs is closed in L∞!

32 / 36

SLIDE 43

Illustration

Illustration: For most activation functions (except the ReLU) ̺ the set NN ̺(d, N1, . . . , NL−1, 1) is star-shaped with center 0, not approximately convex, not closed.

33 / 36

SLIDE 44

Stable parametrization

Continuous parametrization: It is not hard to see, that if ̺ is continuous, then so is the map R̺ : (T1, . . . , TL) → Φ. Quotient map: We can also ask if R̺ is a quotient map, i.e., if Φ1, Φ2 are NNs which are close (w.r.t. · sup), then there exist (T 1

1 , . . . , T 1 L) and (T 2 1 , . . . , T 2 L) which are close in some norm and

R̺((T 1

1 , . . . , T 1 L)) = Φ1 and R̺((T 2 1 , . . . , T 2 L)) = Φ2.

Proposition (P., Raslan, Voigtlaender; 2018)

Let ̺ be Lipschitz continuous and not affine linear, then R̺ is not a quotient map.

34 / 36

SLIDE 45

Consequences

No convexity:

◮ Want to solve ∇J(Φ) = 0 for an energy J and NN Φ. ◮ Not only J could be non-convex, but also the set we optimize

ver.

◮ Similar to N-term approximation by dictionaries.

No closedness:

◮ Exploding coefficients (if PNN (f ) ∈ NN). ◮ No low-neuron approximation.

No inverse-stable parametrization:

◮ Error term very small, while parametrization is far from optimal. ◮ Potentially very slow convergence.

35 / 36

SLIDE 46

Where to go from here?

Different networks:

◮ Special types of networks could be more robust. ◮ Convolutional neural networks are probably still too large a

class. [P., Voigtlaender; 2018].

Stronger norms:

◮ Stronger norms naturally help with closedness and inverse

stability.

◮ Example is Sobolev training [Czarnecki, Osindero, Jaderberg,

Swirszcz, Pascanu; 2017].

◮ Many arguments of our results break down if W 1,∞ norm is

used.

36 / 36

SLIDE 47

Conclusion

Approximation: NNs are a very powerful approximation tool:

◮ Often optimally efficient

parametrization

◮ overcome curse of dimension ◮ surprisingly efficient black-box

ptimization

Topological structure: NNs form an impractical set:

◮ non-convex ◮ non-closed ◮ no inverse-stable parametrization.

37 / 36

SLIDE 48

References:

H. Andrade-Loarca, G. Kutyniok, O. ¨

Oktem, P. Petersen, Extraction of digital wavefront sets using applied harmonic analysis and deep neural networks, arXiv:1901.01388

H. B¨
lcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal Approximation with Sparsely Connected Deep

Neural Networks, arXiv:1705.01714

J. Opschoor, P. Petersen, Ch. Schwab, Deep ReLU Networks and High-Order Finite Element Methods,

SAM, ETH Z¨ urich, 2019.

P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural

networks, Neural Networks, (2018)

P. Petersen, M. Raslan, F. Voigtlaender, Topological properties of the set of functions generated by neural

networks of fixed size, arXiv:1806.08459

P. Petersen, F. Voigtlaender, Equivalence of approximation by convolutional neural networks and

fully-connected networks, arXiv:1809.00973 37 / 36

SLIDE 49

Thank you for your attention!

References:

H. Andrade-Loarca, G. Kutyniok, O. ¨

Oktem, P. Petersen, Extraction of digital wavefront sets using applied harmonic analysis and deep neural networks, arXiv:1901.01388

H. B¨
lcskei, P. Grohs, G. Kutyniok, P. Petersen, Optimal Approximation with Sparsely Connected Deep

Neural Networks, arXiv:1705.01714

J. Opschoor, P. Petersen, Ch. Schwab, Deep ReLU Networks and High-Order Finite Element Methods,

SAM, ETH Z¨ urich, 2019.

P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural

networks, Neural Networks, (2018)

P. Petersen, M. Raslan, F. Voigtlaender, Topological properties of the set of functions generated by neural

networks of fixed size, arXiv:1806.08459

P. Petersen, F. Voigtlaender, Equivalence of approximation by convolutional neural networks and

fully-connected networks, arXiv:1809.00973 36 / 36