Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural Properties
Philipp Christian Petersen
Deep Neural Networks and Partial Differential Equations: - - PowerPoint PPT Presentation
Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural Properties Philipp Christian Petersen Joint work Joint work with: Helmut B olcskei (ETH Z urich) Philipp Grohs (University of Vienna)
Philipp Christian Petersen
Joint work with:
◮ Helmut B¨
urich)
◮ Philipp Grohs (University of Vienna) ◮ Joost Opschoor (ETH Z¨
urich)
◮ Gitta Kutyniok (TU Berlin) ◮ Mones Raslan (TU Berlin) ◮ Christoph Schwab (ETH Z¨
urich)
◮ Felix Voigtlaender (KU Eichst¨
att-Ingolstadt)
1 / 36
Goal of this talk: Discuss the suitability of neural networks as an ansatz system for the solution of PDEs. Two threads: Approximation theory:
◮ universal approximation ◮ optimal
approximation rates for all classical function spaces
◮ reduced curse of dimen-
sion
1 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 1Structural properties:
◮ non-convex,
non-closed ansatz spaces
◮ parametrization not stable ◮ very hard to optimize over
2 / 36
Neural networks Introduction to neural networks Approaches to solve PDEs Approximation theory of neural networks Classical results Optimality High-dimensional approximation Structural results Convexity Closedness Stable parametrization
3 / 36
We consider neural networks as a special kind of functions:
◮ d = N0 ∈ N: input dimension, ◮ L: number of layers, ◮ ̺ : R → R: activation function, ◮ Tℓ : RNℓ−1 → RNℓ, ℓ = 1, . . . , L: affine-linear maps.
Then Φ̺ : Rd → RNL given by Φ̺(x) = TL(̺(TL−1(̺(. . . ̺(T1(x)))))), x ∈ Rd, is called a neural network (NN). The sequence (d, N1, . . . , NL) is called the architecture of Φ̺.
4 / 36
Deep Learning: Deep learning describes a variety of techniques based on data-driven adaptation of the affine linear maps in a neural network. Overwhelming success:
◮ Image classification ◮ Text understanding ◮ Game intelligence
Hardware design of the future!
Ren, He, Girshick, Sun; 2015
5 / 36
Expressibility: Neural networks constitute a very powerful architecture.
Theorem (Cybenko; 1989, Hornik; 1991, Pinkus; 1999)
Let d ∈ N, K ⊂ Rd compact, f : K → R continuous, ̺ : R → R continuous and not a polynomial. Let ε > 0, then there exist a two-layer NN Φ̺: f − Φ̺∞ ≤ ε. Efficient expressibility: RM ∋ θ → (T1, . . . , TL) → Φ̺
θ yields a
parametrized system of functions. In a sense this parametrization is
6 / 36
PDE problem: For D ⊂ Rd, d ∈ N find u such that G(x, u(x), ∇u(x), ∇2u(x)) = 0 for all x ∈ D. Approach of [Lagaris, Likas, Fotiadis; 1998]: Let (xi)i∈I ⊂ D, find a NN Φ̺
θ such that
G(xi, Φ̺
θ(xi), ∇Φ̺ θ(xi), ∇2Φ̺ θ(xi)) = 0 for all i ∈ I.
Standard methods can be used to find parameters θ.
7 / 36
General Framework: Deep Ritz Method [E, Yu; 2017]: NNs as trial functions, SGD naturally replaces quadrature. High-dimensional PDEs: [Sirignano, Spiliopoulos; 2017]: Let D ⊂ Rd d ≥ 100 find u such that ∂u ∂t (t, x) + H(u)(t, x) = 0, (t, x) ∈ [0, T] × Ω, + BC + IC As the number of parameters of the NNs increases the minimizer of associated energy approaches true solution. No mesh generation required! [Berner, Grohs, Hornung, Jentzen, von Wurstemberger; 2017]: Phrasing problem as empirical risk minimization provably no curse of dimension in approximation problem or number of samples.
8 / 36
Deep learning and PDEs: Both approaches above are based on two ideas.
◮ Neural networks are highly efficient in representing solutions of
PDEs, hence the complexity of the problem can be greatly reduced.
◮ There exist black box methods from machine learning that
solve the optimization problem. This talk:
◮ We will show exactly how efficient the representations are. ◮ Raise doubt that the black box can produce reliable results in
general.
9 / 36
10 / 36
Recall: Φ̺(x) = TL(̺(TL−1(̺(. . . ̺(T1(x)))))), x ∈ Rd. Each affine linear mapping Tℓ is defined by a matrix Aℓ ∈ RNℓ×Nℓ−1 and a translation bℓ ∈ RNℓ via Tℓ(x) = Aℓ x + bℓ. The number of weights W (Φ̺) and the number of neurons N(Φ̺) are W (Φ̺) =
(Ajℓ0 + bjℓ0) and N(Φ̺) =
L
Nj .
11 / 36
Given f from some class of functions, how many weights/neurons does an ε-approximating NN need to have?
12 / 36
Given f from some class of functions, how many weights/neurons does an ε-approximating NN need to have? Not so many...
Theorem (Maiorov, Pinkus; 1999)
There exists an activation function ̺weird : R → R that
◮ is analytic and strictly increasing, ◮ satisfies limx→−∞ ̺weird(x) = 0 and limx→∞ ̺weird(x) = 1,
such that for any d ∈ N, any f ∈ C([0, 1]d), and any ε > 0, there is a 3-layer ̺-network Φ̺weird
ε
with f − Φ̺weird
ε
L∞ ≤ ε and N(Φ̺weird
ε
) = 9d + 3.
12 / 36
◮ Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero.
13 / 36
◮ Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero.
◮ Mhaskar; 1993: Let ̺ be sigmoidal function of order k ≥ 2.
For f ∈ C s([0, 1]d), we have f − Φ̺
nL∞ N(Φ̺ n)−s/d and
L(Φ̺
n) = L(d, s, k).
13 / 36
◮ Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero.
◮ Mhaskar; 1993: Let ̺ be sigmoidal function of order k ≥ 2.
For f ∈ C s([0, 1]d), we have f − Φ̺
nL∞ N(Φ̺ n)−s/d and
L(Φ̺
n) = L(d, s, k). ◮ Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for ̺(x) = x+
(called ReLU) that f − Φ̺
nL∞ W (Φ̺ n)−s/d and
L(Φ̺
ε) ≍ log(n).
13 / 36
◮ Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero.
◮ Mhaskar; 1993: Let ̺ be sigmoidal function of order k ≥ 2.
For f ∈ C s([0, 1]d), we have f − Φ̺
nL∞ N(Φ̺ n)−s/d and
L(Φ̺
n) = L(d, s, k). ◮ Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for ̺(x) = x+
(called ReLU) that f − Φ̺
nL∞ W (Φ̺ n)−s/d and
L(Φ̺
ε) ≍ log(n). ◮ Shaham, Cloninger, Coifman; 2015: One can implement
certain wavelets using 4–layer NNs.
13 / 36
◮ Barron; 1993: Approximation rate for functions with one finite
Fourier moment using shallow networks with activation function ̺ sigmoidal of order zero.
◮ Mhaskar; 1993: Let ̺ be sigmoidal function of order k ≥ 2.
For f ∈ C s([0, 1]d), we have f − Φ̺
nL∞ N(Φ̺ n)−s/d and
L(Φ̺
n) = L(d, s, k). ◮ Yarotsky; 2017: For f ∈ C s([0, 1]d), we have for ̺(x) = x+
(called ReLU) that f − Φ̺
nL∞ W (Φ̺ n)−s/d and
L(Φ̺
ε) ≍ log(n). ◮ Shaham, Cloninger, Coifman; 2015: One can implement
certain wavelets using 4–layer NNs.
◮ He, Li, Xu, Zheng; 2018, Opschoor, Schwab, P.; 2019:
ReLU NNs reproduce approximation rates of h-, p- and hp-FEM.
13 / 36
Optimal approximation rates: Lower bounds on required network size only exist under additional assumptions. (Recall networks based
Options: (A) Place restrictions on activation function (e.g. only consider the ReLU), thereby excluding pathological examples like ̺weird. ( VC dimension bounds) (B) Place restrictions on the weights. ( Information theoretical bounds, entropy arguments) (C) Use still other concepts like continuous N-widths.
14 / 36
Optimal approximation rates: Lower bounds on required network size only exist under additional assumptions. (Recall networks based
Options: (A) Place restrictions on activation function (e.g. only consider the ReLU), thereby excluding pathological examples like ̺weird. ( VC dimension bounds) (B) Place restrictions on the weights. ( Information theoretical bounds, entropy arguments) (C) Use still other concepts like continuous N-widths.
14 / 36
Encoders: Let C ⊂ L2(Rd), ℓ ∈ N Eℓ :=
, Dℓ :=
{0, 1, 0, 0, 1, 1, 1}
Min-max code length: L(ǫ, C) := min
f ∈C
D(E(f )) − f 2 < ǫ
Optimal exponent: γ∗(C) := inf
15 / 36
Theorem (Boelcskei, Grohs, Kutyniok, P.; 2017)
Let C ⊂ L2(Rd), ̺ : R → R, then for all ǫ > 0: sup
f ∈C
inf
Φ̺ NN with quantised weights Φ̺−f 2≤ǫ
W (Φ̺) ǫ−γ∗(C). (1) Optimal approximation/parametrization: If for C ⊂ L2(Rd) one also has in (1), then NNs approximate a function class optimally. Versatility: It turns out that NNs achieve optimal approximation rates for many practically-used function classes.
16 / 36
◮ Mhaskar; 1993: Let ̺ be sigmoidal function of order k ≥ 2.
For f ∈ C s([0, 1]d), we have f − Φ̺
nL∞ N(Φ̺ n)−s/d.
We have γ∗({f ∈ C s([0, 1]d : f ≤ 1}) = d/s.
◮ Shaham, Cloninger, Coifman; 2015: One can implement
certain wavelets using 4–layer ReLU NNs. Optimal, when wavelets are optimal.
◮ B¨
cartoon-like functions.
17 / 36
Piecewise smooth functions: Eβ,d denotes the d-dimensional C β-piecewise smooth functions
Theorem (P., Voigtlaender; 2018)
Let d ∈ N, β ≥ 0, ̺ : R → R, ̺(x) = x+, then sup
f ∈Eβ,d
inf
Φ̺ NN with quantised weights Φ̺−f 2≤ǫ
W (Φ̺) ∼ ǫ−γ∗(Eβ,d) = ǫ−2(d−1)/β. The optimal depth of the networks is ∼ β/d.
18 / 36
Curse of dimension: To guarantee approximation with error ≤ ε
weights. Symmetries and invariances: Image classifiers are often:
◮ translation, dilation, and rotation invariant, ◮ invariant to small deformations, ◮ invariant to small changes in brightness, contrast, color.
19 / 36
Two-step setup: f = χ ◦ τ
◮ τ : RD → Rd is a smooth dimension reducing feature map. ◮ χ ∈ Eβ,d performs classification on low-dimensional space.
20 / 36
Two-step setup: f = χ ◦ τ
◮ τ : RD → Rd is a smooth dimension reducing feature map. ◮ χ ∈ Eβ,d performs classification on low-dimensional space.
Theorem (P., Voigtlaender; 2017)
Let ̺(x) = x+. There are constants c > 0, L ∈ N such that for any f = χ ◦ τ and any ε ∈ (0, 1/2), there is a NN Φ̺
ε with at most L
layers, and at most c · ε−2(d−1)/β non-zero weights such that Φ̺
ε − f L2 < ε.
Asymptotic approximation rate depends only on d, not on D.
20 / 36
Compositional functions: [Mhaskar, Poggio; 2016] High-dimensional functions as dyadic composition of 2-dimensional functions. R8 ∋ x → h3
1(h2 1(h1 1(x1, x2), h1 2(x3, x4)), h2 2(h1 3(x5, x6), h1 4(x7, x8)))
x1 x2 x3 x4 x5 x6 x7 x8
21 / 36
Approximation with respect to Sobolev norms: ReLU NNs Φ are Lipschitz continuous. Hence, for s ∈ [0, 1], p ≥ 1 and f ∈ W s,p(Ω), we can measure f − ΦW s,p(Ω). ReLU Networks achieve the same approximation rates as h-, p-, hp-FEM, [Opschoor, P., Schwab; 2019]. Convolutional neural networks: Direct correspondence between approximation by CNNs (without pooling) and approximation by fully-connected networks, [P., Voigtlaender; 2018].
22 / 36
Optimal parametrization:
◮ Neural networks yield optimal representations of many function
classes relevant in PDE applications,
◮ Approximation is flexible and quality is improved if
low-dimensional structure is present. PDE discretization:
◮ Problem complexity drastically reduced, ◮ No design of ansatz system necessary, since NNs approximate
almost every function class well. Can neural networks really be this good?
23 / 36
24 / 36
Goal: Fix a space of networks with prescribed shape and understand the associated set of functions. Fixed architecture networks: Let d, L ∈ N, N1, . . . , NL−1 ∈ N, ̺ : R → R then we define by NN ̺(d, N1, . . . , NL−1, 1) the set of NNs with architecture (d, N1, . . . , NL−1, 1).
d = 8 N1 = 12 N2 = 12 N3 = 12 N4 = 8 25 / 36
Topological properties: Is NN ̺(d, N1, . . . , NL−1, 1)
◮ star-shaped? ◮ convex? approximately convex? ◮ closed?
Is the map (T1, . . . , TL) → Φ open? Implications for optimization: If we do not have the properties above, then we can have
◮ terrible local minima, ◮ exploding weights, ◮ very slow convergence.
26 / 36
Star-shapedness: NN ̺(d, N1, . . . , NL−1, 1) is trivially star-shaped with center 0. ...but...
Proposition (P., Raslan, Voigtlaender; 2018)
Let d, L, N, N1, . . . , NL−1 ∈ N and let ̺ : R → R be locally Lipschitz continuous. Then the number of linearly independent centers of NN ̺(d, N1, . . . , NL) is at most L
ℓ=1(Nℓ−1 + 1)Nℓ,
where N0 = d.
27 / 36
Corollary (P., Raslan, Voigtlaender; 2018)
Let d, L, N, N1, . . . , NL−1 ∈ N, N0 = d, and let ̺ : R → R be locally Lipschitz continuous. If NN ̺(d, N1, . . . , NL−1, 1) contains more than L
ℓ=1(Nℓ−1 + 1)Nℓ
linearly independent functions, then NN ̺(d, N1, . . . , NL−1, 1) is not convex. From translation invariance: If NN ̺(d, N1, . . . , NL−1, 1) only finitely many linearly independent functions then ̺ is a finite sum of complex exponentials multiplied with polynomials.
28 / 36
Weak convexity: NN ̺(d, N1, . . . , NL−1, 1) is almost never convex, but what about NN ̺(d, N1, . . . , NL−1, 1) + B·∞
ǫ
(0) for a hopefully small ǫ > 0?
29 / 36
Weak convexity: NN ̺(d, N1, . . . , NL−1, 1) is almost never convex, but what about NN ̺(d, N1, . . . , NL−1, 1) + B·∞
ǫ
(0) for a hopefully small ǫ > 0?
Theorem (P., Raslan, Voigtlaender; 2018)
Let d, L, N, N1, . . . , NL−1 ∈ N, N0 = d. For all commonly-used activation functions there does not exist ǫ > 0 such that NN ̺(d, N1, . . . , NL−1, 1) + B·∞
ǫ
(0) is convex. As a corollary, we also get that NN ̺(d, N1, . . . , NL−1, 1) is usually nowhere dense.
29 / 36
Illustration: The set NN ̺(d, N1, . . . , NL−1, 1) has very few centers, it is scaling invariant, not approximately convex, and nowhere dense.
30 / 36
Compact weights: If the activation function ̺ is continuous, then a compactness argument shows that the set of networks of a compact parameter set is closed.
31 / 36
Compact weights: If the activation function ̺ is continuous, then a compactness argument shows that the set of networks of a compact parameter set is closed.
Theorem (P., Raslan, Voigtlaender; 2018)
Let d, L, N1, . . . , NL−1 ∈ N, N0 = d. If ̺ has one of the properties below, then NN ̺(d, N1, . . . , NL−1, 1) is not closed in Lp, p ∈ (0, ∞).
◮ analytic, bounded, not constant, ◮ C 1 but not C ∞, ◮ continuous, monotone, bounded, ̺′(x0) exists and is non-zero
in at least one point x0 ∈ R.
◮ continuous, monotone, continuous differentiable outside a
compact set, and limx→∞ ̺′(x), limx→−∞ ̺′(x) exist and do not coincide.
31 / 36
Theorem (P., Raslan, Voigtlaender; 2018)
Let d, L, N, N1, . . . , NL−1 ∈ N, N0 = d. If ̺ is has one of the properties below, then NN ̺(d, N1, . . . , NL−1, 1) is not closed in L∞.
◮ analytic, bounded, not constant, ◮ C 1 but not C ∞ ◮ ρ ∈ C p and |ρ(x) − xp +| bounded, for p ≥ 1.
ReLU: The set of two-layer ReLU NNs is closed in L∞!
32 / 36
Illustration: For most activation functions (except the ReLU) ̺ the set NN ̺(d, N1, . . . , NL−1, 1) is star-shaped with center 0, not approximately convex, not closed.
33 / 36
Continuous parametrization: It is not hard to see, that if ̺ is continuous, then so is the map R̺ : (T1, . . . , TL) → Φ. Quotient map: We can also ask if R̺ is a quotient map, i.e., if Φ1, Φ2 are NNs which are close (w.r.t. · sup), then there exist (T 1
1 , . . . , T 1 L) and (T 2 1 , . . . , T 2 L) which are close in some norm and
R̺((T 1
1 , . . . , T 1 L)) = Φ1 and R̺((T 2 1 , . . . , T 2 L)) = Φ2.
Proposition (P., Raslan, Voigtlaender; 2018)
Let ̺ be Lipschitz continuous and not affine linear, then R̺ is not a quotient map.
34 / 36
No convexity:
◮ Want to solve ∇J(Φ) = 0 for an energy J and NN Φ. ◮ Not only J could be non-convex, but also the set we optimize
◮ Similar to N-term approximation by dictionaries.
No closedness:
◮ Exploding coefficients (if PNN (f ) ∈ NN). ◮ No low-neuron approximation.
No inverse-stable parametrization:
◮ Error term very small, while parametrization is far from optimal. ◮ Potentially very slow convergence.
35 / 36
Different networks:
◮ Special types of networks could be more robust. ◮ Convolutional neural networks are probably still too large a
Stronger norms:
◮ Stronger norms naturally help with closedness and inverse
stability.
◮ Example is Sobolev training [Czarnecki, Osindero, Jaderberg,
Swirszcz, Pascanu; 2017].
◮ Many arguments of our results break down if W 1,∞ norm is
used.
36 / 36
Approximation: NNs are a very powerful approximation tool:
◮ Often optimally efficient
parametrization
◮ overcome curse of dimension ◮ surprisingly efficient black-box
Topological structure: NNs form an impractical set:
◮ non-convex ◮ non-closed ◮ no inverse-stable parametrization.
37 / 36
References:
Oktem, P. Petersen, Extraction of digital wavefront sets using applied harmonic analysis and deep neural networks, arXiv:1901.01388
Neural Networks, arXiv:1705.01714
SAM, ETH Z¨ urich, 2019.
networks, Neural Networks, (2018)
networks of fixed size, arXiv:1806.08459
fully-connected networks, arXiv:1809.00973 37 / 36
References:
Oktem, P. Petersen, Extraction of digital wavefront sets using applied harmonic analysis and deep neural networks, arXiv:1901.01388
Neural Networks, arXiv:1705.01714
SAM, ETH Z¨ urich, 2019.
networks, Neural Networks, (2018)
networks of fixed size, arXiv:1806.08459
fully-connected networks, arXiv:1809.00973 36 / 36