[PPT] - Adaptivity of deep ReLU network and its generalization error PowerPoint Presentation

SLIDE 1

Adaptivity of deep ReLU network and its generalization error analysis

Taiji Suzuki†‡

†The University of Tokyo

Department of Mathematical Informatics

‡AIP-RIKEN

22nd/Feb/2019 The 2nd Korea-Japan Machine Learning Workshop

1 / 50

SLIDE 2

Deep learning model

f (x) = η(WLη(WL−1 . . . W2η(W1x + b1) + b2 . . . )) High performance learning system Many applications: Deepmind, Google, Facebook, Open AI, Baidu, ...

2 / 50

SLIDE 3

Deep learning model

f (x) = η(WLη(WL−1 . . . W2η(W1x + b1) + b2 . . . )) High performance learning system Many applications: Deepmind, Google, Facebook, Open AI, Baidu, ... We need theories.

2 / 50

SLIDE 4

Outline of this talk

Why does deep learning perform so well? “Adaptivity” of deep neural network:

Adaptivity to the shape of the target function. Adaptivity to the dimensionality of the input data. → sparsity, non-convexity

3 / 50

SLIDE 5

Outline of this talk

Why does deep learning perform so well? “Adaptivity” of deep neural network:

Adaptivity to the shape of the target function. Adaptivity to the dimensionality of the input data. → sparsity, non-convexity Approach: Estimation error analysis on a Besov space.

spatial inhomogeneity of smoothness avoiding curse of dimensionality

Will be shown that any linear estimators such as kernel methods are outperformed by DL.

3 / 50

SLIDE 6

Reference

Taiji Suzuki: Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. ICLR2019, to appear. (arXiv:1810.08033).

4 / 50

SLIDE 7

Outline

1

Literature overview

2

Approximating and estimating functions in Besov space and related spaces Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space

5 / 50

SLIDE 8

Universal approximator

Two layer neural network: f (x) =

m

∑

j=1

vjη(w ⊤

j x + bj).

As m → ∞, the two layer network can approximate an arbitrary function with an arbitrary precision. ˆ f (x) = ∑m

j=1 vjη(w ⊤ j x + bj)

≃ f o(x) = ∫ ho(w, b)η(w ⊤x + b)dwdb

(Sonoda & Murata, 2015)

Year Basis function space 1987 Hecht-Nielsen – C(Rd) 1988 Gallant & White Cos L2(K) Irie & Miyake integrable L2(Rd) 1989 Carroll & Dickinson Continuous sigmoidal L2(K) Cybenko Continuous sigmoidal C(K) Funahashi Monotone & bounded C(K) 1993 Mhaskar & Micchelli Polynomial growth C(K) 2015 Sonoda & Murata admissible L1, L2

6 / 50

SLIDE 9

Universal approximator

Two layer neural network: f (x) =

m

∑

j=1

vjη(w ⊤

j x + bj).

As m → ∞, the two layer network can approximate an arbitrary function with an arbitrary precision. ˆ f (x) = ∑m

j=1 vjη(w ⊤ j x + bj)

≃ f o(x) = ∫ ho(w, b)η(w ⊤x + b)dwdb

(Sonoda & Murata, 2015)

Activation functions: ReLU: η(u) = max{u, 0} Sigmoid: η(u) =

1 1+exp(−u)

6 / 50

SLIDE 10

Expressive power of deep neural network

Combinatorics/Hyperplane Arrangements (Montufar et al., 2014) Number of linear regions (ReLU) Polynomial expansions, tensor analysis (Cohen et al., 2016; Cohen &

Shashua, 2016)

Number of monomials (Sum product) Algebraic topology (Bianchini & Scarselli, 2014) Betti numbers (Pfaffian) Riemannian geometry + Dynamic mean field theory (Poole et al., 2016) Extrinsic curvature

Deep neural network has exponentially large power of expression against the number of layers.

7 / 50

SLIDE 11

Depth separation between 2 and 3 layers

2 layer NN is already universal approximator. When is deeper network useful? There is a function represented by

f o(x) = g(∥x∥2) = g(x2

1 + · · · + x2 d)

that can be better approximated by 3 layer NN than 2 layer NN (c.f., Eldan and Shamir (2016)) ⃝ × dx: the dimension of the input x

3 layers: O(poly(dx, ϵ−1)) internal nodes are sufficient. 2 layers: At least Ω(1/ϵdx) internal nodes are required. → DL can avoid curse of dimensionality.

8 / 50

SLIDE 12

Non-smooth function

For estimating a non-smooth function, deep is better (Imaizumi & Fukumizu, 2018): f o(x) =

K

∑

k=1

1Rk(x)hk(x) where Rk is a region with smooth boundary and hk is a smooth function.

9 / 50

SLIDE 13

Depth separation

What makes difference between deep and shallow methods?

10 / 50

SLIDE 14

Depth separation

What makes difference between deep and shallow methods? → Non-convexity of the model (sparseness)

10 / 50

SLIDE 15

Easy example: Linear activation

Reduced rank regression: Yi = UVXi + ξi (i = 1, . . . , n) where U ∈ RM×r, V ∈ Rr×N (r ≪ M, N), and Yi ∈ RM, Xi ∈ RN. Linear estimator ˆ f (x) = ∑n

i=1 Yiφ(X1, . . . , Xn, x),

Deep learning ˆ f (x) = ˆ U ˆ V x. r(M + N) n Deep ≪ MN n Shallow

=

Yi Xi U V

Non-convexity is essential. → sparsity.

11 / 50

SLIDE 16

Nonlinear regression problem

✓ ✏ Nonlinear regression problem: yi = f o(xi) + ξi (i = 1, . . . , n), where ξi ∼ N(0, σ2), and xi ∼ PX([0, 1]d) (i.i.d.). ✒ ✑ We want to estimate f o from data (xi, yi)n

i=1.

Least squares estimator: ˆ f = argmin

f ∈F

1 n

n

∑

i=1

(yi − f (xi))2 where F is a neural network model.

12 / 50

SLIDE 17

Bias and variance trade-off

True Model

Estimator

Approximation error (bias) Sample deviation (variance)

∥f o − ˆ f ∥L2(P)

Estimation error

≤ ∥f o − ˇ f ∥L2(P)

Approximation error

(bias)

+ ∥ˇ f − ˆ f ∥L2(P)

Sample deviation

(variance)

Large model: small approximation error, large sample deviation Small model: large approximation error, small sample deviation → Bias and variance trade-off

13 / 50

SLIDE 18

Outline

1

Literature overview

2

Approximating and estimating functions in Besov space and related spaces Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space

14 / 50

SLIDE 19

Agenda of this talk

Deep learning can make use of sparsity.

Appropriate function class with non-convexity: Q: A typical setting is H¨

lder space. Can we generalize it?

A: Besov space and mixed-smooth Besov space (tensor product space) Curse of dimensionality: Q: Deep learning can suffer from curse of dimensionality. Can we ease the effect of dimensionality under a suitable condition? A: Yes, if the true function is included in mixed-smooth Besov space.

15 / 50

SLIDE 20

Outline

1

Literature overview

2

Approximating and estimating functions in Besov space and related spaces Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space

16 / 50

SLIDE 21

Minimax optimal framework

What is a “good” estimator? Minimax optimal rate: inf

ˆ f :estimator

sup

f o∈F

E[∥ˆ f − f o∥2

L2(P)] ≤ n−?

→ If an estimator ˆ f achieves the minimax optimal rate, then it can be seen a “good” estimator. What kind F do we think?

17 / 50

SLIDE 22

H¨

lder, Sobolev, Besov

Ω = [0, 1]d ⊂ Rd H¨

lder space (Cβ(Ω))

∥f ∥Cβ = max

|α|≤m

∂αf ∥∞ + max

|α|=m sup x∈Ω

|∂αf (x) − ∂αf (y)| |x − y|β−m Sobolev space (W k

p (Ω))

∥f ∥W k

p =

( ∑

|α|≤k

∥Dαf ∥p

Lp(Ω)

) 1

p

✓ ✏ Besov space (Bs

p,q(Ω)) (0 < p, q ≤ ∞, 0 < s ≤ m)

ωm(f , t)p := sup

∥h∥≤t

m

∑

j=1

(−1)m−j (m j ) f (· + jh)

Lp(Ω)

, ∥f ∥Bs

p,q(Ω) = ∥f ∥Lp(Ω) +

(∫ ∞ [t−sωm(f , t)p]q dt t )1/q . ✒ ✑

18 / 50

SLIDE 23

Relation between the spaces

Suppose Ω = [0, 1]d ⊂ R. For m ∈ N, Bm

p,1 ֒

→ W m

p ֒

→ Bm

p,∞,

Bm

2,2 = W m 2 .

For 0 < s < ∞ and s ̸∈ N, Cs = Bs

∞,∞.

19 / 50

SLIDE 24

Continuous regime: s > d/p Bs

p,q ֒

→ C 0 Lr-integrability: s > d(1/p − 1/r)+ Bs

p,q ֒

→ Lr (If d/p ≥ s, the elements are not necessarily continuous).

s

Continuous Dis-continuous

∞

Example: B1

1,1([0, 1]) ⊂ {bounded total variation} ⊂ B1 1,∞([0, 1])

20 / 50

SLIDE 25

Properties of Besov space

Discontinuity: d/p > s Spatial inhomogeneity of smoothness: small p

rough smooth

Question: Can deep learning capture these properties?

21 / 50

SLIDE 26

Connection to sparsity

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1

p=0.5 p=1 p=2

Multiresolution expansion

f = ∑

k∈N+

∑

j∈J(k)

αk,jψ(2kx − j), ∥f ∥Bs

p,q ≃

 

∞

∑

k=0

{2sk(2−kd ∑

j∈J(k)

|αk,j|p)1/p}q  

1/q

Sparse coefficients → spatial inhomogeneity of smoothness

22 / 50

SLIDE 27

Deep learning model

f (x) = (W (L)η(·) + b(L)) ◦ (W (L−1)η(·) + b(L−1)) ◦ · · · ◦ (W (1)x + b(1)) F(L, W , S, B) : deep networks with depth L, width W , sparsity S, norm bound B. η is ReLU activation: η(u) = max{u, 0}. (currently most popular)

23 / 50

SLIDE 28

Approximation by deep NN in Besov space

F(L, W , S, B) : deep networks with depth L, width W , sparsity S, norm bound B.

Proposition (Approximation ability for Besov space)

Suppose that 0 < p, q, r ≤ ∞ and 0 < s < ∞ satisfy m > 2s and s > d(1/p − 1/r)+ For N ∈ N, by setting L = 3⌈log2 (

3d∨mN

s d

c(d,m)

) + 5⌉⌈log2(d ∨ m)⌉, W = 6N(d ∨ m2), S = 6(L − 1)(d ∨ m2) + N, B = O(N(d/p−s)+), it holds that

sup

f o∈U(Bs

p,q([0,1]d))

inf

ˇ f ∈F(L,W ,S,B) ∥f o − ˇ

f ∥Lr([0,1]d) ≲ N−s/d.

Remark: Shallow network cannot achieve this rate.

24 / 50

SLIDE 29

Approximation by deep NN in Besov space

F(L, W , S, B) : deep networks with depth L, width W , sparsity S, norm bound B.

Proposition (Approximation ability for Besov space)

Suppose that 0 < p, q, r ≤ ∞ and 0 < s < ∞ satisfy m > 2s and s > d(1/p − 1/r)+ For N ∈ N, by setting L = O(log(N)), W = O(N), S = O(N log(N)), B = O(N(d/p−s)+), it holds that

sup

f o∈U(Bs

p,q([0,1]d))

inf

ˇ f ∈F(L,W ,S,B) ∥f o − ˇ

f ∥Lr([0,1]d) ≲ N−s/d.

Remark: Shallow network cannot achieve this rate.

24 / 50

SLIDE 30

B-spline

N(x) = { 1 (x ∈ [0, 1]), (otherwise). Cardinal B-spline of order m: Nm(x) = (N ∗ N ∗ · · · ∗ N

m + 1 times

)(x). → Piece-wise polynomial of order m.

N0 N1 N2 N3

N (d)

k,j (x1, . . . , xd) = d

∏

i=1

Nm(2kxi − ji)

25 / 50

SLIDE 31

Cardinal B-spline interpolation (DeVore & Popov, 1988)

Atomic decomposition f ∈ Lp is in Bs

p,q if and only if f can be decomposed into

f = ∑

k∈N+

∑

j∈J(k)

αk,jN (d)

k,j ,

(where J(k) = {j ∈ Zd | −m < ji < 2ki+1 + m}) such that N(f ) :=  

∞

∑

k=0

{2sk(2−kd ∑

j∈J(k)

|αk,j|p)1/p}q  

1/q

< ∞. (αk,j is determined in a certain way.) Norm equivalence ∥f ∥Bs

p,q ≃ N(f ).

Basic strategy: approximate each basis N (d)

k,j by deep NN “efficiently”.

※ cardinal B-spline is not a wavelet basis.

26 / 50

SLIDE 32

Cardinal B-spline expansion (m = 1)

27 / 50

SLIDE 33

Under the condition s > d(1/p − 1/r)+, it holds that sup

f o∈U(Bs

p,q([0,1]d))

inf

ˇ f ∈F(L,W ,S,B) ∥f o − ˇ

f ∥Lr([0,1]d) ≲ N−s/d. Setting p = q = ∞ and r = ∞, then Bs

p,q(Ω) = C s(Ω)

⇒ The result by Yarotsky (2016) is recovered as a special case.

rough smooth

28 / 50

SLIDE 34

Under the condition s > d(1/p − 1/r)+, it holds that sup

f o∈U(Bs

p,q([0,1]d))

inf

ˇ f ∈F(L,W ,S,B) ∥f o − ˇ

f ∥Lr([0,1]d) ≲ N−s/d. Setting p = q = ∞ and r = ∞, then Bs

p,q(Ω) = C s(Ω)

⇒ The result by Yarotsky (2016) is recovered as a special case. Nonlinear adaptive sampling recovery is required (D˜

ung, 2011b).

“Non-adaptive method” only achieves N−(s/d−(1/p−1/r)+), for 1 < p < r ≤ 2, s > d(1/p − 1/r)+ which is not optimal if p < r. (Non-adaptive method: it uses N “fixed” bases to approximate the target function by ∑N

i=1 αiψi(x))

→ Methods with fixed bases cannot achieve the opt. rate!

rough smooth

(small p situation)

28 / 50

SLIDE 35

Empirical risk minimization and estimation error

True Model

Estimator

Approximation error (bias) Sample deviation (variance)

We have already obtained the approximation error. Next, we derive the estimation error of the least squares estimator: ˆ f = argmin

f ∈F(L,W ,S,B) n

∑

i=1

(yi − f (xi))2.

29 / 50

SLIDE 36

Bias and variance decomposition

✓ ✏ A standard covering number argument gives E[∥f o − ˆ f ∥2

L2(PX )]

≲ S[L log(BW ) + log(Ln)] n

Variance

+ inf

f ∈F(L,W ,S,B) ∥f − f o∥2 L2(PX )

Bias

✒ ✑

30 / 50

SLIDE 37

Bias and variance decomposition

✓ ✏ A standard covering number argument gives E[∥f o − ˆ f ∥2

L2(PX )]

≲ S[L log(BW ) + log(Ln)] n

Variance

+ inf

f ∈F(L,W ,S,B) ∥f − f o∥2 L2(PX )

Bias

✒ ✑ If f o ∈ Bs

p,q(Ω), we know that

Bias = N−s/d (approximation error) for L = O(log(N)), W = O(N), S = O(N log(N)), B = O(N(d/p−s)+). ⇒ Balance the bias and variance terms.

30 / 50

SLIDE 38

Estimation error analysis

yi = f o(xi) + ξi (i = 1, . . . , n), where xi ∼ P(X) with density p ∈ Lr/(r−2)([0, 1]d) for r < (1/p − s/d)−1

+ .

F(L, W , S, B): ReLU-NN with width W , depth L ans sparsity S with parameters are bounded by B. ˆ f = argmin

f ∈F(L,W ,S,B) n

∑

i=1

(yi − ¯ f (xi))2

(¯ f is the clipping of f : ¯ f = min{max{f , −R}, R}; realizable by ReLU)

Proposition

For f o s.t. ∥f o∥Bs

p,q([0,1]d) ≤ 1 and ∥f o∥∞ ≤ R, and 0 < p, q ≤ ∞ with

s > d( 1

p − 1 2)+, by letting N ≍ n

d 2s+d ,

E[∥f o − ˆ f ∥2

L2(PX )] ≤ n−

2s 2s+d log(n)3.

Setting p = q = ∞, the result of Schmidt-Hieber (2017) is recovered as a special case.

31 / 50

SLIDE 39

Estimation error analysis

yi = f o(xi) + ξi (i = 1, . . . , n), where xi ∼ P(X) with density p ∈ Lr/(r−2)([0, 1]d) for r < (1/p − s/d)−1

+ .

F(L, W , S, B): ReLU-NN with width W , depth L ans sparsity S with parameters are bounded by B. ˆ f = argmin

f ∈F(L,W ,S,B) n

∑

i=1

(yi − ¯ f (xi))2

(¯ f is the clipping of f : ¯ f = min{max{f , −R}, R}; realizable by ReLU)

Proposition

For f o s.t. ∥f o∥Bs

p,q([0,1]d) ≤ 1 and ∥f o∥∞ ≤ R, and 0 < p, q ≤ ∞ with

s > d( 1

p − 1 2)+, by letting N ≍ n

d 2s+d ,

E[∥f o − ˆ f ∥2

L2(PX )] ≤ n−

2s 2s+d log(n)3.

Minimax optimal rate.

31 / 50

SLIDE 40

Best linear estimator vs. deep learning

Linear estimator (Donoho & Johnstone, 1998; Zhang et al., 2002) ˆ f (x) = ∑n

i=1 yiφ(x1, . . . , xn; x)

Kernel ridge estimator, Sieve method, Nadaraya-Watson estimator, ... (e.g., ˆ f (x) = Kx,X(KX,X + λI)−1Y ). For s > 1/p,

n

−

2s−2(1/p−1/2)+ 2s+1−2(1/p−1/2)+

<

Deep learning (our bound)

n−

2s 2s+1

for s > (1/p − 1/2)+.

(sparse estimator achieves this rate for s > max{1/p, 1/2} (Donoho & Johnstone, 1998))

There appears difference when p < 2. p < 2 corresponds to spatial incoherence of smoothness.

rough smooth

32 / 50

SLIDE 41

Why does this difference happen?

Deep net Convex hull (Shallow net)

Q-hull

inf

ˆ f :Linear

sup

f o∈F

E[∥ˆ f − f o∥2

L2(P)] =

inf

ˆ f :Linear

sup

f o∈conv(F)

E[∥ˆ f − f o∥2

L2(P)].

(More strictly, it can be extended to “Q-hull.”)

33 / 50

SLIDE 42

Outline

1

Literature overview

2

Approximating and estimating functions in Besov space and related spaces Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space

34 / 50

SLIDE 43

Functions with jumps

JK = { a0 +

K

∑

i=1

1[ti,1] | ti ∈ (0, 1], |a0|,

K

∑

i=1

|ai| ≤ 1 } → Its convex hull includes the functions of bounded variation.

Theorem

inf

ˆ f :Linear

sup

f o∈JK

E [ ∥ˆ f − f o∥2

L2(P)

] ≥ Ω ( 1 √n ) . But, for a deep learning estimator ˆ f , we obtain sup

f o∈JK

E [ ∥ˆ f − f o∥2

L2(P)

] ≤ O (1 n log(n)3 ) .

35 / 50

SLIDE 44

Function class with sparse parameter

Weak ℓp-norm of the coefficient: ∥α∥|wℓp := sup

i∈Z+

i1/p|α|(i) where |α|(i) is the i-th largest absolute value. Function class with sparse coefficient: J p :=    ∑

(k,ℓ)

αk,ℓψk,ℓ

∥α∥wℓp ≤ C,

∑

k>m

|αk,ℓ|2 ≤ C2−βm    where ψk,ℓ(x) = 2k/2ψ(2kx − ℓ). ψ could be Haar wavelet. Finite combination of J p: Kp := { S ∑

i=1

cifi(Ai · −bi)

|ci|, | det Ai|−1, ∥Ai∥∞, ∥bi∥∞ ≤ C, fi ∈ J p

} .

36 / 50

SLIDE 45

Convergence rate of deep NN

Theorem

Minimax rate Deep learning Jk Ω(n−1) O(n−1 log(n)3) Kp Ω(n−

2α 2α+1 (log(n))− 4α2 2α+1 )

O ( n−

2α 2α+1 log(n)3)

where 0 < p < 2, α = 1/p − 1/2. For 0 < p < 1 (sparse situation), DL is better than the linear estimator:

n−1 log(n)3, n−

2α 2α+1 log(n)3

Deep ≪ n−1/2 Shallow (Linear)

37 / 50

SLIDE 46

Outline

1

Literature overview

2

Approximating and estimating functions in Besov space and related spaces Deep NN representation for Besov space Function class with more explicit sparsity Deep NN representation for “mixed smooth” Besov space

38 / 50

SLIDE 47

Difficulty n−

2s 2s+d

d influences the exponent of the convergence rate.

→ Curse of dimensionality

39 / 50

SLIDE 48

Relation to existing work

Besov space with dominating mixed smoothness (tensor product space) MBr

p,p = Br1 p,p ⊗ · · · ⊗ Brd p,p

The estimation accuracy ∥ˆ f − f o∥2

L2(P).

Space H¨

lder (∀β)

Barron class m-Sobolev (β ≤ 2) m-Besov (∀β)

Approximation

Yarotsky (2016), Liang and Sri- kant (2016) Barron (1993) Montanelli and Du (2017) This work

Approx. rate ˜ O(m− β

d )

˜ O(m−1/2) ˜ O(m−β) ˜ O(m−β)

Estimation

Schmidt-Hieber (2017) Barron (1993)

— This work Estimation. rate ˜ O(n−

2β 2β+d )

˜ O(n− 1

2 )

— ˜ O(n−

2β 2β+1+log2(e)) 40 / 50

SLIDE 49

Tensor product space

Tensor product of Besov space (dominating mixed smoothness) MBβ

p,p = Bβ p,p(R) ⊗p · · · ⊗p Bβ p,p(R)

f (x1, . . . , xd) ∈ span{f1(x1) × · · · × fd(xd)} (limR→∞ ∑R

r=1 f (1) r

(x1)f (2)

r

(x2) . . . f (d)

r

(xd)) Can be extended to p ̸= q MBβ

p,q (see, for example, Sickel and Ullrich (2009);

D˜ ung (2011a)).

41 / 50

SLIDE 50

Tensor product space

Tensor product of Besov space (dominating mixed smoothness) MBβ

p,p = Bβ p,p(R) ⊗p · · · ⊗p Bβ p,p(R)

f (x1, . . . , xd) ∈ span{f1(x1) × · · · × fd(xd)} (limR→∞ ∑R

r=1 f (1) r

(x1)f (2)

r

(x2) . . . f (d)

r

(xd)) Can be extended to p ̸= q MBβ

p,q (see, for example, Sickel and Ullrich (2009);

D˜ ung (2011a)). When p ≥ 1, let the norm of the space Bβ

p,p ⊗p G for a Banach space G be

∥f ∥Bβ

p,p⊗pG := inf

   ( R ∑

r=1

∥f (1)

r

∥p

Bβ

p,p

)1/p sup  

R

∑

r=1

λrg (2)

r

G
( R

∑

r=1

|λr|p )1/p ≤ 1     

for f = ∑R

r=1 f (1) r

(x1)g (2)

r

(x2) where f (1)

r

∈ Bβ

p,p and g (2) r

∈ G. Bβ

p,p ⊗p G is obtained by completion of the finite sum w.r.t. this norm.

MBβ

p,p := Bβ p,p ⊗p (· · · Bβ p,p ⊗p (Bβ p,p ⊗p Bβ p,p))

For p < 1 and p = ∞, a different norm is induced.

(see Light and Cheney (1985))

41 / 50

SLIDE 51

Tensor product space

Tensor product of Besov space (dominating mixed smoothness) MBβ

p,p = Bβ p,p(R) ⊗p · · · ⊗p Bβ p,p(R)

f (x1, . . . , xd) ∈ span{f1(x1) × · · · × fd(xd)} (limR→∞ ∑R

r=1 f (1) r

(x1)f (2)

r

(x2) . . . f (d)

r

(xd)) Can be extended to p ̸= q MBβ

p,q (see, for example, Sickel and Ullrich (2009);

D˜ ung (2011a)). Tensor product of Besov (MB2

p,q(R2)):

∂f ∂x1 , ∂f ∂x2 , ∂2f ∂x2

1

, ∂2f ∂x2

2

, ∂2f ∂x1∂x2 , ∂3f ∂x1∂x2

2

, ∂3f ∂x2

1∂x2

, ∂4f ∂x2

1∂x2 2

(e.g., Korobov space) Sobolev (W 2

p (R2)):

∂f ∂x1 , ∂f ∂x2 , ∂2f ∂x2

1

, ∂2f ∂x2

2

, ∂2f ∂x1∂x2

41 / 50

SLIDE 52

Examples

f (g1(x1), g2(x2), . . . , gd(xd)) gk ∈ Bs

p,q(R), f : sufficiently smooth.

Additive model: f (x) =

d

∑

r=1

fd(xd) Tensor model: f (x) =

R

∑

r=1 d

∏

k=1

fr,k(xk)

42 / 50

SLIDE 53

Approximation by NN

Theorem

Suppose that 0 < p, q, r ≤ ∞ and β > (1/p − 1/r)+. For all f ∈ MBβ

p,q([0, 1]d)

s.t. ∥f o∥MBβ

p,q([0,1]d) ≤ 1 and N ≥ 1, there exists ReLU-NN ˇ

f with Width W = O(NCN,d) Depth L = O(log(N)) Sparsity S = O(W × L × log(N)) and the parameters are bounded by ∥W (ℓ)∥∞, ∥b(ℓ)∥∞ < O(N(1/p−β)+) such that ∥f o − ˇ f ∥Lr ([0,1]d) ≤      N−βC (1/ min(r,1)−1/q)+

d,N

(p ≥ r), N−βC (1/r−1/q)+

d,N

(p < r, r < ∞), N−βC (1−1/q)+

d,N

(r = ∞), where Cd,N := (1 +

d−1 log(N))log(N)(1 + log(N) d−1 )d−1(≲ dlog(N) ∧ log(N)d−1).

Ordinal Besov space Bβ

p,q([0, 1]d): N−β/d.

Proof idea: Sparse grid technique (D˜ ung, 2011a; Smolyak, 1963) combined with adaptive nonlinear interpolation.

43 / 50

SLIDE 54

Estimation error bound

yi = f o(xi) + ξi (i = 1, . . . , n), where xi ∼ P(X) with density p(x) < G on [0, 1]d. F(L, W , S, B): ReLU-NN with width W , depth L ans sparsity S with parameters are bounded by B. ˆ f = argmin

f ∈F(L,W ,S,B) n

∑

i=1

(yi − ¯ f (xi))2

(¯ f is the clipping of f : ¯ f = min{max{f , −R}, R}; realizable by ReLU)

Theorem

Suppose that 0 < p, q ≤ ∞ and β > (1/p − 1/2)+. For all f o ∈ MBβ

p,q([0, 1]d)

s.t. ∥f o∥MBβ

p,q([0,1]d) ≤ 1, by letting u = (1 − 1

q)+ (p ≥ 2), ( 1 2 − 1 q)+ (p < 2),

∥f o − ˆ f ∥2

L2(P) ≤

{ n−

2β 2β+1 log(n) 2β+2u 1+2β (d−1)log(n)3

(every time), n−

2β 2β+1+log2(e) log(n)3

(u = 0). Besov space Bβ

p,q([0, 1]d): ˜

O(n−

2β 2β+d ).

→ effect of dimensionality is eased.

44 / 50

SLIDE 55

Sparse grid

(figure is borrowed from (Montanelli & Du, 2017))

Number of points in sparse grid: N = 2MMd−1. Dense grid: N = 2Md.

45 / 50

SLIDE 56

NN-structure

・・・・

(x1,x2,...,xd)

46 / 50

SLIDE 57

Applications

Additive model: f (x1, . . . , xd) =

d

∑

j=1

fr(xr). Tensor product form: f (x1, . . . , xd) =

R

∑

r=1 d

∏

k=1

fr,k(xk). Dimensionality reduction: f o = g ◦ F where F : Rd → RD such that D ≪ d and Fi ∈ MBs

p,q, and g ∈ Bγ p,q(RD):

˜ O(n−

2s 2s+1+log2(e) + n− 2γ 2γ+D ).

(F is a nonlinear dimensionality reduction into a low dimensional space (e.g., low dimensional manifold embedding).) (see also B¨

lcskei et al. (2017))

47 / 50

SLIDE 58

Sparse input

Input x is sparse (its number of non-zero elements is small).

∥x∥0 ≤ k ⇒ n−

2γ 2γ+k

48 / 50

SLIDE 59

Low dimensional manifold

f (x) only depends on D-dimensional quotient-manifold:

n−

2γ 2γ+D

49 / 50

SLIDE 60

Conclusion

Adaptivity of deep learning It was shown that the ReLU-DNN has a high adaptivity to the shape of the target functions (discontinuity and spatial inhomogeneous smoothness). ∥ˆ f − f o∥2

L2(P) = ˜

O(n−2s/(2s+d))

DNN outperforms a non-adaptive method. (DNN) n−2s/(2s+d) ≪ n

−

2(s−d(1/p−1/2)) 2s+d−2d(1/p−1/2)

(linear method)

The ReLU-DNN can ease the curse of dimensionality to estimate the mixed-smooth Besov spaces. (Besov) ˜ O(n−2s/(2s+d)) → (m-Besov) ˜ O(n−2s/(2s+1) log(n)

2β+2u 1+2β (d−1))

Better than fixed basis methods: high adaptivity to sparsity.

50 / 50

SLIDE 61

Arora, S., Ge, R., Neyshabur, B., & Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. Proceedings of the 35th International Conference on Machine Learning (pp. 254–263). PMLR. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39, 930–945. Bianchini, M., & Scarselli, F. (2014). On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE transactions on neural networks and learning systems, 25, 1553–1565. B¨

lcskei, H., Grohs, P., Kutyniok, G., & Petersen, P. (2017). Optimal

approximation with sparsely connected deep neural networks. arXiv preprint arXiv:1705.01714. Cohen, N., Sharir, O., & Shashua, A. (2016). On the expressive power of deep learning: A tensor analysis. The 29th Annual Conference on Learning Theory (pp. 698–728). Cohen, N., & Shashua, A. (2016). Convolutional rectifier networks as generalized tensor decompositions. Proceedings of the 33th International Conference on Machine Learning (pp. 955–963). DeVore, R. A., & Popov, V. A. (1988). Interpolation of besov spaces. Transactions of the American Mathematical Society, 305, 397–414. Donoho, D. L., & Johnstone, I. M. (1998). Minimax estimation via wavelet

shrinkage. The Annals of Statistics, 26, 879–921.

50 / 50

SLIDE 62

D˜ ung, D. (2011a). B-spline quasi-interpolant representations and sampling recovery of functions with mixed smoothness. Journal of Complexity, 27, 541–567. D˜ ung, D. (2011b). Optimal adaptive sampling recovery. Advances in Computational Mathematics, 34, 1–41. Eldan, R., & Shamir, O. (2016). The power of depth for feedforward neural

networks. Proceedings of The 29th Annual Conference on Learning Theory (pp.

907–940). Imaizumi, M., & Fukumizu, K. (2018). Deep neural networks learn non-smooth functions effectively. arXiv preprint arXiv:1802.04474. Liang, S., & Srikant, R. (2016). Why deep neural networks for function approximation? arXiv preprint arXiv:1610.04161. ICLR2017. Light, W., & Cheney, E. (1985). Approximation theory in tensor product spaces. Lecture notes in mathematics. Springer-Verlag. Montanelli, H., & Du, Q. (2017). Deep relu networks lessen the curse of

dimensionality. arXiv preprint arXiv:1712.08688.

Montufar, G. F., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the number of linear regions of deep neural networks. In Z. Ghahramani, M. Welling,

C. Cortes, N. Lawrence and K. Weinberger (Eds.), Advances in neural

information processing systems 27, 2924–2932. Curran Associates, Inc.

50 / 50

SLIDE 63

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., & Ganguli, S. (2016). Exponential expressivity in deep neural networks through transient chaos. In

D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.),

Advances in neural information processing systems 29, 3360–3368. Curran Associates, Inc. Schmidt-Hieber, J. (2017). Nonparametric regression using deep neural networks with ReLU activation function. ArXiv e-prints. Sickel, W., & Ullrich, T. (2009). Tensor products of Sobolev–Besov spaces and applications to approximation from the hyperbolic cross. Journal of Approximation Theory, 161, 748–786. Smolyak, S. (1963). Quadrature and interpolation formulas for tensor products of certain classes of functions. Soviet Math. Dokl. (pp. 240–243). Sonoda, S., & Murata, N. (2015). Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis. Suzuki, T., Abe, H., Murata, T., Horiuchi, S., Ito, K., Wachi, T., Hirai, S., Yukishima, M., & Nishimura, T. (2018). Spectral-Pruning: Compressing deep neural network via spectral analysis. arXiv e-prints, arXiv:1808.08558. Yarotsky, D. (2016). Error bounds for approximations with deep relu networks. CoRR, abs/1610.01145.

50 / 50

SLIDE 64

Zhang, S., Wong, M.-Y., & Zheng, Z. (2002). Wavelet threshold estimation of a regression function with random design. Journal of multivariate analysis, 80, 256–284.

50 / 50