[PPT] - EE-559 Deep learning 6. Going deeper Fran cois Fleuret PowerPoint Presentation

SLIDE 1

EE-559 – Deep learning

6. Going deeper

Fran¸ cois Fleuret https://fleuret.org/dlc/

[version of: June 11, 2018]

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

SLIDE 2

Benefits and challenges of greater depth

Fran¸ cois Fleuret EE-559 – Deep learning / 6. Going deeper 2 / 83

SLIDE 3

For image classification for instance, there has been a trend toward deeper architectures to improve performance.

Network

Nb. layers

LeNet5 (leCun et al., 1998) 5 AlexNet (Krizhevsky et al., 2012) 8 VGG (Simonyan and Zisserman, 2014) 11–19 GoogleLeNet (Szegedy et al., 2015) 22 Inception v4 (Szegedy et al., 2016) 76 Resnet (He et al., 2015) 34–152 Resnet (He et al., 2016) 1001 Resnet (Huang et al., 2016) 1202

Fran¸ cois Fleuret EE-559 – Deep learning / 6. Going deeper 3 / 83

SLIDE 4

For image classification for instance, there has been a trend toward deeper architectures to improve performance.

Network

Nb. layers

LeNet5 (leCun et al., 1998) 5 AlexNet (Krizhevsky et al., 2012) 8 VGG (Simonyan and Zisserman, 2014) 11–19 GoogleLeNet (Szegedy et al., 2015) 22 Inception v4 (Szegedy et al., 2016) 76 Resnet (He et al., 2015) 34–152 Resnet (He et al., 2016) 1001 Resnet (Huang et al., 2016) 1202

“Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.” (Simonyan and Zisserman, 2014)

Fran¸ cois Fleuret EE-559 – Deep learning / 6. Going deeper 3 / 83

SLIDE 5

For image classification for instance, there has been a trend toward deeper architectures to improve performance.

Network

Nb. layers

LeNet5 (leCun et al., 1998) 5 AlexNet (Krizhevsky et al., 2012) 8 VGG (Simonyan and Zisserman, 2014) 11–19 GoogleLeNet (Szegedy et al., 2015) 22 Inception v4 (Szegedy et al., 2016) 76 Resnet (He et al., 2015) 34–152 Resnet (He et al., 2016) 1001 Resnet (Huang et al., 2016) 1202

“Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.” (Simonyan and Zisserman, 2014) A theoretical analysis provides an intuition of how a network’s output “irregularity” grows linearly with its width and exponentially with its depth.

Fran¸ cois Fleuret EE-559 – Deep learning / 6. Going deeper 3 / 83

SLIDE 6

Let F be the set of piece-wise linear mappings on [0, 1], and ∀f ∈ F, let κ(f ) be the minimum number of linear pieces needed to represent f .

Fran¸ cois Fleuret EE-559 – Deep learning / 6. Going deeper 4 / 83

SLIDE 7

Let F be the set of piece-wise linear mappings on [0, 1], and ∀f ∈ F, let κ(f ) be the minimum number of linear pieces needed to represent f . Let σ be the ReLU function σ : R → R x → max(0, x).

Fran¸ cois Fleuret EE-559 – Deep learning / 6. Going deeper 4 / 83

SLIDE 8

Let F be the set of piece-wise linear mappings on [0, 1], and ∀f ∈ F, let κ(f ) be the minimum number of linear pieces needed to represent f . Let σ be the ReLU function σ : R → R x → max(0, x). If we compose σ and f ∈ F, any linear piece that does not cross 0 remains a single piece or disappears, and one that does cross 0 breaks into two, hence ∀f ∈ F, κ(σ(f )) ≤ 2κ(f ) ,

Fran¸ cois Fleuret EE-559 – Deep learning / 6. Going deeper 4 / 83

SLIDE 9

Let F be the set of piece-wise linear mappings on [0, 1], and ∀f ∈ F, let κ(f ) be the minimum number of linear pieces needed to represent f . Let σ be the ReLU function σ : R → R x → max(0, x). If we compose σ and f ∈ F, any linear piece that does not cross 0 remains a single piece or disappears, and one that does cross 0 breaks into two, hence ∀f ∈ F, κ(σ(f )) ≤ 2κ(f ) , and we also have ∀(f , g) ∈ F2, κ(f + g) ≤ κ(f ) + κ(g) .

Fran¸ cois Fleuret EE-559 – Deep learning / 6. Going deeper 4 / 83

SLIDE 10

Consider a MLP with ReLU, a single input unit, and a single output unit. x0

1 = x,

∀d = 1, . . . , D, ∀i,

sd

i

= W d−1

j=1

wd

i,jxd−1 j

+ bd

i

xd

i

= σ(sd

i )

y = xD

1 .

Fran¸ cois Fleuret EE-559 – Deep learning / 6. Going deeper 5 / 83

SLIDE 11

Consider a MLP with ReLU, a single input unit, and a single output unit. x0

1 = x,

∀d = 1, . . . , D, ∀i,

sd

i

= W d−1

j=1

wd

i,jxd−1 j

+ bd

i

xd

i

= σ(sd

i )

y = xD

1 .

All the sd

i s and xd i s are piece-wise linear functions of x

Fran¸ cois Fleuret EE-559 – Deep learning / 6. Going deeper 5 / 83

SLIDE 12

Consider a MLP with ReLU, a single input unit, and a single output unit. x0

1 = x,

∀d = 1, . . . , D, ∀i,

sd

i

= W d−1

j=1

wd

i,jxd−1 j

+ bd

i

xd

i

= σ(sd

i )

y = xD

1 .

All the sd

i s and xd i s are piece-wise linear functions of x with ∀i, κ

s1

i

= 1, and

∀l, i, κ

xl

i

= κ
σ(sl

i )

≤ 2κ
sl

i

≤ 2

Wl−1

j=1

κ

xl−1

j

from which

∀l, max

i

κ

xl

i

≤ 2Wl−1 max

j

κ

xl−1

j

Fran¸

cois Fleuret EE-559 – Deep learning / 6. Going deeper 5 / 83

SLIDE 13

Consider a MLP with ReLU, a single input unit, and a single output unit. x0

1 = x,

∀d = 1, . . . , D, ∀i,

sd

i

= W d−1

j=1

wd

i,jxd−1 j

+ bd

i

xd

i

= σ(sd

i )

y = xD

1 .

All the sd

i s and xd i s are piece-wise linear functions of x with ∀i, κ

s1

i

= 1, and

∀l, i, κ

xl

i

= κ
σ(sl

i )

≤ 2κ
sl

i

≤ 2

Wl−1

j=1

κ

xl−1

j

from which

∀l, max

i

κ

xl

i

≤ 2Wl−1 max

j

κ

xl−1

j

and we get the following bound for any ReLU MLP

κ(y) ≤ 2D

D

d=1