[PPT] - Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taix PowerPoint Presentation

SLIDE 1

Le Lecture 7 7 R Recap ap

1 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 2

Na Naïve L Losse sses: s: L L2 v vs L s L1

L2 Loss:

– 𝑀! = ∑"#$

%

𝑧" − 𝑔 𝑦"

!

– Sum of squared differences (SSD) – Prone to outliers – Compute-efficient (optimization) – Optimum is the mean

I2DL: Prof. Niessner, Prof. Leal-Taixé 2

L1 Loss:

– 𝑀$ = ∑"#$

%

|𝑧" − 𝑔(𝑦")| – Sum of absolute differences – Robust – Costly to compute – Optimum is the median

SLIDE 3

Bi Binar ary Cl Clas assificat ation

n: Sigmoi

moid

Can be interpreted as a probability 1

I2DL: Prof. Niessner, Prof. Leal-Taixé 3

𝜏 𝒚, 𝜾 = 1 1 + 𝑓!∑#!$!

𝑞(𝑧 = 1|𝑦, 𝜾) 𝜏 𝑡 = 1 1 + 𝑓!"

Σ

𝜄# 𝜄$ 𝜄%

𝑡

𝑦% 𝑦# 𝑦$

SLIDE 4

So Softm tmax x Formula lati tion

What if we have multiple classes?

4 I2DL: Prof. Niessner, Prof. Leal-Taixé

Softmax

𝑞(𝑧 = 1|𝒚, 𝜾) = 𝑓𝒕𝟐 𝑓𝒕𝟐 + 𝑓𝒕𝟑 + 𝑓𝒕𝟒 𝑞(𝑧 = 3|𝒚, 𝜾) = 𝑓𝒕𝟒 𝑓𝒕𝟐 + 𝑓𝒕𝟑 + 𝑓𝒕𝟒 𝑦% 𝑦# 𝑦$

Σ Σ

𝑞(𝑧 = 2|𝒚, 𝜾) = 𝑓𝒕𝟑 𝑓𝒕𝟐 + 𝑓𝒕𝟑 + 𝑓𝒕𝟒

Σ

𝑡1 𝑡2 𝑡3

Scores for each class Probabilities for each class

SLIDE 5

− ln

*! *!+*"#$+*"#$

= 2 ∗ 10!##

Ex Examp mple: e: Hin inge e vs Cr Cros

ss-En

Entrop

py

Given the following scores for 𝒚, : 𝑡 = [5, −3, 2] 𝑡 = [5, 10, 10] 𝑡 = [5, −20, −20] 𝑧, = 0

I2DL: Prof. Niessner, Prof. Leal-Taixé 5

Model 1 Model 2 Model 3 Hinge loss: max(0, −3 − 5 + 1) + max 0, 2 − 5 + 1 = 0 max(0, 10 − 5 + 1) + max 0, 10 − 5 + 1 = 12 max(0, −20 − 5 + 1) + max 0, −20 − 5 + 1 = 0 Cross Entropy loss: − ln

*! *!+*"%+*# = 0.05

− ln

*! *!+*&$+*&$ = 5.70

− Cross Entropy *always* wants to improve! (loss never 0) − Hinge Loss saturates.

Hinge Loss: 𝑀@ = ∑ABC% max(0, 𝑡A − 𝑡C% + 1) Cross Entropy : 𝑀@ = − log(

D&'% ∑( D&()

SLIDE 6

Si Sigmoid Acti Activa vati tion

I2DL: Prof. Niessner, Prof. Leal-Taixé 6

Saturated neurons kill the gradient flow 𝜏 𝑡 = 1 1 + 𝑓!" Forward 𝜖𝑀 𝜖𝑥 = 𝜖𝑡 𝜖𝑥 𝜖𝑀 𝜖𝑡 𝜖𝑀 𝜖𝑡 = 𝜖𝜏 𝜖𝑡 𝜖𝑀 𝜖𝜏 𝜖𝜏 𝜖𝑡 𝜖𝑀 𝜖𝜏

SLIDE 7

Ta TanH nH Ac Acti tiva vati tion

Zero- centered Still saturates

I2DL: Prof. Niessner, Prof. Leal-Taixé 7

[LeCun et al. 1991] Improving Generalization Performance in Character Recognition

SLIDE 8

Rec Rectified ed Linear ear Units (ReL ReLU)

Large and consistent gradients Does not saturate Fast convergence What happens if a ReLU outputs zero? Dead ReLU

I2DL: Prof. Niessner, Prof. Leal-Taixé 8

[Krizhevsky et al. NeurIPS 2012] ImageNet Classification with Deep Convolutional Neural Networks

SLIDE 9

Qui Quick ck Gui Guide

Sigmoid is not really used.
ReLU is the standard choice.
Second choice are the variants of ReLU or Maxout.
Recurrent nets will require TanH or similar.

9 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 10

In Initialization is Extremely Im Important

Optimum

10 I2DL: Prof. Niessner, Prof. Leal-Taixé

Not guaranteed to reach the optimum

𝑦∗ = arg min 𝑔(𝑦)

Initialization

SLIDE 11

Xavier In Initialization

How to ensure the variance of the output is the same

as the input?

11 I2DL: Prof. Niessner, Prof. Leal-Taixé

𝑜𝑊𝑏𝑠(𝑥 𝑊𝑏𝑠 𝑦 ) = 1 𝑊𝑏𝑠 𝑥 = 1 𝑜

SLIDE 12

ReL ReLU Kills Hal alf of

f the

e Dat ata

12 I2DL: Prof. Niessner, Prof. Leal-Taixé

It makes a huge difference!

[He et al., ICCV’15] He Initialization

𝑊𝑏𝑠 𝑥 = 2 𝑜

SLIDE 13

Le Lecture 8

13 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 14

Da Data ta Augm ugmen enta tati tion

n

14 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 15

Da Data ta Augm gmenta tati tion

A classifier has to be invariant to a wide variety of

transformations

16 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 16

Pose Appearance Illumination

17 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 17

Da Data ta Augm gmenta tati tion

A classifier has to be invariant to a wide variety of

transformations

Helping the classifier: synthesize data simulating

plausible transformations

18 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 18

Da Data ta Augm gmenta tati tion

19 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Krizhevsky et al., NIPS’12] ImageNet

SLIDE 19

Da Data ta Augm gmenta tati tion: Br Brightnes ess

Random brightness and contrast changes

20 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Krizhevsky et al., NIPS’12] ImageNet

SLIDE 20

Da Data ta Augm gmenta tati tion: Random Crops ps

Training: random crops

– Pick a random L in [256,480] – Resize training image, short side L – Randomly sample crops of 224x224

Testing: fixed set of crops

– Resize image at N scales – 10 fixed crops of 224x224: (4 corners + 1 center ) × 2 flips

21 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Krizhevsky et al., NIPS’12] ImageNet

SLIDE 21

Da Data ta Augm gmenta tati tion

When comparing two networks make sure to use the

same data augmentation!

Consider data augmentation a part of your network

design

22 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 22

Ad Advanc vanced Regula larization

23 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 23

Wei Weight Dec ecay ay

L2 regularization
Penalizes large weights
Improves generalization

24 I2DL: Prof. Niessner, Prof. Leal-Taixé

Learning rate Gradient

Θ&'$ = Θ& − 𝜗𝛼( Θ&, 𝑦, 𝑧 − 𝜇𝜄&

Θ Θ/2 Θ/2 Gradient of L2-regularization

SLIDE 24

Ea Early Stop

ppin

ing

25 I2DL: Prof. Niessner, Prof. Leal-Taixé

Overfitting

SLIDE 25

Ea Early Stop

ppin

ing

Easy form of regularization

26 I2DL: Prof. Niessner, Prof. Leal-Taixé

Θ% Θ# Θ$ Θ- Θ∗ 𝜗 𝜗 𝜐 … … Overfitting

SLIDE 26

Bag Bagging an and d En Ensemb emble e Met Methods

ds
Train multiple models and average their results
E.g., use a different algorithm for optimization or

change the objective function / loss function.

If errors are uncorrelated, the expected combined

error will decrease linearly with the ensemble size

27 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 27

Bag Bagging an and d En Ensemb emble e Met Methods

ds
Bagging: uses k different datasets

28 I2DL: Prof. Niessner, Prof. Leal-Taixé

Training Set 1 Training Set 2 Training Set 3

Image Source: [Srivastava et al., JMLR’14] Dropout

SLIDE 28

Dropout Dropout

29 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 29

Dr Dropo pout

Disable a random set of neurons (typically 50%)

30 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Srivastava et al., JMLR’14] Dropout

Forward

SLIDE 30

Dropout: In Intuition

Using half the network = half capacity

31 I2DL: Prof. Niessner, Prof. Leal-Taixé

Furry Has two eyes Has a tail Has paws Has two ears Redundant representations

[Srivastava et al., JMLR’14] Dropout

SLIDE 31

Dropout: In Intuition

Using half the network = half capacity

– Redundant representations – Base your scores on more features

Consider it as a model ensemble

32 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Srivastava et al., JMLR’14] Dropout

SLIDE 32

Dropout: In Intuition

Two models in one

33 I2DL: Prof. Niessner, Prof. Leal-Taixé

Model 1 Model 2

[Srivastava et al., JMLR’14] Dropout

SLIDE 33

Dropout: In Intuition

Using half the network = half capacity

– Redundant representations – Base your scores on more features

Consider it as two models in one

– Training a large ensemble of models, each on different set of data (mini-batch) and with SHARED parameters

34 I2DL: Prof. Niessner, Prof. Leal-Taixé

Reducing co-adaptation between neurons

[Srivastava et al., JMLR’14] Dropout

SLIDE 34

Dr Dropo pout: t: Te Test t Ti Time

All neurons are “turned on” – no dropout

35 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Srivastava et al., JMLR’14] Dropout

Conditions at train and test time are not the same

SLIDE 35

Dr Dropo pout: t: Te Test t Ti Time

Test:
Train:

36 I2DL: Prof. Niessner, Prof. Leal-Taixé

Dropout probability 𝑞 = 0.5 Weight scaling inference rule

𝑨 = (𝜄Q𝑦Q + 𝜄R𝑦R) 5 𝑞 𝐹 𝑨 = 1 4 (𝜄Q0 + 𝜄R0 + 𝜄Q𝑦Q + 𝜄R0 + 𝜄Q0 + 𝜄R𝑦R + 𝜄Q𝑦Q + 𝜄R𝑦R) = 1 2 (𝜄Q𝑦Q + 𝜄R𝑦R)

𝑦$ 𝑦# 𝜄$ 𝑨 𝜄#

[Srivastava et al., JMLR’14] Dropout

SLIDE 36

Dr Dropo pout: t: Verdict ct

Efficient bagging method with parameter sharing
Try it!
Dropout reduces the effective capacity of a model à

larger models, more training time

37 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Srivastava et al., JMLR’14] Dropout

SLIDE 37

Batch Normali lization

38 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 38

Our Our Go Goal

All we want is that our activations do not die out

I2DL: Prof. Niessner, Prof. Leal-Taixé 39

SLIDE 39

Bat Batch Nor

rmal

malizat ation

n
Wish: Unit Gaussian activations (in our example)
Solution: let’s do it

40 I2DL: Prof. Niessner, Prof. Leal-Taixé

D = num of features N = mini-batch size Mean of your mini-batch examples over feature k

T 𝒚 / = 𝒚 / − 𝐹 𝒚 / 𝑊𝑏𝑠 𝒚 /

[Ioffe and Szegedy, PMLR’15] Batch Normalization feature 1 … feature k …

SLIDE 40

Bat Batch Nor

rmal

malizat ation

n
In each dimension of the features, you have a unit

gaussian (in our example)

41 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Ioffe and Szegedy, PMLR’15] Batch Normalization

Mean of your mini-batch examples over feature k

T 𝒚 / = 𝒚 / − 𝐹 𝒚 / 𝑊𝑏𝑠 𝒚 /

feature 1 … feature k …

D = num of features N = mini-batch size Unit gaussian

SLIDE 41

Bat Batch Nor

rmal

malizat ation

n
In each dimension of the features, you have a unit

gaussian (in our example)

For NN in general à BN normalizes the mean and

variance of the inputs to your activation functions

42 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Ioffe and Szegedy, PMLR’15] Batch Normalization

SLIDE 42

BN BN Lay ayer er

A layer to be applied after Fully

Connected (or Convolutional) layers and before non-linear activation functions

43 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Ioffe and Szegedy, PMLR’15] Batch Normalization

SLIDE 43

Bat Batch Nor

rmal

malizat ation

n
1. Normalize
2. Allow the network to change the range

44 I2DL: Prof. Niessner, Prof. Leal-Taixé

These parameters will be

ptimized during backprop

Differentiable function so we can backprop through it….

7 𝒚 & = 𝒚 & − 𝐹 𝒚 & 𝑊𝑏𝑠 𝒚 & 𝒛 & = 𝛿 & 7 𝒚(&) + 𝛾 &

[Ioffe and Szegedy, PMLR’15] Batch Normalization

SLIDE 44

Bat Batch Nor

rmal

malizat ation

n
1. Normalize
2. Allow the network to change the

range

45 I2DL: Prof. Niessner, Prof. Leal-Taixé

backprop The network can learn to undo the normalization 𝛿 / = 𝑊𝑏𝑠 𝒚 / 𝛾 / = 𝐹 𝒚 /

[Ioffe and Szegedy, PMLR’15] Batch Normalization

7 𝒚 & = 𝒚 & − 𝐹 𝒚 & 𝑊𝑏𝑠 𝒚 & 𝒛 & = 𝛿 & 7 𝒚(&) + 𝛾 &

SLIDE 45

Bat Batch Nor

rmal

malizat ation

n
Ok

Ok to treat at dimens nsions ns separ arat ately? ? Shown empirically that even if features are not correlated, convergence is still faster with this method

You can set all biases of the layers before BN to zero,

because they will be cancelled out by BN anyway

46 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Ioffe and Szegedy, PMLR’15] Batch Normalization

SLIDE 46

BN BN: Trai ain vs Tes est

Train time: mean and variance is taken over the mini-

batch

Test-time: what happens if we can just process one

image at a time?

– No chance to compute a meaningful mean and variance

47 I2DL: Prof. Niessner, Prof. Leal-Taixé

7 𝒚 & = 𝒚 & − 𝐹 𝒚 & 𝑊𝑏𝑠 𝒚 &

[Ioffe and Szegedy, PMLR’15] Batch Normalization

SLIDE 47

BN BN: Trai ain vs Tes est

Trai aini ning ng: Compute mean and variance from mini-batch 1,2,3 … Testing ng: Compute mean and variance by running an exponentially weighted averaged across training mini-

batches. Use them as and .

48 I2DL: Prof. Niessner, Prof. Leal-Taixé

𝜈0*"0 𝜏0*"0

$

[Ioffe and Szegedy, PMLR’15] Batch Normalization

𝑊𝑏𝑠

1233,34 = 𝛾5 ∗ 𝑊𝑏𝑠 1233,34 +

1 − 𝛾5 ∗ 𝑊𝑏𝑠

5,3,67089

𝜈1233,34 = 𝛾5 ∗ 𝜈1233,34 + (1 − 𝛾5) ∗ 𝜈5,3,67089 𝛾5: momentum (hyperparameter)

SLIDE 48

BN BN: Wh What at do do you

u get

et?

Very deep nets are much easier to train à more

stable gradients

A much larger range of hyperparameters works

similarly when using BN

49 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Ioffe and Szegedy, PMLR’15] Batch Normalization

SLIDE 49

BN BN: A Mi Miles eston

ne

50 I2DL: Prof. Niessner, Prof. Leal-Taixé

⎯

[Wu and He, ECCV’18] Group Normalization

SLIDE 50

BN BN: Draw awbac acks

51 I2DL: Prof. Niessner, Prof. Leal-Taixé

val error

[Wu and He, ECCV’18] Group Normalization

SLIDE 51

Ot Other r No Norm rmali lizations

52 I2DL: Prof. Niessner, Prof. Leal-Taixé

val error

[Wu and He, ECCV’18] Group Normalization

SLIDE 52

Ot Other r No Norm rmali lizations

53 I2DL: Prof. Niessner, Prof. Leal-Taixé

H, W C N Batch Norm H, W C N Layer Norm H, W C N Instance Norm H, W C N Group Norm

Image size Number of channels Number of elements in the batch

[Wu and He, ECCV’18] Group Normalization

SLIDE 53

What What We We Kno Know

54 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 54

Wh What at do do we e know

w so
far

ar?

55 I2DL: Prof. Niessner, Prof. Leal-Taixé

Width Depth

SLIDE 55

Wh What at do do we e know

w so
far

ar?

56 I2DL: Prof. Niessner, Prof. Leal-Taixé

Concept of a ‘Neuron’

𝜏 𝑡 = 1 1 + 𝑓!"

Σ

𝜄# 𝜄$ 𝜄%

𝑡

𝑦% 𝑦# 𝑦$

SLIDE 56

Wh What at do do we e know

w so
far

ar?

57 I2DL: Prof. Niessner, Prof. Leal-Taixé

Activation Functions (non-linearities)

Sigmoid: 𝜏 𝑦 =

# (#+*"')

TanH: tanh 𝑦
ReLU: max 0, 𝑦
Leaky ReLU: max 0.1𝑦, 𝑦

SLIDE 57

Wh What at do do we e know

w so
far

ar?

58 I2DL: Prof. Niessner, Prof. Leal-Taixé

𝑥% 𝑦% 𝑥# 𝑦# 𝑐

*−1 + 1 𝑦 𝑓! ∗ ∗ + 2.00 −1.00 −2.00 −3.00 −2.00 6.00 +1 4.00 −3.00 −1.00 1.00 0.37 1.37 0.73 1.00 −0.53 −0.53 −0.20 0.20 0.20 0.20 0.20 0.20 −0.20 −0.39 −0.39 −0.59

Backpropagation

SLIDE 58

Wh What at do do we e know

w so
far

ar?

59 I2DL: Prof. Niessner, Prof. Leal-Taixé

SGD Variations (Momentum, etc.)

SLIDE 59

Wh What at do do we e know

w so
far

ar?

60 I2DL: Prof. Niessner, Prof. Leal-Taixé

Dropout Batch-Norm Weight Regularization Data Augmentation Weight Initialization (e.g., Xavier/2) e.g., 𝑀$-reg: 𝑆$ 𝑿 = ∑,<#

=

𝑥,

$

T 𝒚 / = 𝒚 / − 𝐹 𝒚 / 𝑊𝑏𝑠 𝒚 /

SLIDE 60

Wh Why not

t simply mor
re

e Lay ayer ers?

We cannot make networks arbitrarily complex
Why not just go deeper and get better?

– No structure!! – It is just brute force! – Optimization becomes hard – Performance plateaus / drops!

61 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 61

See See you

u next

ext week eek!

62 I2DL: Prof. Niessner, Prof. Leal-Taixé

SLIDE 62

Ref Refer eren ences es

Goodfellow et al. “Deep Learning” (2016),

– Chapter 6: Deep Feedforward Networks

Bishop “Pattern Recognition and Machine Learning” (2006),

– Chapter 5.5: Regularization in Network Nets

http://cs231n.github.io/neural-networks-1/
http://cs231n.github.io/neural-networks-2/
http://cs231n.github.io/neural-networks-3/

I2DL: Prof. Niessner, Prof. Leal-Taixé 63