Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taix - - PowerPoint PPT Presentation

le lecture 7 7 r recap ap
SMART_READER_LITE
LIVE PREVIEW

Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taix - - PowerPoint PPT Presentation

Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taix 1 Na Nave L Losse sses: s: L L2 v vs L s L1 L2 Loss: L1 Loss: $ = "#$ % ! = "#$ ! | " ( " )| %


slide-1
SLIDE 1

Le Lecture 7 7 R Recap ap

1 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-2
SLIDE 2

Na Naïve L Losse sses: s: L L2 v vs L s L1

  • L2 Loss:

– 𝑀! = ∑"#$

%

𝑧" − 𝑔 𝑦"

!

– Sum of squared differences (SSD) – Prone to outliers – Compute-efficient (optimization) – Optimum is the mean

I2DL: Prof. Niessner, Prof. Leal-Taixé 2

  • L1 Loss:

– 𝑀$ = ∑"#$

%

|𝑧" − 𝑔(𝑦")| – Sum of absolute differences – Robust – Costly to compute – Optimum is the median

slide-3
SLIDE 3

Bi Binar ary Cl Clas assificat ation

  • n: Sigmoi

moid

Can be interpreted as a probability 1

I2DL: Prof. Niessner, Prof. Leal-Taixé 3

𝜏 𝒚, 𝜾 = 1 1 + 𝑓!∑#!$!

𝑞(𝑧 = 1|𝑦, 𝜾) 𝜏 𝑡 = 1 1 + 𝑓!"

Σ

𝜄# 𝜄$ 𝜄%

𝑡

𝑦% 𝑦# 𝑦$

slide-4
SLIDE 4

So Softm tmax x Formula lati tion

  • What if we have multiple classes?

4 I2DL: Prof. Niessner, Prof. Leal-Taixé

Softmax

𝑞(𝑧 = 1|𝒚, 𝜾) = 𝑓𝒕𝟐 𝑓𝒕𝟐 + 𝑓𝒕𝟑 + 𝑓𝒕𝟒 𝑞(𝑧 = 3|𝒚, 𝜾) = 𝑓𝒕𝟒 𝑓𝒕𝟐 + 𝑓𝒕𝟑 + 𝑓𝒕𝟒 𝑦% 𝑦# 𝑦$

Σ Σ

𝑞(𝑧 = 2|𝒚, 𝜾) = 𝑓𝒕𝟑 𝑓𝒕𝟐 + 𝑓𝒕𝟑 + 𝑓𝒕𝟒

Σ

𝑡1 𝑡2 𝑡3

Scores for each class Probabilities for each class

slide-5
SLIDE 5

− ln

*! *!+*"#$+*"#$

= 2 ∗ 10!##

Ex Examp mple: e: Hin inge e vs Cr Cros

  • ss-En

Entrop

  • py

Given the following scores for 𝒚, : 𝑡 = [5, −3, 2] 𝑡 = [5, 10, 10] 𝑡 = [5, −20, −20] 𝑧, = 0

I2DL: Prof. Niessner, Prof. Leal-Taixé 5

Model 1 Model 2 Model 3 Hinge loss: max(0, −3 − 5 + 1) + max 0, 2 − 5 + 1 = 0 max(0, 10 − 5 + 1) + max 0, 10 − 5 + 1 = 12 max(0, −20 − 5 + 1) + max 0, −20 − 5 + 1 = 0 Cross Entropy loss: − ln

*! *!+*"%+*# = 0.05

− ln

*! *!+*&$+*&$ = 5.70

− Cross Entropy *always* wants to improve! (loss never 0) − Hinge Loss saturates.

Hinge Loss: 𝑀@ = ∑ABC% max(0, 𝑡A − 𝑡C% + 1) Cross Entropy : 𝑀@ = − log(

D&'% ∑( D&()

slide-6
SLIDE 6

Si Sigmoid Acti Activa vati tion

I2DL: Prof. Niessner, Prof. Leal-Taixé 6

Saturated neurons kill the gradient flow 𝜏 𝑡 = 1 1 + 𝑓!" Forward 𝜖𝑀 𝜖𝑥 = 𝜖𝑡 𝜖𝑥 𝜖𝑀 𝜖𝑡 𝜖𝑀 𝜖𝑡 = 𝜖𝜏 𝜖𝑡 𝜖𝑀 𝜖𝜏 𝜖𝜏 𝜖𝑡 𝜖𝑀 𝜖𝜏

slide-7
SLIDE 7

Ta TanH nH Ac Acti tiva vati tion

Zero- centered Still saturates

I2DL: Prof. Niessner, Prof. Leal-Taixé 7

[LeCun et al. 1991] Improving Generalization Performance in Character Recognition

slide-8
SLIDE 8

Rec Rectified ed Linear ear Units (ReL ReLU)

Large and consistent gradients Does not saturate Fast convergence What happens if a ReLU outputs zero? Dead ReLU

I2DL: Prof. Niessner, Prof. Leal-Taixé 8

[Krizhevsky et al. NeurIPS 2012] ImageNet Classification with Deep Convolutional Neural Networks

slide-9
SLIDE 9

Qui Quick ck Gui Guide

  • Sigmoid is not really used.
  • ReLU is the standard choice.
  • Second choice are the variants of ReLU or Maxout.
  • Recurrent nets will require TanH or similar.

9 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-10
SLIDE 10

In Initialization is Extremely Im Important

  • Optimum

10 I2DL: Prof. Niessner, Prof. Leal-Taixé

Not guaranteed to reach the optimum

𝑦∗ = arg min 𝑔(𝑦)

Initialization

slide-11
SLIDE 11

Xavier In Initialization

  • How to ensure the variance of the output is the same

as the input?

11 I2DL: Prof. Niessner, Prof. Leal-Taixé

𝑜𝑊𝑏𝑠(𝑥 𝑊𝑏𝑠 𝑦 ) = 1 𝑊𝑏𝑠 𝑥 = 1 𝑜

slide-12
SLIDE 12

ReL ReLU Kills Hal alf of

  • f the

e Dat ata

12 I2DL: Prof. Niessner, Prof. Leal-Taixé

It makes a huge difference!

[He et al., ICCV’15] He Initialization

𝑊𝑏𝑠 𝑥 = 2 𝑜

slide-13
SLIDE 13

Le Lecture 8

13 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-14
SLIDE 14

Da Data ta Augm ugmen enta tati tion

  • n

14 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-15
SLIDE 15

Da Data ta Augm gmenta tati tion

  • A classifier has to be invariant to a wide variety of

transformations

16 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-16
SLIDE 16

Pose Appearance Illumination

17 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-17
SLIDE 17

Da Data ta Augm gmenta tati tion

  • A classifier has to be invariant to a wide variety of

transformations

  • Helping the classifier: synthesize data simulating

plausible transformations

18 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-18
SLIDE 18

Da Data ta Augm gmenta tati tion

19 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Krizhevsky et al., NIPS’12] ImageNet

slide-19
SLIDE 19

Da Data ta Augm gmenta tati tion: Br Brightnes ess

  • Random brightness and contrast changes

20 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Krizhevsky et al., NIPS’12] ImageNet

slide-20
SLIDE 20

Da Data ta Augm gmenta tati tion: Random Crops ps

  • Training: random crops

– Pick a random L in [256,480] – Resize training image, short side L – Randomly sample crops of 224x224

  • Testing: fixed set of crops

– Resize image at N scales – 10 fixed crops of 224x224: (4 corners + 1 center ) × 2 flips

21 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Krizhevsky et al., NIPS’12] ImageNet

slide-21
SLIDE 21

Da Data ta Augm gmenta tati tion

  • When comparing two networks make sure to use the

same data augmentation!

  • Consider data augmentation a part of your network

design

22 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-22
SLIDE 22

Ad Advanc vanced Regula larization

23 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-23
SLIDE 23

Wei Weight Dec ecay ay

  • L2 regularization
  • Penalizes large weights
  • Improves generalization

24 I2DL: Prof. Niessner, Prof. Leal-Taixé

Learning rate Gradient

Θ&'$ = Θ& − 𝜗𝛼( Θ&, 𝑦, 𝑧 − 𝜇𝜄&

Θ Θ/2 Θ/2 Gradient of L2-regularization

slide-24
SLIDE 24

Ea Early Stop

  • ppin

ing

25 I2DL: Prof. Niessner, Prof. Leal-Taixé

Overfitting

slide-25
SLIDE 25

Ea Early Stop

  • ppin

ing

  • Easy form of regularization

26 I2DL: Prof. Niessner, Prof. Leal-Taixé

Θ% Θ# Θ$ Θ- Θ∗ 𝜗 𝜗 𝜐 … … Overfitting

slide-26
SLIDE 26

Bag Bagging an and d En Ensemb emble e Met Methods

  • ds
  • Train multiple models and average their results
  • E.g., use a different algorithm for optimization or

change the objective function / loss function.

  • If errors are uncorrelated, the expected combined

error will decrease linearly with the ensemble size

27 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-27
SLIDE 27

Bag Bagging an and d En Ensemb emble e Met Methods

  • ds
  • Bagging: uses k different datasets

28 I2DL: Prof. Niessner, Prof. Leal-Taixé

Training Set 1 Training Set 2 Training Set 3

Image Source: [Srivastava et al., JMLR’14] Dropout

slide-28
SLIDE 28

Dropout Dropout

29 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-29
SLIDE 29

Dr Dropo pout

  • Disable a random set of neurons (typically 50%)

30 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Srivastava et al., JMLR’14] Dropout

Forward

slide-30
SLIDE 30

Dropout: In Intuition

  • Using half the network = half capacity

31 I2DL: Prof. Niessner, Prof. Leal-Taixé

Furry Has two eyes Has a tail Has paws Has two ears Redundant representations

[Srivastava et al., JMLR’14] Dropout

slide-31
SLIDE 31

Dropout: In Intuition

  • Using half the network = half capacity

– Redundant representations – Base your scores on more features

  • Consider it as a model ensemble

32 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Srivastava et al., JMLR’14] Dropout

slide-32
SLIDE 32

Dropout: In Intuition

  • Two models in one

33 I2DL: Prof. Niessner, Prof. Leal-Taixé

Model 1 Model 2

[Srivastava et al., JMLR’14] Dropout

slide-33
SLIDE 33

Dropout: In Intuition

  • Using half the network = half capacity

– Redundant representations – Base your scores on more features

  • Consider it as two models in one

– Training a large ensemble of models, each on different set of data (mini-batch) and with SHARED parameters

34 I2DL: Prof. Niessner, Prof. Leal-Taixé

Reducing co-adaptation between neurons

[Srivastava et al., JMLR’14] Dropout

slide-34
SLIDE 34

Dr Dropo pout: t: Te Test t Ti Time

  • All neurons are “turned on” – no dropout

35 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Srivastava et al., JMLR’14] Dropout

Conditions at train and test time are not the same

slide-35
SLIDE 35

Dr Dropo pout: t: Te Test t Ti Time

  • Test:
  • Train:

36 I2DL: Prof. Niessner, Prof. Leal-Taixé

Dropout probability 𝑞 = 0.5 Weight scaling inference rule

𝑨 = (𝜄Q𝑦Q + 𝜄R𝑦R) 5 𝑞 𝐹 𝑨 = 1 4 (𝜄Q0 + 𝜄R0 + 𝜄Q𝑦Q + 𝜄R0 + 𝜄Q0 + 𝜄R𝑦R + 𝜄Q𝑦Q + 𝜄R𝑦R) = 1 2 (𝜄Q𝑦Q + 𝜄R𝑦R)

𝑦$ 𝑦# 𝜄$ 𝑨 𝜄#

[Srivastava et al., JMLR’14] Dropout

slide-36
SLIDE 36

Dr Dropo pout: t: Verdict ct

  • Efficient bagging method with parameter sharing
  • Try it!
  • Dropout reduces the effective capacity of a model à

larger models, more training time

37 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Srivastava et al., JMLR’14] Dropout

slide-37
SLIDE 37

Batch Normali lization

38 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-38
SLIDE 38

Our Our Go Goal

  • All we want is that our activations do not die out

I2DL: Prof. Niessner, Prof. Leal-Taixé 39

slide-39
SLIDE 39

Bat Batch Nor

  • rmal

malizat ation

  • n
  • Wish: Unit Gaussian activations (in our example)
  • Solution: let’s do it

40 I2DL: Prof. Niessner, Prof. Leal-Taixé

D = num of features N = mini-batch size Mean of your mini-batch examples over feature k

T 𝒚 / = 𝒚 / − 𝐹 𝒚 / 𝑊𝑏𝑠 𝒚 /

[Ioffe and Szegedy, PMLR’15] Batch Normalization feature 1 … feature k …

slide-40
SLIDE 40

Bat Batch Nor

  • rmal

malizat ation

  • n
  • In each dimension of the features, you have a unit

gaussian (in our example)

41 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Ioffe and Szegedy, PMLR’15] Batch Normalization

Mean of your mini-batch examples over feature k

T 𝒚 / = 𝒚 / − 𝐹 𝒚 / 𝑊𝑏𝑠 𝒚 /

feature 1 … feature k …

D = num of features N = mini-batch size Unit gaussian

slide-41
SLIDE 41

Bat Batch Nor

  • rmal

malizat ation

  • n
  • In each dimension of the features, you have a unit

gaussian (in our example)

  • For NN in general à BN normalizes the mean and

variance of the inputs to your activation functions

42 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Ioffe and Szegedy, PMLR’15] Batch Normalization

slide-42
SLIDE 42

BN BN Lay ayer er

  • A layer to be applied after Fully

Connected (or Convolutional) layers and before non-linear activation functions

43 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Ioffe and Szegedy, PMLR’15] Batch Normalization

slide-43
SLIDE 43

Bat Batch Nor

  • rmal

malizat ation

  • n
  • 1. Normalize
  • 2. Allow the network to change the range

44 I2DL: Prof. Niessner, Prof. Leal-Taixé

These parameters will be

  • ptimized during backprop

Differentiable function so we can backprop through it….

7 𝒚 & = 𝒚 & − 𝐹 𝒚 & 𝑊𝑏𝑠 𝒚 & 𝒛 & = 𝛿 & 7 𝒚(&) + 𝛾 &

[Ioffe and Szegedy, PMLR’15] Batch Normalization

slide-44
SLIDE 44

Bat Batch Nor

  • rmal

malizat ation

  • n
  • 1. Normalize
  • 2. Allow the network to change the

range

45 I2DL: Prof. Niessner, Prof. Leal-Taixé

backprop The network can learn to undo the normalization 𝛿 / = 𝑊𝑏𝑠 𝒚 / 𝛾 / = 𝐹 𝒚 /

[Ioffe and Szegedy, PMLR’15] Batch Normalization

7 𝒚 & = 𝒚 & − 𝐹 𝒚 & 𝑊𝑏𝑠 𝒚 & 𝒛 & = 𝛿 & 7 𝒚(&) + 𝛾 &

slide-45
SLIDE 45

Bat Batch Nor

  • rmal

malizat ation

  • n
  • Ok

Ok to treat at dimens nsions ns separ arat ately? ? Shown empirically that even if features are not correlated, convergence is still faster with this method

  • You can set all biases of the layers before BN to zero,

because they will be cancelled out by BN anyway

46 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Ioffe and Szegedy, PMLR’15] Batch Normalization

slide-46
SLIDE 46

BN BN: Trai ain vs Tes est

  • Train time: mean and variance is taken over the mini-

batch

  • Test-time: what happens if we can just process one

image at a time?

– No chance to compute a meaningful mean and variance

47 I2DL: Prof. Niessner, Prof. Leal-Taixé

7 𝒚 & = 𝒚 & − 𝐹 𝒚 & 𝑊𝑏𝑠 𝒚 &

[Ioffe and Szegedy, PMLR’15] Batch Normalization

slide-47
SLIDE 47

BN BN: Trai ain vs Tes est

Trai aini ning ng: Compute mean and variance from mini-batch 1,2,3 … Testing ng: Compute mean and variance by running an exponentially weighted averaged across training mini-

  • batches. Use them as and .

48 I2DL: Prof. Niessner, Prof. Leal-Taixé

𝜈0*"0 𝜏0*"0

$

[Ioffe and Szegedy, PMLR’15] Batch Normalization

𝑊𝑏𝑠

1233,34 = 𝛾5 ∗ 𝑊𝑏𝑠 1233,34 +

1 − 𝛾5 ∗ 𝑊𝑏𝑠

5,3,67089

𝜈1233,34 = 𝛾5 ∗ 𝜈1233,34 + (1 − 𝛾5) ∗ 𝜈5,3,67089 𝛾5: momentum (hyperparameter)

slide-48
SLIDE 48

BN BN: Wh What at do do you

  • u get

et?

  • Very deep nets are much easier to train à more

stable gradients

  • A much larger range of hyperparameters works

similarly when using BN

49 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Ioffe and Szegedy, PMLR’15] Batch Normalization

slide-49
SLIDE 49

BN BN: A Mi Miles eston

  • ne

50 I2DL: Prof. Niessner, Prof. Leal-Taixé

[Wu and He, ECCV’18] Group Normalization

slide-50
SLIDE 50

BN BN: Draw awbac acks

51 I2DL: Prof. Niessner, Prof. Leal-Taixé

val error

[Wu and He, ECCV’18] Group Normalization

slide-51
SLIDE 51

Ot Other r No Norm rmali lizations

52 I2DL: Prof. Niessner, Prof. Leal-Taixé

val error

[Wu and He, ECCV’18] Group Normalization

slide-52
SLIDE 52

Ot Other r No Norm rmali lizations

53 I2DL: Prof. Niessner, Prof. Leal-Taixé

H, W C N Batch Norm H, W C N Layer Norm H, W C N Instance Norm H, W C N Group Norm

Image size Number of channels Number of elements in the batch

[Wu and He, ECCV’18] Group Normalization

slide-53
SLIDE 53

What What We We Kno Know

54 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-54
SLIDE 54

Wh What at do do we e know

  • w so
  • far

ar?

55 I2DL: Prof. Niessner, Prof. Leal-Taixé

Width Depth

slide-55
SLIDE 55

Wh What at do do we e know

  • w so
  • far

ar?

56 I2DL: Prof. Niessner, Prof. Leal-Taixé

Concept of a ‘Neuron’

𝜏 𝑡 = 1 1 + 𝑓!"

Σ

𝜄# 𝜄$ 𝜄%

𝑡

𝑦% 𝑦# 𝑦$

slide-56
SLIDE 56

Wh What at do do we e know

  • w so
  • far

ar?

57 I2DL: Prof. Niessner, Prof. Leal-Taixé

Activation Functions (non-linearities)

  • Sigmoid: 𝜏 𝑦 =

# (#+*"')

  • TanH: tanh 𝑦
  • ReLU: max 0, 𝑦
  • Leaky ReLU: max 0.1𝑦, 𝑦
slide-57
SLIDE 57

Wh What at do do we e know

  • w so
  • far

ar?

58 I2DL: Prof. Niessner, Prof. Leal-Taixé

𝑥% 𝑦% 𝑥# 𝑦# 𝑐

*−1 + 1 𝑦 𝑓! ∗ ∗ + 2.00 −1.00 −2.00 −3.00 −2.00 6.00 +1 4.00 −3.00 −1.00 1.00 0.37 1.37 0.73 1.00 −0.53 −0.53 −0.20 0.20 0.20 0.20 0.20 0.20 −0.20 −0.39 −0.39 −0.59

Backpropagation

slide-58
SLIDE 58

Wh What at do do we e know

  • w so
  • far

ar?

59 I2DL: Prof. Niessner, Prof. Leal-Taixé

SGD Variations (Momentum, etc.)

slide-59
SLIDE 59

Wh What at do do we e know

  • w so
  • far

ar?

60 I2DL: Prof. Niessner, Prof. Leal-Taixé

Dropout Batch-Norm Weight Regularization Data Augmentation Weight Initialization (e.g., Xavier/2) e.g., 𝑀$-reg: 𝑆$ 𝑿 = ∑,<#

=

𝑥,

$

T 𝒚 / = 𝒚 / − 𝐹 𝒚 / 𝑊𝑏𝑠 𝒚 /

slide-60
SLIDE 60

Wh Why not

  • t simply mor
  • re

e Lay ayer ers?

  • We cannot make networks arbitrarily complex
  • Why not just go deeper and get better?

– No structure!! – It is just brute force! – Optimization becomes hard – Performance plateaus / drops!

61 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-61
SLIDE 61

See See you

  • u next

ext week eek!

62 I2DL: Prof. Niessner, Prof. Leal-Taixé

slide-62
SLIDE 62

Ref Refer eren ences es

  • Goodfellow et al. “Deep Learning” (2016),

– Chapter 6: Deep Feedforward Networks

  • Bishop “Pattern Recognition and Machine Learning” (2006),

– Chapter 5.5: Regularization in Network Nets

  • http://cs231n.github.io/neural-networks-1/
  • http://cs231n.github.io/neural-networks-2/
  • http://cs231n.github.io/neural-networks-3/

I2DL: Prof. Niessner, Prof. Leal-Taixé 63