Le Lecture 7 7 R Recap ap
1 I2DL: Prof. Niessner, Prof. Leal-Taixé
Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taix - - PowerPoint PPT Presentation
Le Lecture 7 7 R Recap ap I2DL: Prof. Niessner, Prof. Leal-Taix 1 Na Nave L Losse sses: s: L L2 v vs L s L1 L2 Loss: L1 Loss: $ = "#$ % ! = "#$ ! | " ( " )| %
1 I2DL: Prof. Niessner, Prof. Leal-Taixé
– 𝑀! = ∑"#$
%
𝑧" − 𝑔 𝑦"
!
– Sum of squared differences (SSD) – Prone to outliers – Compute-efficient (optimization) – Optimum is the mean
I2DL: Prof. Niessner, Prof. Leal-Taixé 2
– 𝑀$ = ∑"#$
%
|𝑧" − 𝑔(𝑦")| – Sum of absolute differences – Robust – Costly to compute – Optimum is the median
Can be interpreted as a probability 1
I2DL: Prof. Niessner, Prof. Leal-Taixé 3
𝜏 𝒚, 𝜾 = 1 1 + 𝑓!∑#!$!
𝑞(𝑧 = 1|𝑦, 𝜾) 𝜏 𝑡 = 1 1 + 𝑓!"
𝜄# 𝜄$ 𝜄%
𝑡
𝑦% 𝑦# 𝑦$
4 I2DL: Prof. Niessner, Prof. Leal-Taixé
Softmax
𝑞(𝑧 = 1|𝒚, 𝜾) = 𝑓𝒕𝟐 𝑓𝒕𝟐 + 𝑓𝒕𝟑 + 𝑓𝒕𝟒 𝑞(𝑧 = 3|𝒚, 𝜾) = 𝑓𝒕𝟒 𝑓𝒕𝟐 + 𝑓𝒕𝟑 + 𝑓𝒕𝟒 𝑦% 𝑦# 𝑦$
𝑞(𝑧 = 2|𝒚, 𝜾) = 𝑓𝒕𝟑 𝑓𝒕𝟐 + 𝑓𝒕𝟑 + 𝑓𝒕𝟒
𝑡1 𝑡2 𝑡3
Scores for each class Probabilities for each class
− ln
*! *!+*"#$+*"#$
= 2 ∗ 10!##
Given the following scores for 𝒚, : 𝑡 = [5, −3, 2] 𝑡 = [5, 10, 10] 𝑡 = [5, −20, −20] 𝑧, = 0
I2DL: Prof. Niessner, Prof. Leal-Taixé 5
Model 1 Model 2 Model 3 Hinge loss: max(0, −3 − 5 + 1) + max 0, 2 − 5 + 1 = 0 max(0, 10 − 5 + 1) + max 0, 10 − 5 + 1 = 12 max(0, −20 − 5 + 1) + max 0, −20 − 5 + 1 = 0 Cross Entropy loss: − ln
*! *!+*"%+*# = 0.05
− ln
*! *!+*&$+*&$ = 5.70
− Cross Entropy *always* wants to improve! (loss never 0) − Hinge Loss saturates.
Hinge Loss: 𝑀@ = ∑ABC% max(0, 𝑡A − 𝑡C% + 1) Cross Entropy : 𝑀@ = − log(
D&'% ∑( D&()
I2DL: Prof. Niessner, Prof. Leal-Taixé 6
Saturated neurons kill the gradient flow 𝜏 𝑡 = 1 1 + 𝑓!" Forward 𝜖𝑀 𝜖𝑥 = 𝜖𝑡 𝜖𝑥 𝜖𝑀 𝜖𝑡 𝜖𝑀 𝜖𝑡 = 𝜖𝜏 𝜖𝑡 𝜖𝑀 𝜖𝜏 𝜖𝜏 𝜖𝑡 𝜖𝑀 𝜖𝜏
Zero- centered Still saturates
I2DL: Prof. Niessner, Prof. Leal-Taixé 7
[LeCun et al. 1991] Improving Generalization Performance in Character Recognition
Large and consistent gradients Does not saturate Fast convergence What happens if a ReLU outputs zero? Dead ReLU
I2DL: Prof. Niessner, Prof. Leal-Taixé 8
[Krizhevsky et al. NeurIPS 2012] ImageNet Classification with Deep Convolutional Neural Networks
9 I2DL: Prof. Niessner, Prof. Leal-Taixé
10 I2DL: Prof. Niessner, Prof. Leal-Taixé
Not guaranteed to reach the optimum
𝑦∗ = arg min 𝑔(𝑦)
Initialization
as the input?
11 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝑜𝑊𝑏𝑠(𝑥 𝑊𝑏𝑠 𝑦 ) = 1 𝑊𝑏𝑠 𝑥 = 1 𝑜
12 I2DL: Prof. Niessner, Prof. Leal-Taixé
It makes a huge difference!
[He et al., ICCV’15] He Initialization
𝑊𝑏𝑠 𝑥 = 2 𝑜
13 I2DL: Prof. Niessner, Prof. Leal-Taixé
14 I2DL: Prof. Niessner, Prof. Leal-Taixé
transformations
16 I2DL: Prof. Niessner, Prof. Leal-Taixé
Pose Appearance Illumination
17 I2DL: Prof. Niessner, Prof. Leal-Taixé
transformations
plausible transformations
18 I2DL: Prof. Niessner, Prof. Leal-Taixé
19 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Krizhevsky et al., NIPS’12] ImageNet
20 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Krizhevsky et al., NIPS’12] ImageNet
– Pick a random L in [256,480] – Resize training image, short side L – Randomly sample crops of 224x224
– Resize image at N scales – 10 fixed crops of 224x224: (4 corners + 1 center ) × 2 flips
21 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Krizhevsky et al., NIPS’12] ImageNet
same data augmentation!
design
22 I2DL: Prof. Niessner, Prof. Leal-Taixé
23 I2DL: Prof. Niessner, Prof. Leal-Taixé
24 I2DL: Prof. Niessner, Prof. Leal-Taixé
Learning rate Gradient
Θ&'$ = Θ& − 𝜗𝛼( Θ&, 𝑦, 𝑧 − 𝜇𝜄&
Θ Θ/2 Θ/2 Gradient of L2-regularization
25 I2DL: Prof. Niessner, Prof. Leal-Taixé
Overfitting
26 I2DL: Prof. Niessner, Prof. Leal-Taixé
Θ% Θ# Θ$ Θ- Θ∗ 𝜗 𝜗 𝜐 … … Overfitting
change the objective function / loss function.
error will decrease linearly with the ensemble size
27 I2DL: Prof. Niessner, Prof. Leal-Taixé
28 I2DL: Prof. Niessner, Prof. Leal-Taixé
Training Set 1 Training Set 2 Training Set 3
Image Source: [Srivastava et al., JMLR’14] Dropout
29 I2DL: Prof. Niessner, Prof. Leal-Taixé
30 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Srivastava et al., JMLR’14] Dropout
Forward
31 I2DL: Prof. Niessner, Prof. Leal-Taixé
Furry Has two eyes Has a tail Has paws Has two ears Redundant representations
[Srivastava et al., JMLR’14] Dropout
– Redundant representations – Base your scores on more features
32 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Srivastava et al., JMLR’14] Dropout
33 I2DL: Prof. Niessner, Prof. Leal-Taixé
Model 1 Model 2
[Srivastava et al., JMLR’14] Dropout
– Redundant representations – Base your scores on more features
– Training a large ensemble of models, each on different set of data (mini-batch) and with SHARED parameters
34 I2DL: Prof. Niessner, Prof. Leal-Taixé
Reducing co-adaptation between neurons
[Srivastava et al., JMLR’14] Dropout
35 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Srivastava et al., JMLR’14] Dropout
Conditions at train and test time are not the same
36 I2DL: Prof. Niessner, Prof. Leal-Taixé
Dropout probability 𝑞 = 0.5 Weight scaling inference rule
𝑨 = (𝜄Q𝑦Q + 𝜄R𝑦R) 5 𝑞 𝐹 𝑨 = 1 4 (𝜄Q0 + 𝜄R0 + 𝜄Q𝑦Q + 𝜄R0 + 𝜄Q0 + 𝜄R𝑦R + 𝜄Q𝑦Q + 𝜄R𝑦R) = 1 2 (𝜄Q𝑦Q + 𝜄R𝑦R)
𝑦$ 𝑦# 𝜄$ 𝑨 𝜄#
[Srivastava et al., JMLR’14] Dropout
larger models, more training time
37 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Srivastava et al., JMLR’14] Dropout
38 I2DL: Prof. Niessner, Prof. Leal-Taixé
I2DL: Prof. Niessner, Prof. Leal-Taixé 39
40 I2DL: Prof. Niessner, Prof. Leal-Taixé
D = num of features N = mini-batch size Mean of your mini-batch examples over feature k
T 𝒚 / = 𝒚 / − 𝐹 𝒚 / 𝑊𝑏𝑠 𝒚 /
[Ioffe and Szegedy, PMLR’15] Batch Normalization feature 1 … feature k …
gaussian (in our example)
41 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Ioffe and Szegedy, PMLR’15] Batch Normalization
Mean of your mini-batch examples over feature k
T 𝒚 / = 𝒚 / − 𝐹 𝒚 / 𝑊𝑏𝑠 𝒚 /
feature 1 … feature k …
D = num of features N = mini-batch size Unit gaussian
gaussian (in our example)
variance of the inputs to your activation functions
42 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Ioffe and Szegedy, PMLR’15] Batch Normalization
Connected (or Convolutional) layers and before non-linear activation functions
43 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Ioffe and Szegedy, PMLR’15] Batch Normalization
44 I2DL: Prof. Niessner, Prof. Leal-Taixé
These parameters will be
Differentiable function so we can backprop through it….
7 𝒚 & = 𝒚 & − 𝐹 𝒚 & 𝑊𝑏𝑠 𝒚 & 𝒛 & = 𝛿 & 7 𝒚(&) + 𝛾 &
[Ioffe and Szegedy, PMLR’15] Batch Normalization
range
45 I2DL: Prof. Niessner, Prof. Leal-Taixé
backprop The network can learn to undo the normalization 𝛿 / = 𝑊𝑏𝑠 𝒚 / 𝛾 / = 𝐹 𝒚 /
[Ioffe and Szegedy, PMLR’15] Batch Normalization
7 𝒚 & = 𝒚 & − 𝐹 𝒚 & 𝑊𝑏𝑠 𝒚 & 𝒛 & = 𝛿 & 7 𝒚(&) + 𝛾 &
Ok to treat at dimens nsions ns separ arat ately? ? Shown empirically that even if features are not correlated, convergence is still faster with this method
because they will be cancelled out by BN anyway
46 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Ioffe and Szegedy, PMLR’15] Batch Normalization
batch
image at a time?
– No chance to compute a meaningful mean and variance
47 I2DL: Prof. Niessner, Prof. Leal-Taixé
7 𝒚 & = 𝒚 & − 𝐹 𝒚 & 𝑊𝑏𝑠 𝒚 &
[Ioffe and Szegedy, PMLR’15] Batch Normalization
Trai aini ning ng: Compute mean and variance from mini-batch 1,2,3 … Testing ng: Compute mean and variance by running an exponentially weighted averaged across training mini-
48 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝜈0*"0 𝜏0*"0
$
[Ioffe and Szegedy, PMLR’15] Batch Normalization
𝑊𝑏𝑠
1233,34 = 𝛾5 ∗ 𝑊𝑏𝑠 1233,34 +
1 − 𝛾5 ∗ 𝑊𝑏𝑠
5,3,67089
𝜈1233,34 = 𝛾5 ∗ 𝜈1233,34 + (1 − 𝛾5) ∗ 𝜈5,3,67089 𝛾5: momentum (hyperparameter)
stable gradients
similarly when using BN
49 I2DL: Prof. Niessner, Prof. Leal-Taixé
[Ioffe and Szegedy, PMLR’15] Batch Normalization
50 I2DL: Prof. Niessner, Prof. Leal-Taixé
⎯
[Wu and He, ECCV’18] Group Normalization
51 I2DL: Prof. Niessner, Prof. Leal-Taixé
val error
[Wu and He, ECCV’18] Group Normalization
52 I2DL: Prof. Niessner, Prof. Leal-Taixé
val error
[Wu and He, ECCV’18] Group Normalization
53 I2DL: Prof. Niessner, Prof. Leal-Taixé
H, W C N Batch Norm H, W C N Layer Norm H, W C N Instance Norm H, W C N Group Norm
Image size Number of channels Number of elements in the batch
[Wu and He, ECCV’18] Group Normalization
54 I2DL: Prof. Niessner, Prof. Leal-Taixé
55 I2DL: Prof. Niessner, Prof. Leal-Taixé
Width Depth
56 I2DL: Prof. Niessner, Prof. Leal-Taixé
Concept of a ‘Neuron’
𝜏 𝑡 = 1 1 + 𝑓!"
𝜄# 𝜄$ 𝜄%
𝑡
𝑦% 𝑦# 𝑦$
57 I2DL: Prof. Niessner, Prof. Leal-Taixé
Activation Functions (non-linearities)
# (#+*"')
58 I2DL: Prof. Niessner, Prof. Leal-Taixé
𝑥% 𝑦% 𝑥# 𝑦# 𝑐
*−1 + 1 𝑦 𝑓! ∗ ∗ + 2.00 −1.00 −2.00 −3.00 −2.00 6.00 +1 4.00 −3.00 −1.00 1.00 0.37 1.37 0.73 1.00 −0.53 −0.53 −0.20 0.20 0.20 0.20 0.20 0.20 −0.20 −0.39 −0.39 −0.59
Backpropagation
59 I2DL: Prof. Niessner, Prof. Leal-Taixé
SGD Variations (Momentum, etc.)
60 I2DL: Prof. Niessner, Prof. Leal-Taixé
Dropout Batch-Norm Weight Regularization Data Augmentation Weight Initialization (e.g., Xavier/2) e.g., 𝑀$-reg: 𝑆$ 𝑿 = ∑,<#
=
𝑥,
$
T 𝒚 / = 𝒚 / − 𝐹 𝒚 / 𝑊𝑏𝑠 𝒚 /
– No structure!! – It is just brute force! – Optimization becomes hard – Performance plateaus / drops!
61 I2DL: Prof. Niessner, Prof. Leal-Taixé
62 I2DL: Prof. Niessner, Prof. Leal-Taixé
– Chapter 6: Deep Feedforward Networks
– Chapter 5.5: Regularization in Network Nets
I2DL: Prof. Niessner, Prof. Leal-Taixé 63