CS 4803 / 7643: Deep Learning
Zsolt Kira Georgia Tech
Topics:
– Regularization – Neural Networks – Optimization – Computing Gradients
CS 4803 / 7643: Deep Learning Topics: Regularization Neural - - PowerPoint PPT Presentation
CS 4803 / 7643: Deep Learning Topics: Regularization Neural Networks Optimization Computing Gradients Zsolt Kira Georgia Tech Administrativia HW0 Reminder Due: 01/18, 11:55pm Plagiarism No Tolerance
Topics:
– Regularization – Neural Networks – Optimization – Computing Gradients
– Due: 01/18, 11:55pm
– CCB 222 for instructor – CCB 345 for Tas
(C) Dhruv Batra 2
– GPUs
– Google colaboratory allows free TPU access!!
– Google Cloud Credits
– PACE-ICE
(C) Dhruv Batra and Zsolt Kira 3
(C) Dhruv Batra and Zsolt Kira 4
Image parameters
10 numbers giving class scores
Array of 32x32x3 numbers (3072 numbers total)
3072x1 10x1 10x3072 10x1
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra and Zsolt Kira 6
Reality
Input Softmax FC HxWx3
Multi-class Logistic Regression
7
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
Input image
56 231 24 2 56 231 24 2
Stretch pixels into column
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
8
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
0.2
0.1 2.0 1.5 1.3 2.1 0.0 0.25 0.2
Input image
56 231 24 2 56 231 24 2
Stretch pixels into column
1.1 3.2
437.9 61.95
Cat score Dog score Ship score
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
9
f(x,W) = Wx Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint One template per class Hyperplanes cutting up space
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
that quantifies our unhappiness with the scores across the training data.
efficiently finding the parameters that minimize the loss function. (optimization)
TODO:
Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
cat frog car
Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
“Hinge loss”
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
cat frog car
Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:
Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:
“Hinge loss”
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
15
Softmax Classifier (Multinomial Logistic Regression) cat frog car
Want to interpret raw classifier scores as probabilities
Softmax Function
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
16
Softmax Classifier (Multinomial Logistic Regression) cat frog car
Want to interpret raw classifier scores as probabilities
Softmax Function
exp
unnormalized probabilities
Probabilities must be >= 0
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
17
Softmax Classifier (Multinomial Logistic Regression) cat frog car
Want to interpret raw classifier scores as probabilities
Softmax Function
exp normalize
unnormalized probabilities
Probabilities must be >= 0 Probabilities must sum to 1
probabilities
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
18
Softmax Classifier (Multinomial Logistic Regression) cat frog car
Want to interpret raw classifier scores as probabilities
Softmax Function
exp normalize
unnormalized probabilities
Probabilities must be >= 0 Probabilities must sum to 1
probabilities
Unnormalized log- probabilities / logits
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
19
Softmax Classifier (Multinomial Logistic Regression) cat frog car
Want to interpret raw classifier scores as probabilities
Softmax Function
exp normalize
unnormalized probabilities
Probabilities must be >= 0 Probabilities must sum to 1
probabilities
Unnormalized log- probabilities / logits
Li = -log(0.13) = 2.04
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
20
Softmax Classifier (Multinomial Logistic Regression) cat frog car
Want to interpret raw classifier scores as probabilities
Softmax Function
exp normalize
unnormalized probabilities
Probabilities must be >= 0 Probabilities must sum to 1
probabilities
Unnormalized log- probabilities / logits
Li = -log(0.13) = 2.04
Maximum Likelihood Estimation Choose probabilities to maximize the likelihood of the observed data
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra and Zsolt Kira 21
22
Softmax Classifier (Multinomial Logistic Regression) cat frog car
Want to interpret raw classifier scores as probabilities
Softmax Function
exp normalize
unnormalized probabilities
Probabilities must be >= 0 Probabilities must sum to 1
probabilities
Unnormalized log- probabilities / logits
Correct probs
compare
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
23
Softmax Classifier (Multinomial Logistic Regression) cat frog car
Want to interpret raw classifier scores as probabilities
Softmax Function
exp normalize
unnormalized probabilities
Probabilities must be >= 0 Probabilities must sum to 1
probabilities
Unnormalized log- probabilities / logits
Correct probs
compare Kullback–Leibler divergence
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
24
Softmax Classifier (Multinomial Logistic Regression) cat frog car
Want to interpret raw classifier scores as probabilities
Softmax Function
exp normalize
unnormalized probabilities
Probabilities must be >= 0 Probabilities must sum to 1
probabilities
Unnormalized log- probabilities / logits
Correct probs
compare Cross Entropy
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra and Zsolt Kira 26
27
Data loss: Model predictions should match training data
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
28
Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
29
Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
30
x y
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
31
x y f
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
32
x y f1 f2
Regularization pushes against fitting the data too well so we don’t fit noise in the data
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra and Zsolt Kira 33
– https://arachnoid.com/polysolve/
– 10 6 – 15 9 – 20 11 – 25 12 – 29 13 – 40 11 – 50 10 – 60 9
(C) Dhruv Batra and Zsolt Kira 35
37
Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
38
Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Simple examples L2 regularization: L1 regularization: Elastic net (L1 + L2):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
39
Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Simple examples L2 regularization: L1 regularization: Elastic net (L1 + L2): More complex: Dropout Batch normalization Stochastic depth, fractional pooling, etc
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
40
Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Why regularize?
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
e.g.
Softmax SVM Full loss
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
e.g.
Softmax SVM Full loss
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
46
47
48
49
50
52
(C) Dhruv Batra and Zsolt Kira 53
Reality
Input Softmax FC HxWx3
Multi-class Logistic Regression
(Before) Linear score function:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
55
(Before) Linear score function: (Now) 2-layer Neural Network
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
56
x h
W1
s
W2 3072 100 10
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(Before) Linear score function: (Now) 2-layer Neural Network
57
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
x h
W1
s
W2 3072 100 10
(Before) Linear score function: (Now) 2-layer Neural Network
58
(Before) Linear score function: (Now) 2-layer Neural Network
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
59
Full implementation of training a 2-layer Neural Network needs ~20 lines:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
61
This image by Fotis Bobolas is licensed under CC-BY 2.0
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
62
Impulses carried toward cell body Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
dendrite cell body axon presynaptic terminal
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
63
Impulses carried toward cell body Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
dendrite cell body axon presynaptic terminal
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
64
sigmoid activation function
Impulses carried toward cell body Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
dendrite cell body axon presynaptic terminal
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
66 66
Impulses carried toward cell body Impulses carried away from cell body
This image by Felipe Perucho is licensed under CC-BY 3.0
dendrite cell body axon presynaptic terminal
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Biological Neurons:
system
[Dendritic Computation. London and Hausser]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Sigmoid tanh ReLU Leaky ReLU Maxout ELU
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra and Zsolt Kira 69
(C) Dhruv Batra and Zsolt Kira 70 Image Credit: LeCun et al. ‘98
(C) Dhruv Batra and Zsolt Kira 71
[Krizhevsky et al., NIPS12]
(C) Dhruv Batra and Zsolt Kira 72
(C) Dhruv Batra and Zsolt Kira 73
Image Credit: Andrej Karpathy, CS231n
“Fully-connected” layers “2-layer Neural Net”, or “1-hidden-layer Neural Net” “3-layer Neural Net”, or “2-hidden-layer Neural Net”
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
We can efficiently evaluate an entire layer of neurons.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
76
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
x W
hinge loss
R
+
L
s (scores)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra 78
Given a library of simple functions Compose into a complicate function
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
(C) Dhruv Batra and Zsolt Kira 79
x h
W1
s
W2 3072 100 10
(Before) Linear score function: (Now) 2-layer Neural Network Now: Arbitrary Composition of linear & Non-Linear Functions
e.g.
Softmax SVM Full loss
How do we find the best W?
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
This image is CC0 1.0 public domain
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
84
Strategy 1: Random Search
85
Lets see how well this works on the test set...
Strategy 2: Follow the slope
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Strategy: Follow the slope In 1-dimension, the derivative of a function:
In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
negative gradient direction
W_1 W_2
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Vanilla Gradient Descent Better Variants`