[PPT] - CS 4803 / 7643: Deep Learning Topics: Regularization Neural PowerPoint Presentation

SLIDE 1

CS 4803 / 7643: Deep Learning

Zsolt Kira Georgia Tech

Topics:

– Regularization – Neural Networks – Optimization – Computing Gradients

SLIDE 2

Administrativia

HW0 Reminder

– Due: 01/18, 11:55pm

Plagiarism – No Tolerance
Office hours have started (one for every day!)

– CCB 222 for instructor – CCB 345 for Tas

Sign up for Piazza if you haven’t!

(C) Dhruv Batra 2

SLIDE 3

Computing

Major bottleneck

– GPUs

Options

– Google colaboratory allows free TPU access!!

https://colab.research.google.com/notebooks/welcome.ipynb

– Google Cloud Credits

courtesy Google – details forthcoming for next HW

– PACE-ICE

https://pace.gatech.edu/sites/default/files/pace-ice_orientation_1.pdf

(C) Dhruv Batra and Zsolt Kira 3

SLIDE 4

Recap from last time

(C) Dhruv Batra and Zsolt Kira 4

SLIDE 5

Image parameters

r weights

W

f(x,W)

10 numbers giving class scores

Array of 32x32x3 numbers (3072 numbers total)

f(x,W) = Wx + b

3072x1 10x1 10x3072 10x1

Parametric Approach: Linear Classifier

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 6

Error Decomposition

(C) Dhruv Batra and Zsolt Kira 6

Reality

Input Softmax FC HxWx3

Multi-class Logistic Regression

SLIDE 7

7

Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

Input image

56 231 24 2 56 231 24 2

Stretch pixels into column

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 8

8

Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

0.2

0.5

0.1 2.0 1.5 1.3 2.1 0.0 0.25 0.2

0.3

W

Input image

56 231 24 2 56 231 24 2

Stretch pixels into column

1.1 3.2

1.2

+

96.8

437.9 61.95

=

Cat score Dog score Ship score

b

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 9

Linear Classifier: Three Viewpoints

9

f(x,W) = Wx Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint One template per class Hyperplanes cutting up space

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 10

1. Define a loss function

that quantifies our unhappiness with the scores across the training data.

1. Come up with a way of

efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain

Recall from last time: Linear Classifier

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 11

Softmax vs. SVM

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 12

cat frog car

3.2 5.1

1.7

4.9 1.3 2.0

3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

“Hinge loss”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 13

cat frog car

3.2 5.1

1.7

4.9 1.3 2.0

3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

“Hinge loss”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 14

Softmax vs. SVM

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 15

15

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 16

16

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18

exp

unnormalized probabilities

Probabilities must be >= 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 17

17

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 18

18

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 19

19

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

Li = -log(0.13) = 2.04

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 20

20

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

Li = -log(0.13) = 2.04

Maximum Likelihood Estimation Choose probabilities to maximize the likelihood of the observed data

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 21

Log-Likelihood / KL-Divergence / Cross-Entropy

(C) Dhruv Batra and Zsolt Kira 21

SLIDE 22

22

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

1.00 0.00 0.00

Correct probs

compare

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 23

23

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

1.00 0.00 0.00

Correct probs

compare Kullback–Leibler divergence

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 24

24

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

1.00 0.00 0.00

Correct probs

compare Cross Entropy

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 25

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 26

Plan for Today

Regularization
Neural Networks
Optimization
Computing Gradients

(C) Dhruv Batra and Zsolt Kira 26

SLIDE 27

Regularization

27

Data loss: Model predictions should match training data

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 28

Regularization

28

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 29

Regularization

29

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 30

Model Complexity

30

x y

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 31

Polynomial Regression

31

x y f

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 32

Regularization: Prefer Simpler Models

32

x y f1 f2

Regularization pushes against fitting the data too well so we don’t fit noise in the data

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 33

Polynomial Regression

(C) Dhruv Batra and Zsolt Kira 33

SLIDE 34

Polynomial Regression

Demo:

– https://arachnoid.com/polysolve/

Data:

– 10 6 – 15 9 – 20 11 – 25 12 – 29 13 – 40 11 – 50 10 – 60 9

(C) Dhruv Batra and Zsolt Kira 35

SLIDE 35

Regularization

37

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 36

Regularization

38

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Simple examples L2 regularization: L1 regularization: Elastic net (L1 + L2):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 37

Regularization

39

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Simple examples L2 regularization: L1 regularization: Elastic net (L1 + L2): More complex: Dropout Batch normalization Stochastic depth, fractional pooling, etc

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 38

Regularization

40

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Why regularize?

Express preferences over weights
Make the model simple so it works on test data
Improve optimization by adding curvature

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 39

We have some dataset of (x,y)
We have a score function:
We have a loss function:

e.g.

Softmax SVM Full loss

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Recap

SLIDE 40

We have some dataset of (x,y)
We have a score function:
We have a loss function:

e.g.

Softmax SVM Full loss

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Recap

SLIDE 41

Next: Neural Networks

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 42

Why? Image Features

46

SLIDE 43

47

SLIDE 44

48

SLIDE 45

Histogram of Oriented Gradients (HOG)

49

SLIDE 46

Bag of Words

50

SLIDE 47

52

SLIDE 48

Error Decomposition

(C) Dhruv Batra and Zsolt Kira 53

Reality

Input Softmax FC HxWx3

Multi-class Logistic Regression

SLIDE 49

(Before) Linear score function:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: without the brain stuff

SLIDE 50

55

(Before) Linear score function: (Now) 2-layer Neural Network

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: without the brain stuff

SLIDE 51

56

x h

W1

s

W2 3072 100 10

Neural networks: without the brain stuff

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(Before) Linear score function: (Now) 2-layer Neural Network

SLIDE 52

57

Neural networks: without the brain stuff

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

x h

W1

s

W2 3072 100 10

(Before) Linear score function: (Now) 2-layer Neural Network

SLIDE 53

58

(Before) Linear score function: (Now) 2-layer Neural Network

r 3-layer Neural Network

Neural networks: without the brain stuff

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 54

59

Full implementation of training a 2-layer Neural Network needs ~20 lines:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 55

61

This image by Fotis Bobolas is licensed under CC-BY 2.0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 56

62

Impulses carried toward cell body Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

dendrite cell body axon presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 57

63

Impulses carried toward cell body Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

dendrite cell body axon presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 58

64

sigmoid activation function

Impulses carried toward cell body Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

dendrite cell body axon presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 59

66 66

Impulses carried toward cell body Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

dendrite cell body axon presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 60

Be very careful with your brain analogies!

Biological Neurons:

Many different types
Dendrites can perform complex non-linear computations
Synapses are not a single weight but a complex non-linear dynamical

system

Rate code may not be adequate

[Dendritic Computation. London and Hausser]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 61

Sigmoid tanh ReLU Leaky ReLU Maxout ELU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Activation functions

SLIDE 62

Activation Functions

sigmoid vs tanh

(C) Dhruv Batra and Zsolt Kira 69

SLIDE 63

A quick note

(C) Dhruv Batra and Zsolt Kira 70 Image Credit: LeCun et al. ‘98

SLIDE 64

Rectified Linear Units (ReLU)

(C) Dhruv Batra and Zsolt Kira 71

[Krizhevsky et al., NIPS12]

SLIDE 65

Limitation

A single “neuron” is still a linear decision boundary
What to do?
Idea: Stack a bunch of them together!

(C) Dhruv Batra and Zsolt Kira 72

SLIDE 66

Multilayer Networks

Cascade Neurons together
The output from one layer is the input to the next
Each Layer has its own sets of weights

(C) Dhruv Batra and Zsolt Kira 73

Image Credit: Andrej Karpathy, CS231n

SLIDE 67

“Fully-connected” layers “2-layer Neural Net”, or “1-hidden-layer Neural Net” “3-layer Neural Net”, or “2-hidden-layer Neural Net”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: Architectures

SLIDE 68

Example feed-forward computation of a neural network

We can efficiently evaluate an entire layer of neurons.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 69

76

Example feed-forward computation of a neural network

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 70

x W

hinge loss

R

+

L

s (scores)

*

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Computational Graph

SLIDE 71

Before: Logistic Regression as Cascade

(C) Dhruv Batra 78

Given a library of simple functions Compose into a complicate function

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

SLIDE 72

(C) Dhruv Batra and Zsolt Kira 79

x h

W1

s

W2 3072 100 10

(Before) Linear score function: (Now) 2-layer Neural Network Now: Arbitrary Composition of linear & Non-Linear Functions

SLIDE 73

Demo Time

https://playground.tensorflow.org

SLIDE 74

Optimization

SLIDE 75

We have some dataset of (x,y)
We have a score function:
We have a loss function:

e.g.

Softmax SVM Full loss

How do we find the best W?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Recap

SLIDE 76

This image is CC0 1.0 public domain

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 77

84

Strategy 1: Random Search

SLIDE 78

85

Lets see how well this works on the test set...

What other methods can we use?

SLIDE 79

Strategy 2: Follow the slope

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 80

Strategy: Follow the slope In 1-dimension, the derivative of a function:

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 81

Gradient Descent

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 82

Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Stochastic Gradient Descent (SGD)

SLIDE 83

riginal W

negative gradient direction

W_1 W_2

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

SLIDE 84

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Vanilla Gradient Descent Better Variants`