CS 4803 / 7643: Deep Learning Topics: Regularization Neural - - PowerPoint PPT Presentation

cs 4803 7643 deep learning
SMART_READER_LITE
LIVE PREVIEW

CS 4803 / 7643: Deep Learning Topics: Regularization Neural - - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Regularization Neural Networks Optimization Computing Gradients Zsolt Kira Georgia Tech Administrativia HW0 Reminder Due: 01/18, 11:55pm Plagiarism No Tolerance


slide-1
SLIDE 1

CS 4803 / 7643: Deep Learning

Zsolt Kira Georgia Tech

Topics:

– Regularization – Neural Networks – Optimization – Computing Gradients

slide-2
SLIDE 2

Administrativia

  • HW0 Reminder

– Due: 01/18, 11:55pm

  • Plagiarism – No Tolerance
  • Office hours have started (one for every day!)

– CCB 222 for instructor – CCB 345 for Tas

  • Sign up for Piazza if you haven’t!

(C) Dhruv Batra 2

slide-3
SLIDE 3

Computing

  • Major bottleneck

– GPUs

  • Options

– Google colaboratory allows free TPU access!!

  • https://colab.research.google.com/notebooks/welcome.ipynb

– Google Cloud Credits

  • courtesy Google – details forthcoming for next HW

– PACE-ICE

  • https://pace.gatech.edu/sites/default/files/pace-ice_orientation_1.pdf

(C) Dhruv Batra and Zsolt Kira 3

slide-4
SLIDE 4

Recap from last time

(C) Dhruv Batra and Zsolt Kira 4

slide-5
SLIDE 5

Image parameters

  • r weights

W

f(x,W)

10 numbers giving class scores

Array of 32x32x3 numbers (3072 numbers total)

f(x,W) = Wx + b

3072x1 10x1 10x3072 10x1

Parametric Approach: Linear Classifier

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-6
SLIDE 6

Error Decomposition

(C) Dhruv Batra and Zsolt Kira 6

Reality

Input Softmax FC HxWx3

Multi-class Logistic Regression

slide-7
SLIDE 7

7

Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

Input image

56 231 24 2 56 231 24 2

Stretch pixels into column

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-8
SLIDE 8

8

Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

0.2

  • 0.5

0.1 2.0 1.5 1.3 2.1 0.0 0.25 0.2

  • 0.3

W

Input image

56 231 24 2 56 231 24 2

Stretch pixels into column

1.1 3.2

  • 1.2

+

  • 96.8

437.9 61.95

=

Cat score Dog score Ship score

b

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-9
SLIDE 9

Linear Classifier: Three Viewpoints

9

f(x,W) = Wx Algebraic Viewpoint Visual Viewpoint Geometric Viewpoint One template per class Hyperplanes cutting up space

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-10
SLIDE 10
  • 1. Define a loss function

that quantifies our unhappiness with the scores across the training data.

  • 1. Come up with a way of

efficiently finding the parameters that minimize the loss function. (optimization)

TODO:

Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain

Recall from last time: Linear Classifier

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-11
SLIDE 11

Softmax vs. SVM

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-12
SLIDE 12

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

“Hinge loss”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-13
SLIDE 13

cat frog car

3.2 5.1

  • 1.7

4.9 1.3 2.0

  • 3.1

2.5 2.2

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss:

Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: the SVM loss has the form:

“Hinge loss”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-14
SLIDE 14

Softmax vs. SVM

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-15
SLIDE 15

15

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-16
SLIDE 16

16

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18

exp

unnormalized probabilities

Probabilities must be >= 0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-17
SLIDE 17

17

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-18
SLIDE 18

18

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-19
SLIDE 19

19

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

Li = -log(0.13) = 2.04

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-20
SLIDE 20

20

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

Li = -log(0.13) = 2.04

Maximum Likelihood Estimation Choose probabilities to maximize the likelihood of the observed data

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-21
SLIDE 21

Log-Likelihood / KL-Divergence / Cross-Entropy

(C) Dhruv Batra and Zsolt Kira 21

slide-22
SLIDE 22

22

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

1.00 0.00 0.00

Correct probs

compare

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-23
SLIDE 23

23

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

1.00 0.00 0.00

Correct probs

compare Kullback–Leibler divergence

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-24
SLIDE 24

24

Softmax Classifier (Multinomial Logistic Regression) cat frog car

3.2 5.1

  • 1.7

Want to interpret raw classifier scores as probabilities

Softmax Function

24.5 164.0 0.18 0.13 0.87 0.00

exp normalize

unnormalized probabilities

Probabilities must be >= 0 Probabilities must sum to 1

probabilities

Unnormalized log- probabilities / logits

1.00 0.00 0.00

Correct probs

compare Cross Entropy

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-25
SLIDE 25

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-26
SLIDE 26

Plan for Today

  • Regularization
  • Neural Networks
  • Optimization
  • Computing Gradients

(C) Dhruv Batra and Zsolt Kira 26

slide-27
SLIDE 27

Regularization

27

Data loss: Model predictions should match training data

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-28
SLIDE 28

Regularization

28

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-29
SLIDE 29

Regularization

29

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-30
SLIDE 30

Model Complexity

30

x y

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-31
SLIDE 31

Polynomial Regression

31

x y f

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-32
SLIDE 32

Regularization: Prefer Simpler Models

32

x y f1 f2

Regularization pushes against fitting the data too well so we don’t fit noise in the data

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-33
SLIDE 33

Polynomial Regression

(C) Dhruv Batra and Zsolt Kira 33

slide-34
SLIDE 34

Polynomial Regression

  • Demo:

– https://arachnoid.com/polysolve/

  • Data:

– 10 6 – 15 9 – 20 11 – 25 12 – 29 13 – 40 11 – 50 10 – 60 9

(C) Dhruv Batra and Zsolt Kira 35

slide-35
SLIDE 35

Regularization

37

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-36
SLIDE 36

Regularization

38

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Simple examples L2 regularization: L1 regularization: Elastic net (L1 + L2):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-37
SLIDE 37

Regularization

39

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Simple examples L2 regularization: L1 regularization: Elastic net (L1 + L2): More complex: Dropout Batch normalization Stochastic depth, fractional pooling, etc

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-38
SLIDE 38

Regularization

40

Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Why regularize?

  • Express preferences over weights
  • Make the model simple so it works on test data
  • Improve optimization by adding curvature

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-39
SLIDE 39
  • We have some dataset of (x,y)
  • We have a score function:
  • We have a loss function:

e.g.

Softmax SVM Full loss

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Recap

slide-40
SLIDE 40
  • We have some dataset of (x,y)
  • We have a score function:
  • We have a loss function:

e.g.

Softmax SVM Full loss

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Recap

slide-41
SLIDE 41

Next: Neural Networks

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-42
SLIDE 42

Why? Image Features

46

slide-43
SLIDE 43

47

slide-44
SLIDE 44

48

slide-45
SLIDE 45

Histogram of Oriented Gradients (HOG)

49

slide-46
SLIDE 46

Bag of Words

50

slide-47
SLIDE 47

52

slide-48
SLIDE 48

Error Decomposition

(C) Dhruv Batra and Zsolt Kira 53

Reality

Input Softmax FC HxWx3

Multi-class Logistic Regression

slide-49
SLIDE 49

(Before) Linear score function:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: without the brain stuff

slide-50
SLIDE 50

55

(Before) Linear score function: (Now) 2-layer Neural Network

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: without the brain stuff

slide-51
SLIDE 51

56

x h

W1

s

W2 3072 100 10

Neural networks: without the brain stuff

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(Before) Linear score function: (Now) 2-layer Neural Network

slide-52
SLIDE 52

57

Neural networks: without the brain stuff

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

x h

W1

s

W2 3072 100 10

(Before) Linear score function: (Now) 2-layer Neural Network

slide-53
SLIDE 53

58

(Before) Linear score function: (Now) 2-layer Neural Network

  • r 3-layer Neural Network

Neural networks: without the brain stuff

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-54
SLIDE 54

59

Full implementation of training a 2-layer Neural Network needs ~20 lines:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-55
SLIDE 55

61

This image by Fotis Bobolas is licensed under CC-BY 2.0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-56
SLIDE 56

62

Impulses carried toward cell body Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

dendrite cell body axon presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-57
SLIDE 57

63

Impulses carried toward cell body Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

dendrite cell body axon presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-58
SLIDE 58

64

sigmoid activation function

Impulses carried toward cell body Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

dendrite cell body axon presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-59
SLIDE 59

66 66

Impulses carried toward cell body Impulses carried away from cell body

This image by Felipe Perucho is licensed under CC-BY 3.0

dendrite cell body axon presynaptic terminal

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-60
SLIDE 60

Be very careful with your brain analogies!

Biological Neurons:

  • Many different types
  • Dendrites can perform complex non-linear computations
  • Synapses are not a single weight but a complex non-linear dynamical

system

  • Rate code may not be adequate

[Dendritic Computation. London and Hausser]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-61
SLIDE 61

Sigmoid tanh ReLU Leaky ReLU Maxout ELU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Activation functions

slide-62
SLIDE 62

Activation Functions

  • sigmoid vs tanh

(C) Dhruv Batra and Zsolt Kira 69

slide-63
SLIDE 63

A quick note

(C) Dhruv Batra and Zsolt Kira 70 Image Credit: LeCun et al. ‘98

slide-64
SLIDE 64

Rectified Linear Units (ReLU)

(C) Dhruv Batra and Zsolt Kira 71

[Krizhevsky et al., NIPS12]

slide-65
SLIDE 65

Limitation

  • A single “neuron” is still a linear decision boundary
  • What to do?
  • Idea: Stack a bunch of them together!

(C) Dhruv Batra and Zsolt Kira 72

slide-66
SLIDE 66

Multilayer Networks

  • Cascade Neurons together
  • The output from one layer is the input to the next
  • Each Layer has its own sets of weights

(C) Dhruv Batra and Zsolt Kira 73

Image Credit: Andrej Karpathy, CS231n

slide-67
SLIDE 67

“Fully-connected” layers “2-layer Neural Net”, or “1-hidden-layer Neural Net” “3-layer Neural Net”, or “2-hidden-layer Neural Net”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Neural networks: Architectures

slide-68
SLIDE 68

Example feed-forward computation of a neural network

We can efficiently evaluate an entire layer of neurons.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-69
SLIDE 69

76

Example feed-forward computation of a neural network

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-70
SLIDE 70

x W

hinge loss

R

+

L

s (scores)

*

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Computational Graph

slide-71
SLIDE 71

Before: Logistic Regression as Cascade

(C) Dhruv Batra 78

Given a library of simple functions Compose into a complicate function

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-72
SLIDE 72

(C) Dhruv Batra and Zsolt Kira 79

x h

W1

s

W2 3072 100 10

(Before) Linear score function: (Now) 2-layer Neural Network Now: Arbitrary Composition of linear & Non-Linear Functions

slide-73
SLIDE 73

Demo Time

  • https://playground.tensorflow.org
slide-74
SLIDE 74

Optimization

slide-75
SLIDE 75
  • We have some dataset of (x,y)
  • We have a score function:
  • We have a loss function:

e.g.

Softmax SVM Full loss

How do we find the best W?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Recap

slide-76
SLIDE 76

This image is CC0 1.0 public domain

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-77
SLIDE 77

84

Strategy 1: Random Search

slide-78
SLIDE 78

85

Lets see how well this works on the test set...

What other methods can we use?

slide-79
SLIDE 79

Strategy 2: Follow the slope

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-80
SLIDE 80

Strategy: Follow the slope In 1-dimension, the derivative of a function:

In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-81
SLIDE 81

Gradient Descent

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-82
SLIDE 82

Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Stochastic Gradient Descent (SGD)

slide-83
SLIDE 83
  • riginal W

negative gradient direction

W_1 W_2

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-84
SLIDE 84

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Vanilla Gradient Descent Better Variants`