BBM413 Fundamentals of Image Processing Introduction to Deep - - PowerPoint PPT Presentation

bbm413 fundamentals of image processing introduction to
SMART_READER_LITE
LIVE PREVIEW

BBM413 Fundamentals of Image Processing Introduction to Deep - - PowerPoint PPT Presentation

BBM413 Fundamentals of Image Processing Introduction to Deep Learning Erkut Erdem Hacettepe University Computer Vision Lab (HUCVL) What is deep learning? Y. LeCun, Y. Bengio, G. Hinton, "Deep Learning" , Nature,


slide-1
SLIDE 1

BBM413
 Fundamentals of
 Image Processing Introduction to Deep Learning

Erkut Erdem
 Hacettepe University
 Computer Vision Lab (HUCVL) 


slide-2
SLIDE 2

“Deep learning allows computational models that are composed

  • f multiple processing layers to learn representations of

data with multiple levels of abstraction.”

− Yann LeCun, Yoshua Bengio and Geoff Hinton

What is deep learning?

  • Y. LeCun, Y. Bengio, G. Hinton, "Deep Learning", Nature, Vol. 521, 28 May 2015

2

slide-3
SLIDE 3

1943 – 2006: 
 A Prehistory of Deep Learning

3

slide-4
SLIDE 4

1943: Warren McCulloch and Walter Pitts

  • First computational model
  • Neurons as logic gates

(AND, OR, NOT)

  • A neuron model that sums

binary inputs and outputs 
 1 if the sum exceeds 
 a certain threshold value, and otherwise outputs 0

4

slide-5
SLIDE 5

1958: Frank Rosenblatt’s Perceptron

  • A computational model of a single

neuron

  • Solves a binary classification

problem

  • Simple training algorithm
  • Built using specialized hardware

5

  • F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain”,

Psychological Review, Vol. 65, 1958

slide-6
SLIDE 6

1969: Marvin Minsky and Seymour Papert

“No machine can learn to recognize 
 X unless it possesses, at least 
 potentially, some scheme for 
 representing X.” (p. xiii)

  • Perceptrons can only represent 


linearly separable functions.

  • such as XOR Problem 

  • Wrongly attributed as the reason 


behind the AI winter, a period of 
 reduced funding and interest in AI research

6

slide-7
SLIDE 7

1990s

  • Multi-layer perceptrons can 


theoretically learn any function 
 (Cybenko, 1989; Hornik, 1991)

  • Training multi-layer perceptrons
  • Back-propagation 


(Rumelhart, Hinton, Williams, 1986)

  • Back-propagation through time (BPTT) 


(Werbos, 1988)

  • New neural architectures
  • Convolutional neural nets (LeCun et al.,

1989)

  • Long-short term memory networks

(LSTM) (Schmidhuber, 1997)

7

slide-8
SLIDE 8

Why it failed then

  • Too many parameters to learn from few labeled

examples.

  • “I know my features are better for this task”.
  • Non-convex optimization? No, thanks.
  • Black-box model, no interpretability.
  • Very slow and inefficient
  • Overshadowed by the success of SVMs 


(Cortes and Vapnik, 1995)

8

Adapted from Joan Bruna

slide-9
SLIDE 9

A major breakthrough in 2006

9

slide-10
SLIDE 10

2006 Breakthrough: 
 Hinton and Salakhutdinov

  • The first solution to the vanishing gradient problem.
  • Build the model in a layer-by-layer fashion using unsupervised

learning

  • The features in early layers are already initialized or “pretrained” with some

suitable features (weights).

  • Pretrained features in early layers only need to be adjusted slightly during

supervised learning to achieve good results.

10

  • G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks”, 


Science, Vol. 313, 28 July 2006.

slide-11
SLIDE 11

The 2012 revolution

11

slide-12
SLIDE 12

ImageNet Challenge

12

Image classification

Easiest classes Hardest classes

Output Scale T-shirt Steel drum Drumstick Mud turtle Output Scale T-shirt Giant panda Drumstick Mud turtle

  • J. Deng, Wei Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei , “ImageNet: A Large-Scale Hierarchical Image Database”, CVPR 2009.
  • O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge”, Int. J. Comput. Vis.,, Vol. 115, Issue 3, pp 211-252, 2015.
  • Large Scale Visual Recognition

Challenge (ILSVRC)

  • 1.2M training images with 


1K categories

  • Measure top-5 classification 


error

slide-13
SLIDE 13

ILSVRC 2012 Competition

2012 Teams %Error Supervision (Toronto) 15.3 ISI (Tokyo) 26.1 VGG (Oxford) 26.9 XRCE/INRIA 27.0 UvA (Amsterdam) 29.6 INRIA/LEAR 33.4

  • A. Krizhevsky, I. Sutskever, G.E. Hinton “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS 2012
  • The success of AlexNet, a deep

convolutional network

  • 7 hidden layers (not counting some max

pooling layers)

  • 60M parameters
  • Combined several tricks
  • ReLU activation function, data augmentation,

dropout

13

CNN based, 
 non-CNN based

slide-14
SLIDE 14

2012 – now
 A Cambrian explosion in deep learning

14

slide-15
SLIDE 15

Robotics

Amodei et al., "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin", In CoRR 2015 M.-T. Luong et al., "Effective Approaches to Attention-based Neural Machine Translation", EMNLP 2015

  • M. Bojarski et al., “End to End Learning for Self-

Driving Cars”, In CoRR 2016

  • D. Silver et al., "Mastering the game of Go with

deep neural networks and tree search", Nature 529, 2016

  • L. Pinto and A. Gupta, “Supersizing Self-

supervision: Learning to Grasp from 50K Tries and 700 Robot Hours” ICRA 2015

  • H. Y. Xiong et al., "The human splicing code

reveals new insights into the genetic determinants of disease", Science 347, 2015

  • M. Ramona et al., "Capturing a Musician's

Groove: Generation of Realistic Accompaniments from Single Song Recordings", In IJCAI 2015

Speech recognition Self-Driving Cars Game Playing Genomics Machine Translation

am a student _ Je suis étudiant Je suis étudiant _ I

Audio Generation

And many more…

15

slide-16
SLIDE 16

Why now?

16

slide-17
SLIDE 17

17

Slide credit: Neil Lawrence

17

slide-18
SLIDE 18

Datasets vs. Algorithms

18

Year Breakthroughs in AI Datasets (First Available) Algorithms (First Proposed) 1994 Human-level spontaneous speech recognition Spoken Wall Street Journal articles and other texts (1991) Hidden Markov Model (1984) 1997 IBM Deep Blue defeated Garry Kasparov 700,000 Grandmaster chess games, aka “The Extended Book” (1991) Negascout planning algorithm (1983) 2005 Google’s Arabic-and Chinese-to- English translation 1.8 trillion tokens from Google Web and News pages (collected in 2005) Statistical machine translation algorithm (1988) 2011 IBM Watson became the world Jeopardy! champion 8.6 million documents from Wikipedia, Wiktionary, and Project Gutenberg (updated in 2010) Mixture-of-Experts (1991) 2014 Google’s GoogLeNet object classification at near-human performance ImageNet corpus of 1.5 million labeled images and 1,000 object categories (2010) Convolutional Neural Networks (1989) 2015 Google’s DeepMind achieved human parity in playing 29 Atari games by learning general control from video Arcade Learning Environment dataset of over 50 Atari games (2013) Q-learning (1992) Average No. of Years to Breakthrough: 3 years 18 years

Table credit: Quant Quanto

slide-19
SLIDE 19
  • CPU vs. GPU

GPU vs. CPU

Slide credit:

19

Powerful Hardware

slide-20
SLIDE 20

20 Slide credit:

20

slide-21
SLIDE 21
  • Better Learning Regularization (e.g. Dropout)

21

  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural

Networks from Overfitting”, 
 JMLR Vol. 15, No. 1,

Working ideas on how to train deep architectures

slide-22
SLIDE 22

22

  • Better Optimization Conditioning (e.g. Batch Normalization)
  • S. Ioffe, C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”,

In ICML 2015

Working ideas on how to train deep architectures

slide-23
SLIDE 23

23

  • Better neural achitectures (e.g. Residual Nets)
  • K. He, X. Zhang, S. Ren, J. Sun, “Deep Residual Learning for Image Recognition”, In CVPR 2016

Working ideas on how to train deep architectures

slide-24
SLIDE 24

Let’s make a review 


  • f neural networks

24

slide-25
SLIDE 25

The Perceptron

25 x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

slide-26
SLIDE 26

Perceptron Forward Pass

  • Neuron pre-activation 


(or input activation)


  • Neuron output activation:

where
 w are the weights (parameters) b is the bias term
 g(·) is called the activation function

26 x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • a(x) = b + P

i wixi = b + w>x

P

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

slide-27
SLIDE 27

Output Activation of The Neuron

27 x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

Range is determined by g(·)

Bi t ri s ed

(from Pascal Vincent’s slides)

Bias only changes the position of the riff

Image credit: Pascal Vincent

slide-28
SLIDE 28

Linear Activation Function

28 x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

  • {
  • g(a) = a

tion No nonlinear transformation No input squashing

slide-29
SLIDE 29

Sigmoid Activation Function

29 x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

Squashes the neuron’s output between 0 and 1 Always positive Bounded Strictly Increasing

  • g(a) = sigm(a) =

1 1+exp(a)

s

  • utput between 0 and 1
  • utput between 0 and 1
slide-30
SLIDE 30

Hyperbolic Tangent (tanh) Activation Function

30 x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

Squashes the neuron’s output between 


  • 1 and 1

Can be positive or negative Bounded Strictly Increasing h(a) = exp(a)exp(a)

exp(a)+exp(a) = exp(2a)1 exp(2a)+1

  • g(a) = tanh(a) = exp(a)exp(a)

exp(a)+exp(a)

=

slide-31
SLIDE 31

Rectified Linear (ReLU) Activation Function

31 x0 x1 x2 xn 1 ∑ inputs bias weights w0 w1 w2 wn b … sum non-linearity

  • P
  • h(x) = g(a(x)) = g(b + P

i wixi)

Bounded below by 0 (always non-negative)
 Not upper bounded Strictly increasing Tends to produce units with sparse activities

  • g(a) = reclin(a) = max(0, a)
slide-32
SLIDE 32

Multi-Output Perceptron

32

  • We need multiple outputs 


(1 output per class)i j

  • We need to estimate conditional probability

p(y = c|x)

  • Discriminative Learning

  • Softmax activation function at the output
  • Strictly positive
  • sums to one
  • Predict class with the highest estimated

class conditional probability.

x0 x1 x2 xn

inputs …

  • utput
  • n
  • |
  • o(a) = softmax(a) =

h

exp(a1) P

c exp(ac) . . .

exp(aC) P

c exp(ac)

i>

slide-33
SLIDE 33

Single Hidden Layer Neural Network

33

  • Hidden layer pre-activation:
  • Hidden layer activation:
  • Output layer activation:

x0 x1 xn h1 inputs … hidden layer h2 h0 hn

  • n
  • utput

layer

  • a(x) = b(1) + W(1)x

⇣ a(x)i = b(1)

i

+ P

j W (1) i,j xj

  • h(x) = g(a(x))
  • (x) = o

⇣ b(2) + w(2)h(1)x ⌘

slide-34
SLIDE 34

Multi-Layer Perceptron (MLP)

34

Consider a network with L hidden layers.

  • layer pre-activation for k>0



 


  • hidden layer activation from 1

to L:
 
 


  • output layer activation (k=L+1)

x0 x1 xn h1 inputs … hidden layer h2 h0 hn

  • n
  • utput

layer

  • a(k)(x) = b(k) + W(k)h(k1)(x) (
  • h(k)(x) = g(a(k)(x))
  • h(L+1)(x) = o(a(L+1)(x)) = f(x)
slide-35
SLIDE 35

Training

  • Learning is cast as optimization.
  • For classification problems, we would like to minimize

classification error

  • Loss function can sometimes be viewed as a surrogate

for what we want to optimize (e.g. upper bound)

35

arg min

θ

1 T X

t

l(f(x(t); θ), y(t)) + λΩ(θ)

Loss function Regularizer

Training Neural Networks: Objective

MIT 6.S191 | Intro to Deep Learning | IAP 2017

loss function

=

Training Neural Networks: Objective

MIT 6.S191 | Intro to Deep Learning | IAP 2017

slide-36
SLIDE 36

Loss is a function of the model’s parameters

36

Loss is a function of the model’s parameters

MIT 6.S191 | Intro to Deep Learning | IAP 2017

slide-37
SLIDE 37

How to minimize loss?

37

slide-38
SLIDE 38

How to minimize loss?

38

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Compute:

+

How to minimize loss?

slide-39
SLIDE 39

How to minimize loss?

39

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Move in direction opposite

  • f gradient to new point

+

How to minimize loss?

Move in direction opposite 


  • f gradient to new point
slide-40
SLIDE 40

How to minimize loss?

40

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Move in direction opposite

  • f gradient to new point

+

How to minimize loss?

Move in direction opposite 


  • f gradient to new point

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Move in direction opposite

  • f gradient to new point

+ +

How to minimize loss?

Move in direction opposite 


  • f gradient to new point
slide-41
SLIDE 41

How to minimize loss?

41

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Move in direction opposite

  • f gradient to new point

+

How to minimize loss?

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Move in direction opposite

  • f gradient to new point

+ +

How to minimize loss?

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Repeat!

How to minimize loss?

Repeat!

slide-42
SLIDE 42

This is called 
 Stochastic Gradient Descent (SGD)

42

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Move in direction opposite

  • f gradient to new point

+

How to minimize loss?

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Move in direction opposite

  • f gradient to new point

+ +

How to minimize loss?

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Repeat!

How to minimize loss?

Repeat!

slide-43
SLIDE 43

Stochastic Gradient Descent (SGD)

  • Initialize θ randomly
  • For N Epochs
  • For each training example (x, y): 

  • Compute Loss Gradient: 

  • Update θ with update rule:

43

𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ

  • Initialize θ randomly
  • For N Epochs

○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:

Stochastic Gradient Descent (SGD)

MIT 6.S191 | Intro to Deep Learning | IAP 2017

  • Initialize θ randomly
  • For N Epochs

○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:

Stochastic Gradient Descent (SGD)

MIT 6.S191 | Intro to Deep Learning | IAP 2017

slide-44
SLIDE 44

Why is it Stochastic Gradient Descent?

  • Initialize θ randomly
  • For N Epochs
  • For each training example (x, y): 

  • Compute Loss Gradient: 

  • Update θ with update rule:

44

𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ

  • Initialize θ randomly
  • For N Epochs

○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:

Stochastic Gradient Descent (SGD)

MIT 6.S191 | Intro to Deep Learning | IAP 2017

  • Initialize θ randomly
  • For N Epochs

○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:

Stochastic Gradient Descent (SGD)

MIT 6.S191 | Intro to Deep Learning | IAP 2017

Only an estimate of true gradient!

Advantages:

  • More accurate estimation of gradient

⎯ Smoother convergence ⎯ Allows for larger learning rates

  • Minibatches lead to fast training!

⎯ Can parallelize computation + achieve significant speed increases on GPU’s

slide-45
SLIDE 45

Why is it Stochastic Gradient Descent?

  • Initialize θ randomly
  • For N Epochs
  • For each training example (x, y): 

  • Compute Loss Gradient: 

  • Update θ with update rule:

45

𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ

  • Initialize θ randomly
  • For N Epochs

○ For each training example (x, y): ■ Compute Loss Gradient: ■ Update θ with update rule:

Stochastic Gradient Descent (SGD)

MIT 6.S191 | Intro to Deep Learning | IAP 2017

  • Initialize θ randomly
  • For N Epochs

○ For each training batch {(x0, y0), … , (xB, yB)}: ■ Compute Loss Gradient: ■ Update θ with update rule:

Minibatches Reduce Gradient Variance

MIT 6.S191 | Intro to Deep Learning | IAP 2017

More accurate estimate!

More accurate estimate!

slide-46
SLIDE 46

Stochastic Gradient Descent (SGD)

  • Algorithm that performs updates after each example
  • initialize
  • for N iterations
  • for each training example or batch

  • To apply this algorithm to neural network training, we need:
  • the loss function
  • a procedure to compute the parameter gradients:
  • the regularizer (and the gradient )

46 𝜄(𝑢+1) = 𝜄(𝑢) − 𝜃𝑢𝛼𝜄ℒ

  • rk return the gradients
  • tions

ze:

  • θ ⌘ {W(1), b(1), . . . , W(L+1), b(L+1)}

r 8

  • ∆ = rθl(f(x(t); θ), y(t)) λrθΩ(θ)
  • P r
  • θ θ + α ∆

mple

  • r
  • (x(t), y(t))

ion:

  • l(f(x(t); θ), y(t))

nts: ,

  • rθl(f(x(t); θ), y(t))
  • r

ent: ,

  • Ω(θ)

,

r

  • rθΩ(θ)

Training epoch = Iteration over all examples

slide-47
SLIDE 47

What is a neural network again?

  • A family of parametric, non-linear and hierarchical representation

learning functions 


  • x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear

function

  • Given training corpus {X, Y} find optimal parameters
  • 47

aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL) ✓∗ ← arg min

θ

X

(x,y)∈(X,Y )

`(y, aL(x; ✓1,...,L))

slide-48
SLIDE 48

Neural network models

  • A neural network model is a series of hierarchically

connected functions

  • The hierarchy can be very, very complex

48

h1(xi; θ)

Input

h2(xi; θ) h3(xi; θ) h4(xi; θ) h5(xi; θ) Loss

Forward connections (Feedforward architecture)

slide-49
SLIDE 49

Neural network models

  • A neural network model is a series of hierarchically

connected functions

  • The hierarchy can be very, very complex

49

Input Interweaved connections (Directed Acyclic Graph architecture – DAGNN)

h1(xi; θ) h2(xi; θ) h3(xi; θ)

h4(xi; θ)

h5(xi; θ)

Loss

h2(xi; θ) h4(xi; θ)

slide-50
SLIDE 50

Again, what is a neural network again?

  • x: input, θl: parameters for layer l, al = hl(x, θl): (non-)linear

function


  • Given training corpus {X, Y} find optimal parameters


  • To use gradient descent optimization
  • we need the gradients
  • How to compute the gradients for such a complicated function

enclosing other functions, like 𝑏𝑀 (... )?

50

aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL) ✓∗ ← arg min

θ

X

(x,y)∈(X,Y )

`(y, aL(x; ✓1,...,L)) ✓ θt+1 = θt − ηt ∂L ∂θt ◆ ∂L ∂θl , l = 1, . . . , L

slide-51
SLIDE 51

Chain rule

  • Assume a nested function, 𝑨 = 𝑔(𝑧) and 𝑧 = 𝑕(𝑦)
  • Chain Rule for scalars 𝑦, 𝑧, 𝑨
  • When

51

dz dx = dz dy dy dx dz dxi = X

j

dz dyj dyj dxi

gradients from all possible paths

𝑨 = 𝑔 𝑧 𝑧 = 𝑕 𝑦 𝑦, 𝑧, 𝑨

  • 𝑒𝑨

𝑒𝑦 = 𝑒𝑨 𝑒𝑧 𝑒𝑧 𝑒𝑦

𝑦 ∈ ℛ𝑛, 𝑧 ∈ ℛ𝑜, 𝑨 ∈ ℛ

  • 𝑒𝑨

𝑒𝑦𝑗 = σ𝑘 𝑒𝑨 𝑒𝑧𝑘 𝑒𝑧𝑘 𝑒𝑦𝑗

𝑨 𝑧(1) 𝑧(2) 𝑦(1) 𝑦(2) 𝑦(3)

slide-52
SLIDE 52

Chain rule

  • Assume a nested function, 𝑨 = 𝑔(𝑧) and 𝑧 = 𝑕(𝑦)
  • Chain Rule for scalars 𝑦, 𝑧, 𝑨
  • When

52

dz dx = dz dy dy dx dz dxi = X

j

dz dyj dyj dxi

gradients from all possible paths

𝑨 = 𝑔 𝑧 𝑧 = 𝑕 𝑦 𝑦, 𝑧, 𝑨

  • 𝑒𝑨

𝑒𝑦 = 𝑒𝑨 𝑒𝑧 𝑒𝑧 𝑒𝑦

𝑦 ∈ ℛ𝑛, 𝑧 ∈ ℛ𝑜, 𝑨 ∈ ℛ

  • 𝑒𝑨

𝑒𝑦𝑗 = σ𝑘 𝑒𝑨 𝑒𝑧𝑘 𝑒𝑧𝑘 𝑒𝑦𝑗

𝑨 𝑧(1) 𝑧(2) 𝑦(1) 𝑦(2) 𝑦(3)

𝑨 = 𝑔 𝑧 𝑧 = 𝑕 𝑦 𝑦, 𝑧, 𝑨

  • 𝑒𝑨

𝑒𝑦 = 𝑒𝑨 𝑒𝑧 𝑒𝑧 𝑒𝑦

𝑦 ∈ ℛ𝑛, 𝑧 ∈ ℛ𝑜, 𝑨 ∈ ℛ

  • 𝑒𝑨

𝑒𝑦𝑗 = σ𝑘 𝑒𝑨 𝑒𝑧𝑘 𝑒𝑧𝑘 𝑒𝑦𝑗

𝑨 𝑧1 𝑧2 𝑦1 𝑦2 𝑦3

𝑒𝑨 𝑒𝑦1 = 𝑒𝑨 𝑒𝑧1 𝑒𝑧1 𝑒𝑦1+ 𝑒𝑨 𝑒𝑧2 𝑒𝑧2 𝑒𝑦1

slide-53
SLIDE 53

Chain rule

  • Assume a nested function, 𝑨 = 𝑔(𝑧) and 𝑧 = 𝑕(𝑦)
  • Chain Rule for scalars 𝑦, 𝑧, 𝑨
  • When

53

dz dx = dz dy dy dx dz dxi = X

j

dz dyj dyj dxi

gradients from all possible paths

𝑨 = 𝑔 𝑧 𝑧 = 𝑕 𝑦 𝑦, 𝑧, 𝑨

  • 𝑒𝑨

𝑒𝑦 = 𝑒𝑨 𝑒𝑧 𝑒𝑧 𝑒𝑦

𝑦 ∈ ℛ𝑛, 𝑧 ∈ ℛ𝑜, 𝑨 ∈ ℛ

  • 𝑒𝑨

𝑒𝑦𝑗 = σ𝑘 𝑒𝑨 𝑒𝑧𝑘 𝑒𝑧𝑘 𝑒𝑦𝑗

𝑨 𝑧(1) 𝑧(2) 𝑦(1) 𝑦(2) 𝑦(3)

𝑨 = 𝑔 𝑧 𝑧 = 𝑕 𝑦 𝑦, 𝑧, 𝑨

  • 𝑒𝑨

𝑒𝑦 = 𝑒𝑨 𝑒𝑧 𝑒𝑧 𝑒𝑦

𝑦 ∈ ℛ𝑛, 𝑧 ∈ ℛ𝑜, 𝑨 ∈ ℛ

  • 𝑒𝑨

𝑒𝑦𝑗 = σ𝑘 𝑒𝑨 𝑒𝑧𝑘 𝑒𝑧𝑘 𝑒𝑦𝑗

𝑨 𝑧1 𝑧2 𝑦1 𝑦2 𝑦3

𝑒𝑨 𝑒𝑦1 = 𝑒𝑨 𝑒𝑧1 𝑒𝑧1 𝑒𝑦1+ 𝑒𝑨 𝑒𝑧2 𝑒𝑧2 𝑒𝑦1

slide-54
SLIDE 54

Chain rule

  • Assume a nested function, 𝑨 = 𝑔(𝑧) and 𝑧 = 𝑕(𝑦)
  • Chain Rule for scalars 𝑦, 𝑧, 𝑨
  • When

54

dz dx = dz dy dy dx dz dxi = X

j

dz dyj dyj dxi

gradients from all possible paths

𝑨 = 𝑔 𝑧 𝑧 = 𝑕 𝑦 𝑦, 𝑧, 𝑨

  • 𝑒𝑨

𝑒𝑦 = 𝑒𝑨 𝑒𝑧 𝑒𝑧 𝑒𝑦

𝑦 ∈ ℛ𝑛, 𝑧 ∈ ℛ𝑜, 𝑨 ∈ ℛ

  • 𝑒𝑨

𝑒𝑦𝑗 = σ𝑘 𝑒𝑨 𝑒𝑧𝑘 𝑒𝑧𝑘 𝑒𝑦𝑗

𝑨 𝑧(1) 𝑧(2) 𝑦(1) 𝑦(2) 𝑦(3)

𝑨 = 𝑔 𝑧 𝑧 = 𝑕 𝑦 𝑦, 𝑧, 𝑨

  • 𝑒𝑨

𝑒𝑦 = 𝑒𝑨 𝑒𝑧 𝑒𝑧 𝑒𝑦

𝑦 ∈ ℛ𝑛, 𝑧 ∈ ℛ𝑜, 𝑨 ∈ ℛ

  • 𝑒𝑨

𝑒𝑦𝑗 = σ𝑘 𝑒𝑨 𝑒𝑧𝑘 𝑒𝑧𝑘 𝑒𝑦𝑗

𝑨 𝑧1 𝑧2 𝑦1 𝑦2 𝑦3

𝑒𝑨 𝑒𝑦1 = 𝑒𝑨 𝑒𝑧1 𝑒𝑧1 𝑒𝑦1+ 𝑒𝑨 𝑒𝑧2 𝑒𝑧2 𝑒𝑦1

𝑨 = 𝑔 𝑧 𝑧 = 𝑕 𝑦 𝑦, 𝑧, 𝑨

  • 𝑒𝑨

𝑒𝑦 = 𝑒𝑨 𝑒𝑧 𝑒𝑧 𝑒𝑦

𝑦 ∈ ℛ𝑛, 𝑧 ∈ ℛ𝑜, 𝑨 ∈ ℛ

  • 𝑒𝑨

𝑒𝑦𝑗 = σ𝑘 𝑒𝑨 𝑒𝑧𝑘 𝑒𝑧𝑘 𝑒𝑦𝑗

𝑨 𝑧(1) 𝑧(2) 𝑦(1) 𝑦(2) 𝑦(3)

𝑒𝑨 𝑒𝑦3 = 𝑒𝑨 𝑒𝑧1 𝑒𝑧1 𝑒𝑦3+ 𝑒𝑨 𝑒𝑧2 𝑒𝑧2 𝑒𝑦3

slide-55
SLIDE 55

Backpropagation ⟺ Chain rule!!!

  • The loss function depends on 𝑏𝑀, which depends
  • n 𝑏𝑀-1 , …, which depends on 𝑏2:



 


  • Gradients of parameters of layer l → Chain rule
  • When shortened, we need to two quantities

55

L(y, aL) aL(x; θ1,...,L) = hL(hL−1(. . . h1(x, θ1), θL−1), θL)

∂L ∂θl = ∂L ∂al · ∂aL ∂aL−1 · ∂aL−1 ∂aL−2 · · · · ·

Gradient of a module w.r.t. its parameters Gradient of loss w.r.t. the module output

∂L ∂θl = ✓∂al ∂θl ◆T · ∂L ∂al

slide-56
SLIDE 56
  • For in we apply chain rule again



 
 
 
 


  • We can rewrite as gradient of module 


w.r.t. to input 


  • Remember, the output of a module is the input for the next
  • ne:

Backpropagation ⟺ Chain rule!!!

56

· ∂L ∂al

∂L ∂θl = ✓∂al ∂θl ◆T · ∂L ∂al ∂L ∂al = ✓∂al+1 ∂al ◆T · ∂L ∂al+1

∂L ∂al = ✓∂al+1 ∂xl+1 ◆T · ∂L ∂al+1

∂al+1 ∂al

Recursive rule

𝜖ℒ 𝜖𝑏𝑚 𝜖ℒ 𝜖𝜄𝑚 = ( 𝜖𝑏𝑚 𝜖𝜄𝑚)𝑈⋅ 𝜖ℒ 𝜖𝑏𝑚

𝜖ℒ 𝜖𝑏𝑚 = 𝜖𝑏𝑚+1 𝜖𝑏𝑚

𝑈

⋅ 𝜖ℒ 𝜖𝑏𝑚+1

𝜖𝑏𝑚+1 𝜖𝑏𝑚

ut

  • 𝑏𝑚 𝑦𝑚+1

𝜖ℒ 𝜖𝑏𝑚 = 𝜖𝑏𝑚+1 𝜖𝑦𝑚+1

𝑈

⋅ 𝜖ℒ 𝜖𝑏𝑚+1

𝑏𝑚 = ℎ𝑚(𝑦𝑚; 𝜄𝑚) 𝑏𝑚+1 = ℎ𝑚+1(𝑦𝑚+1; 𝜄𝑚+1) 𝑦𝑚+1 = 𝑏𝑚

slide-57
SLIDE 57

So what is deep learning?

57

slide-58
SLIDE 58

Three key ideas

  • (Hierarchical) Compositionality
  • Cascade of non-linear transformations
  • Multiple layers of representations
  • End-to-End Learning
  • Learning (goal-driven) representations
  • Learning to feature extract
  • Distributed Representations
  • No single neuron “encodes” everything
  • Groups of neurons work together

58

slide by Dhruv Batra

slide-59
SLIDE 59

Three key ideas

  • (Hierarchical) Compositionality
  • Cascade of non-linear transformations
  • Multiple layers of representations
  • End-to-End Learning
  • Learning (goal-driven) representations
  • Learning to feature extract
  • Distributed Representations
  • No single neuron “encodes” everything
  • Groups of neurons work together

59

slide by Dhruv Batra

slide-60
SLIDE 60

Traditional Machine Learning

\ˈd ē p\

fixed learned

your favorite
 classifier hand-crafted
 features SIFT/HOG

“car” “+”

This burrito place is yummy and fun!

VISION SPEECH NLP

fixed learned

your favorite
 classifier hand-crafted
 features MFCC

fixed learned

your favorite
 classifier hand-crafted
 features Bag-of-words

slide by Marc’Aurelio Ranzato, Yann LeCun

60

slide-61
SLIDE 61

It’s an old paradigm

  • The first learning machine: 


the Perceptron

  • Built at Cornell in 1960
  • The Perceptron was a linear classifier on

top of a simple feature extractor

  • The vast majority of practical applications
  • f ML today use glorified linear classifiers
  • r glorified template matching.
  • Designing a feature extractor requires

considerable efforts by experts.

y=sign(

i=1 N

W i F i(X )+b)

A

Feature Extractor

Wi

61

slide by Marc’Aurelio Ranzato, Yann LeCun

slide-62
SLIDE 62

Hierarchical Compositionality

VISION SPEECH NLP pixels edge texton motif part

  • bject

sample spectral band formant motif phone word character NP/VP/.. clause sentence story word

slide by Marc’Aurelio Ranzato, Yann LeCun

62

slide-63
SLIDE 63

Building A Complicated Function

Given a library of simple functions Compose into a complicate function

slide by Marc’Aurelio Ranzato, Yann LeCun

63

slide-64
SLIDE 64

Building A Complicated Function

Given a library of simple functions

Idea 1: Linear Combinations

  • Boosting
  • Kernels

Compose into a complicate function

slide by Marc’Aurelio Ranzato, Yann LeCun

64

slide-65
SLIDE 65

Building A Complicated Function

Given a library of simple functions

Idea 2: Compositions

  • Deep Learning
  • Grammar models
  • Scattering transforms…

Compose into a complicate function

slide by Marc’Aurelio Ranzato, Yann LeCun

65

slide-66
SLIDE 66

Building A Complicated Function

Given a library of simple functions

Idea 2: Compositions

  • Deep Learning
  • Grammar models
  • Scattering transforms…

Compose into a complicate function

slide by Marc’Aurelio Ranzato, Yann LeCun

66

slide-67
SLIDE 67

“car”

slide by Marc’Aurelio Ranzato, Yann LeCun

Deep Learning = Hierarchical Compositionality

67

slide-68
SLIDE 68

Trainable 
 Classifier Low-Level
 Feature Mid-Level
 Feature High-Level
 Feature

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]

“car”

Deep Learning = Hierarchical Compositionality

slide by Marc’Aurelio Ranzato, Yann LeCun

68

slide-69
SLIDE 69

69 Sparse DBNs [Lee et al. ICML ‘09] Figure courtesy: Quoc Le

slide by Dhruv Batra

slide-70
SLIDE 70

Three key ideas

  • (Hierarchical) Compositionality
  • Cascade of non-linear transformations
  • Multiple layers of representations
  • End-to-End Learning
  • Learning (goal-driven) representations
  • Learning to feature extract
  • Distributed Representations
  • No single neuron “encodes” everything
  • Groups of neurons work together

70

slide by Dhruv Batra

slide-71
SLIDE 71

Traditional Machine Learning

\ˈd ē p\

fixed learned

your favorite
 classifier hand-crafted
 features SIFT/HOG

“car” “+”

This burrito place is yummy and fun!

VISION SPEECH NLP

fixed learned

your favorite
 classifier hand-crafted
 features MFCC

fixed learned

your favorite
 classifier hand-crafted
 features Bag-of-words

slide by Marc’Aurelio Ranzato, Yann LeCun

71

slide-72
SLIDE 72

fixed unsupervised supervised

classifier Mixture of Gaussians

MFCC

\ˈd ē p\

fixed unsupervised supervised

classifier K-Means/ pooling

SIFT/HOG

“car”

fixed unsupervised supervised

classifier

n-grams

Parse Tree Syntactic

“+”

This burrito place is yummy and fun!

VISION SPEECH NLP

Traditional Machine Learning (more accurately)

“Learned”

slide by Marc’Aurelio Ranzato, Yann LeCun

72

slide-73
SLIDE 73

fixed unsupervised supervised

classifier Mixture of Gaussians

MFCC

\ˈd ē p\

fixed unsupervised supervised

classifier K-Means/ pooling

SIFT/HOG

“car”

fixed unsupervised supervised

classifier

n-grams

Parse Tree Syntactic

“+”

This burrito place is yummy and fun!

VISION SPEECH NLP “Learned”

slide by Marc’Aurelio Ranzato, Yann LeCun

Deep Learning = End-to-End Learning

73

slide-74
SLIDE 74
  • A hierarchy of trainable feature transforms
  • Each module transforms its input representation into a

higher-level one.

  • High-level features are more global and more invariant
  • Low-level features are shared among categories

Trainable
 Feature- Transform / 
 Classifier Trainable
 Feature- Transform / 
 Classifier Trainable
 Feature- Transform / 
 Classifier Learned Internal Representations

Deep Learning = End-to-End Learning

slide by Marc’Aurelio Ranzato, Yann LeCun

74

slide-75
SLIDE 75
  • “Shallow” models
  • Deep models

Trainable
 Feature- Transform / 
 Classifier Trainable
 Feature- Transform / 
 Classifier Trainable
 Feature- Transform / 
 Classifier Learned Internal Representations

“Shallow” vs Deep Learning

“Simple” Trainable Classifier hand-crafted Feature Extractor

fixed learned

slide by Marc’Aurelio Ranzato, Yann LeCun

75

slide-76
SLIDE 76

Three key ideas

  • (Hierarchical) Compositionality
  • Cascade of non-linear transformations
  • Multiple layers of representations
  • End-to-End Learning
  • Learning (goal-driven) representations
  • Learning to feature extract
  • Distributed Representations
  • No single neuron “encodes” everything
  • Groups of neurons work together

76

slide by Dhruv Batra

slide-77
SLIDE 77

Localist representations

  • The simplest way to represent things

with neural networks is to dedicate one neuron to each thing.

  • Easy to understand.
  • Easy to code by hand
  • Often used to represent inputs to a net
  • Easy to learn
  • This is what mixture models do.
  • Each cluster corresponds to one neuron
  • Easy to associate with other representations
  • r responses.
  • But localist models are very inefficient

whenever the data has componential structure.

77 Image credit: Moontae Lee

slide by Geoff Hinton

slide-78
SLIDE 78

Distributed Representations

  • Each neuron must represent

something, so this must be a local representation.

  • Distributed representation means a

many-to-many relationship between two types of representation (such as concepts and neurons).

  • Each concept is represented by many

neurons

  • Each neuron participates in the

representation of many concepts

78

Local Distributed

slide by Geoff Hinton

Image credit: Moontae Lee

slide-79
SLIDE 79

Power of distributed representations!

  • Possible internal representations:
  • Objects
  • Scene attributes
  • Object parts
  • Textures

79

  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba “Object Detectors Emerge in Deep Scene CNNs”, ICLR 2015

bedroom mountain

Scene Classification

slide by Bolei Zhou

slide-80
SLIDE 80

Deep Convolutional 
 Neural Networks

80

slide-81
SLIDE 81

Convolutions

slide by Yisong Yue

81

slide-82
SLIDE 82

Convolution Filters

82

slide by Yisong Yue

slide-83
SLIDE 83

Gabor Filters

83

slide by Yisong Yue

slide-84
SLIDE 84

Gaussian Blur Filters

84

slide by Yisong Yue