Image Classification with Deep Networks Ronan Collobert Facebook AI - - PowerPoint PPT Presentation

image classification with deep networks
SMART_READER_LITE
LIVE PREVIEW

Image Classification with Deep Networks Ronan Collobert Facebook AI - - PowerPoint PPT Presentation

Image Classification with Deep Networks Ronan Collobert Facebook AI Research Feb 11, 2015 Overview Origins of Deep Learning Shallow vs Deep Perceptron Multi Layer Perceptrons Going Deeper Why? Issues (and fix)?


slide-1
SLIDE 1

Image Classification with Deep Networks

Ronan Collobert

Facebook AI Research

Feb 11, 2015

slide-2
SLIDE 2

Overview

  • Origins of Deep Learning
  • Shallow vs Deep
  • Perceptron
  • Multi Layer Perceptrons
  • Going Deeper
  • Why?
  • Issues (and fix)?
  • Convolutional Neural Networks
  • Fancier Architectures
  • Applications

2 / 65

slide-3
SLIDE 3

Acknowledgement

Part of these slides have been cut-and-pasted from Marc’Aurelio Ranzato’s original presentation

3 / 65

slide-4
SLIDE 4

Shallow vs Deep

4 / 65

slide-5
SLIDE 5

Shallow Learning (1/2)

5 / 65

slide-6
SLIDE 6

Shallow Learning (2/2) Typical example

6 / 65

slide-7
SLIDE 7

Deep Learning (1/2)

7 / 65

slide-8
SLIDE 8

Deep Learning (2/2)

8 / 65

slide-9
SLIDE 9

Deep Learning (2/2)

9 / 65

slide-10
SLIDE 10

Perceptrons

(shallow)

10 / 65

slide-11
SLIDE 11

Biological Neuron

  • Dendrites connected to other neurons through synapses
  • Excitatory and inhibitory signals are integrated
  • If stimulus reaches a threshold, neuron fires along the axon

11 / 65

slide-12
SLIDE 12

McCulloch and Pitts (1943)

  • Neuron as linear threshold units
  • Binary inputs x ∈ {0, 1}d, binary output, vector of weights

w ∈ Rd f (x) =

  • 1

if w · x > T

  • therwise
  • A unit can perform OR and AND operations
  • Combine these units to represent any boolean function
  • How to train them?

12 / 65

slide-13
SLIDE 13

Perceptron: Rosenblatt (1957)

  • Input: retina x ∈ Rn
  • Associative area: any kind of (fixed) function ϕ(x) ∈ Rd
  • Decision function:

f (x) =

  • 1

if w · ϕ(x) > 0 −1

  • therwise

13 / 65

slide-14
SLIDE 14

Perceptron: Rosenblatt (1957)

wx+b=0

  • Training update rule: given (xt, yt) ∈ Rd × {−1, 1}

wt+1 = wt +

  • yt ϕ(xt)

if yt w · ϕ(xt) ≤ 0

  • therwise
  • Note that

wt+1 · ϕ(xt) = wt · ϕ(xt) + yt ||ϕ(xt)||2

  • >0
  • Corresponds to minimizing

w →

  • t

max(0, −yt w · ϕ(xt))

14 / 65

slide-15
SLIDE 15

Multi Layer Perceptrons

(deeper)

15 / 65

slide-16
SLIDE 16

Going Non-Linear

  • How to train a “good” ϕ(·) in

w · ϕ(x) ?

  • Many attempts have been tried!
  • Neocognitron (Fukushima, 1980)

16 / 65

slide-17
SLIDE 17

Going Non-Linear

  • Madaline: Winter & Widrow, 1988
  • Multi Layer Perceptron

x W 1 × • tanh(•) W 2 × • score

  • Matrix-vector multiplications interleaved with

non-linearities

  • Each row of W 1 corresponds to a hidden unit
  • The number of hidden units must be chosen carefully

17 / 65

slide-18
SLIDE 18

Universal Approximator (Cybenko, 1989)

  • Any function

g : Rd − → R can be approximated (on a compact) by a two-layer neural network

x W 1 × • tanh(•) W 2 × • score

  • Note:
  • It does not say how to train it
  • It does not say anything on the generalization capabilities

18 / 65

slide-19
SLIDE 19

Training a Neural Network

  • Given a network fw(·) with parameters W , “input” examples

xt and “targets” yt, we want to minimize a loss W →

  • (xt,yt)

C(fW (xt), yt)

  • View the network+loss as a “stack” of layers

x f1(•) f2(•) f3(•) f4(•)

f (x) = fL(fL−1(. . . f1(x))

  • Optimization problem: use some sort of gradient descent

Wl ← − Wl − λ ∂f ∂wl ∀l − → How to compute ∂f

∂wl

∀l ?

19 / 65

slide-20
SLIDE 20

Gradient Backpropagation (1/2)

  • In the neural network field: (Rumelhart et al, 1986)
  • However, previous possible references exist,

including (Leibniz, 1675) and (Newton, 1687)

  • E.g., in the Adaline L = 2

x w1 × •

1 2(y − •)

  • f1(x) = w1 · x
  • f2(f1) = 1

2(y − f1)2

∂f ∂w1 = ∂f2 ∂f1

  • =y−f1

∂f1 ∂w1

  • =x

20 / 65

slide-21
SLIDE 21

Gradient Backpropagation (2/2)

x f1(•) f2(•) f3(•) f4(•)

  • Chain rule:

∂f ∂wl = ∂fL ∂fL−1 ∂fL−1 ∂fL−2 · · · ∂fl+1 ∂fl ∂fl ∂wl = ∂f ∂fl ∂fl ∂wl

  • In the backprop way, each module fl()
  • Receive the gradient w.r.t. its own outputs fl
  • Computes the gradient w.r.t. its own input fl−1 (backward)
  • Computes the gradient w.r.t. its own parameters wl (if any)

∂f ∂fl−1 = ∂f ∂fl ∂fl ∂fl−1 ∂f ∂wl = ∂f ∂fl ∂fl ∂wl

21 / 65

slide-22
SLIDE 22

Examples Of Modules

  • We denote
  • x the input of a module
  • z target of a loss module
  • y the output of a module fl(x)
  • ˜

y the gradient w.r.t. the output of each module Module Forward Backward Gradient Linear y = W x W T ˜ y ˜ y xT Tanh y = tanh(x) ˜ y (1 − y 2) Sigmoid y = 1/(1 + e−x) ˜ y (1 − y) y ReLU y = max(0, x) ˜ y 1x≥0 Perceptron Loss y = max(0, −z x) −1z·x≤0 MSE Loss y = 1

2 (x − z)2

x − z

22 / 65

slide-23
SLIDE 23

Typical Classification Loss (euh, Likelihood)

  • Given a set of examples (xt, yt) ∈ Rd × N, t = 1 . . . T

we want to maximize the (log-)likelihood log

T

  • t=1

p(yt|xt) =

T

  • t=1

log p(yt|xt)

  • The network outputs a score fy(x) per class y
  • Interpret scores as conditional probabilities using a

softmax: p(y|x) = efy(x)

  • i efi(x)
  • In practice we consider only log-probabilites:

log p(y|x) = fy(x) − log

  • i

efi(x)

  • 23 / 65
slide-24
SLIDE 24

Optimization Techniques Minimize W →

  • (xt,yt)

C(fW (x), y)

  • Gradient descent (“batch”)

W ← − W − λ

  • (xt,yt)

∂C(fW (xt), yt) ∂W

  • Stochastic gradient descent

W ← − W − λ∂C(fW (xt), yt) ∂W

  • Many variants, including second order techniques (where

the Hessian is approximated)

24 / 65

slide-25
SLIDE 25

Going Deeper

25 / 65

slide-26
SLIDE 26

Deeper: What is the Point? (1/3)

x f1(•) f2(•) f3(•) f4(•)

  • Share features across the “deep” hierarchy
  • Compose these features
  • Efficiency: intermediate computations are re-used

[0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 . . . ] truck feature

26 / 65

slide-27
SLIDE 27

Deeper: What is the Point? (2/3) Sharing [1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 . . . ] motorbike [0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 . . . ] truck

27 / 65

slide-28
SLIDE 28

Deeper: What is the Point? (3/3) Composing

(Lee et al., 2009)

28 / 65

slide-29
SLIDE 29

Deeper: What are the Issues? (1/2) Vanishing Gradients

  • Chain rule:

∂f ∂wl = ∂fL ∂fL−1 ∂fL−1 ∂fL−2 · · · ∂fl+1 ∂fl ∂fl ∂wl

  • Because transfer function non-linearities, some ∂fl+1

∂fl

will be very small, or zero, when back-propagating

  • E.g. with ReLU

y = max(0, x) ∂y ∂x = 1x≥0

29 / 65

slide-30
SLIDE 30

Deeper: What are the Issues? (2/2) Number of Parameters

  • A 200 × 200 image with 1000 hidden units leads to 40B

parameters

  • We would need a lot of training examples
  • Spatial correlation is local anyways

30 / 65

slide-31
SLIDE 31

Fix Vanishing Gradient Issue with Unsupervised Training (1/2)

  • Leverage unlabeled data (when there is no y)?
  • Popular way to pretrain each layer
  • “Auto-encoder/bottleneck” network

W ●

tanh(●) tanh(●)

W ● W ●

x

1 2 3

  • Learn to reconstruct the input

||f (x) − x||2

  • Caveats:
  • PCA if no W 2 layer (Bourlard & Kamp, 1988)
  • Projected intermediate space must be of lower dimension

31 / 65

slide-32
SLIDE 32

Fix Vanishing Gradient Issue with Unsupervised Training (2/2)

W ●

tanh(●) tanh(●)

W ● W ●

x

1 2 3

  • Possible improvements:
  • No W 2 layer, W 3 =
  • W 1T (Bengio et al., 2006)
  • Noise injection in x

reconstruct the true x (Bengio et al., 2008)

  • Impose sparsity constraints
  • n the projection (Kavukcuoglu et al., 2008)

32 / 65

slide-33
SLIDE 33

Fix Number of Parameters Issue by Generating Examples (1/2)

  • Capacity h is too large? Find more training examples L!

33 / 65

slide-34
SLIDE 34

Fix Number of Parameters Issue by Generating Examples (2/2)

  • Concrete example: digit recognition
  • Add an (infinite) number of random deformations

(Simard et al, 2003)

  • State-of-the-art with 9 layers with 1000 hidden units and...

a GPU (Ciresan et al, 2010)

  • In general, data augmentation includes
  • random translation or rotation
  • random left/right flipping
  • random scaling

34 / 65

slide-35
SLIDE 35

Convolutional Neural Networks

35 / 65

slide-36
SLIDE 36

2D Convolutions (1/4)

  • Share parameters across different locations

(Fukushima, 1980) (LeCun, 1987)

36 / 65

slide-37
SLIDE 37

2D Convolutions (1/4)

  • Share parameters across different locations

(Fukushima, 1980) (LeCun, 1987)

37 / 65

slide-38
SLIDE 38

2D Convolutions (1/4)

  • Share parameters across different locations

(Fukushima, 1980) (LeCun, 1987)

38 / 65

slide-39
SLIDE 39

2D Convolutions (2/4)

  • It is like applying a filter to the image...
  • ...but the filter is trained

⋆ = ⋆ =

39 / 65

slide-40
SLIDE 40

2D Convolutions (3/4)

  • It is again a matrix-vector operation, but where weights

are spatially “shared”

W●1 W●2 W●3

  • As for normal linear layers, can be stacked for higher-level

representations

40 / 65

slide-41
SLIDE 41

2D Convolutions (4/4)

41 / 65

slide-42
SLIDE 42

Spatial Pooling (1/2)

  • “Pooling” (e.g. with a max() operation) increases robustness

w.r.t. spatial location

42 / 65

slide-43
SLIDE 43

Spatial Pooling (2/2) Controls the capacity

  • A unit will see “more” of the image, for the same number of

parameters

  • adding pooling decreases the size of subsequent fully

connected layers!

43 / 65

slide-44
SLIDE 44

Spatial Pooling (2/2) Controls the capacity

  • A unit will see “more” of the image, for the same number of

parameters

  • adding pooling decreases the size of subsequent fully

connected layers!

44 / 65

slide-45
SLIDE 45

Fully Connected Layers Fully connected layers are a particular type of convolutions (with a w × h kernel)

45 / 65

slide-46
SLIDE 46

Training vs Testing

  • Training Phase
  • Testing Phase

46 / 65

slide-47
SLIDE 47

Training vs Testing

  • Training Phase
  • Testing Phase
  • convolutions are naturally applied to larger input images
  • much faster than sliding windows

47 / 65

slide-48
SLIDE 48

Fancier Architectures

48 / 65

slide-49
SLIDE 49

Multi-Scale

(Farabet et al., 2013) “Learning hierarchical features for scene labeling”

49 / 65

slide-50
SLIDE 50

Multi-Modal

(Frome et al., 2013) “Devise: a deep visual semantic embedding model”

50 / 65

slide-51
SLIDE 51

Multi-Task

(Zhang et al., 2014) “PANDA”

51 / 65

slide-52
SLIDE 52

Recurrent Neural Networks (1/3)

  • Leverage previous output

label scores

I(t) O(t − 1) H(t) O(t)

  • Leverage previous hidden

representations

I(t) H(t − 1) H(t) O(t)

  • Both

I(t) O(t − 1) H(t − 1) H(t) O(t)

(Jordan, 1986) (Elman, 1990)

52 / 65

slide-53
SLIDE 53

Recurrent Neural Networks (2/3)

  • Training: unfold network through time
  • Weights are shared through time
  • Standard backpropagation applies

I(t) O(t − 1) H(t) O(t)

O(0) I(1) H(1) O(1) I(2) H(2) O(2) I(3) H(3) O(3) I(4) H(4) O(4)

  • Note: the loss might include all O(t)

∀t

  • Must consider the full sequence 1..T (not real-time)

53 / 65

slide-54
SLIDE 54

Recurrent Neural Networks (3/3)

(Pinheiro et al., 2014) “Recurrent Convolutional Neural Networks for Scene Labeling”

54 / 65

slide-55
SLIDE 55

Graph Transformer Networks

(Bottou et al., 1997) (Lecun et al., 1998)55 / 65

slide-56
SLIDE 56

Applications

56 / 65

slide-57
SLIDE 57

Digit Recognition (1/2)

  • Err. rate (%)

Gaussian SVM 1.4 1000 HU NN (MSE) 4.5 800 HU NN 1.6 CNN 0.8 CNN + distortions 0.4 9 layers NN + distortions 0.4

57 / 65

slide-58
SLIDE 58

Digit Recognition (2/2)

(Lecun et al., 1998)

58 / 65

slide-59
SLIDE 59

ImageNet (1/2)

(Deng et al., 2009) “Imagenet: a large scale hierarchical image database”

59 / 65

slide-60
SLIDE 60

ImageNet (2/2)

(Krizhevsky et al., 2012) “ImageNet Classification with deep CNNs”

60 / 65

slide-61
SLIDE 61

Texture Classification

(Sifre et al., 2013) “Rotation, scaling and deformation invariant scattering for texture discrimination”

61 / 65

slide-62
SLIDE 62

Object Segmentation

(Farabet et al., 2013) “Learning hierarchical features for scene labeling” (Pinheiro et al., 2014) “Recurrent CNN for scene parsing”

62 / 65

slide-63
SLIDE 63

Action Recognition in Videos

(Taylor et al., 2010) “Convolutional learning of spatio-temporal features” (Karpathy et al, 2014) “Large-scale video classification with CNNs”

63 / 65

slide-64
SLIDE 64

Denoising

(Burger et al., 2012) “Can plain NNs compete with BM3D?”

64 / 65

slide-65
SLIDE 65

Toolboxes

  • Torch7

http://torch7.org

  • Theano

http://deeplearning.net/software/theano

  • Cuda Convnet

http://code.google.com/p/cuda-convnet

  • Caffe

http://caffe.berkeleyvision.org

  • NVIDIA Kernels

https://developer.nvidia.com/cuDNN

65 / 65