[PPT] - Convolutional Neural Networks 08, 10 & 17 Nov, 2016 J. Ezequiel PowerPoint Presentation

SLIDE 1

Convolutional Neural Networks 1

Convolutional Neural Networks

08, 10 & 17 Nov, 2016

J. Ezequiel Soto S.

Image Processing 2016

Prof. Luiz Velho

SLIDE 2

Convolutional Neural Networks 2

Summary & References

08/11 ImageNet Classification with Deep Convolutional Neural Networks

2012, Krizhevsky et. al. [source]

10/11 Going Deeper with Convolutions

2015, Szegedy et. al. [source]

17/11 Painting Style Transfer for Head Portraits using Convolutional Neural Networks

2016, Selim & Elgharib [source]

+ CS231n: Convolutional Neural Networks for Visual Recognition

Sanford University Course Notes

+ Very Deep Convolutional Networks for Large-Scale Image Recognition

2015, Simonyan & Zizzerman [source]

SLIDE 3

Convolutional Neural Networks 3

ImageNet Classification with Deep Convolutional Neural Networks

Krizhevsky et.al. 2012

SLIDE 4

Convolutional Neural Networks 4

Outline

Motivation
Data
Architecture

– ReLU Nonlinearity – Parallel GPU training – Local Response Normalization – Overlapping Pooling

Reducing Overfitting

– Data augmentation – Dropout

Learning details
Results
Discussion

SLIDE 5

Convolutional Neural Networks 5

Motivation

Object recognition

Machine Learning Methods →

Improved performance:

– Larger datasets – Powerful learning methods – Better techniques vs. overfitting

MNIST digit recognition [e<0.3% ~ human]
Evolution of labeled large image datasets:

– NORB, CIFAR – LabelMe: ~100k segmented & labeled images – ImageNet: >15M labeled hi-res images in 22k categories

Still not enough to specify such a complex problem: we need prior knowledge...

SLIDE 6

Convolutional Neural Networks 6

Motivation

Models with large learning capacity

→ Convolutional Neural Networks.

CNN assumptions (strong & correct):

– Stationarity of statistics – Locality of pixel dependencies

CNNs pros:

– Variable capacity (depth and breadth) – Fewer connections and parameters than usual, but still a lot... – Easier to train

CNNs cons:

– Prohibitively expensive to apply in large scale to high-resolution images

Applicability

GPU with optimized 2D convolutions →

→ Large enough datasets like ImageNet for training without overfitting

SLIDE 7

Convolutional Neural Networks 7

Motivation

It was one of the largest CNN trained with the ImageNet dataset

for the ILSVRC Challenges, and the results set a new state of the art for the task.

Highly optimized GPU implementation of 2D convolutions

publicly available code.

Reduction of training time and strategies to control overfitting.
Specific architecture: 5 Conv + 3 FC layers.
Network limits established by existing hardware.

SLIDE 8

Convolutional Neural Networks 8

ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)

ILSVRC Challenges 2010

Classification with 1000 categories

2011

Classification Classification + localization

2012

Classification Classification + localization Fine-grained classification (100+ categories dogs) * WINNER: Krizhevsky et.al. 2012

2013

PASCAL-style detection on fully labeled data for 200 categories Classification with 1000 categories Classification + localization with 1000 categories

2014

PASCAL-style detection on fully labeled data for 200 categories Classification + localization with 1000 categories * WINNERs: GoogLeNet, Szegedy et.al. 2015 VGG, Simonyan & Zisserman, 2015

2015

Object detection for 200 fully labeled categories Object localization for 1000 categories Object detection from video for 30 fully labeled categories Scene classification for 401 categories

2016

Object localization for 1000 categories Object detection for 200 fully labeled categories Object detection from video for 30 fully labeled categories Scene classification for 365 scene categories Scene parsing for 150 stuff and discrete object categories

Source: http://image-net.org/challenges/LSVRC/

SLIDE 9

Convolutional Neural Networks 9

Data

ImageNet: >15M labeled hi-res images in ~22k categories.
ILSVRC uses a subset of about 1.2M in 1000 categories.
Used labels of the 2010 set for training.
Error reporting: top-1 and top-5.
Variable resolution images

down-sampled to fit 256 x 256. →

Centered raw RGB values of the pixels.

SLIDE 10

Convolutional Neural Networks 10

Architecture

8 learned layers:

– 5 convolutional (Conv) – 3 fully connected (FC)

Conv1 Conv2 Conv3 Conv4 Conv5 FC1 FC2 FC3 → → → → →

SLIDE 11

Convolutional Neural Networks 11

ReLU Nonlinearity

tanh(x) max(0, x)

Non-saturating activation

function: max(0,x)

Neurons: Rectified Linear

Units (ReLU)

Faster training

→ Figure: test on a 4-deep CNN on CIFAR-10, no regularization, different

ptimal learning rates.
Is this the best? PreLu,

ELU? Open debate...

SLIDE 12

Convolutional Neural Networks 12

Parallel GPU training

GTX 580 GPU (3GB) limits training capability
1.2M images for training
CNN: Spread across 2 GPU units
Communication only in certain layers: 3

4 and FC →

– Easy with modern GPUs: common access to memory

Communication reduces error with respect to completely independent

columns by 1.7% (top-1) and 1.2% (top-5)

GPU 1 GPU 2

SLIDE 13

Convolutional Neural Networks 13

Local Response Normalization

ReLUs don’t require input normalization
Local normalization

generalization →

Average over neighboring kernels at the same spatial position (x,y)
Lateral inhibition inspired in real neurons (biology)
“Brightness normalization”
Reduces error in 1.4% (top-1) and 1.2% (top-5)
Parameters obtained trough a validation set…

k=2, n=5, α=10-4, β=0.75

SLIDE 14

Convolutional Neural Networks 14

Overlapping Pooling

Pooling summarizes the output of neighboring groups of neurons in the same

kernel

Grid: spaced by s units of z×z averaging units
Common pooling in CNNs: s = z
Overlapping pooling: s < z
This implementation has s = 2, z = 3
Reduction of error by 0.4% (top-1) and 0.3% (top-5)
Observed result during training: overfitting is more difficult to occur

SLIDE 15

Convolutional Neural Networks 15

LRN Pooling

SLIDE 16

Convolutional Neural Networks 16

Architecture

Model:

Maximize the multinomial logistic regression objective Maximize the average across training cases of the log-probability of the correct label under the prediction distribution ~60 million parameters

Conv1 Conv2 Conv3 Conv4 Conv5 FC1 FC2 FC3 → → → → → 96 Ker 256 Ker 384 Ker 384 Ker 256 Ker 4096 neurons

11×11×3(s4) 5×5×48 3×3×256 3×3×192 3×3×192 each

LRN LRN

dropout

SLIDE 17

Convolutional Neural Networks 17

Fitting filters and neurons

W: input volume size F: receptive field size (filter / kernel) S: stride P: zero padding Neurons = (W – F + 2P)/S + 1 In this CNN: (224 – 11 + 0)/4 + 1 = 52.25 !!! (224 – 11 + 3)/4 + 1 = 54 OK →

CS231n: claims error in the paper or unreported zero-padding

SLIDE 18

Convolutional Neural Networks 18

Reducing Overfitting

60 M parameters / 1.2 M training images for 1000 classes impose 10

bits of constraints in the mapping from image to label

→ Not enough to prevent overfitting

Data augmentation = artificially enlarge training set with label

preserving transformations

– Image translation and horizontal reflection

Training over random 224 × 224 patches and its reflections → 2048x training set size Test with four corners and central patch 10x test chance →

– Changes in the intensity and color of illumination: Alter color intensities

with PCA of of the 3×3 covariance color matrix I’xy = [IRxy, IGxy, IBxy] + [p1, p2, p3][α1λ1, α2λ2, α3λ3]T Each αi is a random Gaussian computed each training use of the image

SLIDE 19

Convolutional Neural Networks 19

Reducing Overfitting

SLIDE 20

Convolutional Neural Networks 20

Reducing Overfitting

Dropout:

Zero the output of each neuron during training with a probability of 0.5

(turn off: during forward feed and back-propagation)

– Combine the predictions of many models is effective but it is too

expensive

– Similar results strategy that costs about 2x the time of training – Reduce co-adaptation of neighboring neurons – Forced to learn more robust features – Test time: multiply all outputs by 0.5! – Dropout inhibits substantial overfitting – Doubles time of convergence

SLIDE 21

Convolutional Neural Networks 21

Learning details

Stochastic gradient descent (L: loss function)

– Batch (Di) size: 128 images – Momentum: 0.9 – Weight decay: 0.0005 – Learning rate: ϵ

Initialization:

– Weights ~ N(0,0.01) – Biases = 1 for Conv2, Conv4, Conv5, all FC; = 0 everywhere else

→ accelerated initial learning with non-zero ReLU

– Learning rate 0.01 and divide by 10 when validation error rate stops improving (3 times

until termination)

Training: 90 cycles trough all 1.2 M images (6 days / 2 NVIDIA GTX 580)

SLIDE 22

Convolutional Neural Networks 22

Results

ILSVRC 2010
ILSVRC 2012

(*Pre-training Conv6: ImageNet 2011 Fall, 15M images in 22k categories)

SLIDE 23

Convolutional Neural Networks 23

Results

Data connected

learned kernels Conv1:

– Frequency / orientation selective filters – GPU specialization (independent of initialization)

SLIDE 24

Convolutional Neural Networks 24

Results

Examples of classified images: even with not centered objects

Euclidean distance groups by 4096-dimensional feature vectors of last hidden layer (not equal to L2 on pixels) → Generate auto-encoders?

SLIDE 25

Convolutional Neural Networks 25

Discussion

Depth is really important!!!

→ As we will see with GoogLeNet: 22 layers deep → Hyper-parameters: depth, breadth filter size! →

How to increase the size of the network without

needing much more data? Faster?

Apply CNNs on video, use temporal structure to

improve results!

SLIDE 26

Convolutional Neural Networks 26

Convolutional Neural Networks

08, 10 & 17 Nov, 2016

Image Processing 2016

Summary & References

ImageNet Classification with Deep Convolutional Neural Networks

Krizhevsky et.al. 2012

Outline

Motivation

Motivation

Motivation

for the ILSVRC Challenges, and the results set a new state of the art for the task.

publicly available code.

ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)

Data

down-sampled to fit 256 x 256. →

Architecture

ReLU Nonlinearity

Parallel GPU training

GPU 1 GPU 2

Local Response Normalization

Overlapping Pooling

LRN Pooling

Architecture

Fitting filters and neurons

W: input volume size F: receptive field size (filter / kernel) S: stride P: zero padding Neurons = (W – F + 2P)/S + 1 In this CNN: (224 – 11 + 0)/4 + 1 = 52.25 !!! (224 – 11 + 3)/4 + 1 = 54 OK →

Reducing Overfitting

Reducing Overfitting

Reducing Overfitting

Zero the output of each neuron during training with a probability of 0.5

Learning details

Results

Results

learned kernels Conv1:

Results

Examples of classified images: even with not centered objects

Discussion

→ As we will see with GoogLeNet: 22 layers deep → Hyper-parameters: depth, breadth filter size! →

needing much more data? Faster?

improve results!

It will continue...