[PPT] - Lecture 18: Concluding Convolutional Neural Networks, Graphical PowerPoint Presentation

SLIDE 1

Lecture 18: Concluding Convolutional Neural Networks, Graphical Models as Foundation for Recurrent Neural Networks and Bayesian Networks

Reference: We will be referring to sections etc of ‘Deep Learning’ by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=1187

SLIDE 2

The Lego Blocks in Modern Deep Learning

1

Depth/Feature Map [Eg: Red, Green and Blue feature maps]

2

Patches/Filters (provide for spatial interpolations)

3

Non-linear Activation unit (provided for detection/classification)

4

Strides (enable downsampling)

5

Padding (shrinking across layers)

6

Pooling (non-linear downsampling)

7

Inception [Optional: Extra slides]

8

RNN, Attention and LSTM (Backpropagation through time and Memory cell) [Optional: Extra slides]

9

Embeddings (Unsupervised learning) [Optional: Extra slides]

SLIDE 3

Convolution: Sparse Interactions through Filters K(.) (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl 11 wl 12 wl 12 wl 21 wl 22 wl 23 wl 23 wl 32 wl 33 wl 34 wl 34 wl 43 wl 44 wl 45 wl 45 wl 54 wl 55

input/(l − 1)th layer lth layer

SLIDE 4

Convolution: Sparse Interactions through Filters K(.) (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl 11 wl 12 wl 12 wl 21 wl 22 wl 23 wl 23 wl 32 wl 33 wl 34 wl 34 wl 43 wl 44 wl 45 wl 45 wl 54 wl 55

input/(l − 1)th layer lth layer hi =

m

xmwmiK(i − m) On RHS, K(i − m) = 1 iff |m − i| ≤ 1 For 2-D inputs (such as images):

SLIDE 5

Convolution: Sparse Interactions through Filters K(.) (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl 11 wl 12 wl 12 wl 21 wl 22 wl 23 wl 23 wl 32 wl 33 wl 34 wl 34 wl 43 wl 44 wl 45 wl 45 wl 54 wl 55

input/(l − 1)th layer lth layer hi =

m

xmwmiK(i − m) On RHS, K(i − m) = 1 iff |m − i| ≤ 1 For 2-D inputs (such as images): hij =

m
n

xmnwij,mnK(i − m, j − n) Intuition: Neighboring signals xm (or pixels xmn) more relevant than one’s further away, reduces prediction time Can be viewed as multiplication with a Toeplitz matrix K (which has each row as the row above shifted by one element) Further, K is sparse wrt parameter θ (eg: K(i − m) = 1 iff |m − i| ≤ θ)

SLIDE 6

Convolution: Shared parameters and Patches (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl

input/(l − 1)th layer lth layer

SLIDE 7

Convolution: Shared parameters and Patches (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl

input/(l − 1)th layer lth layer hi =

m

xmwi−mK(i − m) On LHS, K(i − m) = 1 iff |m − i| ≤ 1 For 2-D inputs (such as images):

SLIDE 8

Convolution: Shared parameters and Patches (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl

input/(l − 1)th layer lth layer hi =

m

xmwi−mK(i − m) On LHS, K(i − m) = 1 iff |m − i| ≤ 1 For 2-D inputs (such as images): hij =

m
n

xmnwi−m,j−nK(i − m, j − n) Intuition: Neighboring signals xm (or pixels xmn) affect in similar way irrespective of location (i.e., value of m or n) More Intuition: Corresponds to moving patches around the image Further reduces storage requirement; does not affect prediction time Further, K is often sparse (eg: K(i − m) = 1 iff |m − i| ≤ θ)

SLIDE 9

Convolution: Strides and Padding (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl

input/(l − 1)th layer lth layer

SLIDE 10

Convolution: Strides and Padding (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl

input/(l − 1)th layer lth layer Consider only hi’s where i is a multiple of s. Intuition: Stride of s corresponds to moving the patch by s steps at a time More Intuition: Stride of s corresponds to downsampling by s What to do at the corners?

SLIDE 11

Convolution: Strides and Padding (for Single Feature Map)

x1 x2 x3 x4 x5 h1 h2 h3 h4 h5

wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl wl 1 wl 1 wl −1 wl

input/(l − 1)th layer lth layer Consider only hi’s where i is a multiple of s. Intuition: Stride of s corresponds to moving the patch by s steps at a time More Intuition: Stride of s corresponds to downsampling by s What to do at the corners? Ans: Pad with 0’s at the edges to create output of same size as input (same padding) or have no padding at all and let the next layer have fewer nodes (valid) Reduces storage requirement as well as prediction time

SLIDE 12

Examples of Convolutional Filters: Guess what each does

+1

1

+2

2

+1

1

5Also referred to as kernels, but not to be confused with the positive semi-definite kernel

SLIDE 13

Examples of Convolutional Filters: Guess what each does

+1

1

+2

2

+1

1

Sobel Vertical edge detector +1 +2 +1

1
2
1

5Also referred to as kernels, but not to be confused with the positive semi-definite kernel

SLIDE 14

Examples of Convolutional Filters: Guess what each does

+1

1

+2

2

+1

1

Sobel Vertical edge detector +1 +2 +1

1
2
1

Sobel Horizontal edge detector 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 Image blurring filter

1
1

3

1
1

Image sharpening filter

Illustration at https://www.saama.com/blog/different-kinds-convolutional-filters/ In CNNs, these filters5 (i.e. weights wi−m,j−n) are generally learnt from the data. Filter size ⇒ Strong prior, Filter value ⇒ Posterior

5Also referred to as kernels, but not to be confused with the positive semi-definite kernel

SLIDE 15

The Convolutional Filter

SLIDE 16

The Convolutional Filter

SLIDE 17

The Convolutional Filter

SLIDE 18

Question: MLP Vs CNN

Convolution leverages three important ideas that can help improve a machine learning system: (a) sparse interactions, (b) parameter sharing and (c) equivariant representations: f (g(x)) = g(f (x)) when f is convolution and g is shift function. We just saw these in action:

SLIDE 19

Question: MLP Vs CNN

Convolution leverages three important ideas that can help improve a machine learning system: (a) sparse interactions, (b) parameter sharing and (c) equivariant representations: f (g(x)) = g(f (x)) when f is convolution and g is shift function. We just saw these in action: Input Image Size: 200 × 200 × 3 MLP: Hidden Layer has 40k neurons, resulting in 4.8 billion parameters. CNN: Say, hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1 and zero padding of 4 on each side, i.e., maximum overlapping of convolution windows. A feature map corresponds to one set of weights wl

ij. F feature maps ⇒ F times the number of weight

parameters Question: How many parameters? Answer: Question: How many neurons (location specific)? Answer:

SLIDE 20

Answer: MLP Vs CNN

MLP: Hidden Layer has 40k neurons, so it has 4800000 parameters. CNN: Hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1, and zero padding of 4 on each side, i.e., maximum overlapping of convolution windows. Question: How many parameters? Answer: Just 1500 Question: How many neurons (location specific)? Let M × N × 3 be dimension of image and P × Q × 3 be dimension of filter for convolution. Let D be number

f zero paddings and s be stride length.

Answer:Output size =

SLIDE 21

Answer: MLP Vs CNN

MLP: Hidden Layer has 40k neurons, so it has 4800000 parameters. CNN: Hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1, and zero padding of 4 on each side, i.e., maximum overlapping of convolution windows. Question: How many parameters? Answer: Just 1500 Question: How many neurons (location specific)? Let M × N × 3 be dimension of image and P × Q × 3 be dimension of filter for convolution. Let D be number

f zero paddings and s be stride length.

Answer:Output size =

M−P+2D

s

+ 1

×
N−Q+2D

s

+ 1

.

In current case, D = P − 1 ⇒ Output size =

M+P

s

− 1

×
N+Q

s

− 1

.

20 × ((200 + 5)/s) − 1) × ((200 + 5)/s) − 1) = 832320 (around 830 thousand which can increase with max-pooling). If D = (P − 1)/2 and S = 1,

SLIDE 22

Answer: MLP Vs CNN

MLP: Hidden Layer has 40k neurons, so it has 4800000 parameters. CNN: Hidden layer has 20 feature-maps each of size 5 X 5 X 3 with stride = 1, and zero padding of 4 on each side, i.e., maximum overlapping of convolution windows. Question: How many parameters? Answer: Just 1500 Question: How many neurons (location specific)? Let M × N × 3 be dimension of image and P × Q × 3 be dimension of filter for convolution. Let D be number

f zero paddings and s be stride length.

Answer:Output size =

M−P+2D

s

+ 1

×
N−Q+2D

s

+ 1

.

In current case, D = P − 1 ⇒ Output size =

M+P

s

− 1

×
N+Q

s

− 1

.

20 × ((200 + 5)/s) − 1) × ((200 + 5)/s) − 1) = 832320 (around 830 thousand which can increase with max-pooling). If D = (P − 1)/2 and S = 1, output will be of same size as input!

SLIDE 23

How about Deconvolution? (Illustrated)

[OPTIONAL]

SLIDE 24

How about Deconvolution? (Illustrated)

[OPTIONAL]

SLIDE 25

The Convolutional Filter

SLIDE 26

The Lego Blocks in Modern Deep Learning

1

Depth/Feature Map

2

Patches/Filters (provide for spatial interpolations) - Filter

3

Non-linear Activation unit (provided for detection/classification)

4

Strides (Linear downsampling)

5

Padding (shrinking across layers)

6

Pooling (Non-linear downsampling) - Filter

7

Inception

8

RNN and LSTM (Backpropagation through time and Memory cell)

9

Embeddings (Unsupervised learning)

SLIDE 27

Two Typical Nomenclatures/Architectures [Optional]

SLIDE 28

The Max Pooling Filter

A non-linear downsampling filter that selects the maximum value from its patch. It is a sample-based discretization process. Objective is dimensionality reduction through down-sampling of input representation (eg: image), Translation invariance to the internal representation (send only the important data to the next layers) Helps avoid overfitting and reduces the number of parameters to learn.

SLIDE 29

Max pooling (with downsampling) for a Single Feature Map (1-d example)

SLIDE 30

Max pooling (with downsampling) for a Single Feature Map (1-d example)

SLIDE 31

Max pooling (with downsampling) for a Single Feature Map (1-d example)

SLIDE 32

Max pooling (with downsampling) for a Single Feature Map (1-d example)

What will be the output if input and max pooling filter remains same but stride changes to 2?

SLIDE 33

Max pooling (with downsampling) for a Single Feature Map (1-d example)

What will be the output if input and max pooling filter remains same but stride changes to 2? [6,8]

SLIDE 34

Max pooling in 2-D for a Single Feature Map

Let M × N × 3 be dimension of image and P × Q × 3 be dimension of patch for (filter) convolution. Let s be stride length Max pooling takes every M × N × 3 patch from the input and set the output to the maximum value in that patch Output size =

M−P

s

+ 1

×
N−Q

s

+ 1

. For Eg:

Input: A 3D image of size with M = N = 5, P = Q = 3 and with (default) stride of 1. Output size will be 3 × 3 × 1

SLIDE 35

Tutorial 7, Problem 10

ConvNetJS (http://cs.stanford.edu/people/karpathy/convnetjs/) is a Javascript library for training Deep Learning models (Neural Networks) entirely in your browser. Try different choices of network configurations which include the choice of the stack of convolution, pooling, activation units, number of parallel networks, position of fully connected layers and so on. You can also save some network snapshots as JSON objects. What does the network visualization of the different layers reveal? Also try out the demo at http://places.csail.mit.edu/demo.html to understand the heat maps and their correlations with the structure of the neural network.

SLIDE 36

Tutorial 7, Problem 11

Discuss the advantages and disadvantages of different activation functions: tanh, sigmoid, ReLU, softmax. Explain and illustrate when you would choose one activation function in lieu of another in a Neural Network. You can also include any experiences from Problem 5 in your answer.

SLIDE 37

Alex-net [NIPS 2012]

Stack of two types of parallel networks First 5 convolution layers

First convolution layer takes input of size 224 × 224 × 3, 48 (×2) features each with filter of size 11 × 11 × 3 with stride of 4

Thus, ((224 + 11)/4 − 1) × ((224 + 11)/4 − 1) = 57 × 57.

Max-pooling (3 × 3 × 1 with stride of 1) in the end reduces size to 55 × 55 for each filter

Fully connected last 3 layers

Image reference: ”Imagenet Classification with Deep Convolution Neural Networks”,NIPS 2012.

SLIDE 38

The Lego Blocks in Modern Deep Learning

1

Depth/Feature Map

2

Patches/Filters (provide for spatial interpolations) - Filter

3

Strides (Linear downsampling)

4

Padding (shrinking across layers)

5

Pooling (Non-linear downsampling) - Filter

6

Inception

7

RNN, Attention and LSTM (Backpropagation through time and Memory cell)

8

Embeddings and Unsupervised Learning

SLIDE 39 RNN-Based Model 1 RNN-Based Model 2

IMAGINE THAT INSTEAD OF IMAGE CLASSIFICATION USING CNN, YOU NEEDED TO GENERATE A "MEANINGFUL CAPTION" (as a sequence of words) FOR EACH INPUT IMAGE HOW CAN WE ENSURE THAT THE MODEL OUTPUTS A "MEANINGFUL SEQUENCE OF WORDS" AS A CAPTION?

SLIDE 40 RNN-Based Model 2 RNN-Based Model 3 RNN-Based Model 1

TO OUTPUT A "MEANINGFUL SEQUENCE OF WORDS", WE NEED TO MODEL THE STRUCTURE IN THE OUTPUT SPACE WE BEGIN WITH (a) LOGISTIC REGRESSION INTUITIVELY EXTENDED (to Conditional Random Fields) and then move on to (b) A BASIC OVERVIEW OF GRAPHICAL MODELS and then (c) Deep-RNNs

SLIDE 41

Extending Neural Networks to sequential output

When the neural network is a simple logistic regression model, the extension is called Conditional Random Field (CRF). Detailed slides on graphical models, including CRFs at

https://www.cse.iitb.ac.in/~cs725/notes/classNotes/graphicalModels.ppt

xn xi x2 x1 yn yi y2 y1 φn,x φi,x φ2,x φ1,x inputs classes & y−potentials φi,y x−potentials

SLIDE 42

PRIMER ON GRAPHICAL MODELS LEADING TO RECURRENT NEURAL NETWORKS

https://iitbacin-my.sharepoint.com/:p:/g/personal/ganramkr_iitb_ac_in/ EbmsI4FN0DRNryz-JpaYbgMB0n79VdO7jaJMtnfAWS8jKQ?rtime=2UGoIY5R10g

SLIDE 43

23rd December, 2013

Probabilistic Graphical Models

Ganesh Ramakrishnan CS337, Artificial Intelligence and Machine Learning

SLIDE 44

December 23rd, 2013

Probabilistic Graphical Models

Graphical representations of

probability distributions

–new insights into existing models –motivation for new models –graph based algorithms for calculation and computation

CS337, Artificial Intelligence and Machine Learning

SLIDE 45

December 23rd, 2013

Components of a Graphical Model

Each node in a graphical model represents a

random variable

– Or in general, a set or vector of random variables

There are edges between nodes
It is the absence of certain edges in a graph that

encodes independencies

– Information provided by presence of edges is in some sense vacuous. – Eg: Degenerate case of completely connected graph, which describes all possible distributions

CS337, Artificial Intelligence and Machine Learning

SLIDE 46

December 23rd, 2013

Types of graphical models

Directed: When all edges are directed

– Hidden Markov models, Kalman filters, Factor analysis, Probabilistic principal component analysis, Independent component analysis, Mixtures of Gaussians, Transformed component analysis, Probabilistic expert systems, Sigmoid belief networks, Hierarchical mixtures of experts, Bayesian Networks, etc.

Undirected: When all edges are undirected

– Markov random field, Conditional random field, etc.

Chain graphs

– Have directed as well as undirected edges

CS337, Artificial Intelligence and Machine Learning

SLIDE 47

December 23rd, 2013

Factorization and Conditional Independence properties of graphical models

Two equivalent ways of specifying a graphical model

(Hammersley-clifford Theorem)

– Factorization Property

How to factorize the joint distribution, given the graph

– Markov or Conditional Independence property

Can we determine the conditional independence properties of a

distribution directly from its graph?

– undirected graphs: easy – directed graphs: one subtlety CS337, Artificial Intelligence and Machine Learning

SLIDE 48

December 23rd, 2013

Factorization Properties

Directed graphs

–conditional independence from d-separation test

Undirected graphs

–conditional independence from graph separation

 





N i i i N

x pa x x x x

1 2 1

] [ | Pr ) ....., , Pr(



 

C C C N

x x x x ) ( ) ....., , Pr(

2 1

CS337, Artificial Intelligence and Machine Learning

SLIDE 49

December 23rd, 2013

Markov Properties: Undirected Graphs

Conditional independence given by graph separation
Two sets of nodes “a” and “b” are conditionally independent given a

set of nodes “c” iff

Every path from a node in “a” to a node in “b” is blocked by a

node in “c”.

Here, by “blocking” we mean that a node from “c” occurs on

that path

y z x | 

CS337, Artificial Intelligence and Machine Learning

SLIDE 50

December 23rd, 2013

Markov Properties: Directed Graphs

c b a | 

We say that node c blocks the path from node a to

node b iff and

Identify three types of nodes that block/unblock

paths

Head-to-tail node : Blocks the path

 

 | b a

c

b a | 

 

 | b a

CS337, Artificial Intelligence and Machine Learning

SLIDE 51

December 23rd, 2013

Markov Properties: Directed Graphs (contd)

Tail-to-tail node: Blocks the path
Head-to-head node: Unblocks the path

c b a | 

 

 | b a





c b a | 



| b a

CS337, Artificial Intelligence and Machine Learning

SLIDE 52

December 23rd, 2013

More formally…

CS337, Artificial Intelligence and Machine Learning

SLIDE 53

December 23rd, 2013

Markov Properties: Directed Graphs (contd)

Conditional independence given by d-

separation test

Two sets of nodes “a” and “b” are conditionally

independent given a set of nodes “c” iff

– Every path from a node in “a” to a node in “b” is blocked by a node in “c”.

CS337, Artificial Intelligence and Machine Learning

SLIDE 54

December 23rd, 2013

Markov Properties: Directed Graphs (contd)

What about the following ??

? | Is c b a

? | Is f b a

NO YES

CS337, Artificial Intelligence and Machine Learning