CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS Neural networks Fully - - PowerPoint PPT Presentation

convolutional and recurrent neural networks neural
SMART_READER_LITE
LIVE PREVIEW

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS Neural networks Fully - - PowerPoint PPT Presentation

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS Neural networks Fully connected networks Neuron Non-linearity Softmax layer DNN training Loss function and regularization SGD and backprop Learning rate Overfitting


slide-1
SLIDE 1

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS

slide-2
SLIDE 2

Neural networks

  • Fully connected networks
  • Neuron
  • Non-linearity
  • Softmax layer
  • DNN training
  • Loss function and regularization
  • SGD and backprop
  • Learning rate
  • Overfitting – dropout, batchnorm
  • CNN, RNN, LSTM, GRU <- This class
slide-3
SLIDE 3

Notes on non-linearity

  • Sigmoid

https://www.analyticsvidhya.com/blog/2017/10/fundamentals- deep-learning-activation-functions-when-to-use-them/

Models get stuck if fall go far away from 0. Output always positive

slide-4
SLIDE 4

Notes on non-linearity

  • Tanh

https://www.analyticsvidhya.com/blog/2017/10/fundamentals- deep-learning-activation-functions-when-to-use-them/

Output can be +-. Models get stuck if fall go far away from 0

slide-5
SLIDE 5

Notes on non-linearity

  • ReLU

https://www.analyticsvidhya.com/blog/2017/10/fundamentals- deep-learning-activation-functions-when-to-use-them/

High gradient in positive. Fast compute. Gradient doesn’t move in negative

slide-6
SLIDE 6

Notes on non-linearity

  • Leaky ReLU

https://www.analyticsvidhya.com/blog/2017/10/fundamentals- deep-learning-activation-functions-when-to-use-them/

Negative part now have some gradient. Real task doesn’t seem that much better than ReLU

slide-7
SLIDE 7

So we learn about basis and projections

slide-8
SLIDE 8

Projections and Neural network weights

  • wTx

w x x w

slide-9
SLIDE 9

Projections and neural network weights

  • WTx

w1 x x w1 w2 w2

slide-10
SLIDE 10

Projections and neural network weights

  • WTx

w1 x x = y w1 w2 w2 v1 v2 y LDA projections fisher projection = VTWT x = (WV)T x

slide-11
SLIDE 11

Projections and neural network weights

  • WTx

w1 x x = y w1 w2 w2 v1 v2 y LDA projections fisher projection = VTWT x = (WV)T x Wv1 Wv2

slide-12
SLIDE 12

Projections and neural network weights

  • WTx

w1 x x = y w1 w2 w2 v1 v2 y LDA projections Wv1 Wv2

  • Neural network layers as feature transform
  • Non-linearity prevents merging of layers

fisher projection = VTWT x = (WV)T x

slide-13
SLIDE 13

Shift in feature space

  • WTx

w1 x x = y w1 w2 w2 v1 v2 y LDA projections Wv1 Wv2

  • What happens if I have a person that is
  • ff-frame?
  • Need another filter that is shifted
  • Can we do better?

fisher projection = VTWT x = (WV)T x

slide-14
SLIDE 14

Convolution

  • Continuous convolution
  • Meaning of (t-T) : Flip then shift
slide-15
SLIDE 15

Convolution visually

https://en.wikipedia.org/wiki/Convolution

Demo

slide-16
SLIDE 16

Convolution discrete

  • Discrete convolution
  • Continuous convolution
  • Same concept as continuous version
slide-17
SLIDE 17

Matched filters

  • We can use convolution to detect things that match our

pattern

  • Convolution can be considered as a filter (Why? Take

ASR next semester J)

  • If the filter detects our pattern, it will show up as a nice

peak even if there are noise.

  • Demo
slide-18
SLIDE 18

Matched filters

Red: matched filter Blue: signal Matched peak

slide-19
SLIDE 19

Convolution and Cross-Correlation

  • Convolution
  • (Cross)-Correlation

Convolution and cross-correlation are the same if g(t) is symmetric (even function). For some unknown reason, people use convolution in CNN to mean cross-correlation. From this point onwards, when we say convolution we mean cross-correlation.

slide-20
SLIDE 20

2D convolution

  • Flip and shifts in 2D
  • But, we no longer flips

Will get some peak here Our match filter

slide-21
SLIDE 21

Shift in feature space

  • WTx

w1 x x = y w1 w2 w2 v1 v2 y LDA projections Wv1 Wv2

  • What happens if I have a person that is
  • ff-frame?
  • Need another filter that is shifted

fisher projection = VTWT x = (WV)T x

slide-22
SLIDE 22

Shift in feature space

  • WTx

w1 x x = y w1 w2 w2 v1 v2 y LDA projections Wv1 Wv2

  • What happens if I have a person that is
  • ff-frame?
  • Ans: Convolution with W as filter

fisher projection = VTWT x = (WV)T x

slide-23
SLIDE 23

Convolutional Neural networks

  • A neural network with convolutions! (cross-correlation to

be precise)

  • But we have peaks at different location

From the point of view of a network, these are two different things.

slide-24
SLIDE 24

Pooling layers/Subsampling layers

  • Combine different locations into one
  • One possible method is to use a max
  • Interpretation: Yes, I found a cat somewhere

Max filter 1

slide-25
SLIDE 25

Convolutional filters

  • Convolutional layer consists of
  • Small filter patches
  • Pooling to remove variation

Input image 100 x 100 filter1 filter2 Filter3 3x3 Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3
slide-26
SLIDE 26

Convolutional filters

  • Convolutional layer consists of
  • Small filter patches
  • Pooling to remove variation

Input image 100 x 100 filter1 filter2 Filter3 3x3 Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3

4 5 6 1 2 3 x 4*1 + 5*2 + 6*3 = 32

slide-27
SLIDE 27

Convolutional filters

  • Convolutional layer consists of
  • Small filter patches
  • Pooling to remove variation

Input image 100 x 100 filter1 filter2 Filter3 3x3 Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3
slide-28
SLIDE 28

Convolutional filters

  • Convolutional layer consists of
  • Small filter patches
  • Pooling to remove variation

Input image 100 x 100 filter1 filter2 Filter3 3x3 Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3
slide-29
SLIDE 29

Convolutional filters

  • Convolutional layer consists of
  • Small filter patches
  • Pooling to remove variation

Input image 100 x 100 filter1 filter2 Filter3 3x3 Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3
slide-30
SLIDE 30

Pooling/subsampling

  • Convolutional layer consists of
  • Small filter patches
  • Pooling to remove variation

Convolution

  • utput1

98x98 Convolution

  • utput2

Layer output1 33x33 3x3 Max filter with no overlap Layer output2

slide-31
SLIDE 31

Pooling/subsampling

  • Convolutional layer consists of
  • Small filter patches
  • Pooling to remove variation

Convolution

  • utput1

98x98 Convolution

  • utput2

Layer output1 33x33 3x3 Max filter with no overlap Layer output2 4 5 6 max = 6

slide-32
SLIDE 32

Pooling/subsampling

  • Convolutional layer consists of
  • Small filter patches
  • Pooling to remove variation

Convolution

  • utput1

98x98 Convolution

  • utput2

Layer output1 33x33 3x3 Max filter with no overlap Layer output2

slide-33
SLIDE 33

Pooling/subsampling

  • Convolutional layer consists of
  • Small filter patches
  • Pooling to remove variation

Convolution

  • utput1

98x98 Convolution

  • utput2

Layer output1 33x33 3x3 Max filter with no overlap Layer output2

slide-34
SLIDE 34

CNN overview

  • Filter size, number of filters, filter shifts, and pooling rate

are all parameters

  • Usually followed by a fully connected network at the end
  • CNN is good at learning low level features
  • DNN combines the features into high level features and classify

https://en.wikipedia.org/wiki/Convolutional_neural_network#/media/File:Typical_cnn.png

slide-35
SLIDE 35

Parameter sharing in convolution neural networks

  • WTx

w1 x w2

  • Cats at different location might need two

neurons for different locations in fully connect NNs.

  • CNN shares the parameters in 1 filter
  • The network is no longer fully connected

Layer n Layer n-1 Layer n-1 Layer n+1 pooling convolutional

slide-36
SLIDE 36

Pooling/subsampling

  • Max filter -> Maxout
  • Backward pass?
  • Gradient pass through the maximum location, 0 otherwise
slide-37
SLIDE 37

Norms (p-norm or Lp-norm)

  • For any real number p > 1
  • For p = ∞
  • We’ll see more of p-norms when we get

to neural networks

https://en.wikipedia.org/wiki/Lp_space

slide-38
SLIDE 38

Pooling/subsampling

  • Max filter -> Maxout
  • Backward pass?
  • Gradient pass through the maximum location, 0 otherwise
  • P-norm filter
  • Fully connected layer – (1x1 convolutions)
  • Recently, people care less about the meaning of pooling

as way to introduce a shift invariance, but more as a dimension reduction (since conv layers usually has a higher dimension than the input)

slide-39
SLIDE 39

1x1 Convolutions

  • Convolutional layer consists of
  • Small filter patches
  • Pooling to remove variation

Input image 100 x 100 filter1 filter2 Filter3 3x3 Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3
slide-40
SLIDE 40

1x1 Convolutions

Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3

1x1 filters (in space) 1x1xK from the previous output

slide-41
SLIDE 41

1x1 Convolutions

Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3

1x1 filters (in space) 1x1xK from the previous output Sum over channels

slide-42
SLIDE 42

1x1 Convolutions

Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3

1x1 filters (in space) 1x1xK from the previous output Sum over channels

slide-43
SLIDE 43

Convolution

  • utput1

98x98

1x1 Convolutions

Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3

1x1 filters (in space) 1x1xK from the previous output Sum over channels

slide-44
SLIDE 44

Convolution

  • utput1

98x98 Convolution

  • utput2

98x98

1x1 Convolutions

Convolution

  • utput1

98x98 Convolution

  • utput2

Convolution

  • utput3

1x1 filters (in space) 1x1xK from the previous output Sum over channels If we have less 1x1 filters than previous level, we just perform dimensionality reduction.

slide-45
SLIDE 45

Common schemes

  • INPUT -> [CONV -> RELU -> POOL]*N -> [FC ->

RELU]*M -> FC

  • INPUT -> [CONV -> RELU -> CONV -> RELU ->

POOL]*N -> [FC -> RELU]*M -> FC

  • If you working with images, just use a winning

architecture.

slide-46
SLIDE 46

Wider and deeper networks

  • ImageNet task

5 10 15 20 25 30 2010 2011 2012 2013 2014 2014 2015

ImageNet ILSVRC

Human performance

ResNet GoogLeNet Olga Russakovsky, ImageNet Large Scale Visual Recognition Challenge, 2014 https://arxiv.org/abs/1409.0575 VGG AlexNet

Deep learning era 8 8 19 22 152 Number of layers Error rate SVM era

ZFNet

slide-47
SLIDE 47

ImageNET prize winner

  • AlexNet – first deep model based on LeNET (simple CNN

from 1990s)

  • ZFNet – AlexNET just deeper, and better tuned

hyperparameters

  • GoogLeNet (Inception Model)
slide-48
SLIDE 48

ImageNET prize winner 2

  • VGGNet - just a bigger network. Notable mention because

model available

  • ResNet – Next class
slide-49
SLIDE 49

Other models

  • YOLO: Frame object classification to regression
  • Regression: coordinates + class probabilities
  • Traditional object detector segments the original images into

patches and scan the patches – each different patch is scanned many times

https://pjreddie.com/media/files/papers/yolo_1.pdf

slide-50
SLIDE 50

YOLO

  • YOLO: Frame object classification to regression
  • Regression: coordinates + class probabilities
  • You Only Look Once: output possible bounding boxes and class
  • A post-processing step is used to merge the bounding boxes
  • Fast model for real time object detectors
  • Model is just VGG

https://pjreddie.com/media/files/papers/yolo_1.pdf

slide-51
SLIDE 51

Faster R-CN

  • Another model used for object detection
  • Current state of the art
  • Details – next class
slide-52
SLIDE 52

De-convolution layers

  • Yet another abused of notation made by vision folks
  • Conventional meaning:
  • A method to reverse an effect of a filter
  • Blurred image -> de-convolution -> Good image
  • Neural network meaning:
  • Something that reverse the order of convolution computation
  • Backward pass of a convolutional layer
  • Used for upsampling
slide-53
SLIDE 53

Convolution

  • 3x3 filter pad, stride 1, pad 1
slide-54
SLIDE 54

Convolution

  • 3x3 filter pad, stride 2, pad 1
slide-55
SLIDE 55

Convolution

  • 3x3 filter pad, stride 2, pad 1
slide-56
SLIDE 56

De-convolution

  • 3x3 de-convolution filter, stride 2, pad 1

Input gives Weight for filter Rubber stamp

slide-57
SLIDE 57

Visualizing convolutional layers

  • Just like PCA, we can visualize the weights of a transform
  • “Matched filters”

http://cs231n.github.io/understanding-cnn/

slide-58
SLIDE 58

De-convolution

  • 3x3 deconvolution filter, stride 2, pad 1

Input gives Weight for filter Rubber stamp of filters a*F[1][2]+b*F[1][0] a b Other names because this name sucks (for me)

  • Convolution transpose, upconvolution, backward strided convolution
slide-59
SLIDE 59

De-convolution for segmentation

https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf

slide-60
SLIDE 60

De-convolution for segmentation

https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf

slide-61
SLIDE 61

De-convolution for segmentation

https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf

slide-62
SLIDE 62

CNN

  • Convolutional layer
  • Subsampling
  • Sharing of parameters in space
  • Sharing of parameters in time?
slide-63
SLIDE 63

Recurrent neural network (RNN)

  • DNN framework

DNN layer DNN layer DNN layer They light the fire DNN layer Subject Adjective Article Object Problem: need a way to remember the past

63

slide-64
SLIDE 64

Recurrent neural network (RNN)

  • RNN framework

RNN layer RNN layer RNN layer They light the fire RNN layer Subject Verb Article Object

64

Output of the layer encodes something meaningful about the past

slide-65
SLIDE 65

Recurrent neural network (RNN)

  • RNN framework

RNN layer RNN layer RNN layer They light the fire RNN layer Subject Verb Article Object

65

[initial value] New input feature = [original input feature, output of the layer at previous time step]

slide-66
SLIDE 66

Training a recurrent neural network

  • Computation graph

W1 W1 W1 They light the fire W1 Subject Verb Article Object

66

[initial value] Parameter sharing

slide-67
SLIDE 67

Training a recurrent neural network

  • Backward Computation graph

W1 W1 W1 They light the fire W1 Subject Verb Article Object

67

[initial value] Backpropagation through time (BPTT)

slide-68
SLIDE 68

BPTT

  • Backward Computation graph

W1 W1 W1 They light the fire W1 Subject Verb Article Object

68

[initial value] W1 <- W1 + G11 + G12 + G13 + G14 G11 G12 G13 G14 Cannot deal with infinitely long recurrent Gradient explosion, vanishing

slide-69
SLIDE 69

Truncated BPTT

  • Backward Computation graph

W1 W1 W1 They light the fire W1 Subject Verb Article Object

69

[initial value] W1 <- W1 + G13 + G14 G11 G12 G13 G14 Pick a maximum number of time steps and go backwards that much

slide-70
SLIDE 70

Truncated BPTT

  • Backward Computation graph

W1 They Subject

70

[initial value] W1 <- W1 + G11 G11 Pick a maximum number of time steps and go backwards that much

slide-71
SLIDE 71

Truncated BPTT

  • Backward Computation graph

W1 W1 They light Subject Verb

71

[initial value] W1 <- W1 + G11 + G12 G11 G12 Pick a maximum number of time steps and go backwards that much

slide-72
SLIDE 72

Truncated BPTT

  • Backward Computation graph

W1 W1 W1 They light the Subject Verb Article

72

[initial value] W1 <- W1 + G12 + G13 G11 G12 G13 Pick a maximum number of time steps and go backwards that much

slide-73
SLIDE 73

Truncated BPTT

  • Backward Computation graph

W1 W1 W1 They light the fire W1 Subject Verb Article Object

73

[initial value] W1 <- W1 + G13 + G14 G11 G12 G13 G14 Pick a maximum number of time steps and go backwards that much

slide-74
SLIDE 74

Recurrent neural network (RNN)

RNN layer RNN layer RNN layer They light the fire RNN layer Subject Verb Article Object Problem2: needs a way to stop remembering George RNN layer Object Can the network learn when to start and stop remembering things?

74

New sentence

slide-75
SLIDE 75

Gated Recurrent Unit (GRU)

  • Forms a Gated Recurrent Neural Networks (GRNN)
  • Add gates that can choose to reset (r) or update (z)

RNN unit GRU Update gate (z) Reset gate (r)

75

slide-76
SLIDE 76

Gated Recurrent Unit (GRU)

GRU Update gate (z) Reset gate (r)

76

ht-1 ht xt Note: non-linearity of GRU and LSTM are usually sigmoid/tanh – not ReLU

slide-77
SLIDE 77

Gated Recurrent Unit (GRU) layer

77

RNN unit RNN unit RNN unit Layer output (size of RNN units) Previous layer Time delayed

slide-78
SLIDE 78

Long Short-Term Memory (LSTM)

  • Have 3 gates, forget (f), input (i), output (o)
  • Has an explicit memory cell (c)

78

GRU Update gate (z) Reset gate (r) Memory (c) Input gate (i) LSTM

  • utput gate (o)

forget gate (i) Both works for data with time dependency. Try GRU first

slide-79
SLIDE 79

Long Short-Term Memory (LSTM)

79

Memory (c) Input gate (i) LSTM

  • utput gate (o)

forget gate (i) j is the index of the LSTM cell Note V is diagonal (output of the cell is related to the memory of its own cell)

slide-80
SLIDE 80

Bi-directional LSTM

  • The previous GRU/LSTM only goes backward in time (uni-directional)
  • Most of the time information from the future is useful for predicting the

current output

http://tsong.me/blog/google-nmt/

slide-81
SLIDE 81

Real time bi-directional LSTM

  • For real time applications, only “look ahead” a certain

amount of time steps

  • Still helpful
slide-82
SLIDE 82

LSTM remembers meaningful things

Andrej Karpathy, Visualizing and Understanding Recurrent Networks, 2015, https://arxiv.org/abs/1506.02078

82

slide-83
SLIDE 83

LSTM/GRU summary

  • Sharing of parameters across time
  • Remembering the past (and future)

GRU Update gate (z) Reset gate (r) Memory (c) Input gate (i) LSTM

  • utput gate (o)

forget gate (i)

slide-84
SLIDE 84

DNN Legos

  • Typical models now consists of all 3 types
  • CNN: local structure in the feature. Used for feature learning.
  • LSTM: remembering longer term structure or across time
  • DNN: Good for mapping features for classification. Usually used in

final layers

CNN LSTM DNN

slide-85
SLIDE 85

Tensorboard demo

slide-86
SLIDE 86

Project doodle

  • Build your group on courseville (under project)
  • Submit a proposal. Due next Tuesday.
  • Bullet points
  • What do you want to do?
  • What is the data?
  • How to evaluate the task?
  • Sign up for time slots (next weeks’ office hour)
  • Sign up sheet TBA.