15-780 Graduate Artificial Intelligence: Convolutional and - - PowerPoint PPT Presentation

▶

Nov 21, 2022 977 likes •1.32k views

15-780 Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter (this lecture) and Ariel Procaccia Carnegie Mellon University Spring 2017 1 Outline Convolutional neural networks Applications of convolutional

SLIDE 1

15-780 – Graduate Artificial Intelligence: Convolutional and recurrent networks

J. Zico Kolter (this lecture) and Ariel Procaccia

Carnegie Mellon University Spring 2017

SLIDE 2

Outline

Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks

SLIDE 3

Outline

Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks

SLIDE 4

The problem with fully-connected networks

A 256x256 (RGB) image ⟹ ~200K dimensional input 𝑦 A fully connected network would need a very large number of parameters, very likely to overfit the data Generic deep network also does not capture the “natural” invariances we expect in images (translation, scale)

zi zi+1 (Wi)1 zi zi+1 (Wi)2

SLIDE 5

Convolutional neural networks

To create architectures that can handle large images, restrict the weights in two ways 1. Require that activations between layers only occur in “local” manner 2. Require that all activations share the same weights These lead to an architecture known as a convolutional neural network

zi zi+1 Wi zi zi+1 Wi

SLIDE 6

Convolutions

Convolutions are a basic primitive in many computer vision and image processing algorithms Idea is to “slide” the weights 𝑥 (called a filter) over the image to produce a new image, written 𝑧 = 𝑨 ∗ 𝑥

w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32 y33 w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y11 = z11w11 + z12w12 + z13w13 + z21w21 + . . . w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y12 = z12w11 + z13w12 + z14w13 + z22w21 + . . . w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y13 = z13w11 + z14w12 + z15w13 + z23w21 + . . . w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y21 = z21w11 + z22w12 + z23w13 + z31w21 + . . . w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y22 = z22w11 + z23w12 + z24w13 + z32w21 + . . . w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y23 = z23w11 + z24w12 + z25w13 + z33w21 + . . .

SLIDE 7

Convolutions in image processing

Convolutions (typically with prespecified filters) are a common operation in many computer vision applications

Original image 𝑨

𝑨 ∗ 1 4 7 4 16 26 4 1 16 4 7 26 41 4 16 26 1 4 4 26 7 16 4 4 1 /273 𝑨 ∗ −1 1 −2 2 −1 1

+ 𝑨 ∗ −1 −2 −1 1 2 1

2 1 2

Gaussian blur Image gradient

SLIDE 8

Convolutional neural networks

Idea of a convolutional neural network, in some sense, is to let the network “learn” the right filters for the specified task In practice, we actually use “3D” convolutions, which apply a separate convolution to multiple layers of the image, then add the results together

zi zi+1 (Wi)1 zi zi+1 (Wi)2

SLIDE 9

Additional on convolutions

For anyone with a signal processing background: this is actually not what you call a convolution, this is a correlation (convolution with the filter flipped upside-down and left-right) It’s common to “zero pad” the input image so that the resulting image is the same size Also common to use a max-pooling operation that shrinks images by taking max over a region (also common: strided convolutions)

zi zi+1

max

SLIDE 10

Number of parameters

Consider a convolutional network that takes as input color (RGB) 32x32 images, and uses the layers (all convolutional layers use zero-padding) 1. 5x5x64 convolution 2. 2x2 Maxpooling 3. 3x3x128 convolution 4. 2x2 Maxpooling 5. Fully-connected to 10-dimensional output How many parameters does this network have? 1. O(10^3) 2. O(10^4) 3. O(10^5) 4. O(10^6)

SLIDE 11

Learning with convolutions

How do we apply backpropagation to neural networks with convolutions? 𝑨푖+1 = 𝑔푖(𝑨푖 ∗ 𝑥푖 + 𝑐푖) Remember that for a dense layer 𝑨푖+1 = 𝑔푖(𝑋푖𝑨푖 + 𝑐푖), forward pass required multiplication by 𝑋푖 and backward pass required multiplication by 𝑋푖

푇

We’re going to show that convolution is a type of (highly structured) matrix multiplication, and show how to compute the multiplication by its tranpose

SLIDE 12

Convolutions as matrix multiplication

Consider initially a 1D convolution 𝑨푖 ∗ 𝑥푖 for 𝑥푖 ∈ ℝ3, 𝑨푖 ∈ ℝ6 Then 𝑨푖 ∗ 𝑥푖 = 𝑋푖𝑨푖 for 𝑋푖 = 𝑥1 𝑥2 𝑥1 𝑥3 𝑥2 𝑥3 𝑥1 𝑥2 𝑥1 𝑥3 𝑥2 𝑥3 So how do we multiply by 𝑋푖

푇 ?

SLIDE 13

Convolutions as matrix multiplication, cont

Multiplication by transpose is just 𝑋푖

푇 𝑕푖+1 =

𝑥1 𝑥2 𝑥1 𝑥3 𝑥2 𝑥3 𝑥1 𝑥2 𝑥1 𝑥3 𝑥2 𝑥3 𝑕푖+1 = 𝑕푖+1 ∗ 𝑥푖 where 𝑥푖+1 is just the flipped version of 𝑥푖 In other words, transpose of convolution is just (zero-padded) convolution by flipped filter (correlations for signal processing people) Property holds for 2D convolutions, backprop just flips convolutions

SLIDE 14

Outline

Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks

SLIDE 15

LeNet network, digit classification

The network that started it all (and then stopped for ~14 years)

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

LeNet-5 (LeCun et al., 1998) architecture, achieves 1% error in MNIST digit classification

SLIDE 16

Image classification

Recent ImageNet classification challenges

SLIDE 17

Using intermediate layers as features

Increasingly common to use later-stage layers of pre-trained image classification networks as features for image classification tasks Classify dogs/cats based upon 2000 images (1000 of each class): Approach 1: Convolution network from scratch: 80% Approach 2: Final-layer from VGG network -> dense net: 90% Approach 3: Also fine-tune last convolution features: 94%

https://blog.keras.io/building-powerful-image-classification- models-using-very-little-data.html

SLIDE 18

Playing Atari games

SLIDE 19

Neural style

Adjust input image to make feature activations (really, inner products of feature activations), match target (art) images (Gatys et al., 2016)

SLIDE 20

Detecting cancerous cells in images

https://research.googleblog.com/2017/03/assisting- pathologists-in-detecting.html

SLIDE 21

Outline

Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks

SLIDE 22

Predicting temporal data

So far, the models we have discussed are application to independent inputs 𝑦 1 , … , 𝑦 푚 In practice, we often want to predict a sequence of outputs, given a sequence of inputs (predicting independently would miss correlations) Examples: time series forecasting, sentence labeling, speech to text, etc

x(1) x(2) x(3) y(1) y(2) y(3) · · ·

SLIDE 23

Recurrent neural networks

Maintain hidden state over time, hidden state is a function of current input and previous hidden state

x(1) x(2) x(3) ˆ y(1) ˆ y(2) ˆ y(3) z(1) z(2) z(3) · · · Wxz Wxz Wxz Wzy Wzy Wzy Wzz Wzz Wzz

𝑨 푡 = 𝑔푧 𝑋푥푧𝑦 푡 + 𝑋푧푧𝑨 푡−1 + 𝑐푧 𝑧̂ 푡 = 𝑔푦(𝑋푧푦𝑨 푡 + 𝑐푦)

SLIDE 24

Training recurrent networks

Most common training approach is to “unroll” the RNN on some dataset, and minimize the loss function minimize

푊푥푧,푊푧푧,푊푧푦

∑ ℓ 𝑧̂ 푡 , 𝑧 푡

푚 푖=1

Note that the network will have the “same” parameters in a lot of places in the network (e.g., the same 𝑋푧푧 matrix occurs in each step); advance of computation graph approach is that it’s easy to compute these complex gradients Some issues: initializing first hidden layer (just set it to all zeros), how long a sequence (pick something big, like >100)

SLIDE 25

LSTM networks

Trouble with plain RNNs is that it is difficult to capture long-term dependencies (e.g. if we see a “(“ character, we expected a “)” to follow at some point) Problem has to do with vanishing gradient, for many activations like sigmoid, tanh, gradients get smaller and smaller over subsequent layers (and ReLU’s have their own problems) One solution, long short term memory (Hochreiter and Schmidhuber, 1997), has more complex structure that specifically encodes memory and pass-through features, able to model long-term dependencies

it = tanh(Wxixt + Whiht−1 + bi) jt = sigm(Wxjxt + Whjht−1 + bj) ft = sigm(Wxfxt + Whfht−1 + bf)

= tanh(Wxoxt + Whoht−1 + bo) ct = ct−1 ft + it jt ht = tanh(ct) ot

Figure from (Jozefowicz et al., 2015)

SLIDE 26

Outline

Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks

SLIDE 27

Char-RNN

Excellent tutorial available at: http://karpathy.github.io/2015/05/21/rnn- effectiveness/ Basic idea is to build an RNN (using stacked LSTMs) that predicts the next character from some text given previous characters

SLIDE 28

Sample code from Char-RNN

Char-RNN trained on code of Linux kernel

/* * Increment the size file of the new incorrect UI_FILTER group information * of the size generatively. */ static int indicate_policy(void) { int error; if (fd == MARN_EPT) { /* * The kernel blank will coeld it to userspace. */ if (ss->segment < mem_total) unblock_graph_and_set_blocked(); else ret = 1; goto bail; } segaddr = in_SB(in.addr); selector = seg / 16; setup_works = true; …

SLIDE 29

Sample Latex from Char-RNN

Char-RNN trained on Latex source of textbook on algebraic geometry

SLIDE 30

Sequence to sequence models

Idea: use LSTM without outputs on “input” sequence, then auto- regressive LSTM on output sequence (Sutskever et al., 2014)

SLIDE 31

Machine translation

A scale-up of sequence to sequence learning, now underlying much of Google’s machine translation methods (Wu et al., 2016)

SLIDE 32

15-780 – Graduate Artificial Intelligence: Convolutional and recurrent networks

Carnegie Mellon University Spring 2017

Outline

Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks

Outline

Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks

The problem with fully-connected networks

A 256x256 (RGB) image ⟹ ~200K dimensional input 𝑦 A fully connected network would need a very large number of parameters, very likely to overfit the data Generic deep network also does not capture the “natural” invariances we expect in images (translation, scale)

zi zi+1 (Wi)1 zi zi+1 (Wi)2

Convolutional neural networks

To create architectures that can handle large images, restrict the weights in two ways 1. Require that activations between layers only occur in “local” manner 2. Require that all activations share the same weights These lead to an architecture known as a convolutional neural network

zi zi+1 Wi zi zi+1 Wi

Convolutions

Convolutions are a basic primitive in many computer vision and image processing algorithms Idea is to “slide” the weights 𝑥 (called a filter) over the image to produce a new image, written 𝑧 = 𝑨 ∗ 𝑥

y11 = z11w11 + z12w12 + z13w13 + z21w21 + . . . w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y12 = z12w11 + z13w12 + z14w13 + z22w21 + . . . w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y13 = z13w11 + z14w12 + z15w13 + z23w21 + . . . w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y21 = z21w11 + z22w12 + z23w13 + z31w21 + . . . w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y22 = z22w11 + z23w12 + z24w13 + z32w21 + . . . w11 w12 w13 w21 w22 w23 w31 w32 w33 z11 z21 z31 z41 z51 z12 z22 z32 z42 z52 z13 z23 z33 z43 z53 z14 z24 z34 z44 z54 z15 z25 z35 z45 z55 ∗ = y11 y12 y13 y21 y22 y23 y31 y32

y23 = z23w11 + z24w12 + z25w13 + z33w21 + . . .

Convolutions in image processing

Convolutions (typically with prespecified filters) are a common operation in many computer vision applications

Original image 𝑨

Gaussian blur Image gradient

Convolutional neural networks

Idea of a convolutional neural network, in some sense, is to let the network “learn” the right filters for the specified task In practice, we actually use “3D” convolutions, which apply a separate convolution to multiple layers of the image, then add the results together

zi zi+1 (Wi)1 zi zi+1 (Wi)2

Additional on convolutions

zi zi+1

max

Number of parameters

Learning with convolutions

We’re going to show that convolution is a type of (highly structured) matrix multiplication, and show how to compute the multiplication by its tranpose

Convolutions as matrix multiplication

Consider initially a 1D convolution 𝑨푖 ∗ 𝑥푖 for 𝑥푖 ∈ ℝ3, 𝑨푖 ∈ ℝ6 Then 𝑨푖 ∗ 𝑥푖 = 𝑋푖𝑨푖 for 𝑋푖 = 𝑥1 𝑥2 𝑥1 𝑥3 𝑥2 𝑥3 𝑥1 𝑥2 𝑥1 𝑥3 𝑥2 𝑥3 So how do we multiply by 𝑋푖

Convolutions as matrix multiplication, cont

Multiplication by transpose is just 𝑋푖

Outline

Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks

LeNet network, digit classification

The network that started it all (and then stopped for ~14 years)

LeNet-5 (LeCun et al., 1998) architecture, achieves 1% error in MNIST digit classification

Image classification

Recent ImageNet classification challenges

Using intermediate layers as features

https://blog.keras.io/building-powerful-image-classification- models-using-very-little-data.html

Playing Atari games

Neural style

Adjust input image to make feature activations (really, inner products of feature activations), match target (art) images (Gatys et al., 2016)

Detecting cancerous cells in images

https://research.googleblog.com/2017/03/assisting- pathologists-in-detecting.html

Outline

Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks

Predicting temporal data

x(1) x(2) x(3) y(1) y(2) y(3) · · ·

Recurrent neural networks

Maintain hidden state over time, hidden state is a function of current input and previous hidden state

x(1) x(2) x(3) ˆ y(1) ˆ y(2) ˆ y(3) z(1) z(2) z(3) · · · Wxz Wxz Wxz Wzy Wzy Wzy Wzz Wzz Wzz

𝑨 푡 = 𝑔푧 𝑋푥푧𝑦 푡 + 𝑋푧푧𝑨 푡−1 + 𝑐푧 𝑧̂ 푡 = 𝑔푦(𝑋푧푦𝑨 푡 + 𝑐푦)

Training recurrent networks

Most common training approach is to “unroll” the RNN on some dataset, and minimize the loss function minimize

∑ ℓ 𝑧̂ 푡 , 𝑧 푡

LSTM networks

Figure from (Jozefowicz et al., 2015)

Outline

Convolutional neural networks Applications of convolutional networks Recurrent networks Applications of recurrent networks

Char-RNN

Excellent tutorial available at: http://karpathy.github.io/2015/05/21/rnn- effectiveness/ Basic idea is to build an RNN (using stacked LSTMs) that predicts the next character from some text given previous characters

Sample code from Char-RNN

Char-RNN trained on code of Linux kernel

Sample Latex from Char-RNN

Char-RNN trained on Latex source of textbook on algebraic geometry

Sequence to sequence models

Idea: use LSTM without outputs on “input” sequence, then auto- regressive LSTM on output sequence (Sutskever et al., 2014)

Machine translation

A scale-up of sequence to sequence learning, now underlying much of Google’s machine translation methods (Wu et al., 2016)

Take convolutional network and feed it into the first hidden layer of a recurrent neural network

Combining RNNs and CNNs