[PPT] - Deep learning Recurrent neural networks Hamid Beigy Sharif PowerPoint Presentation

SLIDE 1

Deep learning

Recurrent neural networks Hamid Beigy

Sharif university of technology

November 10, 2019

Hamid Beigy | Sharif university of technology | November 10, 2019 1 / 96

SLIDE 2

Deep learning

Introduction

1 In previous sessions, we considered deep learning models with the

following characteristics.

Input Layer: (maybe vectorized), quantitative representation Hidden Layer(s): Apply transformations with nonlinearity Output Layer: Result for classification, regression, translation, segmentation, etc.

2 Models used for supervised learning

Hamid Beigy | Sharif university of technology | November 10, 2019 3 / 96

SLIDE 5

Deep learning | Introduction

Sequence learning

1 Sequence learning is the study of machine learning algorithms

designed for sequential data.

2 Language model is one of the most interesting topics that use

sequence labeling.

3 For example, consider machine translation task.

We have a sentence in the source language. We must translate the above sentence to the destination language.

4 How do we use feed-forward networks for solving the above machine

translation problem?

5 Consider other solutions such as autoregressive models, linear

dynamical systems, and hidden Markov models as an exercise.

Hamid Beigy | Sharif university of technology | November 10, 2019 4 / 96

SLIDE 6

Deep learning | Introduction

Sequence learning (application)

1 Consider the stock market1. 2 We must consider the series of stock values in the past several days

to decide if it is wise to invest today (Ideally consider all of history).

–
–

15/03 14/03 13/03 12/03 11/03 10/03 9/03 8/03 7/03 To invest or not to invest? stocks

3 Inputs are vectors. 4 Output may be scalar (Should I invest) or vector (should I invest

in X).

1From Bhiksha Raj slides Hamid Beigy | Sharif university of technology | November 10, 2019 5 / 96

SLIDE 7

Deep learning | Introduction

Sequence learning (application)

1 We need a network that accepts previous days and decides.

–

–

Stock

vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+4)

Credit: Bhiksha Raj Hamid Beigy | Sharif university of technology | November 10, 2019 6 / 96

SLIDE 8

Deep learning | Introduction

Sequence learning (Applications)

1 Although all problems in sequence learning can be converted into one

with fixed- length inputs and outputs, they may involve a variable time horizon2.

2 Sequence classification

sentiment analysis, activity/action recognition, DNA sequence classification, action selection

3 Sequence synthesis:

text synthesis, music synthesis, motion synthesis.

4 Sequence-to-sequence translation:

speech recognition, text translation, part-of-speech tagging.

2From Francois Fleuret slides Hamid Beigy | Sharif university of technology | November 10, 2019 7 / 96

SLIDE 9

Deep learning | Introduction

Processing sequences

1 Processing sequences3.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 13

Vanilla Neural Networks

“Vanilla” Neural Network

Vanilla Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 14

Recurrent Neural Networks: Process Sequences

e.g. Image Captioning image -> sequence of words

Image Captioning (image → seq

f words)

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 14

Recurrent Neural Networks: Process Sequences

e.g. Image Captioning image -> sequence of words

Sentiment Classification (seq of words → sentiment)

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 14

Recurrent Neural Networks: Process Sequences

e.g. Image Captioning image -> sequence of words

Machine Translation (seq

f words → seq
f words)

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 14

Recurrent Neural Networks: Process Sequences

e.g. Image Captioning image -> sequence of words

Video classification on frame level

3From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 8 / 96

SLIDE 10

Deep learning | Introduction

Can we process non-sequential data sequentially?

1 We have a non-sequence data and we are processing it sequentially. 2 Is it possible? 3 Please read the following papers.

Ba, Mnih, and Kavukcuoglu, Multiple Object Recognition with Visual Attention, ICLR 2015. Gregor et al, DRAW: A Recurrent Neural Network For Image Generation, ICML 2015

Hamid Beigy | Sharif university of technology | November 10, 2019 9 / 96

SLIDE 11

Deep learning | Introduction

Why MLP is not appropriate for sequence learning?

1 MLPs only accept an input of fixed dimensionality and map it to an

utput of fixed dimensionality

2 In a traditional MLPs, we assume that all inputs (and outputs) are

independent of each other. Why this is problematic?

3 Need to re-learn the rules from scratch each time 4 Need to resuse knowledge about the previous events to help in

classifying the current

Hamid Beigy | Sharif university of technology | November 10, 2019 10 / 96

SLIDE 12

Deep learning | Recurrent neural networks

Introduction

1 A recurrent model maintains a recurrent state updated at each time

step4.

2 Consider input x ∈ S

( RD) , and an initial state h0 ∈ RQ, the recurrent model would compute the following sequence of recurrent states iteratively. ht = φ(xt, ht−1) φ : RD × RQ → RQ

3 A prediction can be computed at any time step from the recurrent

state yt = ψ(ht) ψ : RQ → RC

4From Francois Fleuret slides Hamid Beigy | Sharif university of technology | November 10, 2019 11 / 96

SLIDE 14

Deep learning | Recurrent neural networks

Recurrent neural networks

1 Recurrent neural networks are networks for handling sequential data

such as a pair of sentences with different lengths and two speech signals.

2 Are weights dependent to the time instant? 3 RNN Share parameters across different parts of the model. 4 Why do we share weights? 5 When there is no parameter sharing, it would not be possible to share

statistical strength and generalize to lengths of sequences not seen during training.

Hamid Beigy | Sharif university of technology | November 10, 2019 12 / 96

SLIDE 15

Deep learning | Recurrent neural networks

Recurrent neural network architecture5

h0

Φ

x1 h1

Φ

x2

. . .

Φ

hT−1 xT−1 hT

Φ

xT

Ψ

yT

Ψ

yT−1

Ψ

y1 w 5From Francois Fleuret slides Hamid Beigy | Sharif university of technology | November 10, 2019 13 / 96

SLIDE 16

Deep learning | Recurrent neural networks

Recurrent neural network (Machine translation)

1 Machine Translation is similar to language modeling in that our input

is a sequence of words in the source language.

2 We want to output a sequence of words in our target language. 3 A key difference is that the output only starts after we have seen the

complete input, because the first word of our translated sentences may require information captured from the complete input sequence.

Credit: Denny Britz Hamid Beigy | Sharif university of technology | November 10, 2019 14 / 96

SLIDE 17

Deep learning | Recurrent neural networks

Recurrent neural network (shortcut)

1 We often use the following shortcut to represent the recurrent neural

network.

–
–

Credit: Bhiksha Raj Hamid Beigy | Sharif university of technology | November 10, 2019 15 / 96

SLIDE 18

Deep learning | Recurrent neural networks

Recurrent neural network (shortcut)

1 We often use the following shortcut to represent the recurrent neural

network.

–
–

Credit: Bhiksha Raj Hamid Beigy | Sharif university of technology | November 10, 2019 16 / 96

SLIDE 19

Deep learning | Recurrent neural networks

Recurrent neural network (example application)6

1 We consider the following simple binary sequence classification

problem:

Label 1: the sequence is the concatenation of two identical halves, label 0:

therwise

x y (1, 2, 3, 4, 5, 6) (3, 9, 9, 3) (7, 4, 5, 7, 5, 4) (7, 7) 1 (1, 2, 3, 1, 2, 3) 1 (5, 1, 1, 2, 5, 1, 1, 2) 1

6From Francois Fleuret slides Hamid Beigy | Sharif university of technology | November 10, 2019 17 / 96

SLIDE 20

Deep learning | Recurrent neural networks

Recurrent neural networks

1 Usually, we want to predict a vector at some steps7.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 22

Recurrent Neural Network

x RNN y We can process a sequence of vectors x by applying a recurrence formula at every time step:

new state

ld state input vector at

some time step some function with parameters W

1 We can process input x by applying the following recurrence

equation. ht = fw (xt, ht−1)

2 Assume that the activation function is tanh.

ht = tanh (Uxt, Wht−1) yt = Vht

7From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 18 / 96

SLIDE 21

RNN: Computational Graph: One to Many

hT y3 y2 y1

8From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 19 / 96

SLIDE 22

Deep learning | Recurrent neural networks

Recurrent neural networks (character level language model)

1 Assume that the vocabulary is [h,e,l,o] 9. 2 Example training sequence: hello 3 At the output layer, we use softmax. Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 38

Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello”

9From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 20 / 96

SLIDE 23

Deep learning | Recurrent neural networks

Recurrent neural networks (character level language model)

1 Assume that the vocabulary is [h,e,l,o] 10. 2 Example training sequence: hello 3 At the output layer, we use softmax.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 42

.03 .13 .00 .84 .25 .20 .05 .50 .11 .17 .68 .03 .11 .02 .08 .79

Softmax “e” “l” “l” “o” Sample

Example: Character-level Language Model Sampling Vocabulary: [h,e,l,o]

At test-time sample characters one at a time, feed back to model

10From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 21 / 96

SLIDE 24

Deep learning | Recurrent neural networks

Comparing two models

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 38

Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello”

1 This is more powerful because a

very high dimensional hidden vector can be considered.

2 The training will be harder. It

must be sequential.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 42

.03 .13 .00 .84 .25 .20 .05 .50 .11 .17 .68 .03 .11 .02 .08 .79

Softmax “e” “l” “l” “o” Sample

Example: Character-level Language Model Sampling Vocabulary: [h,e,l,o]

At test-time sample characters one at a time, feed back to model

1 This is less powerful unless very

high dimensional and rich

utput vector is considered.

2 The training will be easier. It

allows a parallelization.

Hamid Beigy | Sharif university of technology | November 10, 2019 22 / 96

SLIDE 25

Deep learning | Training recurrent neural networks

Backpropagation through time

1 We have a collection of labeled samples.

S = {(xi, yi), (x2, y2), . . . , (xm, ym)} where

vijh(2)

j

(t) + ci  

Hamid Beigy | Sharif university of technology | November 10, 2019 24 / 96

1 We must find

∇V L(θ) ∇W L(θ) ∇UL(θ) ∇bL(θ) ∇cL(θ)

2 Then we threat the network as usual multi-layer network and apply

the backpropagation on the unrolled network.

Hamid Beigy | Sharif university of technology | November 10, 2019 26 / 96

SLIDE 30

Deep learning | Training recurrent neural networks

Backpropagation through time

1 Let consider a network with one hidden layer. 2 We use softmax activation function for output layer 3 We use tanh for hidden layer 4 Suppose that L(t) be the loss at time t. 5 If L(t) is the negative log-likelihood of y(t) given

x(1), x(2), . . . , x(T), then L ({x(1), x(2), . . . , x(T)}, {y(1), y(2), . . . , y(T)}) = ∑

t

L(t) = − ∑

t

log pmodel (y(t)|{x(1), x(2), . . . , x(T)})

Hamid Beigy | Sharif university of technology | November 10, 2019 27 / 96

SLIDE 31

Deep learning | Training recurrent neural networks

Backpropagation through time

1 We backpropagate the gradient in the following way (L is loss

function).

x1 ˆ y1 h1 U V

∂L ∂U

x2 ˆ y2 h2 U W V

∂L ∂U ∂L ∂W

x3 ˆ y3 h3 U W V

∂L ∂U ∂L ∂W

x4 ˆ y4 h4 U V W

∂L ∂U ∂L ∂W

x5 ˆ y5 h5 U W V

∂L ∂V ∂L ∂U ∂L ∂W Credit Trivedi & Kondor Hamid Beigy | Sharif university of technology | November 10, 2019 28 / 96

SLIDE 32

Deep learning | Training recurrent neural networks

Backpropagation through time

1 In forward phase, hidden layer computes

ht = tanh ( W Tht−1 + UTxt + b )

2 In forward phase, output layer computes

t = V Tht + c

ˆ yt = softmax (ot)

Hamid Beigy | Sharif university of technology | November 10, 2019 29 / 96

SLIDE 33

Deep learning | Training recurrent neural networks

Backpropagation through time

1 Calculating the gradient

∂L ∂L(t) = 1 ( ∇o(t)L )

i =

∂L ∂oi(t) = ∂L ∂L(t) ∂L(t) ∂oi(t)

2 By using softmax in output layer, we have

ˆ yi(t) = eoi(t) ∑

k eok(t)

∂ˆ yi(t) ∂oj(t) = ∂

eoi (t) ∑

k eok (t)

∂oj(t) =    ˆ yi(t)(1 − ˆ yj(t)) i = j −ˆ yj(t)ˆ yi(t) i ̸= j

Hamid Beigy | Sharif university of technology | November 10, 2019 30 / 96

SLIDE 34

SLIDE 37

Deep learning | Training recurrent neural networks

Backpropagation through time

1 At the final time step T, h(T) has only as o(T) as a descendant

∇h(T)L = V T∇o(T)L

2 We can then iterate backward in time to back-propagate gradients

through time, from t = T − 1 down to t = 1. ∇h(t)L = (∂h(t + 1) ∂h(t) )T ( ∇h(t+1)L ) + (∂o(t) ∂h(t) )T ( ∇o(t)L ) = W T diag ( 1 − (h(t + 1))2) ( ∇h(t+1)L ) + V T ( ∇o(t)L )

Hamid Beigy | Sharif university of technology | November 10, 2019 34 / 96

SLIDE 38

Deep learning | Training recurrent neural networks

Backpropagation through time

1 The gradient on the remaining parameters is given by

∇cL = ∑

t

(∂o(t) ∂c(t) )T ∇o(t)L = ∑

t

∇o(t)L ∇bL = ∑

t

(∂h(t) ∂b(t) )T ∇h(t)L = ∑

t

diag ( 1 − (h(t))2) ( ∇h(t) ) L ∇V L = ∑

t

∑

i

( ∂L ∂oi(t) )T ∇V (t)o(t) = ∑

t

( ∇o(t)L ) h(t)T ∇W L = ∑

t

∑

i

( ∂L ∂hi(t) )T ∇W (t)hi(t) = ∑

t

diag ( 1 − (h(t))2) ( ∇h(t−1)L ) (h(t − 1))T ∇UL = ∑

t

∑

i

( ∂L ∂hi(t) )T ∇U(t)hi(t) = ∑

t

diag ( 1 − (h(t))2) ( ∇h(t)L ) (x(t))T

Hamid Beigy | Sharif university of technology | November 10, 2019 35 / 96

SLIDE 39

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 45

Truncated Backpropagation through time

Loss

Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

12From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 37 / 96

SLIDE 41

Deep learning | Training recurrent neural networks

Truncated Backpropagation through time

1 Run forward and backward through chunks of the sequence instead of

whole sequence13.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - May 3, 2018 46

Truncated Backpropagation through time

Loss 13From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 38 / 96

SLIDE 42

Deep learning | Design patterns of RNN

Design patterns of RNN (summarization)

1 Producing a single output and have recurrent connections from

utput between hidden units.

2 This is useful for summarizing a sequence such as sentiment analysis

Hamid Beigy | Sharif university of technology | November 10, 2019 39 / 96

SLIDE 44

Deep learning | Design patterns of RNN

Design patterns of RNN (fixed vector as input)

1 Sometimes we are interested in only taking a single, fixed sized vector

x as input, which generates the y sequence

2 Most common ways to provide an extra input at each time step 3 Other solutions? (Please consider them) 4 Application: Image caption generation

Hamid Beigy | Sharif university of technology | November 10, 2019 40 / 96

SLIDE 45

Deep learning | Design patterns of RNN

Bidirectional RNNs

1 We considered RNNs in the context of a sequence x(t)(t = 1, . . . , T) 2 In many applications, y(t) may depend on the whole input sequence. 3 Bidirectional RNNs were introduced to address this need.

Hamid Beigy | Sharif university of technology | November 10, 2019 41 / 96

SLIDE 46

Deep learning | Design patterns of RNN

Encoder-Decoder

1 How do we map input sequences to output sequences that are not

necessarily of the same length?

2 The input to RNN is called context, we want to find a representation

f the context C.

3 C may be a vector or a sequence that summarizes

x = {x(1), . . . , x(nx)}

Hamid Beigy | Sharif university of technology | November 10, 2019 42 / 96

SLIDE 47

Deep learning | Design patterns of RNN

Encoder-Decoder

Hamid Beigy | Sharif university of technology | November 10, 2019 43 / 96

SLIDE 48

Deep learning | Design patterns of RNN

Deep recurrent networks

1 The computations in RNNs can be decomposed into three blocks of

parameters and transformations

Input to hidden state Previous hidden state to the next Hidden state to the output

2 Each of these transforms were learned affine transformations followed

by a fixed nonlinearity.

3 Introducing depth in each of these operations is advantageous. 4 The intuition on why depth should be more useful is quite similar to

that in deep feed-forward networks

5 Optimization can be made much harder, but can be mitigated by

tricks such as introducing skip connections

Hamid Beigy | Sharif university of technology | November 10, 2019 44 / 96

SLIDE 49

Deep learning | Design patterns of RNN

Deep recurrent networks

Hamid Beigy | Sharif university of technology | November 10, 2019 45 / 96

SLIDE 50

Deep learning | Long-term dependencies

Long-term dependencies

1 RNNs involve the composition of a function multiple times, once per

time step.

2 This function composition resembles matrix multiplication. 3 Consider the recurrence relationship h(t + 1) = W Th(t) 4 This is very simple RNN without a nonlinear activation and x. 5 This recurrence equation can be written as: h(t) = (W t)T h(0). 6 If W has an eigendecomposition of formW = QΛQT with orthogonal

Q.

7 The recurrence becomes h(t) = (W t)T h(0) = QTΛtQh(0).

Q is matrix composed of eigenvectors W . Λ is a diagonal matrix with eigenvalues placed on the diagonals.

Hamid Beigy | Sharif university of technology | November 10, 2019 46 / 96

SLIDE 52

Deep learning | Long-term dependencies

Long-term dependencies

1 The recurrence becomes h(t) = (W t)T h(0) = QTΛtQh(0).

Q is matrix composed of eigenvectors W . Λ is a diagonal matrix with eigenvalues placed on the diagonals.

2 Eigenvalues are raised to t: Quickly decay to zero or explode.

Consider

Vanishing gradients Exploding gradients

3 Problem: Gradients propagated over many stages tend to

vanish (most of the time) or explode (relatively rarely)

Hamid Beigy | Sharif university of technology | November 10, 2019 47 / 96

SLIDE 53

Deep learning | Long-term dependencies

Why do gradients vanish or explode?

1 The expression for h(t) was h(t) = tanh (Wh(t − 1) + Ux(t)). 2 The partial derivative of loss wrt to hidden states equals to

∂L ∂h(t) = ∂L ∂h(T) ∂h(T) ∂h(t) = ∂L ∂h(T)

T−1

∏

k=t

∂h(k + 1) ∂h(k) = ∂L ∂h(T)

T−1

∏

k=t

Dk+1W T

k

where Dk+1 = diag ( 1 − tanh2 (Wh(t − 1) + Ux(t)) )

Hamid Beigy | Sharif university of technology | November 10, 2019 48 / 96

SLIDE 54

Deep learning | Long-term dependencies

Why do gradients vanish or explode?

1 For any matrices A, B , we have ∥AB∥ ≤ ∥A∥∥B∥

∂L

∂h(t)

=
∂L

∂h(T)

T−1

∏

k=t

Dk+1W T

k

≤
∂L

∂h(T)

T−1

∏

k=t

Dk+1W T

k

2 Since ∥A∥ equals to the largest singular value of A (σmax(A)), we

have

∂L

∂h(t)

=
∂L

∂h(T)

T−1

∏

k=t

Dk+1W T

k

≤
∂L

∂h(T)

T−1

∏

k=t

σmax(Dk+1)σmax(Wk)

3 Hence the gradient norm can shrink to zero or grow up exponentially

fast depending on the σmax.

Hamid Beigy | Sharif university of technology | November 10, 2019 49 / 96

SLIDE 55

Deep learning | Long-term dependencies

Echo state networks

1 Set the recurrent weights such that they do a good job of capturing

past history and learn only the output weights

2 Methods: Echo State Machines and Liquid State Machines 3 The general methodology is called reservoir computing 4 How to choose the recurrent weights? 5 In Echo State Machines, choose recurrent weights such that the

hidden-to-hidden transition Jacobian has eigenvalues close to 1

Hamid Beigy | Sharif university of technology | November 10, 2019 50 / 96

SLIDE 56

Deep learning | Long-term dependencies

Gated units

1 RNNs can accumulate but it might be useful to forget. 2 Creating paths through time where derivatives can flow. 3 Learn when to forget! 4 Gates allow learning how to read, write and forget. 5 We consider two gated units:

Long Short-Term Memory (LSTM) Gated Recurrent Unit (GRU)

Hamid Beigy | Sharif university of technology | November 10, 2019 52 / 96

SLIDE 58

Deep learning | Long-term dependencies

Long short term memory

1 LSTMs are explicitly designed to avoid the long-term dependency

problem.

2 All recurrent neural networks have the form of a chain of repeating

modules of neural network.

3 In standard RNNs, this repeating module will have a very simple

structure, such as a single tanh layer.

Hamid Beigy | Sharif university of technology | November 10, 2019 53 / 96

SLIDE 59

Deep learning | Long-term dependencies

Long short term memory

1 LSTMs also have this chain like structure, but the repeating module

has a different structure.

2 Instead of having a single neural network layer, there are four,

interacting in a very special way.

Hamid Beigy | Sharif university of technology | November 10, 2019 54 / 96

SLIDE 60

Deep learning | Long-term dependencies

Long short term memory

1 Let us to define the following notations 2 In the above figure, each line carries an entire vector, from the output

f one node to the inputs of others.

The pink circles represent point-wise operations, like vector addition. The yellow boxes are learned neural network layers. Lines merging denote concatenation Line forking denote its content being copied and the copies going to different locations.

Hamid Beigy | Sharif university of technology | November 10, 2019 55 / 96

SLIDE 61

Hamid Beigy | Sharif university of technology | November 10, 2019 58 / 96

SLIDE 64

Deep learning | Long-term dependencies

Long short term memory

1 The next step is to decide what new information were going to store

in the cell state.

2 This has the two following parts that could be added to the state.

A sigmoid layer called the input gate layer decides which values well update. A tanh layer creates a vector of new candidate values, ˜ Ct.

3 Then well combine these two to create an update to the state.

Hamid Beigy | Sharif university of technology | November 10, 2019 59 / 96

SLIDE 65

Deep learning | Long-term dependencies

Long short term memory

1 Now, we must update the old cell state, Ct−1 into the new cell state

Ct.

2 We multiply the old state by ft, forgetting the things we decided to

forget earlier.

3 Then we add it × ˜

Ct. This is the new candidate values, scaled by how

much we decided to update each state value.

Hamid Beigy | Sharif university of technology | November 10, 2019 60 / 96

SLIDE 66

Deep learning | Long-term dependencies

Long short term memory

1 Finally, we need to decide what were going to output. 2 This output will be based on our cell state, but will be a filtered

version.

First, we run a sigmoid layer which decides what parts of the cell state were going to output. Then, we put the cell state through tanh and multiply it by the output

f the sigmoid gate.

3 So that we only output the parts we decided to.

Hamid Beigy | Sharif university of technology | November 10, 2019 61 / 96

SLIDE 67

Deep learning | Long-term dependencies

Long short term memory

Hamid Beigy | Sharif university of technology | November 10, 2019 62 / 96

SLIDE 68

Deep learning | Long-term dependencies

Variants on Long short term memory

1 One popular LSTM variant is adding peephole connections. This

means that we let the gate layers look at the cell state.14

14F. A. Gers and J. Schmidhuber, ”Recurrent nets that time and count”, Proceedings
f the IEEE International Joint Conference on Neural Networks, 2000.

Hamid Beigy | Sharif university of technology | November 10, 2019 63 / 96

SLIDE 69

Deep learning | Long-term dependencies

Variants on Long short term memory

1 Another variation is to use coupled forget and input gates. 2 Instead of separately deciding what to forget and what we should add

new information to, we make those decisions together.

3 We only forget when were going to input something in its place. We

nly input new values to the state when we forget something older.

Hamid Beigy | Sharif university of technology | November 10, 2019 64 / 96

SLIDE 70

Deep learning | Long-term dependencies

Variants on Long short term memory (GRU)

1 GRU combines the forget and input gates into a single update gate15. 2 It merges the cell and hidden states, and makes some other changes. 3 The resulting model is simpler than standard LSTM models, and has

been growing increasingly popular.

15Kyunghyun Cho and et al. ”Learning Phrase Representations using RNN

Encoder-Decoder for Statistical Machine Translation”, Proc. of Conference on Empirical Methods in Natural Language Processing, pages 1724-1734, 2014.

Hamid Beigy | Sharif university of technology | November 10, 2019 65 / 96

SLIDE 71

Deep learning | Long-term dependencies

Gated recurrent units (GRU)

Hamid Beigy | Sharif university of technology | November 10, 2019 66 / 96

SLIDE 72

Deep learning | Long-term dependencies

Variants on Long short term memory

1 There are also other LSTM variants such as

Jan Koutnik, et. al. ”A clockwork RNN”, Proceedings of the 31st International Conference on International Conference on Machine Learning, 2014. Kaisheng Yao, et. al. ”Depth-Gated Recurrent Neural Networks”, https://arxiv.org/pdf/1508.03790v2.pdf.

Hamid Beigy | Sharif university of technology | November 10, 2019 67 / 96

SLIDE 73

Deep learning | Long-term dependencies

Optimization for long-term dependencies

1 Two simple solutions for gradient vanishing / exploding.

For vanishing gradients: Initialization + ReLus Trick for exploding gradient: clipping trick When ∥g∥ > v Then g ← vg ∥g∥

2 Vanishing gradient happens when the optimization gets stuck in a

saddle point, the gradient becomes too small for the optimization to

progress. This can also be fixed by using gradient descent with

momentum or other methods.

3 Exploding gradient happens when the gradient becomes too big and

you get numerical overflow. This can be easily fixed by initializing the network’s weights to smaller values.

Hamid Beigy | Sharif university of technology | November 10, 2019 68 / 96

SLIDE 74

Deep learning | Long-term dependencies

Optimization for long-term dependencies

1 Gradient clipping helps to deal with exploding gradients but it does

not help with vanishing gradients.

2 Another way is encourage creating paths in the unfolded recurrent

architecture along which the product of gradients is near 1.

3 Solution: regularize to encourage information flow. 4 We want that

( ∇h(t)L )

∂h(t) ∂h(t−1) to be as large as ∇h(t)L. 5 One regularizer is

∑

t

 

(

∇h(t)L )

∂h(t) ∂h(t−1)

∇h(t)L
− 1

 

2 6 Computing this regularizer is difficult and its approximation is used.

Hamid Beigy | Sharif university of technology | November 10, 2019 69 / 96

SLIDE 75

Deep learning | Attention models

Sequence to sequence modeling

1 Sequence to sequence modeling transforms an input sequence (source)

to a new one (target) and both sequences can be of arbitrary lengths.

2 Examples of transformation tasks include

Machine translation between multiple languages in either text or audio Question-answer dialog generation Parsing sentences into grammar trees.

3 The sequence to sequence model normally has an encoder-decoder

architecture, composed of

An encoder processes the input sequence and compresses the information into a context vector of a fixed length. This representation is expected to be a good summary of the meaning

f the whole source sequence.

A decoder is initialized with the context vector to emit the transformed

utput. The early work only used the last state of the encoder network

as the decoder initial state.

Hamid Beigy | Sharif university of technology | November 10, 2019 70 / 96

SLIDE 77

Deep learning | Attention models

Attention models

1 Both the encoder and decoder are recurrent neural networks such as

LSTM or GRU units.

2 A critical disadvantage of this fixed-length context vector design is

incapability of remembering long sentences.

Hamid Beigy | Sharif university of technology | November 10, 2019 71 / 96

SLIDE 78

Deep learning | Attention models

Attention models

1 The attention mechanism was born to help memorize long source

sentences in neural machine translation (NMT)16.

2 Instead of building a single context vector out of the encoders last

hidden state, the goal of attention is to create shortcuts between the context vector and the entire source input.

3 The weights of these shortcut connections are customizable for each

utput element.

4 The alignment between the source and target is learned and

controlled by the context vector.

5 Essentially the context vector consumes three pieces of information:

Encoder hidden states Decoder hidden states Alignment between source and target

16Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine

translation by jointly learning to align and translate.” ICLR 2015.

Hamid Beigy | Sharif university of technology | November 10, 2019 72 / 96

SLIDE 79

Deep learning | Attention models

Attention models

1 Assume that we have a source sequence x of length n and try to

utput a target sequence y of length m

x = [x1, x2, . . . , xn] y = [y1, y2, . . . , ym]

2 The encoder is a bidirectional RNN with a forward hidden state −

→ h i and a backward one ← − h i.

3 A simple concatenation of two represents the encoder state. 4 The motivation is to include both the preceding and following words

in the annotation of one word. hi = [− → h ⊤

i ; ←

− h ⊤

i

]⊤ i = 1, 2, . . . , n

Hamid Beigy | Sharif university of technology | November 10, 2019 73 / 96

SLIDE 80

Deep learning | Attention models

Attention models

1 Model of attention

Hamid Beigy | Sharif university of technology | November 10, 2019 74 / 96

SLIDE 81

Deep learning | Attention models

Attention models

1 The decoder network has hidden state st = f (st−1, yt−1, ct) at

position t = 1, 2, . . . , m.

2 The context vector ct is a sum of hidden states of the input sequence,

weighted by alignment scores: ct =

n

∑

i=1

αt,ihi

Context vector for output y

αt,i = align(yt, xi)

How well two words yt and xi are aligned.

= exp (score(st−1, hi)) ∑n

j=1 exp (score(st−1, hj))

Softmax of predefined alignment score.

3 The alignment model assigns a score αt,i to the pair of (yt, xi) based

n how well they match.

4 The set of {αt,i} are weights defining how much of each source

hidden state should be considered for each output.

Hamid Beigy | Sharif university of technology | November 10, 2019 75 / 96

SLIDE 82

Deep learning | Attention models

Attention models

1 The alignment score α is parametrized by a feed-forward network with

a single hidden layer17.

2 This network is jointly trained with other parts of the model. 3 The score function is therefore in the following form, given that tanh

is used as activation function. score(st, hi) = v⊤

a tanh(Wa[st; hi])

where both Va and Wa are weight matrices to be learned in the alignment model.

17Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine

translation by jointly learning to align and translate.” ICLR 2015.

Hamid Beigy | Sharif university of technology | November 10, 2019 76 / 96

SLIDE 83

Deep learning | Attention models

Alignment scores

1 The matrix of alignment scores explicitly show the correlation

between source and target words.

(a)

Hamid Beigy | Sharif university of technology | November 10, 2019 77 / 96

SLIDE 84

Deep learning | Attention models

Alignment scores

1 The matrix of alignment scores explicitly show the correlation

between source and target words.

Hamid Beigy | Sharif university of technology | November 10, 2019 78 / 96

SLIDE 85

Deep learning | Attention models

Alignment scores

Name Alignment score function Paper

( https://lilianweng.github.io/lil-log/2018/06/24/attention-attention. html)

Content-base attention score(st, hi) = cosine[st, hi]

A. Graves, et al. ”Neural Turing machines”, arXiv, 2014.

Additive score(st, hi) = v⊤

a tanh(Wa[st; hi])

D. Bahdanau, et al.”Neural machine translation by

jointly learning to align and translate”, ICLR 2015. Location-Base αt,i = softmax(Wast)

T. Luong, , et al. ”Effective Approaches to Attention-

based Neural Machine Translation”, EMNLP 2015. General score(st, hi) = s⊤

t Wahi

Same as the above Dot-Product score(st, hi) = s⊤

t hi

Same as the above Scaled Dot-Product score(st, hi) = s⊤

t hi

√n

A. Vaswani, et al.

”Attention is all you need”, NIPS 2017.

Hamid Beigy | Sharif university of technology | November 10, 2019 79 / 96

SLIDE 86

Deep learning | Attention models

Self-Attention

1 Self-attention18 (intra-attention) is an attention mechanism relating

different positions of a single sequence in order to compute a representation of the same sequence.

2 It is very useful in

Machine reading (the automatic, unsupervised understanding of text) Abstractive summarization Image description generation

18Jianpeng Cheng, Li Dong, and Mirella Lapata. ”Long short-term memory-networks

for machine reading”. EMNLP 2016.

Hamid Beigy | Sharif university of technology | November 10, 2019 80 / 96

SLIDE 87

Deep learning | Attention models

Self Attention

1 The self-attention mechanism enables us to learn the correlation

between the current words and the previous part of the sentence.

2 The current word is in red and the size of the blue shade indicates the

activation level.

Hamid Beigy | Sharif university of technology | November 10, 2019 81 / 96

SLIDE 88

Deep learning | Attention models

Self Attention

1 Self-attention is applied to the image to generate descriptions19. 2 Image is encoded by aCNN and a RNN with self-attention consumes

the CNN feature maps to generate the descriptive words one by one.

3 The visualization of the attention weights clearly demonstrates which

regions of the image, the model pays attention to so as to output a certain word.

19Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua

Bengio. ”Show, attend and tell: Neural image caption generation with visual attention”, ICML, 2015.

Hamid Beigy | Sharif university of technology | November 10, 2019 82 / 96

SLIDE 89

Deep learning | Attention models

Self Attention

Hamid Beigy | Sharif university of technology | November 10, 2019 83 / 96

SLIDE 90

Deep learning | Attention models

Soft vs Hard Attention

1 The soft vs hard attention is another way to categorize how attention

is defined based on whether the attention has access to the entire image or only a patch.

Soft Attention: the alignment weights are learned and placed softly

ver all patches in the source image (the same idea as in Bahdanau et

al., 2015).

Pro: the model is smooth and differentiable. Con: expensive when the source input is large.

Hard Attention: only selects one patch of the image to attend to at a time.

Pro: less calculation at the inference time. Con: the model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train20.

20Thang Luong, Hieu Pham, Christopher D. Manning. Effective Approaches to

Attention-based Neural Machine Translation. EMNLP 2015.

Hamid Beigy | Sharif university of technology | November 10, 2019 84 / 96

SLIDE 91

Deep learning | Attention models

Global vs Local Attention

1 Global and local attention are proposed by Luong, et al21 2 The idea of a global attentional model is to consider all the hidden

states of the encoder when deriving the context vector.

is ) el yt ˜ ht ct at ht ¯ hs

Global align weights

Attention Layer

Context vector

21Thang Luong, Hieu Pham, Christopher D. Manning. Effective Approaches to

Attention-based Neural Machine Translation. EMNLP 2015.

Hamid Beigy | Sharif university of technology | November 10, 2019 85 / 96

SLIDE 92

Deep learning | Attention models

Global vs Local Attention

1 The global attention has a drawback that it has to attend to all words

n the source side for each target word, which is expensive and can

potentially render it impractical to translate longer sequences,

2 The local attentional mechanism chooses to focus only on a small

subset of the source positions per target word.

3 Local one is an interesting blend between hard and soft, an

improvement over the hard attention to make it differentiable:

4 The model first predicts a single aligned position for the current

target word and a window centered around the source position is then used to compute a context vector. pt = n × sigmoid ( v⊤

p tanh(Wpht)

) n is length of source sequence. Hence, pt ∈ [0, n].

Hamid Beigy | Sharif university of technology | November 10, 2019 86 / 96

SLIDE 93

Deep learning | Attention models

Global vs Local Attention

1 To favor alignment points near pt, they placed a Gaussian distribution

centered around pt . Specif ically, the alignment weights are defined as ast = align(ht, ¯ hs) exp ( −(s − pt)2 2σ2 ) and pt = n × sigmoid ( v⊤

p tanh(Wpht)

)

Hamid Beigy | Sharif university of technology | November 10, 2019 87 / 96

SLIDE 94

Deep learning | Attention models

Global vs Local Attention

yt ˜ ht ct at ht pt ¯ hs

Attention Layer

Context vector Local weights Aligned position

Figure 3: Local attention model – the model first

Hamid Beigy | Sharif university of technology | November 10, 2019 88 / 96

SLIDE 95

Deep learning | Attention models

Transformer model

1 The soft attention and make it possible to do sequence to sequence

modeling without recurrent network units22.

2 The transformer model is entirely built on the self-attention

mechanisms without using sequence-aligned recurrent architecture.

22Ashish Vaswani, et al. Attention is all you need. NIPS 2017. Hamid Beigy | Sharif university of technology | November 10, 2019 89 / 96

SLIDE 96

Deep learning | Attention models

Transformer encoder-decoder

1 The encoding component is a stack of six encoders. 2 The decoding component is a stack of decoders of the same number.

Hamid Beigy | Sharif university of technology | November 10, 2019 90 / 96

SLIDE 97

Deep learning | Attention models

Transformer encoder-decoder

1 Each encoder has two sub-layer. 2 Each decoder has three sub-layer.

Hamid Beigy | Sharif university of technology | November 10, 2019 91 / 96

SLIDE 98

Deep learning | Attention models

Transformer encoder

1 All encoders receive a list of vectors each of the size 512. 2 The size of this list is hyperparameter we can set basically it would

be the length of the longest sentence in our training dataset.

Hamid Beigy | Sharif university of technology | November 10, 2019 92 / 96

SLIDE 99

Deep learning | Attention models

Transformer encoder

1 Each sublayer has residual connection.

Hamid Beigy | Sharif university of technology | November 10, 2019 93 / 96

SLIDE 100

Deep learning | Attention models

Transformer

1 A transformer of two stacked encoders and decoders

Hamid Beigy | Sharif university of technology | November 10, 2019 94 / 96

SLIDE 101

Deep learning | Attention models

Simple Neural Attention Meta-Learner (SNAIL)

1 The SNAIL was developed partially to resolve the problem with

positioning in the transformer model by combining the self-attention mechanism in transformer with convolutions23.

2 It has been demonstrated to be good at both supervised learning and

reinforcement learning tasks.

23Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural

attentive meta-learner. NIPS Workshop on Meta-Learning. 2017.

Hamid Beigy | Sharif university of technology | November 10, 2019 95 / 96

SLIDE 102

Deep learning | Reading

Reading

Please read chapter 10 of Deep Learning Book and papers referenced in these slides.

Hamid Beigy | Sharif university of technology | November 10, 2019 96 / 96

Deep learning

Recurrent neural networks Hamid Beigy

November 10, 2019

Table of contents

Table of contents

Introduction

following characteristics.

Sequence learning

designed for sequential data.

sequence labeling.

translation problem?

dynamical systems, and hidden Markov models as an exercise.

Sequence learning (application)

to decide if it is wise to invest today (Ideally consider all of history).

in X).

Sequence learning (application)

–

Sequence learning (Applications)

with fixed- length inputs and outputs, they may involve a variable time horizon2.

Processing sequences

“Vanilla” Neural Network

Recurrent Neural Networks: Process Sequences

Recurrent Neural Networks: Process Sequences

Recurrent Neural Networks: Process Sequences

Recurrent Neural Networks: Process Sequences

Can we process non-sequential data sequentially?

Ba, Mnih, and Kavukcuoglu, Multiple Object Recognition with Visual Attention, ICLR 2015. Gregor et al, DRAW: A Recurrent Neural Network For Image Generation, ICML 2015

Why MLP is not appropriate for sequence learning?

independent of each other. Why this is problematic?

classifying the current

Table of contents

Introduction

step4.

( RD) , and an initial state h0 ∈ RQ, the recurrent model would compute the following sequence of recurrent states iteratively. ht = φ(xt, ht−1) φ : RD × RQ → RQ

state yt = ψ(ht) ψ : RQ → RC

Recurrent neural networks

such as a pair of sentences with different lengths and two speech signals.

statistical strength and generalize to lengths of sequences not seen during training.

Recurrent neural network architecture5

Recurrent neural network (Machine translation)

is a sequence of words in the source language.

complete input, because the first word of our translated sentences may require information captured from the complete input sequence.

Recurrent neural network (shortcut)

network.

Recurrent neural network (shortcut)

network.

Recurrent neural network (example application)6

problem:

x y (1, 2, 3, 4, 5, 6) (3, 9, 9, 3) (7, 4, 5, 7, 5, 4) (7, 7) 1 (1, 2, 3, 1, 2, 3) 1 (5, 1, 1, 2, 5, 1, 1, 2) 1

Recurrent neural networks

Recurrent Neural Network

Recurrent neural networks

fW

fW

fW

…

W

RNN: Computational Graph: One to Many

Recurrent neural networks (character level language model)

Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello”

Recurrent neural networks (character level language model)

Comparing two models

very high dimensional hidden vector can be considered.

must be sequential.

high dimensional and rich

allows a parallelization.

Table of contents

Backpropagation through time

S = {(xi, yi), (x2, y2), . . . , (xm, ym)} where

between ˆ yi = (ˆ yi,0, ˆ yi,1, . . . , ˆ yi,T) and yi = (yi,0, yi,1, . . . , yi,T)

calculated.

weights are calculated and the weights are updated.

Forward phase

h(1)

(t) = σ1  ∑

uijxj(t) + ∑

w(1)

(t − 1) + b(1)

 

h(2)