Deep learning
Deep learning
Recurrent neural networks Hamid Beigy
Sharif university of technology
November 10, 2019
Hamid Beigy | Sharif university of technology | November 10, 2019 1 / 96
Deep learning Recurrent neural networks Hamid Beigy Sharif - - PowerPoint PPT Presentation
Deep learning Deep learning Recurrent neural networks Hamid Beigy Sharif university of technology November 10, 2019 Hamid Beigy | Sharif university of technology | November 10, 2019 1 / 96 Deep learning Table of contents 1 Introduction 2
Deep learning
Sharif university of technology
Hamid Beigy | Sharif university of technology | November 10, 2019 1 / 96
Deep learning
1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading
Hamid Beigy | Sharif university of technology | November 10, 2019 2 / 96
Deep learning | Introduction
1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading
Hamid Beigy | Sharif university of technology | November 10, 2019 2 / 96
Deep learning | Introduction
1 In previous sessions, we considered deep learning models with the
Input Layer: (maybe vectorized), quantitative representation Hidden Layer(s): Apply transformations with nonlinearity Output Layer: Result for classification, regression, translation, segmentation, etc.
2 Models used for supervised learning
Hamid Beigy | Sharif university of technology | November 10, 2019 3 / 96
Deep learning | Introduction
1 Sequence learning is the study of machine learning algorithms
2 Language model is one of the most interesting topics that use
3 For example, consider machine translation task.
We have a sentence in the source language. We must translate the above sentence to the destination language.
4 How do we use feed-forward networks for solving the above machine
5 Consider other solutions such as autoregressive models, linear
Hamid Beigy | Sharif university of technology | November 10, 2019 4 / 96
Deep learning | Introduction
1 Consider the stock market1. 2 We must consider the series of stock values in the past several days
15/03 14/03 13/03 12/03 11/03 10/03 9/03 8/03 7/03 To invest or not to invest? stocks
3 Inputs are vectors. 4 Output may be scalar (Should I invest) or vector (should I invest
1From Bhiksha Raj slides Hamid Beigy | Sharif university of technology | November 10, 2019 5 / 96
Deep learning | Introduction
1 We need a network that accepts previous days and decides.
vector Time X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+4)
Credit: Bhiksha Raj Hamid Beigy | Sharif university of technology | November 10, 2019 6 / 96
Deep learning | Introduction
1 Although all problems in sequence learning can be converted into one
2 Sequence classification
sentiment analysis, activity/action recognition, DNA sequence classification, action selection
3 Sequence synthesis:
text synthesis, music synthesis, motion synthesis.
4 Sequence-to-sequence translation:
speech recognition, text translation, part-of-speech tagging.
2From Francois Fleuret slides Hamid Beigy | Sharif university of technology | November 10, 2019 7 / 96
Deep learning | Introduction
1 Processing sequences3.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 13
Vanilla Neural Networks
Vanilla Neural Networks
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 14
e.g. Image Captioning image -> sequence of words
Image Captioning (image → seq
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 14
e.g. Image Captioning image -> sequence of words
Sentiment Classification (seq of words → sentiment)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 14
e.g. Image Captioning image -> sequence of words
Machine Translation (seq
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 14
e.g. Image Captioning image -> sequence of words
Video classification on frame level
3From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 8 / 96
Deep learning | Introduction
1 We have a non-sequence data and we are processing it sequentially. 2 Is it possible? 3 Please read the following papers.
Hamid Beigy | Sharif university of technology | November 10, 2019 9 / 96
Deep learning | Introduction
1 MLPs only accept an input of fixed dimensionality and map it to an
2 In a traditional MLPs, we assume that all inputs (and outputs) are
3 Need to re-learn the rules from scratch each time 4 Need to resuse knowledge about the previous events to help in
Hamid Beigy | Sharif university of technology | November 10, 2019 10 / 96
Deep learning | Recurrent neural networks
1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading
Hamid Beigy | Sharif university of technology | November 10, 2019 10 / 96
Deep learning | Recurrent neural networks
1 A recurrent model maintains a recurrent state updated at each time
2 Consider input x ∈ S
3 A prediction can be computed at any time step from the recurrent
4From Francois Fleuret slides Hamid Beigy | Sharif university of technology | November 10, 2019 11 / 96
Deep learning | Recurrent neural networks
1 Recurrent neural networks are networks for handling sequential data
2 Are weights dependent to the time instant? 3 RNN Share parameters across different parts of the model. 4 Why do we share weights? 5 When there is no parameter sharing, it would not be possible to share
Hamid Beigy | Sharif university of technology | November 10, 2019 12 / 96
Deep learning | Recurrent neural networks
h0
Φ
x1 h1
Φ
x2
. . .
Φ
hT−1 xT−1 hT
Φ
xT
Ψ
yT
Ψ
yT−1
Ψ
y1 w 5From Francois Fleuret slides Hamid Beigy | Sharif university of technology | November 10, 2019 13 / 96
Deep learning | Recurrent neural networks
1 Machine Translation is similar to language modeling in that our input
2 We want to output a sequence of words in our target language. 3 A key difference is that the output only starts after we have seen the
Credit: Denny Britz Hamid Beigy | Sharif university of technology | November 10, 2019 14 / 96
Deep learning | Recurrent neural networks
1 We often use the following shortcut to represent the recurrent neural
Credit: Bhiksha Raj Hamid Beigy | Sharif university of technology | November 10, 2019 15 / 96
Deep learning | Recurrent neural networks
1 We often use the following shortcut to represent the recurrent neural
Credit: Bhiksha Raj Hamid Beigy | Sharif university of technology | November 10, 2019 16 / 96
Deep learning | Recurrent neural networks
1 We consider the following simple binary sequence classification
Label 1: the sequence is the concatenation of two identical halves, label 0:
6From Francois Fleuret slides Hamid Beigy | Sharif university of technology | November 10, 2019 17 / 96
Deep learning | Recurrent neural networks
1 Usually, we want to predict a vector at some steps7.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 22
x RNN y We can process a sequence of vectors x by applying a recurrence formula at every time step:
new state
some time step some function with parameters W
1 We can process input x by applying the following recurrence
equation. ht = fw (xt, ht−1)
2 Assume that the activation function is tanh.
ht = tanh (Uxt, Wht−1) yt = Vht
7From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 18 / 96
Deep learning | Recurrent neural networks
1 The computational graph is 8. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018 33 h0
h1
h2
h3 yT
x
hT y3 y2 y1
8From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 19 / 96
Deep learning | Recurrent neural networks
1 Assume that the vocabulary is [h,e,l,o] 9. 2 Example training sequence: hello 3 At the output layer, we use softmax. Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 38
9From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 20 / 96
Deep learning | Recurrent neural networks
1 Assume that the vocabulary is [h,e,l,o] 10. 2 Example training sequence: hello 3 At the output layer, we use softmax.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 42
.03 .13 .00 .84 .25 .20 .05 .50 .11 .17 .68 .03 .11 .02 .08 .79
Softmax “e” “l” “l” “o” Sample
Example: Character-level Language Model Sampling Vocabulary: [h,e,l,o]
At test-time sample characters one at a time, feed back to model
10From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 21 / 96
Deep learning | Recurrent neural networks
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 38
Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello”
1 This is more powerful because a
2 The training will be harder. It
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 42
.03 .13 .00 .84 .25 .20 .05 .50 .11 .17 .68 .03 .11 .02 .08 .79
Softmax “e” “l” “l” “o” Sample
Example: Character-level Language Model Sampling Vocabulary: [h,e,l,o]
At test-time sample characters one at a time, feed back to model
1 This is less powerful unless very
2 The training will be easier. It
Hamid Beigy | Sharif university of technology | November 10, 2019 22 / 96
Deep learning | Training recurrent neural networks
1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading
Hamid Beigy | Sharif university of technology | November 10, 2019 22 / 96
Deep learning | Training recurrent neural networks
1 We have a collection of labeled samples.
xi = (xi,0, xi,1, . . . , xi,T) is the input sequence and yi = (yi,0, yi,1, . . . , yi,T) is the output sequence.
2 The goal is to find weights of the network that minimizes the error
3 In forward phase, the input is given to the network and the output is
4 In backward phase, the gradient of cost function with respect to the
Hamid Beigy | Sharif university of technology | November 10, 2019 23 / 96
Deep learning | Training recurrent neural networks
1 The input is given to the network and the output will be calculated. 2 Consider a two hidden-layers denoted by h(1) and h(2). 3 The output of the first hidden layer equals to
i
j
j
ji h(1) j
i
4 The output of the second hidden layer equals to
i
j
j
j
ji h(2) j
i
5 The output equals to
j
j
Hamid Beigy | Sharif university of technology | November 10, 2019 24 / 96
Deep learning | Training recurrent neural networks
1 Forward through entire sequence to compute loss, then backward
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 43
Loss Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient
11From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 25 / 96
Deep learning | Training recurrent neural networks
1 We must find
2 Then we threat the network as usual multi-layer network and apply
Hamid Beigy | Sharif university of technology | November 10, 2019 26 / 96
Deep learning | Training recurrent neural networks
1 Let consider a network with one hidden layer. 2 We use softmax activation function for output layer 3 We use tanh for hidden layer 4 Suppose that L(t) be the loss at time t. 5 If L(t) is the negative log-likelihood of y(t) given
t
t
Hamid Beigy | Sharif university of technology | November 10, 2019 27 / 96
Deep learning | Training recurrent neural networks
1 We backpropagate the gradient in the following way (L is loss
x1 ˆ y1 h1 U V
∂L ∂U
x2 ˆ y2 h2 U W V
∂L ∂U ∂L ∂W
x3 ˆ y3 h3 U W V
∂L ∂U ∂L ∂W
x4 ˆ y4 h4 U V W
∂L ∂U ∂L ∂W
x5 ˆ y5 h5 U W V
∂L ∂V ∂L ∂U ∂L ∂W Credit Trivedi & Kondor Hamid Beigy | Sharif university of technology | November 10, 2019 28 / 96
Deep learning | Training recurrent neural networks
1 In forward phase, hidden layer computes
2 In forward phase, output layer computes
Hamid Beigy | Sharif university of technology | November 10, 2019 29 / 96
Deep learning | Training recurrent neural networks
1 Calculating the gradient
i =
2 By using softmax in output layer, we have
k eok(t)
eoi (t) ∑
k eok (t)
Hamid Beigy | Sharif university of technology | November 10, 2019 30 / 96
Deep learning | Training recurrent neural networks
1 We compute gradient of loss function with respect to oi(t)
k
k
k
k
Hamid Beigy | Sharif university of technology | November 10, 2019 31 / 96
Deep learning | Training recurrent neural networks
1 We compute gradient of loss function with respect to oi(t)
k̸=i
k̸=i
k̸=i
k̸=i
Hamid Beigy | Sharif university of technology | November 10, 2019 32 / 96
Deep learning | Training recurrent neural networks
1 We had
k̸=i
2 Since y(t) is a one hot encoded vector for the labels, so
k yk(t) = 1 and yi(t) + ∑ k̸=1 yk(t) = 1. So we have
3 This is a very simple and elegant expression.
Hamid Beigy | Sharif university of technology | November 10, 2019 33 / 96
Deep learning | Training recurrent neural networks
1 At the final time step T, h(T) has only as o(T) as a descendant
2 We can then iterate backward in time to back-propagate gradients
Hamid Beigy | Sharif university of technology | November 10, 2019 34 / 96
Deep learning | Training recurrent neural networks
1 The gradient on the remaining parameters is given by
∇cL = ∑
t
(∂o(t) ∂c(t) )T ∇o(t)L = ∑
t
∇o(t)L ∇bL = ∑
t
(∂h(t) ∂b(t) )T ∇h(t)L = ∑
t
diag ( 1 − (h(t))2) ( ∇h(t) ) L ∇V L = ∑
t
∑
i
( ∂L ∂oi(t) )T ∇V (t)o(t) = ∑
t
( ∇o(t)L ) h(t)T ∇W L = ∑
t
∑
i
( ∂L ∂hi(t) )T ∇W (t)hi(t) = ∑
t
diag ( 1 − (h(t))2) ( ∇h(t−1)L ) (h(t − 1))T ∇UL = ∑
t
∑
i
( ∂L ∂hi(t) )T ∇U(t)hi(t) = ∑
t
diag ( 1 − (h(t))2) ( ∇h(t)L ) (x(t))T
Hamid Beigy | Sharif university of technology | November 10, 2019 35 / 96
Deep learning | Training recurrent neural networks
1 Finally, bearing in mind that the weights to and from each unit in the
T
t=1
T
t=1
T
t=1
T
t=1
T
t=1
Hamid Beigy | Sharif university of technology | November 10, 2019 36 / 96
Deep learning | Training recurrent neural networks
1 Run forward and backward through chunks of the sequence instead of
Fei-Fei Li & Justin Johnson & Serena Yeung
Fei-Fei Li & Justin Johnson & Serena Yeung
Loss
Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
12From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 37 / 96
Deep learning | Training recurrent neural networks
1 Run forward and backward through chunks of the sequence instead of
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - May 3, 2018 46
Loss 13From Fei-Fei Li et al. slides Hamid Beigy | Sharif university of technology | November 10, 2019 38 / 96
Deep learning | Design patterns of RNN
1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading
Hamid Beigy | Sharif university of technology | November 10, 2019 38 / 96
Deep learning | Design patterns of RNN
1 Producing a single output and have recurrent connections from
2 This is useful for summarizing a sequence such as sentiment analysis
Hamid Beigy | Sharif university of technology | November 10, 2019 39 / 96
Deep learning | Design patterns of RNN
1 Sometimes we are interested in only taking a single, fixed sized vector
2 Most common ways to provide an extra input at each time step 3 Other solutions? (Please consider them) 4 Application: Image caption generation
Hamid Beigy | Sharif university of technology | November 10, 2019 40 / 96
Deep learning | Design patterns of RNN
1 We considered RNNs in the context of a sequence x(t)(t = 1, . . . , T) 2 In many applications, y(t) may depend on the whole input sequence. 3 Bidirectional RNNs were introduced to address this need.
Hamid Beigy | Sharif university of technology | November 10, 2019 41 / 96
Deep learning | Design patterns of RNN
1 How do we map input sequences to output sequences that are not
2 The input to RNN is called context, we want to find a representation
3 C may be a vector or a sequence that summarizes
Hamid Beigy | Sharif university of technology | November 10, 2019 42 / 96
Deep learning | Design patterns of RNN
Hamid Beigy | Sharif university of technology | November 10, 2019 43 / 96
Deep learning | Design patterns of RNN
1 The computations in RNNs can be decomposed into three blocks of
Input to hidden state Previous hidden state to the next Hidden state to the output
2 Each of these transforms were learned affine transformations followed
3 Introducing depth in each of these operations is advantageous. 4 The intuition on why depth should be more useful is quite similar to
5 Optimization can be made much harder, but can be mitigated by
Hamid Beigy | Sharif university of technology | November 10, 2019 44 / 96
Deep learning | Design patterns of RNN
Hamid Beigy | Sharif university of technology | November 10, 2019 45 / 96
Deep learning | Long-term dependencies
1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading
Hamid Beigy | Sharif university of technology | November 10, 2019 45 / 96
Deep learning | Long-term dependencies
1 RNNs involve the composition of a function multiple times, once per
2 This function composition resembles matrix multiplication. 3 Consider the recurrence relationship h(t + 1) = W Th(t) 4 This is very simple RNN without a nonlinear activation and x. 5 This recurrence equation can be written as: h(t) = (W t)T h(0). 6 If W has an eigendecomposition of formW = QΛQT with orthogonal
7 The recurrence becomes h(t) = (W t)T h(0) = QTΛtQh(0).
Hamid Beigy | Sharif university of technology | November 10, 2019 46 / 96
Deep learning | Long-term dependencies
1 The recurrence becomes h(t) = (W t)T h(0) = QTΛtQh(0).
2 Eigenvalues are raised to t: Quickly decay to zero or explode.
Vanishing gradients Exploding gradients
3 Problem: Gradients propagated over many stages tend to
Hamid Beigy | Sharif university of technology | November 10, 2019 47 / 96
Deep learning | Long-term dependencies
1 The expression for h(t) was h(t) = tanh (Wh(t − 1) + Ux(t)). 2 The partial derivative of loss wrt to hidden states equals to
T−1
k=t
T−1
k=t
k
Hamid Beigy | Sharif university of technology | November 10, 2019 48 / 96
Deep learning | Long-term dependencies
1 For any matrices A, B , we have ∥AB∥ ≤ ∥A∥∥B∥
T−1
k=t
k
k=t
k
∂h(t)
∂h(T)
T−1
∏
k=t
Dk+1W T
k
∂h(T)
∏
k=t
σmax(Dk+1)σmax(Wk)
3 Hence the gradient norm can shrink to zero or grow up exponentially
Hamid Beigy | Sharif university of technology | November 10, 2019 49 / 96
Deep learning | Long-term dependencies
1 Set the recurrent weights such that they do a good job of capturing
2 Methods: Echo State Machines and Liquid State Machines 3 The general methodology is called reservoir computing 4 How to choose the recurrent weights? 5 In Echo State Machines, choose recurrent weights such that the
Hamid Beigy | Sharif university of technology | November 10, 2019 50 / 96
Deep learning | Long-term dependencies
1 Adding skip connection through time : Adding direct connections
2 Leaky units: Having units with self-connections. 3 Removing connections: Removing length-one connections and
Hamid Beigy | Sharif university of technology | November 10, 2019 51 / 96
Deep learning | Long-term dependencies
1 RNNs can accumulate but it might be useful to forget. 2 Creating paths through time where derivatives can flow. 3 Learn when to forget! 4 Gates allow learning how to read, write and forget. 5 We consider two gated units:
Long Short-Term Memory (LSTM) Gated Recurrent Unit (GRU)
Hamid Beigy | Sharif university of technology | November 10, 2019 52 / 96
Deep learning | Long-term dependencies
1 LSTMs are explicitly designed to avoid the long-term dependency
2 All recurrent neural networks have the form of a chain of repeating
3 In standard RNNs, this repeating module will have a very simple
Hamid Beigy | Sharif university of technology | November 10, 2019 53 / 96
Deep learning | Long-term dependencies
1 LSTMs also have this chain like structure, but the repeating module
2 Instead of having a single neural network layer, there are four,
Hamid Beigy | Sharif university of technology | November 10, 2019 54 / 96
Deep learning | Long-term dependencies
1 Let us to define the following notations 2 In the above figure, each line carries an entire vector, from the output
The pink circles represent point-wise operations, like vector addition. The yellow boxes are learned neural network layers. Lines merging denote concatenation Line forking denote its content being copied and the copies going to different locations.
Hamid Beigy | Sharif university of technology | November 10, 2019 55 / 96
Deep learning | Long-term dependencies
1 The key to LSTMs is the cell state, the horizontal line running
2 The cell state is kind of like a conveyor belt. 3 It runs straight down the entire chain, with only some minor linear
4 Its very easy for information to just flow along it unchanged. 5 The LSTM does have the ability to remove or add information to the
Hamid Beigy | Sharif university of technology | November 10, 2019 56 / 96
Deep learning | Long-term dependencies
1 Gates are a way to optionally let information through. 2 They are composed out of a sigmoid neural net layer and a pointwise
3 The sigmoid layer outputs numbers between zero and one, describing
4 An LSTM has three of these gates, to protect and control the cell
Hamid Beigy | Sharif university of technology | November 10, 2019 57 / 96
Deep learning | Long-term dependencies
1 This decision is made by a sigmoid layer called the forget gate layer. 2 It looks at ht−1 and xt, and outputs a number between 0 and 1 for
Hamid Beigy | Sharif university of technology | November 10, 2019 58 / 96
Deep learning | Long-term dependencies
1 The next step is to decide what new information were going to store
2 This has the two following parts that could be added to the state.
A sigmoid layer called the input gate layer decides which values well update. A tanh layer creates a vector of new candidate values, ˜ Ct.
3 Then well combine these two to create an update to the state.
Hamid Beigy | Sharif university of technology | November 10, 2019 59 / 96
Deep learning | Long-term dependencies
1 Now, we must update the old cell state, Ct−1 into the new cell state
2 We multiply the old state by ft, forgetting the things we decided to
3 Then we add it × ˜
Hamid Beigy | Sharif university of technology | November 10, 2019 60 / 96
Deep learning | Long-term dependencies
1 Finally, we need to decide what were going to output. 2 This output will be based on our cell state, but will be a filtered
First, we run a sigmoid layer which decides what parts of the cell state were going to output. Then, we put the cell state through tanh and multiply it by the output
3 So that we only output the parts we decided to.
Hamid Beigy | Sharif university of technology | November 10, 2019 61 / 96
Deep learning | Long-term dependencies
Hamid Beigy | Sharif university of technology | November 10, 2019 62 / 96
Deep learning | Long-term dependencies
1 One popular LSTM variant is adding peephole connections. This
Hamid Beigy | Sharif university of technology | November 10, 2019 63 / 96
Deep learning | Long-term dependencies
1 Another variation is to use coupled forget and input gates. 2 Instead of separately deciding what to forget and what we should add
3 We only forget when were going to input something in its place. We
Hamid Beigy | Sharif university of technology | November 10, 2019 64 / 96
Deep learning | Long-term dependencies
1 GRU combines the forget and input gates into a single update gate15. 2 It merges the cell and hidden states, and makes some other changes. 3 The resulting model is simpler than standard LSTM models, and has
15Kyunghyun Cho and et al. ”Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Translation”, Proc. of Conference on Empirical Methods in Natural Language Processing, pages 1724-1734, 2014.
Hamid Beigy | Sharif university of technology | November 10, 2019 65 / 96
Deep learning | Long-term dependencies
Hamid Beigy | Sharif university of technology | November 10, 2019 66 / 96
Deep learning | Long-term dependencies
1 There are also other LSTM variants such as
Jan Koutnik, et. al. ”A clockwork RNN”, Proceedings of the 31st International Conference on International Conference on Machine Learning, 2014. Kaisheng Yao, et. al. ”Depth-Gated Recurrent Neural Networks”, https://arxiv.org/pdf/1508.03790v2.pdf.
Hamid Beigy | Sharif university of technology | November 10, 2019 67 / 96
Deep learning | Long-term dependencies
1 Two simple solutions for gradient vanishing / exploding.
For vanishing gradients: Initialization + ReLus Trick for exploding gradient: clipping trick When ∥g∥ > v Then g ← vg ∥g∥
2 Vanishing gradient happens when the optimization gets stuck in a
3 Exploding gradient happens when the gradient becomes too big and
Hamid Beigy | Sharif university of technology | November 10, 2019 68 / 96
Deep learning | Long-term dependencies
1 Gradient clipping helps to deal with exploding gradients but it does
2 Another way is encourage creating paths in the unfolded recurrent
3 Solution: regularize to encourage information flow. 4 We want that
∂h(t) ∂h(t−1) to be as large as ∇h(t)L. 5 One regularizer is
t
∂h(t) ∂h(t−1)
2 6 Computing this regularizer is difficult and its approximation is used.
Hamid Beigy | Sharif university of technology | November 10, 2019 69 / 96
Deep learning | Attention models
1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading
Hamid Beigy | Sharif university of technology | November 10, 2019 69 / 96
Deep learning | Attention models
1 Sequence to sequence modeling transforms an input sequence (source)
2 Examples of transformation tasks include
Machine translation between multiple languages in either text or audio Question-answer dialog generation Parsing sentences into grammar trees.
3 The sequence to sequence model normally has an encoder-decoder
An encoder processes the input sequence and compresses the information into a context vector of a fixed length. This representation is expected to be a good summary of the meaning
A decoder is initialized with the context vector to emit the transformed
as the decoder initial state.
Hamid Beigy | Sharif university of technology | November 10, 2019 70 / 96
Deep learning | Attention models
1 Both the encoder and decoder are recurrent neural networks such as
2 A critical disadvantage of this fixed-length context vector design is
Hamid Beigy | Sharif university of technology | November 10, 2019 71 / 96
Deep learning | Attention models
1 The attention mechanism was born to help memorize long source
2 Instead of building a single context vector out of the encoders last
3 The weights of these shortcut connections are customizable for each
4 The alignment between the source and target is learned and
5 Essentially the context vector consumes three pieces of information:
Encoder hidden states Decoder hidden states Alignment between source and target
16Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine
translation by jointly learning to align and translate.” ICLR 2015.
Hamid Beigy | Sharif university of technology | November 10, 2019 72 / 96
Deep learning | Attention models
1 Assume that we have a source sequence x of length n and try to
2 The encoder is a bidirectional RNN with a forward hidden state −
3 A simple concatenation of two represents the encoder state. 4 The motivation is to include both the preceding and following words
i ; ←
i
Hamid Beigy | Sharif university of technology | November 10, 2019 73 / 96
Deep learning | Attention models
1 Model of attention
Hamid Beigy | Sharif university of technology | November 10, 2019 74 / 96
Deep learning | Attention models
1 The decoder network has hidden state st = f (st−1, yt−1, ct) at
2 The context vector ct is a sum of hidden states of the input sequence,
n
i=1
Context vector for output y
How well two words yt and xi are aligned.
j=1 exp (score(st−1, hj))
Softmax of predefined alignment score.
3 The alignment model assigns a score αt,i to the pair of (yt, xi) based
4 The set of {αt,i} are weights defining how much of each source
Hamid Beigy | Sharif university of technology | November 10, 2019 75 / 96
Deep learning | Attention models
1 The alignment score α is parametrized by a feed-forward network with
2 This network is jointly trained with other parts of the model. 3 The score function is therefore in the following form, given that tanh
a tanh(Wa[st; hi])
17Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine
translation by jointly learning to align and translate.” ICLR 2015.
Hamid Beigy | Sharif university of technology | November 10, 2019 76 / 96
Deep learning | Attention models
1 The matrix of alignment scores explicitly show the correlation
(a)
Hamid Beigy | Sharif university of technology | November 10, 2019 77 / 96
Deep learning | Attention models
1 The matrix of alignment scores explicitly show the correlation
Hamid Beigy | Sharif university of technology | November 10, 2019 78 / 96
Deep learning | Attention models
Name Alignment score function Paper
( https://lilianweng.github.io/lil-log/2018/06/24/attention-attention. html)
Content-base attention score(st, hi) = cosine[st, hi]
Additive score(st, hi) = v⊤
a tanh(Wa[st; hi])
jointly learning to align and translate”, ICLR 2015. Location-Base αt,i = softmax(Wast)
based Neural Machine Translation”, EMNLP 2015. General score(st, hi) = s⊤
t Wahi
Same as the above Dot-Product score(st, hi) = s⊤
t hi
Same as the above Scaled Dot-Product score(st, hi) = s⊤
t hi
√n
”Attention is all you need”, NIPS 2017.
Hamid Beigy | Sharif university of technology | November 10, 2019 79 / 96
Deep learning | Attention models
1 Self-attention18 (intra-attention) is an attention mechanism relating
2 It is very useful in
Machine reading (the automatic, unsupervised understanding of text) Abstractive summarization Image description generation
18Jianpeng Cheng, Li Dong, and Mirella Lapata. ”Long short-term memory-networks
for machine reading”. EMNLP 2016.
Hamid Beigy | Sharif university of technology | November 10, 2019 80 / 96
Deep learning | Attention models
1 The self-attention mechanism enables us to learn the correlation
2 The current word is in red and the size of the blue shade indicates the
Hamid Beigy | Sharif university of technology | November 10, 2019 81 / 96
Deep learning | Attention models
1 Self-attention is applied to the image to generate descriptions19. 2 Image is encoded by aCNN and a RNN with self-attention consumes
3 The visualization of the attention weights clearly demonstrates which
19Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
Hamid Beigy | Sharif university of technology | November 10, 2019 82 / 96
Deep learning | Attention models
Hamid Beigy | Sharif university of technology | November 10, 2019 83 / 96
Deep learning | Attention models
1 The soft vs hard attention is another way to categorize how attention
Soft Attention: the alignment weights are learned and placed softly
al., 2015).
Pro: the model is smooth and differentiable. Con: expensive when the source input is large.
Hard Attention: only selects one patch of the image to attend to at a time.
Pro: less calculation at the inference time. Con: the model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train20.
20Thang Luong, Hieu Pham, Christopher D. Manning. Effective Approaches to
Attention-based Neural Machine Translation. EMNLP 2015.
Hamid Beigy | Sharif university of technology | November 10, 2019 84 / 96
Deep learning | Attention models
1 Global and local attention are proposed by Luong, et al21 2 The idea of a global attentional model is to consider all the hidden
is ) el yt ˜ ht ct at ht ¯ hs
Global align weights
Attention Layer
Context vector
21Thang Luong, Hieu Pham, Christopher D. Manning. Effective Approaches to
Attention-based Neural Machine Translation. EMNLP 2015.
Hamid Beigy | Sharif university of technology | November 10, 2019 85 / 96
Deep learning | Attention models
1 The global attention has a drawback that it has to attend to all words
2 The local attentional mechanism chooses to focus only on a small
3 Local one is an interesting blend between hard and soft, an
4 The model first predicts a single aligned position for the current
p tanh(Wpht)
Hamid Beigy | Sharif university of technology | November 10, 2019 86 / 96
Deep learning | Attention models
1 To favor alignment points near pt, they placed a Gaussian distribution
p tanh(Wpht)
Hamid Beigy | Sharif university of technology | November 10, 2019 87 / 96
Deep learning | Attention models
Attention Layer
Context vector Local weights Aligned position
Hamid Beigy | Sharif university of technology | November 10, 2019 88 / 96
Deep learning | Attention models
1 The soft attention and make it possible to do sequence to sequence
2 The transformer model is entirely built on the self-attention
22Ashish Vaswani, et al. Attention is all you need. NIPS 2017. Hamid Beigy | Sharif university of technology | November 10, 2019 89 / 96
Deep learning | Attention models
1 The encoding component is a stack of six encoders. 2 The decoding component is a stack of decoders of the same number.
Hamid Beigy | Sharif university of technology | November 10, 2019 90 / 96
Deep learning | Attention models
1 Each encoder has two sub-layer. 2 Each decoder has three sub-layer.
Hamid Beigy | Sharif university of technology | November 10, 2019 91 / 96
Deep learning | Attention models
1 All encoders receive a list of vectors each of the size 512. 2 The size of this list is hyperparameter we can set basically it would
Hamid Beigy | Sharif university of technology | November 10, 2019 92 / 96
Deep learning | Attention models
1 Each sublayer has residual connection.
Hamid Beigy | Sharif university of technology | November 10, 2019 93 / 96
Deep learning | Attention models
1 A transformer of two stacked encoders and decoders
Hamid Beigy | Sharif university of technology | November 10, 2019 94 / 96
Deep learning | Attention models
1 The SNAIL was developed partially to resolve the problem with
2 It has been demonstrated to be good at both supervised learning and
23Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural
attentive meta-learner. NIPS Workshop on Meta-Learning. 2017.
Hamid Beigy | Sharif university of technology | November 10, 2019 95 / 96
Deep learning | Reading
1 Introduction 2 Recurrent neural networks 3 Training recurrent neural networks 4 Design patterns of RNN 5 Long-term dependencies 6 Attention models 7 Reading
Hamid Beigy | Sharif university of technology | November 10, 2019 95 / 96
Deep learning | Reading
Hamid Beigy | Sharif university of technology | November 10, 2019 96 / 96