Recurrent Neural Networks
- M. Soleymani
Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Socher lectures, cs224d, Stanford, 2017. .
Recurrent Neural Networks M. Soleymani Sharif University of - - PowerPoint PPT Presentation
Recurrent Neural Networks M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Socher lectures, cs224d, Stanford, 2017. . Vanilla
Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Socher lectures, cs224d, Stanford, 2017. .
“Vanilla” NN
“Vanilla” NN e.g. Image Captioning image -> seq of words e.g. Sentiment Classification seq of words -> sentiment e.g. Machine Translation seq of words -> seq of words e.g. Video classification on frame level
x RNN usually want to predict a vector at some time steps
y
x RNN y
We can process a sequence of vectors x by applying a recurrence formula at every time step: new state
some time step some function with parameters W
x RNN y
We can process a sequence of vectors x by applying a recurrence formula at every time step:
Notice: the same function and the same set of parameters are used at every time step.
x RNN y
The state consists of a single “hidden” vector h:
Re-use the same weight matrix at every time-step
Vocabulary: [h,e,l,o] Example training sequence: “hello”
x RNN y
Vocabulary: [h,e,l,o] Example training sequence: “hello”
Vocabulary: [h,e,l,o] Example training sequence: “hello”
Vocabulary: [h,e,l,o] Example training sequence: “hello”
test-time sample characters one at a time, feed back to model
test-time sample characters one at a time, feed back to model
test-time sample characters one at a time, feed back to model
test-time sample characters one at a time, feed back to model
min-char-rnn.py gist:
112 lines of Python
(https://gist.github.com/karpathy /d4dee566867f8291f086)
x RNN y
train more train more train more at first:
Latex source
quote/comment cell
Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient
Run forward and backward through chunks of the sequence instead of whole sequence
Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
𝑄(𝑥1, … , 𝑥𝑈)
– Word ordering: p(the cat is small) > p(small the is cat) – Word choice: p(walking home after school) > p(walking house after school)
𝑦1, 𝑦2, … , 𝑦𝑈
ℎ𝑢 = tanh 𝑋
𝑦ℎ𝑦𝑢 + 𝑋 ℎℎℎ𝑢−1
𝑧𝑢 = softmax 𝑋
ℎ𝑧ℎ𝑢
𝑄(𝑧𝑢 = 𝑤𝑘|𝑦1, … , 𝑦𝑢) ≈ 𝑧𝑢,𝑘
𝑦1, 𝑦2, … , 𝑦𝑈
ℎ𝑢 = tanh 𝑋
𝑦ℎ𝑦𝑢 + 𝑋 ℎℎℎ𝑢−1
𝑧𝑢 = softmax 𝑋
ℎ𝑧ℎ𝑢
𝑄(𝑧𝑢 = 𝑤𝑘|𝑦1, … , 𝑦𝑢) ≈ 𝑧𝑢,𝑘 ℎ0 is some initialization vector for the hidden layer at time step 0 𝑦𝑢 is the column vector at time step t
𝐹𝑢 = −
𝑘=1 |𝑊|
𝑧𝑢,𝑘 log 𝑧𝑢,𝑘
𝐹 = − 1 𝑈
𝑢=1 𝑈 𝑘=1 |𝑊|
𝑧𝑢,𝑘 log 𝑧𝑢,𝑘
𝑧𝑢,𝑘 = 1 when 𝑥𝑢 must be the word 𝑘 of vocabulary
ℎ𝑘 = 𝑋
ℎℎ𝑔 ℎ𝑘−1 + 𝑋 𝑦ℎ𝑦𝑘
𝜖ℎ𝑘 𝜖ℎ𝑘−1 = 𝑋
ℎℎ 𝑈 diag 𝑔′ ℎ𝑘−1
𝜖ℎ𝑘 𝜖ℎ𝑘−1 ≤ 𝑋
ℎℎ 𝑈
𝑔′ ℎ𝑘−1 ≤ 𝛾𝑋𝛾ℎ 𝜖ℎ𝑢 𝜖ℎ𝑙 =
𝑘=𝑙+1 𝑢
𝜖ℎ𝑘 𝜖ℎ𝑘−1 =
𝑘=𝑙+1 𝑢
𝑋
ℎℎ 𝑈 diag 𝑔′ ℎ𝑘−1
≤ 𝛾𝑋𝛾ℎ 𝑢−𝑙
[Bengio et al 1994].
𝜖ℎ𝑘,𝑛 𝜖ℎ𝑘−1,𝑜 = 𝑋
ℎℎ 𝑈 𝑜,. 𝑔′ 𝑛
not taken into consideration when training to predict the next word
the day. Jane said hi to ____
Computing gradient of h0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients
maximum value.
– High curvature walls
Computing gradient of h0 involves many factors
Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Gradient clipping: Scale Computing gradient if its norm is too big Change RNN architecture
Le et al. A Simple Way to Initialize Recurrent Networks of Rectified Linear Unit, 2015.
– ℎ𝑢 = 𝑀𝑇𝑈𝑁(𝑦𝑢, ℎ𝑢−1) – ℎ𝑢 = 𝐻𝑆𝑉(𝑦𝑢, ℎ𝑢−1)
–keep around memories to capture long distance dependencies –allow error messages to flow at different strengths depending on the inputs
𝑗
ℎ𝑢−1 𝑦𝑢 + 𝑐𝑗
𝑢 = 𝜏 𝑋 𝑔
ℎ𝑢−1 𝑦𝑢 + 𝑐𝑔
𝑝
ℎ𝑢−1 𝑦𝑢 + 𝑐𝑝
ℎ𝑢−1 𝑦𝑢 + 𝑐
𝑢 ∘ 𝑑𝑢−1
By Chris Ola: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
– Forget gate: look at previous cell state and current input, and decide which information to throw away. – Input gate: see which information in the current state we want to update. – Output: Filter cell state and output the filtered result. – Gate or update gate: propose new values for the cell state.
[Hochreiter et al., 1997]
[Hochreiter et al., 1997]
[Hochreiter et al., 1997]
[Hochreiter et al., 1997]
𝑨 = 𝑔 𝑦1, 𝑦2 = 𝑦1 ∘ 𝑦2 𝜖𝐹 𝜖𝑦1 = 𝜖𝐹 𝜖𝑨 𝜖𝑨 𝜖𝑦1 =
𝜖𝐹 𝜖𝑨 ∘ 𝑦2
𝑨𝑢 = 𝜏 𝑋
𝑨
ℎ𝑢−1 𝑦𝑢 + 𝑐𝑨
𝑠
𝑢 = 𝜏 𝑋 𝑠
ℎ𝑢−1 𝑦𝑢 + 𝑐𝑠
ℎ𝑢 = tanh 𝑋
𝑛
𝑠
𝑢 ∘ ℎ𝑢−1
𝑦𝑢 + 𝑐𝑛
ℎ𝑢 = 𝑨𝑢 ∘ ℎ𝑢−1 + 1 − 𝑨𝑢 ∘ ℎ𝑢
If reset gate unit is ~0, then this ignores previous memory and
– Allows model to drop information that is irrelevant in the future
– If z close to 1, then we can copy information in that unit through many time steps! Less vanishing gradient!
active
– Greff et al. (2015), perform comparison of popular variants, finding that they’re all about the same. – Jozefowicz et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.
speech.
difficult with previous methods are now feasible.
necessarily the top performers: e.g., POS tagging and NER: Choi 2016.
grammaticality (partly because of characters/subwords), but not yet industry strength.
[Ann Copestake, Overview of LSTMs and word2vec, 2016.] https://arxiv.org/ftp/arxiv/papers/1611/1611.00068.pdf
It summarizes the sentence up to that time.
words both preceding and following
to the next.
test image
test image
test image
test image
x0 <START >
<START>
h0
x0 <START >
y0
<START> test image
before: h = tanh(Wxh * x + Whh * h) now: h = tanh(Wxh * x + Whh * h + Wih * v)
v
Wih
h0
x0 <START >
y0
<START> test image
straw
sample!
h0
x0 <START >
y0
<START> test image
straw
h1 y1
h0
x0 <START >
y0
<START> test image
straw
h1 y1
hat
sample!
h0
x0 <START >
y0
<START> test image
straw
h1 y1
hat
h2 y2
h0
x0 <START >
y0
<START> test image
straw
h1 y1
hat
h2 y2
sample <END> token => finish.
[Tsung-Yi Lin et al. 2014] mscoco.org
– Exploding is controlled with gradient clipping. – Vanishing is controlled with additive interactions (LSTM)
gradient flow