[PPT] - Recurrent Neural Networks M. Soleymani Sharif University of PowerPoint Presentation

SLIDE 1

Recurrent Neural Networks

M. Soleymani

Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Socher lectures, cs224d, Stanford, 2017. .

SLIDE 2

“Vanilla” Neural Networks

“Vanilla” NN

SLIDE 3

Recurrent Neural Networks: Process Sequences

“Vanilla” NN e.g. Image Captioning image -> seq of words e.g. Sentiment Classification seq of words -> sentiment e.g. Machine Translation seq of words -> seq of words e.g. Video classification on frame level

SLIDE 4

Recurrent Neural Network

x RNN usually want to predict a vector at some time steps

y

SLIDE 5

Recurrent Neural Network

x RNN y

We can process a sequence of vectors x by applying a recurrence formula at every time step: new state

ld state input vector at

some time step some function with parameters W

SLIDE 6

Recurrent Neural Network

x RNN y

We can process a sequence of vectors x by applying a recurrence formula at every time step:

Notice: the same function and the same set of parameters are used at every time step.

SLIDE 7

(Vanilla) Recurrent Neural Network

x RNN y

The state consists of a single “hidden” vector h:

SLIDE 8

RNN: Computational Graph

SLIDE 9

RNN: Computational Graph

Re-use the same weight matrix at every time-step

SLIDE 10

RNN: Computational Graph: Many to One

SLIDE 11

RNN: Computational Graph: Many to Many

SLIDE 12

RNN: Computational Graph: Many to Many

SLIDE 13

RNN: Computational Graph: Many to Many

SLIDE 14

RNN: Computational Graph: One to Many

SLIDE 15

Sequence to Sequence: Many-to-one + one-to-many

SLIDE 16

Character-level language model example

Vocabulary: [h,e,l,o] Example training sequence: “hello”

x RNN y

SLIDE 17

Character-level language model example

Vocabulary: [h,e,l,o] Example training sequence: “hello”

SLIDE 18

Character-level language model example

Vocabulary: [h,e,l,o] Example training sequence: “hello”

SLIDE 19

Character-level language model example

Vocabulary: [h,e,l,o] Example training sequence: “hello”

SLIDE 20

Example: Character-level Language Model Sampling

Vocabulary: [h,e,l,o]
At

test-time sample characters one at a time, feed back to model

SLIDE 21

Example: Character-level Language Model Sampling

Vocabulary: [h,e,l,o]
At

test-time sample characters one at a time, feed back to model

SLIDE 22

Example: Character-level Language Model Sampling

Vocabulary: [h,e,l,o]
At

test-time sample characters one at a time, feed back to model

SLIDE 23

Example: Character-level Language Model Sampling

Vocabulary: [h,e,l,o]
At

test-time sample characters one at a time, feed back to model

SLIDE 24

min-char-rnn.py gist:

112 lines of Python

(https://gist.github.com/karpathy /d4dee566867f8291f086)

SLIDE 25

x RNN y

Language Modeling: Example I

SLIDE 26

SLIDE 27

train more train more train more at first:

SLIDE 28

SLIDE 29

pen source textbook on algebraic geometry

Latex source

SLIDE 30

SLIDE 31

SLIDE 32

SLIDE 33

Generated C code

SLIDE 34

SLIDE 35

SLIDE 36

Searching for interpretable cells

SLIDE 37

Searching for interpretable cells

SLIDE 38

Searching for interpretable cells

quote detection cell

SLIDE 39

Searching for interpretable cells

line length tracking cell

SLIDE 40

Searching for interpretable cells

if statement cell

SLIDE 41

Searching for interpretable cells

if statement cell

quote/comment cell

SLIDE 42

Searching for interpretable cells

code depth cell

SLIDE 43

Backpropagation through time

Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

SLIDE 44

Truncated Backpropagation through time

Run forward and backward through chunks of the sequence instead of whole sequence

SLIDE 45

Truncated Backpropagation through time

Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

SLIDE 46

Truncated Backpropagation through time

SLIDE 47

Example: Language Models

A language model computes a probability for a sequence of words:

𝑄(𝑥1, … , 𝑥𝑈)

Useful for machine translation, spelling correction, and …

– Word ordering: p(the cat is small) > p(small the is cat) – Word choice: p(walking home after school) > p(walking house after school)

SLIDE 48

Example: RNN language model

Given list of word vectors:

𝑦1, 𝑦2, … , 𝑦𝑈

At a single time step:

ℎ𝑢 = tanh 𝑋

𝑦ℎ𝑦𝑢 + 𝑋 ℎℎℎ𝑢−1

Output:

𝑧𝑢 = softmax 𝑋

ℎ𝑧ℎ𝑢

𝑄(𝑧𝑢 = 𝑤𝑘|𝑦1, … , 𝑦𝑢) ≈ 𝑧𝑢,𝑘

SLIDE 49

Example: RNN language model

Given list of word vectors:

𝑦1, 𝑦2, … , 𝑦𝑈

At a single time step:

ℎ𝑢 = tanh 𝑋

𝑦ℎ𝑦𝑢 + 𝑋 ℎℎℎ𝑢−1

Output:

𝑧𝑢 = softmax 𝑋

ℎ𝑧ℎ𝑢

𝑄(𝑧𝑢 = 𝑤𝑘|𝑦1, … , 𝑦𝑢) ≈ 𝑧𝑢,𝑘 ℎ0 is some initialization vector for the hidden layer at time step 0 𝑦𝑢 is the column vector at time step t

SLIDE 50

Example: RNN language model loss

𝑧 ∈ ℝ 𝑊 is a probability distribution over the vocabulary
Cross entropy loss function at location t of the sequence:

𝐹𝑢 = −

𝑘=1 |𝑊|

𝑧𝑢,𝑘 log 𝑧𝑢,𝑘

Cost function over the entire sequence:

𝐹 = − 1 𝑈

𝑢=1 𝑈 𝑘=1 |𝑊|

𝑧𝑢,𝑘 log 𝑧𝑢,𝑘

𝑧𝑢,𝑘 = 1 when 𝑥𝑢 must be the word 𝑘 of vocabulary

SLIDE 51

Training RNN

SLIDE 52

Training RNN

SLIDE 53

Training RNN

ℎ𝑘 = 𝑋

ℎℎ𝑔 ℎ𝑘−1 + 𝑋 𝑦ℎ𝑦𝑘

𝜖ℎ𝑘 𝜖ℎ𝑘−1 = 𝑋

ℎℎ 𝑈 diag 𝑔′ ℎ𝑘−1

𝜖ℎ𝑘 𝜖ℎ𝑘−1 ≤ 𝑋

ℎℎ 𝑈

𝑔′ ℎ𝑘−1 ≤ 𝛾𝑋𝛾ℎ 𝜖ℎ𝑢 𝜖ℎ𝑙 =

𝑘=𝑙+1 𝑢

𝜖ℎ𝑘 𝜖ℎ𝑘−1 =

𝑘=𝑙+1 𝑢

𝑋

ℎℎ 𝑈 diag 𝑔′ ℎ𝑘−1

≤ 𝛾𝑋𝛾ℎ 𝑢−𝑙

This can become very small or very large quickly (vanishing/exploding gradients)

[Bengio et al 1994].

𝜖ℎ𝑘,𝑛 𝜖ℎ𝑘−1,𝑜 = 𝑋

ℎℎ 𝑈 𝑜,. 𝑔′ 𝑛

SLIDE 54

Training RNNs is hard

Multiply the same matrix at each time step during forward prop
Ideally inputs from many time steps ago can modify output y

SLIDE 55

The vanishing gradient problem: Example

In the case of language modeling words from time steps far away are

not taken into consideration when training to predict the next word

Example: Jane walked into the room. John walked in too. It was late in

the day. Jane said hi to ____

SLIDE 56

Vanilla RNN Gradient Flow

SLIDE 57

Vanilla RNN Gradient Flow

SLIDE 58

Vanilla RNN Gradient Flow

Computing gradient of h0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients

SLIDE 59

Trick for exploding gradient: clipping trick

The solution first introduced by Mikolov is to clip gradients to a

maximum value.

Makes a big difference in RNNs.

SLIDE 60

Gradient clipping intuition

Error surface of a single hidden unit RNN

– High curvature walls

Solid lines: standard gradient descent trajectories
Dashed lines gradients rescaled to fixed size

SLIDE 61

Vanilla RNN Gradient Flow

Computing gradient of h0 involves many factors

f W (and repeated tanh)

Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Gradient clipping: Scale Computing gradient if its norm is too big Change RNN architecture

SLIDE 62

For vanishing gradients: Initialization + ReLus!

Initialize Ws to identity matrix I and activations to RelU
New experiments with recurrent neural nets.

Le et al. A Simple Way to Initialize Recurrent Networks of Rectified Linear Unit, 2015.

SLIDE 63

Better units for recurrent models

More complex hidden unit computation in recurrence!

– ℎ𝑢 = 𝑀𝑇𝑈𝑁(𝑦𝑢, ℎ𝑢−1) – ℎ𝑢 = 𝐻𝑆𝑉(𝑦𝑢, ℎ𝑢−1)

Main ideas:

–keep around memories to capture long distance dependencies –allow error messages to flow at different strengths depending on the inputs

SLIDE 64

Long Short Term Memory (LSTM)

SLIDE 65

Long-short-term-memories (LSTMs)

Input gate (current cell matterst 𝑗𝑢 = 𝜏 𝑋

𝑗

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑗

Forget (gate 0, forget past): 𝑔

𝑢 = 𝜏 𝑋 𝑔

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑔

Output (how much cell is exposed): 𝑝𝑢 = 𝜏 𝑋

𝑝

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑝

New memory cell: 𝑕𝑢 = tanh 𝑋

𝑕

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑕

Final memory cell: 𝑑𝑢 = 𝑗𝑢 ∘ 𝑕𝑢 + 𝑔

𝑢 ∘ 𝑑𝑢−1

Final hidden state: ℎ𝑢 = 𝑝𝑢 ∘ tanh 𝑑𝑢

SLIDE 66

Some visualization

By Chris Ola: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

SLIDE 67

LSTM Gates

Gates are ways to let information through (or not):

– Forget gate: look at previous cell state and current input, and decide which information to throw away. – Input gate: see which information in the current state we want to update. – Output: Filter cell state and output the filtered result. – Gate or update gate: propose new values for the cell state.

For instance: store gender of subject until another subject is seen.

SLIDE 68

Long Short Term Memory (LSTM)

SLIDE 69

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

SLIDE 70

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

SLIDE 71

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

SLIDE 72

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

SLIDE 73

Derivatives for LSTM

𝑨 = 𝑔 𝑦1, 𝑦2 = 𝑦1 ∘ 𝑦2 𝜖𝐹 𝜖𝑦1 = 𝜖𝐹 𝜖𝑨 𝜖𝑨 𝜖𝑦1 =

𝜖𝐹 𝜖𝑨 ∘ 𝑦2

SLIDE 74

GRUs

Gated Recurrent Units (GRU) introduced by Cho et al. 2014
Update gate

𝑨𝑢 = 𝜏 𝑋

𝑨

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑨

Reset gate

𝑠

𝑢 = 𝜏 𝑋 𝑠

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑠

Memory

ℎ𝑢 = tanh 𝑋

𝑛

𝑠

𝑢 ∘ ℎ𝑢−1

𝑦𝑢 + 𝑐𝑛

Final Memory

ℎ𝑢 = 𝑨𝑢 ∘ ℎ𝑢−1 + 1 − 𝑨𝑢 ∘ ℎ𝑢

If reset gate unit is ~0, then this ignores previous memory and

nly stores the new input

SLIDE 75

GRU intuition

Units with long term dependencies have active update gates z
Illustration:

SLIDE 76

GRU intuition

If reset is close to 0, ignore previous hidden state

– Allows model to drop information that is irrelevant in the future

Update gate z controls how much of past state should matter now.

– If z close to 1, then we can copy information in that unit through many time steps! Less vanishing gradient!

Units with short-term dependencies often have reset gates very

active

SLIDE 77

Other RNN Varients

SLIDE 78

Which of these variants is best?

Do the differences matter?

– Greff et al. (2015), perform comparison of popular variants, finding that they’re all about the same. – Jozefowicz et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

SLIDE 79

LSTM Achievements

LSTMs have essentially replaced n-grams as language models for

speech.

Image captioning and other multi-modal tasks which were very

difficult with previous methods are now feasible.

Many traditional NLP tasks work very well with LSTMs, but not

necessarily the top performers: e.g., POS tagging and NER: Choi 2016.

Neural MT: broken away from plateau of SMT, especially for

grammaticality (partly because of characters/subwords), but not yet industry strength.

[Ann Copestake, Overview of LSTMs and word2vec, 2016.] https://arxiv.org/ftp/arxiv/papers/1611/1611.00068.pdf

SLIDE 80

Multi-layer RNN

SLIDE 81

Bidirectional RNN

h is the memory, computed from the past memory and current word.

It summarizes the sentence up to that time.

SLIDE 82

Bidirectional RNN

Problem: For classification you want to incorporate information from

words both preceding and following

SLIDE 83

Multilayer bidirectional RNN

Each memory layer passes an intermediate sequential representation

to the next.

SLIDE 84

Explain Images with Multimodal Recurrent Neural Networks, Mao et al.
Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei
Show and Tell: A Neural Image Caption Generator, Vinyals et al.
Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.
Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Image Captioning

SLIDE 85

Convolutional Neural Network Recurrent Neural Network

SLIDE 86

test image

SLIDE 87

test image

SLIDE 88

test image

X

SLIDE 89

test image

x0 <START >

<START>

SLIDE 90

h0

x0 <START >

y0

<START> test image

before: h = tanh(Wxh * x + Whh * h) now: h = tanh(Wxh * x + Whh * h + Wih * v)

v

Wih

SLIDE 91

h0

x0 <START >

y0

<START> test image

straw

sample!

SLIDE 92

h0

x0 <START >

y0

<START> test image

straw

h1 y1

SLIDE 93

h0

x0 <START >

y0

<START> test image

straw

h1 y1

hat

sample!

SLIDE 94

h0

x0 <START >

y0

<START> test image

straw

h1 y1

hat

h2 y2

SLIDE 95

h0

x0 <START >

y0

<START> test image

straw

h1 y1

hat

h2 y2

sample <END> token => finish.

SLIDE 96

Microsoft COCO

[Tsung-Yi Lin et al. 2014] mscoco.org

currently: ~120K images ~5 sentences each

Image Sentence Datasets

SLIDE 97

Image Captioning: Example Results

SLIDE 98

Image Captioning: Failure Cases

SLIDE 99

RNN: Summary

RNNs allow a lot of flexibility in architecture design
Vanilla RNNs are simple but don’t work very well
Backward flow of gradients in RNN can explode or vanish.

– Exploding is controlled with gradient clipping. – Vanishing is controlled with additive interactions (LSTM)

Common to use LSTM or GRU: their additive interactions improve

gradient flow

Better/simpler architectures are a hot topic of current research
Better understanding (both theoretical and empirical) is needed.