Recurrent Neural Networks M. Soleymani Sharif University of - - PowerPoint PPT Presentation

recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks M. Soleymani Sharif University of - - PowerPoint PPT Presentation

Recurrent Neural Networks M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Socher lectures, cs224d, Stanford, 2017. . Vanilla


slide-1
SLIDE 1

Recurrent Neural Networks

  • M. Soleymani

Sharif University of Technology Fall 2017 Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Socher lectures, cs224d, Stanford, 2017. .

slide-2
SLIDE 2

“Vanilla” Neural Networks

“Vanilla” NN

slide-3
SLIDE 3

Recurrent Neural Networks: Process Sequences

“Vanilla” NN e.g. Image Captioning image -> seq of words e.g. Sentiment Classification seq of words -> sentiment e.g. Machine Translation seq of words -> seq of words e.g. Video classification on frame level

slide-4
SLIDE 4

Recurrent Neural Network

x RNN usually want to predict a vector at some time steps

y

slide-5
SLIDE 5

Recurrent Neural Network

x RNN y

We can process a sequence of vectors x by applying a recurrence formula at every time step: new state

  • ld state input vector at

some time step some function with parameters W

slide-6
SLIDE 6

Recurrent Neural Network

x RNN y

We can process a sequence of vectors x by applying a recurrence formula at every time step:

Notice: the same function and the same set of parameters are used at every time step.

slide-7
SLIDE 7

(Vanilla) Recurrent Neural Network

x RNN y

The state consists of a single “hidden” vector h:

slide-8
SLIDE 8

RNN: Computational Graph

slide-9
SLIDE 9

RNN: Computational Graph

Re-use the same weight matrix at every time-step

slide-10
SLIDE 10

RNN: Computational Graph: Many to One

slide-11
SLIDE 11

RNN: Computational Graph: Many to Many

slide-12
SLIDE 12

RNN: Computational Graph: Many to Many

slide-13
SLIDE 13

RNN: Computational Graph: Many to Many

slide-14
SLIDE 14

RNN: Computational Graph: One to Many

slide-15
SLIDE 15

Sequence to Sequence: Many-to-one + one-to-many

slide-16
SLIDE 16

Character-level language model example

Vocabulary: [h,e,l,o] Example training sequence: “hello”

x RNN y

slide-17
SLIDE 17

Character-level language model example

Vocabulary: [h,e,l,o] Example training sequence: “hello”

slide-18
SLIDE 18

Character-level language model example

Vocabulary: [h,e,l,o] Example training sequence: “hello”

slide-19
SLIDE 19

Character-level language model example

Vocabulary: [h,e,l,o] Example training sequence: “hello”

slide-20
SLIDE 20

Example: Character-level Language Model Sampling

  • Vocabulary: [h,e,l,o]
  • At

test-time sample characters one at a time, feed back to model

slide-21
SLIDE 21

Example: Character-level Language Model Sampling

  • Vocabulary: [h,e,l,o]
  • At

test-time sample characters one at a time, feed back to model

slide-22
SLIDE 22

Example: Character-level Language Model Sampling

  • Vocabulary: [h,e,l,o]
  • At

test-time sample characters one at a time, feed back to model

slide-23
SLIDE 23

Example: Character-level Language Model Sampling

  • Vocabulary: [h,e,l,o]
  • At

test-time sample characters one at a time, feed back to model

slide-24
SLIDE 24

min-char-rnn.py gist:

112 lines of Python

(https://gist.github.com/karpathy /d4dee566867f8291f086)

slide-25
SLIDE 25

x RNN y

Language Modeling: Example I

slide-26
SLIDE 26
slide-27
SLIDE 27

train more train more train more at first:

slide-28
SLIDE 28
slide-29
SLIDE 29
  • pen source textbook on algebraic geometry

Latex source

slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33

Generated C code

slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36

Searching for interpretable cells

slide-37
SLIDE 37

Searching for interpretable cells

slide-38
SLIDE 38

Searching for interpretable cells

quote detection cell

slide-39
SLIDE 39

Searching for interpretable cells

line length tracking cell

slide-40
SLIDE 40

Searching for interpretable cells

if statement cell

slide-41
SLIDE 41

Searching for interpretable cells

if statement cell

quote/comment cell

slide-42
SLIDE 42

Searching for interpretable cells

code depth cell

slide-43
SLIDE 43

Backpropagation through time

Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

slide-44
SLIDE 44

Truncated Backpropagation through time

Run forward and backward through chunks of the sequence instead of whole sequence

slide-45
SLIDE 45

Truncated Backpropagation through time

Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

slide-46
SLIDE 46

Truncated Backpropagation through time

slide-47
SLIDE 47

Example: Language Models

  • A language model computes a probability for a sequence of words:

𝑄(𝑥1, … , 𝑥𝑈)

  • Useful for machine translation, spelling correction, and …

– Word ordering: p(the cat is small) > p(small the is cat) – Word choice: p(walking home after school) > p(walking house after school)

slide-48
SLIDE 48

Example: RNN language model

  • Given list of word vectors:

𝑦1, 𝑦2, … , 𝑦𝑈

  • At a single time step:

ℎ𝑢 = tanh 𝑋

𝑦ℎ𝑦𝑢 + 𝑋 ℎℎℎ𝑢−1

  • Output:

𝑧𝑢 = softmax 𝑋

ℎ𝑧ℎ𝑢

𝑄(𝑧𝑢 = 𝑤𝑘|𝑦1, … , 𝑦𝑢) ≈ 𝑧𝑢,𝑘

slide-49
SLIDE 49

Example: RNN language model

  • Given list of word vectors:

𝑦1, 𝑦2, … , 𝑦𝑈

  • At a single time step:

ℎ𝑢 = tanh 𝑋

𝑦ℎ𝑦𝑢 + 𝑋 ℎℎℎ𝑢−1

  • Output:

𝑧𝑢 = softmax 𝑋

ℎ𝑧ℎ𝑢

𝑄(𝑧𝑢 = 𝑤𝑘|𝑦1, … , 𝑦𝑢) ≈ 𝑧𝑢,𝑘 ℎ0 is some initialization vector for the hidden layer at time step 0 𝑦𝑢 is the column vector at time step t

slide-50
SLIDE 50

Example: RNN language model loss

  • 𝑧 ∈ ℝ 𝑊 is a probability distribution over the vocabulary
  • Cross entropy loss function at location t of the sequence:

𝐹𝑢 = −

𝑘=1 |𝑊|

𝑧𝑢,𝑘 log 𝑧𝑢,𝑘

  • Cost function over the entire sequence:

𝐹 = − 1 𝑈

𝑢=1 𝑈 𝑘=1 |𝑊|

𝑧𝑢,𝑘 log 𝑧𝑢,𝑘

𝑧𝑢,𝑘 = 1 when 𝑥𝑢 must be the word 𝑘 of vocabulary

slide-51
SLIDE 51

Training RNN

slide-52
SLIDE 52

Training RNN

slide-53
SLIDE 53

Training RNN

ℎ𝑘 = 𝑋

ℎℎ𝑔 ℎ𝑘−1 + 𝑋 𝑦ℎ𝑦𝑘

𝜖ℎ𝑘 𝜖ℎ𝑘−1 = 𝑋

ℎℎ 𝑈 diag 𝑔′ ℎ𝑘−1

𝜖ℎ𝑘 𝜖ℎ𝑘−1 ≤ 𝑋

ℎℎ 𝑈

𝑔′ ℎ𝑘−1 ≤ 𝛾𝑋𝛾ℎ 𝜖ℎ𝑢 𝜖ℎ𝑙 =

𝑘=𝑙+1 𝑢

𝜖ℎ𝑘 𝜖ℎ𝑘−1 =

𝑘=𝑙+1 𝑢

𝑋

ℎℎ 𝑈 diag 𝑔′ ℎ𝑘−1

≤ 𝛾𝑋𝛾ℎ 𝑢−𝑙

  • This can become very small or very large quickly (vanishing/exploding gradients)

[Bengio et al 1994].

𝜖ℎ𝑘,𝑛 𝜖ℎ𝑘−1,𝑜 = 𝑋

ℎℎ 𝑈 𝑜,. 𝑔′ 𝑛

slide-54
SLIDE 54

Training RNNs is hard

  • Multiply the same matrix at each time step during forward prop
  • Ideally inputs from many time steps ago can modify output y
slide-55
SLIDE 55

The vanishing gradient problem: Example

  • In the case of language modeling words from time steps far away are

not taken into consideration when training to predict the next word

  • Example: Jane walked into the room. John walked in too. It was late in

the day. Jane said hi to ____

slide-56
SLIDE 56

Vanilla RNN Gradient Flow

slide-57
SLIDE 57

Vanilla RNN Gradient Flow

slide-58
SLIDE 58

Vanilla RNN Gradient Flow

Computing gradient of h0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients

slide-59
SLIDE 59

Trick for exploding gradient: clipping trick

  • The solution first introduced by Mikolov is to clip gradients to a

maximum value.

  • Makes a big difference in RNNs.
slide-60
SLIDE 60

Gradient clipping intuition

  • Error surface of a single hidden unit RNN

– High curvature walls

  • Solid lines: standard gradient descent trajectories
  • Dashed lines gradients rescaled to fixed size
slide-61
SLIDE 61

Vanilla RNN Gradient Flow

Computing gradient of h0 involves many factors

  • f W (and repeated tanh)

Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Gradient clipping: Scale Computing gradient if its norm is too big Change RNN architecture

slide-62
SLIDE 62

For vanishing gradients: Initialization + ReLus!

  • Initialize Ws to identity matrix I and activations to RelU
  • New experiments with recurrent neural nets.

Le et al. A Simple Way to Initialize Recurrent Networks of Rectified Linear Unit, 2015.

slide-63
SLIDE 63

Better units for recurrent models

  • More complex hidden unit computation in recurrence!

– ℎ𝑢 = 𝑀𝑇𝑈𝑁(𝑦𝑢, ℎ𝑢−1) – ℎ𝑢 = 𝐻𝑆𝑉(𝑦𝑢, ℎ𝑢−1)

  • Main ideas:

–keep around memories to capture long distance dependencies –allow error messages to flow at different strengths depending on the inputs

slide-64
SLIDE 64

Long Short Term Memory (LSTM)

slide-65
SLIDE 65

Long-short-term-memories (LSTMs)

  • Input gate (current cell matterst 𝑗𝑢 = 𝜏 𝑋

𝑗

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑗

  • Forget (gate 0, forget past): 𝑔

𝑢 = 𝜏 𝑋 𝑔

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑔

  • Output (how much cell is exposed): 𝑝𝑢 = 𝜏 𝑋

𝑝

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑝

  • New memory cell: 𝑕𝑢 = tanh 𝑋

𝑕

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑕

  • Final memory cell: 𝑑𝑢 = 𝑗𝑢 ∘ 𝑕𝑢 + 𝑔

𝑢 ∘ 𝑑𝑢−1

  • Final hidden state: ℎ𝑢 = 𝑝𝑢 ∘ tanh 𝑑𝑢
slide-66
SLIDE 66

Some visualization

By Chris Ola: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-67
SLIDE 67

LSTM Gates

  • Gates are ways to let information through (or not):

– Forget gate: look at previous cell state and current input, and decide which information to throw away. – Input gate: see which information in the current state we want to update. – Output: Filter cell state and output the filtered result. – Gate or update gate: propose new values for the cell state.

  • For instance: store gender of subject until another subject is seen.
slide-68
SLIDE 68

Long Short Term Memory (LSTM)

slide-69
SLIDE 69

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

slide-70
SLIDE 70

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

slide-71
SLIDE 71

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

slide-72
SLIDE 72

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

slide-73
SLIDE 73

Derivatives for LSTM

𝑨 = 𝑔 𝑦1, 𝑦2 = 𝑦1 ∘ 𝑦2 𝜖𝐹 𝜖𝑦1 = 𝜖𝐹 𝜖𝑨 𝜖𝑨 𝜖𝑦1 =

𝜖𝐹 𝜖𝑨 ∘ 𝑦2

slide-74
SLIDE 74

GRUs

  • Gated Recurrent Units (GRU) introduced by Cho et al. 2014
  • Update gate

𝑨𝑢 = 𝜏 𝑋

𝑨

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑨

  • Reset gate

𝑠

𝑢 = 𝜏 𝑋 𝑠

ℎ𝑢−1 𝑦𝑢 + 𝑐𝑠

  • Memory

ℎ𝑢 = tanh 𝑋

𝑛

𝑠

𝑢 ∘ ℎ𝑢−1

𝑦𝑢 + 𝑐𝑛

  • Final Memory

ℎ𝑢 = 𝑨𝑢 ∘ ℎ𝑢−1 + 1 − 𝑨𝑢 ∘ ℎ𝑢

If reset gate unit is ~0, then this ignores previous memory and

  • nly stores the new input
slide-75
SLIDE 75

GRU intuition

  • Units with long term dependencies have active update gates z
  • Illustration:
slide-76
SLIDE 76

GRU intuition

  • If reset is close to 0, ignore previous hidden state

– Allows model to drop information that is irrelevant in the future

  • Update gate z controls how much of past state should matter now.

– If z close to 1, then we can copy information in that unit through many time steps! Less vanishing gradient!

  • Units with short-term dependencies often have reset gates very

active

slide-77
SLIDE 77

Other RNN Varients

slide-78
SLIDE 78

Which of these variants is best?

  • Do the differences matter?

– Greff et al. (2015), perform comparison of popular variants, finding that they’re all about the same. – Jozefowicz et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

slide-79
SLIDE 79

LSTM Achievements

  • LSTMs have essentially replaced n-grams as language models for

speech.

  • Image captioning and other multi-modal tasks which were very

difficult with previous methods are now feasible.

  • Many traditional NLP tasks work very well with LSTMs, but not

necessarily the top performers: e.g., POS tagging and NER: Choi 2016.

  • Neural MT: broken away from plateau of SMT, especially for

grammaticality (partly because of characters/subwords), but not yet industry strength.

[Ann Copestake, Overview of LSTMs and word2vec, 2016.] https://arxiv.org/ftp/arxiv/papers/1611/1611.00068.pdf

slide-80
SLIDE 80

Multi-layer RNN

slide-81
SLIDE 81

Bidirectional RNN

  • h is the memory, computed from the past memory and current word.

It summarizes the sentence up to that time.

slide-82
SLIDE 82

Bidirectional RNN

  • Problem: For classification you want to incorporate information from

words both preceding and following

slide-83
SLIDE 83

Multilayer bidirectional RNN

  • Each memory layer passes an intermediate sequential representation

to the next.

slide-84
SLIDE 84
  • Explain Images with Multimodal Recurrent Neural Networks, Mao et al.
  • Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei
  • Show and Tell: A Neural Image Caption Generator, Vinyals et al.
  • Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al.
  • Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Image Captioning

slide-85
SLIDE 85

Convolutional Neural Network Recurrent Neural Network

slide-86
SLIDE 86

test image

slide-87
SLIDE 87

test image

slide-88
SLIDE 88

test image

X

slide-89
SLIDE 89

test image

x0 <START >

<START>

slide-90
SLIDE 90

h0

x0 <START >

y0

<START> test image

before: h = tanh(Wxh * x + Whh * h) now: h = tanh(Wxh * x + Whh * h + Wih * v)

v

Wih

slide-91
SLIDE 91

h0

x0 <START >

y0

<START> test image

straw

sample!

slide-92
SLIDE 92

h0

x0 <START >

y0

<START> test image

straw

h1 y1

slide-93
SLIDE 93

h0

x0 <START >

y0

<START> test image

straw

h1 y1

hat

sample!

slide-94
SLIDE 94

h0

x0 <START >

y0

<START> test image

straw

h1 y1

hat

h2 y2

slide-95
SLIDE 95

h0

x0 <START >

y0

<START> test image

straw

h1 y1

hat

h2 y2

sample <END> token => finish.

slide-96
SLIDE 96

Microsoft COCO

[Tsung-Yi Lin et al. 2014] mscoco.org

currently: ~120K images ~5 sentences each

Image Sentence Datasets

slide-97
SLIDE 97

Image Captioning: Example Results

slide-98
SLIDE 98

Image Captioning: Failure Cases

slide-99
SLIDE 99

RNN: Summary

  • RNNs allow a lot of flexibility in architecture design
  • Vanilla RNNs are simple but don’t work very well
  • Backward flow of gradients in RNN can explode or vanish.

– Exploding is controlled with gradient clipping. – Vanishing is controlled with additive interactions (LSTM)

  • Common to use LSTM or GRU: their additive interactions improve

gradient flow

  • Better/simpler architectures are a hot topic of current research
  • Better understanding (both theoretical and empirical) is needed.