Recurrent Neural Networks: Stability analysis and LSTMs M. - - PowerPoint PPT Presentation

recurrent neural networks stability analysis and lstms
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks: Stability analysis and LSTMs M. - - PowerPoint PPT Presentation

Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of Technology Spring 2019 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li and colleagues lectures, cs231n,


slide-1
SLIDE 1

Recurrent Neural Networks: Stability analysis and LSTMs

  • M. Soleymani

Sharif University of Technology Spring 2019 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017.

slide-2
SLIDE 2

St Story so far

  • Iterated structures are good for analyzing time series data with short-

time dependence on the past

– These are “Time delay Neural Nets” (TDNNs), AKA convnets

Stock vector X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+6)

2

slide-3
SLIDE 3

St Story so far

  • Recurrent structures are good for analyzing time series data with

long-term dependence on the past

– These are recurrent neural networks

Time X(t) Y(t) t=0 h-1

3

slide-4
SLIDE 4

Re Recurrent structures can do what static structures cannot

  • The addition problem: Add two N-bit numbers to produce a N+1-bit

number

– Input is binary – Will require large number of training instances

  • Output must be specified for every pair of inputs
  • Weights that generalize will make errors

– Network trained for N-bit numbers will not work for N+1 bit numbers

1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 MLP 1 0 1 0 1 0 1 1 1 1 0

4

slide-5
SLIDE 5
  • The addition problem: Add two N-bit numbers to produce a N+1-

bit number

  • RNN solution: Very simple, can add two numbers of any size

1 1 RNN unit Previous carry Carry

  • ut

5

MLP MLPs vs RNNs

slide-6
SLIDE 6

MLP: Th The parity problem

  • Is the number of “ones” even or odd
  • Network must be complex to capture all patterns

– XOR network, quite complex – Fixed input size

1 0 0 0 1 1 0 0 1 0 MLP 1

6

slide-7
SLIDE 7

RNN: Th The parity problem

  • Trivial solution
  • Generalizes to input of any size

1 1 RNN unit Previous

  • utput

7

slide-8
SLIDE 8
  • Recurrent structures can be trained by minimizing the loss between

the sequence of outputs and the sequence of desired outputs

– Through gradient descent and backpropagation

Time X(t) Y(t) t=0 h-1 Loss Ydesired(t)

8

St Story so far

slide-9
SLIDE 9

h0

𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊

  • The loss computed is between the sequence of outputs

by the network and the desired sequence of outputs

  • This is not just the sum of the divergences at individual times

§ Unless we explicitly define it that way

9

Back Propagation Th Through Ti Time

slide-10
SLIDE 10
  • Usual assumption: Sequence divergence is the sum of the

divergence at individual instants

𝑀𝑝𝑡𝑡 𝑍

012340 1 … 𝑈 , 𝑍(1 … 𝑈) = 8 𝑀𝑝𝑡𝑡 𝑍 012340 𝑢 , 𝑍(𝑢)

  • 𝛼<(0)𝑀𝑝𝑡𝑡 𝑍

012340 1 … 𝑈 , 𝑍(1 … 𝑈) = 𝛼<(0)𝑀𝑝𝑡𝑡 𝑍 012340 𝑢 , 𝑍(𝑢)

Time X(t) Y(t) t=1 h0 Y(t) DIVERGENCE

Ytarget(t)

10

Ti Time-sy synchronous recurrence

slide-11
SLIDE 11
  • In linear systems, long-term behavior depends entirely on the

eigenvalues of the hidden-layer weights matrix

– If the largest Eigen value is greater than 1, the system will “blow up” – If it is lesser than 1, the response will “vanish” very quickly

11

long-term behavior of RNNs

slide-12
SLIDE 12

“B “BIB IBO” ” Stabi bility ty

  • “Bounded Input Bounded Output” stability

– This is a highly desirable characteristic

12

slide-13
SLIDE 13
  • Time-delay structures have bounded output if

– The function 𝑔() has bounded output for bounded input

  • Which is true of almost every activation function

– 𝑌(𝑢) is bounded

X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+5)

13

“B “BIB IBO” ” Stabi bility ty

  • Returning to an old model..

𝑍 𝑢 = 𝑔(𝑌 𝑢 − 𝑗 , 𝑗 = 1. . 𝐿)

  • When will the output “blow up”?
slide-14
SLIDE 14
  • Will RNN necessarily be BIBO?

Time X(t) Y(t) t=1 h0

14

Is Is this this BIB IBO?

slide-15
SLIDE 15
  • Will this necessarily be BIBO?

– Guaranteed if output and hidden activations are bounded

  • But will it saturate (and where)

– What if the activations are linear?

Time X(t) Y(t) t=1 h0

15

Is Is this this BIB IBO?

slide-16
SLIDE 16
  • Sufficient to analyze the behavior of the hidden layer ℎB since

it carries the relevant information

– Will assume only a single hidden layer for simplicity

Time X(t) Y(t) t=1 h0

16

Ana Analyzing ng recur urrenc nce

slide-17
SLIDE 17

Ana Analyzing ng Recur ursi sion

17

slide-18
SLIDE 18
  • Easier to analyze linear systems

– Will attempt to extrapolate to non-linear systems subsequently

  • All activations are identity functions

– 𝑨B = 𝑋

EℎBFG + 𝑋 I𝑦B, ℎB= 𝑨B

Time X(t) Y(t) t=1 h0

18

St Streetlight effect

slide-19
SLIDE 19

Li Linear r systems ms

  • ℎB = 𝑋

EℎBFG + 𝑋 I𝑦B

– ℎBFG = 𝑋

EℎBFK + 𝑋 I𝑦BFG

  • ℎB = 𝑋

E KℎBFK + 𝑋 E𝑋 I𝑦BFG + 𝑋 I𝑦B

  • ℎB = 𝑋

E BℎL + 𝑋 E BFG𝑋 I𝑦G + 𝑋 E BFK𝑋 I𝑦K + ⋯

19

slide-20
SLIDE 20

St Streetlight effect

  • Sufficient to analyze the response to a single input at 𝑢 = 1

– Principle of superposition in linear systems:

Time X(t) Y(t) t=1 h0

20

slide-21
SLIDE 21
  • Consider simple, scalar, linear recursion (note change of notation)

– ℎ 𝑢 = 𝑥Eℎ 𝑢 − 1 + 𝑥I𝑦(𝑢) – ℎ G 𝑢 = 𝑥E

0𝑥I𝑦 1

  • Response to a single input at 1

21

Li Linear r recursions

ℎ G 𝑙

slide-22
SLIDE 22
  • Vector linear recursion (note change of notation)

– ℎ 𝑢 = 𝑋

Eℎ 𝑢 − 1 + 𝑋 I𝑦(𝑢)

– ℎ G 𝑢 = 𝑋

E 0𝑋 I𝑦 1

  • Length of response vector to a single input at 1 is |ℎ{G} 𝑢 |
  • We can write 𝑋

E = 𝑉Λ𝑉FG

– 𝑋

E𝑣V = 𝜇V𝑣V

– For any vector 𝑤 we can write

  • 𝑤 = 𝑏G𝑣G + 𝑏K𝑣K + ⋯ + 𝑏Z𝑣Z
  • 𝑋

E𝑤 = 𝑏G𝜇G𝑣G + 𝑏K𝜇K𝑣K + ⋯ + 𝑏Z𝜇Z𝑣Z

  • 𝑋

E 0𝑤 = 𝑏G𝜇G 0𝑣G + 𝑏K𝜇K 0 𝑣K + ⋯ + 𝑏Z𝜇Z 0 𝑣Z

– lim

0→_ 𝑋 E 0𝑤 = 𝑏`𝜇` 0 𝑣` where 𝑛 = argmax f

𝜇f

22

Li Linear r recursions: : Vector r version

slide-23
SLIDE 23
  • Vector linear recursion (note change of notation)

– ℎ 𝑢 = 𝑋

Eℎ 𝑢 − 1 + 𝑋 I𝑦(𝑢)

– ℎ G 𝑢 = 𝑋

E 0𝑋 I𝑦 1

  • Length of response vector to a single input at 1 is |ℎ{G} 𝑢 |
  • We can write 𝑋

E = 𝑉Λ𝑉FG

– 𝑋

E𝑣V = 𝜇V𝑣V

– For any vector 𝑤 we can write

  • 𝑤 = 𝑏G𝑣G + 𝑏K𝑣K + ⋯ + 𝑏Z𝑣Z
  • 𝑋

E𝑤 = 𝑏G𝜇G𝑣G + 𝑏K𝜇K𝑣K + ⋯ + 𝑏Z𝜇Z𝑣Z

  • 𝑋

E 0𝑤 = 𝑏G𝜇G 0𝑣G + 𝑏K𝜇K 0 𝑣K + ⋯ + 𝑏Z𝜇Z 0 𝑣Z

– lim

0→_ 𝑋 E 0𝑤 = 𝑏`𝜇` 0 𝑣` where 𝑛 = argmax f

𝜇f

For any input, for large 𝑢 the length of the hidden vector will expand or contract according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix

23

Li Linear r recursions: : Vector r version

slide-24
SLIDE 24
  • Vector linear recursion (note change of notation)

– ℎ 𝑢 = 𝑋

Eℎ 𝑢 − 1 + 𝑋 I𝑦(𝑢)

– ℎ G 𝑢 = 𝑋

E 0𝑋 I𝑦 1

  • Length of response vector to a single input at 1 is |ℎ{G} 𝑢 |
  • We can write 𝑋

E = 𝑉Λ𝑉FG

– 𝑋

E𝑣V = 𝜇V𝑣V

– For any vector 𝑤 we can write

  • 𝑤 = 𝑏G𝑣G + 𝑏K𝑣K + ⋯ + 𝑏Z𝑣Z
  • 𝑋

E𝑤 = 𝑏G𝜇G𝑣G + 𝑏K𝜇K𝑣K + ⋯ + 𝑏Z𝜇Z𝑣Z

  • 𝑋

E 0𝑤 = 𝑏G𝜇G 0𝑣G + 𝑏K𝜇K 0 𝑣K + ⋯ + 𝑏Z𝜇Z 0 𝑣Z

– lim

0→_ 𝑋 E 0𝑤 = 𝑏`𝜇` 0 𝑣` where 𝑛 = argmax f

𝜇f

For any input, for large 𝑢 the length of the hidden vector will expand or contract according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix

Unless it has no component along the eigen vector corresponding to the largest eigen value. In that case it will grow according to the second largest Eigen value.. And so on..

24

Li Linear r recursions: : Vector r version

slide-25
SLIDE 25
  • Vector linear recursion (note change of notation)

– ℎ 𝑢 = 𝑋

Eℎ 𝑢 − 1 + 𝑋 I𝑦(𝑢)

– ℎ G 𝑢 = 𝑋

E 0𝑋 I𝑦 1

  • Length of response vector to a single input at 1 is |ℎ{G} 𝑢 |
  • We can write 𝑋

E = 𝑉Λ𝑉FG

– 𝑋

E𝑣V = 𝜇V𝑣V

– For any vector 𝑤 we can write

  • 𝑤 = 𝑏G𝑣G + 𝑏K𝑣K + ⋯ + 𝑏Z𝑣Z
  • 𝑋

E𝑤 = 𝑏G𝜇G𝑣G + 𝑏K𝜇K𝑣K + ⋯ + 𝑏Z𝜇Z𝑣Z

  • 𝑋

E 0𝑤 = 𝑏G𝜇G 0𝑣G + 𝑏K𝜇K 0 𝑣K + ⋯ + 𝑏Z𝜇Z 0 𝑣Z

– lim

0→_ 𝑋 E 0𝑤 = 𝑏`𝜇` 0 𝑣` where 𝑛 = argmax f

𝜇f

If 𝜇`1I > 1 it will blow up, otherwise it will contract and shrink to 0 rapidly For any input, for large 𝑢 the length of the hidden vector will expand or contract according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix

Unless it has no component along the eigen vector corresponding to the largest eigen value. In that case it will grow according to the second largest Eigen value.. And so on..

25

Li Linear r recursions: : Vector r version

slide-26
SLIDE 26
  • Vector linear recursion (note change of notation)

– ℎ 𝑢 = 𝑋

Eℎ 𝑢 − 1 + 𝑋 I𝑦(𝑢)

– ℎ G 𝑢 = 𝑋

E 0𝑋 I𝑦 1

  • Length of response vector to a single input at 1 is |ℎ{G} 𝑢 |
  • We can write 𝑋

E = 𝑉Λ𝑉FG

– 𝑋

E𝑣V = 𝜇V𝑣V

– For any vector 𝑤 we can write

  • 𝑤 = 𝑏G𝑣G + 𝑏K𝑣K + ⋯ + 𝑏Z𝑣Z
  • 𝑋

E𝑤 = 𝑏G𝜇G𝑣G + 𝑏K𝜇K𝑣K + ⋯ + 𝑏Z𝜇Z𝑣Z

  • 𝑋

E 0𝑤 = 𝑏G𝜇G 0𝑣G + 𝑏K𝜇K 0 𝑣K + ⋯ + 𝑏Z𝜇Z 0 𝑣Z

– lim

0→_ 𝑋 E 0𝑤 = 𝑏`𝜇` 0 𝑣` where 𝑛 = argmax f

𝜇f

If 𝜇`1I > 1 it will blow up, otherwise it will contract and shrink to 0 rapidly What about at middling values of 𝑢? It will depend on the

  • ther eigen values

For any input, for large 𝑢 the length of the hidden vector will expand or contract according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix

Unless it has no component along the eigen vector corresponding to the largest eigen value. In that case it will grow according to the second largest Eigen value.. And so on..

26

Linear recursions: Vector version

slide-27
SLIDE 27
  • Vector linear recursion

– ℎ 𝑢 = 𝑋

Eℎ 𝑢 − 1 + 𝑋 I𝑦(𝑢)

– ℎ G 𝑢 = 𝑋

E 0𝑋 I𝑦 1

  • Response to a single input [1 1 1 1] at 1

𝜇`1I = 0.9 𝜇`1I = 1 𝜇`1I = 1.1 𝜇`1I = 1 𝜇`1I = 1.1

27

Li Linear r recursions

slide-28
SLIDE 28
  • Vector linear recursion

– ℎ 𝑢 = 𝑋

Eℎ 𝑢 − 1 + 𝑋 I𝑦(𝑢)

– ℎ G 𝑢 = 𝑋

E 0𝑋 I𝑦 1

  • Response to a single input [1 1 1 1] at 1

𝜇`1I = 0.9 𝜇`1I = 1 𝜇`1I = 1.1 𝜇`1I = 1 𝜇`1I = 1.1 Complex Eigenvalues 𝜇KZj = 0.5 𝜇KZj = 0.1

28

Li Linear r recursions

slide-29
SLIDE 29
  • In linear systems, long-term behavior depends entirely on the

eigenvalues of the hidden-layer weights matrix

– If the largest Eigen value is greater than 1, the system will “blow up” – If it is lesser than 1, the response will “vanish” very quickly – Complex Eigen values cause oscillatory response

  • Which we may or may not want
  • For smooth behavior, must force the weights matrix to have real Eigen values
  • Symmetric weight matrix

29

Le Lesson..

slide-30
SLIDE 30
  • The behavior of scalar non-linearities
  • Left: Sigmoid, Middle: Tanh, Right: Relu

– Sigmoid: Saturates in a limited number of steps, regardless of 𝑥E – Tanh: Sensitive to 𝑥E, but eventually saturates

  • “Prefers” weights close to 1.0

– Relu: Sensitive to 𝑥E, can blow up

ℎ 𝑢 = 𝑔(𝑥Eℎ 𝑢 − 1 + 𝑥I𝑦 𝑢 )

30

Ho How w abo about ut no non-lin linear earities ities (sc (scalar)

slide-31
SLIDE 31
  • With a negative start
  • Left: Sigmoid, Middle: Tanh, Right: Relu

– Sigmoid: Saturates in a limited number of steps, regardless of 𝑥E – Tanh: Sensitive to 𝑥E, but eventually saturates – Relu: For negative starts, has no response

ℎ 𝑢 = 𝑔(𝑥Eℎ 𝑢 − 1 + 𝑥I𝑦 𝑢 )

31

Ho How about non-linearities (sc (scalar)

slide-32
SLIDE 32
  • Assuming a uniform unit vector initialization

– 1,1,1, … / 𝑂

  • – Behavior similar to scalar recursion

– Interestingly, RELU is more prone to blowing up (why?)

  • Eigenvalues less than 1.0 retain the most “memory”

ℎ 𝑢 = 𝑔(𝑥Eℎ 𝑢 − 1 + 𝑥I𝑦 𝑢 )

sigmoid tanh relu

32

Ve Vector Process

slide-33
SLIDE 33

St Stability Analysis

  • Formal stability analysis considers convergence of “Lyapunov” functions

– Alternately, Routh’s criterion and/or pole-zero analysis – Positive definite functions evaluated at ℎ – Conclusions are similar: only the tanh activation gives us any reasonable behavior

  • And still has very short “memory”
  • Lessons:

– Bipolar activations (e.g. tanh) have the best memory behavior – Still sensitive to Eigenvalues of 𝑋 – Best case memory is short – Exponential memory behavior

  • “Forgets” in exponential manner

33

slide-34
SLIDE 34
  • Recurrent networks retain information from the infinite past in principle
  • In practice, they tend to blow up or forget

– If the largest Eigen value of the recurrent weights matrix is greater than 1, the network response may blow up – If its less than one, the response dies down very quickly

  • The “memory” of the network also depends on the activation of the hidden

units

– Sigmoid activations saturate and the network becomes unable to retain new information – RELUs blow up – Tanh activations are the most effective at storing memory

  • But still, for not very long

34

St Story so far

slide-35
SLIDE 35

RN RNNs.. s..

  • Excellent models for time-series analysis tasks

– Time-series prediction – Time-series classification – Sequence prediction.. – They can even simplify problems that are difficult for MLPs

  • But the memory isn’t all that great..

– Also..

35

slide-36
SLIDE 36

Th The vanishing gradient problem

  • A particular problem with training deep networks..

– (Any deep network, not just recurrent nets) – The gradient of the error with respect to weights is unstable..

36

slide-37
SLIDE 37

Re Reminder: Tr Training deep networks

𝑃𝑣𝑢𝑞𝑣𝑢 = 𝑏[t] = 𝑔 𝑨[t] = 𝑔 𝑋[t]𝑏[tFG] = 𝑔 𝑋[t]𝑔(𝑋[tFG]𝑏[tFK] = 𝑔 𝑋[t]𝑔 𝑋[tFG] … 𝑔 𝑋[K]𝑔 𝑋[G]𝑦

For convenience, we use the same activation functions for all layers. However, output layer neurons most commonly do not need activation function (they show class scores or real-valued targets.)

𝑋[G] 𝑦 × 𝑔 𝑋[K] × 𝑔 𝑋[t] × 𝑔 𝑨[G] 𝑏[G] 𝑨[K] 𝑏[tFG] 𝑨[t] 𝑏[t] 𝑏[t] = 𝑝𝑣𝑢𝑞𝑣𝑢

37

slide-38
SLIDE 38
  • For

𝑀𝑝𝑡𝑡(𝑦) = 𝐹 𝑔[t] 𝑋[t]𝑔[tFG] 𝑋[tFG]𝑔[tFK] … 𝑋[G]𝑦

  • We get:

𝛼

x[y]𝑀𝑝𝑡𝑡 = 𝛼 x[z]𝑀𝑝𝑡𝑡. 𝛼𝑔[t]. 𝑋[t]. 𝛼𝑔[tFG]. 𝑋[tFG] … 𝛼𝑔[{|G]. 𝑋[{|G]

  • Where

– 𝛼

x[y]𝑀𝑝𝑡𝑡 is the gradient of the error w.r.t the output of the l-th layer of the network

  • Needed to compute the gradient of the error w.r.t 𝑋[{]

– 𝛼𝑔[{] is jacobian of 𝑔[{] w.r.t. to its current input – All blue terms are matrices

38

Re Reminder: Training deep deep ne networks

slide-39
SLIDE 39

Re Reminder: Gradient pr probl blems i in de n deep ne p netw twork rks

  • The gradients in the lower/earlier layers can explode or vanish

– Resulting in insignificant or unstable gradient descent updates – Problem gets worse as network depth increases

𝛼

x[y]𝑀𝑝𝑡𝑡 = 𝛼 x[z]𝑀𝑝𝑡𝑡. 𝛼𝑔[t]. 𝑋[t]. 𝛼𝑔[tFG]. 𝑋[tFG] … 𝛼𝑔[{|G]. 𝑋[{|G]

39

slide-40
SLIDE 40
  • As we go back in layers, the Jacobians of the activations constantly

shrink the derivative

– After a few layers the derivative of the loss at any time is totally “forgotten”

𝛼

x[y]𝑀𝑝𝑡𝑡 = 𝛼 x[z]𝑀𝑝𝑡𝑡. 𝛼𝑔[t]. 𝑋[t]. 𝛼𝑔[tFG]. 𝑋[tFG] … 𝛼𝑔[{|G]. 𝑋[{|G]

40

Re Reminder: Training deep deep ne networks

slide-41
SLIDE 41
  • 𝛼𝑔() is the derivative of the output of the (layer of) hidden

recurrent neurons with respect to their input

– For vector activations: A full matrix – For scalar activations: A matrix where the diagonal entries are the derivatives of the activation of the recurrent hidden layer

ℎV(𝑢) = 𝑔 𝑨V 𝑢 𝑌 ℎ 𝑍 𝛼𝑔 𝑨(𝑢) = 𝑔}(𝑨0,G) ⋯ 𝑔}(𝑨0,K) ⋯ ⋮ ⋮ ⋱ ⋮ ⋯ 𝑔}(𝑨0,€)

41

Th The Jacobian of the hidden layers fo for an an RNN

slide-42
SLIDE 42
  • The derivative (or subgradient) of the activation function is always bounded

– The diagonals (or singular values) of the Jacobian are bounded

  • There is a limit on how much multiplying a vector by the Jacobian will scale it

42

Th The Jacobian

ℎV(𝑢) = 𝑔 𝑨V 𝑢 𝑌 ℎ 𝑍 𝛼𝑔 𝑨(𝑢) = 𝑔}(𝑨0,G) ⋯ 𝑔}(𝑨0,K) ⋯ ⋮ ⋮ ⋱ ⋮ ⋯ 𝑔}(𝑨0,€)

slide-43
SLIDE 43

Th The derivative of the hidden state activation

  • Most common activation functions, such as sigmoid, tanh() and RELU have

derivatives that are always less than 1

  • The most common activation for the hidden units in an RNN is the tanh()

– The derivative of tanh ()is never greater than 1 (and mostly less than 1)

  • Multiplication by the Jacobian is always a shrinking operation

43

slide-44
SLIDE 44

𝛼

x[•]𝑀𝑝𝑡𝑡 = 𝛼 x[‚]𝑀𝑝𝑡𝑡. 𝛼𝑔[ƒ]. 𝑋. 𝛼𝑔[ƒFG]. 𝑋 … 𝛼𝑔[0|G]. 𝑋

  • In a single-layer RNN, the weight matrices are identical

– The conclusion below holds for any deep network, though

  • The chain product for 𝛼

x[•]𝑀𝑝𝑡𝑡 will

– Expand 𝛼

x[‚]𝑀𝑝𝑡𝑡 along directions in which the singular values of the weight

matrices are greater than 1 – Shrink 𝛼

x[‚]𝑀𝑝𝑡𝑡 in directions where the singular values are less than 1

– Repeated multiplication by the weights matrix will result in Exploding or vanishing gradients

44

Wha What abo bout ut the he weigh ghts

slide-45
SLIDE 45

𝛼

x[•]𝑀𝑝𝑡𝑡 = 𝛼 x[‚]𝑀𝑝𝑡𝑡. 𝛼𝑔[ƒ]. 𝑋. 𝛼𝑔[ƒFG]. 𝑋 … 𝛼𝑔[0]. 𝑋[0]

  • Every blue term is a matrix
  • 𝛼

x[‚]𝑀𝑝𝑡𝑡 is proportional to the actual loss

– Particularly for L2 and KL divergence

  • The chain product for 𝛼

x[•]𝑀𝑝𝑡𝑡 will

– Expand it in directions where each stage has singular values greater than 1 – Shrink it in directions where each stage has singular values less than 1

45

Expl Explodi ding ng/Vani nishi shing ng gr gradi dients

slide-46
SLIDE 46

Training RNN

46

slide-47
SLIDE 47

Training RNN

ℎf = 𝑔 𝑋

EEℎfFG + 𝑋 IE𝑦f

𝜖ℎf 𝜖ℎfFG = 𝑋

EE ƒ diag 𝑔} 𝑨 f

𝜖ℎf 𝜖ℎfFG ≤ 𝑋

EE ƒ

𝑔} 𝑨

f

≤ 𝛾ˆ𝛾E 𝜖ℎ0 𝜖ℎB = ‰ 𝜖ℎf 𝜖ℎfFG

fŠB|G

= ‰ 𝑋

EE ƒ diag 𝑔} 𝑨 f fŠB|G

≤ 𝛾ˆ𝛾E 0FB

  • This can become very small or very large quickly (vanishing/exploding gradients)

[Bengio et al 1994].

𝜖ℎf,` 𝜖ℎfFG,Z = 𝑋

EE ƒ Z,. 𝑔} `

47

slide-48
SLIDE 48
  • The relation between 𝑌(1) and 𝑍(𝑈) is one of a very deep network

– Gradients from errors at t = 𝑈 will vanish by the time they’re propagated to 𝑢 = 1

X(1)

hf(0)

Y(T)

48

Re Recurrent nets are very deep nets

slide-49
SLIDE 49

Training RNNs is hard

  • The unrolled network can be very deep and inputs from many time

steps ago can modify output

– Unrolled network is very deep

  • Multiply the same matrix at each time step during forward prop

49

slide-50
SLIDE 50

The vanishing gradient problem: Example

  • In the case of language modeling words from time steps far away are

not taken into consideration when training to predict the next word

  • Example: Jane walked into the room. John walked in too. It was late in

the day. Jane said hi to ____

This slide has been adopted from Socher lectures, cs224d, Stanford, 2017

50

slide-51
SLIDE 51

Th The long-te term dependency problem

  • Must know to “remember” for extended periods of time and “recall”

when necessary

– Can be performed with a multi-tap recursion, but how many taps? – Need an alternate way to “remember” stuff

51

slide-52
SLIDE 52
  • Recurrent networks retain information from the infinite past in principle
  • In practice, they are poor at memorization

– The hidden outputs can blow up, or shrink to zero depending on the Eigen values of the recurrent weights matrix – The memory is also a function of the activation of the hidden units

  • Tanh activations are the most effective at retaining memory, but even they don’t hold it very long
  • Deep networks also suffer from a “vanishing or exploding gradient” problem

– The gradient of the error at the output gets concentrated into a small number of parameters in the earlier layers, and goes to zero for others

52

St Story so far

slide-53
SLIDE 53

Vanilla RNN Gradient Flow

53

slide-54
SLIDE 54

Vanilla RNN Gradient Flow

54

slide-55
SLIDE 55

Vanilla RNN Gradient Flow

Computing gradient of h0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients

55

slide-56
SLIDE 56

Trick for exploding gradient: clipping trick

  • The solution first introduced by Mikolov is to clip gradients to a

maximum value.

  • Makes a big difference in RNNs.

56

slide-57
SLIDE 57

Gradient clipping intuition

  • Error surface of a single hidden unit RNN

– High curvature walls

  • Solid lines: standard gradient descent trajectories
  • Dashed lines gradients rescaled to fixed size

57

slide-58
SLIDE 58

Vanilla RNN Gradient Flow

Computing gradient of h0 involves many factors

  • f W (and repeated tanh)

Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Gradient clipping: Scale Computing gradient if its norm is too big Change RNN architecture

58

slide-59
SLIDE 59

For vanishing gradients: Initialization + ReLus!

  • Initialize Ws to identity matrix I and activations to RelU
  • New experiments with recurrent neural nets.

Le et al. A Simple Way to Initialize Recurrent Networks of Rectified Linear Unit, 2015.

59

slide-60
SLIDE 60

Better units for recurrent models

  • More complex hidden unit computation in recurrence!

– ℎ0 = 𝑀𝑇𝑈𝑁(𝑦0, ℎ0FG) – ℎ0 = 𝐻𝑆𝑉(𝑦0, ℎ0FG)

  • Main ideas:

–keep around memories to capture long distance dependencies –allow error messages to flow at different strengths depending on the inputs

60

slide-61
SLIDE 61

And And no now w we enter the he do domain n of.. ..

61

slide-62
SLIDE 62
  • Can we replace this with something that doesn’t fade or blow up?
  • Can we have a network that just “remembers” arbitrarily long, to be

recalled on demand?

– Not be directly dependent on vagaries of network parameters, but rather on input-based determination of whether it must be remembered – Replace them, e.g., by a function of the input that decides if things must be forgotten or not

62

Expl Explodi ding ng/Vani nishi shing ng gr gradi dients

slide-63
SLIDE 63

En Enter the he LSTM TM

  • Long Short-Term Memory
  • Explicitly latch information to prevent decay / blowup
  • Following notes borrow liberally from
  • http://colah.github.io/posts/2015-08-Understanding-LSTMs/

63

slide-64
SLIDE 64

St Standard RNN

  • Recurrent neurons receive past recurrent outputs and current input as inputs
  • Processed through a tanh() activation function

– As mentioned earlier, tanh() is the generally used activation for the hidden layer

  • Current recurrent output passed to next higher layer and next time instant

64

slide-65
SLIDE 65

Some visualization

65

slide-66
SLIDE 66

Lo Long Sh Short rt-Te Term Memory

  • The 𝜏() are multiplicative gates that decide if something is important
  • r not
  • Remember, every line actually represents a vector

66

slide-67
SLIDE 67

LSTM TM: Constant Error Carousel

  • Key component: a remembered cell state

67

slide-68
SLIDE 68

LSTM TM: CEC

  • 𝐷0 is the linear history
  • Carries information through, only affected by a gate

– And addition of history, which too is gated..

68

slide-69
SLIDE 69

LSTM TM: Gates

  • Gates are simple sigmoidal units with outputs in the range (0,1)
  • Controls how much of the information is to be let through

69

slide-70
SLIDE 70
  • The first gate determines whether to carry over the history or to forget it

– More precisely, how much of the history to carry over – Also called the “forget” gate – Note, we’re actually distinguishing between the cell memory 𝐷 and the state ℎ that is coming over time! They’re related though

70

LSTM TM: Forget gate

slide-71
SLIDE 71

LSTM TM: Input gate

  • The second input has two parts

– A perceptron layer that determines if there’s something new and interesting in the input – A gate that decides if its worth remembering – If so its added to the current memory cell

71

slide-72
SLIDE 72

LSTM TM: Memory cell update

  • The second input has two parts

– A perceptron layer that determines if there’s something interesting in the input – A gate that decides if its worth remembering – If so its added to the current memory cell

72

slide-73
SLIDE 73

LSTM TM: Output and Output gate

  • The output of the cell

– Simply compress it with tanh to make it lie between 1 and -1

  • Note that this compression no longer affects our ability to carry memory forward

– Controlled by an output gate

  • To decide if the memory contents are worth reporting at this time

73

slide-74
SLIDE 74

Long-short-term-memories (LSTMs)

  • Input gate (current cell matter): 𝑗0 = 𝜏 𝑋

V

ℎ0FG 𝑦0 + 𝑐V

  • Forget (gate 0, forget past): 𝑔

0 = 𝜏 𝑋 x

ℎ0FG 𝑦0 + 𝑐x

  • Output (how much cell is exposed): 𝑝0 = 𝜏 𝑋

ℎ0FG 𝑦0 + 𝑐“

  • New memory cell: 𝑑̃0 = tanh 𝑋

ℎ0FG 𝑦0 + 𝑐–

  • Final memory cell: 𝑑0 = 𝑗0 ∘ 𝑑̃0 + 𝑔

0 ∘ 𝑑0FG

  • Final hidden state: ℎ0 = 𝑝0 ∘ tanh 𝑑0

74

slide-75
SLIDE 75

75

  • 𝒋𝒖: input gate, how much of the new information will be let through the memory cell.
  • 𝒈𝒖: forget gate, responsible for information should be thrown away from memory cell.
  • 𝒑𝒖: output gate, how much of the information will be passed to expose to the next time step.
  • 𝒉𝒖 or 𝒅

ž𝒖: self-recurrent which is equal to standard RNN

  • 𝒅𝒖: internal memory of the memory cell
  • 𝒊𝒖: hidden state
  • 𝐳: output

LSTM TM Equations

slide-76
SLIDE 76

LSTM Gates

  • Gates are ways to let information through (or not):

– Forget gate: look at previous cell state and current input, and decide which information to throw away. – Input gate: see which information in the current state we want to update. – Output: Filter cell state and output the filtered result. – Gate or update gate: propose new values for the cell state.

  • For instance: store gender of subject until another subject is seen.

76

slide-77
SLIDE 77

LSTM TM: Th The “Peephole” Connection

  • The raw memory is informative by itself and can also be input

– Note, we’re using both 𝐷 and ℎ

77

slide-78
SLIDE 78
  • Forward rules:

𝑦0 ℎ0FG ℎ0 𝐷0FG 𝐷0 𝑔 𝑗0 𝑝0 𝐷 ¡0

s() s() s()

tanh tanh

Gates Variables

78

Ba Backp kpropagation ru rules: s: Forward

slide-79
SLIDE 79

# Continuing from previous slide # Note: [W,h] is a set of parameters, whose individual elements are # shown in red within the code. These are passed in # Static local variables which aren’t required outside this cell static local zf, zi, zc, f, i, o, Ci function [Co, ho] = LSTM_cell.forward(C,h,x, [W,h]) zf = WfcC + Wfhh + Wfxx + bf f = sigmoid(zf) # forget gate zi = WicC + Wihh + Wixx + bi i = sigmoid(zi) # input gate zc = WccC + Wchh + Wcxx + bc Ci = tanh(zc) # Detecting input pattern Co = f∘C + i∘Ci # “∘” is component-wise multiply zo = WocCo + Wohh + Woxx + bo

  • = sigmoid(zo) # output gate

ho = o∘tanh(C) # “∘” is component-wise multiply return Co,ho

79

LSTM TM cell forward

slide-80
SLIDE 80

# Assuming h(0,*) is known and C(0,*)=0 # Assuming L hidden-state layers and an output layer # Note: LSTM_cell is an indexed class with functions # [W{l},b{l}] are the entire set of weights and biases # for the lth hidden layer # Wo and bo are output layer weights and biases for t = 1:T # Including both ends of the index h(t,0) = x(t) # Vectors. Initialize hidden layer h(0) to input for l = 1:L # hidden layers operate at time t [C(t,l),h(t,l)] = LSTM_cell(t,l).forward(… …C(t-1,l),h(t-1,l),h(t,l-1)[W{l},b{l}]) zo(t) = Woh(t,L) + bo Y(t) = softmax( zo(t) )

80

LSTM TM network forward

slide-81
SLIDE 81

Long Short Term Memory (LSTM)

g in the previous slides was called 𝑑̃

81

slide-82
SLIDE 82

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

82

slide-83
SLIDE 83

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

83

slide-84
SLIDE 84

Long Short Term Memory (LSTM)

[Hochreiter et al., 1997]

84

slide-85
SLIDE 85

Ga Gated Recu current t Un Units its: Lets simplify the LSTM TM

  • Don’t bother to separately maintain compressed and regular memories

– Pointless computation!

  • But compress it before using it to decide on the usefulness of the

current input!

85

slide-86
SLIDE 86

GRUs

  • Gated Recurrent Units (GRU) introduced by Cho et al. 2014
  • Update gate

𝑨0 = 𝜏 𝑋

¢

ℎ0FG 𝑦0 + 𝑐¢

  • Reset gate

𝑠

0 = 𝜏 𝑋 2

ℎ0FG 𝑦0 + 𝑐2

  • Memory

ℎ ¤0 = tanh 𝑋

`

𝑠

0 ∘ ℎ0FG

𝑦0 + 𝑐`

  • Final Memory

ℎ0 = 𝑨0 ∘ ℎ0FG + 1 − 𝑨0 ∘ ℎ ¤0

If reset gate unit is ~0, then this ignores previous memory and

  • nly stores the new input

86

slide-87
SLIDE 87

GRU intuition

  • Units with long term dependencies have active update gates z
  • Illustration:

87

This slide has been adopted from Socher lectures, cs224d, Stanford, 2017

slide-88
SLIDE 88

GRU intuition

  • If reset is close to 0, ignore previous hidden state

– àAllows model to drop information that is irrelevant in the future

  • Update gate z controls how much of past state should matter now.

– If z close to 1, then we can copy information in that unit through many time steps! Less vanishing gradient!

  • Units with short-term dependencies often have reset gates very active

88

This slide has been adopted from Socher lectures, cs224d, Stanford, 2017

slide-89
SLIDE 89

Other RNN Varients

89

slide-90
SLIDE 90

Which of these variants is best?

  • Do the differences matter?

– Greff et al. (2015), perform comparison of popular variants, finding that they’re all about the same. – Jozefowicz et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

90

slide-91
SLIDE 91

LSTM Achievements

  • LSTMs have essentially replaced n-grams as language models for

speech.

  • Image captioning and other multi-modal tasks which were very

difficult with previous methods are now feasible.

  • Many traditional NLP tasks work very well with LSTMs, but not

necessarily the top performers: e.g., POS tagging and NER: Choi 2016.

  • Neural MT: broken away from plateau of SMT, especially for

grammaticality (partly because of characters/subwords), but not yet industry strength.

[Ann Copestake, Overview of LSTMs and word2vec, 2016.] https://arxiv.org/ftp/arxiv/papers/1611/1611.00068.pdf

91

slide-92
SLIDE 92

Multi-layer RNN

92

slide-93
SLIDE 93
  • Each green box is now an entire LSTM or GRU unit
  • Also keep in mind each box is an array of units

Time X(t) Y(t)

93

Mu Multi-layer LSTM TM architecture

slide-94
SLIDE 94
  • RNN with both forward and backward recursion

– Explicitly models the fact that just as the future can be predicted from the past, the past can be deduced from the future

94 Proposed by Schuster and Paliwal 1997

Ex Extensi nsions ns to the he RN RNN: Bi Bidi direc ectional nal RN RNN

slide-95
SLIDE 95
  • A forward net process the data from t=1 to t=T
  • A backward net processes it backward from t=T down to t=1

t hf(0)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

95

Bi Bidirectional RN RNN

slide-96
SLIDE 96
  • The forward net process the data from t=1 to t=T

– Only computing the hidden states, initially

  • The backward net processes it backward from t=T down to t=0

t hf(0)

X(1) X(2) X(T-2) X(T-1) X(T)

96

Bi Bidirectional RN RNN: Processi ssing an input stri ring

slide-97
SLIDE 97

Bi Bidirectional RN RNN: Processi ssing an input stri ring

  • The backward nets processes the input data in reverse time, end to beginning

– Initially only the hidden state values are computed

  • Clearly, this is not an online process and requires the entire input data

– Note: This is not the backward pass of backprop.net processes it backward from t=T down to t=0

t hf(0)

X(1) X(2) X(T-2) X(T-1) X(T) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

97

slide-98
SLIDE 98

Bi Bidirectional RN RNN: Processi ssing an input stri ring

  • The computed states of both networks are used to compute the

final output at each time

t hf(0)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

98

slide-99
SLIDE 99
  • Forward pass: Compute both forward and backward

networks and final output

t hf(0)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

99

Ba Backp kpropagation in BRN BRNNs

𝒊 = 𝒊, 𝒊 : represents both the past and future

slide-100
SLIDE 100
  • Backward pass: Define a divergence from the desired output
  • Separately perform back propagation on both nets

– From t=T down to t=0 for the forward net – From t=0 up to t=T for the backward net

Loss() d1..dT Loss

100

Ba Backp kpropagation in BRN BRNNs

t hf(0)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

slide-101
SLIDE 101
  • Backward pass: Define a divergence from the desired output
  • Separately perform back propagation on both nets

– From t=T down to t=0 for the forward net

– From t=0 up to t=T for the backward net

hf(0)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T)

Loss() d1..dT Loss

101

Ba Backp kpropagation in BRN BRNNs

slide-102
SLIDE 102
  • Backward pass: Define a divergence from the desired output
  • Separately perform back propagation on both nets

– From t=T down to t=0 for the forward net – From t=0 up to t=T for the backward net

t Y(1) Y(2)

Y(T-2) Y(T-1) Y(T) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf) Loss() d1..dT Loss

102

Ba Backp kpropagation in BRN BRNNs

slide-103
SLIDE 103
  • Like the BRNN, but now the hidden nodes are LSTM units.
  • Can have multiple layers of LSTM units in either direction

– Its also possible to have MLP feed-forward layers between the hidden layers..

  • The output nodes (orange boxes) may be complete MLPs

t hf(0)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

103

Bidirectional LSTM TM

slide-104
SLIDE 104
  • Recurrent networks are poor at memorization

– Memory can explode or vanish depending on the weights and activation

  • They also suffer from the vanishing gradient problem during training

– Error at any time cannot affect parameter updates in the too-distant past – E.g. seeing a “close bracket” cannot affect its ability to predict an “open bracket” if it happened too long ago in the input

  • LSTMs are an alternative formalism where memory is made more directly dependent on the input,

rather than network parameters/structure

– Through a memory structure with no weights or activations, but instead direct switching and “increment/decrement” from pattern recognizers – Do not suffer from a vanishing gradient problem but do suffer from exploding gradient issue

  • Bidirectional networks analyze data both ways, beginàend and endàbeginning to make

predictions

– In these networks, backprop must follow the chain of recursion (and gradient pooling) separately in the forward and reverse nets

104

St Story so far

slide-105
SLIDE 105

RNN: Summary

  • RNNs allow a lot of flexibility in architecture design
  • Vanilla RNNs are simple but don’t work very well
  • Backward flow of gradients in RNN can explode or vanish.

– Exploding is controlled with gradient clipping. – Vanishing is controlled with additive interactions (LSTM)

  • Common to use LSTM or GRU: their additive interactions improve

gradient flow

  • Better/simpler architectures are a hot topic of current research
  • Better understanding (both theoretical and empirical) is needed.

105