Recurrent Networks Part 3 1 Y(t+6) Story so far Stock vector - - PowerPoint PPT Presentation

recurrent networks part 3
SMART_READER_LITE
LIVE PREVIEW

Recurrent Networks Part 3 1 Y(t+6) Story so far Stock vector - - PowerPoint PPT Presentation

Deep Learning Recurrent Networks Part 3 1 Y(t+6) Story so far Stock vector X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Iterated structures are good for analyzing time series data with short-time dependence on the past


slide-1
SLIDE 1

Deep Learning

Recurrent Networks Part 3

1

slide-2
SLIDE 2

Story so far

  • Iterated structures are good for analyzing time series

data with short-time dependence on the past

– These are “Time delay” neural nets, AKA convnets

  • Recurrent structures are good for analyzing time series

data with long-term dependence on the past

– These are recurrent neural networks

Stock vector X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+6)

2

slide-3
SLIDE 3

Story so far

  • Iterated structures are good for analyzing time series data

with short-time dependence on the past

– These are “Time delay” neural nets, AKA convnets

  • Recurrent structures are good for analyzing time series

data with long-term dependence on the past

– These are recurrent neural networks

Time X(t) Y(t) t=0 h-1

3

slide-4
SLIDE 4

Recap: Recurrent networks can be incredibly effective at modeling long-term dependencies

4

slide-5
SLIDE 5

Recurrent structures can do what static structures cannot

  • The addition problem: Add two N-bit numbers to produce a N+1-bit number

– Input is binary – Will require large number of training instances

  • Output must be specified for every pair of inputs
  • Weights that generalize will make errors

– Network trained for N-bit numbers will not work for N+1 bit numbers

  • An RNN learns to do this very quickly

– With very little training data!

1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 MLP 1 0 1 0 1 0 1 1 1 1 0 1 1 RNN unit Previous carry Carry

  • ut

5

slide-6
SLIDE 6

Story so far

  • Recurrent structures can be trained by minimizing

the divergence between the sequence of outputs and the sequence of desired outputs

– Through gradient descent and backpropagation

Time X(t) Y(t) t=0 h-1 DIVERGENCE Ydesired(t)

6

slide-7
SLIDE 7

Story so far

  • Recurrent structures can be trained by minimizing

the divergence between the sequence of outputs and the sequence of desired outputs

– Through gradient descent and backpropagation

Time X(t) Y(t) t=0 h-1 DIVERGENCE Ydesired(t) Primary topic for today

7

slide-8
SLIDE 8

Story so far: stability

  • Recurrent networks can be unstable

– And not very good at remembering at other times

sigmoid tanh relu

8

slide-9
SLIDE 9

Vanishing gradient examples..

  • Learning is difficult: gradients tend to vanish..

ELU activation, Batch gradients

Output layer Input layer

9

slide-10
SLIDE 10

The long-term dependency problem

  • Long-term dependencies are hard to learn in a

network where memory behavior is an untriggered function of the network

– Need it to be a triggered response to input PATTERN1 […………………………..] PATTERN 2

1

Jane had a quick lunch in the bistro. Then she..

10

slide-11
SLIDE 11

Long Short-Term Memory

  • The LSTM addresses the problem of input-

dependent memory behavior

11

slide-12
SLIDE 12

LSTM-based architecture

  • LSTM based architectures are identical to

RNN-based architectures

Time X(t) Y(t)

12

slide-13
SLIDE 13

Bidirectional LSTM

  • Bidirectional version..

X(0)

Y(0) t hf(-1)

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T) X(0) X(1) X(2) X(T-2) X(T-1) X(T)

hb(inf)

13

slide-14
SLIDE 14

Key Issue

  • How do we define the divergence
  • Also: how do we compute the outputs..

Time X(t) Y(t) t=0 h-1 DIVERGENCE Ydesired(t) Primary topic for today

14

slide-15
SLIDE 15

What follows in this series on recurrent nets

  • Architectures: How to train recurrent networks of

different architectures

  • Synchrony: How to train recurrent networks when

– The target output is time-synchronous with the input – The target output is order-synchronous, but not time synchronous – Applies to only some types of nets

  • How to make predictions/inference with such networks

15

slide-16
SLIDE 16

Variants on recurrent nets

  • Conventional MLP
  • Time-synchronous outputs

– E.g. part of speech tagging

Images from Karpathy

16

slide-17
SLIDE 17

Variants on recurrent nets

  • Sequence classification: Classifying a full input sequence

– E.g phoneme recognition

  • Order synchronous , time asynchronous sequence-to-sequence generation

– E.g. speech recognition – Exact location of output is unknown a priori

17

slide-18
SLIDE 18

Variants

  • A posteriori sequence to sequence: Generate output sequence after processing

input

– E.g. language translation

  • Single-input a posteriori sequence generation

– E.g. captioning an image

Images from Karpathy

18

slide-19
SLIDE 19

Variants on recurrent nets

  • Conventional MLP
  • Time-synchronous outputs

– E.g. part of speech tagging

Images from Karpathy

19

slide-20
SLIDE 20

Regular MLP for processing sequences

  • No recurrence in model

– Exactly as many outputs as inputs – Every input produces a unique output – The output at time 𝑢 is unrelated to the output at 𝑢′ ≠ 𝑢

Time X(t) Y(t) t=0

20

slide-21
SLIDE 21

Learning in a Regular MLP

  • No recurrence

– Exactly as many outputs as inputs

  • One to one correspondence between desired output and actual
  • utput

– The output at time 𝑢 is unrelated to the output at 𝑢′ ≠ 𝑢.

Time X(t) Y(t) t=0 DIVERGENCE

Ydesired(t)

21

slide-22
SLIDE 22

Regular MLP

  • Gradient backpropagated at each time

𝛼𝑍(𝑢)𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈)

  • Common assumption:

𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈) = ෍ 𝑢

𝑥𝑢𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢)

𝛼𝑍(𝑢)𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈) = 𝑥𝑢𝛼𝑍(𝑢)𝐸𝑗𝑤 𝑍 𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢)

– 𝑥𝑢 is typically set to 1.0 – This is further backpropagated to update weights etc Y(t) DIVERGENCE

Ytarget(t)

22

slide-23
SLIDE 23

Regular MLP

  • Gradient backpropagated at each time

𝛼𝑍(𝑢)𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈)

  • Common assumption:

𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈) = ෍ 𝑢

𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢)

𝛼𝑍(𝑢)𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈) = 𝛼𝑍(𝑢)𝐸𝑗𝑤 𝑍 𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢)

– This is further backpropagated to update weights etc

Y(t) DIVERGENCE

Ytarget(t)

Typical Divergence for classification: 𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢) = 𝑌𝑓𝑜𝑢(𝑍 𝑢𝑏𝑠𝑕𝑓𝑢, 𝑍)

23

slide-24
SLIDE 24

Variants on recurrent nets

  • Conventional MLP
  • Time-synchronous outputs

– E.g. part of speech tagging

Images from Karpathy

24

slide-25
SLIDE 25

Variants on recurrent nets

  • Conventional MLP
  • Time-synchronous outputs

– E.g. part of speech tagging

Images from Karpathy

25

With a brief detour into modelling language

slide-26
SLIDE 26

Time synchronous network

  • Network produces one output for each input

– With one-to-one correspondence – E.g. Assigning grammar tags to words

  • May require a bidirectional network to consider both past

and future words in the sentence

26

two

CD h-1

roads diverged a yellow wood

NNS VBD

DT JJ NN in

IN

slide-27
SLIDE 27

Time-synchronous networks: Inference

  • Process input left to right and produce output

after each input

27

X(0)

Y(0) h-1

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T)

slide-28
SLIDE 28

Time-synchronous networks: Inference

  • For bidirectional networks:

– Process input left to right using forward net – Process it right to left using backward net – The combined outputs are used subsequently to produce one output per input symbol

  • Rest of the lecture(s) will not specifically consider bidirectional nets, but the

discussion generalizes

28

X(0)

Y(0) h-1

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T) X(0) X(1) X(2) X(T-2) X(T-1) X(T)

slide-29
SLIDE 29

How do we train the network

  • Back propagation through time (BPTT)
  • Given a collection of sequence training instances comprising input

sequences and output sequences of equal length, with one-to-one correspondence

– (𝐘𝑗, 𝐄𝑗), where – 𝐘𝑗 = 𝑌𝑗,0, … , 𝑌𝑗,𝑈 – 𝐄𝑗 = 𝐸𝑗,0, … , 𝐸𝑗,𝑈

X(0)

Y(0) t h-1

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T)

29

slide-30
SLIDE 30

Training: Forward pass

  • For each training input:
  • Forward pass: pass the entire data sequence through the network,

generate outputs

X(0)

Y(0) t h-1

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T)

30

slide-31
SLIDE 31

Training: Computing gradients

  • For each training input:
  • Backward pass: Compute gradients via backpropagation

– Back Propagation Through Time

X(0)

Y(0) t h-1

X(1) X(2) X(T-2) X(T-1) X(T)

Y(1) Y(2)

Y(T-2) Y(T-1) Y(T)

31

slide-32
SLIDE 32

Back Propagation Through Time

h-1

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊

  • The divergence computed is between the sequence of outputs

by the network and the desired sequence of outputs

  • This is not just the sum of the divergences at individual times
  • Unless we explicitly define it that way

32

slide-33
SLIDE 33

Back Propagation Through Time

h-1

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊

First step of backprop: Compute 𝛼𝑍(𝑢)𝐸𝐽𝑊 for all t The rest of backprop continues from there

33

slide-34
SLIDE 34

Back Propagation Through Time

h-1

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊

34

𝛼𝑎(1)(𝑢)𝐸𝐽𝑊 = 𝛼𝑍(𝑢)𝐸𝐽𝑊 𝛼𝑎(𝑢)𝑍(𝑢) First step of backprop: Compute 𝛼𝑍(𝑢)𝐸𝐽𝑊 for all t

And so on!

slide-35
SLIDE 35

Back Propagation Through Time

h-1

𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) 𝐸(1. . 𝑈) 𝐸𝐽𝑊

35

First step of backprop: Compute 𝛼𝑍(𝑢)𝐸𝐽𝑊 for all t

  • The key component is the computation of this derivative!!
  • This depends on the definition of “DIV”
slide-36
SLIDE 36

Time-synchronous recurrence

  • Usual assumption: Sequence divergence is the sum of the divergence at

individual instants

𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈) = ෍ 𝑢

𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢)

𝛼𝑍(𝑢)𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈) = 𝛼𝑍(𝑢)𝐸𝑗𝑤 𝑍 𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢)

Time X(t) Y(t) t=0 h-1 Y(t) DIVERGENCE

Ytarget(t)

36

slide-37
SLIDE 37

Time-synchronous recurrence

  • Usual assumption: Sequence divergence is the sum of the divergence at

individual instants

𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈) = ෍ 𝑢

𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢)

𝛼𝑍(𝑢)𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈) = 𝛼𝑍(𝑢)𝐸𝑗𝑤 𝑍 𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢)

Time X(t) Y(t) t=0 h-1 Y(t) DIVERGENCE

Ytarget(t)

37

Typical Divergence for classification: 𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢) = 𝑌𝑓𝑜𝑢(𝑍 𝑢𝑏𝑠𝑕𝑓𝑢, 𝑍)

slide-38
SLIDE 38

Simple recurrence example: Text Modelling

  • Learn a model that can predict the next character given a

sequence of characters

– Or, at a higher level, words

  • After observing inputs 𝑥0 … 𝑥𝑙 it predicts 𝑥𝑙+1

h-1 𝑥0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7

38

slide-39
SLIDE 39

Simple recurrence example: Text Modelling

  • Input presented as one-hot vectors

– Actually “embeddings” of one-hot vectors

  • Output: probability distribution over characters

– Must ideally peak at the target character

Figure from Andrej Karpathy. Input: Sequence of characters (presented as one-hot vectors). Target output after observing “h e l l” is “o”

39

slide-40
SLIDE 40

Training

  • Input: symbols as one-hot vectors
  • Dimensionality of the vector is the size of the “vocabulary”
  • Output: Probability distribution over symbols

𝑍 𝑢, 𝑗 = 𝑄(𝑊

𝑗|𝑥0 … 𝑥𝑢−1)

  • 𝑊

𝑗 is the i-th symbol in the vocabulary

  • Divergence

𝐸𝑗𝑤 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 1 … 𝑈 , 𝑍(1 … 𝑈) = ෍ 𝑢

𝑌𝑓𝑜𝑢 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢 𝑢 , 𝑍(𝑢) = − ෍ 𝑢

log 𝑍(𝑢, 𝑥𝑢+1)

Time Y(t) t=0 h-1 Y(t) DIVERGENCE 𝑥0 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 The probability assigned to the correct next word

40

slide-41
SLIDE 41

Brief detour: Language models

  • Modelling language using time-synchronous

nets

  • More generally language models and

embeddings..

41

slide-42
SLIDE 42

Which open source project?

42

slide-43
SLIDE 43

Language modelling using RNNs

  • Problem: Given a sequence of words (or

characters) predict the next one

Four score and seven years ??? A B R A H A M L I N C O L ??

43

slide-44
SLIDE 44

Language modelling: Representing words

  • Represent words as one-hot vectors

– Pre-specify a vocabulary of N words in fixed (e.g. lexical) order

  • E.g. [ A AARDVARK AARON ABACK ABACUS… ZZYP]

– Represent each word by an N-dimensional vector with N-1 zeros and a single 1 (in the position of the word in the ordered list of words)

  • E.g. “AARDVARK”  [0 1 0 0 0 …]
  • E.g. “AARON”  [0 0 1 0 0 0 …]
  • Characters can be similarly represented

– English will require about 100 characters, to include both cases, special characters such as commas, hyphens, apostrophes, etc., and the space character

44

slide-45
SLIDE 45

Predicting words

  • Given one-hot representations of 𝑋

1…𝑋 𝑜−1, predict 𝑋 𝑜

  • Dimensionality problem: All inputs 𝑋

1…𝑋 𝑜−1 are both

very high-dimensional and very sparse

𝑋

𝑜 = 𝑔 𝑋 1, … , 𝑋 𝑜−1

Four score and seven years ??? Nx1 one-hot vectors 𝑔()

⋮ 1 1 ⋮ 1 ⋮

1 ⋮

𝑋

1

𝑋

2

𝑋

𝑜−1

𝑋

𝑜

45

slide-46
SLIDE 46

Predicting words

  • Given one-hot representations of 𝑋

1…𝑋 𝑜−1, predict 𝑋 𝑜

  • Dimensionality problem: All inputs 𝑋

1…𝑋 𝑜−1 are both

very high-dimensional and very sparse

𝑋

𝑜 = 𝑔 𝑋 1, … , 𝑋 𝑜−1

Four score and seven years ??? Nx1 one-hot vectors 𝑔()

⋮ 1 1 ⋮ 1 ⋮

1 ⋮

𝑋

1

𝑋

2

𝑋

𝑜−1

𝑋

𝑜

46

slide-47
SLIDE 47

The one-hot representation

  • The one hot representation uses only N corners of the 2N corners of a unit

cube

– Actual volume of space used = 0

  • (1, 𝜁, 𝜀) has no meaning except for 𝜁 = 𝜀 = 0

– Density of points: 𝒫

𝑂 𝑠𝑂

  • This is a tremendously inefficient use of dimensions

(1,0,0) (0,1,0) (0,0,1)

47

slide-48
SLIDE 48

Why one-hot representation

  • The one-hot representation makes no assumptions about the relative

importance of words

– All word vectors are the same length

  • It makes no assumptions about the relationships between words

– The distance between every pair of words is the same

(1,0,0) (0,1,0) (0,0,1)

48

slide-49
SLIDE 49

Solution to dimensionality problem

  • Project the points onto a lower-dimensional subspace

– The volume used is still 0, but density can go up by many orders of magnitude

  • Density of points: 𝒫

𝑂 𝑠𝑁

– If properly learned, the distances between projected points will capture semantic relations between the words

(1,0,0) (0,1,0) (0,0,1)

49

slide-50
SLIDE 50

Solution to dimensionality problem

  • Project the points onto a lower-dimensional subspace

– The volume used is still 0, but density can go up by many orders of magnitude

  • Density of points: 𝒫

𝑂 𝑠𝑁

– If properly learned, the distances between projected points will capture semantic relations between the words

  • This will also require linear transformation (stretching/shrinking/rotation) of the subspace

(1,0,0) (0,1,0) (0,0,1)

50

slide-51
SLIDE 51

The Projected word vectors

  • Project the N-dimensional one-hot word vectors into a lower-dimensional space

– Replace every one-hot vector 𝑋

𝑗 by 𝑄𝑋 𝑗

– 𝑄 is an 𝑁 × 𝑂 matrix – 𝑄𝑋

𝑗 is now an 𝑁-dimensional vector

– Learn P using an appropriate objective

  • Distances in the projected space will reflect relationships imposed by the objective

𝑋

𝑜 = 𝑔 𝑄𝑋 1, 𝑄𝑋 2, … , 𝑄𝑋 𝑜−1

Four score and seven years ??? 𝑔()

⋮ 1 1 ⋮ 1 ⋮

1 ⋮

𝑋

1

𝑋

2

𝑋

𝑜−1

𝑋

𝑜

𝑄 𝑄 𝑄

(1,0,0) (0,1,0) (0,0,1)

51

slide-52
SLIDE 52

“Projection”

  • P is a simple linear transform
  • A single transform can be implemented as a layer of M neurons with linear activation
  • The transforms that apply to the individual inputs are all M-neuron linear-activation subnets with

tied weights

𝑋

𝑜 = 𝑔 𝑄𝑋 1, 𝑄𝑋 2, … , 𝑄𝑋 𝑜−1

(1,0,0) (0,1,0) (0,0,1)

⋮ ⋮ ⋮ ⋮ 𝑔()

1 ⋮

𝑋

𝑜

⋮ ⋮ ⋮

⋮ 1 1 ⋮ 1 ⋮

𝑋

1

𝑋

2

𝑋

𝑜−1

𝑂 𝑁

52

slide-53
SLIDE 53

Predicting words: The TDNN model

  • Predict each word based on the past N words

– “A neural probabilistic language model”, Bengio et al. 2003 – Hidden layer has Tanh() activation, output is softmax

  • One of the outcomes of learning this model is that we also learn low-dimensional

representations 𝑄𝑋 of words 𝑄 𝑋

1

𝑄 𝑋

2

𝑄 𝑋

3

𝑄 𝑋

4

𝑄 𝑋

5

𝑄 𝑋

6

𝑄 𝑋

7

𝑄 𝑋

8

𝑄 𝑋

9

𝑋

5

𝑋

6

𝑋

7

𝑋

8

𝑋

9

𝑋

10

53

slide-54
SLIDE 54

Alternative models to learn projections

  • Soft bag of words: Predict word based on words in

immediate context

– Without considering specific position

  • Skip-grams: Predict adjacent words based on current

word

  • More on these in a future recitation?

𝑄 Mean pooling 𝑋

1

𝑄 𝑋

2

𝑄 𝑋

3

𝑄 𝑋

5

𝑄 𝑋

6

𝑄 𝑋

7

𝑋

4

𝑄 𝑋

7

𝑋

5

𝑋

6

𝑋

8

𝑋

9

𝑋

10

𝑋

4

Color indicates shared parameters

54

slide-55
SLIDE 55

Embeddings: Examples

  • From Mikolov et al., 2013, “Distributed Representations of Words

and Phrases and their Compositionality”

55

slide-56
SLIDE 56

Generating Language: The model

  • The hidden units are (one or more layers of) LSTM units
  • Trained via backpropagation from a lot of text

𝑄 𝑋

1

𝑄 𝑋

2

𝑄 𝑋

3

𝑄 𝑋

4

𝑄 𝑋

5

𝑄 𝑋

6

𝑄 𝑋

7

𝑄 𝑋

8

𝑄 𝑋

9

𝑋

5

𝑋

6

𝑋

7

𝑋

8

𝑋

9

𝑋

10

𝑋

2

𝑋

3

𝑋

4

56

slide-57
SLIDE 57

Generating Language: Synthesis

  • On trained model : Provide the first few words

– One-hot vectors

  • After the last input word, the network generates a probability distribution
  • ver words

– Outputs an N-valued probability distribution rather than a one-hot vector 𝑄 𝑋

1

𝑄 𝑋

2

𝑄 𝑋

3

57

slide-58
SLIDE 58

Generating Language: Synthesis

  • On trained model : Provide the first few words

– One-hot vectors

  • After the last input word, the network generates a probability distribution over words

– Outputs an N-valued probability distribution rather than a one-hot vector

  • Draw a word from the distribution

– And set it as the next word in the series

𝑄 𝑋

1

𝑄 𝑋

2

𝑄 𝑋

3

𝑋

4

58

slide-59
SLIDE 59

Generating Language: Synthesis

  • Feed the drawn word as the next word in the series

– And draw the next word from the output probability distribution

  • Continue this process until we terminate generation

– In some cases, e.g. generating programs, there may be a natural termination 𝑄 𝑋

1

𝑄 𝑋

2

𝑄 𝑋

3

𝑄 𝑋

5

𝑋

4

59

slide-60
SLIDE 60

Generating Language: Synthesis

  • Feed the drawn word as the next word in the series

– And draw the next word from the output probability distribution

  • Continue this process until we terminate generation

– In some cases, e.g. generating programs, there may be a natural termination 𝑄 𝑋

1

𝑄 𝑋

2

𝑄 𝑋

3

𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑋

5

𝑋

6

𝑋

7

𝑋

8

𝑋

9

𝑋

10

𝑋

4

60

slide-61
SLIDE 61

Which open source project?

Trained on linux source code Actually uses a character-level model (predicts character sequences)

61

slide-62
SLIDE 62

Composing music with RNN

http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/

62

slide-63
SLIDE 63

Returning to our problem

  • Divergences are harder to define in other

scenarios..

63

slide-64
SLIDE 64

Variants on recurrent nets

  • Sequence classification: Classifying a full input sequence

– E.g phoneme recognition

  • Order synchronous , time asynchronous sequence-to-sequence generation

– E.g. speech recognition – Exact location of output is unknown a priori

64

slide-65
SLIDE 65

Example..

  • Question answering
  • Input : Sequence of words
  • Output: Answer at the end of the question

65

𝐷𝑝𝑚𝑝𝑠 Blue 𝑝𝑔 𝑡𝑙𝑧

slide-66
SLIDE 66

Example..

  • Speech recognition
  • Input : Sequence of feature vectors (e.g. Mel spectra)
  • Output: Phoneme ID at the end of the sequence

– Represented as an N-dimensional output probability vector, where N is the number of phonemes

𝑌0 𝑌1 𝑌2 /AH/

66

slide-67
SLIDE 67

Inference: Forward pass

  • Exact input sequence provided

– Output generated when the last vector is processed

  • Output is a probability distribution over phonemes
  • But what about at intermediate stages?

𝑌0 𝑌1 𝑌2 /AH/

67

slide-68
SLIDE 68

Forward pass

  • Exact input sequence provided

– Output generated when the last vector is processed

  • Output is a probability distribution over phonemes
  • Outputs are actually produced for every input

– We only read it at the end of the sequence

𝑌0 𝑌1 𝑌2 /AH/

68

slide-69
SLIDE 69

Training

  • The Divergence is only defined at the final input

– 𝐸𝐽𝑊 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢, 𝑍 = 𝑌𝑓𝑜𝑢(𝑍 𝑈 , 𝑄ℎ𝑝𝑜𝑓𝑛𝑓)

  • This divergence must propagate through the net

to update all parameters

𝑌0 𝑌1 𝑌2 /AH/ Div Y(2)

69

slide-70
SLIDE 70

Training

  • The Divergence is only defined at the final input

– 𝐸𝐽𝑊 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢, 𝑍 = 𝑌𝑓𝑜𝑢(𝑍 𝑈 , 𝑄ℎ𝑝𝑜𝑓𝑛𝑓)

  • This divergence must propagate through the net

to update all parameters

𝑌0 𝑌1 𝑌2 /AH/ Div Y(2) Shortcoming: Pretends there’s no useful information in these

70

slide-71
SLIDE 71

Training

  • Exploiting the untagged inputs: assume the same output for the

entire input

  • Define the divergence everywhere

𝐸𝐽𝑊 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢, 𝑍 = ෍ 𝑢

𝑥𝑢𝑌𝑓𝑜𝑢(𝑍 𝑢 , 𝑄ℎ𝑝𝑜𝑓𝑛𝑓)

𝑌0 𝑌1 𝑌2 /AH/ Div Y(2) Fix: Use these

  • utputs too.

These too must ideally point to the correct phoneme /AH/ Div /AH/ Div

71

slide-72
SLIDE 72

Training

  • Define the divergence everywhere

𝐸𝐽𝑊 𝑍

𝑢𝑏𝑠𝑕𝑓𝑢, 𝑍 = ෍ 𝑢

𝑥𝑢𝑌𝑓𝑜𝑢(𝑍 𝑢 , 𝑄ℎ𝑝𝑜𝑓𝑛𝑓)

  • Typical weighting scheme for speech: all are equally important
  • Problem like question answering: answer only expected after the question ends

– Only 𝑥𝑈 is high, other weights are 0 or low 𝑌0 𝑌1 𝑌2 /AH/ Div Y(2) Fix: Use these

  • utputs too.

These too must ideally point to the correct phoneme /AH/ Div /AH/ Div

72

𝐷𝑝𝑚𝑝𝑠 Blue Div Y(2) 𝑝𝑔 𝑡𝑙𝑧 Div Div

slide-73
SLIDE 73

Variants on recurrent nets

  • Sequence classification: Classifying a full input sequence

– E.g phoneme recognition

  • Order synchronous , time asynchronous sequence-to-sequence generation

– E.g. speech recognition – Exact location of output is unknown a priori

73

slide-74
SLIDE 74

A more complex problem

  • Objective: Given a sequence of inputs, asynchronously
  • utput a sequence of symbols

– This is just a simple concatenation of many copies of the simple “output at the end of the input sequence” model we just saw

  • But this simple extension complicates matters..

𝑌0 𝑌1 𝑌2 /B/ 𝑌4 𝑌5 𝑌6 /AH/ 𝑌7 𝑌8 𝑌9 /T/ 𝑌3

74

slide-75
SLIDE 75

The sequence-to-sequence problem

  • How do we know when to output symbols

– In fact, the network produces outputs at every time – Which of these are the real outputs

  • Outputs that represent the definitive occurrence of a symbol

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3 /B/ /AH/ /T/

75

slide-76
SLIDE 76

The actual output of the network

  • At each time the network outputs a probability

for each output symbol

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3 /AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

1

𝑧0

2

𝑧0

3

𝑧0

4

𝑧0

5

𝑧0

6

𝑧0

7

𝑧1

1

𝑧1

2

𝑧1

3

𝑧1

4

𝑧1

5

𝑧1

6

𝑧1

7

𝑧2

1

𝑧2

2

𝑧2

3

𝑧2

4

𝑧2

5

𝑧2

6

𝑧2

7

𝑧3

1

𝑧3

2

𝑧3

3

𝑧3

4

𝑧3

5

𝑧3

6

𝑧3

7

𝑧4

1

𝑧4

2

𝑧4

3

𝑧4

4

𝑧4

5

𝑧4

6

𝑧4

7

𝑧5

1

𝑧5

2

𝑧5

3

𝑧5

4

𝑧5

5

𝑧5

6

𝑧5

7

𝑧6

1

𝑧6

2

𝑧6

3

𝑧6

4

𝑧6

5

𝑧6

6

𝑧6

7

𝑧7

1

𝑧7

2

𝑧7

3

𝑧7

4

𝑧7

5

𝑧7

6

𝑧7

7

𝑧8

1

𝑧8

2

𝑧8

3

𝑧8

4

𝑧8

5

𝑧8

6

𝑧8

7

76

slide-77
SLIDE 77

The actual output of the network

  • Option 1: Simply select the most probable

symbol at each time

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3 /AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

1

𝑧0

2

𝑧0

3

𝑧0

4

𝑧0

5

𝑧0

6

𝑧0

7

𝑧1

1

𝑧1

2

𝑧1

3

𝑧1

4

𝑧1

5

𝑧1

6

𝑧1

7

𝑧2

1

𝑧2

2

𝑧2

3

𝑧2

4

𝑧2

5

𝑧2

6

𝑧2

7

𝑧3

1

𝑧3

2

𝑧3

3

𝑧3

4

𝑧3

5

𝑧3

6

𝑧3

7

𝑧4

1

𝑧4

2

𝑧4

3

𝑧4

4

𝑧4

5

𝑧4

6

𝑧4

7

𝑧5

1

𝑧5

2

𝑧5

3

𝑧5

4

𝑧5

5

𝑧5

6

𝑧5

7

𝑧6

1

𝑧6

2

𝑧6

3

𝑧6

4

𝑧6

5

𝑧6

6

𝑧6

7

𝑧7

1

𝑧7

2

𝑧7

3

𝑧7

4

𝑧7

5

𝑧7

6

𝑧7

7

𝑧8

1

𝑧8

2

𝑧8

3

𝑧8

4

𝑧8

5

𝑧8

6

𝑧8

7

77

slide-78
SLIDE 78

The actual output of the network

  • Option 1: Simply select the most probable symbol at each

time

– Merge adjacent repeated symbols, and place the actual emission

  • f the symbol in the final instant

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3 /AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

1

𝑧0

2

𝑧0

3

𝑧0

4

𝑧0

5

𝑧0

6

𝑧0

7

𝑧1

1

𝑧1

2

𝑧1

3

𝑧1

4

𝑧1

5

𝑧1

6

𝑧1

7

𝑧2

1

𝑧2

2

𝑧2

3

𝑧2

4

𝑧2

5

𝑧2

6

𝑧2

7

𝑧3

1

𝑧3

2

𝑧3

3

𝑧3

4

𝑧3

5

𝑧3

6

𝑧3

7

𝑧4

1

𝑧4

2

𝑧4

3

𝑧4

4

𝑧4

5

𝑧4

6

𝑧4

7

𝑧5

1

𝑧5

2

𝑧5

3

𝑧5

4

𝑧5

5

𝑧5

6

𝑧5

7

𝑧6

1

𝑧6

2

𝑧6

3

𝑧6

4

𝑧6

5

𝑧6

6

𝑧6

7

𝑧7

1

𝑧7

2

𝑧7

3

𝑧7

4

𝑧7

5

𝑧7

6

𝑧7

7

𝑧8

1

𝑧8

2

𝑧8

3

𝑧8

4

𝑧8

5

𝑧8

6

𝑧8

7

/G/ /F/ /IY/ /D/

78

slide-79
SLIDE 79

The actual output of the network

  • Option 1: Simply select the most probable symbol at each

time

– Merge adjacent repeated symbols, and place the actual emission

  • f the symbol in the final instant

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3 /AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

1

𝑧0

2

𝑧0

3

𝑧0

4

𝑧0

5

𝑧0

6

𝑧0

7

𝑧1

1

𝑧1

2

𝑧1

3

𝑧1

4

𝑧1

5

𝑧1

6

𝑧1

7

𝑧2

1

𝑧2

2

𝑧2

3

𝑧2

4

𝑧2

5

𝑧2

6

𝑧2

7

𝑧3

1

𝑧3

2

𝑧3

3

𝑧3

4

𝑧3

5

𝑧3

6

𝑧3

7

𝑧4

1

𝑧4

2

𝑧4

3

𝑧4

4

𝑧4

5

𝑧4

6

𝑧4

7

𝑧5

1

𝑧5

2

𝑧5

3

𝑧5

4

𝑧5

5

𝑧5

6

𝑧5

7

𝑧6

1

𝑧6

2

𝑧6

3

𝑧6

4

𝑧6

5

𝑧6

6

𝑧6

7

𝑧7

1

𝑧7

2

𝑧7

3

𝑧7

4

𝑧7

5

𝑧7

6

𝑧7

7

𝑧8

1

𝑧8

2

𝑧8

3

𝑧8

4

𝑧8

5

𝑧8

6

𝑧8

7

/G/ /F/ /IY/ /D/ Cannot distinguish between an extended symbol and repetitions of the symbol /F/

79

slide-80
SLIDE 80

The actual output of the network

  • Option 1: Simply select the most probable symbol at each

time

– Merge adjacent repeated symbols, and place the actual emission

  • f the symbol in the final instant

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3 /AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

1

𝑧0

2

𝑧0

3

𝑧0

4

𝑧0

5

𝑧0

6

𝑧0

7

𝑧1

1

𝑧1

2

𝑧1

3

𝑧1

4

𝑧1

5

𝑧1

6

𝑧1

7

𝑧2

1

𝑧2

2

𝑧2

3

𝑧2

4

𝑧2

5

𝑧2

6

𝑧2

7

𝑧3

1

𝑧3

2

𝑧3

3

𝑧3

4

𝑧3

5

𝑧3

6

𝑧3

7

𝑧4

1

𝑧4

2

𝑧4

3

𝑧4

4

𝑧4

5

𝑧4

6

𝑧4

7

𝑧5

1

𝑧5

2

𝑧5

3

𝑧5

4

𝑧5

5

𝑧5

6

𝑧5

7

𝑧6

1

𝑧6

2

𝑧6

3

𝑧6

4

𝑧6

5

𝑧6

6

𝑧6

7

𝑧7

1

𝑧7

2

𝑧7

3

𝑧7

4

𝑧7

5

𝑧7

6

𝑧7

7

𝑧8

1

𝑧8

2

𝑧8

3

𝑧8

4

𝑧8

5

𝑧8

6

𝑧8

7

/G/ /F/ /IY/ /D/ Cannot distinguish between an extended symbol and repetitions of the symbol /F/ Resulting sequence may be meaningless (what word is “GFIYD”?)

80

slide-81
SLIDE 81

The actual output of the network

  • Option 2: Impose external constraints on what sequences are

allowed

– E.g. only allow sequences corresponding to dictionary words – E.g. Sub-symbol units (like in HW1 – what were they?)

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌3 /AH/ /B/ /D/ /EH/ /IY/ /F/ /G/ 𝑧0

1

𝑧0

2

𝑧0

3

𝑧0

4

𝑧0

5

𝑧0

6

𝑧0

7

𝑧1

1

𝑧1

2

𝑧1

3

𝑧1

4

𝑧1

5

𝑧1

6

𝑧1

7

𝑧2

1

𝑧2

2

𝑧2

3

𝑧2

4

𝑧2

5

𝑧2

6

𝑧2

7

𝑧3

1

𝑧3

2

𝑧3

3

𝑧3

4

𝑧3

5

𝑧3

6

𝑧3

7

𝑧4

1

𝑧4

2

𝑧4

3

𝑧4

4

𝑧4

5

𝑧4

6

𝑧4

7

𝑧5

1

𝑧5

2

𝑧5

3

𝑧5

4

𝑧5

5

𝑧5

6

𝑧5

7

𝑧6

1

𝑧6

2

𝑧6

3

𝑧6

4

𝑧6

5

𝑧6

6

𝑧6

7

𝑧7

1

𝑧7

2

𝑧7

3

𝑧7

4

𝑧7

5

𝑧7

6

𝑧7

7

𝑧8

1

𝑧8

2

𝑧8

3

𝑧8

4

𝑧8

5

𝑧8

6

𝑧8

7

81

slide-82
SLIDE 82

The sequence-to-sequence problem

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3 /B/ /AH/ /T/

82

  • How do we know when to output symbols

– In fact, the network produces outputs at every time – Which of these are the real outputs

  • How do we train these models?

Partially Addressed We will revisit this though

slide-83
SLIDE 83

Training

  • Given output symbols at the right locations

– The phoneme /B/ ends at X2, /AH/ at X6, /T/ at X9

𝑌0 𝑌1 𝑌2 /B/ 𝑌4 𝑌5 𝑌6 /AH/ 𝑌7 𝑌8 𝑌9 /T/ 𝑌3

83

slide-84
SLIDE 84

Training

  • Either just define Divergence as:

𝐸𝐽𝑊 = 𝑌𝑓𝑜𝑢 𝑍

2, 𝐶 + 𝑌𝑓𝑜𝑢 𝑍 6, 𝐵𝐼 + 𝑌𝑓𝑜𝑢(𝑍 9, 𝑈)

  • Or..

𝑌0 𝑌1 𝑌2 /B/ 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3 Div Div Div /AH/ /T/ 𝑍

2

𝑍

6

𝑍

9

84

slide-85
SLIDE 85
  • Either just define Divergence as:

𝐸𝐽𝑊 = 𝑌𝑓𝑜𝑢 𝑍

2, 𝐶 + 𝑌𝑓𝑜𝑢 𝑍 6, 𝐵𝐼 + 𝑌𝑓𝑜𝑢(𝑍 9, 𝑈)

  • Or repeat the symbols over their duration

𝐸𝐽𝑊 = ෍

𝑢

𝑌𝑓𝑜𝑢 𝑍

𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢 = − ෍ 𝑢

log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚𝑢

𝑌0 𝑌1 𝑌2 /B/ 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3 Div Div Div /AH/ /T/ 𝑍

2

𝑍

6

𝑍

9

Div Div Div Div Div Div Div

85

slide-86
SLIDE 86

𝑌0 𝑌1 𝑌2 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌9 𝑌3

Problem: No timing information provided

  • Only the sequence of output symbols is

provided for the training data

– But no indication of which one occurs where

  • How do we compute the divergence?

– And how do we compute its gradient w.r.t. 𝑍

𝑢

/B/ /AH/ /T/

? ? ? ? ? ? ? ? ? ?

𝑍 𝑍

1

𝑍

2

𝑍

4

𝑍

5

𝑍

6

𝑍

7

𝑍

8

𝑍

9

𝑍

3

86

slide-87
SLIDE 87

Next Class

  • Training without aligned truth..

– Connectionist Temporal Classification – Separating repeated symbols

  • The CTC decoder..

87