Transformer Sequence Models CSE354 - Spring 2020 Natural Language - - PowerPoint PPT Presentation

transformer sequence models
SMART_READER_LITE
LIVE PREVIEW

Transformer Sequence Models CSE354 - Spring 2020 Natural Language - - PowerPoint PPT Presentation

Transformer Sequence Models CSE354 - Spring 2020 Natural Language Processing Most NLP Tasks. E.g. Transformer Networks Sequence Tasks Transformers Language Modeling BERT Machine Translation Speech


slide-1
SLIDE 1

Transformer Sequence Models

CSE354 - Spring 2020 Natural Language Processing

slide-2
SLIDE 2

Most NLP Tasks. E.g.

  • Sequence Tasks

○ Language Modeling ○ Machine Translation ○ Speech Recognition

  • Transformer Networks

○ Transformers ○ BERT

slide-3
SLIDE 3

Evolution of Sequence Modeling

RNNs LSTMs LSTMS with Attention Attention without LSTMs

slide-4
SLIDE 4

Multi-level bidirectional RNN (LSTM or GRU)

(Eisenstein, 2018)

slide-5
SLIDE 5

Multi-level bidirectional RNN (LSTM or GRU)

(Eisenstein, 2018) Each node has a forward -> and backward <- hidden state: Can represent as a concatenation of both.

slide-6
SLIDE 6

Multi-level bidirectional RNN (LSTM or GRU)

Average of top layer is an embedding (average of concatinated vectors) (Eisenstein, 2018)

slide-7
SLIDE 7

Multi-level bidirectional RNN (LSTM or GRU)

Sometimes just use left-most and right-most hidden state instead (Eisenstein, 2018)

slide-8
SLIDE 8

Sentiment Analysis:

Example Application of Single Representation of document

I feel terrible about what happened to Stark but the movie was excellent !

Document Encoding

feature vector

...

slide-9
SLIDE 9

Sentiment Analysis:

Example Application of Single Representation of document

I feel terrible about what happened to Stark but the movie was excellent !

Document Encoding

feature vector

...

classifier or regressor

like/dislike (classifier)

  • r

rating on scale (regressor)

slide-10
SLIDE 10

like/dislike (classifier)

  • r

rating on scale (regressor) like/dislike (classifier)

  • r

rating on scale (regressor)

feature vector` feature vector

Sentiment Analysis:

Example Application of Single Representation of document

I feel terrible about what happened to Stark but the movie was excellent !

Document Encoding

feature vector

...

classifier or regressor

like/dislike (classifier)

  • r

rating on scale (regressor)

slide-11
SLIDE 11

Encoder

A representation of input. (Eisenstein, 2018)

slide-12
SLIDE 12

Encoder-Decoder

Representing input and converting to output (Eisenstein, 2018)

slide-13
SLIDE 13

Encoder-Decoder

(Eisenstein, 2018) Softmax y(0) y(1) y(2) y(3)

….

slide-14
SLIDE 14

Encoder-Decoder

Softmax y(0) y(1) y(2) y(3)

….

<go> y(0) y(1) y(2)

….

slide-15
SLIDE 15

Encoder-Decoder

A representation of input. Softmax y(0) y(1) y(2) y(3)

….

<go>

slide-16
SLIDE 16

Encoder-Decoder

A representation of input. Softmax y(0) y(1) y(2) y(3)

….

<go> essentially a language model conditioned on the final state from the encoder.

slide-17
SLIDE 17

Encoder-Decoder

When applied to new data... <go> essentially a language model conditioned on the final state from the encoder.

slide-18
SLIDE 18

Encoder-Decoder

A representation of input. Softmax y(0) y(1) y(2) y(3)

….

<go>

slide-19
SLIDE 19

Encoder-Decoder “seq2seq” model

Softmax y(0) y(1) y(2) y(3)

….

<go>

Language 1: (e.g. Chinese) Language 2: (e.g. English)

slide-20
SLIDE 20

Encoder-Decoder

Challenge:

  • Long distance dependency when translating:

<go> y(0) y(1) y(2) ….

y(0) y(1) y(2) y(3) y(4)

slide-21
SLIDE 21

Encoder-Decoder

Challenge:

  • Long distance dependency when translating:

<go> y(0) y(1) y(2) ….

y(0) y(1) y(2) y(3) y(4)

slide-22
SLIDE 22

Encoder-Decoder

Challenge:

  • Long distance dependency when translating:

<go> y(0) y(1) y(2) ….

y(0) y(1) y(2) y(3) y(4)

Kayla kicked the ball. The ball was kicked by kayla.

slide-23
SLIDE 23

Encoder-Decoder

Challenge:

  • Long distance dependency when translating:

<go> y(0) y(1) y(2) ….

y(0) y(1) y(2) y(3) y(4)

A lot of responsibility put fixed-size hidden state passed from encoder to decoder

Kayla kicked the ball. The ball was kicked by kayla.

slide-24
SLIDE 24

Long Distance / Out of order dependencies

<go> Softmax y(0) y(1) y(2) y(3)

….

A lot of responsibility put fixed-size hidden state passed from encoder to decoder

slide-25
SLIDE 25

Long Distance / Out of order dependencies

<go> Softmax y(0) y(1) y(2) y(3)

….

slide-26
SLIDE 26

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

s1 s2 s3 s4

slide-27
SLIDE 27

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

Analogy: random access memory s1 s2 s3 s4

slide-28
SLIDE 28

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

attention layer s1 s2 s3 s4

slide-29
SLIDE 29

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

attention layer i: current token of output N: tokens of input

hi-1 hi hi+1 zn-1 zn zn+1 hn-1 hn hn+1 hn-1 hn hn+1

chi

s1 s2 s3 s4

slide-30
SLIDE 30

Attention

s1 s2 s3 s4 chi αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

slide-31
SLIDE 31

Attention

z1 z2 z3 z4 chi αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Z is the vector to be attended to (the value in memory). It is typically hidden states of the input (i.e. sn) but can be anything.

slide-32
SLIDE 32

Attention

s1 s2 s3 s4 chi αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

slide-33
SLIDE 33

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

slide-34
SLIDE 34

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Score function: v , Wh , Ws

slide-35
SLIDE 35

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Score function: v , Wh , Ws A useful abstraction is to make the vector attended to (the “value vector”, Z) separate than the “key vector” (s). z1 z2 z3 z4

slide-36
SLIDE 36

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Score function: v , Wh , Ws A useful abstraction is to make the vector attended to (the “value vector”, Z) separate than the “key vector” (s). z1 z2 z3 z4 values keys query

slide-37
SLIDE 37

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Alternative Scoring Functions

slide-38
SLIDE 38

If variables are standardized, matrix multiply produces a similarity score.

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

Alternative Scoring Functions

slide-39
SLIDE 39

Attention

(“synced”, 2017) hi s4 s3 s2 s1

slide-40
SLIDE 40

Attention

(“synced”, 2017) hi s4 s3 s2 s1

slide-41
SLIDE 41

Attention

(“synced”, 2017) hi s4 s3 s2 s1

(Bahdanau et al., 2015)

slide-42
SLIDE 42

Attention

(“synced”, 2017) hi s4 s3 s2 s1

(Bahdanau et al., 2015)

slide-43
SLIDE 43

Machine Translation

Why?

  • $40billion/year industry
  • A center piece of many genres of science fiction
  • A fairly “universal” problem:

○ Language understanding ○ Language generation

  • Societal benefits of inter-

cultural communication

slide-44
SLIDE 44

Machine Translation

Why?

  • $40billion/year industry
  • A center piece of many genres of science fiction
  • A fairly “universal” problem:

○ Language understanding ○ Language generation

  • Societal benefits of inter-

cultural communication

(Douglas Adams)

slide-45
SLIDE 45

Machine Translation

Why Neural Network Approach works? (Manning, 2018)

  • Joint end-to-end training: learning all parameters at once.
  • Exploiting distributed representations (embeddings)
  • Exploiting variable-length context
  • High quality generation from deep decoders - stronger

language models (even when wrong, make sense)

slide-46
SLIDE 46

Machine Translation

As an optimization problem (Eisenstein, 2018):

slide-47
SLIDE 47

Attention

(“synced”, 2017) hi s4 s3 s2 s1

slide-48
SLIDE 48

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

Analogy: random access memory s1 s2 s3 s4

slide-49
SLIDE 49

Attention

<go> Softmax y(0) y(1) y(2) y(3)

….

s1 s2 s3 s4

Do we even need all these RNNs?

(Vaswani et al., 2017: Attention is all you need)

slide-50
SLIDE 50

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

v , Wh , Ws A useful abstraction is to make the vector attended to (the “value vector”, Z) separate than the “key vector” (s). z1 z2 z3 z4 values keys query

slide-51
SLIDE 51

Attention

s1 s2 s3 s4 chi hi 𝜔 αhi->s

1

αhi->s

2

αhi->s

3

αhi->s

4

v , Wh , Ws A useful abstraction is to make the vector attended to (the “value vector”, Z) separate than the “key vector” (s). z1 z2 z3 z4 values keys query (Eisenstein, 2018) zj sj hi

slide-52
SLIDE 52

The Transformer: “Attention-only” models

(Eisenstein, 2018) Attention as weighting a value based on a query and key:

slide-53
SLIDE 53

The Transformer: “Attention-only” models

(Eisenstein, 2018)

Output α 𝜔 h hi-1

hi hi+1

x

slide-54
SLIDE 54

Output α 𝜔 h

The Transformer: “Attention-only” models

(Eisenstein, 2018)

hi-1

hi hi+1

self attention

hi hi-1 hi-1

slide-55
SLIDE 55

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2

slide-56
SLIDE 56

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2

FFN

slide-57
SLIDE 57

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

slide-58
SLIDE 58

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 …. yi-1

yi yi+1

yi+2

...

slide-59
SLIDE 59

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

Attend to all hidden states in your “neighborhood”.

slide-60
SLIDE 60

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

X X X X

+

dot product dp dp dp

ktq

slide-61
SLIDE 61

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

X X X X

+

dot product dp dp dp

scaling parameter (ktq) σ (k,q)

slide-62
SLIDE 62

The Transformer: “Attention-only” models

Output α 𝜔 h hi-1

hi hi+1

hi+2 wi-1

wi wi+1

wi+2 yi-1

yi yi+1

yi+2

X X X X

+

dot product dp dp dp

Linear layer: WTX One set of weights for each of for K, Q, and V ktq (k,q) (ktq) σ

slide-63
SLIDE 63

The Transformer: “Attention-only” models

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input steps)
  • Enables “interactions” (i.e. adaptations) between words
  • Easy to parallelize -- don’t need sequential processing.
slide-64
SLIDE 64

The Transformer

Limitation (thus far): Can’t capture multiple types of dependencies between words.

slide-65
SLIDE 65

The Transformer

Solution: Multi-head attention

slide-66
SLIDE 66

Multi-head Attention

slide-67
SLIDE 67

Transformer for Encoder-Decoder

slide-68
SLIDE 68

Transformer for Encoder-Decoder

sequence index (t)

slide-69
SLIDE 69

Transformer for Encoder-Decoder

slide-70
SLIDE 70

Transformer for Encoder-Decoder

Residualized Connections

slide-71
SLIDE 71

Transformer for Encoder-Decoder

Residualized Connections

residuals enable positional information to be passed along

slide-72
SLIDE 72

Transformer for Encoder-Decoder

slide-73
SLIDE 73

Transformer for Encoder-Decoder

essentially, a language model

slide-74
SLIDE 74

Transformer for Encoder-Decoder

essentially, a language model Decoder blocks out future inputs

slide-75
SLIDE 75

Transformer for Encoder-Decoder

essentially, a language model Add conditioning of the LM based on the encoder

slide-76
SLIDE 76

Transformer for Encoder-Decoder

slide-77
SLIDE 77

Transformer (as of 2017)

“WMT-2014” Data Set. BLEU scores:

slide-78
SLIDE 78

Transformer

  • Utilize Self-Attention
  • Simple att scoring function (dot product, scaled)
  • Added linear layers for Q, K, and V
  • Multi-head attention
  • Added positional encoding
  • Added residual connection
  • Simulate decoding by masking
slide-79
SLIDE 79

Transformer

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input

steps)

  • Enables “interactions” (i.e. adaptations)

between words

  • Easy to parallelize -- don’t need sequential

processing. Drawbacks:

  • Only unidirectional by default
  • Only a “single-hop” relationship per layer

(multiple layers to capture multiple)

slide-80
SLIDE 80

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input

steps)

  • Enables “interactions” (i.e. adaptations)

between words

  • Easy to parallelize -- don’t need sequential

processing. Drawbacks of Vanilla Transformers:

  • Only unidirectional by default
  • Only a “single-hop” relationship per layer

(multiple layers to capture multiple)

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

slide-81
SLIDE 81

Why?

  • Don’t need complexity of LSTM/GRU cells
  • Constant num edges between words (or input

steps)

  • Enables “interactions” (i.e. adaptations)

between words

  • Easy to parallelize -- don’t need sequential

processing. Drawbacks of Vanilla Transformers:

  • Only unidirectional by default
  • Only a “single-hop” relationship per layer

(multiple layers to capture multiple)

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.
slide-82
SLIDE 82

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.

She saw the man on the hill with the telescope. She [mask] the man on the hill [mask] the telescope.

slide-83
SLIDE 83

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.

She saw the man on the hill with the telescope. She [mask] the man on the hill [mask] the telescope. Mask 1 in 7 words:

  • Too few: expensive, less robust
  • Too many: not enough context
slide-84
SLIDE 84

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.
  • BERT-Base, Cased:

12-layer, 768-hidden, 12-heads , 110M parameters

slide-85
SLIDE 85

BERT

Bidirectional Encoder Representations from Transformers Produces contextualized embeddings (or pre-trained contextualized encoder)

  • Bidirectional context by “masking” in the middle
  • A lot of layers, hidden states, attention heads.
  • BERT-Base, Cased:

12-layer, 768-hidden, 12-heads , 110M parameters

  • BERT-Large, Cased:

24-layer, 1024-hidden, 16-heads, 340M parameters

  • BERT-Base, Multilingual Cased:

104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

slide-86
SLIDE 86

BERT

(Devlin et al., 2019)

slide-87
SLIDE 87

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.

(Devlin et al., 2019)

slide-88
SLIDE 88

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

slide-89
SLIDE 89

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

slide-90
SLIDE 90

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

slide-91
SLIDE 91

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

slide-92
SLIDE 92

BERT

Differences from previous state of the art:

  • Bidirectional transformer (through masking)
  • Directions jointly trained at once.
  • Capture sentence-level relations

(Devlin et al., 2019)

tokenize into “word pieces”

slide-93
SLIDE 93

BERT Performance: e.g. Question Answering

https://rajpurkar.github.io/SQuAD-explorer/

slide-94
SLIDE 94

Bert: Attention by Layers

https://colab.research.google.com/drive/1vlOJ1lhdujVjfH857hvYKIdKPTD9Kid8

(Vig, 2019)

slide-95
SLIDE 95

BERT: Pre-training; Fine-tuning

12 or 24 layers

slide-96
SLIDE 96

BERT: Pre-training; Fine-tuning

12 or 24 layers

slide-97
SLIDE 97

BERT: Pre-training; Fine-tuning

12 or 24 layers Novel classifier (e.g. sentiment classifier; stance detector...etc..)

slide-98
SLIDE 98

BERT: Pre-training; Fine-tuning

[CLS] vector at start is supposed to capture meaning of whole sequence. Novel classifier (e.g. sentiment classifier; stance detector...etc..)

slide-99
SLIDE 99

BERT: Pre-training; Fine-tuning

[CLS] vector at start is supposed to capture meaning of whole sequence. Average of top layer (or second to top) also often used. Novel classifier (e.g. sentiment classifier; stance detector...etc..)

avg

slide-100
SLIDE 100

Extra Material:

slide-101
SLIDE 101

BERT for Machine Translation:

(Lample & Conneau, Facebook, 2019)

slide-102
SLIDE 102

BERT for Machine Translation:

(Lample & Conneau, Facebook, 2019)

slide-103
SLIDE 103

BERT for Machine Translation:

(Lample & Conneau, Facebook, 2019)

Use as a pre-trained model for feeding into a machine translation system.

slide-104
SLIDE 104

BERT for Machine Translation:

(Lample & Conneau, Facebook, 2019)

Use as a pre-trained model for feeding into a machine translation system.

slide-105
SLIDE 105

Neural Machine Translation

Where does neural approach fall short? (Manning, 2018)

  • Translation process is mostly a black box -- can’t answer

“why” for reordering, word choice decisions

  • No direct use of semantic or syntactic structures
  • Not modeling discourse structure -- only rough sense of

how sentences relate to each other. Doesn’t model long distance anaphora.