Transformer Ablation Studies Simon Will Institute of Formal and - - PowerPoint PPT Presentation

transformer ablation studies
SMART_READER_LITE
LIVE PREVIEW

Transformer Ablation Studies Simon Will Institute of Formal and - - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . Transformer Ablation Studies Simon Will Institute of Formal and Applied Linguistics Charles University Seminar: Statistical Machine Translation Instructor: Dr. Ondej Bojar . . . . . . .


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Transformer Ablation Studies

Simon Will

Institute of Formal and Applied Linguistics Charles University Seminar: Statistical Machine Translation Instructor: Dr. Ondřej Bojar

May 2019, 30

slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Structure

The Transformer Ablation Concerns Feed Forward Layers Positional Embeddings Self-Attention Keys and Queries

slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Idea

▶ Transformer successful and many variations exist (e.g. Ott et al. 2018; Dai et al. 2019) ▶ Diffjcult to know what the essentials are and what each part contributes → Train similar models difgering in crucial points

slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Transformer

▶ Encoder-Decoder Architecture based on attention (Vaswani et al. 2017) ▶ No recurrence

▶ Constant in source and target “time” while training ▶ In inference, only constant in source “time” ▶ Better parallelizable than RNN-based network

slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Transformer Illustration

Figure: Two-Layer Transformer (Image from Alammar 2018)

slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Ablation Concerns

Figure: Areas of Concern for this Project

slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Feed Forward Layers

▶ Contribution of Feed Forward Layers in encoder and decoder not clear. ▶ Is the attention enough? → Three confjgurations:

▶ No encoder FF layer ▶ No decoder FF layer ▶ No decoder and no encoder FF layer

slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Ablation Concerns

Figure: Areas of Concern for this Project

slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Ablation Concerns

Figure: Areas of Concern for this Project

slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Positional Embeddings

▶ No recurrence → no information about order of tokens

→ Add information via explicit positional embeddings ▶ Added to the word embedding vector

▶ Two types:

▶ Learned embeddings of absolute position (e.g. Gehring et al. 2017) ▶ Sinusoidal embeddings (used in Vaswani et al. 2017) ▶ PE(pos,2i) = sin ( pos 10000

2i key dimensionality

) PE(pos,2i+1) = cos ( pos 10000

2i key dimensionality

)

slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Positional Embeddings Illustrated

Figure: Illustrated positional embeddings for 20 token sentence and key dimensionality 512 (taken from Alammar 2018)

slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Modifjcations

▶ Vary “rainbow stretch” by introducing stretching factor α PE(pos,2i) = sin ( pos 10000

2iα key dimensionality

) ▶ Expectations:

▶ α too low: No positional information ▶ α too high: Word embedding information destroyed ▶ α other than 1 optimal

slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Self-Attention Keys and Queries

▶ Attention(Q, K, V) = softmax ((QWQ)(KWK)T √ dk ) (VWV) ▶ In encoder, source words are used for key and query generation with difgerent matrices ▶ Modifjcation: Use same matrix for both

slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Experiment Design

▶ Do all basic confjgurations ▶ Combine well-performing modifjcations ▶ How to compare?

▶ BLEU score on test set at best dev set performance ▶ Whole learning curves (similar to Popel and Bojar 2018)

slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dataset

▶ Parallel image captions (Elliott et al. 2016)

▶ https://github.com/multi30k/dataset

▶ Short sentences ▶ Rather small (30k sentences)

▶ Good because fjtting takes less than a day ▶ Bad because dev and test performance is far below train performance

slide-16
SLIDE 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

▶ Experiments still pending ▶ Expecting to see mainly negative results ▶ Hopefully some positive ones

slide-17
SLIDE 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Nice translation by the system

“eine junge frau hält eine blume , um sich an der blume zu halten .” “a young woman is holding a fmower in order to hold on to the fmower .”

slide-18
SLIDE 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

References I

Alammar, Jay (2018). The Illustrated Transformer. url: http: //jalammar.github.io/illustrated-transformer/. Dai, Zihang et al. (2019). “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”. In: CoRR abs/1901.02860. arXiv: 1901.02860. url: http://arxiv.org/abs/1901.02860. Elliott, Desmond et al. (2016). “Multi30K: Multilingual English-German Image Descriptions”. In: Proceedings of the 5th Workshop on Vision and Language. Berlin, Germany: Association for Computational Linguistics, pp. 70–74. doi: 10.18653/v1/W16-3210. url: http://www.aclweb.org/anthology/W16-3210. Gehring, Jonas et al. (2017). “Convolutional Sequence to Sequence Learning”. In: CoRR abs/1705.03122. arXiv: 1705.03122. url: http://arxiv.org/abs/1705.03122. Ott, Myle et al. (2018). “Scaling Neural Machine Translation”. In: CoRR abs/1806.00187. arXiv: 1806.00187. url: http://arxiv.org/abs/1806.00187.

slide-19
SLIDE 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

References II

Popel, Martin and Ondrej Bojar (2018). “Training Tips for the Transformer Model”. In: CoRR abs/1804.00247. arXiv: 1804.00247. url: http://arxiv.org/abs/1804.00247. Vaswani, Ashish et al. (2017). “Attention Is All You Need”. In: CoRR abs/1706.03762. arXiv: 1706.03762. url: http://arxiv.org/abs/1706.03762.