[PPT] - Neural Machine Translation: directions for improvement CMSC 470 PowerPoint Presentation

SLIDE 1

Neural Machine Translation: directions for improvement

CMSC 470 Marine Carpuat

SLIDE 2

How can we improve on state-of-the-art machine translation approaches?

Model
Training
Data
Objective
Algorithm

SLIDE 3

Addressing domain mismatch

Slides adapted from Kevin Duh [Domain Adaptation in Machine Translation, MTMA 2019]

SLIDE 4

Supervised training data is not always in the domain we want to translate!

SLIDE 5

SLIDE 6

SLIDE 7

Domain adaptation is an important practical problem in machine translation

It may be expensive to obtain training sets that are both large and

relevant to test domain

So we often have to work with whatever we can!

SLIDE 8

Possible strategies: “Continued Training” or “fine-tuning”

Requires small in-domain parallel data

[Luong and Manning 2016]

SLIDE 9

Possible strategies: back-translation

SLIDE 10

Possible strategies: data selection

Train a language model on data

representative of test domain

N-gram count based model [Moore & Lewis 2010]
Neural model [Duh et al. 2013]
Neural MT model [Junczys-Dowmunt 2018]
Use perplexity of LM on new data to

measure distance from test domain

SLIDE 11

Possible strategies: different weights for different training samples

Corpus level weight

Instance level weight Based on classifier that measures similarity of samples with in domain data

SLIDE 12

How can we improve on state-of-the-art machine translation approaches?

Model
Training
Data
Objective
Algorithm

SLIDE 13

Beyond Maximum Likelihood Training

SLIDE 14

How can we improve NMT training?

Assumption: References can substitute for predicted

translations during training

Our hypothesis: Modeling divergences between references

and predictions improves NMT

Based on paper by Weijia Xu [NAACL 2019]

SLIDE 15

Exposure Bias: Gap Between Training and Inference

Maximum Likelihood Training Inference

<s>

ℎ1 ℎ2

dinner made We

我们做了晚餐

We will <s>

ℎ1 ℎ2

?

我们做了晚餐

Reference Model Translation

෍

𝑢=1 𝑈

log 𝑞 𝑧𝑢 𝑧<𝑢, 𝑦 ෑ

𝑢=1 𝑈

𝑞 𝑧𝑢 𝑧<𝑢, 𝑦

Loss =

𝑄 𝑧 𝑦 =

SLIDE 16

How to Address Exposure Bias?

Because of exposure bias
Models don’t learn to recover from their errors
Cascading errors at test time
Solution:
Expose models to their own predictions during training
But how to compute the loss when the partial translation

diverges from the reference?

SLIDE 17

Existing Method: Scheduled Sampling

Reference: <s> We made dinner </s>

<s> We

predict

We

我们做了晚餐

We

P P = choose randomly

[Bengio et al., NeurIPS 2015]

SLIDE 18

Existing Method: Scheduled Sampling

<s> ℎ1 We

我们做了晚餐

Reference: <s> We made dinner </s>

will

predict

made will

P P = choose randomly

[Bengio et al., NeurIPS 2015]

SLIDE 19

Existing Method: Scheduled Sampling

<s> ℎ1 will ℎ2 ℎ3

…

We

我们做了晚餐

Reference: <s> We made dinner </s>

[Bengio et al., NeurIPS 2015]

SLIDE 20

Existing Method: Scheduled Sampling

[Bengio et al., NeurIPS 2015]

<s> ℎ1 will ℎ2 ℎ3

…

We

我们做了晚餐

Reference: <s> We made dinner </s> J = log p(“We” | “<s>”, source)

SLIDE 21

Existing Method: Scheduled Sampling

<s> ℎ1 will ℎ2 ℎ3 … We

我们做了晚餐

Reference: <s> We made dinner </s> J = log p(“made” | “<s> We”, source)

[Bengio et al., NeurIPS 2015]

SLIDE 22

Existing Method: Scheduled Sampling

<s> ℎ1 will ℎ2 ℎ3

Incorrect synthetic reference: “We will dinner”

…

We

我们做了晚餐

Reference: <s> We made dinner </s> J = log p(“dinner” | “<s> We will”, source)

[Bengio et al., NeurIPS 2015]

SLIDE 23

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝟐

𝒃𝟐 logp(“dinner” | “<s>”, source)

我们做了晚餐

Reference: <s> We made dinner </s>

SLIDE 24

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝟑

𝒃𝟐 logp(“dinner” | “<s>”, source) + 𝒃𝟑 logp(“dinner” | “<s> We”, source)

我们做了晚餐

Reference: <s> We made dinner </s>

SLIDE 25

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝟒

𝒃𝟐 logp(“dinner” | “<s>”, source) + 𝒃𝟑 logp(“dinner” | “<s> We”, source) + 𝒃𝟒 logp(“dinner” | “<s> We will”, source)

我们做了晚餐

Reference: <s> We made dinner </s>

SLIDE 26

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝟓

𝒃𝟐 logp(“dinner” | “<s>”, source) + 𝒃𝟑 logp(“dinner” | “<s> We”, source) + 𝒃𝟒 logp(“dinner” | “<s> We will”, source) + 𝒃𝟓 logp(“dinner” | “<s> We will make”, source)

我们做了晚餐

Reference: <s> We made dinner </s>

SLIDE 27

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝒋 ∝ 𝐟𝐲𝐪(𝑭𝒏𝒄𝒇𝒆𝒆𝒋𝒐𝒐𝒇𝒔 ⋅ 𝒊𝒋)

𝒃𝟐 logp(“dinner” | “<s>”, source) + 𝒃𝟑 logp(“dinner” | “<s> We”, source) + 𝒃𝟒 logp(“dinner” | “<s> We will”, source) + 𝒃𝟓 logp(“dinner” | “<s> We will make”, source)

我们做了晚餐

Reference: <s> We made dinner </s>

SLIDE 28

Our Solution: Align Reference with Partial Translations

<s> ℎ1 will ℎ2 ℎ3 We make ℎ4 dinner

Soft Alignment 𝒃𝒋 ∝ 𝐟𝐲𝐪(𝑭𝒏𝒄𝒇𝒆𝒆𝒋𝒐𝒐𝒇𝒔 ⋅ 𝒊𝒋)

𝒃𝟐 logp(“dinner” | “<s>”, source) + 𝒃𝟑 logp(“dinner” | “<s> We”, source) + 𝒃𝟒 logp(“dinner” | “<s> We will”, source) + 𝒃𝟓 logp(“dinner” | “<s> We will make”, source)

我们做了晚餐

Reference: <s> We made dinner </s>

SLIDE 29

Training Objective

Ours:

Soft alignment between 𝑧𝑢 and ෤ 𝑧<𝑘

𝐾𝑇𝐵 = ෍

𝑦,𝑧 ∈𝐸

෍

𝑢=1 𝑈

𝑚𝑝𝑕 ෍

𝑘=1 𝑈′

𝑏𝑢𝑘 𝑞 𝑧𝑢 ෤ 𝑧<𝑘, 𝑦)

Scheduled Sampling:

Hard alignment by time index t

𝐾𝑇𝑇 = ෍

𝑦,𝑧 ∈𝐸

෍

𝑢=1 𝑈

𝑚𝑝𝑕 𝑞 𝑧𝑢 ෤ 𝑧<𝑢, 𝑦)

SLIDE 30

Training Objective

Ours:

Soft alignment between 𝑧𝑢 and ෤ 𝑧<𝑘

𝐾𝑇𝐵 = ෍

𝑦,𝑧 ∈𝐸

෍

𝑢=1 𝑈

𝑚𝑝𝑕 ෍

𝑘=1 𝑈′

𝑏𝑢𝑘 𝑞 𝑧𝑢 ෤ 𝑧<𝑘, 𝑦)

Scheduled Sampling:

Hard alignment by time index t

𝐾𝑇𝑇 = ෍

𝑦,𝑧 ∈𝐸

෍

𝑢=1 𝑈

𝑚𝑝𝑕 𝑞 𝑧𝑢 ෤ 𝑧<𝑢, 𝑦)

SLIDE 31

Training Objective

Ours:

Soft alignment between 𝑧𝑢 and ෤ 𝑧<𝑘

𝐾𝑇𝐵 = ෍

𝑦,𝑧 ∈𝐸

෍

𝑢=1 𝑈

𝑚𝑝𝑕 ෍

𝑘=1 𝑈′

𝑏𝑢𝑘 𝑞 𝑧𝑢 ෤ 𝑧<𝑘, 𝑦)

Combined with maximum likelihood:

𝐾 = 𝐾𝑇𝐵 + 𝐾𝑁𝑀

Scheduled Sampling:

Hard alignment by time index t

𝐾𝑇𝑇 = ෍

𝑦,𝑧 ∈𝐸

෍

𝑢=1 𝑈

𝑚𝑝𝑕 𝑞 𝑧𝑢 ෤ 𝑧<𝑢, 𝑦)

SLIDE 32

Experiments

Data
IWSLT14 de-en
IWSLT15 vi-en
Model
Bi-LSTM encoder, LSTM decoder,

multilayer perceptron attention

Differentiable sampling with Straight-

Through Gumbel Softmax

Based on AWS sockeye

SLIDE 33

Our Method Outperforms Maximum Likelihood and Scheduled Sampling

22 23 24 25 26 27 28 de-en en-de vi-en

BLEU Baseline Scheduled Sampling Differentiable Scheduled Sampling Our Method

SLIDE 34

Our Method Needs No Annealing

17 19 21 23 25 27 de-en en-de vi-en

BLEU Baseline Scheduled Sampling w/ annealing Scheduled Sampling w/o annealing Our Method (no annealing) Scheduled sampling: BLEU drops when used without annealing!

SLIDE 35

Summary

Introduced a new training objective

1. Generate translation prefixes via differentiable sampling
2. Learn to align the reference words with sampled prefixes

Better BLEU than the maximum likelihood and scheduled sampling (de-en, en-de, vi-en) Simple to train, no annealing schedule required

SLIDE 36

What you should know

Lots of things can be done to improve neural MT even without changing

the model architecture

The domain of training data matters
Simple techniques can be used to measure distance from test domain
And to adapt model to domain of interest
The standard maximum likelihood objective is suboptimal
It does not directly measure translation quality
It is based on reference translations only, so the model is not exposed to its own

errors during training

Developing reliable alternatives is an active area of research