Machine Translation
Spring 2020 CMPT 825: Natural Language Processing
SFU NatLangLab
Adapted from slides from Chris Manning, Abigail See, Matthew Lamm, Danqi Chen and Karthik Narasimhan
Machine Translation Spring 2020 Adapted from slides from Chris - - PowerPoint PPT Presentation
SFU NatLangLab CMPT 825: Natural Language Processing Machine Translation Spring 2020 Adapted from slides from Chris Manning, Abigail See, Matthew Lamm, Danqi Chen and Karthik Narasimhan Translation One of the holy grail problems in
Spring 2020 CMPT 825: Natural Language Processing
SFU NatLangLab
Adapted from slides from Chris Manning, Abigail See, Matthew Lamm, Danqi Chen and Karthik Narasimhan
world
ich mag Äpfel (German)
J'aime les pommes (French)
J'aime les pommes rouges (French)
the but les pommes apples
↔ ↔ ↔ ↔ ↔
in a source language (input) to a sentence in the target language (output)
w(s) ̂ w(t) = arg max
w(t) ψ (w(s), w(t))
ψ ψ ̂ w(t)
J'aime les pommes
J'aime les pommes rouges
the but les pommes apples
↔ ↔ ↔ ↔
Extremely large output space Decoding is NP-hard
⟹
should adequately reflect the linguistic content of
should be fluent text in the target language
w(t) w(s) w(t)
Different translations of A Vinay le gusta Python
Two modifications:
, all precisions are smoothed
should not get a unigram precision of 1 Precision-based metrics favor short translations
shorter than reference,
BLEU = exp 1
N
N
∑
n=1
log pn
log 0 e1−r/h
(G. Doddington, NIST)
Sample BLEU scores for various system outputs
and reference
Issues?
13
dictionary to map Russian words to their English counterparts
1 minute video showing 1954 MT: https://youtu.be/K-HfpsHPmvw
14
15
̂ y = arg max
y
P(y|x) ̂ y = arg max
y
P(x|y)P(y)
Translation/Alignment Model Models how words and phrases should be translated (adequacy/fidelity). Learn from parallel data Language Model Models how to write good English (fluency) Learn from monolingual data
that is maximally likely under the conditional distribution (which is what we want)
w(t) pT|S
̂ w(t) = arg max
w(t) ψ (w(s), w(t))
Decoder
pS|T
Noisy Channel
w(s)
Input Model
pT w(t)
ψ (w(s), w(t)) = ψA (w(s), w(t)) + ψF (w(t)) log PS,T(w(s), w(t)) = log PS|T(w(s)|w(t)) + log PT(w(t))
(Europarl, Koehn, 2005)
Introduce latent variable modeling the alignment (word-level correspondence) between the source sentence x and the target sentence y
P(x|y) = P(x, A|y)
Alignment can be many-to-one
Alignment can be one-to-many
Some words are very fertile!
Alignment can be many-to-many (phrase-level)
Given the alignment, how do we incorporate in our model?
P(x|y) = P(x, A|y)
sentences
word in the source sentence, i.e. it specifies that the word is aligned to the word in target
M(s), M(t)
am mth mth amth
Is this sufficient?
a1 = 2, a2 = 3, a3 = 4,...
Multiple source words may align to the same target word! (source) (target)
(Slide credit: Brendan O’Connor)
Assume extra NULL token
arg max
w(t) p(w(t)|w(s)) = arg max w(t)
p(w(s), w(t)) p(w(s))
words in target? good bad
p(am|m, M(s), M(t)) = 1 M(t)
Every alignment is equally likely!
?
p(am|m, M(s), M(t)) = 1 M(t) p(w(s), w(t)) = p(w(t))∑
A
( 1 M(t) )M(s) p(w(s)|w(t)) p(w(s) = v|w(t) = u)
probabilities using the MLE:
= #instances where word was aligned to word in the training set
p(v|u) = count(u, v) count(u) count(u, v) u v
What can we do?
estimate likelihood of each alignment as:
parameters: p(v|u) =
Eq[count(u, v)] count(u)
In general, use greedy or beam decoding
M(t) arg max
w(t) p(w(t)|w(s)) = arg max w(t) p(w(s), w(t))
p(w(s), w(t)) = p(w(t))∑
A
1 M(t) p(w(s)|w(t))
(source) (target) At every step , pick target word to maximize product of:
where is the inverse alignment from target to source
m pLM(w(t)
m |w(t) <m)
p(w(s)
bm |w(t) m )
bm
constant
p(am|m, M(s), M(t))
multiword spans or “phrases”
(literal) (actual)
multiword spans or “phrases”
probabilities to multi-word units
(Slide credit: Greg Durrett)
tables)
47