New Territory of Machine Translation
Kyunghyun Cho
Courant Institute of Mathematical Sciences & Center for Data Science, New York University
New Territory of Machine Translation Kyunghyun Cho Courant - - PowerPoint PPT Presentation
New Territory of Machine Translation Kyunghyun Cho Courant Institute of Mathematical Sciences & Center for Data Science, New York University I really enjoyed this film. However, that is on the basis that Science Fiction is one of my
Kyunghyun Cho
Courant Institute of Mathematical Sciences & Center for Data Science, New York University
I really enjoyed this film. (I, really, enjoyed, this, film,.)
Word segmentation, tokenization, …
I really enjoyed this film. However, that is on the basis that Science Fiction is one of my favourite genres: I can see some audiences finding the philosophical plotting too slow and wordy to hold their interest. But if you like your films deep and thought-provoking, as well as deliciously tense in places, then this might be for you.
http://www.imdb.com/title/tt0470752/reviews?ref_=tt_urv
Machine Translation
(J’, ai, vraiment, aimé, ce, film, .) J'ai vraiment aimé ce film.
Detokenization, …
Je vraiment aimé ce film. Cependant, ce qui est sur la base que la science-fiction est un de mes genres préférés: je peux voir certains publics trouver le tracé philosophique trop lent et verbeux pour maintenir leur intérêt. Mais si vous aimez vos films et profonde réflexion, ainsi que délicieusement tendue dans les lieux, alors ce pourrait être pour vous.
Google Translate
However, that is on the basis that Science Fiction is one of my favourite genres:
(However, ,, that, is, on, the, basis, that, Science, Fiction, is, one, of, my, favourite, genres, :)
Word segmentation, tokenization, …
I really enjoyed this film. However, that is on the basis that Science Fiction is one of my favourite genres: I can see some audiences finding the philosophical plotting too slow and wordy to hold their interest. But if you like your films deep and thought-provoking, as well as deliciously tense in places, then this might be for you.
http://www.imdb.com/title/tt0470752/reviews?ref_=tt_urv
Machine Translation
(Cependant, ,, ce, qui, est, sur, la, base, que, la, science-fiction, est, un, de, mes, genres, préférés, :) Cependant, ce qui est sur la base que la science-fiction est un de mes genres préférés:
Detokenization, …
Je vraiment aimé ce film. Cependant, ce qui est sur la base que la science-fiction est un de mes genres préférés: je peux voir certains publics trouver le tracé philosophique trop lent et verbeux pour maintenir leur intérêt. Mais si vous aimez vos films et profonde réflexion, ainsi que délicieusement tendue dans les lieux, alors ce pourrait être pour vous.
Google Translate
Do you see three issues here?
I really enjoyed this film. (I, really, enjoyed, this, film,.)
Word segmentation, tokenization, …
I really enjoyed this film. However, that is on the basis that Science Fiction is one of my favourite genres: I can see some audiences finding the philosophical plotting too slow and wordy to hold their interest. But if you like your films deep and thought-provoking, as well as deliciously tense in places, then this might be for you.
http://www.imdb.com/title/tt0470752/reviews?ref_=tt_urv
Machine Translation
(J’, ai, vraiment, aimé, ce, film, .) J'ai vraiment aimé ce film.
Detokenization, …
Je vraiment aimé ce film. Cependant, ce qui est sur la base que la science-fiction est un de mes genres préférés: je peux voir certains publics trouver le tracé philosophique trop lent et verbeux pour maintenir leur intérêt. Mais si vous aimez vos films et profonde réflexion, ainsi que délicieusement tendue dans les lieux, alors ce pourrait être pour vous.
Google Translate
I really enjoyed this film. (I, really, enjoyed, this, film,.)
Word segmentation, tokenization, …
I really enjoyed this film. However, that is on the basis that Science Fiction is one of my favourite genres: I can see some audiences finding the philosophical plotting too slow and wordy to hold their interest. But if you like your films deep and thought-provoking, as well as deliciously tense in places, then this might be for you.
http://www.imdb.com/title/tt0470752/reviews?ref_=tt_urv
Machine Translation
(J’, ai, vraiment, aimé, ce, film, .) J'ai vraiment aimé ce film.
Detokenization, …
Je vraiment aimé ce film. Cependant, ce qui est sur la base que la science-fiction est un de mes genres préférés: je peux voir certains publics trouver le tracé philosophique trop lent et verbeux pour maintenir leur intérêt. Mais si vous aimez vos films et profonde réflexion, ainsi que délicieusement tendue dans les lieux, alors ce pourrait être pour vous.
Google Translate
I really enjoyed this film. (I, really, enjoyed, this, film,.)
Word segmentation, tokenization, …
I really enjoyed this film. However, that is on the basis that Science Fiction is one of my favourite genres: I can see some audiences finding the philosophical plotting too slow and wordy to hold their interest. But if you like your films deep and thought-provoking, as well as deliciously tense in places, then this might be for you.
http://www.imdb.com/title/tt0470752/reviews?ref_=tt_urv
Machine Translation
(J’, ai, vraiment, aimé, ce, film, .) J'ai vraiment aimé ce film.
Detokenization, …
Je vraiment aimé ce film. Cependant, ce qui est sur la base que la science-fiction est un de mes genres préférés: je peux voir certains publics trouver le tracé philosophique trop lent et verbeux pour maintenir leur intérêt. Mais si vous aimez vos films et profonde réflexion, ainsi que délicieusement tendue dans les lieux, alors ce pourrait être pour vous.
Google Translate
Word-level Sentence-wise Bilingual Translation
I really enjoyed this film. (I, really, enjoyed, this, film,.)
Word segmentation, tokenization, …
I really enjoyed this film. However, that is on the basis that Science Fiction is one of my favourite genres: I can see some audiences finding the philosophical plotting too slow and wordy to hold their interest. But if you like your films deep and thought-provoking, as well as deliciously tense in places, then this might be for you.
http://www.imdb.com/title/tt0470752/reviews?ref_=tt_urv
Machine Translation
(J’, ai, vraiment, aimé, ce, film, .) J'ai vraiment aimé ce film.
Detokenization, …
Je vraiment aimé ce film. Cependant, ce qui est sur la base que la science-fiction est un de mes genres préférés: je peux voir certains publics trouver le tracé philosophique trop lent et verbeux pour maintenir leur intérêt. Mais si vous aimez vos films et profonde réflexion, ainsi que délicieusement tendue dans les lieux, alors ce pourrait être pour vous.
Google Translate
Y = (y1, y2, . . . , yTy)
X = (x1, x2, . . . , xTx) {(X1, Y1), (X2, Y2), . . . , (XN, YN)}
1 N
N
X
n=1 Ty,n
X
t=1
log p(yn
t |yn <t, Xn)
h1 h2 h3 h4
x1 x2 x3 x4
= c
hsi y1 y2 y3 y4 z1
z2
z3
z4
z0
h1 h2 h3 h4
x1 x2 x3 x4
= c
hsi y1 y2 y3 y4 z1
z2
z3
z4
z0
ID Word 1 the 2 a 2093 cat
ecat = 0, . . ., 0, 1, 0, . . .,
>
2093-th element
h1 h2 h3 h4
x1 x2 x3 x4
= c
hsi y1 y2 y3 y4 z1
z2
z3
z4
z0
ht = ⇢ φenc(ht−1, xt), if t > 0 0,
c = hTx
z0 = finit(c)
c
zt = φdec(zt−1, c, yt−1)
p(yt|y<t, X) ∝ exp(φout(zt))
yt
yt = heosi
This is not too great a model, because
h1 h2 h3 h4
x1 x2 x3 x4
= c
hsi y1 y2 y3 y4 z1
z2
z3
z4
z0
“You can’t cram the meaning of a whole %&!$# sentence into a single $&!#* vector!”
Ray Mooney
x1 x2 x3 x4
− → h 1 − → h 2 − → h 3 − → h 4 ← − h 1 ← − h 2
← − h 3 ← − h 4
h1 h2 h3 h4 = c
x1 x2 x3 x4
− → h 1 − → h 2
− → h 3 − → h 4
← − h 1 ← − h 2
← − h 3 ← − h 4
h1 h2 h3 h4
= c hsi
y1 y2 z1 z2 z0
fscore e2,3
zt−1
zt−1
αj,t = exp(ej,t) P
j0 exp(ej0,t)
x1 x2 x3 x4
− → h 1 − → h 2
− → h 3 − → h 4
← − h 1 ← − h 2
← − h 3 ← − h 4
h1
h2
h3 h4 = c hsi y1 y2 z1 z2 z0 z3
+
α1,3 α2,3 α3,3 α4,3
ct = X
j
αj,thj
p(yt|y<t, X) ∝ exp(φout(zt)) zt = φdec(zt−1, ct, yt−1)
Word-level Sentence-wise Bilingual Translation Subword-level
I really enjoyed this film. (I, really, enjoyed, this, film,.)
Word segmentation, tokenization, …
I really enjoyed this film. However, that is on the basis that Science Fiction is one of my favourite genres: I can see some audiences finding the philosophical plotting too slow and wordy to hold their interest. But if you like your films deep and thought-provoking, as well as deliciously tense in places, then this might be for you.
http://www.imdb.com/title/tt0470752/reviews?ref_=tt_urv
Machine Translation
(J’, ai, vraiment, aimé, ce, film, .) J'ai vraiment aimé ce film.
Detokenization, …
Je vraiment aimé ce film. Cependant, ce qui est sur la base que la science-fiction est un de mes genres préférés: je peux voir certains publics trouver le tracé philosophique trop lent et verbeux pour maintenir leur intérêt. Mais si vous aimez vos films et profonde réflexion, ainsi que délicieusement tendue dans les lieux, alors ce pourrait être pour vous.
Google Translate
“산업” “통산” “자원” “부” => Ministry of Trade, Industry and Energy
all independent from each other?
ID Word 1 the 2 a 2093 cat
ecat = 0, . . ., 0, 1, 0, . . .,
>
2093-th element
−33 −32 −31 −30 −29 −28 −27 −13 −12.5 −12 −11.5 −11 −10.5
Canada European Europe Union Canadian EU Africa Assembly African North Germany Kingdom Ontario Iraq British Japan
preserves its important similarities according to the objective.
corresponds to a word.
ID Word 1 the 2 a 2093 cat
ecat = 0, . . ., 0, 1, 0, . . .,
>
2093-th element
23 words/sentence vs. 115 letters/sentence
(EMNLP 2015): with recurrent nets
X = (x1, x2, . . . , xTx)
xt = (ct
1, ct 2, . . . , ct Tt)
ht = φword(ht−1, f(xt))
zt = φchar(zt−1, ct−1) p(ct| . . .) ∝ exp (gct(zt))
simultaneously, or avoid segmentation in general?
consonant+vowel(+consonant) => syllable Unicode encodes each and every syllable
* (Zhang et al., NIPS 2015)
Subword-level Sentence-wise Bilingual Translation Larger-Context
I really enjoyed this film. (I, really, enjoyed, this, film,.)
Word segmentation, tokenization, …
I really enjoyed this film. However, that is on the basis that Science Fiction is one of my favourite genres: I can see some audiences finding the philosophical plotting too slow and wordy to hold their interest. But if you like your films deep and thought-provoking, as well as deliciously tense in places, then this might be for you.
http://www.imdb.com/title/tt0470752/reviews?ref_=tt_urv
Machine Translation
(J’, ai, vraiment, aimé, ce, film, .) J'ai vraiment aimé ce film.
Detokenization, …
Je vraiment aimé ce film. Cependant, ce qui est sur la base que la science-fiction est un de mes genres préférés: je peux voir certains publics trouver le tracé philosophique trop lent et verbeux pour maintenir leur intérêt. Mais si vous aimez vos films et profonde réflexion, ainsi que délicieusement tendue dans les lieux, alors ce pourrait être pour vous.
Google Translate
James the Turtle was always getting in trouble. Sometimes he’d reach into the freezer and empty
she could to keep him out of trouble, but he was sneaky and got into lots of trouble behind her back. One day, James thought he would go into town and see what kind of trouble he could get into. He went to the grocery store and pulled all the pudding off the shelves and ate two jars. Then he walked to the fast food restaurant and ordered 15 bags of fries. He didn’t pay, and instead headed home.
While it's not flawless, some motivations and scenarios remain somewhat underdeveloped or questionable; Ex_Machina is a stunning Sci-Fi vision that is also a fully formed thinking man's thriller. With a jaw droopingly good turn from the soon to be megastar Vikander, ____?____ is another excellent example of what makes the ____?____
Context Following Sentence
(Wang & Cho, arXiv 2015; Ji et al., arXiv 2015)
previous sentences
conditioned on this bag-of-words P(D) ≈ P(S1)P(S2) · · · P(SN) P(D) ≈ P(S1)P(S2|S1) · · · P(SN|SN−n, . . . , SN−1)
vs. hsi
z1 z2 z3 z4 z0 w1 w2 w3 w4 (Sl−n, Sl−n+1, . . . , Sl−1)
Early Fusion Late Fusion
IMDB PTB
hsi y1 y2 y3 y4
z1 z2 z3
z4
c z0
h1 h2 h3 h4 x1
x2 x3 x4 =
fsummary
* Unless Chris Dyer uploads it on arXiv tomorrow
(1) Hierarchical Model? (2) Something other than BPTT?
Document Chapter 1 Chapter 2 Chapter 2 Section 1 Section 2 Section 2 Paragraph 1 Paragraph 2
Summarize! Summarize! Summarize!
Utterance-level RNN + Dialogue-level RNN
Document Chapter 1 Chapter 2 World Knowledge
P(Sl|S1, S2, . . . , Sl−1, D1, D2, . . . , DM)
Subword-level Larger-Context Bilingual Translation Multilingual
I really enjoyed this film. (I, really, enjoyed, this, film,.)
Word segmentation, tokenization, …
I really enjoyed this film. However, that is on the basis that Science Fiction is one of my favourite genres: I can see some audiences finding the philosophical plotting too slow and wordy to hold their interest. But if you like your films deep and thought-provoking, as well as deliciously tense in places, then this might be for you.
http://www.imdb.com/title/tt0470752/reviews?ref_=tt_urv
Machine Translation
(J’, ai, vraiment, aimé, ce, film, .) J'ai vraiment aimé ce film.
Detokenization, …
Je vraiment aimé ce film. Cependant, ce qui est sur la base que la science-fiction est un de mes genres préférés: je peux voir certains publics trouver le tracé philosophique trop lent et verbeux pour maintenir leur intérêt. Mais si vous aimez vos films et profonde réflexion, ainsi que délicieusement tendue dans les lieux, alors ce pourrait être pour vous.
Google Translate
N M N × M
max {N, M}
hsi
y1 y2 y3
z1 z2
z3
z0 h1
h2 h3 h4 x1 x2
x3
x4
hsi
y1 y2 y3
z1 z2 z3 z0
h1 h2 h3 h4 x1 x2 x3 x4
Language-agnostic vector space
hsi
y1
y2 y3 z1 z2
z3
z0 h1 h2 h3 h4
x1 x2 x3 x4 hsi
y1 y2 y3
z1
z2 z3 z0 h1 h2 h3 h4
x1 x2 x3 x4
O(N × M) O(N + M)
hsi y1 y2 y3 z1 z2
z3 z0 h1
h2 h3 h4 x1 x2 x3 x4
hsi y1 y2 y3 z1 z2 z3
z0
h1 h2 h3 h4 x1 x2
x3 x4
x1 x2
x3
x4
− → h 1 − → h 2
− → h 3
− → h 4 ← − h 1 ← − h 2 ← − h 3 ← − h 4
h1
h2 h3 h4
English Encoder
x1 x2 x3 x4
− → h 1 − → h 2
− → h 3 − → h 4
← − h 1 ← − h 2
← − h 3 ← − h 4
h1 h2 h3 h4
German Encoder
x1 x2
x3
x4
− → h 1 − → h 2
− → h 3 − → h 4
← − h 1 ← − h 2
← − h 3 ← − h 4
h1 h2 h3 h4
Finnish Encoder
fscore
Shared Attention
hsi y1 y2 z1 z2 z0 z3 α1,3 α2,3 α3,3 α4,3
English Decoder
hsi y1 y2
z1
z2 z0 z3 α1,3 α2,3 α3,3 α4,3
German Decoder
BLEU En->De 21.03 De->En 25.77 Fi->En 13.32
Single-pair models
BLEU Diff En->De 20.45
De->En 24.53
Fi->En 14.33 +1.09
One Multilingual Model
x1 x2
x3
x4
− → h 1 − → h 2 − → h 3 − → h 4 ← − h 1 ← − h 2 ← − h 3
← − h 4
h1 h2 h3 h4
English Encoder
x1 x2 x3 x4
− → h 1 − → h 2
− → h 3
− → h 4 ← − h 1 ← − h 2 ← − h 3 ← − h 4
h1 h2 h3 h4
German Encoder
Annotation Vectors
hj
j
Σ =1
Convolutional Neural Network
Image Encoder
− → h 1 − → h 2 − → h 3 − → h 4
← − h 1 ← − h 2
← − h 3 ← − h 4
h1 h2 h3 h4
Speech Encoder
English Decoder Finnish Decoder Speech Decoder Image Decoder
fscore
Attention
Modality-agnostic space
New Territory of Machine Translation
Why is this a talk at a NIPS workshop not at ACL?
Single Frame Feature Extraction Recognition Engine
Man Woman Motorcycle Bus
Single Frame Recognition Engine
Man Woman Motorcycle Bus
Single Frame Recognition Engine Man Woman Motorcycle Bus
* I think DeepMind and Facebook are all doing these…
(Artificial) Intelligence Natural Language Understanding
* Oh, so many people might hate me, but probably not at NIPS