Neural Machine Translation In Sogou, Inc.
Feifei Zhai and Dongxu Yang
Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu - - PowerPoint PPT Presentation
Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang Sogou Company Strong R&D Capabilities No. 2 2,100 employees, of which 76% are technology staff, the highest in Chinas Internet industry No. 2 Chinese Internet
Feifei Zhai and Dongxu Yang
terms of user base PC MAU 520MM, mobile MAU 560MM, covering 96% of the Internet users in China
technology staff, the highest in China’s Internet industry
Strong R&D Capabilities
doctor degrees
And In 2015 revenue reached $ 592 million, profit of $ 110 million.
Robust revenue growth
Sogou search including Web Search and 24 Vertical Search Products. UGC Platform : Sogou Wenwen、Sogou Encyclopedia、Sogou Guidance Sogou Exclusive : WeChat search、Zhihu search、English search
1.
2.
5
Automatically translate one sentence of source language into target language
Methods
Rule-based machine translation (RBMT)
Example-based machine translation (EBMT)
Statistical Machine Translation (SMT)
…
布什 与 沙龙 举行 了 会谈 Bush held talks with Sharon
6
To model the direct mapping between source and target language by neural network
Really amazing translation quality
From ( Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf )
布什 与 沙龙 举行 了 会谈 Bush held talks with Sharon
20.3 20.9 20.8 21.5 19.4 20.2 22 22.1 24.7 5 10 15 20 25 2013 2014 2015 2016
Edinburgh’s WMT Results Over the Years
phrase-based SMT syntax-based SMT neural MT
7
Encoder-Decoder Framework
Encoder: represent the source sentence as a vector by neural network
Decoder: generate target words one by one based on the vector from Encoder
What do we actually have in the encoded vector?
布什 沙龙 与 举行 了 会谈 Bush talks held with
Sharon
<\s>
<\s>
(Sutskever et al., 2014)
8
Attention Mechanism
For each target word to be generated, dynamically calculate the source language information related to it
布什 沙龙 与 举行 了 会谈 Bush talks held <\s >
Weighted average
9
A pure neural-based commercial machine translation engine
Stacked encoders and decoders
Residual network
Length normalization
Domain adaptation
…
布什 沙龙 与 举行 了 会谈
Encoder hidden states
Attention Mechanism
Softmax Bush held talks with Sharon
Dual Learning
Zero-shot Learning
…
10
Keep optimizing our translation engine on translation model, bilingual data mining, distributed training and decoding.
Focus on Chinese-English and English-Chinese translation now
Good performance on Chinese-English and Engilsh-Chinese translation
2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3
Human Evaluation on English-Chinese Translation
2.9 3.9
Sogou
Initial performance Current performance
2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3
Human Evaluation on Chinese-English Translation 3.6
4.2
Sogou
Initial performance Current performance
11
Training is too slow !!!!!
Decoding is slow
less than 200ms per translation request on average to meet the real time standard
Take a one layer GRU NMT system as an example
Vocabulary size: 80000 Word embedding: 620 Hidden state: 1000
Encoder(bidirection): ~ 16M MACs per word (just forward)
2*3*2000*1000 + 2*3*620*1000
Decoder: ~70M MACs per word (just forward)
For Training: 3*3620*1000 + 3*2000*1000 +80000*620
For BeamSearch inference: Decoder computation is BeamSize times larger!
We need fast training and decoding
(Sutskever et al., 2014) (Wu et al., 2016)
12
13
14
– (same batchsize * number of GPU)
1 3.904 7.408 13.232 1 0.976 0.926 0.827 0.6 0.7 0.8 0.9 1 4 8 12 16 1 4 8 16 number of GPU
Acceleration efficiency Acceleration ratio
15
– sort by sentence length inside each 20 mini-batches – in each mini-batch, sentence length is similar
– about 2 times faster than above
16
1 1.33 1.59 2.26 1.97 0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 batch time(s)
17
– also fusion parameter matrix ahead of time
– instead of at each step
18
– reorder sentence by length – scale batchsize at each step
– rearrange beams at each step – also scale batchsize according to left beams
19
1 1.35 1.81 2.21 2.67 0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 batch time(s)
20
P40 P100 TFLOPS 12T 9.3T Memory Bandwidth 346GB/s 732GB/s
21
1.
2.
22
Translation box in search results
Translation Vertical channel
Translation with OCR
23
Chinese query English query English results Chinese abstract machine translatio n machine translatio n Chinese webpages machine translatio n
Oversea search
24