[PPT] - Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu PowerPoint Presentation

SLIDE 1

Neural Machine Translation In Sogou, Inc.

Feifei Zhai and Dongxu Yang

SLIDE 2

Sogou Company

No. 2 Chinese Internet company in

terms of user base PC MAU 520MM, mobile MAU 560MM, covering 96% of the Internet users in China

No. 2
2,100 employees, of which 76% are

technology staff, the highest in China’s Internet industry

Strong R&D Capabilities

38% of employees hold graduate or

doctor degrees

Revenue CAGR of 126% from 2011 to 2015,

And In 2015 revenue reached $ 592 million, profit of $ 110 million.

Robust revenue growth

SLIDE 3

Rich Product Line

Sogou search including Web Search and 24 Vertical Search Products. UGC Platform : Sogou Wenwen、Sogou Encyclopedia、Sogou Guidance Sogou Exclusive : WeChat search、Zhihu search、English search

SLIDE 4

1.

Neural Machine Translation

2.

Related application scenarios

Outline

SLIDE 5

5

Machine Translation



Automatically translate one sentence of source language into target language



Methods



Rule-based machine translation (RBMT)



Example-based machine translation (EBMT)



Statistical Machine Translation (SMT)



…

布什与沙龙举行了会谈 Bush held talks with Sharon

SLIDE 6

6

Neural Machine Translation – A New Era



To model the direct mapping between source and target language by neural network



Really amazing translation quality

From ( Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf )

布什与沙龙举行了会谈 Bush held talks with Sharon

20.3 20.9 20.8 21.5 19.4 20.2 22 22.1 24.7 5 10 15 20 25 2013 2014 2015 2016

Edinburgh’s WMT Results Over the Years

phrase-based SMT syntax-based SMT neural MT

SLIDE 7

7

Neural Machine Translation – A New Era



Encoder-Decoder Framework



Encoder: represent the source sentence as a vector by neural network



Decoder: generate target words one by one based on the vector from Encoder



What do we actually have in the encoded vector?

布什沙龙与举行了会谈 Bush talks held with

Sharon

<\s>

(Sutskever et al., 2014)

SLIDE 8

8

Neural Machine Translation – A New Era



Attention Mechanism



For each target word to be generated, dynamically calculate the source language information related to it

布什沙龙与举行了会谈 Bush talks held <\s >

Weighted average

SLIDE 9

9 

A pure neural-based commercial machine translation engine



Stacked encoders and decoders



Residual network



Length normalization



Domain adaptation



…

Sogou Neural Machine Translation Engine

布什沙龙与举行了会谈

… … … … … … … … … …

Encoder hidden states

Attention Mechanism

… … … …

Softmax Bush held talks with Sharon



Dual Learning



Zero-shot Learning



…

SLIDE 10

10 

Keep optimizing our translation engine on translation model, bilingual data mining, distributed training and decoding.



Focus on Chinese-English and English-Chinese translation now



Good performance on Chinese-English and Engilsh-Chinese translation

Sogou Neural Machine Translation Engine

2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3

Human Evaluation on English-Chinese Translation

2.9 3.9

Sogou

Initial performance Current performance

2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3

Human Evaluation on Chinese-English Translation 3.6

4.2

Sogou

Initial performance Current performance

SLIDE 11

11 

Training is too slow !!!!!



Decoding is slow



less than 200ms per translation request on average to meet the real time standard



Take a one layer GRU NMT system as an example



Vocabulary size: 80000 Word embedding: 620 Hidden state: 1000



Encoder(bidirection): ~ 16M MACs per word (just forward)



2*3*2000*1000 + 2*3*620*1000



Decoder: ~70M MACs per word (just forward)



For Training: 3*3620*1000 + 3*2000*1000 +80000*620



For BeamSearch inference: Decoder computation is BeamSize times larger!



We need fast training and decoding

Challenges in Real Application

(Sutskever et al., 2014) (Wu et al., 2016)

SLIDE 12

12

Distributed Training

Parameter server

– Keep current model parameters – Receive gradients from workers, and update parameters accordingly

Workers

– Make use of GPU for model training – Communicate with Parameter server to update parameters

SLIDE 13

13

Asynchronous

– Each worker send local updated parameters to Parameter server – Parameter server averages the parameters from worker with its own version – Return the updated parameter to worker

Synchronous

– Each worker send its gradients to Parameter server – Parameter server do parameter updating after it receives the gradients from all workers

Distributed Training

SLIDE 14

14

Acceleration ratio

– Asynchronous

around 3x acceleration with 10 GPU cards

– Synchronous

Acceleration ratio v.s. number of GPU

– (same batchsize * number of GPU)

Distributed Training

1 3.904 7.408 13.232 1 0.976 0.926 0.827 0.6 0.7 0.8 0.9 1 4 8 12 16 1 4 8 16 number of GPU

Acceleration efficiency Acceleration ratio

SLIDE 15

15

Acceleration on single card

– Corpus shuffle

Global random shuffle
Local Sort

– sort by sentence length inside each 20 mini-batches – in each mini-batch, sentence length is similar

– Optimization function selection

Adadelta
Momentum
Adam

– about 2 times faster than above

Training acceleration

SLIDE 16

16

Acceleration on single card

– Use better GPU or newer CUDA if possible ☺

Training acceleration

1 1.33 1.59 2.26 1.97 0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 batch time(s)

batch time

speed up (X)

SLIDE 17

17

Compute acceleration

– fusion of Computations

fusion element wise operations together
fusion matrix multiplications to larger ones

– also fusion parameter matrix ahead of time

fusion input embeding projection together

– instead of at each step

– CUDA function selection

for batchsize=1, use level 2 cuBLAS function instead of level 3

Decoding acceleration

SLIDE 18

18

Batch Processing

– about 3x faster than single sentence

use batch mode if possible

– Sentence reordering

sentence length may vary greatly
Encoder

– reorder sentence by length – scale batchsize at each step

Decoder

– rearrange beams at each step – also scale batchsize according to left beams

Decoding acceleration

SLIDE 19

19

Other acceleration methods

– Use better GPU or newer CUDA if possible ☺

Decoding acceleration

1 1.35 1.81 2.21 2.67 0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 batch time(s)

batch time

speed up (X)

SLIDE 20

20

P40 v.s. P100
batchsize

– training: 80 or more

Computation dominate

– inference: 10 or less

memory bandwidth also play an important role

comparison with training

P40 P100 TFLOPS 12T 9.3T Memory Bandwidth 346GB/s 732GB/s

SLIDE 21

21

1.

Neural Machine Translation

2.

Related application scenarios

Outline

SLIDE 22

22

Sogou translate related products



Translation box in search results



Translation Vertical channel



Translation with OCR

SLIDE 23

23

Sogou translate related products

Chinese query English query English results Chinese abstract machine translatio n machine translatio n Chinese webpages machine translatio n



Oversea search

SLIDE 24

24