Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu - - PowerPoint PPT Presentation

neural machine translation in sogou inc
SMART_READER_LITE
LIVE PREVIEW

Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu - - PowerPoint PPT Presentation

Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang Sogou Company Strong R&D Capabilities No. 2 2,100 employees, of which 76% are technology staff, the highest in Chinas Internet industry No. 2 Chinese Internet


slide-1
SLIDE 1

Neural Machine Translation In Sogou, Inc.

Feifei Zhai and Dongxu Yang

slide-2
SLIDE 2

Sogou Company

  • No. 2 Chinese Internet company in

terms of user base PC MAU 520MM, mobile MAU 560MM, covering 96% of the Internet users in China

  • No. 2
  • 2,100 employees, of which 76% are

technology staff, the highest in China’s Internet industry

Strong R&D Capabilities

  • 38% of employees hold graduate or

doctor degrees

  • Revenue CAGR of 126% from 2011 to 2015,

And In 2015 revenue reached $ 592 million, profit of $ 110 million.

Robust revenue growth

slide-3
SLIDE 3

Rich Product Line

Sogou search including Web Search and 24 Vertical Search Products. UGC Platform : Sogou Wenwen、Sogou Encyclopedia、Sogou Guidance Sogou Exclusive : WeChat search、Zhihu search、English search

slide-4
SLIDE 4

1.

Neural Machine Translation

2.

Related application scenarios

Outline

slide-5
SLIDE 5

5

Machine Translation

Automatically translate one sentence of source language into target language

Methods

Rule-based machine translation (RBMT)

Example-based machine translation (EBMT)

Statistical Machine Translation (SMT)

布什 与 沙龙 举行 了 会谈 Bush held talks with Sharon

slide-6
SLIDE 6

6

Neural Machine Translation – A New Era

To model the direct mapping between source and target language by neural network

Really amazing translation quality

From ( Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf )

布什 与 沙龙 举行 了 会谈 Bush held talks with Sharon

20.3 20.9 20.8 21.5 19.4 20.2 22 22.1 24.7 5 10 15 20 25 2013 2014 2015 2016

Edinburgh’s WMT Results Over the Years

phrase-based SMT syntax-based SMT neural MT

slide-7
SLIDE 7

7

Neural Machine Translation – A New Era

Encoder-Decoder Framework

Encoder: represent the source sentence as a vector by neural network

Decoder: generate target words one by one based on the vector from Encoder

What do we actually have in the encoded vector?

布什 沙龙 与 举行 了 会谈 Bush talks held with

Sharon

<\s>

<\s>

(Sutskever et al., 2014)

slide-8
SLIDE 8

8

Neural Machine Translation – A New Era

Attention Mechanism

For each target word to be generated, dynamically calculate the source language information related to it

布什 沙龙 与 举行 了 会谈 Bush talks held <\s >

Weighted average

slide-9
SLIDE 9

9 

A pure neural-based commercial machine translation engine

Stacked encoders and decoders

Residual network

Length normalization

Domain adaptation

Sogou Neural Machine Translation Engine

布什 沙龙 与 举行 了 会谈

… … … … … … … … … …

Encoder hidden states

Attention Mechanism

… … … …

Softmax Bush held talks with Sharon

Dual Learning

Zero-shot Learning

slide-10
SLIDE 10

10 

Keep optimizing our translation engine on translation model, bilingual data mining, distributed training and decoding.

Focus on Chinese-English and English-Chinese translation now

Good performance on Chinese-English and Engilsh-Chinese translation

Sogou Neural Machine Translation Engine

2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3

Human Evaluation on English-Chinese Translation

2.9 3.9

Sogou

Initial performance Current performance

2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3

Human Evaluation on Chinese-English Translation 3.6

4.2

Sogou

Initial performance Current performance

slide-11
SLIDE 11

11 

Training is too slow !!!!!

Decoding is slow

less than 200ms per translation request on average to meet the real time standard

Take a one layer GRU NMT system as an example

Vocabulary size: 80000 Word embedding: 620 Hidden state: 1000

Encoder(bidirection): ~ 16M MACs per word (just forward)

2*3*2000*1000 + 2*3*620*1000

Decoder: ~70M MACs per word (just forward)

For Training: 3*3620*1000 + 3*2000*1000 +80000*620

For BeamSearch inference: Decoder computation is BeamSize times larger!

We need fast training and decoding

Challenges in Real Application

(Sutskever et al., 2014) (Wu et al., 2016)

slide-12
SLIDE 12

12

Distributed Training

  • Parameter server

– Keep current model parameters – Receive gradients from workers, and update parameters accordingly

  • Workers

– Make use of GPU for model training – Communicate with Parameter server to update parameters

slide-13
SLIDE 13

13

  • Asynchronous

– Each worker send local updated parameters to Parameter server – Parameter server averages the parameters from worker with its own version – Return the updated parameter to worker

  • Synchronous

– Each worker send its gradients to Parameter server – Parameter server do parameter updating after it receives the gradients from all workers

Distributed Training

slide-14
SLIDE 14

14

  • Acceleration ratio

– Asynchronous

  • around 3x acceleration with 10 GPU cards

– Synchronous

  • Acceleration ratio v.s. number of GPU

– (same batchsize * number of GPU)

Distributed Training

1 3.904 7.408 13.232 1 0.976 0.926 0.827 0.6 0.7 0.8 0.9 1 4 8 12 16 1 4 8 16 number of GPU

Acceleration efficiency Acceleration ratio

slide-15
SLIDE 15

15

  • Acceleration on single card

– Corpus shuffle

  • Global random shuffle
  • Local Sort

– sort by sentence length inside each 20 mini-batches – in each mini-batch, sentence length is similar

– Optimization function selection

  • Adadelta
  • Momentum
  • Adam

– about 2 times faster than above

Training acceleration

slide-16
SLIDE 16

16

  • Acceleration on single card

– Use better GPU or newer CUDA if possible ☺

Training acceleration

1 1.33 1.59 2.26 1.97 0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 batch time(s)

batch time

speed up (X)

slide-17
SLIDE 17

17

  • Compute acceleration

– fusion of Computations

  • fusion element wise operations together
  • fusion matrix multiplications to larger ones

– also fusion parameter matrix ahead of time

  • fusion input embeding projection together

– instead of at each step

– CUDA function selection

  • for batchsize=1, use level 2 cuBLAS function instead of level 3

Decoding acceleration

slide-18
SLIDE 18

18

  • Batch Processing

– about 3x faster than single sentence

  • use batch mode if possible

– Sentence reordering

  • sentence length may vary greatly
  • Encoder

– reorder sentence by length – scale batchsize at each step

  • Decoder

– rearrange beams at each step – also scale batchsize according to left beams

Decoding acceleration

slide-19
SLIDE 19

19

  • Other acceleration methods

– Use better GPU or newer CUDA if possible ☺

Decoding acceleration

1 1.35 1.81 2.21 2.67 0.5 1 1.5 2 2.5 3 0.1 0.2 0.3 0.4 0.5 0.6 batch time(s)

batch time

speed up (X)

slide-20
SLIDE 20

20

  • P40 v.s. P100
  • batchsize

– training: 80 or more

  • Computation dominate

– inference: 10 or less

  • memory bandwidth also play an important role

comparison with training

P40 P100 TFLOPS 12T 9.3T Memory Bandwidth 346GB/s 732GB/s

slide-21
SLIDE 21

21

1.

Neural Machine Translation

2.

Related application scenarios

Outline

slide-22
SLIDE 22

22

Sogou translate related products

Translation box in search results

Translation Vertical channel

Translation with OCR

slide-23
SLIDE 23

23

Sogou translate related products

Chinese query English query English results Chinese abstract machine translatio n machine translatio n Chinese webpages machine translatio n

Oversea search

slide-24
SLIDE 24

24