Findings of the 2016 Conference on Machine Translation WMT 2016 @ - - PowerPoint PPT Presentation

findings of the 2016 conference on machine translation
SMART_READER_LITE
LIVE PREVIEW

Findings of the 2016 Conference on Machine Translation WMT 2016 @ - - PowerPoint PPT Presentation

Findings of the 2016 Conference on Machine Translation WMT 2016 @ ACL Berlin, Germany August 1112 Organizers : Ond ej Bojar (Charles University in Prague), Christian Buck (University of Edinburgh), Rajen Chatterjee (FBK), Christian


slide-1
SLIDE 1

Findings of the 2016 Conference

  • n Machine Translation

WMT 2016 @ ACL Berlin, Germany August 11–12

Organizers: Ondřej Bojar (Charles University in Prague), Christian Buck (University of Edinburgh), Rajen Chatterjee (FBK), Christian Federmann (MSR), Liane Guillou (University of Edinburgh), Barry Haddow (University

  • f Edinburgh), Matthias Huck (University of Edinburgh), Antonio Jimeno Yepes (IBM Research Australia), Varvara

Logacheva (University of Sheffield), Aurélie Névéol (LIMSI, CNRS), Mariana Neves (Hasso-Plattner Institute), Pavel Pecina (Charles University in Prague), Martin Popel (Charles University in Prague), Philipp Koehn (University of Edinburgh / Johns Hopkins University), Christof Monz (University of Amsterdam), Matteo Negri (FBK), Matt Post (Johns Hopkins University), Carolina Scarton (University of Sheffield), Lucia Specia (University of Sheffield), Karin Verspoor (University of Melbourne), Jörg Tiedemann (University of Helsinki), Marco Turchi (FBK)

slide-2
SLIDE 2

News Translation Task

slide-3
SLIDE 3

Overview

Français čeština Deutsch română pу ́сский suomi Türkçe English

NEW NEW

slide-4
SLIDE 4
  • European Union’s Horizon 2020 program



 
 


  • Yandex (Russian–English and Turkish–English test

sets)

  • University of Helsinki (Finnish–English test set)

Funding

slide-5
SLIDE 5

Participation

102 entries from 24 institutions +4 anonymized commercial,

  • nline, and rule-based systems
slide-6
SLIDE 6

Human Evaluation

slide-7
SLIDE 7
  • We wish to identify the best systems for each task

– Automatic metrics are useful for development,

but must be grounded in human evaluation of system output

  • How to compute it?

– Adequacy / fluency, sentence ranking (RR),

constituent ranking, constituent OK, sentence comprehension

– Direct Assessment (DA)

Human Evaluation

slide-8
SLIDE 8

Metric / Year ‘06 '07 '08 '09 '10 ’11 '12 '13 '14 '15 '16 Adequacy / Fluency

Sentence Ranking

  • ● ●
  • Constituent

Ranking

Constituent 
 OK

  • Sentence

Comprehension

Direct Assessment

slide-9
SLIDE 9

Sentence Ranking

https://github.com/cfedermann/Appraise/

A > {B, D, E} B > {D, E} C > {A, B, D, E} D > {E} = 10 pairwise rankings

A B C D E

slide-10
SLIDE 10
  • Innovation: rank distinct outputs instead of systems



 
 
 
 
 


  • Then, distribute 


rankings across 
 systems:


More Judgments

slide-11
SLIDE 11
  • 150 trusted annotators, 939 person-hours

Data collected

2014 2015 2016

Pairwise judgments (thousands)

245 252 324 290 328

Pairs Expanded statmt.org/wmt16/results.html

slide-12
SLIDE 12
  • Rank systems using TrueSkill (Herbrich et al., 2006,

Sakaguchi et al., 2014)

  • Cluster (Koehn, 2012)

– Aggregate each system’s rank over 1,000

bootstrap-resampled folds

– Throw out top and bottom 25 ranks, collect ranges – Groups systems by non-overlapping ranges

Clustering

slide-13
SLIDE 13

Pairwise judgments / system 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 Number of systems in task 5 10 15 20

2015 2016

  • ~4.1k rankings /

task (~3k last year)

  • Total judgments:

542k (328k last year)

  • Data: statmt.org/

wmt16/results.html

Manual evaluation summary

slide-14
SLIDE 14

Czech–English

cluster constrained not constrained 1 uedin-nmt 2 jhu-pbmt 3

  • nline-B

4 PJATK, TT-* 5

  • nline-A

6

cu-mergetrees

slide-15
SLIDE 15

English–Czech

cluster constrained not constrained 1 uedin-nmt 2 nyu-montreal 3 jhu-pbmt 4 cu-chimera, cu-tamchyna 5 uedin-cu-syntax

  • nline-B

6 TT-* 7

  • nline-A

8 cu-tectomt 9 tt-usaar-hmm-mert 10 cu-mergetrees 11 tt-usaar-hmm-mira 12 tt-usaar-harm

slide-16
SLIDE 16

Russian–English

cluster constrained not constrained 1 amu-uedin,NRC, uedin-nmt

  • nline-G, online-B

2 AFRL-MITLL-phr

  • nline-A

3 AFRL-MITLL-cntr, PROMT-rule 4

  • nline-F
slide-17
SLIDE 17

English–Russian

cluster constrained not constrained 1 promt-rule 2 amu-uedin, uedin-nmt

  • nline-B, online-G

3 NYU-montreal 4 jhu-pbmt, limsi, AFRL- MITLL-phr

  • nline-A

5 AFRL-MITLL-verb 6

  • nline-F
slide-18
SLIDE 18

German–English

cluster constrained not constrained 1 uedin-nmt 2 uedin-syntax, kit, 
 uedin-pbmt, jhu-pbmt

  • nline-B, online-A

3 jhu-syntax

  • nline-G

4

  • nline-F
slide-19
SLIDE 19

English–German

cluster constrained not constrained 1 uedin-nmt 2 metamind 3 uedin-syntax 4 nyu-montreal 5 kit-limsi, cambridge, promt-rule, kit

  • nline-B, online-A

6 jhu-syntax, jhu-pbmt 7 uedin-pbmt

  • nline-F, online-G
slide-20
SLIDE 20

Romanian–English

cluster constrained not constrained 1 uedin-nmt

  • nline-B

2 uedin-pbmt 3 uedin-syntax, jhu-pbmt, limsi

  • nline-A
slide-21
SLIDE 21

English–Romanian

cluster constrained not constrained 1 uedin-nmt, qt21-himl-comb 2 kit, uedin-pbmt, 
 uedin-lmu-hiero, rwth-comb

  • nline-B

3 limsi, lmu-cuni, jhu-pbmt, usfd-rescoring

  • nline-A
slide-22
SLIDE 22

Finnish–English

cluster constrained not constrained 1 uedin-pbmt, online-G,

  • nline-B, uh-opus

2 PROMT-smt 3 uh-factored, uedin-syntax 4

  • nline-A

5 jhu-pbmt

slide-23
SLIDE 23

English–Finnish

cluster constrained not constrained 1 abumatran-nmt, 
 abumatran-cmb

  • nline-G, online-B, uh-opus

2 abumatran-pb, nyu-montreal

  • nline-A

3 jhu-pbmt, uh-factored, aalto, jhu-hltcoe, uut

slide-24
SLIDE 24

Turkish–English

cluster constrained not constrained 1

  • nline-B, online-G, online-A

2 tbtk-syscomb, usda PROMT-smt 3 jhu-syntax, jhu-pbmt, parFDA

slide-25
SLIDE 25

English–Turkish

cluster constrained not constrained 1

  • nline-G, online-B

2

  • nline-A

3 ysda 4 jhu-hltcoe, tbtk-morph, cmu 5 jhu-pbmt, parFDA

slide-26
SLIDE 26
  • UEdin-NMT

– 4 languages: uncontested winner – 3 languages: tied for first – 1 language: tied for second (behind rule-based!)

  • English–Russian: rule-based system (PROMT-rule)

the winner by a wide margin

Trends

slide-27
SLIDE 27

Comparison with BLEU

TrueSkill mean

  • 1.4
  • 1.2
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8

BLEU score

0.05 0.1 0.15 0.2 0.25 0.3

promt-rule uedin-nmt

slide-28
SLIDE 28
  • statmt.org/wmt16/results.html

– Source and reference data, system outputs – Manual evaluation results (raw XML, CSV files

with pairwise rankings)
 


  • github.com/cfedermann/wmt16

– Code used to compute rankings, clusters,

annotator agreement

Data

srclang,trglang,id,judge,sys1,sys1rank,sys2,sys2rank,group deu,eng,348,judge13,jhu-syntax,3,online-B,5,190

slide-29
SLIDE 29

Direct Assessment