An Empirical Comparison of Features and Tuning for Phrase-based - - PowerPoint PPT Presentation

an empirical comparison of features and tuning for phrase
SMART_READER_LITE
LIVE PREVIEW

An Empirical Comparison of Features and Tuning for Phrase-based - - PowerPoint PPT Presentation

An Empirical Comparison of Features and Tuning for Phrase-based Machine Translation Spence Green with Daniel Cer and Chris Manning Stanford University WMT // 27 June 2014 Recap: ACL13 Results SGD-based, n -best learning L 1 feature selection


slide-1
SLIDE 1

An Empirical Comparison of Features and Tuning for Phrase-based Machine Translation

Spence Green with Daniel Cer and Chris Manning Stanford University WMT // 27 June 2014

slide-2
SLIDE 2

Recap: ACL13 Results

SGD-based, n-best learning L1 feature selection 2

slide-3
SLIDE 3

Recap: ACL13 Results

SGD-based, n-best learning L1 feature selection BOLT-scale Zh–En on NIST data:

BLEU Δ MERT 48.4

2

slide-4
SLIDE 4

Recap: ACL13 Results

SGD-based, n-best learning L1 feature selection BOLT-scale Zh–En on NIST data:

BLEU Δ MERT 48.4 SGD 48.1

2

slide-5
SLIDE 5

Recap: ACL13 Results

SGD-based, n-best learning L1 feature selection BOLT-scale Zh–En on NIST data:

BLEU Δ MERT 48.4 SGD 48.1 SGD+Features 49.9 +1.5 :-)

2

slide-6
SLIDE 6

Motivation #1: WMT13 Shared Task :-(

  • 29

30 31 32 1 2 3 4 5 6 7 8 9 10

Epoch BLEU newtest2008−2011

Model

  • dense

feature−rich

3

slide-7
SLIDE 7

Motivation #1: WMT13 Shared Task

En–Fr news2012 (dev)

BLEU Dense 31.1 SGD+Features 31.5 +0.4

4

slide-8
SLIDE 8

Motivation #2: Practical Issues

Q1: Which phrase-based features should I use? 5

slide-9
SLIDE 9

Motivation #2: Practical Issues

Q1: Which phrase-based features should I use? Q2: Why don’t my features help? 5

slide-10
SLIDE 10

My Frustrating Summer...

What’s wrong with feature-rich MT?

  • 1. Loss Function

6

slide-11
SLIDE 11

My Frustrating Summer...

What’s wrong with feature-rich MT?

  • 1. Loss Function
  • 2. References and scoring

functions 6

slide-12
SLIDE 12

My Frustrating Summer...

What’s wrong with feature-rich MT?

  • 1. Loss Function
  • 2. References and scoring

functions

  • 3. Representation: Features

6

slide-13
SLIDE 13

My Frustrating Summer...

What’s wrong with feature-rich MT?

  • 1. Loss Function
  • 2. References and scoring

functions

  • 3. Representation: Features

This paper as a pain reliever... 6

slide-14
SLIDE 14

Loss Function

slide-15
SLIDE 15

ACL13: Online PRO

Sensitive to length Doesn’t optimize top-k Slow to compute (sampling) 8

slide-16
SLIDE 16

This work: Online Expected Error

Expected BLEU ℓt(t−1) = Ept−1[−BLEU(d)] = −

  • d∈H

pt−1(d) · BLEU(d) 9

slide-17
SLIDE 17

This work: Online Expected Error

Expected BLEU ℓt(t−1) = Ept−1[−BLEU(d)] = −

  • d∈H

pt−1(d) · BLEU(d) Smooth, non-convex Fast, less sensitive to length ...but still doesn’t prefer top-k 9

slide-18
SLIDE 18

References and Scoring

slide-19
SLIDE 19

Single vs. Multiple References

Experiment: Compute BLEU+1 for each reference 11

slide-20
SLIDE 20

Single vs. Multiple References

Experiment: Compute BLEU+1 for each reference Baseline MT system 11

slide-21
SLIDE 21

Single vs. Multiple References

Experiment: Compute BLEU+1 for each reference Baseline MT system Ar–En NIST MT05 has five (5) references 11

slide-22
SLIDE 22

MT05: Max. vs. Min. BLEU+1

  • 25

50 75 100 25 50 75 100

Minimum Maximum

12

slide-23
SLIDE 23

MT05: Max. vs. All References BLEU+1

  • 25

50 75 100 25 50 75 100

Maximum All References

13

slide-24
SLIDE 24

Refs and Scoring Functions

Single-ref Lesson: Don’t try too hard 14

slide-25
SLIDE 25

Refs and Scoring Functions

Single-ref Lesson: Don’t try too hard Blame the scoring function? BLEU+1 BLEU-Nakov [Nakov et al. 2012] 14

slide-26
SLIDE 26

Refs and Scoring Functions

Single-ref Lesson: Don’t try too hard Blame the scoring function? BLEU+1 BLEU-Nakov [Nakov et al. 2012] BLEU+Noise add Gaussian noise to n-gram precisions 14

slide-27
SLIDE 27

Refs and Scoring Functions

Single-ref Lesson: Don’t try too hard Blame the scoring function? BLEU+1 BLEU-Nakov [Nakov et al. 2012] BLEU+Noise add Gaussian noise to n-gram precisions TER (short translations) Linear combinations 14

slide-28
SLIDE 28

Representation: Features

slide-29
SLIDE 29

Representation: Dense + Extended

Dense features Moses baseline templates [Koehn et al. 2007] Hierarchical lex. reordering [Galley and Manning 2008] Rule count and uniqueness indicator 16

slide-30
SLIDE 30

Representation: Dense + Extended

Dense features Moses baseline templates [Koehn et al. 2007] Hierarchical lex. reordering [Galley and Manning 2008] Rule count and uniqueness indicator Extended features 16

slide-31
SLIDE 31

Representation: Dense + Extended

Dense features Moses baseline templates [Koehn et al. 2007] Hierarchical lex. reordering [Galley and Manning 2008] Rule count and uniqueness indicator Extended features

Fire less than dense but more than sparse

16

slide-32
SLIDE 32

Goal: a general, robust feature-rich model

slide-33
SLIDE 33

Goal: a general, robust feature-rich model

No more ad-hoc features

slide-34
SLIDE 34

Goal: a general, robust feature-rich model

No more ad-hoc features Starting point for more specific features

slide-35
SLIDE 35

Five Feature Categories

Common MT error types

  • 1. Lexical Choice
  • 2. Word Alignments
  • 3. Phrase Boundaries
  • 4. Derivation Quality
  • 5. Reordering

18

slide-36
SLIDE 36

Five Feature Categories

Common MT error types

  • 1. Lexical Choice
  • 2. Word Alignments
  • 3. Phrase Boundaries
  • 4. Derivation Quality
  • 5. Reordering

Sources: Novel, literature, word-of-mouth, etc. 18

slide-37
SLIDE 37

Features: Lexical Choice

Filtered Rule Indicator maison maison->the_house the house 19

slide-38
SLIDE 38

Features: Lexical Choice

Filtered Rule Indicator maison maison->the_house the house Class-based variant maison 64->22_14 the house 19

slide-39
SLIDE 39

Features: Lexical Choice

Target unigram class e: utility stocks lead shares higher 20

slide-40
SLIDE 40

Features: Lexical Choice

Target unigram class e: utility stocks lead shares higher 77 82 3 82 267 20

slide-41
SLIDE 41

Features: Lexical Choice

Target unigram class e: utility stocks lead shares higher 77 82 3 82 267 Feature strings:

CLASS:77 CLASS:82 CLASS:3 CLASS:82 CLASS:267

20

slide-42
SLIDE 42

Features: Word Alignments

tarceva parvient ainsi à stopper la croissance t a r c e v a w a s t h u s a b l e t

  • h

a l t t h e g r

  • w

t h

21

slide-43
SLIDE 43

Features: Word Alignments

tarceva parvient ainsi à stopper la croissance t a r c e v a w a s t h u s a b l e t

  • h

a l t t h e g r

  • w

t h

Feature strings:

ALGN:parvient->able ALGN:stopper->to_halt etc.

21

slide-44
SLIDE 44

Features: Phrase Boundaries

Target bigram phrase boundary e: utility stocks lead shares higher 22

slide-45
SLIDE 45

Features: Phrase Boundaries

Target bigram phrase boundary e: utility stocks lead shares higher 77 82 3 82 267 22

slide-46
SLIDE 46

Features: Phrase Boundaries

Target bigram phrase boundary e: utility stocks lead shares higher 77 82 3 82 267 Feature strings:

BOUNDARY:77_82 BOUNDARY:82_267

22

slide-47
SLIDE 47

Features: Derivation Quality

Rule dimension features

maison⇒ the house

23

slide-48
SLIDE 48

Features: Derivation Quality

Rule dimension features

maison⇒ the house

Feature strings:

SOURCE_DIM:1 TARGET_DIM:2 DIM:1-2

23

slide-49
SLIDE 49

Features: Reordering

Filtered Rule Orientation maison SWAP:maison->the_house the house 24

slide-50
SLIDE 50

Features: Reordering

Filtered Rule Orientation maison SWAP:maison->the_house the house Class-based variant maison SWAP:64->22_14 the house 24

slide-51
SLIDE 51

Aside: Learning Word Classes

Experiment: 3.7M English tokens, 512 classes 25

slide-52
SLIDE 52

Aside: Learning Word Classes

Experiment: 3.7M English tokens, 512 classes #threads min:sec Brown (wcluster) 1 1023:39 25

slide-53
SLIDE 53

Aside: Learning Word Classes

Experiment: 3.7M English tokens, 512 classes #threads min:sec Brown (wcluster) 1 1023:39 Clark (cluster_neyessen) 1 890:11 25

slide-54
SLIDE 54

Aside: Learning Word Classes

Experiment: 3.7M English tokens, 512 classes #threads min:sec Brown (wcluster) 1 1023:39 Clark (cluster_neyessen) 1 890:11 Och (mkcls) 1 199:04 25

slide-55
SLIDE 55

Aside: Learning Word Classes

Experiment: 3.7M English tokens, 512 classes #threads min:sec Brown (wcluster) 1 1023:39 Clark (cluster_neyessen) 1 890:11 Och (mkcls) 1 199:04 This paper 8 2:42

[Whittaker and Woodland 2001][Uszkoreit and Brants 2008]

25

slide-56
SLIDE 56

Experiments

slide-57
SLIDE 57

NIST Experiments

Stanford Phrasal [Green et al. 2014] BOLT-scale systems: Ar–En, Zh–En Four references, uncased BLEU-4 27

slide-58
SLIDE 58

NIST Results: Ar–En

BLEU Δ Dense 42.2

28

slide-59
SLIDE 59

NIST Results: Ar–En

BLEU Δ Dense 42.2 Dense+ACL13 44.0 +1.8

28

slide-60
SLIDE 60

NIST Results: Ar–En

BLEU Δ Dense 42.2 Dense+ACL13 44.0 +1.8 Dense+Ext 44.6 +2.4

28

slide-61
SLIDE 61

NIST Results: Ar–En

BLEU Δ Dense 42.2 Dense+ACL13 44.0 +1.8 Dense+Ext 44.6 +2.4 Dense+Ext+Domain 45.0 +2.8

Domain: feature space augmentation [Daumé III 2007] 28

slide-62
SLIDE 62

NIST Results: Ar–En

BLEU Δ Dense 42.2 Dense+ACL13 44.0 +1.8 Dense+Ext 44.6 +2.4 Dense+Ext+Domain 45.0 +2.8

Domain: feature space augmentation [Daumé III 2007] Zh–En: +2.0 BLEU 28

slide-63
SLIDE 63

WMT–14 Shared Task

Single reference, uncased BLEU-4 29

slide-64
SLIDE 64

WMT–14 Shared Task

Single reference, uncased BLEU-4 All Fr–En constrained data

Bilingual Monolingual #Segments #Tokens #Tokens 36.3M 2.1M 7.2B

29

slide-65
SLIDE 65

WMT–14 Results: Fr–En

2014 BLEU Δ Dense 35.6 Dense+Ext 36.0 +0.4 :-(

30

slide-66
SLIDE 66

WMT–14 Results: Fr–En

2014 BLEU Δ Dense 35.6 Dense+Ext 36.0 +0.4 :-(

Uncased BLEU: 1st place Manual eval: 2–4 cluster 30

slide-67
SLIDE 67

Analysis: Single vs. Multiple References

Ar–En MT09 results

4-ref Δ 1-ref Δ Dense 48.0 47.8 Dense+Ext 50.0 +2.0 48.9 +1.1

31

slide-68
SLIDE 68

General Observations

More expressive models match refs better (duh) Single-ref condition == overfitting 32

slide-69
SLIDE 69

General Observations

More expressive models match refs better (duh) Single-ref condition == overfitting Sensitivity to tuning set size/content Bitext tuning 32

slide-70
SLIDE 70

General Observations

More expressive models match refs better (duh) Single-ref condition == overfitting Sensitivity to tuning set size/content Bitext tuning Ablation isn’t very helpful Approximate search, non-convex 32

slide-71
SLIDE 71

Conclusion and Impact

Baseline feature-rich representation Domain adaptation 33

slide-72
SLIDE 72

Conclusion and Impact

Baseline feature-rich representation Domain adaptation Faster, better online tuning 33

slide-73
SLIDE 73

Conclusion and Impact

Baseline feature-rich representation Domain adaptation Faster, better online tuning Scalable software to implement the features See new Phrasal release 33

slide-74
SLIDE 74

An Empirical Comparison of Features and Tuning for Phrase-based Machine Translation

Spence Green with Daniel Cer and Chris Manning Stanford University WMT // 27 June 2014