[PPT] - Tuning SMT Systems on the Training Set Chris Dyer, Patrick Simianer, PowerPoint Presentation

SLIDE 1

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Tuning SMT Systems on the Training Set

Chris Dyer, Patrick Simianer, Stefan Riezler, Phil Blunsom, Eva Hasler

Project Report MT Marathon 2011 FBK Trento

SLIDE 2

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Tuning SMT Systems on the Training Set

Goal: Discriminative training using sparse features on the full training set

SLIDE 3

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Tuning SMT Systems on the Training Set

Goal: Discriminative training using sparse features on the full training set Approach: Picky-picky / elitist learning:

SLIDE 4

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Tuning SMT Systems on the Training Set

Goal: Discriminative training using sparse features on the full training set Approach: Picky-picky / elitist learning: Stochastic learning with true random selection of examples.

SLIDE 5

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Tuning SMT Systems on the Training Set

Goal: Discriminative training using sparse features on the full training set Approach: Picky-picky / elitist learning: Stochastic learning with true random selection of examples. Feature selection according to various regularization criteria.

SLIDE 6

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Tuning SMT Systems on the Training Set

Goal: Discriminative training using sparse features on the full training set Approach: Picky-picky / elitist learning: Stochastic learning with true random selection of examples. Feature selection according to various regularization criteria. Leave-one-out estimation: Leave out sentence/shard currently being trained on when extracting rules/features in training.

SLIDE 7

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

SMT Framework + Data

cdec decoder (https://github.com/redpony/cdec)

SLIDE 8

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

SMT Framework + Data

cdec decoder (https://github.com/redpony/cdec) Hiero SCFG grammars

SLIDE 9

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

SMT Framework + Data

cdec decoder (https://github.com/redpony/cdec) Hiero SCFG grammars WMT11 news-commentary corpus

SLIDE 10

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

SMT Framework + Data

cdec decoder (https://github.com/redpony/cdec) Hiero SCFG grammars WMT11 news-commentary corpus

132,755 parallel sentences

SLIDE 11

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

SMT Framework + Data

cdec decoder (https://github.com/redpony/cdec) Hiero SCFG grammars WMT11 news-commentary corpus

132,755 parallel sentences German-to-English

SLIDE 12

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Learning Framework: SGD for Pairwise Ranking

SLIDE 13

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Constraint Selection = Sampling of Pairs

Random sampling of pairs from full chart for pairwise ranking:

SLIDE 14

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Constraint Selection = Sampling of Pairs

Random sampling of pairs from full chart for pairwise ranking:

First sample translations according to their model score.

SLIDE 15

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Constraint Selection = Sampling of Pairs

Random sampling of pairs from full chart for pairwise ranking:

First sample translations according to their model score. Then sample pairs.

SLIDE 16

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Constraint Selection = Sampling of Pairs

Random sampling of pairs from full chart for pairwise ranking:

First sample translations according to their model score. Then sample pairs.

Sampling will diminish problem of learning to discriminate translations that are too close (in terms of sentence-wise

approx. BLEU) to each other.

SLIDE 17

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Constraint Selection = Sampling of Pairs

Random sampling of pairs from full chart for pairwise ranking:

First sample translations according to their model score. Then sample pairs.

Sampling will diminish problem of learning to discriminate translations that are too close (in terms of sentence-wise

approx. BLEU) to each other.

Sampling will also speed up learning.

SLIDE 18

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Constraint Selection = Sampling of Pairs

Random sampling of pairs from full chart for pairwise ranking:

First sample translations according to their model score. Then sample pairs.

Sampling will diminish problem of learning to discriminate translations that are too close (in terms of sentence-wise

approx. BLEU) to each other.

Sampling will also speed up learning. Lots of variations on sampling possible ...

SLIDE 19

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Candidate Features

Efficient computation of features from local rule context:

SLIDE 20

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Candidate Features

Efficient computation of features from local rule context:

Hiero SCFG rule identifier

SLIDE 21

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Candidate Features

Efficient computation of features from local rule context:

Hiero SCFG rule identifier target n-grams within rule

SLIDE 22

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Candidate Features

Efficient computation of features from local rule context:

Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule

SLIDE 23

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Candidate Features

Efficient computation of features from local rule context:

Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule binned rule counts in full training set

SLIDE 24

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Candidate Features

Efficient computation of features from local rule context:

Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule binned rule counts in full training set rule length features

SLIDE 25

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Candidate Features

Efficient computation of features from local rule context:

Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule binned rule counts in full training set rule length features rule shape features

SLIDE 26

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Candidate Features

Efficient computation of features from local rule context:

Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule binned rule counts in full training set rule length features rule shape features word alignments in rules

SLIDE 27

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Candidate Features

Efficient computation of features from local rule context:

Hiero SCFG rule identifier target n-grams within rule target n-gram with gaps (X) within rule binned rule counts in full training set rule length features rule shape features word alignments in rules

... and many more!

SLIDE 28

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Feature Selection

ℓ1/ℓ2-regularization

SLIDE 29

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Feature Selection

ℓ1/ℓ2-regularization

Compute ℓ2-norm of column vectors (= vector of examples/shards for each of n features), then ℓ1-norm of resulting n-dimensional vector.

SLIDE 30

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Feature Selection

ℓ1/ℓ2-regularization

Compute ℓ2-norm of column vectors (= vector of examples/shards for each of n features), then ℓ1-norm of resulting n-dimensional vector.

SLIDE 31

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Feature Selection

ℓ1/ℓ2-regularization

Compute ℓ2-norm of column vectors (= vector of examples/shards for each of n features), then ℓ1-norm of resulting n-dimensional vector.

Effect is to choose small subset of features that are useful across all examples/shards

SLIDE 32

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Feature Selection, done properly

Incremental gradient-based selection of column vectors (Obozinski, Taskar, Jordan: Joint covariant selection and joint subspace selection for multiple classification

problems. Stat Comput (2010))

SLIDE 33

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Feature Selection, done properly

Incremental gradient-based selection of column vectors (Obozinski, Taskar, Jordan: Joint covariant selection and joint subspace selection for multiple classification

problems. Stat Comput (2010))

SLIDE 34

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Feature Selection, quick and dirty

Combine feature selection with averaging:

SLIDE 35

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Feature Selection, quick and dirty

Combine feature selection with averaging:

Keep only those features with large enough ℓ2-norm computed over examples/shards.

SLIDE 36

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Feature Selection, quick and dirty

Combine feature selection with averaging:

Keep only those features with large enough ℓ2-norm computed over examples/shards. Then average feature values over examples/shards.

SLIDE 37

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

How far did we get in a few days?

First full training run finished!

SLIDE 38

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

How far did we get in a few days?

First full training run finished!

150k parallel sentences from news commentary data, German-to-English

SLIDE 39

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

How far did we get in a few days?

First full training run finished!

150k parallel sentences from news commentary data, German-to-English pairwise ranking perceptron

SLIDE 40

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

How far did we get in a few days?

First full training run finished!

150k parallel sentences from news commentary data, German-to-English pairwise ranking perceptron sample 100 translations from chart, use all 100*(99)/2 pairs

SLIDE 41

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

How far did we get in a few days?

First full training run finished!

150k parallel sentences from news commentary data, German-to-English pairwise ranking perceptron sample 100 translations from chart, use all 100*(99)/2 pairs OR: use n-best list sparse rule-id features AND/OR dense features

SLIDE 42

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

How far did we get in a few days?

First full training run finished!

150k parallel sentences from news commentary data, German-to-English pairwise ranking perceptron sample 100 translations from chart, use all 100*(99)/2 pairs OR: use n-best list sparse rule-id features AND/OR dense features 200 shards (25 machines with 8 cores)

SLIDE 43

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Results

Still a lot of bugs due to integration of code from different sources

SLIDE 44

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Results

Still a lot of bugs due to integration of code from different sources Infrastructure is working

SLIDE 45

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Results

Still a lot of bugs due to integration of code from different sources Infrastructure is working Experiments still running

SLIDE 46

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Results

Still a lot of bugs due to integration of code from different sources Infrastructure is working Experiments still running Sensible things happening:

Best rule X → X1 , dass X2, X1 that X2 Bad rule X → X1 oder X2, X1 and X2

SLIDE 47

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Results

Still a lot of bugs due to integration of code from different sources Infrastructure is working Experiments still running Sensible things happening:

Best rule X → X1 , dass X2, X1 that X2 Bad rule X → X1 oder X2, X1 and X2

At the moment still trailing behind MERT ...

SLIDE 48

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Results

Still a lot of bugs due to integration of code from different sources Infrastructure is working Experiments still running Sensible things happening:

Best rule X → X1 , dass X2, X1 that X2 Bad rule X → X1 oder X2, X1 and X2

At the moment still trailing behind MERT ... We’ll catch up!

SLIDE 49

ToTS Dyer, Simianer, Riezler, Blunsom, Hasler

Thanks

Thanks to organizers for great

pportunity to learn/chat/hobnob!