Developments in Hierarchical Phrase-based Translation Philip Resnik - - PowerPoint PPT Presentation

developments in hierarchical phrase based translation
SMART_READER_LITE
LIVE PREVIEW

Developments in Hierarchical Phrase-based Translation Philip Resnik - - PowerPoint PPT Presentation

Developments in Hierarchical Phrase-based Translation Philip Resnik University of Maryland Work done with David Chiang, Chris Dyer, Nitin Madnani, and Adam Lopez Some things youve seen recently Shamelessly stolen from Philipp Koehn


slide-1
SLIDE 1

Developments in Hierarchical Phrase-based Translation

Philip Resnik University of Maryland

Work done with David Chiang, Chris Dyer, Nitin Madnani, and Adam Lopez

slide-2
SLIDE 2

Some things you’ve seen recently…

Shamelessly stolen from Philipp Koehn

slide-3
SLIDE 3

Some things you’ve seen recently…

Shamelessly stolen from Kevin Knight

slide-4
SLIDE 4

Flat Phrases

Australia is with North Korea diplomatic relations is

  • ne of

the few countries 澳洲 是 与 北 韩 有 邦交 的 国家 之一 少数 Australia is with North Korea is diplomatic relations

  • ne of

the few countries

Can we capture this modification relationship without ISI-style syntactic modeling?

slide-5
SLIDE 5

have

with 有 与 有 与 diplomatic relations few countries Australia is North Korea 之一 的

Hierarchical phrases

澳洲 是 与 北 韩 有 邦交 的 国家 之一 少数 North Korea diplomatic relations Australiais few countries 之一 的

slide-6
SLIDE 6

that 的 the

few countries have diplomatic relations with North Korea

is Australia 之一 的

Hierarchical phrases

have diplomatic relations with North Korea

Australia is

few countries

之一

slide-7
SLIDE 7
  • ne of

之一

the few countries that have diplomatic relations with North Korea

之一 Australia is

Hierarchical phrases

the few countries that have diplomatic relations with North Korea

Australia is

slide-8
SLIDE 8

North Korea 北 韩

Synchronous CFG

have

with 与 有

(X → 与 X1 有 X2, X → have X2 with X1)

邦交

diplomatic relations

(X → 邦交, X → diplomatic relations) (X → 北 韩, X → North Korea)

slide-9
SLIDE 9

澳 洲 是 与 北 韩 有 邦 交 的 少 数 国 家 之 一

Australia is

  • ne
  • f

the few countries that have diplomati c relations with North Korea

Grammar extraction

(与 北 韩 有 邦交, have diplomatic relations with North Korea) (邦交, diplomatic relations) (北 韩, North Korea) (X → 与 X1 有 X2, X → have X2 with X1)

X2

X1

slide-10
SLIDE 10

Permits dependencies over long distances without memorizing intervening material (sparseness!)

slide-11
SLIDE 11

Non-Hierarchical Phrases

slide-12
SLIDE 12

Hierarchical Modeling

slide-13
SLIDE 13

Structures Useful for MT

slide-14
SLIDE 14

Hiero: Hierarchical Phrase-Based

Translation

  • Introduced by Chiang (2005, 2007)
  • Moves from phrase-based models toward syntax

– Phrase table → Synchronous CFG

  • Learn reordering rules together with phrases

X → < 与 X1 有 X2, have X2 with X1 > X → < 北 韩, North Korea>

– Decoder → Parser

  • CKY parser
  • Target side of grammar intersected with finite state LM
  • Log-linear model tuned to optimize objective (BLEU, TER, …)
slide-15
SLIDE 15

Roadmap

  • Brief review of Hiero
  • New developments

– Confusion network decoding (Dyer) – Suffix arrays for richer features (Lopez) – Paraphrase to improve parameter tuning (Madnani)

  • Summary and conclusions
slide-16
SLIDE 16

Confusion Network Decoding for Translating ASR Output

  • ASR systems produce word graphs:
  • Equivalent to weighted FSA
  • However, Hiero assumes 1-best input
slide-17
SLIDE 17

Confusion networks (a.k.a. pinched lattices, meshes, sausages)

  • Approximation of a word lattice

(Mangu, et al., 2000)

–Every path through the network hits every node –Probability distribution over words at a given position –Special symbol ε (epsilon) represents a skip.

slide-18
SLIDE 18

Translating from Confusion Networks

  • Confusion networks for MT

– Many more paths than in the source lattice – Nice properties for dynamic programming

  • Decoding confusion networks beats 1-best hypothesis

with a phrase-based model

– Bertoldi, et al. 2005

  • Decoding confusion networks is highly efficient with

a phrase-based model

– Hopkins Summer Workshop

  • Moses decoder accepts input as a confusion network

– Bertoldi, et al. 2007

slide-19
SLIDE 19

The value of hierarchy in the face

  • f ambiguity

Input: saafara al-ra’iisu

cala

Baghdad

‘ila

Grammar rule: saafara X ‘ila Y ↔ X traveled to Y al-ra’iisu al-ra’iisu al-amriikiy

al-rajulu al-manfiyu allathiy laa yuħibbu al-Ńayaraana

slide-20
SLIDE 20

Parsing Confusion Networks

  • Efficient CKY parsing available

– Insight: except for the initialization pass (processing terminal symbols), standard CKY already operates on “confusion networks”.

slide-21
SLIDE 21

Parsing Confusion Networks

  • Axioms:
  • Inferences:
  • Goal:

Text Confusion Networks

slide-22
SLIDE 22

Model features

λCN

slide-23
SLIDE 23

Application: spoken language translation

  • Experiments

– Chinese – English (IWSLT 2006)

  • Small standard training bitext (<1M words)
  • Trigram LM from English side of bitext only
  • Spontaneous and read speech from the travel domain
  • Text only development data! (λCN=λLM)

– Arabic – English (BNAT05)

  • UMD training bitext (6.7M words)
  • Trigram LM from bitext and portions of Gigaword
  • Broadcast news and broadcast conversations
  • ASR output development data. (λCN tuned by MERT)
slide-24
SLIDE 24

Chinese-English (IWSLT 2006)

14.26 15.61 23.1 spont., full CN 13.57 14.96 32.5 spont., 1-best (CN) 15.59 16.51 16.8 read, full CN 15.69 16.37 24.9 read, 1-best (CN) 18.40 19.63 0.0 verbatim Moses* Hiero* WER Input Noisier signal → more improvement

* BLEU, 7 references

p<0.05

slide-25
SLIDE 25

Performance impact

  • The impact on decoding time is minimal

– Roughly the average depth of the confusion network – Similar to the impact in a phrase-based system

  • Moses: 3.8x slower over 1-best baseline
  • Hiero: 4.3x slower over 1-best baseline
  • Both systems have efficient disk-based formats

available to them

– Adaptation of Zens & Ney (2007)

slide-26
SLIDE 26

Arabic-English (BNAT05)

22.61 24.58 7.5 Full CN 22.64 23.64 12.2 1-best 25.13 26.46 0.0 Verbatim Moses* Hiero* WER Input

p<0.05 p<0.01 n.s. p<0.05

Extremely low WER (audio was part of recognizer training data). Hiero appears to make better use of ambiguity.

* BLEU, 1 reference

slide-27
SLIDE 27

Another Application: Decoder-Guided Morphological Backoff

  • Morphological complexity makes the sparse data

problem even more acute

  • Example: Czech → English

– Hypothesis:

From the US side of the Atlantic all such odůvodnění appears to be a totally bizarre.

– Target:

From the American side of the Atlantic, all of these rationales seem utterly bizarre.

slide-28
SLIDE 28

Solving the morphology dilemma with confusion networks

  • Conventional solution: reduce morphological complexity

by removing morphemes

  • Lemmatize (Goldwater & McCloskey 2005)
  • Truncate (Och)
  • Collapse meaningless distinctions (Talbot and Osborne, 2006)
  • Backoff for words you don’t know how to translate (Yang and Kirchhoff)

– Problem: the removed morphemes contain important translation information

  • Surface only:

From the US side of the Atlantic all such odůvodnění appears to be a totally bizarre.

  • Lemma only:

From the [US] side of the Atlantic with any such justification seem completely bizarre.

slide-29
SLIDE 29

Solving the morphology dilemma with confusion networks

  • Use confusion networks to give access to both

representations:

  • Use surface forms if it makes sense to do so, otherwise

back off to lemmas, with individual choices guided by the model.

  • Create single grammar by combining the rules from both

grammars

  • Variety of cost assignment strategies available.

jevit takový s atlantik břeh americký . bizarní naprosto jako jeví

  • důvodnění

taková veskerá se atlantiku břehu amerického z

atlantiku atlantik

slide-30
SLIDE 30

Czech-English results

  • Improvements for using CNs are

significant at p<.05, CN > surface at p < .01

  • WMT07 training data (2.6M

words), trigram LM * 1 reference translation

22.74 Surface forms only 23.94 Backoff (~ Yang & Kirchhoff 2006) 25.01 Surface+Lemma (CN) 22.50 Lemmas only

BLEU* Input

  • Best system on Czech-

English task at WMT’07 on all evaluation measures.

slide-31
SLIDE 31

Confusion Networks Summary

  • Keeping as much information as possible is a good

idea.

– Alternative transcription hypotheses from ASR – Full morphological information

  • Hierarchical phrase-based models outperform

conventional models

– Higher absolute baseline – Better utilization of ambiguity in the signal (cf. Arabic results)

  • Decoding ambiguous input can be done efficiently
  • Current work: Arabic morphological backoff
slide-32
SLIDE 32

Roadmap

  • Brief review of Hiero
  • New developments

– Confusion network decoding (Dyer) – Suffix arrays for richer features (Lopez) – Paraphrase to improve parameter tuning (Madnani)

  • Summary and conclusions
slide-33
SLIDE 33

Standard Decoder Architecture

slide-34
SLIDE 34

Standard Decoder Architecture

Much larger training set Much larger phrase table

slide-35
SLIDE 35

Alternative Decoder Architecture

(Callison-Burch et al., Zhang and Vogel et al.)

Look up (or sample from) all e for substring f

slide-36
SLIDE 36

Hierarchical Phrase Based Translation with Suffix Arrays

  • Key idea: instead of pre-tabulating information to

support features like p(e|f), look up instances of f in the training bitext, on the fly

  • Facilitates:

– Scaling to large training corpora – Use of arbitrary length phrases – Ability to decode without test set specific filtering – Features that use broader context – Features that use corpus annotations

slide-37
SLIDE 37

Example

(using English as source language for readability) … and it || y él and it || y ella and it || pero él …

slide-38
SLIDE 38

Looking source patterns up on the fly

… discussed the issue with her and it seems as if … … offered the organization a better alternative and it … … built between the new building and it . After proposing … … y él parece que … … mejor pero él … … y el otro . …

subj subj

slide-39
SLIDE 39

Efficient Pattern Matching

  • If the F side of the bitext is indexed using a

suffix array, lookup of all matches can be done very quickly.

slide-40
SLIDE 40

Example (using English as source language for readability)

slide-41
SLIDE 41
slide-42
SLIDE 42

Problem: patterns with gaps

(using English as source language for readability) …

slide-43
SLIDE 43
  • Instances of pattern are no

longer contiguous in suffix array

  • Naïve approaches (e.g.

using intersection of subpatterns) are very inefficient – baseline timing result is that decoding takes 2241 seconds per sentence! Query pattern: him X it

slide-44
SLIDE 44

Algorithmic extensions

  • Exploiting redundancy using prefix tree with

suffix links (Zhang and Vogel 2005)

  • Double binary search (Baeza-Yates 2004) for

cases where there is an infrequent subpattern

  • Precomputation for cases where there are

multiple frequent subpatterns

  • Caching
slide-45
SLIDE 45

Timing Results

slide-46
SLIDE 46

Applications

  • Sampling for feature value estimation
  • Features based on context
  • Features based on annotations
  • Take-home message: the suffix array

framework allows very rapid exploration of a larger feature space.

slide-47
SLIDE 47

Roadmap

  • Brief review of Hiero
  • New developments

– Confusion network decoding (Dyer) – Suffix arrays for richer features (Lopez) – Paraphrase to improve parameter tuning (Madnani)

  • Summary and conclusions
slide-48
SLIDE 48

Using paraphrases to improve parameter tuning

  • Virtually all SMT systems tune model parameters

by optimizing an objective function that compares decoder output to reference translations (e.g. BLEU).

  • It’s widely accepted that multiple references per

translation are better.

  • But references are expensive to obtain.
  • Could we exloit a quantity/quality tradeoff by

increasing the number of references artificially?

slide-49
SLIDE 49

Example

slide-50
SLIDE 50

Paraphrase as English-to-English translation

slide-51
SLIDE 51

Examples (Europarl, using French as pivot)

slide-52
SLIDE 52

Examples (NIST’03 test set using Chinese as pivot)

slide-53
SLIDE 53

Experiment

slide-54
SLIDE 54

Results

  • Potentially more interesting

scenario, since any bitext provides one human reference translation per source sentence.

  • Raises the possibility of topic

and genre-specific parameter tuning.

  • Score tuning on four human

references is matched (statistically) with only two human references needed.

  • “Standard” (for NIST) four

references can still improve.

slide-55
SLIDE 55

Conclusions

  • Hiero is both a framework and a strategy for bringing

more linguistically relevant properties into statistical MT

– Start with hierarchy, lexically anchored reordering – Be driven by parallel data, not by monolingual analysis – Embrace and extend phrase-based ideas that work well – Tackle cross-cutting challenges (e.g. more ref translations)

slide-56
SLIDE 56

Thanks and acknowledgements

  • The work presented would not have been possible

without the many good ideas and generous assistance from the following people:

Nicola Bertoldi David Chiang Marcello Federico Ian Lane Lidia Mangu Smaranda Muresan Daniel Zeman Richard Zens

slide-57
SLIDE 57

And thank you!