Deep Linguistic Information in Hybrid Machine Translation
- Charles University in Prague
Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Czech Republic
Deep Linguistic Information in Hybrid Machine Translation Charles - - PDF document
Deep Linguistic Information in Hybrid Machine Translation Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Czech Republic Outline: From Data To an MT
Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Czech Republic
Hybrid MT Workshop - Coling 2012 2
Hybrid MT Workshop - Coling 2012 4
Workshop Work h p -
Co C 2 4
surface syntax syntax & semantics (and more) = “tectogrammatics”
Hybrid MT Workshop - Coling 2012 5
Hybrid MT Workshop - Coling 2012 6
8 2 8 2
Hybrid MT Workshop - Coling 2012 7
At both syntactic levels
automatic, test section manually corrected (in part)
At both syntactic level vel At both syntac ls ls y
a tomatic, tes c tes automatic, tes
201 01 1 01 01 1 01 01 01 012 Co C 2 2 2 7
rrected (in par rrected ed (in par rt) rt)
st section manually cor cor st sectio tion manually st se st
Hybrid MT Workshop - Coling 2012 8
At both syntactic levels 1 1
automatic, test section manually corrected (in part), m n
m
tectogrammatics surface syntax PTB syntax
Hybrid MT Workshop - Coling 2012 10
– Dependency (head rules + additions, manual corrections) – Function label (PDT-style) at all nodes (from PTB + rules) – Lemmatization + „pure“ POS tags from PTB – Automatic (from PTB) + a few manual corrections
– PDT style, no change – Syntax: automatic (MST); 2000 sent. fully manual for testing – Lemmatization and tagging: auto
99%/96%, Spoustová et al. EACL 2009 (COMPOST tagger) http://ufal.mff.cuni.cz/compost (Czech, English & other)
– No p-level (of course )
Hybrid MT Workshop - Coling 2012 11
– Nodes with „autosemantic“ words only (no function words)
Ellipsis „restored“ (new node for verbal arguments)
– (Semantic) function (dependent head relation)
Verb arguments + ca 50 functions for other relations
– Valency lexicons attached (Eng: links to PropBank) – “Formemes”: prep+case style label (useful in MT and search) – Co-reference integrated (Eng: BBN + more), Czech: manually
– To surface syntax & between Czech and English
This temblor-prone city dispatched inspectors, firefighters and other earthquake-trained personnel *-1 to aid San Francisco.
Hybrid MT Workshop - Coling 2012 12
– Annotation, View/Browse and Search environment – Open source, perl – Search and visualization:
Simple data browser (http://ufal.mff.cuni.cz/pcedt2.0)
PML-TQ: Powerful query language for complex tree-based annotation
– Modular NLP processing environment – Easy handling of complex NLP-annotated data – Modules exists for Czech, English data processing
– CPAN-distributed
source language (English) target language (Czech) POS & lemmatization: morphological layer shallow syntax: analytical layer deep syntax & semantics: tectogrammatical layer a-layer m-layer w-layer t-layer ANALYSIS TRANSFER SYNTHESIS
Hybrid MT Workshop - Coling 2012 13
source language (English) target language (Czech) a-layer m-layer w-layer t-layer ANALYSIS TRANSFER SYNTHESIS Tokenization Lemmatization Tagging (Compost) Parsing (MST) Analytical dep. function Convert to t-tree Grammatemes, formemes Structural transfer Basic morph. categories Agreement Add function words Concatenate Generate forms
Hybrid MT Workshop - Coling 2012 14
Lexical transfer (dictionary) & lexical choice
Tokenized
machine translation should be easy . NN NN MD VB JJ .
Lemmatized & POS tagged
machine Atr translation Sb should Pred be Obj . AuxK easy Pnom
Hybrid MT Workshop - Coling 2012 15
machine Atr translation Sb should Pred be Obj . AuxK easy Pnom
Hybrid MT Workshop - Coling 2012 16
machine n:attr translation n:subj be v:fin easy adj:compl
Hybrid MT Workshop - Coling 2012 17
machine n:attr translation n:subj be v:fin easy adj:compl Modality=hort Conditional=1 Tense=PresSim DoC=Positive Num=sg
Hybrid MT Workshop - Coling 2012 18
p strojový stroj n:2 adj:attr n:attr
n:1 mít být v:fin v:inf snadný jednoduchý adj:compl n:1 adv: Modality=hort Conditional=1 Tense=PresSim DoC=Positive Num=sg
Hybrid MT Workshop - Coling 2012 19
* Dictionary translation: MaxEnt classifier, ~106 features
p strojový stroj n:2 adj:attr n:attr
n:1 mít být v:fin v:inf snadný jednoduchý adj:compl n:1 adv: Modality=hort Conditional=1 Tense=PresSim DoC=Positive Num=sg
Hybrid MT Workshop - Coling 2012 20
strojový Deg=pos Case=1 Gen=MInanim
Case=1 mít Gen=MInanim C=PastP Num=sg snadný Deg=pos Case=1 Gen=MInanim . . být C=inf by
Hybrid MT Workshop - Coling 2012 21
strojový Deg=pos Case=1 Gen=MInanim
Case=1 mít Gen=MInanim C=PastP Num=sg snadný Deg=pos Case=1 Gen=MInanim . . být C=inf by
Hybrid MT Workshop - Coling 2012 22
strojový
snadný . být by
Hybrid MT Workshop - Coling 2012 23
Hybrid MT Workshop - Coling 2012 24
Hybrid MT Workshop - Coling 2012 25
Acknowledgements:
Acknowledgements: Ministry of Education Czech Rep. LC536, MSM0021620838 Acknowledgements: Ministry of Education Czech Rep. ME09008, 7Ennnn Acknowledgements: Czech Science Foundation GAP406/10/0875 Acknowledgements: Czech Science Foundation GPP406/10/P193 Acknowledgements: Czech Science Foundation GA405/09/0729 Acknowledgements: “Information Society” Programme 1ET101120503 Acknowledgements: Charles Univ. student grants 116310, 158010, 3537/2011 Acknowledgements: European projects (in part) 034434, 034291, 231720, 247762 Acknowledgements: European projects (part) 249119, 257528 Acknowledgements: Charles University research funds (“PRVOUK”)
Hybrid MT Workshop - Coling 2012 26
Machine Translation. In ACL 2009, pp. 145-148
Translation and MetricsMATR, ACL 2010, Uppsala, Sweden, pp. 201-206.
Canada, pp. 267-274. IceTAL 2010, 7th International Conference on Natural Language Processing, Reykjavík, Iceland, pp. 293- 304. TectoMT at WMT 12: http://www.statmt.org/wmt12/pdf/WMT02.pdf