Czechizator echiztor Charles University Faculty of Mathematics and - - PowerPoint PPT Presentation
Czechizator echiztor Charles University Faculty of Mathematics and - - PowerPoint PPT Presentation
Rudolf Rosa rosa@ufal.mff.cuni.cz Czechizator echiztor Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics SloNLP, Tatransk Matliare, 18 September 2016 Czechizator lexicon-less
Rudolf Rosa: Czechizator - Čechizátor
2/32
Czechizator
lexicon-less “translation” from English to Czech
Rudolf Rosa: Czechizator - Čechizátor
3/32
Czechizator
lexicon-less “translation” from English to Czech usual approach: use a bilingual lexicon
Czech- English texts statistical translation model training presentation prezentace translation system
- utput
input
Rudolf Rosa: Czechizator - Čechizátor
4/32
Czechizator
lexicon-less “translation” from English to Czech usual approach: use a bilingual lexicon Czechizator approach: use a set of rules instead
Czech- English texts statistical translation model training presentation prezentace translation system
- utput
input presentation presentace translation system
- utput
input Czech- English texts rules:
- ise → -iza
- tion → -ce
...
Rudolf Rosa: Czechizator - Čechizátor
5/32
Example: Czechizating ITAT titles
Statistical modelling in climate science
Rudolf Rosa: Czechizator - Čechizátor
6/32
Example: Czechizating ITAT titles
Statistical modelling in climate science
Statistické modelování v klimat scienci
Rudolf Rosa: Czechizator - Čechizátor
7/32
Example: Czechizating ITAT titles
Statistical modelling in climate science
Statistické modelování v klimat scienci
12 years of Unsupervised Dependency Parsing
Rudolf Rosa: Czechizator - Čechizátor
8/32
Example: Czechizating ITAT titles
Statistical modelling in climate science
Statistické modelování v klimat scienci
12 years of Unsupervised Dependency Parsing
12 jírů nesupervizované parsování dependence
Rudolf Rosa: Czechizator - Čechizátor
9/32
Example: Czechizating ITAT titles
Statistical modelling in climate science
Statistické modelování v klimat scienci
12 years of Unsupervised Dependency Parsing
12 jírů nesupervizované parsování dependence
Multivariable Approximation by Convolutional Kernel
Networks
Rudolf Rosa: Czechizator - Čechizátor
10/32
Example: Czechizating ITAT titles
Statistical modelling in climate science
Statistické modelování v klimat scienci
12 years of Unsupervised Dependency Parsing
12 jírů nesupervizované parsování dependence
Multivariable Approximation by Convolutional Kernel
Networks Multivariabilní aproximace Konvolucional Kernel netvorksu
Rudolf Rosa: Czechizator - Čechizátor
11/32
Implementation
lexical translation: a set of Czechization rules
43 ending-based transformation rules (see later) 33 transliteration rules: th → t, ti → ci, ck → k,
ph → f, sh → š, igh → aj, dg → dž, w → v, c → k…
36 hard-coded translations of semi-auxiliaries:
be, have, do, and, or, all, this, many, only, main…
grammar and function words: TectoMT
English-Czech machine translation system Czechizator implemented as a TectoMT lexical
translation model
Rudolf Rosa: Czechizator - Čechizátor
12/32
Implementation
I preferred the presentation
- f David.
Rudolf Rosa: Czechizator - Čechizátor
13/32
Implementation
I preferred the presentation
- f David.
TectoMT analysis
Rudolf Rosa: Czechizator - Čechizátor
14/32
Implementation
I preferred the presentation
- f David.
TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent.
Rudolf Rosa: Czechizator - Čechizátor
15/32
Implementation
I preferred the presentation
- f David.
TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. transfer
Rudolf Rosa: Czechizator - Čechizátor
16/32
Implementation
I preferred the presentation
- f David.
TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. Czechization
- f lemmas
Rudolf Rosa: Czechizator - Čechizátor
17/32
Implementation
I preferred the presentation
- f David.
TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. preferovat prezentace David Czechization
- f lemmas
Rudolf Rosa: Czechizator - Čechizátor
18/32
Implementation
I preferred the presentation
- f David.
TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. preferovat prezentace David Czechization
- f lemmas
TectoMT transfer of attributes
Rudolf Rosa: Czechizator - Čechizátor
19/32
Implementation
I preferred the presentation
- f David.
TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. preferovat verb, 1st person, past prezentace noun, accusative David noun, genitive, n.e. Czechization
- f lemmas
TectoMT transfer of attributes
Rudolf Rosa: Czechizator - Čechizátor
20/32
Implementation
I preferred the presentation
- f David.
TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. TectoMT synthesis preferovat verb, 1st person, past prezentace noun, accusative David noun, genitive, n.e. Czechization
- f lemmas
TectoMT transfer of attributes
Rudolf Rosa: Czechizator - Čechizátor
21/32
Implementation
I preferred the presentation
- f David.
Preferoval jsem prezentaci Davida. TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. TectoMT synthesis preferovat verb, 1st person, past prezentace noun, accusative David noun, genitive, n.e. Czechization
- f lemmas
TectoMT transfer of attributes
Rudolf Rosa: Czechizator - Čechizátor
22/32
Transformation rules for adjectives
partial
→ parciální
stable
→ stabilní
tolerant
→ tolerantní
tolerated
→ tolerovaný
turkic
→ turkický
practical
→ praktický
native
→ nativní
regular
→ regulární
fatal
→ fatální
nervous
→ nervózní
parsed
→ parsovaný
parsing
→ parsující
park
→ parkový
Rudolf Rosa: Czechizator - Čechizátor
23/32
What is it good for?
translations sometimes “reasonable”
scientific titles and abstracts, marketing texts
Rudolf Rosa: Czechizator - Čechizátor
24/32
What is it good for?
translations sometimes “reasonable”
scientific titles and abstracts, marketing texts: Accenture Operations combines technology that
digitizes and automates business processes, unlocks actionable insights, and delivers everything- as-a-service with our team's deep industry, functional and technical expertise.
Operacions acenturu kombinuje technologii, která
digitizuje a automuje procesy businosti, unlokuje akcionabilní insajty a deliveruje everyting-as-a- servicová s funkcionální a technickou expertizou dípové industrie našeho tímu.
Rudolf Rosa: Czechizator - Čechizátor
25/32
What is it good for?
translations sometimes “reasonable”
scientific titles and abstracts, marketing texts
still, only a proof of concept & a fun application
not really useful as a standalone tool maybe as a starting point for later post-editing
Rudolf Rosa: Czechizator - Čechizátor
26/32
What is it good for?
translations sometimes “reasonable”
scientific titles and abstracts, marketing texts
still, only a proof of concept & a fun application
not really useful as a standalone tool maybe as a starting point for later post-editing
potential: combine with TectoMT lexical models
frequent words: translation model trained from data infrequent words: insufficient training data, Czechize!
Rudolf Rosa: Czechizator - Čechizátor
27/32
Complementing TectoMT
rare/unseen words not well handled by TectoMT
unreliable translation for rare words, none for unseen
e.g. scientific terms
large number and growing, rare in data often rather regular translations → can be Czechized anaphora
→ anafora hypotactical → hypotaktický circumfixal → cirkumfixální
Rudolf Rosa: Czechizator - Čechizátor
28/32
Complementing TectoMT
rare/unseen words not well handled by TectoMT
unreliable translation for rare words, none for unseen
e.g. scientific terms
large number and growing, rare in data often rather regular translations → can be Czechized anaphora
→ anafora hypotactical → hypotaktický circumfixal → cirkumfixální
current issues: named entities get Czechized
usually should be avoided, but detection insufficient
Rudolf Rosa: Czechizator - Čechizátor
29/32
Conclusion
lexicon-less lexical “translation” module
transformation (endings) and transliteration rules
grammar and aux words handled by TectoMT
Czechization of lemmas on t-layer
Czechization of scientific titles sometimes “good”
but still not really useful
work in progress: integrate into TectoMT
complement existing lexical models Czechize rare and unseen words, e.g. science terms
Rudolf Rosa: Czechizator - Čechizátor
30/32
Thank you for your attention
http://ufal.mff.cuni.cz/rudolf-rosa/
Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Czechizator – Čechizátor Rudolf Rosa rosa@ufal.mff.cuni.cz
http://ufal.mff.cuni.cz/czechizator
Rudolf Rosa: Czechizator - Čechizátor
31/32
Examples: parsing papers
The theory of parsing, translation, and compiling Accurate unlexicalized parsing An efficient context-free parsing algorithm Seven principles of surface structure parsing in
natural language
Head-driven statistical models for natural
language parsing
Parsing by chunks Shallow parsing with conditional random fields
Rudolf Rosa: Czechizator - Čechizátor
32/32
Examples: parsing papers
Teorie parsování, translace a kompiluje Akuratová unlexikalizovaná parsování Eficientová kontext-fríová parsování algoritm 7 principlů struktur surface parsování v naturální
langvaži
Híd-drivenové statistické modely pro naturální
langvaž parsování
Parsují Chunky Šalovujte, parsujete s kondicionálními