Czechizator echiztor Charles University Faculty of Mathematics and - - PowerPoint PPT Presentation

czechizator echiz tor
SMART_READER_LITE
LIVE PREVIEW

Czechizator echiztor Charles University Faculty of Mathematics and - - PowerPoint PPT Presentation

Rudolf Rosa rosa@ufal.mff.cuni.cz Czechizator echiztor Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics SloNLP, Tatransk Matliare, 18 September 2016 Czechizator lexicon-less


slide-1
SLIDE 1

SloNLP, Tatranské Matliare, 18 September 2016 Rudolf Rosa rosa@ufal.mff.cuni.cz

Czechizator – Čechizátor

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

slide-2
SLIDE 2

Rudolf Rosa: Czechizator - Čechizátor

2/32

Czechizator

 lexicon-less “translation” from English to Czech

slide-3
SLIDE 3

Rudolf Rosa: Czechizator - Čechizátor

3/32

Czechizator

 lexicon-less “translation” from English to Czech  usual approach: use a bilingual lexicon

Czech- English texts statistical translation model training presentation prezentace translation system

  • utput

input

slide-4
SLIDE 4

Rudolf Rosa: Czechizator - Čechizátor

4/32

Czechizator

 lexicon-less “translation” from English to Czech  usual approach: use a bilingual lexicon  Czechizator approach: use a set of rules instead

Czech- English texts statistical translation model training presentation prezentace translation system

  • utput

input presentation presentace translation system

  • utput

input Czech- English texts rules:

  • ise → -iza
  • tion → -ce

...

slide-5
SLIDE 5

Rudolf Rosa: Czechizator - Čechizátor

5/32

Example: Czechizating ITAT titles

 Statistical modelling in climate science

slide-6
SLIDE 6

Rudolf Rosa: Czechizator - Čechizátor

6/32

Example: Czechizating ITAT titles

 Statistical modelling in climate science

Statistické modelování v klimat scienci

slide-7
SLIDE 7

Rudolf Rosa: Czechizator - Čechizátor

7/32

Example: Czechizating ITAT titles

 Statistical modelling in climate science

Statistické modelování v klimat scienci

 12 years of Unsupervised Dependency Parsing

slide-8
SLIDE 8

Rudolf Rosa: Czechizator - Čechizátor

8/32

Example: Czechizating ITAT titles

 Statistical modelling in climate science

Statistické modelování v klimat scienci

 12 years of Unsupervised Dependency Parsing

12 jírů nesupervizované parsování dependence

slide-9
SLIDE 9

Rudolf Rosa: Czechizator - Čechizátor

9/32

Example: Czechizating ITAT titles

 Statistical modelling in climate science

Statistické modelování v klimat scienci

 12 years of Unsupervised Dependency Parsing

12 jírů nesupervizované parsování dependence

 Multivariable Approximation by Convolutional Kernel

Networks

slide-10
SLIDE 10

Rudolf Rosa: Czechizator - Čechizátor

10/32

Example: Czechizating ITAT titles

 Statistical modelling in climate science

Statistické modelování v klimat scienci

 12 years of Unsupervised Dependency Parsing

12 jírů nesupervizované parsování dependence

 Multivariable Approximation by Convolutional Kernel

Networks Multivariabilní aproximace Konvolucional Kernel netvorksu

slide-11
SLIDE 11

Rudolf Rosa: Czechizator - Čechizátor

11/32

Implementation

 lexical translation: a set of Czechization rules

 43 ending-based transformation rules (see later)  33 transliteration rules: th → t, ti → ci, ck → k,

ph → f, sh → š, igh → aj, dg → dž, w → v, c → k…

 36 hard-coded translations of semi-auxiliaries:

be, have, do, and, or, all, this, many, only, main…

 grammar and function words: TectoMT

 English-Czech machine translation system  Czechizator implemented as a TectoMT lexical

translation model

slide-12
SLIDE 12

Rudolf Rosa: Czechizator - Čechizátor

12/32

Implementation

I preferred the presentation

  • f David.
slide-13
SLIDE 13

Rudolf Rosa: Czechizator - Čechizátor

13/32

Implementation

I preferred the presentation

  • f David.

TectoMT analysis

slide-14
SLIDE 14

Rudolf Rosa: Czechizator - Čechizátor

14/32

Implementation

I preferred the presentation

  • f David.

TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent.

slide-15
SLIDE 15

Rudolf Rosa: Czechizator - Čechizátor

15/32

Implementation

I preferred the presentation

  • f David.

TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. transfer

slide-16
SLIDE 16

Rudolf Rosa: Czechizator - Čechizátor

16/32

Implementation

I preferred the presentation

  • f David.

TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. Czechization

  • f lemmas
slide-17
SLIDE 17

Rudolf Rosa: Czechizator - Čechizátor

17/32

Implementation

I preferred the presentation

  • f David.

TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. preferovat prezentace David Czechization

  • f lemmas
slide-18
SLIDE 18

Rudolf Rosa: Czechizator - Čechizátor

18/32

Implementation

I preferred the presentation

  • f David.

TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. preferovat prezentace David Czechization

  • f lemmas

TectoMT transfer of attributes

slide-19
SLIDE 19

Rudolf Rosa: Czechizator - Čechizátor

19/32

Implementation

I preferred the presentation

  • f David.

TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. preferovat verb, 1st person, past prezentace noun, accusative David noun, genitive, n.e. Czechization

  • f lemmas

TectoMT transfer of attributes

slide-20
SLIDE 20

Rudolf Rosa: Czechizator - Čechizátor

20/32

Implementation

I preferred the presentation

  • f David.

TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. TectoMT synthesis preferovat verb, 1st person, past prezentace noun, accusative David noun, genitive, n.e. Czechization

  • f lemmas

TectoMT transfer of attributes

slide-21
SLIDE 21

Rudolf Rosa: Czechizator - Čechizátor

21/32

Implementation

I preferred the presentation

  • f David.

Preferoval jsem prezentaci Davida. TectoMT analysis prefer verb, 1st person, past presentation noun, definite, object David noun+of, named ent. TectoMT synthesis preferovat verb, 1st person, past prezentace noun, accusative David noun, genitive, n.e. Czechization

  • f lemmas

TectoMT transfer of attributes

slide-22
SLIDE 22

Rudolf Rosa: Czechizator - Čechizátor

22/32

Transformation rules for adjectives

 partial

→ parciální

 stable

→ stabilní

 tolerant

→ tolerantní

 tolerated

→ tolerovaný

 turkic

→ turkický

 practical

→ praktický

 native

→ nativní

 regular

→ regulární

 fatal

→ fatální

 nervous

→ nervózní

 parsed

→ parsovaný

 parsing

→ parsující

 park

→ parkový

slide-23
SLIDE 23

Rudolf Rosa: Czechizator - Čechizátor

23/32

What is it good for?

 translations sometimes “reasonable”

 scientific titles and abstracts, marketing texts

slide-24
SLIDE 24

Rudolf Rosa: Czechizator - Čechizátor

24/32

What is it good for?

 translations sometimes “reasonable”

 scientific titles and abstracts, marketing texts:  Accenture Operations combines technology that

digitizes and automates business processes, unlocks actionable insights, and delivers everything- as-a-service with our team's deep industry, functional and technical expertise.

 Operacions acenturu kombinuje technologii, která

digitizuje a automuje procesy businosti, unlokuje akcionabilní insajty a deliveruje everyting-as-a- servicová s funkcionální a technickou expertizou dípové industrie našeho tímu.

slide-25
SLIDE 25

Rudolf Rosa: Czechizator - Čechizátor

25/32

What is it good for?

 translations sometimes “reasonable”

 scientific titles and abstracts, marketing texts

 still, only a proof of concept & a fun application

 not really useful as a standalone tool  maybe as a starting point for later post-editing

slide-26
SLIDE 26

Rudolf Rosa: Czechizator - Čechizátor

26/32

What is it good for?

 translations sometimes “reasonable”

 scientific titles and abstracts, marketing texts

 still, only a proof of concept & a fun application

 not really useful as a standalone tool  maybe as a starting point for later post-editing

 potential: combine with TectoMT lexical models

 frequent words: translation model trained from data  infrequent words: insufficient training data, Czechize!

slide-27
SLIDE 27

Rudolf Rosa: Czechizator - Čechizátor

27/32

Complementing TectoMT

 rare/unseen words not well handled by TectoMT

 unreliable translation for rare words, none for unseen

 e.g. scientific terms

 large number and growing, rare in data  often rather regular translations → can be Czechized  anaphora

→ anafora hypotactical → hypotaktický circumfixal → cirkumfixální

slide-28
SLIDE 28

Rudolf Rosa: Czechizator - Čechizátor

28/32

Complementing TectoMT

 rare/unseen words not well handled by TectoMT

 unreliable translation for rare words, none for unseen

 e.g. scientific terms

 large number and growing, rare in data  often rather regular translations → can be Czechized  anaphora

→ anafora hypotactical → hypotaktický circumfixal → cirkumfixální

 current issues: named entities get Czechized

 usually should be avoided, but detection insufficient

slide-29
SLIDE 29

Rudolf Rosa: Czechizator - Čechizátor

29/32

Conclusion

 lexicon-less lexical “translation” module

 transformation (endings) and transliteration rules

 grammar and aux words handled by TectoMT

 Czechization of lemmas on t-layer

 Czechization of scientific titles sometimes “good”

 but still not really useful

 work in progress: integrate into TectoMT

 complement existing lexical models  Czechize rare and unseen words, e.g. science terms

slide-30
SLIDE 30

Rudolf Rosa: Czechizator - Čechizátor

30/32

Thank you for your attention

http://ufal.mff.cuni.cz/rudolf-rosa/

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Czechizator – Čechizátor Rudolf Rosa rosa@ufal.mff.cuni.cz

http://ufal.mff.cuni.cz/czechizator

slide-31
SLIDE 31

Rudolf Rosa: Czechizator - Čechizátor

31/32

Examples: parsing papers

 The theory of parsing, translation, and compiling  Accurate unlexicalized parsing  An efficient context-free parsing algorithm  Seven principles of surface structure parsing in

natural language

 Head-driven statistical models for natural

language parsing

 Parsing by chunks  Shallow parsing with conditional random fields

slide-32
SLIDE 32

Rudolf Rosa: Czechizator - Čechizátor

32/32

Examples: parsing papers

 Teorie parsování, translace a kompiluje  Akuratová unlexikalizovaná parsování  Eficientová kontext-fríová parsování algoritm  7 principlů struktur surface parsování v naturální

langvaži

 Híd-drivenové statistické modely pro naturální

langvaž parsování

 Parsují Chunky  Šalovujte, parsujete s kondicionálními

randomovými Fieldy