[PPT] - MDL-Based Models for Transliteration Generation . . . . . . 1 PowerPoint Presentation

SLIDE 1

. . . . . .

. .

MDL-Based Models for Transliteration Generation

Javad Nouri, Lidia Pivovarova, and Roman Yangarber

University of Helsinki Department of Computer Science

SLSP 2013 July 30, 2013

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 1 / 27

SLIDE 2

. . . . . .

Outline

. .

1 Introduction Transliteration Applications Challenges

.

2 Data Data-sets

. .

3 Methods Review Motivation Etymon Project Models Prediction Pre-processing Evaluation Baselines

. .

4 Results

. .

5 Future Work

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 2 / 27

SLIDE 3

. . . . . .

Transliteration

Predicting word representation in another language, based on spelling

r pronunciation

Tarragona Таррагона Таррагон ﺎﻧﻮﮔارﺎﺗהנוגרט Taragona ट ै र ा ग ो न ा Тарраґонаةنوغارط 塔拉戈纳 타라고나 タラゴナ ታ ራ ጎ ና

த ா ர ா க ோணம ்

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 3 / 27

SLIDE 4

. . . . . .

Transliteration

Predicting word representation in another language, based on spelling

r pronunciation

Tarragona Таррагона Таррагонæ ﺎﻧﻮﮔارﺎﺗהנוגרט Taragona ट ै र ा ग ो न ा Тарраґонаةنوغارط 塔拉戈纳 타라고나 タラゴナ ታ ራ ጎ ና

த ா ர ா க ோணம ்

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 3 / 27

SLIDE 5

. . . . . .

Where can it be used?

Machine Translation: proper names, terms Information Retrieval, Information Extraction, Named Entity Recognition Several scripts for a language

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 4 / 27

SLIDE 6

. . . . . .

Challenges

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 5 / 27

SLIDE 7

. . . . . .

Challenges

Transliteration can be based on:

pronunciation spelling translation

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 5 / 27

SLIDE 8

. . . . . .

Challenges

Transliteration can be based on:

pronunciation spelling translation

Санкт-Петербург /Sankt-Peterburg/

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 5 / 27

SLIDE 9

. . . . . .

Challenges

Transliteration can be based on:

pronunciation spelling translation

Санкт-Петербург /Sankt-Peterburg/ Saint Petersburg Pietari

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 5 / 27

SLIDE 10

. . . . . .

Challenges

Transliteration can be based on:

pronunciation spelling translation

Different transliterations for the same name

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 5 / 27

SLIDE 11

. . . . . .

Лев Толстой

Leo Tolstoy Lev Tolstoy

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 5 / 27

SLIDE 12

. . . . . .

Лев Толстой

Leo Tolstoy Lev Tolstoy

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 5 / 27

SLIDE 13

. . . . . .

Лев Толстой

Leo Tolstoy Lev Tolstoy

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 5 / 27

SLIDE 14

. . . . . .

Лев Толстой

Leo Tolstoy Lev Tolstoy

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 5 / 27

SLIDE 15

. . . . . .

Challenges

Transliteration can be based on:

pronunciation spelling translation

Different transliterations for the same name Scripts are based on different principles:

Alphabetic

English, Russian, ...

Consonantal

Arabic, Persian, Hebrew, ...

Syllabic

Katakana, ...

...

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 5 / 27

SLIDE 16

. . . . . .

Outline

. .

1 Introduction Transliteration Applications Challenges

.

2 Data Data-sets

. .

3 Methods Review Motivation Etymon Project Models Prediction Pre-processing Evaluation Baselines

. .

4 Results

. .

5 Future Work

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 6 / 27

SLIDE 17

. . . . . .

Data

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 7 / 27

SLIDE 18

. . . . . .

Data

Titles devoted to the same entities in different languages are transliterations of each other. Language links can be used to find such pairs. Use categories to distinguish different types of data (locations, people, etc.)

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 7 / 27

SLIDE 19

. . . . . .

Data

Titles devoted to the same entities in different languages are transliterations of each other. Language links can be used to find such pairs. Use categories to distinguish different types of data (locations, people, etc.)

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 7 / 27

SLIDE 20

. . . . . .

Data

Titles devoted to the same entities in different languages are transliterations of each other. Language links can be used to find such pairs. Use categories to distinguish different types of data (locations, people, etc.)

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 7 / 27

SLIDE 21

. . . . . .

Data-sets

# Language Script Type 1 English Alphabetic 2 Farsi Consonantal 3 French Alphabetic 4 Greek Alphabetic 5 Hebrew Consonantal 6 Japanese (Katakana) Syllabic 7 Russian Alphabetic Data-set Language Size: # Data-set Language Size: # pair

f pairs

pair

f pairs

American Actors En–Ru 1471 Russian Cities Ru–En 1136 En–He 1245 Ru–Fa 870 En–Fa 840 Ru–Fr 828 En–Gr 407 Ru–Jp 317 Russian Writes Ru–En 1462 French Cities Fr–Ru 828 Iranian Cities Fa–En 439 Iranian Locations Fa–Ru 1893 Fa–Ru 469

Using Wikipedia categories increases homogeneity in data

Russian writers contains mainly Russian names, etc. Locations are more homogeneous than person names.

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 8 / 27

SLIDE 22

. . . . . .

Data-sets

# Language Script Type 1 English Alphabetic 2 Farsi Consonantal 3 French Alphabetic 4 Greek Alphabetic 5 Hebrew Consonantal 6 Japanese (Katakana) Syllabic 7 Russian Alphabetic Data-set Language Size: # Data-set Language Size: # pair

f pairs

pair

f pairs

American Actors En–Ru 1471 Russian Cities Ru–En 1136 En–He 1245 Ru–Fa 870 En–Fa 840 Ru–Fr 828 En–Gr 407 Ru–Jp 317 Russian Writes Ru–En 1462 French Cities Fr–Ru 828 Iranian Cities Fa–En 439 Iranian Locations Fa–Ru 1893 Fa–Ru 469

Using Wikipedia categories increases homogeneity in data

Russian writers contains mainly Russian names, etc. Locations are more homogeneous than person names.

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 8 / 27

SLIDE 23

. . . . . .

Data-sets

# Language Script Type 1 English Alphabetic 2 Farsi Consonantal 3 French Alphabetic 4 Greek Alphabetic 5 Hebrew Consonantal 6 Japanese (Katakana) Syllabic 7 Russian Alphabetic Data-set Language Size: # Data-set Language Size: # pair

f pairs

pair

f pairs

American Actors En–Ru 1471 Russian Cities Ru–En 1136 En–He 1245 Ru–Fa 870 En–Fa 840 Ru–Fr 828 En–Gr 407 Ru–Jp 317 Russian Writes Ru–En 1462 French Cities Fr–Ru 828 Iranian Cities Fa–En 439 Iranian Locations Fa–Ru 1893 Fa–Ru 469

Using Wikipedia categories increases homogeneity in data

Russian writers contains mainly Russian names, etc. Locations are more homogeneous than person names.

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 8 / 27

SLIDE 24

. . . . . .

Data-sets

# Language Script Type 1 English Alphabetic 2 Farsi Consonantal 3 French Alphabetic 4 Greek Alphabetic 5 Hebrew Consonantal 6 Japanese (Katakana) Syllabic 7 Russian Alphabetic Data-set Language Size: # Data-set Language Size: # pair

f pairs

pair

f pairs

American Actors En–Ru 1471 Russian Cities Ru–En 1136 En–He 1245 Ru–Fa 870 En–Fa 840 Ru–Fr 828 En–Gr 407 Ru–Jp 317 Russian Writes Ru–En 1462 French Cities Fr–Ru 828 Iranian Cities Fa–En 439 Iranian Locations Fa–Ru 1893 Fa–Ru 469

Using Wikipedia categories increases homogeneity in data

Russian writers contains mainly Russian names, etc. Locations are more homogeneous than person names.

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 8 / 27

SLIDE 25

. . . . . .

Outline

. .

1 Introduction Transliteration Applications Challenges

.

2 Data Data-sets

. .

3 Methods Review Motivation Etymon Project Models Prediction Pre-processing Evaluation Baselines

. .

4 Results

. .

5 Future Work

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 9 / 27

SLIDE 26

. . . . . .

Methods

Karimi, S., Scholer, F., Turpin, A.: Machine transliteration survey. ACM Computing Surveys 43(3) (2011) . . Transliteration . Generation . Rule-based . Noisy Channel . MDL . MDL . Combined . Phonetics-based . Spelling-based . Spelling-based . Hybrid . Extraction

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 10 / 27

SLIDE 27

. . . . . .

Methods

Karimi, S., Scholer, F., Turpin, A.: Machine transliteration survey. ACM Computing Surveys 43(3) (2011) . . Transliteration . Generation . Rule-based . Noisy Channel . MDL . MDL . Combined . Phonetics-based . Spelling-based . Spelling-based . Hybrid . Extraction

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 10 / 27

SLIDE 28

. . . . . .

Methods

Karimi, S., Scholer, F., Turpin, A.: Machine transliteration survey. ACM Computing Surveys 43(3) (2011) . . Transliteration . Generation . Rule-based . Noisy Channel . MDL . MDL . Combined . Phonetics-based . Spelling-based . Spelling-based . Hybrid . Extraction

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 10 / 27

SLIDE 29

. . . . . .

Methods

Karimi, S., Scholer, F., Turpin, A.: Machine transliteration survey. ACM Computing Surveys 43(3) (2011) . . Transliteration . Generation . Rule-based . Noisy Channel . MDL . MDL . Combined . Phonetics-based . Spelling-based . Spelling-based . Hybrid . Extraction

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 10 / 27

SLIDE 30

. . . . . .

Methods

Karimi, S., Scholer, F., Turpin, A.: Machine transliteration survey. ACM Computing Surveys 43(3) (2011) . . Transliteration . Generation . Rule-based . Noisy Channel . MDL . MDL . Combined . Phonetics-based . Spelling-based . Spelling-based . Hybrid . Extraction

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 10 / 27

SLIDE 31

. . . . . .

MDL: motivation

Model trained to align words in a training set

H e l s i n k i | | | | | | | | Х е л з и н к и H e l s i n k i | | | | | | | | Х е ль с и н к и H e l s i n k i | | | | | | | | . Ε λ σ ί ν κ ι

The better an alignment, the more regularities have been found The better an alignment, the better the data can be compressed

MDL

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 11 / 27

SLIDE 32

. . . . . .

MDL: motivation

Model trained to align words in a training set

H e l s i n k i | | | | | | | | Х е л з и н к и H e l s i n k i | | | | | | | | Х е ль с и н к и H e l s i n k i | | | | | | | | . Ε λ σ ί ν κ ι

The better an alignment, the more regularities have been found The better an alignment, the better the data can be compressed

MDL

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 11 / 27

SLIDE 33

. . . . . .

MDL: motivation

Model trained to align words in a training set

H e l s i n k i | | | | | | | | Х е л з и н к и H e l s i n k i | | | | | | | | Х е ль с и н к и H e l s i n k i | | | | | | | | . Ε λ σ ί ν κ ι

The better an alignment, the more regularities have been found The better an alignment, the better the data can be compressed

MDL

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 11 / 27

SLIDE 34

. . . . . .

MDL: motivation

Model trained to align words in a training set

H e l s i n k i | | | | | | | | Х е л з и н к и H e l s i n k i | | | | | | | | Х е ль с и н к и H e l s i n k i | | | | | | | | . Ε λ σ ί ν κ ι

The better an alignment, the more regularities have been found The better an alignment, the better the data can be compressed

MDL

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 11 / 27

SLIDE 35

. . . . . .

MDL: motivation

Model trained to align words in a training set

H e l s i n k i | | | | | | | | Х е л з и н к и H e l s i n k i | | | | | | | | Х е ль с и н к и H e l s i n k i | | | | | | | | . Ε λ σ ί ν κ ι

The better an alignment, the more regularities have been found The better an alignment, the better the data can be compressed

MDL

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 11 / 27

SLIDE 36

. . . . . .

MDL: motivation

Model trained to align words in a training set

H e l s i n k i | | | | | | | | Х е л з и н к и H e l s i n k i | | | | | | | | Х е ль с и н к и H e l s i n k i | | | | | | | | . Ε λ σ ί ν κ ι

The better an alignment, the more regularities have been found The better an alignment, the better the data can be compressed

MDL

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 11 / 27

SLIDE 37

. . . . . .

Etymon

Study of regular sound changes across languages Main idea:

The closer the languages, the better alignments between words

http://etymon.cs.helsinki.fi

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 12 / 27

SLIDE 38

. . . . . .

Etymon

Study of regular sound changes across languages Main idea:

The closer the languages, the better alignments between words

http://etymon.cs.helsinki.fi

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 12 / 27

SLIDE 39

. . . . . .

Etymon

Study of regular sound changes across languages Main idea:

The closer the languages, the better alignments between words

http://etymon.cs.helsinki.fi

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 12 / 27

SLIDE 40

. . . . . .

Etymon

Study of regular sound changes across languages Main idea:

The closer the languages, the better alignments between words

http://etymon.cs.helsinki.fi

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 12 / 27

SLIDE 41

. . . . . .

Etymon

Study of regular sound changes across languages Main idea:

The closer the languages, the better alignments between words

http://etymon.cs.helsinki.fi

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 12 / 27

SLIDE 42

. . . . . .

1×1 Model

Each symbol of one language may correspond to at most one symbol in another language No context is used

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 13 / 27

SLIDE 43

. . . . . .

1×1 Model

Each symbol of one language may correspond to at most one symbol in another language No context is used

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 13 / 27

SLIDE 44

. . . . . .

1×1 Model

Each symbol of one language may correspond to at most one symbol in another language No context is used

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 13 / 27

SLIDE 45

. . . . . .

1×1 Model

Each symbol of one language may correspond to at most one symbol in another language No context is used

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 13 / 27

SLIDE 46

. . . . . .

2×2 Model

Align up to two symbols at a time A larger count matrix Takes into account word boundaries

Alda 1 1:اﺪﻟا22:اﺪﻟآ

Is able to find one-to-two correspondences

Cyrillic Ч to English ch

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 14 / 27

SLIDE 47

. . . . . .

2×2 Model

Align up to two symbols at a time A larger count matrix Takes into account word boundaries

Alda 1 1:اﺪﻟا22:اﺪﻟآ

Is able to find one-to-two correspondences

Cyrillic Ч to English ch

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 14 / 27

SLIDE 48

. . . . . .

2×2 Model

Align up to two symbols at a time A larger count matrix Takes into account word boundaries

Alda 1 1:اﺪﻟا22:اﺪﻟآ

Is able to find one-to-two correspondences

Cyrillic Ч to English ch

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 14 / 27

SLIDE 49

. . . . . .

2×2 Model

Align up to two symbols at a time A larger count matrix Takes into account word boundaries

Alda 1 1:اﺪﻟا22:اﺪﻟآ

Is able to find one-to-two correspondences

Cyrillic Ч to English ch

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 14 / 27

SLIDE 50

. . . . . .

2×2 Model

Align up to two symbols at a time A larger count matrix Takes into account word boundaries

ˆAlda$ 1×1: $اﺪﻟاˆ2×2:$اﺪﻟآˆ

Is able to find one-to-two correspondences

Cyrillic Ч to English ch

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 14 / 27

SLIDE 51

. . . . . .

2×2 Model

Align up to two symbols at a time A larger count matrix Takes into account word boundaries

ˆAlda$ 1×1: $اﺪﻟاˆ2×2:$اﺪﻟآˆ

Is able to find one-to-two correspondences

Cyrillic Ч to English ch

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 14 / 27

SLIDE 52

. . . . . .

2×2 Model

Align up to two symbols at a time A larger count matrix Takes into account word boundaries

ˆAlda$ 1×1: $اﺪﻟاˆ2×2:$اﺪﻟآˆ

Is able to find one-to-two correspondences

Cyrillic Ч to English ch

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 14 / 27

SLIDE 53

. . . . . .

Prediction

Use the converged model to predict unseen data 1 1 model

Simple look-up in count matrix

2 2 model

Dynamic Programming

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 15 / 27

SLIDE 54

. . . . . .

Prediction

Use the converged model to predict unseen data 1 1 model

Simple look-up in count matrix

2 2 model

Dynamic Programming

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 15 / 27

SLIDE 55

. . . . . .

Prediction

Use the converged model to predict unseen data 1×1 model

Simple look-up in count matrix

2 2 model

Dynamic Programming

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 15 / 27

SLIDE 56

. . . . . .

Prediction

Use the converged model to predict unseen data 1×1 model

Simple look-up in count matrix

2×2 model

Dynamic Programming

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 15 / 27

SLIDE 57

. . . . . .

Data Cleanup

Semi-automatic data cleanup

Remove accent marks from the Greek data-set Remove patronymics from Russian names Delete word pairs that are not transliterations of each other

Some amount of noise is still there

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 16 / 27

SLIDE 58

. . . . . .

Data Cleanup

Semi-automatic data cleanup

Remove accent marks from the Greek data-set Remove patronymics from Russian names Delete word pairs that are not transliterations of each other

Some amount of noise is still there

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 16 / 27

SLIDE 59

. . . . . .

Data Cleanup

Semi-automatic data cleanup

Remove accent marks from the Greek data-set Remove patronymics from Russian names Delete word pairs that are not transliterations of each other

Some amount of noise is still there

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 16 / 27

SLIDE 60

. . . . . .

Data Cleanup

Semi-automatic data cleanup

Remove accent marks from the Greek data-set Remove patronymics from Russian names Delete word pairs that are not transliterations of each other

Some amount of noise is still there

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 16 / 27

SLIDE 61

. . . . . .

Data Cleanup

Semi-automatic data cleanup

Remove accent marks from the Greek data-set Remove patronymics from Russian names Delete word pairs that are not transliterations of each other

Some amount of noise is still there

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 16 / 27

SLIDE 62

. . . . . .

Data Cleanup

Semi-automatic data cleanup

Remove accent marks from the Greek data-set Remove patronymics from Russian names Delete word pairs that are not transliterations of each other

Some amount of noise is still there

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 16 / 27

SLIDE 63

. . . . . .

Evaluation

Leave-one-out evaluation Word level Accuracy Number of correct transliterations Total number of test words Symbol level

Normalised Edit Distance NED

i ED ci ri i ci

ci: Expected transliteration for word i ri: System response for word i ED ci ri : Levenshtein edit distance

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 17 / 27

SLIDE 64

. . . . . .

Evaluation

Leave-one-out evaluation Word level Accuracy Number of correct transliterations Total number of test words Symbol level

Normalised Edit Distance NED

i ED ci ri i ci

ci: Expected transliteration for word i ri: System response for word i ED ci ri : Levenshtein edit distance

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 17 / 27

SLIDE 65

. . . . . .

Evaluation

Leave-one-out evaluation Word level Accuracy = Number of correct transliterations Total number of test words Symbol level

Normalised Edit Distance NED

i ED ci ri i ci

ci: Expected transliteration for word i ri: System response for word i ED ci ri : Levenshtein edit distance

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 17 / 27

SLIDE 66

. . . . . .

Evaluation

Leave-one-out evaluation Word level Accuracy = Number of correct transliterations Total number of test words Symbol level

Normalised Edit Distance NED = ∑

i ED(ci, ri)

∑

i |ci|

ci: Expected transliteration for word i ri: System response for word i ED(ci, ri): Levenshtein edit distance

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 17 / 27

SLIDE 67

. . . . . .

Evaluation (cont.)

Symbol level

F-Score

Longest Common Subsequence LCS(c, r) = 1 2(|c| + |r| − ED′(c, r)) ED′(c, r) allows insertions and deletions and no substitutions Recall, Precision, F-Score: P = LCS(c,r)

|r|

R = LCS(c,r)

|c|

F = 2 × R×P

R+P

Average F-Score over all words

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 18 / 27

SLIDE 68

. . . . . .

Evaluation (cont.)

Symbol level

F-Score

Longest Common Subsequence LCS(c, r) = 1 2(|c| + |r| − ED′(c, r)) ED′(c, r) allows insertions and deletions and no substitutions Recall, Precision, F-Score: P = LCS(c,r)

|r|

R = LCS(c,r)

|c|

F = 2 × R×P

R+P

Average F-Score over all words

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 18 / 27

SLIDE 69

. . . . . .

Baselines: Baseline 1

Simple rule-based algorithm English to Farsi a →ا n →ن b →ب

→و

c →ك p →پ d →د q →ك e → r →ر f →ف s →س g →گ t →ت h →ه u →و i →ي v →و j →ج w →و k →ك x →ﺲﮐ l →ل y →ي m →م z →ز Farsi to English آ→ aد→ dغ→ gh ا→ aذ→ zف→ f ؤ→ uر→ rق→ gh ئ→ eز→ zك→ k ب→ bژ→ zhگ→ g پ→ pس→ sل→ l ت→ tش→ shم→ m ث→ sص→ sن→ n ج→ jض→ zو→ v چ→ chط→ tه→ h ح→ hظ→ zي→ y خ→ khع→ e

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 19 / 27

SLIDE 70

. . . . . .

Baselines: Baseline 2

Open source system

DirecTL+

Jiampojamarn, S., Cherry, C., Kondrak, G.: Joint processing and discriminative training for letter-to-phoneme conversion. 2008

M2M-aligner

Jiampojamarn, S., Kondrak, G., Sherif, T.: Applying many-to-many alignments and hidden markov models to letter-to-phoneme conversion. 2007

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 20 / 27

SLIDE 71

. . . . . .

Outline

. .

1 Introduction Transliteration Applications Challenges

.

2 Data Data-sets

. .

3 Methods Review Motivation Etymon Project Models Prediction Pre-processing Evaluation Baselines

. .

4 Results

. .

5 Future Work

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 21 / 27

SLIDE 72

. . . . . .

Results

On average Model Word level accuracy NED Mean F-Score 1x1 0.235 0.259 0.804 2x2 0.442 0.162 0.878 Baseline 0.245 0.278 0.795 DirecTL+ 0.311 0.253 0.832

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 22 / 27

SLIDE 73

. . . . . .

Names vs. Locations

Size: # Word level Mean Word level Mean

f pairs

Model Accuracy NED F-Score Accuracy NED F-Score American Actors En → Ru Ru → En 1x1 0.338 0.222 0.815 0.309 0.223 0.814 1471 2x2 0.430 0.176 0.851 0.388 0.177 0.853 Baseline 0.298 0.250 0.799 0.282 0.250 0.795 DirecTL+ 0.387 0.214 0.834 0.373 0.189 0.854 Russian Writers En → Ru Ru → En 1x1 0.400 0.153 0.878 0.415 0.126 0.920 1462 2x2 0.634 0.091 0.934 0.689 0.073 0.943 Baseline 0.347 0.201 0.856 0.651 0.075 0.944 DirectL+ 0.462 0.176 0.875 0.588 0.090 0.933 Russian Cities En → Ru Ru → En 1x1 0.448 0.113 0.904 0.509 0.082 0.957 1136 2x2 0.762 0.040 0.972 0.881 0.018 0.989 Baseline 0.379 0.176 0.868 0.823 0.028 0.983 DirecTL+ 0.501 0.163 0.886 0.813 0.028 0.985

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 23 / 27

SLIDE 74

. . . . . .

Size vs. Homogeneity

Size: # Word level Mean Word level Mean

f pairs

Model Accuracy NED F-Score Accuracy NED F-Score Iranian Cities En → Fa Fa → En 1x1 0.196 0.334 0.787 0.109 0.280 0.817 439 2x2 0.435 0.155 0.896 0.228 0.205 0.857 Baseline 0.175 0.353 0.789 0.057 0.282 0.790 DirectL+ 0.132 0.391 0.786 0.289 0.185 0.863 Ru → Fa Fa → Ru 1x1 0.382 0.197 0.856 0.134 0.282 0.803 469 2x2 0.525 0.139 0.890 0.252 0.237 0.827 Baseline 0.267 0.277 0.803 0.092 0.296 0.775 DirectL+ 0.151 0.332 0.800 0.222 0.210 0.846 Iranian locations Ru → Fa Fa → Ru 1x1 0.380 0.201 0.863 0.135 0.274 0.812 1893 2x2 0.553 0.134 0.902 0.278 0.217 0.841 Baseline 0.285 0.270 0.816 0.078 0.318 0.752 DirectL+ 0.155 0.345 0.813 0.317 0.189 0.854

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 24 / 27

SLIDE 75

. . . . . .

Outline

. .

1 Introduction Transliteration Applications Challenges

.

2 Data Data-sets

. .

3 Methods Review Motivation Etymon Project Models Prediction Pre-processing Evaluation Baselines

. .

4 Results

. .

5 Future Work

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 25 / 27

SLIDE 76

. . . . . .

Future Work

Context-sensitive models

Improved prediction algorithm

More data:

Other languages Other types of names, e.g. company names

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 26 / 27

SLIDE 77

. . . . . .

Future Work

Context-sensitive models

Improved prediction algorithm

More data:

Other languages Other types of names, e.g. company names

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 26 / 27

SLIDE 78

. . . . . .

Future Work

Context-sensitive models

Improved prediction algorithm

More data:

Other languages Other types of names, e.g. company names

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 26 / 27

SLIDE 79

. . . . . .

Future Work

Context-sensitive models

Improved prediction algorithm

More data:

Other languages Other types of names, e.g. company names

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 26 / 27

SLIDE 80

. . . . . .

Future Work

Context-sensitive models

Improved prediction algorithm

More data:

Other languages Other types of names, e.g. company names

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 26 / 27

SLIDE 81

. . . . . .

Thank you!

Questions?

J. Nouri, L. Pivovarova, R. Yangarber (UH)

SLSP 2013 July 30, 2013 27 / 27

. .

MDL-Based Models for Transliteration Generation

Javad Nouri, Lidia Pivovarova, and Roman Yangarber

University of Helsinki Department of Computer Science

SLSP 2013 July 30, 2013

Outline

. .

1

Introduction Transliteration Applications Challenges

.

2

Data Data-sets

. .

3

Methods Review Motivation Etymon Project Models Prediction Pre-processing Evaluation Baselines

. .

4

Results

. .

5

Future Work

Transliteration

Predicting word representation in another language, based on spelling

Tarragona Таррагона Таррагон ﺎﻧﻮﮔارﺎﺗהנוגרט Taragona ट ै र ा ग ो न ा Тарраґонаةنوغارط 塔拉戈纳 타라고나 タ ラ ゴ ナ ታ ራ ጎ ና

த ா ர ா க ோணம ்

Transliteration

Predicting word representation in another language, based on spelling

Tarragona Таррагона Таррагонæ ﺎﻧﻮﮔارﺎﺗהנוגרט Taragona ट ै र ा ग ो न ा Тарраґонаةنوغارط 塔拉戈纳 타라고나 タ ラ ゴ ナ ታ ራ ጎ ና

த ா ர ா க ோணம ்

Where can it be used?

Machine Translation: proper names, terms Information Retrieval, Information Extraction, Named Entity Recognition Several scripts for a language

Challenges

Challenges

Transliteration can be based on:

pronunciation spelling translation

Challenges

Transliteration can be based on:

pronunciation spelling translation

Санкт-Петербург /Sankt-Peterburg/

Challenges

Transliteration can be based on:

pronunciation spelling translation

Санкт-Петербург /Sankt-Peterburg/ Saint Petersburg Pietari

Challenges

Transliteration can be based on:

pronunciation spelling translation

Different transliterations for the same name

Лев Толстой

Leo Tolstoy Lev Tolstoy

Лев Толстой

Leo Tolstoy Lev Tolstoy

Лев Толстой

Leo Tolstoy Lev Tolstoy

Лев Толстой

Leo Tolstoy Lev Tolstoy

Challenges

Transliteration can be based on:

pronunciation spelling translation

Different transliterations for the same name Scripts are based on different principles:

Alphabetic

English, Russian, ...

Consonantal

Arabic, Persian, Hebrew, ...

Syllabic

Katakana, ...

...

Outline

. .

1

Introduction Transliteration Applications Challenges

.

2

Data Data-sets

. .

3

Methods Review Motivation Etymon Project Models Prediction Pre-processing Evaluation Baselines

. .

4

Results

. .

Tarragona Таррагона Таррагон ﺎﻧﻮﮔارﺎﺗהנוגרט Taragona ट ै र ा ग ो न ा Тарраґонаةنوغارط 塔拉戈纳 타라고나 タラゴナ ታ ራ ጎ ና

Tarragona Таррагона Таррагонæ ﺎﻧﻮﮔارﺎﺗהנוגרט Taragona ट ै र ा ग ो न ा Тарраґонаةنوغارط 塔拉戈纳 타라고나 タラゴナ ታ ራ ጎ ና