Learning Morphology of Romance, Germanic, and Slavic languages with - - PowerPoint PPT Presentation

learning morphology of
SMART_READER_LITE
LIVE PREVIEW

Learning Morphology of Romance, Germanic, and Slavic languages with - - PowerPoint PPT Presentation

Learning Morphology of Romance, Germanic, and Slavic languages with the tool Linguistica Helena Blancafort LREC 2010 2 LREC 2010 20/05/2010 Outline 1. Introduction 2. State of the art 3. Linguistica: How it works 4. Experiments and


slide-1
SLIDE 1

Learning Morphology of Romance, Germanic, and Slavic languages with the tool Linguistica

Helena Blancafort LREC 2010

slide-2
SLIDE 2

Outline

  • 1. Introduction
  • 2. State of the art
  • 3. Linguistica: How it works
  • 4. Experiments and Results
  • 5. Conclusions and further work

20/05/2010 LREC 2010

2

slide-3
SLIDE 3

Introduction

Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals

  • Evaluate if we can benefit from unsupervised

learning of morphology

  • Input: Bible parallel corpus, tool Linguistica

(Goldsmith 2001, 2006)

20/05/2010 LREC 2010

3

slide-4
SLIDE 4

State of the Art: Induction of morphology

Objective

  • induce morphological information from raw

data

20/05/2010 LREC 2010

4

  • Brent et al. 1995; Kazakov, 1997
  • MDL (Rissanen ,1998)

Affix inventory

  • Schone and Jurafsky 2001;
  • Yarowsky andWicentowski 2001

Cluster of stems and affixes

slide-5
SLIDE 5

State of the Art II

Using linguistic knowledge or not

20/05/2010 LREC 2010

5

  • Nakov et al (2003); Oliver (2005)
  • Learn all possible endings of an unknown word
  • Apply Maximum Likelihood Estimation (Mikheev)

Lexicon

  • Clément et al. (2004)
  • Fosbert et al (2006); Loupy et al. (2008)
  • Pos-tagger Zanchetta and Baroni

(2005)

Inflection Rules

slide-6
SLIDE 6

Linguistica: How it works I

  • Knowledge-free
  • Input: raw corpus
  • Heuristics to generate a probabilistic

morphological grammar

  • MDL (minimum length description) & EM

(expectation-maximization algorithm) to filter

  • ut inappropriate analysis

20/05/2010 LREC 2010

6

slide-7
SLIDE 7

Linguistica: How it works II

Signatures Paradigm-like clusters with words sharing the same affixes  could help to build a morphological grammar The algorithm:

  • Splits a word into stem and affix
  • For each stem, list of affixes
  • Cluster of stems sharing the same affixes

20/05/2010 LREC 2010

7

slide-8
SLIDE 8

Linguistica: How it works III

Signatures

20/05/2010 LREC 2010

8

NULL.ed.ing.s 68 7889 gather abound account ascend ask belong boil chasten concern confirm consider delay doubt encamp enter exceed explain fail fasten fold gain gather glean greet groan guard hang happen harden insult journey knock lack leap lift listen look minister number obey offer overflow

slide-9
SLIDE 9

Linguistica: How it works IV

Main hurdles 1) Allomorphy 2) Incomplete paradigms due to bad segmentation Spanish verb anunciar: anunci(o, en, etc.), anunciab(a) 3) No distinction between inflectional and derivational suffixes

20/05/2010 LREC 2010

9

ES colgar -> colg, cuelg FR acheter -> achet, achèt

slide-10
SLIDE 10

Experiments and Results I

20/05/2010 LREC 2010

10

100 200 300 400 500 600 pl it cat es fr pt de nl en number of suffixes generated by Linguistica

slide-11
SLIDE 11

Experiments and Results II Number of paradigmes and number of suffixes

20/05/2010 LREC 2010

11

pt fr de cat nl en it es pl

100 200 300 400 500 600 700 800 900 1000 1 2 3 4 5 6 7 8 9 10 11 12 13 14

slide-12
SLIDE 12

Experiments and Results III

20/05/2010 LREC 2010

12

5 10 15 20 25 30 35 40 45 pl es cat it pt fr de nl en Max nb forms per signature (Linguistica)

slide-13
SLIDE 13

Experiments and Results IV

20/05/2010 LREC 2010

13

Max nb forms per signature (Linguistica) es 31 it 28 fr 24 de 14 en 9 Max nb forms per paradigm (Multext) it 63 fr 62 es 55 de 29 en 14 Knowledge-free vs. Knowledge based

slide-14
SLIDE 14

Experiments and Results V

Longest signatures suggested by Linguistica for a stem

Affix Stem signature

pl 39 da NULL.ch.cie.dzą.j.je.jmy.jmyż.ją.jąc.li.liście.liśmy .m.my.na.ne.nej.ni.nie.niu.no.ny.ną.rze.sz.wa.w ał.wszy.d.ł.ła.łby.łbyś.łem.łeś.ło.ły.o es 31 anunci a.ad.ada.adas.adlo.ado.amos.an.ando.ar.ara.arl es.aron.aros.arte.ará.arán.arás.aré.as.ase.asen. e.emos.en.es.o.áis.é.éis.ó de 14 heil NULL.e.en.et.ig.los.lose.loser.sam.same.sames. t.te.ten en 9 light NULL.ed.en.er.ing.ly.ness.ning.s

20/05/2010 LREC 2010

14

slide-15
SLIDE 15

Experiments and Results VI

List of most frequent prefixes for German Prefix Nb occ. Prefix Nb occ. Prefix Nb occ. ge 40 her 13 er 8 aus 30 un 13 *nied 7 ver 21 weg 11 bei 6 hin 20 be 10 heim 6 auf 19 zu 10 über 5 ab 19 *üb 9 durch 5 ein 16 an 9 ent 4

20/05/2010 LREC 2010

15

slide-16
SLIDE 16

Conclusions and Further Work

  • Useful information to evaluate the richness and

complexity of the morphology of a language

  • Unsupervised techniques should be improved with

human input: handwritten-rules are necessary for dealing with allomorphy and correct bad segmentation (Karasimos & Petropoulo 2010)

  • Complete paradigms using the web (Oliver 2005) or
  • Output quality is language-dependent, English

better results than other languages (complete verbal paradigms)

20/05/2010 LREC 2010

16

slide-17
SLIDE 17

20/05/2010 LREC 2010

17

Thank you Grazzi