[PPT] - Automatic Learning of a Morphological Model Theory and PowerPoint Presentation

SLIDE 1

Automatic Learning of a Morphological Model

Theory and Unsupervised Approaches

SLIDE 2

Unsupervised Learning - ToC

 Foundations  Problem description  General architecture  Papers:

– Goldsmith 2001 and 2006 – Goldsmith and Hu 2004

SLIDE 3

Unsupervised Models - Foundations

 Saffran et al. 1996:

Adults are capable of discovering word units rapidly in a stream of a nonsense language without any connection to meaning.

 Creutz and Lagus 2007:

This suggests that humans do use distributional cues, such as transition probabilities between sounds, in language learning. And these kinds of statistical patterns in language data can be successfully exploited by appropriately designed algorithms.

SLIDE 4

Unsupervised Models - Problem desc.

 Input:

– untagged text in an orthographic form, words are separated. – No syntactic, semantic or phonological info is given.

 Output:

– A lexicon of morphemes (stems and affixes). Frequent un-

segmented words might also be included (mixed lexicon, see Creutz and Lagus).

– An analysis of each word in the corpus by segmentation into

morphemes.

 Results evaluation

– The analysis should be as close as possible to the gold

standard, obtained by manual segmentation.

– The data should be represented efficiently.

SLIDE 5

Unsupervised Models – Components

 Bootstrapping heuristic - creates the initial

morphological model.

 Incremental heuristics – create an improved

morphological model based on the existing one.

 Evaluation model - compares morphological

models and tells whether a significant improvement was achieved.

SLIDE 6

Unsupervised Models – Flow diag.

New is (much) better Evaluation Model (MDL, MAP) Corpus Model of Morphology Bootstrapping Heuristic Incremental Heuristics New Model of Morphology Replace old model E v a l u a t e Not much improvement Stop Evaluate

SLIDE 7

Unsupervised Models – Preface

 Goldsmith 2001/6

– Recursive segmentation into stem+suffix/prefix+stem. – Evaluation in terms of Minimum Description Length.

 Goldsmith and Hu 2004

– NFSA holds layered morpheme compositions. – Evaluation in terms of Minimum Description Length.

 Creutz and Lagus 2005/7

– HMM of categories emitting morphemes. – Evaluation in terms of Maximum A Posteriori.

SLIDE 8

Unsupervised Models - Methodologies - 1

 Tradeoff between restrictiveness and flexibility

– A too restricting model may exclude all optimal and

near optimal models, making learning a good model impossible, regardless of how much data and computation time is spent.

– A too flexible model is very hard to learn as it

requires impractical amounts of data and computation.

SLIDE 9

Unsupervised Models - Methodologies - 2

 Minimum Description Length (Rissanen 1989)

– Main ideas:

 Every regularity in data may be used to compress that data.  Learning can be equated with finding regularities in data.

– Formalization of Occam's Razor: The best

hypothesis for a given set of data is the one that leads to the largest compression of the data.

– Choose the best model by simultaneously

considering model accuracy and model complexity.

– Simpler models are favored over complex ones. This

generally improves generalization capacity by inhibiting overlearning.

SLIDE 10

Unsupervised Models - Methodologies - 3

 MDL formulation is used in Goldsmith’s papers:  Alternative formulation, used in the papers by Creutz and Lagus:

Maximum A Posteriori (MAP):

 Chen 1996: The two approaches are equivalent with respect to the

task discussed.

2

( , ) argmin 1 ( ) log argmin ( | )

ModelM ModelM

DescriptionLength CorpusC ModelM length M P C M   ( | ) argmax

Lexicon

P Lexicon Corpus

SLIDE 11

Unsupervised Models – Goldsmith 2001/6

 The signature concept

– List of affixes appearing with a stem

 E.g. Jump: NULL.ed.ing.s  Each stem has a unique signature.

 The structure of the morphological model:

– List of stems, e.g. {cat, jump, laugh, hat, walk, sav} – List of affixes, e.g. {NULL, ed, ing, s, e, es} – List of signatures and their

associated stems, e.g.

ptr(NULL) ptr(ed) ptr(ing) ptr(s) ptr(jump) ptr(walk) ptr(laugh)

SLIDE 12

Goldsmith 2001/6 - Samples

 Some signatures from The Adventures of Tom Sawyer :

# Tokens # Stems Sample of tokens Signature 816 69 betray betrayed betraying NULL.ed.ing 516 14 remain remained remaining remains NULL.ed.ing.s 3414 253 cow cows NULL.s 62 4 notice noticed notices noticing e.ed.es.ing

SLIDE 13

Goldsmith 2001/6 - Bootstrapping heuristic - 1

 Creates the initial (rough) morphological model.  Cuts words into morphemes (stem + affix) and

builds the lists.

 The most effective way is based on an early

proposal of Harris (1955, 1967), named: Successor frequency.

 The idea: words should be cut where it is least

likely to predict the succeeding character.

SLIDE 14

Goldsmith 2001/6 - Bootstrapping heuristic - 2

 An example of successor frequency

– Consider the word: government – Assume that empirically:

 {n}

follows gover

 {e, i , m, o, s, #}

follows govern

 {e}

follows governm

– We get the frequencies : g o v e r 1 n 6 m 1 e n t

 successor frequency tells us to cut the word into govern + ment

Peak

SLIDE 15

Goldsmith 2001/6 - Bootstrapping heuristic - 3

 Difficulties with successor frequency:

– Consider: c 9 o 18 n 11 s 6 e 4 r 1 v 2 a 1 t 1 i 2 v 1 e 1 s

 May lead to under-cut or over-cut. How could one decide?

 Set constraints (higher precision, lower recall)

– Stems length must be at least 5. – Number of stems in signatures must be at least 5. – Absolute peaks: frequencies form must be 1 N 1

Peak Peak Peak

SLIDE 16

Goldsmith 2001/6 - Incremental Heuristics - 1

 General idea:

– Try to reorganize the lists in the model – Evaluate the model length with/without the change and proceed

accordingly.

 Create signatures by a Loose fit strategy:

For every known suffix F, For every word that can be split into S+F:

 Collect all the suffixes of S and suggest a new signature.

 The Check Signatures function:

– Move letters from stems to suffixes (slide left the boundary)

 Examine each signature and suggest moving letters.  Example 1: consider the i in: -ion or –ive.  Example 2: consider the words ending with –able and –ible.

SLIDE 17

Goldsmith 2001/6 - Incremental Heuristics - 2

 Extending to unanalyzed words

– Recall the conservatism of the bootstrapping successor

frequency peak: 1 N 1.

– No account was given for words like derivation and derivative

(could not be analyzed as deriv-ation and deriv-ative)

– Heuristic: for every unanalyzed word, suggest a cut into a known

stem+suffix (prefer the most common stem)

 Slide right the stem-suffix boundary

– Consider signatures with suffixes all sharing the same prefix.

E.g: te.ting.ts

– Suggest sliding right the boundary, thus creating: e.ing.s.

SLIDE 18

Goldsmith 2001/6 – Description Length Evaluation 1

 Evaluating the description length in MDL:

– Recall the formula:

 Evaluating the model length:

2

1 ( ) log ( | ) DL length ModelM P CorupsC M   ( ) ( ) ( ) ( ) length ModelM length stems length affixes length sig   

SLIDE 19

Goldsmith 2001/6 – Description Length Evaluation 2

 Stems list structure:  The stems list (T) length:  The affixes list (F) length:

2 2

log ( ) (log 26* ( ) log( ))

t T

W T length t t inW



 



2 2

log ( ) (log 26* ( ) log( ))

f F

W F length f f inW



 



bits for items # bits for data bits for pointer

Number of Stems - N Stem 1 Stem 2 Stem 3 …. Stem N Ptr Ptr Ptr Ptr Stem 1 Data Stem 2 Data Stem 3 Data ,,, Stem N Data

SLIDE 20

Goldsmith 2001/6 – Description Length Evaluation 3

 Evaluating the model length (cont’):

– The signatures list (S) length:

bits for signatures count: + bits for signatures pointers: + for each sig: bits for stems count + bits for stems pointers: + bits for affixes count + bits for affixes pointers:

2

log ([ ]) S

2 2 ( )

[ ] log *[ ( )] (log )) [ ]

f Affixes s

s Affixes s f in s







(

s Sig 



2 2 ( )

[ ] log [ ( )] (log ) [ ]

t Stems s

W Stems s t



  [ ] (log ) [ ]

s Sig

W s





SLIDE 21

Goldsmith 2001/6 – Description Length Evaluation 4

 Evaluating the probability of the corpus decomposition:  Evaluating the probability of the decomposition of each

word, w:



2 2 2

1 1 log log log ( | ) ( | ) ( | )

w Corpus w Corpus

P w M P CorupsC M P w M

 

   



( | ) ( | ) ( | )* ( | , )* ( | , ) P w M P w t f M P sig M P t sig M P f sig M    

2 2 2 2

log ( | ) log ( ( | )) log ( ( | , )) log ( ( | , )) P w M P sig M P t sig M P f sig M   

2 2 2

( ) ( ) log ( ) log ( ) log ( ) ( ) ( ) ( ) ( ) ( ) W sig w sig w sig w stem w in sig w affix w in sig w   

SLIDE 22

Goldsmith 2001/6 – An Example of Loose-fit - 1

 Assume that the corpus contains: {act, acted,

action, acts, acting}.

 The bootstrapping heuristic adds all words to

the stems list and also {NULL, ed, ion, s, ing} to the affixes list.

 Loose fit: consider adding the signature:

NULL.ed.ion.s.ing instead of 4 instances.

– Evaluate the cost of each alternative. – Increment only if a cheaper alternative is found.

SLIDE 23

Goldsmith 2001/6 – An Example of Loose-fit - 2

 Evaluation of the number of bits saved:

– 4 instances of “act”: (4*3)*log226 = 56 bits.

 Cost evaluation - signature creation:

– Length of stems count: log25=2.3 bits. – Length of affixes count: log25=2.3 bits. – Length of stem pointer: ~ 5 bits. – Length of affix pointer: ~ 5 bits  for 5 affixes: 25 bits. – Length of pointer to sig: ~ 13 bits. – In sum: 2.3+2.3+5+25+13 = 47.6 bits.

 MDL tells us to do the split and create the signature.

SLIDE 24

Goldsmith 2001 – Results - 1

 Test corpus:

– English: The first part of the Brown corpus:

500,000 tokens, 30,000 distinct words.

– French: 350,000 words.

 Baseline:

Set of 1000 consecutive words in the alphabetical list of each corpus. Analyzed manually.

SLIDE 25

Goldsmith 2001 – Results - 2

 English results:

Distribution Count Category 82.9% 829 Good 5.2% 52 Wrong analysis 3.6% 36 Failed to analyze 8.3% 83 Spurious analysis

SLIDE 26

Goldsmith 2001 – Results - 3

 French results:

Distribution Count Category 83.3% 833 Good 6.1% 61 Wrong analysis 4.2% 42 Failed to analyze 6.4% 64 Spurious analysis

SLIDE 27

Goldsmith 2006 - Results

 Baseline: Gold standard of some 15,000 words,

built by hand.

 Accuracy measure: Correct / Incorrect (no

precision and recall)

 Test corpus: the first 200,000 and 300,000

words of the Brown corpus.

 Results: Accuracy of 72% over all words.

SLIDE 28

Goldsmith 2001/6 – Problems & Conclusions

 The problems with the signature concept

– Oriented for Stem + Affix (prefix/suffix) cut. – No support for alternating stems and affixes. – As Creutz and Lagus point out: memory consumption.

 Word frequency is not taken into account.  Conclusions

– In general, not bad for a small corpus. – Poorly suited for highly-inflecting or compounding languages,

where words can consist of possibly lengthy sequences of morphemes with an alternation of stems and suffixes.

SLIDE 29

Unsupervised Models – Goldsmith & Hu 2004

 The paper sketches a new algorithm that hasn’t been

evaluated but nonetheless suggests some nice ideas.

 Motivation:

– Overcoming the problems of the signature concept. – Taking advantage of the differences between inflectional and

derivational morphology by Layered Morphology.

 Different approaches than discussed above:

– Bootstrapping Heuristic: Based on a trie with some constraints. – Morphological model: NFSA is used instead of signatures. – Incremental Heuristics: Based on graph complexity.

 MDL is used to evaluate the morphological model.

SLIDE 30

Goldsmith & Hu 2004 - Observations

 Consider the following inflectional system of French:  Assume that the given words in the corpus are: petit (masc. sg.),

petite (fem. sg.), petits (masc. pl.), petites (fem. pl.)

 This will ends up building the signature: NULL.e.s.es  We would like to have: NULL.e and NULL.s (signature speaking)  Same phenomena shows up in other European languages.

Stem Petit (small) Grand (large) Gender Ø (masculine) e (feminine) Number Ø (singular) s (plural)

SLIDE 31

Goldsmith & Hu 2004 - Motivation

 Motivation for Layered Morphology :

– Natural language inflectional morphology tends to

consist of a sequence of largely non-overlapping realizations of morpho-syntactic information. This can be thought of as a set of convergent paths in a paradigm.

– Derivational morphology, by contrast, is divergent, in

the sense that a derivational affix typically involves a morpheme whose function is to change a word from

ne category to another and shift it from one

convergent path to another in two paradigms.

SLIDE 32

Goldsmith & Hu 2004 – Morphological Model

 List of morphemes  Non-deterministic FSA:

– States contain pointers of morphemes – Common arcs

for “default” (inflectional morphology)

– Morpheme-

specific arcs (derivational morphology)

SLIDE 33

Goldsmith & Hu 2004 – Internal Structure

 The internal structure of a state: NFSA

SLIDE 34

Goldsmith & Hu 2004 - Bootstrapping Heuristic - 1

 Use successor frequency and Patricia Trie

For example: {walk, walks, walked, walking, laugh, laughs, laughing, laughed }

root walk

laugh

Ø ed s ing Ø ed s ing

SLIDE 35

Goldsmith & Hu 2004 - Bootstrapping Heuristic - 2

 Consider each node in the trie as a possible morpheme

break, starting from the 5th character of the word.

 Create a temporary data structure of signatures and

assign creditability to each:

– Signatures with short affixes are unreliable  must be supported

by at least 25 stems in order to be counted.

– Signatures with affixes longer than 1 are reliable  must be

supported by at least 5 stems in order to be counted.

 Build an FSA out of the counted signatures and collapse

states from the end (starting from the accept).

SLIDE 36

Goldsmith & Hu 2004 - Bootstrapping Heuristic - 3

 Following the English example, we will get:  Following the French example, we will get:

SLIDE 37

Goldsmith & Hu 2004 - Incremental Heuristic - 1

 For each node in the FSA perform the following:

– Remove the morpheme-specific arcs to the granddaughter node – Check what strings now fail to be generated – Try to generate these strings by adding one or more morphemes

to the state. This is done based on the assumption that the state becomes a convergent state.

 This will typically decrease the description length of the

grammar, since we will be certain to reduce the number

f pointers in the grammar in so doing.

SLIDE 38

Goldsmith & Hu 2004 - Incremental Heuristic - 2

 Recall the FSA before applying the heuristic:  The incremental heuristic gives the desired

result:

SLIDE 39

Goldsmith & Hu 2004 - Description Length Evaluation -1

 Evaluating the description length in MDL:

– Recall:

 Evaluating the model length:

– The length of the morphemes list: – The length of the FSA:  The length of a state:  In sum:

2

1 ( ) log ( | ) DL length ModelM P CorupsC M   ( ) ( ) ( ) length ModelM length MorphemesList length FSA  

2 2

log ([ ]) (log 26* ( ))

m M

M length m



 

( ) ( )

_ ( ) _ _ ( ) _ ( ) _ ( )

j j

j m Morphemes s a Arcs s

state len s mor ptr len m arc len common arc arc len a

 

  

 

( ) _ ( )

s States

length FSA state len s



 

SLIDE 40

Goldsmith & Hu 2004 - Description Length Evaluation -2

 Evaluating the model length:

– The length of the FSA:

 The number of pointers to a morpheme mj:

 The length of a pointer to a morpheme mj:  The number of word paths thru an arc ai,j:  The number of word paths thru a state sj :  The number of word paths thru all arcs:  The length of an arc ai,j:

#( ) #( )

i

i s States

arcs s



 

, ,

#( ) _ ( ) #( )

i j i j

arcs arc len a a 

,

#( )

i j

a #( )

j

s

( )

#( ) [ ]*( ?1:0)

i

j i s States m Morphemes s

m word paths thru m in s i j

 

 

 

( )

#( ) _ _ ( ) #( )

i

i m Morphemes FSA j j

m mor ptr len m m







SLIDE 41

Goldsmith & Hu 2004 - Description Length Evaluation -3

 Evaluating the description length in MDL (cont’):

– Recall:

 Evaluating the corpus probability:

– Evaluate the probability of every word, w. – P(w) is obtained by multiplying the probability of the arcs in w’s

path:

– The probability of an arci,j is:

2

1 ( ) log ( | ) DL length ModelM P CorupsC M  

,

, ( )

( ) ( )

i j

i j arc Path w

P w P arc







, , ,

#( ) ( ) #( )

i j i j i k k

arc P arc arc  

SLIDE 42

References - 1



Chen, S. F. (1996). Building Probabilistic Models for Natural Language. PhD thesis, Harvard University.



CREUTZ, M. AND LAGUS, K. 2002. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL. Philadelphia, PA. 21–30.



CREUTZ,M. 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the Association for Computations Languages (ACL’03). Sapporo, Japan. 280–287.



CREUTZ,M. AND LAGUS, K. 2004. Induction of a simple morphology for highly- inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 43–51.



CREUTZ, M. AND LAGUS, K. 2005a. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05). Espoo, Finland. 106–113.



CREUTZ, M. AND LAGUS, K. 2007. Unsupervised Models for Morpheme Segmentation and Morphology Learning. ACM Transactions on Speech and Language Processing, Volume 4, Issue 1, January 2007.



FALK, Y. 2007 - Diagrams on slides 4-6: http://pluto.huji.ac.il/~msyfalk/Morph/SyntaxMorphology.pdf, http://pluto.huji.ac.il/~msyfalk/WordStructure/Morphology.pdf.

SLIDE 43

References - 2



GOLDSMITH, J. 2001. Unsupervised learning of the morphology of a natural

language. Computat. Linguis. 27, 2, 153–198.



GOLDSMITH, J. 2006. An algorithm for the unsupervised learning of morphology.

Tech. rep. TR-2005-06, Department of Computer Science, University of Chicago.

http://humfs1.uchicago.edu/∼jagoldsm/Papers/Algorithm.pdf.



GOLDSMITH, J. ANDHU, Y. 2004. From signatures to finite state automata. Midwest Computational Linguistics Colloquium. Bloomington IN.



HARRIS, Z. S. 1955. From phoneme to morpheme. Language 31, 2, 190–222. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)



HARRIS, Z. S. 1967. Morpheme boundaries within words: Report on a computer test. Transformations and Discourse Analysis Papers 73. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)



RISSANEN, J. 1989. Stochastic Complexity in Statistical Inquiry. Vol. 15. World Scientific Series in Computer Science, Singapore.



SAFFRAN, J. R.,NEWPORT,E. L., ANDASLIN,R.N. 1996. Word segmentation: The role of distributional cues. J. Memory Lang. 35, 606–621.



Wintner, S – some presentation ideas and an example for MDL on slides 21-22, 25- 26: http://cs.haifa.ac.il/~shuly/teaching/03/lab/ofer-yaniv.ppt