Automatic Learning of a Morphological Model Theory and - - PowerPoint PPT Presentation
Automatic Learning of a Morphological Model Theory and - - PowerPoint PPT Presentation
Automatic Learning of a Morphological Model Theory and Unsupervised Approaches Unsupervised Learning - ToC Foundations Problem description General architecture Papers: Goldsmith 2001 and 2006 Goldsmith and Hu 2004
Unsupervised Learning - ToC
Foundations Problem description General architecture Papers:
– Goldsmith 2001 and 2006 – Goldsmith and Hu 2004
Unsupervised Models - Foundations
Saffran et al. 1996:
Adults are capable of discovering word units rapidly in a stream of a nonsense language without any connection to meaning.
Creutz and Lagus 2007:
This suggests that humans do use distributional cues, such as transition probabilities between sounds, in language learning. And these kinds of statistical patterns in language data can be successfully exploited by appropriately designed algorithms.
Unsupervised Models - Problem desc.
Input:
– untagged text in an orthographic form, words are separated. – No syntactic, semantic or phonological info is given.
Output:
– A lexicon of morphemes (stems and affixes). Frequent un-
segmented words might also be included (mixed lexicon, see Creutz and Lagus).
– An analysis of each word in the corpus by segmentation into
morphemes.
Results evaluation
– The analysis should be as close as possible to the gold
standard, obtained by manual segmentation.
– The data should be represented efficiently.
Unsupervised Models – Components
Bootstrapping heuristic - creates the initial
morphological model.
Incremental heuristics – create an improved
morphological model based on the existing one.
Evaluation model - compares morphological
models and tells whether a significant improvement was achieved.
Unsupervised Models – Flow diag.
New is (much) better Evaluation Model (MDL, MAP) Corpus Model of Morphology Bootstrapping Heuristic Incremental Heuristics New Model of Morphology Replace old model E v a l u a t e Not much improvement Stop Evaluate
Unsupervised Models – Preface
Goldsmith 2001/6
– Recursive segmentation into stem+suffix/prefix+stem. – Evaluation in terms of Minimum Description Length.
Goldsmith and Hu 2004
– NFSA holds layered morpheme compositions. – Evaluation in terms of Minimum Description Length.
Creutz and Lagus 2005/7
– HMM of categories emitting morphemes. – Evaluation in terms of Maximum A Posteriori.
Unsupervised Models - Methodologies - 1
Tradeoff between restrictiveness and flexibility
– A too restricting model may exclude all optimal and
near optimal models, making learning a good model impossible, regardless of how much data and computation time is spent.
– A too flexible model is very hard to learn as it
requires impractical amounts of data and computation.
Unsupervised Models - Methodologies - 2
Minimum Description Length (Rissanen 1989)
– Main ideas:
Every regularity in data may be used to compress that data. Learning can be equated with finding regularities in data.
– Formalization of Occam's Razor: The best
hypothesis for a given set of data is the one that leads to the largest compression of the data.
– Choose the best model by simultaneously
considering model accuracy and model complexity.
– Simpler models are favored over complex ones. This
generally improves generalization capacity by inhibiting overlearning.
Unsupervised Models - Methodologies - 3
MDL formulation is used in Goldsmith’s papers: Alternative formulation, used in the papers by Creutz and Lagus:
Maximum A Posteriori (MAP):
Chen 1996: The two approaches are equivalent with respect to the
task discussed.
2
( , ) argmin 1 ( ) log argmin ( | )
ModelM ModelM
DescriptionLength CorpusC ModelM length M P C M ( | ) argmax
Lexicon
P Lexicon Corpus
Unsupervised Models – Goldsmith 2001/6
The signature concept
– List of affixes appearing with a stem
E.g. Jump: NULL.ed.ing.s Each stem has a unique signature.
The structure of the morphological model:
– List of stems, e.g. {cat, jump, laugh, hat, walk, sav} – List of affixes, e.g. {NULL, ed, ing, s, e, es} – List of signatures and their
associated stems, e.g.
ptr(NULL) ptr(ed) ptr(ing) ptr(s) ptr(jump) ptr(walk) ptr(laugh)
Goldsmith 2001/6 - Samples
Some signatures from The Adventures of Tom Sawyer :
# Tokens # Stems Sample of tokens Signature 816 69 betray betrayed betraying NULL.ed.ing 516 14 remain remained remaining remains NULL.ed.ing.s 3414 253 cow cows NULL.s 62 4 notice noticed notices noticing e.ed.es.ing
Goldsmith 2001/6 - Bootstrapping heuristic - 1
Creates the initial (rough) morphological model. Cuts words into morphemes (stem + affix) and
builds the lists.
The most effective way is based on an early
proposal of Harris (1955, 1967), named: Successor frequency.
The idea: words should be cut where it is least
likely to predict the succeeding character.
Goldsmith 2001/6 - Bootstrapping heuristic - 2
An example of successor frequency
– Consider the word: government – Assume that empirically:
{n}
follows gover
{e, i , m, o, s, #}
follows govern
{e}
follows governm
– We get the frequencies : g o v e r 1 n 6 m 1 e n t
successor frequency tells us to cut the word into govern + ment
Peak
Goldsmith 2001/6 - Bootstrapping heuristic - 3
Difficulties with successor frequency:
– Consider: c 9 o 18 n 11 s 6 e 4 r 1 v 2 a 1 t 1 i 2 v 1 e 1 s
May lead to under-cut or over-cut. How could one decide?
Set constraints (higher precision, lower recall)
– Stems length must be at least 5. – Number of stems in signatures must be at least 5. – Absolute peaks: frequencies form must be 1 N 1
Peak Peak Peak
Goldsmith 2001/6 - Incremental Heuristics - 1
General idea:
– Try to reorganize the lists in the model – Evaluate the model length with/without the change and proceed
accordingly.
Create signatures by a Loose fit strategy:
For every known suffix F, For every word that can be split into S+F:
Collect all the suffixes of S and suggest a new signature.
The Check Signatures function:
– Move letters from stems to suffixes (slide left the boundary)
Examine each signature and suggest moving letters. Example 1: consider the i in: -ion or –ive. Example 2: consider the words ending with –able and –ible.
Goldsmith 2001/6 - Incremental Heuristics - 2
Extending to unanalyzed words
– Recall the conservatism of the bootstrapping successor
frequency peak: 1 N 1.
– No account was given for words like derivation and derivative
(could not be analyzed as deriv-ation and deriv-ative)
– Heuristic: for every unanalyzed word, suggest a cut into a known
stem+suffix (prefer the most common stem)
Slide right the stem-suffix boundary
– Consider signatures with suffixes all sharing the same prefix.
E.g: te.ting.ts
– Suggest sliding right the boundary, thus creating: e.ing.s.
Goldsmith 2001/6 – Description Length Evaluation 1
Evaluating the description length in MDL:
– Recall the formula:
Evaluating the model length:
2
1 ( ) log ( | ) DL length ModelM P CorupsC M ( ) ( ) ( ) ( ) length ModelM length stems length affixes length sig
Goldsmith 2001/6 – Description Length Evaluation 2
Stems list structure: The stems list (T) length: The affixes list (F) length:
2 2
log ( ) (log 26* ( ) log( ))
t T
W T length t t inW
2 2
log ( ) (log 26* ( ) log( ))
f F
W F length f f inW
bits for items # bits for data bits for pointer
Number of Stems - N Stem 1 Stem 2 Stem 3 …. Stem N Ptr Ptr Ptr Ptr Stem 1 Data Stem 2 Data Stem 3 Data ,,, Stem N Data
Goldsmith 2001/6 – Description Length Evaluation 3
Evaluating the model length (cont’):
– The signatures list (S) length:
bits for signatures count: + bits for signatures pointers: + for each sig: bits for stems count + bits for stems pointers: + bits for affixes count + bits for affixes pointers:
2
log ([ ]) S
2 2 ( )
[ ] log *[ ( )] (log )) [ ]
f Affixes s
s Affixes s f in s
(
s Sig
2 2 ( )
[ ] log [ ( )] (log ) [ ]
t Stems s
W Stems s t
[ ] (log ) [ ]
s Sig
W s
Goldsmith 2001/6 – Description Length Evaluation 4
Evaluating the probability of the corpus decomposition: Evaluating the probability of the decomposition of each
word, w:
2 2 2
1 1 log log log ( | ) ( | ) ( | )
w Corpus w Corpus
P w M P CorupsC M P w M
( | ) ( | ) ( | )* ( | , )* ( | , ) P w M P w t f M P sig M P t sig M P f sig M
2 2 2 2
log ( | ) log ( ( | )) log ( ( | , )) log ( ( | , )) P w M P sig M P t sig M P f sig M
2 2 2
( ) ( ) log ( ) log ( ) log ( ) ( ) ( ) ( ) ( ) ( ) W sig w sig w sig w stem w in sig w affix w in sig w
Goldsmith 2001/6 – An Example of Loose-fit - 1
Assume that the corpus contains: {act, acted,
action, acts, acting}.
The bootstrapping heuristic adds all words to
the stems list and also {NULL, ed, ion, s, ing} to the affixes list.
Loose fit: consider adding the signature:
NULL.ed.ion.s.ing instead of 4 instances.
– Evaluate the cost of each alternative. – Increment only if a cheaper alternative is found.
Goldsmith 2001/6 – An Example of Loose-fit - 2
Evaluation of the number of bits saved:
– 4 instances of “act”: (4*3)*log226 = 56 bits.
Cost evaluation - signature creation:
– Length of stems count: log25=2.3 bits. – Length of affixes count: log25=2.3 bits. – Length of stem pointer: ~ 5 bits. – Length of affix pointer: ~ 5 bits for 5 affixes: 25 bits. – Length of pointer to sig: ~ 13 bits. – In sum: 2.3+2.3+5+25+13 = 47.6 bits.
MDL tells us to do the split and create the signature.
Goldsmith 2001 – Results - 1
Test corpus:
– English: The first part of the Brown corpus:
500,000 tokens, 30,000 distinct words.
– French: 350,000 words.
Baseline:
Set of 1000 consecutive words in the alphabetical list of each corpus. Analyzed manually.
Goldsmith 2001 – Results - 2
English results:
Distribution Count Category 82.9% 829 Good 5.2% 52 Wrong analysis 3.6% 36 Failed to analyze 8.3% 83 Spurious analysis
Goldsmith 2001 – Results - 3
French results:
Distribution Count Category 83.3% 833 Good 6.1% 61 Wrong analysis 4.2% 42 Failed to analyze 6.4% 64 Spurious analysis
Goldsmith 2006 - Results
Baseline: Gold standard of some 15,000 words,
built by hand.
Accuracy measure: Correct / Incorrect (no
precision and recall)
Test corpus: the first 200,000 and 300,000
words of the Brown corpus.
Results: Accuracy of 72% over all words.
Goldsmith 2001/6 – Problems & Conclusions
The problems with the signature concept
– Oriented for Stem + Affix (prefix/suffix) cut. – No support for alternating stems and affixes. – As Creutz and Lagus point out: memory consumption.
Word frequency is not taken into account. Conclusions
– In general, not bad for a small corpus. – Poorly suited for highly-inflecting or compounding languages,
where words can consist of possibly lengthy sequences of morphemes with an alternation of stems and suffixes.
Unsupervised Models – Goldsmith & Hu 2004
The paper sketches a new algorithm that hasn’t been
evaluated but nonetheless suggests some nice ideas.
Motivation:
– Overcoming the problems of the signature concept. – Taking advantage of the differences between inflectional and
derivational morphology by Layered Morphology.
Different approaches than discussed above:
– Bootstrapping Heuristic: Based on a trie with some constraints. – Morphological model: NFSA is used instead of signatures. – Incremental Heuristics: Based on graph complexity.
MDL is used to evaluate the morphological model.
Goldsmith & Hu 2004 - Observations
Consider the following inflectional system of French: Assume that the given words in the corpus are: petit (masc. sg.),
petite (fem. sg.), petits (masc. pl.), petites (fem. pl.)
This will ends up building the signature: NULL.e.s.es We would like to have: NULL.e and NULL.s (signature speaking) Same phenomena shows up in other European languages.
Stem Petit (small) Grand (large) Gender Ø (masculine) e (feminine) Number Ø (singular) s (plural)
Goldsmith & Hu 2004 - Motivation
Motivation for Layered Morphology :
– Natural language inflectional morphology tends to
consist of a sequence of largely non-overlapping realizations of morpho-syntactic information. This can be thought of as a set of convergent paths in a paradigm.
– Derivational morphology, by contrast, is divergent, in
the sense that a derivational affix typically involves a morpheme whose function is to change a word from
- ne category to another and shift it from one
convergent path to another in two paradigms.
Goldsmith & Hu 2004 – Morphological Model
List of morphemes Non-deterministic FSA:
– States contain pointers of morphemes – Common arcs
for “default” (inflectional morphology)
– Morpheme-
specific arcs (derivational morphology)
Goldsmith & Hu 2004 – Internal Structure
The internal structure of a state: NFSA
Goldsmith & Hu 2004 - Bootstrapping Heuristic - 1
Use successor frequency and Patricia Trie
For example: {walk, walks, walked, walking, laugh, laughs, laughing, laughed }
root walk
laugh
Ø ed s ing Ø ed s ing
Goldsmith & Hu 2004 - Bootstrapping Heuristic - 2
Consider each node in the trie as a possible morpheme
break, starting from the 5th character of the word.
Create a temporary data structure of signatures and
assign creditability to each:
– Signatures with short affixes are unreliable must be supported
by at least 25 stems in order to be counted.
– Signatures with affixes longer than 1 are reliable must be
supported by at least 5 stems in order to be counted.
Build an FSA out of the counted signatures and collapse
states from the end (starting from the accept).
Goldsmith & Hu 2004 - Bootstrapping Heuristic - 3
Following the English example, we will get: Following the French example, we will get:
Goldsmith & Hu 2004 - Incremental Heuristic - 1
For each node in the FSA perform the following:
– Remove the morpheme-specific arcs to the granddaughter node – Check what strings now fail to be generated – Try to generate these strings by adding one or more morphemes
to the state. This is done based on the assumption that the state becomes a convergent state.
This will typically decrease the description length of the
grammar, since we will be certain to reduce the number
- f pointers in the grammar in so doing.
Goldsmith & Hu 2004 - Incremental Heuristic - 2
Recall the FSA before applying the heuristic: The incremental heuristic gives the desired
result:
Goldsmith & Hu 2004 - Description Length Evaluation -1
Evaluating the description length in MDL:
– Recall:
Evaluating the model length:
– The length of the morphemes list: – The length of the FSA: The length of a state: In sum:
2
1 ( ) log ( | ) DL length ModelM P CorupsC M ( ) ( ) ( ) length ModelM length MorphemesList length FSA
2 2
log ([ ]) (log 26* ( ))
m M
M length m
( ) ( )
_ ( ) _ _ ( ) _ ( ) _ ( )
j j
j m Morphemes s a Arcs s
state len s mor ptr len m arc len common arc arc len a
( ) _ ( )
s States
length FSA state len s
Goldsmith & Hu 2004 - Description Length Evaluation -2
Evaluating the model length:
– The length of the FSA:
The number of pointers to a morpheme mj:
The length of a pointer to a morpheme mj: The number of word paths thru an arc ai,j: The number of word paths thru a state sj : The number of word paths thru all arcs: The length of an arc ai,j:
#( ) #( )
i
i s States
arcs s
, ,
#( ) _ ( ) #( )
i j i j
arcs arc len a a
,
#( )
i j
a #( )
j
s
( )
#( ) [ ]*( ?1:0)
i
j i s States m Morphemes s
m word paths thru m in s i j
( )
#( ) _ _ ( ) #( )
i
i m Morphemes FSA j j
m mor ptr len m m
Goldsmith & Hu 2004 - Description Length Evaluation -3
Evaluating the description length in MDL (cont’):
– Recall:
Evaluating the corpus probability:
– Evaluate the probability of every word, w. – P(w) is obtained by multiplying the probability of the arcs in w’s
path:
– The probability of an arci,j is:
2
1 ( ) log ( | ) DL length ModelM P CorupsC M
,
, ( )
( ) ( )
i j
i j arc Path w
P w P arc
, , ,
#( ) ( ) #( )
i j i j i k k
arc P arc arc
References - 1
Chen, S. F. (1996). Building Probabilistic Models for Natural Language. PhD thesis, Harvard University.
CREUTZ, M. AND LAGUS, K. 2002. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL. Philadelphia, PA. 21–30.
CREUTZ,M. 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the Association for Computations Languages (ACL’03). Sapporo, Japan. 280–287.
CREUTZ,M. AND LAGUS, K. 2004. Induction of a simple morphology for highly- inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 43–51.
CREUTZ, M. AND LAGUS, K. 2005a. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05). Espoo, Finland. 106–113.
CREUTZ, M. AND LAGUS, K. 2007. Unsupervised Models for Morpheme Segmentation and Morphology Learning. ACM Transactions on Speech and Language Processing, Volume 4, Issue 1, January 2007.
FALK, Y. 2007 - Diagrams on slides 4-6: http://pluto.huji.ac.il/~msyfalk/Morph/SyntaxMorphology.pdf, http://pluto.huji.ac.il/~msyfalk/WordStructure/Morphology.pdf.
References - 2
GOLDSMITH, J. 2001. Unsupervised learning of the morphology of a natural
- language. Computat. Linguis. 27, 2, 153–198.
GOLDSMITH, J. 2006. An algorithm for the unsupervised learning of morphology.
- Tech. rep. TR-2005-06, Department of Computer Science, University of Chicago.
http://humfs1.uchicago.edu/∼jagoldsm/Papers/Algorithm.pdf.
GOLDSMITH, J. ANDHU, Y. 2004. From signatures to finite state automata. Midwest Computational Linguistics Colloquium. Bloomington IN.
HARRIS, Z. S. 1955. From phoneme to morpheme. Language 31, 2, 190–222. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)
HARRIS, Z. S. 1967. Morpheme boundaries within words: Report on a computer test. Transformations and Discourse Analysis Papers 73. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)
RISSANEN, J. 1989. Stochastic Complexity in Statistical Inquiry. Vol. 15. World Scientific Series in Computer Science, Singapore.
SAFFRAN, J. R.,NEWPORT,E. L., ANDASLIN,R.N. 1996. Word segmentation: The role of distributional cues. J. Memory Lang. 35, 606–621.
Wintner, S – some presentation ideas and an example for MDL on slides 21-22, 25- 26: http://cs.haifa.ac.il/~shuly/teaching/03/lab/ofer-yaniv.ppt