Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert - PowerPoint PPT Presentation
Mimicking Word Embeddings using Subword RNNs Yuval Pinter, Robert Guthrie, Jacob Eisenstein @yuvalpi Presented at EMNLP September 2017, Copenhagen The Word Embedding Pipeline Unlabeled Unlabeled corpus Unlabeled corpus Unlabeled corpus
Enter MIMICK ● What data do we have, post-unlabeled corpus? ○ Vector dictionary ○ Orthography (the way words are spelled) ● Use the former as training objective, latter as input ● Pre-trained vectors as target ○ No need to access original unlabeled corpus ○ Many training examples Unlabeled Supervised ○ (No context) Unlabeled corpus task corpus Unlabeled corpus Unlabeled corpus corpus OOV OOV 11
Enter MIMICK ● What data do we have, post-unlabeled corpus? ○ Vector dictionary ○ Orthography (the way words are spelled) ● Use the former as training objective, latter as input ● Pre-trained vectors as target ○ No need to access original unlabeled corpus ○ Many training examples Unlabeled Supervised ○ (No context) Unlabeled corpus task corpus Unlabeled corpus Unlabeled ● Subword units as inputs corpus corpus ○ Very extensible ○ (Character inventory changes?) OOV OOV 11
MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make m a k e 12
MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Character embeddings m a k e 12
MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Forward LSTM Character embeddings m a k e 12
MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12
MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) All possible text Multilayered Perceptron Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12
MIMICK Training Pre-trained Embedding make (Polyglot/FastText/etc.) Loss (L 2 ) All possible text Mimicked Embedding Multilayered Perceptron Unlabeled text make Backward LSTM Forward LSTM Character embeddings m a k e 12
MIMICK Inference All possible text Mimicked Embedding Multilayered Perceptron Unlabeled text blah Backward LSTM Forward LSTM Character embeddings b l a h 13
Observation – Nearest Neighbors 14
Observation – Nearest Neighbors ● English (OOV Nearest in-vocab words) 14
Observation – Nearest Neighbors ● English (OOV Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM 14
Observation – Nearest Neighbors ● English (OOV Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly 14
Observation – Nearest Neighbors ● English (OOV Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser 14
Observation – Nearest Neighbors ● English (OOV Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew 14
Observation – Nearest Neighbors ● English (OOV Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve 14
Observation – Nearest Neighbors ● English (OOV Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) 14
Observation – Nearest Neighbors ● English (OOV Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ ךרא → ציר ’ ןוסדר Richardson Eustrach 14
Observation – Nearest Neighbors ● English (OOV Nearest in-vocab words) ○ MCT → AWS, OTA, APT, PDM ○ pesky → euphoric, disagreeable, horrid, ghastly ○ lawnmower → tradesman, bookmaker, postman, hairdresser ● Hebrew ○ רותפת → גתת (she/you-3p.sg.) will come true (she/you-3p.sg.) will solve ○ םיירט → םיירטמואיג geometric (m.pl., nontrad. spelling) geometric (m.pl.) ○ ךרא → ציר ’ ןוסדר Richardson Eustrach ● ✔ Surface form ✔ Syntactic properties ✘ Semantics 14
Intrinsic Evaluation – RareWords 15
Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words 15
Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words 15
Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words Names ● Domain-specific jargon ● Foreign words ● Rare(-ish) morphological ● derivations Nonce words ● Nonstandard orthography ● Typos and other errors ● ... ● 15
Intrinsic Evaluation – RareWords ● RareWords similarity task: morphologically-complex, mostly unseen words NN FUN!!! Nearest: programmatic transformational Names ● mechanistic Domain-specific jargon ● transactional Foreign words ● contextual Rare(-ish) morphological ● derivations Nonce words ● Nonstandard orthography ● Typos and other errors ● ... ● 15
Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● Names ● Domain-specific jargon ● Foreign words ● Rare(-ish) morphological derivations Nonce words ● ● Nonstandard orthography ● Typos and other errors ● ... 16
Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) DT NN VBZ VBG ● Names ● Domain-specific jargon Backward LSTM ● Foreign words ● Rare(-ish) morphological Forward derivations LSTM Nonce words ● ● Nonstandard Word orthography embeddings ● Typos and other errors the cat is sitting ● ... 16
Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG Backward LSTM Forward LSTM Word embeddings the cat is sitting 17
Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG ● Negative effect on POS Backward LSTM Forward LSTM Word embeddings the cat is sitting 17
Extrinsic Evaluation – POS + Attribute Tagging ● UD is annotated for POS and morphosyntactic attributes ○ Eng: his stated goals Tense=Past|VerbForm=Part ○ Cze: osoby v pokročilém věku Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Negative=Pos|Number=Sing people of advanced age ● POS model from Ling et al. (2015) Tense -- -- pres -- Number -- sing -- -- ● Attributes - same as POS layer POS DT NN VBZ VBG ● Negative effect on POS ● Attribute evaluation metric Backward LSTM ○ Micro F1 Forward LSTM Word embeddings the cat is sitting 17
Language Selection Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● ○ Much word-level data Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative ● Geneological diversity ○ 13 Indo-European (7 different branches) ○ 10 from 8 non-IE branches MRLs (e.g. Slavic languages) ● ○ Much word-level data ○ Relatively free word order Minna Sundberg 18
Language Selection ● |UD ∩ Polyglot| = 44, we took 23 ● Morphological structure ○ 12 fusional ○ 3 analytic ○ 1 isolating ○ 7 agglutinative Institutional ● Geneological diversity Entrepreneurial Linguistic ○ 13 Indo-European (7 different branches) Anatomical ○ 10 from 8 non-IE branches Ideological MRLs (e.g. Slavic languages) ● ○ Much word-level data ○ Relatively free word order Minna Sundberg 18
Language Selection (contd.) 19
Language Selection (contd.) ● Script type 19
Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts 19
Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters 19
Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion 19
Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables 19
Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens 19
Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) 19
Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) 19
Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) ○ 16.9%-70.8% type-level (median 29.1%) 19
Language Selection (contd.) ● Script type ○ 7 in non-alphabetic scripts ○ Ideographic (Chinese) - ~12K characters ○ Hebrew, Arabic - no casing, no vowels, syntactic fusion ○ Vietnamese - tokens are non-compositional syllables ● Attribute-carrying tokens ○ Range from 0% (Vietnamese) to 92.4% (Hindi) ● OOV rate (UD against Polyglot vocabulary) ○ 16.9%-70.8% type-level (median 29.1%) ○ 2.2%-33.1% token-level (median 9.2%) 19
Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding the flatfish is sitting 20
Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK the flatfish is sitting 20
Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK ● CHAR2TAG - additional RNN layer ○ 3x Training time the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20
Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK ● CHAR2TAG - additional RNN layer ○ 3x Training time ● BOTH: MIMICK + CHAR2TAG the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20
Evaluated Systems ● NONE: Polyglot ’ s default UNK embedding ● MIMICK POINT ● CHAR2TAG - additional RNN layer UNION ○ 3x Training time ROAD LIGHT ● BOTH: MIMICK + CHAR2TAG LONG the flatfish is sitting Char- Char- Char- Char- LSTM LSTM LSTM LSTM 20
Results - Full Data NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 21
Results - 5,000 training tokens NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 22
Results - Language Types (5,000 tokens) NONE MIMICK CHAR2TAG BOTH Slavic languages POS 23
Results - Language Types (5,000 tokens) NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Agglutinative languages morpho. attribute F1 Slavic languages POS 23
Results - Chinese NONE MIMICK CHAR2TAG BOTH NONE MIMICK CHAR2TAG BOTH Morpho. Attributes (micro F1) POS tags (accuracy) 24
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.