Word Embeddings
Natural Language Processing VU (706.230) - Andi Rexha
02/04/2020
Word Embeddings Natural Language Processing VU (706.230) - Andi - - PowerPoint PPT Presentation
Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings Agenda Traditional NLP Word Embeddings-1 Word Embeddings-2 Text preprocessing Topic Modeling ELMo Bag-of-words model Neural
Natural Language Processing VU (706.230) - Andi Rexha
02/04/2020
2 02/04/2020 Word Embeddings
Traditional NLP
Word Embeddings-1
Word Embeddings-2
3 02/04/2020 Word Embeddings
4 02/04/2020 Word Embeddings
How to preprocess text?
○ “Divide et impera” approach: ■ Word split ■ Sentence split ■ Paragraphs, etc ○ Is there any other information that we can collect?
5 02/04/2020 Word Embeddings
Other preprocessing steps:
○ Stemming/Lemmatization
○ Part of Speech Tagging (PoS) ○ Chunking/Constituency Parsing ○ Dependency Parsing
6 02/04/2020 Word Embeddings
Morphological:
○ The process of bringing the inflected words to their common root: ■ Producing => produc; produced => produc ■ are =>are
○ Bringing the words to the same lemma word: ■ am , is, are => be
7 02/04/2020 Word Embeddings
Grammatical:
○ Assign to each word a grammatical tag
○ Preprocessing:
PoS Tagging Lemmatization
8 02/04/2020 Word Embeddings
○ Shallow Parsing (Chunking): ■ Adds a tree structure to the POS-tags ■ First identifies its constituents and then their relation ○ Deep Parsing (Dependency Parsing): ■ Parses the sentence in its grammatical structure ■ “Head” - “Dependent” form ■ It is an acyclic directed graph (mostly implemented as a tree)
9 02/04/2020 Word Embeddings
“There are different examples that we might use!”
10 02/04/2020 Word Embeddings
■ How to encode the features?
○ The text is considered to be a set of its words ○ Grammatical dependencies are ignored ○ Features encoding: ■ Dictionary based (Nominal features) ■ One hot encoded/Frequency encoded
11 02/04/2020 Word Embeddings
12 02/04/2020 Word Embeddings
○ Word + PoS tag as part of dictionary: ■ Example: John-PN
○ Use Noun Phrases: ■ Example: the bank account
○ Word + Dependency Path as part of dictionary: ■ Example: use-nsubj-acl:relcl
13 02/04/2020 Word Embeddings
○ We miss linguistic resources: ■ Synonyms, antonyms, hyponyms, hypernyms,... ■ Enrich the feature of our examples with their synonyms ■ Set negative weight to antonyms
○ Wordnet: A lexical database for English, which groups words in synsets ○ Wiktionary: A free multilingual dictionary enriched with relations between words
14 02/04/2020 Word Embeddings
○ Information Extraction: ■ Example: Extract the names of companies from documents (open domain)
○ Classify each token as part or not part of the information: ■ The classification of the current token depends on the classification of the previous one ■ Sequential classifier ■ Still not enough; we need to encode the output ■ We need to know where every “annotation” starts and ends
15 02/04/2020 Word Embeddings
○ Example: I work for TU Graz Austria!
○ BILOU: The-O; Know-B; Center-I; GmbH-L; is-O; a-O; spinoff-O; of- O; TUGraz-U ○ BIO: The-O; Know-B; Center-I; GmbH-I; is-O; a-O; spinoff-O; of-O; TUGraz-B
1: Named Entity recognition: https://www.aclweb.org/anthology/W09-1119.pdf
16 02/04/2020 Word Embeddings
○ Binary (like/dislike) ○ Rating based (eg. 1-5)
○ Usually involving features around the target
○ SentiWordnet http://sentiwordnet.isti.cnr.it/
17 02/04/2020 Word Embeddings
○ Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence: ■
○ Usually solved by combining different sizes of n-grams and weighting them
18 02/04/2020 Word Embeddings
○ Rule based: ■ Usually by using a dictionary ○ Statistical (Involving a bilingual aligned corpora) ■ IBM models (1-6) for aligning and training ○ Hybrid : ■ The use of the two previous techniques
19 02/04/2020 Word Embeddings
End
20 02/04/2020 Word Embeddings
21 02/04/2020 Word Embeddings
○ These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus
○ pLSA (Probability Latent Semantic Analysis) ■ Uses probability instead of SVD (Single Value Decomposition)
○ A Bayesian version of pLSA
22 02/04/2020 Word Embeddings
○ The word sequence that we want to predict is likely to be different from the ones we have seen in the training ○ Seeing in the “The cat is walking in the bedroom” => should help us generate: “A dog was running in the room”: ■ Similar semantics and grammatical role
○ Bengio et al. implemented in 2003 the idea of Mnih and Hinton (1989): ■ Learned a language model and embeddings for the words
23 02/04/2020 Word Embeddings
○ ○ Approximate a function with a window approach ○ Model the approximation with a neural network ○ Input Layer in a 1-hot-encoding form ○ Two hidden layers (first more of a random initialization) ○ A tanh intermediate layer
24 02/04/2020 Word Embeddings
○ Outputting the next word in the sequence
with almost 1M words in the corpus
○ Words that tend to occur in similar linguistic context tend to resemble each other in meanings
25 02/04/2020 Word Embeddings
○ Continuous Bag-of-Words Model (CBOW) (Faster) ■ Predict the middle word in a window of words ○ Skip-gram Model (Better with small amount of data) ■ Predict the context of a middle word, given the word
26 02/04/2020 Word Embeddings
CBOW: Skip-Gram:
27 02/04/2020 Word Embeddings
1. Subsample of frequent word: ■ Each word in the training set is discarded with a probability
■ accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words
28 02/04/2020 Word Embeddings
2. Hierarchical Softmax: ○ A tree approximation of the softmax, by using a sigmoid at every step ○ Intuition: at every step decide whether to go right or left ○ O(log(n)) instead of O(n) 3. Negative sampling: ○ Alternative to Hierarchical Softmax (works better): ○ Brings up infrequent terms, squeezes the probability of frequent terms ○ Change the weight of only a selected number of terms with probability: ■
29 02/04/2020 Word Embeddings
embeddings:
30 02/04/2020 Word Embeddings
○ Advantage: Learn statistical information ○ Drawback: Poor on analogy
○ Advantages: Learn analogy ○ Drawback: Poor in learning statistical information
31 02/04/2020 Word Embeddings
○ Use co-occurrence matrix as a starting point ○ The ratio is better than the co-occurrence to distinguish relevant words (solid vs gas) than unrelated words: ■ Use the ratio as the starting point for the algorithm
32 02/04/2020 Word Embeddings
○ Since the vector spaces are linear, the function should combine : ■
○ How do we combine to keep the linearity? F could be a complicated function, but we need to keep the linearity: ■ To avoid that:
33 02/04/2020 Word Embeddings
■ ■ With F being exponential (the ratio above):
○
34 02/04/2020 Word Embeddings
○ The best alpha is ¾ : looks quite similar to the Word2Vec negative sampling
35 02/04/2020 Word Embeddings
○ Distinguish the prefix and postfix with ‘<’ and ‘>’ respectively ○ Example of 3 grams for the word TUGraz: ■ ‘<TU’, ‘UGra, ‘Gra’, ‘raz’, ‘az>’ ○ Select all n-grams with ‘n’ between 3 and 6 (inclusive) ■ Select also the word itself : ‘<TUGraz>’
36 02/04/2020 Word Embeddings
○ Each word is represented as a bag of n-grams
○ Skip gram architecture of Word2Vec ○ Allows to compute word representation for words that don’t occur in the training
37 02/04/2020 Word Embeddings
patterns
○ Using random sampling to train weights of the network
38 02/04/2020 Word Embeddings
Results:
Word Embeddings
sentence completion, lexical substitution and word sense disambiguation tasks
39 02/04/2020 Word Embeddings
End
40 02/04/2020 Word Embeddings
41 02/04/2020 Word Embeddings
○ Pre-train word embeddings and LM embeddings on large corpora ○ Extract word embeddings and Language Model embeddings for a given input sequence ○ Use them in a supervised task
42 02/04/2020 Word Embeddings
Pretrain
○ Concatenate a character based embedding and a word embedding ■ Character representation captures morphological information (CNN or RNN)
○ Concatenate the two outputs for each layer
43 02/04/2020 Word Embeddings
Architecture
44 02/04/2020 Word Embeddings
○ Maximize the log likelihood of the forward and backward directions: ○ => parameters for the representation of the token ○ => Parameters for the softmax layer
45 02/04/2020 Word Embeddings
Pre-train network
○ Shares some weights between directions, instead of completely independent variables
representations: A ○ Where => a token independent representation (CNN over characters)
46 02/04/2020 Word Embeddings
Pre-train network
○ A combination of all the layers, so not just the last layer ○ => allows to model scale the entire output ■ It is important to optimize the whole process (authors claim) ○ => are softmax normalized weights
47 02/04/2020 Word Embeddings
○ Run the LM on each token and record the output of each layer ○ Then let the architecture for the specific task learn from the representation: ■ Because most of the tasks share the same architecture at the lowest level ■ Then the model forms a model sensitive representation ○ The LM of ELMo weights are freezed: ■ Then the are concatenated with the and feed to the task architecture
48 02/04/2020 Word Embeddings
Universal Language Model Fine Tuning
○ Transfer Learning with task specific tuning ○ Language Models capture a lot of downstream tasks: ■ Long-term dependencies ■ Hierarchical relations ■ Sentiment
49 02/04/2020 Word Embeddings
Universal Language Model Fine Tuning
○ Language Model trained in a general domain ○ Target task LM fine tuning ○ Target task classifier fine tuning
50 02/04/2020 Word Embeddings
Steps of ULMFiT
○ Discriminative fine-tuning: ■ Each layer with different learning rate
■ Slanted triangular learning rates:
region
51 02/04/2020 Word Embeddings
○ Concat pooling (for not losing the information in the last states): ■ Concatenates the last hidden states with a maxpool and mean pool of all the hidden states (Doesn’t it look similar to ELMo?) ○ Gradual unfreeze: ■ To avoid catastrophic forget:
knowledge) and fine-tune it
52 02/04/2020 Word Embeddings
53 02/04/2020 Word Embeddings
Encoder-Decoder architecture
○ Learns the representations of the words
○ Generates the new/translated sequence
○ bi-RNN with some attention
○ Only via attention
54 02/04/2020 Word Embeddings
Transformer
○ 6 layers of encoders & 6 of decoders
○ Self attention & feed forward layers
○ Self attention, Encoder-Decoder attention & Feed Forward
○ So the system learns how to connect the word with each other
55 02/04/2020 Word Embeddings
From the rich to the poor
○ Pre-train ○ Fine-tune
○ Based on the infamous paper: “Attention is All You Need” ○ Multi-layer bidirectional Transformer encoder
56 02/04/2020 Word Embeddings
○ Masked LM ■ Hide (or mask) a word with a probability of 15% and try to identify it ○ Next sentence prediction: ■ Given two sentences, the networks tries to predict whether one follows the
57 02/04/2020 Word Embeddings
embedding, and position embedding
○ Used to find Out Of Vocabulary words: ■ Word : walking => walk ##ing; walked => walk ##ed
58 02/04/2020 Word Embeddings
Similar to BERT
○ Improves trained methodology of BERT ○ 1000 times more data to train
○ Uses half of the parameters of BERT ○ Achieves 97% of performance compared to BERT ○ Good tradeoff
59 02/04/2020 Word Embeddings
60 02/04/2020 Word Embeddings
Transfer learning for non-contextual Embeddings
○ Advantages: ■ Transfer Learning ■ Machine Translation ■ Cross Lingual entity linking ○ The basic idea for most of the papers is to learn a matrix transformation: ■ Exploits the linearity of the spaces
61 02/04/2020 Word Embeddings
Example with a small amount of parallel data
62 02/04/2020 Word Embeddings
Multilingual BERT
○ The Word Piece tokenizer helps to capture semantic between languages: ■ To see the effect: Try a NER task with M-Bert compared to English-Bert ■ ■ M-BERT’s pretraining on multiple languages has enabled a representational capacity deeper than simple vocabulary memorization
○ Task trained in one Language perform well in the other
63 02/04/2020 Word Embeddings
Multilingual ULM-FiT (MultiFiT)
○ QRNN alternate CNN and Recurrent Pooling Function ○ Outperforms LSTM (+ 16 times faster)
○ MultiFiT uses subword tokenization (sounds familiar?) ○ A new variant of the slanted triangular learning rate + gradual unfreezing ■ Cosine variant of the one cycle policy
64 02/04/2020 Word Embeddings