[PPT] - Word Embeddings Natural Language Processing VU (706.230) - Andi PowerPoint Presentation

SLIDE 1

Word Embeddings

Natural Language Processing VU (706.230) - Andi Rexha

02/04/2020

SLIDE 2

2 02/04/2020 Word Embeddings

Agenda

Traditional NLP

Text preprocessing
Bag-of-words model
External Resources
Sequential classification
Other tasks (MT, LM)

Word Embeddings-1

Topic Modeling
Neural Embeddings
Word2Vec
GloVe
fastText

Word Embeddings-2

ELMo
ULMFit
BERT
RoBERTa, DistilBERT
Multilinguality

SLIDE 3

3 02/04/2020 Word Embeddings

Traditional NLP

SLIDE 4

4 02/04/2020 Word Embeddings

How to preprocess text?

How do we (humans) split the text to analyse it?

○ “Divide et impera” approach: ■ Word split ■ Sentence split ■ Paragraphs, etc ○ Is there any other information that we can collect?

Preprocessing

SLIDE 5

5 02/04/2020 Word Embeddings

Preprocessing (2)

Other preprocessing steps:

Morphological:

○ Stemming/Lemmatization

Grammatical:

○ Part of Speech Tagging (PoS) ○ Chunking/Constituency Parsing ○ Dependency Parsing

SLIDE 6

6 02/04/2020 Word Embeddings

Preprocessing (3)

Morphological:

Stemming:

○ The process of bringing the inflected words to their common root: ■ Producing => produc; produced => produc ■ are =>are

Lemmatization:

○ Bringing the words to the same lemma word: ■ am , is, are => be

SLIDE 7

7 02/04/2020 Word Embeddings

Preprocessing (4)

Grammatical:

Part of Speech Tagging (PoS):

○ Assign to each word a grammatical tag

Sentence: “There are different examples that we might use!”

○ Preprocessing:

PoS Tagging Lemmatization

SLIDE 8

8 02/04/2020 Word Embeddings

Preprocessing (5)

Parsing:

○ Shallow Parsing (Chunking): ■ Adds a tree structure to the POS-tags ■ First identifies its constituents and then their relation ○ Deep Parsing (Dependency Parsing): ■ Parses the sentence in its grammatical structure ■ “Head” - “Dependent” form ■ It is an acyclic directed graph (mostly implemented as a tree)

SLIDE 9

9 02/04/2020 Word Embeddings

Preprocessing (6)

Sentence:

“There are different examples that we might use!”

Dependency Parsing:
Constituency Parsing:

SLIDE 10

10 02/04/2020 Word Embeddings

Bag-of-words Model

Use the preprocessed text in Machine Learning tasks

■ How to encode the features?

A major paradigm in NLP and IR - Bag-of-Words (BoW):

○ The text is considered to be a set of its words ○ Grammatical dependencies are ignored ○ Features encoding: ■ Dictionary based (Nominal features) ■ One hot encoded/Frequency encoded

SLIDE 11

11 02/04/2020 Word Embeddings

Bag-of-words Model (2)

Sentences:
Features:
Representation of features for Machine Learning:

SLIDE 12

12 02/04/2020 Word Embeddings

Feature encoding

PoS tagging:

○ Word + PoS tag as part of dictionary: ■ Example: John-PN

Chunking:

○ Use Noun Phrases: ■ Example: the bank account

Dependency Parsing:

○ Word + Dependency Path as part of dictionary: ■ Example: use-nsubj-acl:relcl

SLIDE 13

13 02/04/2020 Word Embeddings

External Resources

Sparse dimension of the feature space:

○ We miss linguistic resources: ■ Synonyms, antonyms, hyponyms, hypernyms,... ■ Enrich the feature of our examples with their synonyms ■ Set negative weight to antonyms

External resources to improve sparsity:

○ Wordnet: A lexical database for English, which groups words in synsets ○ Wiktionary: A free multilingual dictionary enriched with relations between words

SLIDE 14

14 02/04/2020 Word Embeddings

Sequential Classification

We need to classify a sequence of tokens:

○ Information Extraction: ■ Example: Extract the names of companies from documents (open domain)

How to Model?

○ Classify each token as part or not part of the information: ■ The classification of the current token depends on the classification of the previous one ■ Sequential classifier ■ Still not enough; we need to encode the output ■ We need to know where every “annotation” starts and ends

SLIDE 15

15 02/04/2020 Word Embeddings

Sequential Classification (2)

Why do we need a schema?

○ Example: I work for TU Graz Austria!

BILOU: Beginning, Inside, Last, Outside, Unit
BIO(most used): Beginning, Inside, Outside
BILOU has shown to perform better in some datasets
Example: “The Know Center GmbH is a spinoff of TUGraz.”

○ BILOU: The-O; Know-B; Center-I; GmbH-L; is-O; a-O; spinoff-O; of- O; TUGraz-U ○ BIO: The-O; Know-B; Center-I; GmbH-I; is-O; a-O; spinoff-O; of-O; TUGraz-B

Sequential classifiers: Hidden Markov Model, CRF, etc

1: Named Entity recognition: https://www.aclweb.org/anthology/W09-1119.pdf

SLIDE 16

16 02/04/2020 Word Embeddings

Sentiment Analysis

Assign a sentiment to a piece of text:

○ Binary (like/dislike) ○ Rating based (eg. 1-5)

Assign the sentiment to a target phrase:

○ Usually involving features around the target

External resources:

○ SentiWordnet http://sentiwordnet.isti.cnr.it/

SLIDE 17

17 02/04/2020 Word Embeddings

Language model

Generating the next token of a sequence
Usually based on the collection of co-occurrence of words in a window:

○ Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence: ■

In traditional approaches solved as an n-gram approximation:

○ Usually solved by combining different sizes of n-grams and weighting them

SLIDE 18

18 02/04/2020 Word Embeddings

Machine Translation

Translate text from one language to another
Different approaches:

○ Rule based: ■ Usually by using a dictionary ○ Statistical (Involving a bilingual aligned corpora) ■ IBM models (1-6) for aligning and training ○ Hybrid : ■ The use of the two previous techniques

SLIDE 19

19 02/04/2020 Word Embeddings

Traditional NLP

End

SLIDE 20

20 02/04/2020 Word Embeddings

Dense Word Representation

SLIDE 21

21 02/04/2020 Word Embeddings

From Sparse to Dense

Topic Modeling
Since LSA(Latent Semantic Analysis):

○ These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus

Other Methods later:

○ pLSA (Probability Latent Semantic Analysis) ■ Uses probability instead of SVD (Single Value Decomposition)

LDA (Latent Dirichlet Allocation):

○ A Bayesian version of pLSA

SLIDE 22

22 02/04/2020 Word Embeddings

Neural embeddings

Language models suffer from the “Curse of dimensionality”:

○ The word sequence that we want to predict is likely to be different from the ones we have seen in the training ○ Seeing in the “The cat is walking in the bedroom” => should help us generate: “A dog was running in the room”: ■ Similar semantics and grammatical role

A Neural Probabilistic Language Model:

○ Bengio et al. implemented in 2003 the idea of Mnih and Hinton (1989): ■ Learned a language model and embeddings for the words

SLIDE 23

23 02/04/2020 Word Embeddings

Neural embeddings (2)

Bengio’s architecture:

○ ○ Approximate a function with a window approach ○ Model the approximation with a neural network ○ Input Layer in a 1-hot-encoding form ○ Two hidden layers (first more of a random initialization) ○ A tanh intermediate layer

SLIDE 24

24 02/04/2020 Word Embeddings

Neural embeddings (3)

A final softmax layer:

○ Outputting the next word in the sequence

Learned a word representation of 18K words

with almost 1M words in the corpus

IMPORTANT Linguistic Theory:

○ Words that tend to occur in similar linguistic context tend to resemble each other in meanings

SLIDE 25

25 02/04/2020 Word Embeddings

Word2vec

A deep learning model (2 layers) that compute dense vector representations of words
Two different architectures:

○ Continuous Bag-of-Words Model (CBOW) (Faster) ■ Predict the middle word in a window of words ○ Skip-gram Model (Better with small amount of data) ■ Predict the context of a middle word, given the word

Models the probability of words co-occurring with the current word(candidate)
The embedding learned is the output of the hidden layer

SLIDE 26

26 02/04/2020 Word Embeddings

Word2vec (2)

CBOW: Skip-Gram:

SLIDE 27

27 02/04/2020 Word Embeddings

Word2vec (3)

The output is a softmax function
Three new techniques:

1. Subsample of frequent word: ■ Each word in the training set is discarded with a probability

is the frequency of word and t (around ) a threshold
Keep words that are more likely to occur less often

■ accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words

SLIDE 28

28 02/04/2020 Word Embeddings

Word2vec (4)

2. Hierarchical Softmax: ○ A tree approximation of the softmax, by using a sigmoid at every step ○ Intuition: at every step decide whether to go right or left ○ O(log(n)) instead of O(n) 3. Negative sampling: ○ Alternative to Hierarchical Softmax (works better): ○ Brings up infrequent terms, squeezes the probability of frequent terms ○ Change the weight of only a selected number of terms with probability: ■

SLIDE 29

29 02/04/2020 Word Embeddings

Word2vec (5)

A serendipity effect from the Word2vec is the linearity(analogy) between

embeddings:

The famous example: (King - Man) + Woman = Queen

SLIDE 30

30 02/04/2020 Word Embeddings

GloVe (Global Vectors)

Previous embeddings advantages and drawbacks
Methods similar to LSA:

○ Advantage: Learn statistical information ○ Drawback: Poor on analogy

Word2Vec:

○ Advantages: Learn analogy ○ Drawback: Poor in learning statistical information

Glove for combining removing the disadvantages

SLIDE 31

31 02/04/2020 Word Embeddings

GloVe (2)

Proposed approach:

○ Use co-occurrence matrix as a starting point ○ The ratio is better than the co-occurrence to distinguish relevant words (solid vs gas) than unrelated words: ■ Use the ratio as the starting point for the algorithm

SLIDE 32

32 02/04/2020 Word Embeddings

GloVe (3)

Find a function!
The function should calculate :
The ratio should encode the word vector space:

○ Since the vector spaces are linear, the function should combine : ■

Left side is vector, right side is a scalar:

○ How do we combine to keep the linearity? F could be a complicated function, but we need to keep the linearity: ■ To avoid that:

SLIDE 33

33 02/04/2020 Word Embeddings

GloVe (4)

The function is required to be an homomorphism :
Given X, the matrix of word-to-word co occurrence:

■ ■ With F being exponential (the ratio above):

Add two bias terms for the simplification of the previous function

○

Objective function:

SLIDE 34

34 02/04/2020 Word Embeddings

GloVe (5)

Improve the function:

○ The best alpha is ¾ : looks quite similar to the Word2Vec negative sampling

SLIDE 35

35 02/04/2020 Word Embeddings

fastText

Use n-grams to share weights between words
The goal is to share information between words
Split the words in n-grams

○ Distinguish the prefix and postfix with ‘<’ and ‘>’ respectively ○ Example of 3 grams for the word TUGraz: ■ ‘<TU’, ‘UGra, ‘Gra’, ‘raz’, ‘az>’ ○ Select all n-grams with ‘n’ between 3 and 6 (inclusive) ■ Select also the word itself : ‘<TUGraz>’

SLIDE 36

36 02/04/2020 Word Embeddings

fastText (2)

Feature:

○ Each word is represented as a bag of n-grams

Advantage:

○ Skip gram architecture of Word2Vec ○ Allows to compute word representation for words that don’t occur in the training

SLIDE 37

37 02/04/2020 Word Embeddings

Context2Vec

Use a bidirectional LSTM to learn the missing word
Similar to the CBOW, but in this case we use LSTM
Later a Multilayer perceptron is added to capture complex

patterns

Similar to Word2Vec (CBOW)

○ Using random sampling to train weights of the network

SLIDE 38

38 02/04/2020 Word Embeddings

Context2Vec (2)

Results:

Sentence embeddings are close to the terms embeddings
Outperforms context representation of Averaged

Word Embeddings

Surpass or nearly reach state-of-the-art results on

sentence completion, lexical substitution and word sense disambiguation tasks

SLIDE 39

39 02/04/2020 Word Embeddings

Dense Word Representation

End

SLIDE 40

40 02/04/2020 Word Embeddings

Second Generation Word Embeddings

SLIDE 41

41 02/04/2020 Word Embeddings

TagLM (Pre-ELMo)

Pretrain the network first
Three basic steps:

○ Pre-train word embeddings and LM embeddings on large corpora ○ Extract word embeddings and Language Model embeddings for a given input sequence ○ Use them in a supervised task

SLIDE 42

42 02/04/2020 Word Embeddings

TagLM (2)

Pretrain

For each token:

○ Concatenate a character based embedding and a word embedding ■ Character representation captures morphological information (CNN or RNN)

Use these embeddings in a bi-directional LSTM

○ Concatenate the two outputs for each layer

Stacked bidirectional LSTMs
A complicated Sequence tagging with a CRF loss

SLIDE 43

43 02/04/2020 Word Embeddings

TagLM (3)

Architecture

SLIDE 44

44 02/04/2020 Word Embeddings

ELMo (Embeddings from LM)

Pretraining as TagLM
Use as previously a forward and backward LSTM
The formulation of the problem:

○ Maximize the log likelihood of the forward and backward directions: ○ => parameters for the representation of the token ○ => Parameters for the softmax layer

SLIDE 45

45 02/04/2020 Word Embeddings

ELMo (2)

Pre-train network

Different from the previous approach:

○ Shares some weights between directions, instead of completely independent variables

For each token t_k, the L-layer biLM computes a set of 2*L + 1

representations: A ○ Where => a token independent representation (CNN over characters)

SLIDE 46

46 02/04/2020 Word Embeddings

ELMo (3)

Pre-train network

Elmo collapses all the layers in one:
For a specific task, ELMo calculates:

○ A combination of all the layers, so not just the last layer ○ => allows to model scale the entire output ■ It is important to optimize the whole process (authors claim) ○ => are softmax normalized weights

SLIDE 47

47 02/04/2020 Word Embeddings

ELMo (4)

Given a task and the pretrained LM, ELMo does:

○ Run the LM on each token and record the output of each layer ○ Then let the architecture for the specific task learn from the representation: ■ Because most of the tasks share the same architecture at the lowest level ■ Then the model forms a model sensitive representation ○ The LM of ELMo weights are freezed: ■ Then the are concatenated with the and feed to the task architecture

SLIDE 48

48 02/04/2020 Word Embeddings

ULMFiT

Universal Language Model Fine Tuning

Almost the same time as ELMo
Similar idea:

○ Transfer Learning with task specific tuning ○ Language Models capture a lot of downstream tasks: ■ Long-term dependencies ■ Hierarchical relations ■ Sentiment

SLIDE 49

49 02/04/2020 Word Embeddings

ULMFiT (2)

Universal Language Model Fine Tuning

Consists in three steps:

○ Language Model trained in a general domain ○ Target task LM fine tuning ○ Target task classifier fine tuning

SLIDE 50

50 02/04/2020 Word Embeddings

ULMFiT (3)

Steps of ULMFiT

Target task LM fine tuning:

○ Discriminative fine-tuning: ■ Each layer with different learning rate

Empirically found alpha_l-1 = alpha_l/2.6

■ Slanted triangular learning rates:

For adapting task specific features and quickly converge to that

region

SLIDE 51

51 02/04/2020 Word Embeddings

ULMFiT (4)

Target task classifier fine tuning:

○ Concat pooling (for not losing the information in the last states): ■ Concatenates the last hidden states with a maxpool and mean pool of all the hidden states (Doesn’t it look similar to ELMo?) ○ Gradual unfreeze: ■ To avoid catastrophic forget:

First epoch unfreeze the last layer (contains the least general

knowledge) and fine-tune it

Go through each layer and unfreeze each of them top-down

SLIDE 52

52 02/04/2020 Word Embeddings

ULMFiT (5)

SLIDE 53

53 02/04/2020 Word Embeddings

Bert (Introduction)

Encoder-Decoder architecture

Used for Machine Translation or Sequence Tagging
Encoder:

○ Learns the representations of the words

Decoder:

○ Generates the new/translated sequence

Traditionally:

○ bi-RNN with some attention

Transformer:

○ Only via attention

SLIDE 54

54 02/04/2020 Word Embeddings

Bert (Introduction 2)

Transformer

Uses only attention to learn the connection
Architecture:

○ 6 layers of encoders & 6 of decoders

Encoders:

○ Self attention & feed forward layers

Decoder:

○ Self attention, Encoder-Decoder attention & Feed Forward

Tokens are encoded with position embeddings:

○ So the system learns how to connect the word with each other

SLIDE 55

55 02/04/2020 Word Embeddings

BERT

From the rich to the poor

Two steps for BERT:

○ Pre-train ○ Fine-tune

Architecture:

○ Based on the infamous paper: “Attention is All You Need” ○ Multi-layer bidirectional Transformer encoder

Pretrain on a large dataset so the users don’t have to spend time/resources to do so

SLIDE 56

56 02/04/2020 Word Embeddings

BERT (2)

Two tasks for the pretraining:

○ Masked LM ■ Hide (or mask) a word with a probability of 15% and try to identify it ○ Next sentence prediction: ■ Given two sentences, the networks tries to predict whether one follows the

ther:
Seems to be a fine-tuning for Question Answering
Easy to generate (50% for each sentence)
IMPORTANT: BERT tunes all the parameters, so the task is an end-to-end training

SLIDE 57

57 02/04/2020 Word Embeddings

BERT (3)

Tokens are constructed by summing the token embedding, segment

embedding, and position embedding

Uses wordpiece tokenization

○ Used to find Out Of Vocabulary words: ■ Word : walking => walk ##ing; walked => walk ##ed

Training:

SLIDE 58

58 02/04/2020 Word Embeddings

Worth mentioning

Similar to BERT

RoBERTa:

○ Improves trained methodology of BERT ○ 1000 times more data to train

DistilBERT:

○ Uses half of the parameters of BERT ○ Achieves 97% of performance compared to BERT ○ Good tradeoff

SLIDE 59

59 02/04/2020 Word Embeddings

Multilinguality

SLIDE 60

60 02/04/2020 Word Embeddings

Multilinguality

Transfer learning for non-contextual Embeddings

Transfer embeddings from one language to another:

○ Advantages: ■ Transfer Learning ■ Machine Translation ■ Cross Lingual entity linking ○ The basic idea for most of the papers is to learn a matrix transformation: ■ Exploits the linearity of the spaces

SLIDE 61

61 02/04/2020 Word Embeddings

Multilinguality (2)

Example with a small amount of parallel data

Learn either by a Neural Network or by closed algebraic solution

SLIDE 62

62 02/04/2020 Word Embeddings

Multilinguality (3)

Multilingual BERT

Uses a single multilingual vocabulary

○ The Word Piece tokenizer helps to capture semantic between languages: ■ To see the effect: Try a NER task with M-Bert compared to English-Bert ■ ■ M-BERT’s pretraining on multiple languages has enabled a representational capacity deeper than simple vocabulary memorization

Nice effects:

○ Task trained in one Language perform well in the other

SLIDE 63

63 02/04/2020 Word Embeddings

Multilinguality (4)

Multilingual ULM-FiT (MultiFiT)

Similar to ULM-FiT, but uses Quasi RNN instead of LSTM

○ QRNN alternate CNN and Recurrent Pooling Function ○ Outperforms LSTM (+ 16 times faster)

ULMFiT is restricted to words:

○ MultiFiT uses subword tokenization (sounds familiar?) ○ A new variant of the slanted triangular learning rate + gradual unfreezing ■ Cosine variant of the one cycle policy

SLIDE 64

64 02/04/2020 Word Embeddings

Word Embeddings

Agenda

Traditional NLP

Preprocessing

Preprocessing (2)

Preprocessing (3)

Preprocessing (4)

Preprocessing (5)

Preprocessing (6)

Bag-of-words Model

Bag-of-words Model (2)

Feature encoding

External Resources

Sequential Classification

Sequential Classification (2)

Sentiment Analysis

Language model

Machine Translation

Traditional NLP

Dense Word Representation

From Sparse to Dense

Neural embeddings

Neural embeddings (2)

Neural embeddings (3)

Word2vec

Word2vec (2)

Word2vec (3)

Word2vec (4)

Word2vec (5)

GloVe (Global Vectors)

GloVe (2)

GloVe (3)

GloVe (4)

GloVe (5)

fastText

fastText (2)

Context2Vec

Context2Vec (2)

Dense Word Representation

Second Generation Word Embeddings

TagLM (Pre-ELMo)

TagLM (2)

TagLM (3)

ELMo (Embeddings from LM)

ELMo (2)

ELMo (3)

ELMo (4)

ULMFiT

ULMFiT (2)

ULMFiT (3)

ULMFiT (4)

ULMFiT (5)

Bert (Introduction)

Bert (Introduction 2)

BERT

BERT (2)

BERT (3)

Worth mentioning

Multilinguality

Multilinguality

Multilinguality (2)

Multilinguality (3)

Multilinguality (4)

Thank you!