Word Embeddings Natural Language Processing VU (706.230) - Andi - - PowerPoint PPT Presentation

word embeddings
SMART_READER_LITE
LIVE PREVIEW

Word Embeddings Natural Language Processing VU (706.230) - Andi - - PowerPoint PPT Presentation

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings Agenda Traditional NLP Word Embeddings-1 Word Embeddings-2 Text preprocessing Topic Modeling ELMo Bag-of-words model Neural


slide-1
SLIDE 1

Word Embeddings

Natural Language Processing VU (706.230) - Andi Rexha

02/04/2020

slide-2
SLIDE 2

2 02/04/2020 Word Embeddings

Agenda

Traditional NLP

  • Text preprocessing
  • Bag-of-words model
  • External Resources
  • Sequential classification
  • Other tasks (MT, LM)

Word Embeddings-1

  • Topic Modeling
  • Neural Embeddings
  • Word2Vec
  • GloVe
  • fastText

Word Embeddings-2

  • ELMo
  • ULMFit
  • BERT
  • RoBERTa, DistilBERT
  • Multilinguality
slide-3
SLIDE 3

3 02/04/2020 Word Embeddings

Traditional NLP

slide-4
SLIDE 4

4 02/04/2020 Word Embeddings

How to preprocess text?

  • How do we (humans) split the text to analyse it?

○ “Divide et impera” approach: ■ Word split ■ Sentence split ■ Paragraphs, etc ○ Is there any other information that we can collect?

Preprocessing

slide-5
SLIDE 5

5 02/04/2020 Word Embeddings

Preprocessing (2)

Other preprocessing steps:

  • Morphological:

○ Stemming/Lemmatization

  • Grammatical:

○ Part of Speech Tagging (PoS) ○ Chunking/Constituency Parsing ○ Dependency Parsing

slide-6
SLIDE 6

6 02/04/2020 Word Embeddings

Preprocessing (3)

Morphological:

  • Stemming:

○ The process of bringing the inflected words to their common root: ■ Producing => produc; produced => produc ■ are =>are

  • Lemmatization:

○ Bringing the words to the same lemma word: ■ am , is, are => be

slide-7
SLIDE 7

7 02/04/2020 Word Embeddings

Preprocessing (4)

Grammatical:

  • Part of Speech Tagging (PoS):

○ Assign to each word a grammatical tag

  • Sentence: “There are different examples that we might use!”

○ Preprocessing:

PoS Tagging Lemmatization

slide-8
SLIDE 8

8 02/04/2020 Word Embeddings

Preprocessing (5)

  • Parsing:

○ Shallow Parsing (Chunking): ■ Adds a tree structure to the POS-tags ■ First identifies its constituents and then their relation ○ Deep Parsing (Dependency Parsing): ■ Parses the sentence in its grammatical structure ■ “Head” - “Dependent” form ■ It is an acyclic directed graph (mostly implemented as a tree)

slide-9
SLIDE 9

9 02/04/2020 Word Embeddings

Preprocessing (6)

  • Sentence:

“There are different examples that we might use!”

  • Dependency Parsing:
  • Constituency Parsing:
slide-10
SLIDE 10

10 02/04/2020 Word Embeddings

Bag-of-words Model

  • Use the preprocessed text in Machine Learning tasks

■ How to encode the features?

  • A major paradigm in NLP and IR - Bag-of-Words (BoW):

○ The text is considered to be a set of its words ○ Grammatical dependencies are ignored ○ Features encoding: ■ Dictionary based (Nominal features) ■ One hot encoded/Frequency encoded

slide-11
SLIDE 11

11 02/04/2020 Word Embeddings

Bag-of-words Model (2)

  • Sentences:
  • Features:
  • Representation of features for Machine Learning:
slide-12
SLIDE 12

12 02/04/2020 Word Embeddings

Feature encoding

  • PoS tagging:

○ Word + PoS tag as part of dictionary: ■ Example: John-PN

  • Chunking:

○ Use Noun Phrases: ■ Example: the bank account

  • Dependency Parsing:

○ Word + Dependency Path as part of dictionary: ■ Example: use-nsubj-acl:relcl

slide-13
SLIDE 13

13 02/04/2020 Word Embeddings

External Resources

  • Sparse dimension of the feature space:

○ We miss linguistic resources: ■ Synonyms, antonyms, hyponyms, hypernyms,... ■ Enrich the feature of our examples with their synonyms ■ Set negative weight to antonyms

  • External resources to improve sparsity:

○ Wordnet: A lexical database for English, which groups words in synsets ○ Wiktionary: A free multilingual dictionary enriched with relations between words

slide-14
SLIDE 14

14 02/04/2020 Word Embeddings

Sequential Classification

  • We need to classify a sequence of tokens:

○ Information Extraction: ■ Example: Extract the names of companies from documents (open domain)

  • How to Model?

○ Classify each token as part or not part of the information: ■ The classification of the current token depends on the classification of the previous one ■ Sequential classifier ■ Still not enough; we need to encode the output ■ We need to know where every “annotation” starts and ends

slide-15
SLIDE 15

15 02/04/2020 Word Embeddings

Sequential Classification (2)

  • Why do we need a schema?

○ Example: I work for TU Graz Austria!

  • BILOU: Beginning, Inside, Last, Outside, Unit
  • BIO(most used): Beginning, Inside, Outside
  • BILOU has shown to perform better in some datasets
  • Example: “The Know Center GmbH is a spinoff of TUGraz.”

○ BILOU: The-O; Know-B; Center-I; GmbH-L; is-O; a-O; spinoff-O; of- O; TUGraz-U ○ BIO: The-O; Know-B; Center-I; GmbH-I; is-O; a-O; spinoff-O; of-O; TUGraz-B

  • Sequential classifiers: Hidden Markov Model, CRF, etc

1: Named Entity recognition: https://www.aclweb.org/anthology/W09-1119.pdf

slide-16
SLIDE 16

16 02/04/2020 Word Embeddings

Sentiment Analysis

  • Assign a sentiment to a piece of text:

○ Binary (like/dislike) ○ Rating based (eg. 1-5)

  • Assign the sentiment to a target phrase:

○ Usually involving features around the target

  • External resources:

○ SentiWordnet http://sentiwordnet.isti.cnr.it/

slide-17
SLIDE 17

17 02/04/2020 Word Embeddings

Language model

  • Generating the next token of a sequence
  • Usually based on the collection of co-occurrence of words in a window:

○ Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence: ■

  • In traditional approaches solved as an n-gram approximation:

○ Usually solved by combining different sizes of n-grams and weighting them

slide-18
SLIDE 18

18 02/04/2020 Word Embeddings

Machine Translation

  • Translate text from one language to another
  • Different approaches:

○ Rule based: ■ Usually by using a dictionary ○ Statistical (Involving a bilingual aligned corpora) ■ IBM models (1-6) for aligning and training ○ Hybrid : ■ The use of the two previous techniques

slide-19
SLIDE 19

19 02/04/2020 Word Embeddings

Traditional NLP

End

slide-20
SLIDE 20

20 02/04/2020 Word Embeddings

Dense Word Representation

slide-21
SLIDE 21

21 02/04/2020 Word Embeddings

From Sparse to Dense

  • Topic Modeling
  • Since LSA(Latent Semantic Analysis):

○ These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus

  • Other Methods later:

○ pLSA (Probability Latent Semantic Analysis) ■ Uses probability instead of SVD (Single Value Decomposition)

  • LDA (Latent Dirichlet Allocation):

○ A Bayesian version of pLSA

slide-22
SLIDE 22

22 02/04/2020 Word Embeddings

Neural embeddings

  • Language models suffer from the “Curse of dimensionality”:

○ The word sequence that we want to predict is likely to be different from the ones we have seen in the training ○ Seeing in the “The cat is walking in the bedroom” => should help us generate: “A dog was running in the room”: ■ Similar semantics and grammatical role

  • A Neural Probabilistic Language Model:

○ Bengio et al. implemented in 2003 the idea of Mnih and Hinton (1989): ■ Learned a language model and embeddings for the words

slide-23
SLIDE 23

23 02/04/2020 Word Embeddings

Neural embeddings (2)

  • Bengio’s architecture:

○ ○ Approximate a function with a window approach ○ Model the approximation with a neural network ○ Input Layer in a 1-hot-encoding form ○ Two hidden layers (first more of a random initialization) ○ A tanh intermediate layer

slide-24
SLIDE 24

24 02/04/2020 Word Embeddings

Neural embeddings (3)

  • A final softmax layer:

○ Outputting the next word in the sequence

  • Learned a word representation of 18K words

with almost 1M words in the corpus

  • IMPORTANT Linguistic Theory:

○ Words that tend to occur in similar linguistic context tend to resemble each other in meanings

slide-25
SLIDE 25

25 02/04/2020 Word Embeddings

Word2vec

  • A deep learning model (2 layers) that compute dense vector representations of words
  • Two different architectures:

○ Continuous Bag-of-Words Model (CBOW) (Faster) ■ Predict the middle word in a window of words ○ Skip-gram Model (Better with small amount of data) ■ Predict the context of a middle word, given the word

  • Models the probability of words co-occurring with the current word(candidate)
  • The embedding learned is the output of the hidden layer
slide-26
SLIDE 26

26 02/04/2020 Word Embeddings

Word2vec (2)

CBOW: Skip-Gram:

slide-27
SLIDE 27

27 02/04/2020 Word Embeddings

Word2vec (3)

  • The output is a softmax function
  • Three new techniques:

1. Subsample of frequent word: ■ Each word in the training set is discarded with a probability

  • is the frequency of word and t (around ) a threshold
  • Keep words that are more likely to occur less often

■ accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words

slide-28
SLIDE 28

28 02/04/2020 Word Embeddings

Word2vec (4)

2. Hierarchical Softmax: ○ A tree approximation of the softmax, by using a sigmoid at every step ○ Intuition: at every step decide whether to go right or left ○ O(log(n)) instead of O(n) 3. Negative sampling: ○ Alternative to Hierarchical Softmax (works better): ○ Brings up infrequent terms, squeezes the probability of frequent terms ○ Change the weight of only a selected number of terms with probability: ■

slide-29
SLIDE 29

29 02/04/2020 Word Embeddings

Word2vec (5)

  • A serendipity effect from the Word2vec is the linearity(analogy) between

embeddings:

  • The famous example: (King - Man) + Woman = Queen
slide-30
SLIDE 30

30 02/04/2020 Word Embeddings

GloVe (Global Vectors)

  • Previous embeddings advantages and drawbacks
  • Methods similar to LSA:

○ Advantage: Learn statistical information ○ Drawback: Poor on analogy

  • Word2Vec:

○ Advantages: Learn analogy ○ Drawback: Poor in learning statistical information

  • Glove for combining removing the disadvantages
slide-31
SLIDE 31

31 02/04/2020 Word Embeddings

GloVe (2)

  • Proposed approach:

○ Use co-occurrence matrix as a starting point ○ The ratio is better than the co-occurrence to distinguish relevant words (solid vs gas) than unrelated words: ■ Use the ratio as the starting point for the algorithm

slide-32
SLIDE 32

32 02/04/2020 Word Embeddings

GloVe (3)

  • Find a function!
  • The function should calculate :
  • The ratio should encode the word vector space:

○ Since the vector spaces are linear, the function should combine : ■

  • Left side is vector, right side is a scalar:

○ How do we combine to keep the linearity? F could be a complicated function, but we need to keep the linearity: ■ To avoid that:

slide-33
SLIDE 33

33 02/04/2020 Word Embeddings

GloVe (4)

  • The function is required to be an homomorphism :
  • Given X, the matrix of word-to-word co occurrence:

■ ■ With F being exponential (the ratio above):

  • Add two bias terms for the simplification of the previous function

  • Objective function:
slide-34
SLIDE 34

34 02/04/2020 Word Embeddings

GloVe (5)

  • Improve the function:

○ The best alpha is ¾ : looks quite similar to the Word2Vec negative sampling

slide-35
SLIDE 35

35 02/04/2020 Word Embeddings

fastText

  • Use n-grams to share weights between words
  • The goal is to share information between words
  • Split the words in n-grams

○ Distinguish the prefix and postfix with ‘<’ and ‘>’ respectively ○ Example of 3 grams for the word TUGraz: ■ ‘<TU’, ‘UGra, ‘Gra’, ‘raz’, ‘az>’ ○ Select all n-grams with ‘n’ between 3 and 6 (inclusive) ■ Select also the word itself : ‘<TUGraz>’

slide-36
SLIDE 36

36 02/04/2020 Word Embeddings

fastText (2)

  • Feature:

○ Each word is represented as a bag of n-grams

  • Advantage:

○ Skip gram architecture of Word2Vec ○ Allows to compute word representation for words that don’t occur in the training

slide-37
SLIDE 37

37 02/04/2020 Word Embeddings

Context2Vec

  • Use a bidirectional LSTM to learn the missing word
  • Similar to the CBOW, but in this case we use LSTM
  • Later a Multilayer perceptron is added to capture complex

patterns

  • Similar to Word2Vec (CBOW)

○ Using random sampling to train weights of the network

slide-38
SLIDE 38

38 02/04/2020 Word Embeddings

Context2Vec (2)

Results:

  • Sentence embeddings are close to the terms embeddings
  • Outperforms context representation of Averaged

Word Embeddings

  • Surpass or nearly reach state-of-the-art results on

sentence completion, lexical substitution and word sense disambiguation tasks

slide-39
SLIDE 39

39 02/04/2020 Word Embeddings

Dense Word Representation

End

slide-40
SLIDE 40

40 02/04/2020 Word Embeddings

Second Generation Word Embeddings

slide-41
SLIDE 41

41 02/04/2020 Word Embeddings

TagLM (Pre-ELMo)

  • Pretrain the network first
  • Three basic steps:

○ Pre-train word embeddings and LM embeddings on large corpora ○ Extract word embeddings and Language Model embeddings for a given input sequence ○ Use them in a supervised task

slide-42
SLIDE 42

42 02/04/2020 Word Embeddings

TagLM (2)

Pretrain

  • For each token:

○ Concatenate a character based embedding and a word embedding ■ Character representation captures morphological information (CNN or RNN)

  • Use these embeddings in a bi-directional LSTM

○ Concatenate the two outputs for each layer

  • Stacked bidirectional LSTMs
  • A complicated Sequence tagging with a CRF loss
slide-43
SLIDE 43

43 02/04/2020 Word Embeddings

TagLM (3)

Architecture

slide-44
SLIDE 44

44 02/04/2020 Word Embeddings

ELMo (Embeddings from LM)

  • Pretraining as TagLM
  • Use as previously a forward and backward LSTM
  • The formulation of the problem:

○ Maximize the log likelihood of the forward and backward directions: ○ => parameters for the representation of the token ○ => Parameters for the softmax layer

slide-45
SLIDE 45

45 02/04/2020 Word Embeddings

ELMo (2)

Pre-train network

  • Different from the previous approach:

○ Shares some weights between directions, instead of completely independent variables

  • For each token t_k, the L-layer biLM computes a set of 2*L + 1

representations: A ○ Where => a token independent representation (CNN over characters)

slide-46
SLIDE 46

46 02/04/2020 Word Embeddings

ELMo (3)

Pre-train network

  • Elmo collapses all the layers in one:
  • For a specific task, ELMo calculates:

○ A combination of all the layers, so not just the last layer ○ => allows to model scale the entire output ■ It is important to optimize the whole process (authors claim) ○ => are softmax normalized weights

slide-47
SLIDE 47

47 02/04/2020 Word Embeddings

ELMo (4)

  • Given a task and the pretrained LM, ELMo does:

○ Run the LM on each token and record the output of each layer ○ Then let the architecture for the specific task learn from the representation: ■ Because most of the tasks share the same architecture at the lowest level ■ Then the model forms a model sensitive representation ○ The LM of ELMo weights are freezed: ■ Then the are concatenated with the and feed to the task architecture

slide-48
SLIDE 48

48 02/04/2020 Word Embeddings

ULMFiT

Universal Language Model Fine Tuning

  • Almost the same time as ELMo
  • Similar idea:

○ Transfer Learning with task specific tuning ○ Language Models capture a lot of downstream tasks: ■ Long-term dependencies ■ Hierarchical relations ■ Sentiment

slide-49
SLIDE 49

49 02/04/2020 Word Embeddings

ULMFiT (2)

Universal Language Model Fine Tuning

  • Consists in three steps:

○ Language Model trained in a general domain ○ Target task LM fine tuning ○ Target task classifier fine tuning

slide-50
SLIDE 50

50 02/04/2020 Word Embeddings

ULMFiT (3)

Steps of ULMFiT

  • Target task LM fine tuning:

○ Discriminative fine-tuning: ■ Each layer with different learning rate

  • Empirically found alpha_l-1 = alpha_l/2.6

■ Slanted triangular learning rates:

  • For adapting task specific features and quickly converge to that

region

slide-51
SLIDE 51

51 02/04/2020 Word Embeddings

ULMFiT (4)

  • Target task classifier fine tuning:

○ Concat pooling (for not losing the information in the last states): ■ Concatenates the last hidden states with a maxpool and mean pool of all the hidden states (Doesn’t it look similar to ELMo?) ○ Gradual unfreeze: ■ To avoid catastrophic forget:

  • First epoch unfreeze the last layer (contains the least general

knowledge) and fine-tune it

  • Go through each layer and unfreeze each of them top-down
slide-52
SLIDE 52

52 02/04/2020 Word Embeddings

ULMFiT (5)

slide-53
SLIDE 53

53 02/04/2020 Word Embeddings

Bert (Introduction)

Encoder-Decoder architecture

  • Used for Machine Translation or Sequence Tagging
  • Encoder:

○ Learns the representations of the words

  • Decoder:

○ Generates the new/translated sequence

  • Traditionally:

○ bi-RNN with some attention

  • Transformer:

○ Only via attention

slide-54
SLIDE 54

54 02/04/2020 Word Embeddings

Bert (Introduction 2)

Transformer

  • Uses only attention to learn the connection
  • Architecture:

○ 6 layers of encoders & 6 of decoders

  • Encoders:

○ Self attention & feed forward layers

  • Decoder:

○ Self attention, Encoder-Decoder attention & Feed Forward

  • Tokens are encoded with position embeddings:

○ So the system learns how to connect the word with each other

slide-55
SLIDE 55

55 02/04/2020 Word Embeddings

BERT

From the rich to the poor

  • Two steps for BERT:

○ Pre-train ○ Fine-tune

  • Architecture:

○ Based on the infamous paper: “Attention is All You Need” ○ Multi-layer bidirectional Transformer encoder

  • Pretrain on a large dataset so the users don’t have to spend time/resources to do so
slide-56
SLIDE 56

56 02/04/2020 Word Embeddings

BERT (2)

  • Two tasks for the pretraining:

○ Masked LM ■ Hide (or mask) a word with a probability of 15% and try to identify it ○ Next sentence prediction: ■ Given two sentences, the networks tries to predict whether one follows the

  • ther:
  • Seems to be a fine-tuning for Question Answering
  • Easy to generate (50% for each sentence)
  • IMPORTANT: BERT tunes all the parameters, so the task is an end-to-end training
slide-57
SLIDE 57

57 02/04/2020 Word Embeddings

BERT (3)

  • Tokens are constructed by summing the token embedding, segment

embedding, and position embedding

  • Uses wordpiece tokenization

○ Used to find Out Of Vocabulary words: ■ Word : walking => walk ##ing; walked => walk ##ed

  • Training:
slide-58
SLIDE 58

58 02/04/2020 Word Embeddings

Worth mentioning

Similar to BERT

  • RoBERTa:

○ Improves trained methodology of BERT ○ 1000 times more data to train

  • DistilBERT:

○ Uses half of the parameters of BERT ○ Achieves 97% of performance compared to BERT ○ Good tradeoff

slide-59
SLIDE 59

59 02/04/2020 Word Embeddings

Multilinguality

slide-60
SLIDE 60

60 02/04/2020 Word Embeddings

Multilinguality

Transfer learning for non-contextual Embeddings

  • Transfer embeddings from one language to another:

○ Advantages: ■ Transfer Learning ■ Machine Translation ■ Cross Lingual entity linking ○ The basic idea for most of the papers is to learn a matrix transformation: ■ Exploits the linearity of the spaces

slide-61
SLIDE 61

61 02/04/2020 Word Embeddings

Multilinguality (2)

Example with a small amount of parallel data

  • Learn either by a Neural Network or by closed algebraic solution
slide-62
SLIDE 62

62 02/04/2020 Word Embeddings

Multilinguality (3)

Multilingual BERT

  • Uses a single multilingual vocabulary

○ The Word Piece tokenizer helps to capture semantic between languages: ■ To see the effect: Try a NER task with M-Bert compared to English-Bert ■ ■ M-BERT’s pretraining on multiple languages has enabled a representational capacity deeper than simple vocabulary memorization

  • Nice effects:

○ Task trained in one Language perform well in the other

slide-63
SLIDE 63

63 02/04/2020 Word Embeddings

Multilinguality (4)

Multilingual ULM-FiT (MultiFiT)

  • Similar to ULM-FiT, but uses Quasi RNN instead of LSTM

○ QRNN alternate CNN and Recurrent Pooling Function ○ Outperforms LSTM (+ 16 times faster)

  • ULMFiT is restricted to words:

○ MultiFiT uses subword tokenization (sounds familiar?) ○ A new variant of the slanted triangular learning rate + gradual unfreezing ■ Cosine variant of the one cycle policy

slide-64
SLIDE 64

64 02/04/2020 Word Embeddings

Thank you!