Neural Network Approaches to Representation Learning for NLP
Navid Rekabsaz Idiap Research Institute
@navidrekabsaz navid.rekabsaz@idiap.ch
Neural Network Approaches to Representation Learning for NLP Navid - - PowerPoint PPT Presentation
Neural Network Approaches to Representation Learning for NLP Navid Rekabsaz Idiap Research Institute @navidrekabsaz navid.rekabsaz@idiap.ch Agenda Brief Intro to Deep Learning - Neural Networks Word Representation Learning - Neural
@navidrekabsaz navid.rekabsaz@idiap.ch
dimensions: 1×d & d×1 =1
dimensions: 1×d & d×e =1×e
dimensions: l×m & m×n =l×n
§ Neural Networks are non-linear functions with many parameters
§ They consist of several simple non-linear operations § Normally, the objective is to maximize likelihood, namely
§ Generally optimized using Stochastic Gradient Descent (SGD)
size 3x4
size 4x2
input vector parameter matrices prediction
labels loss function
size 3x4
size 4x2
input vector parameter matrices prediction
labels loss function
Initialize parameters Loop over training data (or minibatches)
1. Do forward pass: given input ⃗ % predict output & ' 2. Calculate loss function by comparing & ' with labels ' 3. Do backpropagation: calculate the gradient of each parameter in regard to the loss function 4. Update parameters in the direction of gradient 5. Exit if some stopping criteria are met
probability value
Fetched from https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
the values is equal 1
5
Adopted from http://mlss.tuebingen.mpg.de/2017/speaker_slides/Zoubin1.pdf
Word Embedding Model !" !# !$ %
Vector representations of words projected in two-dimensional space
linguistic theory 1930–1955 (1957)
Nida[1975]
Algorithmic intuition:
sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the
!" !# !$ !% !& !'
Aardvark computer data pinch result sugar
(" apricot
1 1
(# pineapple
1 1
($ digital
2 1 1
(% information
1 6 4
[1]
!" !# !$ !% !& !'
Aardvark computer data pinch result sugar
(" apricot
1 1
(# pineapple
1 1
($ digital
2 1 1
(% information
1 6 4 )*+*,-./0 digital, information = cosine ⃗ BCDEDFGH, ⃗ BDIJKLMGFDKI
corpus ! ~ [10000−500000]
a. Set word " as input and context word $ as output b. Calculate the output of network, namely The probability of observing context word $ given word "
c. Optimize the network to maximize the likelihood probability
Recipe for creating (dense) word embedding with neural networks Details come next!
Window size of 2
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
https://web.stanford.edu/~jurafsky/slp3/
Train sample: (Tesgüino, drink)
Linear activation
Words matrix Context Words matrix Input Layer (One-hot encoder) Output Layer (Softmax)
1×# 1×$ 1×#
Forward pass Backpropagation
Word vector Ale Tesgüino
Word vector Ale Tesgüino
Word vector Context vector Ale Tesgüino
Word vector Context vector drink Ale Tesgüino
Word vector drink Context vector Ale Tesgüino
Word vector Context vector drink
Ale Tesgüino
D is the set of vocabularies
J K
Sorry! Denominator is too expensive!
sigmoid
56- 7
k ~ 2-10
Word vector Context vector drink Tesgüino
Word vector Context vector drink Tesgüino
Word vector Context vector drink Tesgüino
4)
same gender-neutral pronoun
Word vector Context vector she Nurse Housekeeper he Manager
Word vector she Context vector Nurse Housekeeper Manager he
§ The bias of 350 occupations to female/male in the word2vec model, created on English Wikipedia