CS11-747 Neural Networks for NLP
Introduction, Bag-of-words, and Multi-layer Perceptron
Graham Neubig
Site https://phontron.com/class/nn4nlp2020/
Introduction, Bag-of-words, and Multi-layer Perceptron Graham - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Introduction, Bag-of-words, and Multi-layer Perceptron Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Language is Hard! Are These Sentences OK? Jane went to the store. store to Jane went
CS11-747 Neural Networks for NLP
Graham Neubig
Site https://phontron.com/class/nn4nlp2020/
Create a grammar of the language
morphology and exceptions
preferences
problems you want to solve in NLP.
(should be easy)
summarize the material, field questions, elaborate on details and talk about advanced topics
through some demonstration code or equations
have recitation.
scope of NLP, not as much as other DL classes
Individually implement a text classifier and fill in questionnaire project topics
topic and describe the state-of-the-art
and reproduce results from a state-of-the-art model
that either (1) improves on state-of-the-art, or (2) applies neural net models to a unique task
(Fri. 4-5PM GHC5409) Pengfei Liu (Wed. 2-3PM GHC6607)
I hate this movie I love this movie
very good good neutral bad very bad very good good neutral bad very bad
I hate this movie
lookup lookup lookup lookup
+ + + + bias = scores
softmax
probs
to [very good, good, neutral, bad, very bad]
There’s nothing I don’t love about this movie
very good good neutral bad very bad
I don’t love this movie
very good good neutral bad very bad
I hate this movie
lookup lookup lookup lookup softmax
probs some complicated function to extract combination features (neural net) scores
I hate this movie + bias = scores + + +
lookup lookup lookup lookup
=
predictions
power of a linear model, but dimension reduced
I hate this movie + bias = scores
+ + + =
tanh( W1*h + b1) tanh( W2*h + b2)
second layer might be “feature 1 AND feature 5 are active”)
Original Motivation: Neurons in the Brain
Image credit: Wikipedia
Current Conception: Computation Graphs
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
y = x>Ax + b · x + c A node is a {tensor, matrix, vector, scalar} value expression: x graph:
y = x>Ax + b · x + c x expression: graph: An edge represents a function argument (and also an data dependency). They are just pointers to nodes. A node with an incoming edge is a function of that edge’s tail node.
f(u) = u>
A node knows how to compute its value and the value of its derivative w.r.t each argument (edge) times a derivative of an arbitrary input .
∂F ∂f(u)
∂f(u) ∂u ∂F ∂f(u) = ✓ ∂F ∂f(u) ◆>
y = x>Ax + b · x + c x
f(u) = u>
A
f(U, V) = UV
expression: graph: Functions can be nullary, unary, binary, … n-ary. Often they are unary or binary.
y = x>Ax + b · x + c x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
expression: graph: Computation graphs are directed and acyclic (in DyNet)
y = x>Ax + b · x + c x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
x A
f(x, A) = x>Ax ∂f(x, A) ∂A = xx> ∂f(x, A) ∂x = (A> + A)x
expression: graph:
y = x>Ax + b · x + c x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
expression: graph:
y = x>Ax + b · x + c x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c y
f(x1, x2, x3) = X
i
xi
expression: graph: variable names are just labelings of nodes.
node given its inputs
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x>
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x> x>A
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x> x>A b · x
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x> x>A b · x x>Ax
x
f(u) = u>
A
f(U, V) = UV f(M, v) = Mv
b
f(u, v) = u · v
c
f(x1, x2, x3) = X
i
xi
graph:
x> x>A b · x x>Ax
x>Ax + b · x + c
respect to the final value (This is usually a “loss function”, a value we want to minimize)
derivative W -= α * dl/dW
Static Frameworks Dynamic Frameworks (Recommended!) +Gluon +Eager
you want
update
I hate this movie
lookup lookup lookup lookup
+ + + + bias = scores
softmax
probs
I hate this movie + bias = scores + + +
lookup lookup lookup lookup
=
I hate this movie + bias = scores
+ + + =
tanh( W1*h + b1) tanh( W2*h + b2)
calculate any continuous function
inductive bias, to make it easy to learn things we’d like to learn.
undeserved
NN
this movie’s reputation is
your models
models
prediction?
Example: [Ribeiro+ 16]
I hate this movie
LSTM LSTM LSTM LSTM LSTM
</s>
LSTM LSTM LSTM LSTM
この 映画 が 嫌い
argmax
この 映画
argmax
が
argmax
嫌い
argmax
</s>
argmax
embeddings, sentence embeddings
I hate this movie
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
PRP VB DT NN
I hate this movie
RNN RNN RNN
animal dog cat is-a is-a
I hate this movie この 映画 が 嫌い PRP VB DT NN