CS 533: Natural Language Processing
Backpropagation, Self-Attention, Text Representations Through Language Modeling
Karl Stratos
Rutgers University
Karl Stratos CS 533: Natural Language Processing 1/52
Backpropagation, Self-Attention, Text Representations Through - - PowerPoint PPT Presentation
CS 533: Natural Language Processing Backpropagation, Self-Attention, Text Representations Through Language Modeling Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/52 Dropout (Slide credit Danqi Chen
CS 533: Natural Language Processing
Rutgers University
Karl Stratos CS 533: Natural Language Processing 1/52
(Slide credit Danqi Chen & Karthik Narasimhan)
Karl Stratos CS 533: Natural Language Processing 2/52
h x h x
Karl Stratos CS 533: Natural Language Processing 3/52
Karl Stratos CS 533: Natural Language Processing 4/52
◮ A technique to automatically calculate ∇J(θ) for any
Input: loss function J(θ) ∈ R, parameter value ˆ θ Output: ∇J(ˆ θ), the gradient of J(θ) at θ = ˆ θ
◮ Calculates the gradient of an arbitrary differentiable function
Including neural networks
Karl Stratos CS 533: Natural Language Processing 5/52
◮ For the most part, we will consider (differentiable) function
◮ The gradient of f with respect to x is a function of x
◮ The gradient of f with respect to x evaluated at x = a is
Karl Stratos CS 533: Natural Language Processing 6/52
◮ Given any differentiable functions f, g from R to R,
◮ “Proof”: Linearization of linearization of g(z) around f(x)
∂g(f(x)) ∂x
Karl Stratos CS 533: Natural Language Processing 7/52
◮ What is the value of the gradient of f(x) := 7? ◮ What is the value of the gradient of f(x) := 2x? ◮ What is the value of the gradient of f(x) := 2x + 99999? ◮ What is the value of the gradient of f(x) := x3? ◮ What is the value of the gradient of f(x) := exp(x)? ◮ What is the value of the gradient of f(x) := exp(2x3 + 10)? ◮ What is the value of the gradient of
Karl Stratos CS 533: Natural Language Processing 8/52
◮ Let f1 . . . fm denote any differentiable functions from R to R. ◮ If g : Rm → R is a differentiable function from Rm to R,
m
◮ Calculate the gradient of x + x2 + yx with respect to x using
Karl Stratos CS 533: Natural Language Processing 9/52
Karl Stratos CS 533: Natural Language Processing 10/52
Karl Stratos CS 533: Natural Language Processing 11/52
◮ DAG G = (V, E) with a single output node ω ∈ V . ◮ Every node i ∈ V is equipped with a value xi ∈ R:
function f i : R|pa(i)| → R and compute xi = f i((xj)j∈pa(i))
◮ Thus G represents a function: it receives multiple values
◮ We can calculate xω by a forward pass. Karl Stratos CS 533: Natural Language Processing 12/52
Karl Stratos CS 533: Natural Language Processing 13/52
Karl Stratos CS 533: Natural Language Processing 14/52
◮ Collectively refer to all input slots by xI = (xi)i∈VI. ◮ Collectively refer to all input values by aI = (ai)i∈VI. ◮ At i ∈ V :
Refer to its parental slots by xi
I = (xj)j∈pa(i).
Refer to its parental values by ai
I = (aj)j∈pa(i).
◮ A “global” function of xI evaluated at aI. ◮ A “local” function of xi I evaluated at ai I.
Karl Stratos CS 533: Natural Language Processing 15/52
◮ Now for every node i ∈ V , we introduce an additional slot
◮ The goal of backpropagation is to calculate zi for every
◮ Why are we done if we achieve this goal? Karl Stratos CS 533: Natural Language Processing 16/52
◮ Chain rule on the DAG structure
Karl Stratos CS 533: Natural Language Processing 17/52
◮ Chain rule on the DAG structure
I=aj I Karl Stratos CS 533: Natural Language Processing 17/52
◮ Chain rule on the DAG structure
I=aj I
I)
I=aj I
Karl Stratos CS 533: Natural Language Processing 17/52
◮ Chain rule on the DAG structure
I=aj I
I)
I=aj I
◮ If we compute zi in a reverse topological ordering, then we
Karl Stratos CS 533: Natural Language Processing 17/52
◮ Chain rule on the DAG structure
I=aj I
I)
I=aj I
◮ If we compute zi in a reverse topological ordering, then we
◮ What’s the base case zω? Karl Stratos CS 533: Natural Language Processing 17/52
I)
I=aj I Karl Stratos CS 533: Natural Language Processing 18/52
Karl Stratos CS 533: Natural Language Processing 19/52
Karl Stratos CS 533: Natural Language Processing 20/52
◮ Each type of function f creates a child node from parent
◮ “Add” function creates a child node c with two parents (a, b)
and sets c.z ← 0.
◮ Each node has an associated forward function.
◮ Calling forward at c populates c.x = a.x + b.x (assumes
parents have their values).
◮ Each node also has an associated backward function.
◮ Calling backward at c “broadcasts” its gradient c.z (assumes
it’s already calculated) to its parents a.z ← a.z + c.z b.z ← b.z + c.z
Karl Stratos CS 533: Natural Language Processing 21/52
◮ Express your loss JB(θ) on minibatch B at θ = ˆ
◮ Forward pass. For each node a in a topological ordering,
◮ Backward pass. For each node a in a reverse topological
◮ The gradient of JB(θ) at θ = ˆ
Karl Stratos CS 533: Natural Language Processing 22/52
◮ Computation graph in which input values that are vectors
◮ The corresponding gradients are also vectors of the same size
◮ Backpropagation has exactly the same structure using the
1×dj
I=aj I
dj×di
Karl Stratos CS 533: Natural Language Processing 23/52
Karl Stratos CS 533: Natural Language Processing 24/52
h x h x
Karl Stratos CS 533: Natural Language Processing 25/52
◮ Q ∈ Rd×T : T query vectors of the “asker” ◮ K ∈ Rd×T ′: T ′ key vectors of the “answerer” ◮ V ∈ Rd×T ′: T ′ value vectors of the “answerer”
◮ A ∈ Rd×T : T answer vectors of the “asker” after asking
CS 533: Natural Language Processing 26/52
◮ Q = Y : target LSTM encodings ◮ K = X: source LSTM encodings ◮ V = X: source LSTM encodings
◮ A = Attention(Y, X, X): new target encodings
CS 533: Natural Language Processing 27/52
◮ var
◮ var
Karl Stratos CS 533: Natural Language Processing 28/52
◮ W Q i
◮ W K i
◮ W V i
◮ W ∈ Rd×d
Karl Stratos CS 533: Natural Language Processing 29/52
Karl Stratos CS 533: Natural Language Processing 30/52
Karl Stratos CS 533: Natural Language Processing 31/52
Karl Stratos CS 533: Natural Language Processing 32/52
Karl Stratos CS 533: Natural Language Processing 33/52
Karl Stratos CS 533: Natural Language Processing 34/52
Karl Stratos CS 533: Natural Language Processing 35/52
Karl Stratos CS 533: Natural Language Processing 36/52
◮ Binary sentiment classification
This film doesn’t care about intelligent humor →
◮ Example datasets: SST-2, IMBb, Yelp Review, SemEval,
◮ Sentiment analysis results: http:
◮ Other types of classification: grammatical vs ungrammatical
Karl Stratos CS 533: Natural Language Processing 37/52
Karl Stratos CS 533: Natural Language Processing 38/52
Karl Stratos CS 533: Natural Language Processing 39/52
◮ Each such downstream task provides only a limited amount of
◮ Can we transfer a large-scale pretrained language model to
◮ Popular benchmarks (Wang et al, 2018):
◮ GLUE: https://gluebenchmark.com/leaderboard ◮ SuperGLUE:
https://super.gluebenchmark.com/leaderboard
Karl Stratos CS 533: Natural Language Processing 40/52
Karl Stratos CS 533: Natural Language Processing 41/52
Karl Stratos CS 533: Natural Language Processing 42/52
L
i
i
i
Karl Stratos CS 533: Natural Language Processing 43/52
Karl Stratos CS 533: Natural Language Processing 44/52
h x x h
Karl Stratos CS 533: Natural Language Processing 45/52
◮ For the purposes of representation learning, we don’t care
◮ We want a prediction problem which conditions on entire
◮ Solution: mask out words at random
◮ Need to be careful
◮ Too little masking: too expensive to train ◮ Too much masking: not enough context ◮ Test time: no [MASK] input, so training should also handle no
[MASK] input sometimes
◮ Details: https://arxiv.org/pdf/1810.04805.pdf Karl Stratos CS 533: Natural Language Processing 46/52
[CLS] the dog [MASK] [SEP] the cat [MASK] away [SEP]
IsNext barked ran
(Vaswani et al., 2017)
Karl Stratos CS 533: Natural Language Processing 47/52
◮ ELMo: 94 million ◮ BERT Base: 110 million ◮ BERT Large: 340 million
Karl Stratos CS 533: Natural Language Processing 48/52
Karl Stratos CS 533: Natural Language Processing 49/52
Karl Stratos CS 533: Natural Language Processing 50/52
Karl Stratos CS 533: Natural Language Processing 51/52
◮ TagLM (Peters et, 2017) ◮ CoVe (McCann et al. 2017) ◮ ULMfit (Howard and Ruder, 2018) ◮ ELMo (Peters et al, 2018) ◮ OpenAI GPT (Radford et al, 2018) ◮ BERT (Devlin et al, 2018) ◮ OpenAI GPT-2 (Radford et al, 2019) ◮ XLNet (Yang et al, 2019) ◮ SpanBERT (Joshi et al, 2019) ◮ RoBERTa (Liu et al, 2019) ◮ AlBERT (Anonymous) ◮ T5 (Raffel et al., 2019) ◮ . . .
Karl Stratos CS 533: Natural Language Processing 52/52