CS 6355: Structured Prediction
Predicting Sequences: Structured Perceptron
1
Predicting Sequences: Structured Perceptron CS 6355: Structured - - PowerPoint PPT Presentation
Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of factors Each factor assigns a
1
2
3
y0 y1 y2 y3 x 𝒙𝑈𝜚(𝒚, 𝑧0, 𝑧1) 𝒙𝑈𝜚(𝒚, 𝑧+, 𝑧2) 𝒙𝑈𝜚(𝒚, 𝑧2, 𝑧3)
4
5
6
Emissions Transitions
7
Log joint probability = transition scores + emission scores
8
Log joint probability = transition scores + emission scores
9
Log joint probability = transition scores + emission scores Indicators are functions that map Booleans to 0 or 1
10
Log joint probability = transition scores + emission scores The indicators ensure that only
double summation is non-zero Equivalent to
11
Log joint probability = transition scores + emission scores The indicators ensure that only
summation is non-zero Equivalent to
12
log 𝑄 𝐲, 𝐳 = : : : log 𝑄 𝑡 𝑡L ⋅ 𝐽 NOPQ ⋅
𝐽[NOTUVWR]
+ : : log 𝑄 𝑦3 𝑡 ⋅ 𝐽 NOPQ
13
14
Number of times there is a transition in the sequence from state 𝑡’ to state 𝑡 Count(𝑡L → 𝑡)
15
16
Number of times state 𝑡 occurs in the sequence: Count(𝑡)
17
18
19
The ate the dog homework Det Verb Det Noun Noun
20
The ate the dog homework Det Verb Det Noun Noun
21
The ate the dog homework Det Verb Det Noun Noun +
Transition scores Emission scores
22
The ate the dog homework Det Verb Det Noun Noun log P(Det ! Noun) £ 2 + log P(Noun ! Verb) £ 1 + log P(Verb ! Det) £ 1 +
Emission scores
23
The ate the dog homework Det Verb Det Noun Noun log P(Det ! Noun) £ 2 + log P(Noun ! Verb) £ 1 + log P(Verb ! Det) £ 1 log P(The | Det) £ 1 + log P(dog| Noun) £ 1 + log P(ate| Verb) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 +
24
The ate the dog homework Det Verb Det Noun Noun log P(Det ! Noun) £ 2 + log P(Noun ! Verb) £ 1 + log P(Verb ! Det) £ 1 log P(The | Det) £ 1 + log P(dog| Noun) £ 1 + log P(ate| Verb) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 + w: Parameters
25
The ate the dog homework Det Verb Det Noun Noun log P(Det ! Noun) £ 2 + log P(Noun ! Verb) £ 1 + log P(Verb ! Det) £ 1 log P(The | Det) £ 1 + log P(dog| Noun) £ 1 + log P(ate| Verb) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 + Á(x, y): Properties of this output and the input
26
The ate the dog homework Det Verb Det Noun Noun Á(x, y): Properties of this
w: Parameters
log 𝑄(𝐸𝑓𝑢 → 𝑂𝑝𝑣𝑜) log 𝑄(𝑂𝑝𝑣𝑜 → 𝑊𝑓𝑠𝑐) log 𝑄(𝑊𝑓𝑠𝑐 → 𝐸𝑓𝑢) log 𝑄 𝑈ℎ𝑓 𝐸𝑓𝑢) log 𝑄 𝑒𝑝 𝑂𝑝𝑣𝑜) log 𝑄 𝑏𝑢𝑓 𝑊𝑓𝑠𝑐) log 𝑄 𝑢ℎ𝑓 𝐸𝑓𝑢 log 𝑄 ℎ𝑝𝑛𝑓𝑥𝑝𝑠𝑙 𝑂𝑝𝑣𝑜) ⋅ 2 1 1 1 1 1 1 1
27
The ate the dog homework Det Verb Det Noun Noun Á(x, y): Properties of this
w: Parameters
log P(x, y) = A linear scoring function = wTÁ(x,y)
log 𝑄(𝐸𝑓𝑢 → 𝑂𝑝𝑣𝑜) log 𝑄(𝑂𝑝𝑣𝑜 → 𝑊𝑓𝑠𝑐) log 𝑄(𝑊𝑓𝑠𝑐 → 𝐸𝑓𝑢) log 𝑄 𝑈ℎ𝑓 𝐸𝑓𝑢) log 𝑄 𝑒𝑝 𝑂𝑝𝑣𝑜) log 𝑄 𝑏𝑢𝑓 𝑊𝑓𝑠𝑐) log 𝑄 𝑢ℎ𝑓 𝐸𝑓𝑢 log 𝑄 ℎ𝑝𝑛𝑓𝑥𝑝𝑠𝑙 𝑂𝑝𝑣𝑜) ⋅ 2 1 1 1 1 1 1 1
– Can we treat it as any linear classifier for training? – If so, we could add additional features that are global properties
Viterbi only cares about scores to structures (not necessarily normalized)
– If we need normalization, we could always normalize by computing exponentiating and dividing by Z – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model!
28
– Can we treat it as any linear classifier for training? – If so, we could add additional features that are global properties
Viterbi only cares about scores to structures (not necessarily normalized)
– If we need normalization, we could always normalize by computing exponentiating and dividing by Z – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model!
29
– Can we treat it as any linear classifier for training? – If so, we could add additional features that are global properties
Viterbi only cares about scores to structures (not necessarily normalized)
– If we need normalization, we could always normalize by exponentiating and dividing by Z (the partition function) – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model instead of the generative one!
30
31
Structured Perceptron update
32
33
Update only on an error. Perceptron is an mistake- driven algorithm. If there is a mistake, promote y and demote y’
34
T is a hyperparameter to the algorithm
35
In practice, good to shuffle D before the inner loop
36
Inference in training loop!
Exercise: Optimize this for performance – modify a only on errors
37
38
39
Caveat: Adding regularization will change the CRF update, averaging changes the perceptron update Expectation vs max
40
41
Two roads diverge
just linear classifiers
predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)
can use arbitrary features
Structured SVM
42
Two roads diverge
just linear classifiers
predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)
can use arbitrary features
Structured SVM
representation
given input
Sophisticated models that can use arbitrary features
43
Two roads diverge
just linear classifiers
predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)
can use arbitrary features
Structured SVM
representation
given input
Sophisticated models that can use arbitrary features
44
Two roads diverge
Discriminative/Conditional models
just linear classifiers
predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)
can use arbitrary features
Structured SVM
representation
given input
Sophisticated models that can use arbitrary features
Applicable beyond sequences Eventually, similar objective minimized with different loss functions
45
Two roads diverge
Discriminative/Conditional models
just linear classifiers
predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)
can use arbitrary features
Structured SVM
representation
given input
Sophisticated models that can use arbitrary features
Applicable beyond sequences Eventually, similar objective minimized with different loss functions
46
Two roads diverge Coming soon…
Discriminative/Conditional models
– Predict via Viterbi algorithm
– Local approaches (no inference during training)
– Global approaches (inference during training)
– What are the parts in a sequence model? – How is each model scoring these parts?
47
Same dichotomy for more general structures Prediction is not always tractable for general structures