Predicting Sequences: Structured Perceptron CS 6355: Structured - - PowerPoint PPT Presentation

predicting sequences structured perceptron
SMART_READER_LITE
LIVE PREVIEW

Predicting Sequences: Structured Perceptron CS 6355: Structured - - PowerPoint PPT Presentation

Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of factors Each factor assigns a


slide-1
SLIDE 1

CS 6355: Structured Prediction

Predicting Sequences: Structured Perceptron

1

slide-2
SLIDE 2

Conditional Random Fields summary

  • An undirected graphical model

– Decompose the score over the structure into a collection of factors – Each factor assigns a score to assignment of the random variables it is connected to

  • Training and prediction

– Final prediction via argmax wTÁ(x, y) – Train by maximum (regularized) likelihood

  • Connections to other models

– Effectively a linear classifier – A generalization of logistic regression to structures – An conditional variant of a Markov Random Field

  • We will see this soon

2

slide-3
SLIDE 3

Global features

The feature function decomposes over the sequence

3

y0 y1 y2 y3 x 𝒙𝑈𝜚(𝒚, 𝑧0, 𝑧1) 𝒙𝑈𝜚(𝒚, 𝑧+, 𝑧2) 𝒙𝑈𝜚(𝒚, 𝑧2, 𝑧3)

slide-4
SLIDE 4

Outline

  • Sequence models
  • Hidden Markov models

– Inference with HMM – Learning

  • Conditional Models and Local Classifiers
  • Global models

– Conditional Random Fields – Structured Perceptron for sequences

4

slide-5
SLIDE 5

HMM is also a linear classifier

Consider the HMM: 𝑄 𝐲, 𝐳 = 2 𝑄 𝑧3 𝑧34+ 𝑄 𝑦3 𝑧3

  • 3

5

slide-6
SLIDE 6

HMM is also a linear classifier

Consider the HMM: 𝑄 𝐲, 𝐳 = 2 𝑄 𝑧3 𝑧34+ 𝑄 𝑦3 𝑧3

  • 3

6

Emissions Transitions

slide-7
SLIDE 7

HMM is also a linear classifier

Consider the HMM: 𝑄 𝐲, 𝐳 = 2 𝑄 𝑧3 𝑧34+ 𝑄 𝑦3 𝑧3

  • 3

Or equivalently log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

7

Log joint probability = transition scores + emission scores

slide-8
SLIDE 8

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions

8

Log joint probability = transition scores + emission scores

slide-9
SLIDE 9

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions 𝐽 ? = @1, 𝑨 is true, 0, 𝑨 is false.

9

Log joint probability = transition scores + emission scores Indicators are functions that map Booleans to 0 or 1

slide-10
SLIDE 10

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions : : log 𝑄 𝑡 𝑡L ⋅ 𝐽 NOPQ ⋅

  • QR
  • Q

𝐽[NOTUVWR]

10

Log joint probability = transition scores + emission scores The indicators ensure that only

  • ne of the elements of the

double summation is non-zero Equivalent to

slide-11
SLIDE 11

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions : log 𝑄 𝑦3 𝑡 ⋅ 𝐽 NOPQ

  • Q

11

Log joint probability = transition scores + emission scores The indicators ensure that only

  • ne of the elements of the

summation is non-zero Equivalent to

slide-12
SLIDE 12

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions

log 𝑄 𝐲, 𝐳 = : : : log 𝑄 𝑡 𝑡L ⋅ 𝐽 NOPQ ⋅

  • QR
  • Q

𝐽[NOTUVWR]

  • 3

+ : : log 𝑄 𝑦3 𝑡 ⋅ 𝐽 NOPQ

  • Q
  • 3

12

slide-13
SLIDE 13

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions

log 𝑄 𝐲, 𝐳 = : : : log 𝑄 𝑡 𝑡L ⋅ 𝐽 NOPQ ⋅

  • QR
  • Q

𝐽[NOTUVWR]

  • 3

+ : : log 𝑄 𝑦3 𝑡 ⋅ 𝐽 NOPQ

  • Q
  • 3

log 𝑄 𝐲, 𝐳 = : : log 𝑄(𝑡 ∣ 𝑡L) : 𝐽 NOPQ

  • 3
  • QR

𝐽[NOTUVWR]

  • Q

+ : log 𝑄 𝑦3 𝑡 : 𝐽 NOPQ

  • 3
  • Q

13

slide-14
SLIDE 14

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions

log 𝑄 𝐲, 𝐳 = : : log 𝑄 𝑡 𝑡L : 𝐽 NOPQ

  • 3
  • QR

𝐽[NOTUVWR]

  • Q

+ : log 𝑄 𝑦3 𝑡 : 𝐽 NOPQ

  • 3
  • Q

14

Number of times there is a transition in the sequence from state 𝑡’ to state 𝑡 Count(𝑡L → 𝑡)

slide-15
SLIDE 15

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions

log 𝑄 𝐲, 𝐳 = : : log 𝑄 𝑡 𝑡L ⋅ Count(𝑡L → 𝑡)

  • QR
  • Q

+ : log 𝑄 𝑦3 𝑡 : 𝐽 NOPQ

  • 3
  • Q

15

slide-16
SLIDE 16

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions

log 𝑄 𝐲, 𝐳 = : : log 𝑄 𝑡 𝑡L ⋅ Count(𝑡L → 𝑡)

  • QR
  • Q

+ : log 𝑄 𝑦3 𝑡 : 𝐽 NOPQ

  • 3
  • Q

16

Number of times state 𝑡 occurs in the sequence: Count(𝑡)

slide-17
SLIDE 17

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions

log 𝑄 𝐲, 𝐳 = : : log 𝑄 𝑡 𝑡L ⋅ Count(𝑡L → 𝑡)

  • QR
  • Q

+ : log 𝑄 𝑦3 𝑡

  • Q

⋅ Count 𝑡

17

slide-18
SLIDE 18

HMM is also a linear classifier

log 𝑄 𝐲, 𝐳 = : log 𝑄 𝑧3 ∣ 𝑧34+ + log 𝑄 𝑦3 ∣ 𝑧3

  • 3

Let us examine this expression using a carefully defined set of indicator functions

log 𝑄 𝐲, 𝐳 = : : log 𝑄 𝑡 𝑡L ⋅ Count(𝑡L → 𝑡)

  • QR
  • Q

+ : log 𝑄 𝑦3 𝑡

  • Q

⋅ Count 𝑡

18

This is a linear function

log P terms are the weights; counts via indicators are features Can be written as wTÁ(x, y) and add more features

slide-19
SLIDE 19

HMM is a linear classifier: An example

19

The ate the dog homework Det Verb Det Noun Noun

slide-20
SLIDE 20

HMM is a linear classifier: An example

20

The ate the dog homework Det Verb Det Noun Noun

Consider

slide-21
SLIDE 21

HMM is a linear classifier: An example

21

The ate the dog homework Det Verb Det Noun Noun +

Consider

Transition scores Emission scores

slide-22
SLIDE 22

HMM is a linear classifier: An example

22

The ate the dog homework Det Verb Det Noun Noun log P(Det ! Noun) £ 2 + log P(Noun ! Verb) £ 1 + log P(Verb ! Det) £ 1 +

Consider

Emission scores

slide-23
SLIDE 23

HMM is a linear classifier: An example

23

The ate the dog homework Det Verb Det Noun Noun log P(Det ! Noun) £ 2 + log P(Noun ! Verb) £ 1 + log P(Verb ! Det) £ 1 log P(The | Det) £ 1 + log P(dog| Noun) £ 1 + log P(ate| Verb) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 +

Consider

slide-24
SLIDE 24

HMM is a linear classifier: An example

24

The ate the dog homework Det Verb Det Noun Noun log P(Det ! Noun) £ 2 + log P(Noun ! Verb) £ 1 + log P(Verb ! Det) £ 1 log P(The | Det) £ 1 + log P(dog| Noun) £ 1 + log P(ate| Verb) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 + w: Parameters

  • f the model

Consider

slide-25
SLIDE 25

HMM is a linear classifier: An example

25

The ate the dog homework Det Verb Det Noun Noun log P(Det ! Noun) £ 2 + log P(Noun ! Verb) £ 1 + log P(Verb ! Det) £ 1 log P(The | Det) £ 1 + log P(dog| Noun) £ 1 + log P(ate| Verb) £ 1 + log P(the| Det) £ 1 + log P(homework| Noun) £ 1 + Á(x, y): Properties of this output and the input

Consider

slide-26
SLIDE 26

HMM is a linear classifier: An example

26

The ate the dog homework Det Verb Det Noun Noun Á(x, y): Properties of this

  • utput and the input

w: Parameters

  • f the model

Consider

log 𝑄(𝐸𝑓𝑢 → 𝑂𝑝𝑣𝑜) log 𝑄(𝑂𝑝𝑣𝑜 → 𝑊𝑓𝑠𝑐) log 𝑄(𝑊𝑓𝑠𝑐 → 𝐸𝑓𝑢) log 𝑄 𝑈ℎ𝑓 𝐸𝑓𝑢) log 𝑄 𝑒𝑝𝑕 𝑂𝑝𝑣𝑜) log 𝑄 𝑏𝑢𝑓 𝑊𝑓𝑠𝑐) log 𝑄 𝑢ℎ𝑓 𝐸𝑓𝑢 log 𝑄 ℎ𝑝𝑛𝑓𝑥𝑝𝑠𝑙 𝑂𝑝𝑣𝑜) ⋅ 2 1 1 1 1 1 1 1

slide-27
SLIDE 27

HMM is a linear classifier: An example

27

The ate the dog homework Det Verb Det Noun Noun Á(x, y): Properties of this

  • utput and the input

w: Parameters

  • f the model

log P(x, y) = A linear scoring function = wTÁ(x,y)

Consider

log 𝑄(𝐸𝑓𝑢 → 𝑂𝑝𝑣𝑜) log 𝑄(𝑂𝑝𝑣𝑜 → 𝑊𝑓𝑠𝑐) log 𝑄(𝑊𝑓𝑠𝑐 → 𝐸𝑓𝑢) log 𝑄 𝑈ℎ𝑓 𝐸𝑓𝑢) log 𝑄 𝑒𝑝𝑕 𝑂𝑝𝑣𝑜) log 𝑄 𝑏𝑢𝑓 𝑊𝑓𝑠𝑐) log 𝑄 𝑢ℎ𝑓 𝐸𝑓𝑢 log 𝑄 ℎ𝑝𝑛𝑓𝑥𝑝𝑠𝑙 𝑂𝑝𝑣𝑜) ⋅ 2 1 1 1 1 1 1 1

slide-28
SLIDE 28

Towards structured Perceptron

1. HMM is a linear classifier

– Can we treat it as any linear classifier for training? – If so, we could add additional features that are global properties

  • As long as the output can be decomposed for easy inference

2. The Viterbi algorithm calculates max wTÁ(x, y)

Viterbi only cares about scores to structures (not necessarily normalized)

3. We could push the learning algorithm to train for un-normalized scores

– If we need normalization, we could always normalize by computing exponentiating and dividing by Z – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model!

28

slide-29
SLIDE 29

Towards structured Perceptron

1. HMM is a linear classifier

– Can we treat it as any linear classifier for training? – If so, we could add additional features that are global properties

  • As long as the output can be decomposed for easy inference

2. The Viterbi algorithm calculates max wTÁ(x, y)

Viterbi only cares about scores to structures (not necessarily normalized)

3. We could push the learning algorithm to train for un-normalized scores

– If we need normalization, we could always normalize by computing exponentiating and dividing by Z – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model!

29

slide-30
SLIDE 30

Towards structured Perceptron

1. HMM is a linear classifier

– Can we treat it as any linear classifier for training? – If so, we could add additional features that are global properties

  • As long as the output can be decomposed for easy inference

2. The Viterbi algorithm calculates max wTÁ(x, y)

Viterbi only cares about scores to structures (not necessarily normalized)

3. We could push the learning algorithm to train for un-normalized scores

– If we need normalization, we could always normalize by exponentiating and dividing by Z (the partition function) – That is, the learning algorithm can effectively just focus on the score of y for a particular x – Train a discriminative model instead of the generative one!

30

slide-31
SLIDE 31

Structured Perceptron algorithm

Given a training set D = {(x,y)}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (x, y) 2 D:

1. Predict y’ = argmaxy’ wTÁ(x, y’) 2. If y ≠ y’, update w à w + learningRate (Á(x, y) - Á(x, y’))

  • 3. Return w

Prediction: argmaxy wTÁ(x, y)

31

Structured Perceptron update

slide-32
SLIDE 32

Structured Perceptron algorithm

Given a training set D = {(x,y)}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (x, y) 2 D:

1. Predict y’ = argmaxy’ wTÁ(x, y’) 2. If y ≠ y’, update w à w + r (Á(x, y) - Á(x, y’))

  • 3. Return w

Prediction: argmaxy wTÁ(x, y)

32

slide-33
SLIDE 33

Structured Perceptron algorithm

Given a training set D = {(x,y)}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (x, y) 2 D:

1. Predict y’ = argmaxy’ wTÁ(x, y’) 2. If y ≠ y’, update w à w + r (Á(x, y) - Á(x, y’))

  • 3. Return w

Prediction: argmaxy wTÁ(x, y)

33

Update only on an error. Perceptron is an mistake- driven algorithm. If there is a mistake, promote y and demote y’

slide-34
SLIDE 34

Structured Perceptron algorithm

Given a training set D = {(x,y)}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (x, y) 2 D:

1. Predict y’ = argmaxy’ wTÁ(x, y’) 2. If y ≠ y’, update w à w + r (Á(x, y) - Á(x, y’))

  • 3. Return w

Prediction: argmaxy wTÁ(x, y)

34

T is a hyperparameter to the algorithm

slide-35
SLIDE 35

Structured Perceptron algorithm

Given a training set D = {(x,y)}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (x, y) 2 D:

1. Predict y’ = argmaxy’ wTÁ(x, y’) 2. If y ≠ y’, update w à w + r (Á(x, y) - Á(x, y’))

  • 3. Return w

Prediction: argmaxy wTÁ(x, y)

35

In practice, good to shuffle D before the inner loop

slide-36
SLIDE 36

Structured Perceptron algorithm

Given a training set D = {(x,y)}

  • 1. Initialize w = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (x, y) 2 D:

1. Predict y’ = argmaxy’ wTÁ(x, y’) 2. If y ≠ y’, update w à w + r (Á(x, y) - Á(x, y’))

  • 3. Return w

Prediction: argmaxy wTÁ(x, y)

36

Inference in training loop!

slide-37
SLIDE 37

Notes on structured perceptron

  • Mistake bound for separable data, just like perceptron
  • In practice, use averaging for better generalization

– Initialize a = 0 – After each step, whether there is an update or not, a à w + a

  • Note, we still check for mistake using w not a

– Return a at the end instead of w

Exercise: Optimize this for performance – modify a only on errors

  • Global update

– One weight vector for entire sequence

  • Not for each position

– Same algorithm can be derived via constraint classification

  • Create a binary classification data set and run perceptron

37

slide-38
SLIDE 38

Structured Perceptron with averaging

Given a training set D = {(x,y)}

  • 1. Initialize w = 0 2 <n, a = 0 2 <n
  • 2. For epoch = 1 … T:

1. For each training example (x, y) 2 D:

1. Predict y’ = argmaxy’ wTÁ(x, y’) 2. If y ≠ y’, update w à w + r (Á(x, y) - Á(x, y’)) 3. Set a à a + w

  • 3. Return a

38

slide-39
SLIDE 39

CRF vs. structured perceptron

Stochastic gradient descent update for CRF

– For a training example (xi, yi)

Structured perceptron

– For a training example (xi, yi)

39

Caveat: Adding regularization will change the CRF update, averaging changes the perceptron update Expectation vs max

slide-40
SLIDE 40

The lay of the land

HMM: A generative model, assigns probabilities to sequences

40

slide-41
SLIDE 41

The lay of the land

HMM: A generative model, assigns probabilities to sequences

41

Two roads diverge

slide-42
SLIDE 42

The lay of the land

HMM: A generative model, assigns probabilities to sequences

  • Hidden Markov Models are actually

just linear classifiers

  • Don’t really care whether we are

predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)

  • Generalize algorithms for linear
  • classifiers. Sophisticated models that

can use arbitrary features

  • Structured Perceptron

Structured SVM

42

Two roads diverge

slide-43
SLIDE 43

The lay of the land

HMM: A generative model, assigns probabilities to sequences

  • Hidden Markov Models are actually

just linear classifiers

  • Don’t really care whether we are

predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)

  • Generalize algorithms for linear
  • classifiers. Sophisticated models that

can use arbitrary features

  • Structured Perceptron

Structured SVM

  • Model probabilities via exponential
  • functions. Gives us the log-linear

representation

  • Log-probabilities for sequences for a

given input

  • Learn by maximizing likelihood.

Sophisticated models that can use arbitrary features

  • Conditional Random field

43

Two roads diverge

slide-44
SLIDE 44

The lay of the land

HMM: A generative model, assigns probabilities to sequences

  • Hidden Markov Models are actually

just linear classifiers

  • Don’t really care whether we are

predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)

  • Generalize algorithms for linear
  • classifiers. Sophisticated models that

can use arbitrary features

  • Structured Perceptron

Structured SVM

  • Model probabilities via exponential
  • functions. Gives us the log-linear

representation

  • Log-probabilities for sequences for a

given input

  • Learn by maximizing likelihood.

Sophisticated models that can use arbitrary features

  • Conditional Random field

44

Two roads diverge

Discriminative/Conditional models

slide-45
SLIDE 45

The lay of the land

HMM: A generative model, assigns probabilities to sequences

  • Hidden Markov Models are actually

just linear classifiers

  • Don’t really care whether we are

predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)

  • Generalize algorithms for linear
  • classifiers. Sophisticated models that

can use arbitrary features

  • Structured Perceptron

Structured SVM

  • Model probabilities via exponential
  • functions. Gives us the log-linear

representation

  • Log-probabilities for sequences for a

given input

  • Learn by maximizing likelihood.

Sophisticated models that can use arbitrary features

  • Conditional Random field

Applicable beyond sequences Eventually, similar objective minimized with different loss functions

45

Two roads diverge

Discriminative/Conditional models

slide-46
SLIDE 46

The lay of the land

HMM: A generative model, assigns probabilities to sequences

  • Hidden Markov Models are actually

just linear classifiers

  • Don’t really care whether we are

predicting probabilities. We are assigning scores to a full output for a given input (like multiclass)

  • Generalize algorithms for linear
  • classifiers. Sophisticated models that

can use arbitrary features

  • Structured Perceptron

Structured SVM

  • Model probabilities via exponential
  • functions. Gives us the log-linear

representation

  • Log-probabilities for sequences for a

given input

  • Learn by maximizing likelihood.

Sophisticated models that can use arbitrary features

  • Conditional Random field

Applicable beyond sequences Eventually, similar objective minimized with different loss functions

46

Two roads diverge Coming soon…

Discriminative/Conditional models

slide-47
SLIDE 47

Sequence models: Summary

  • Goal: Predict an output sequence given input sequence
  • Hidden Markov Model
  • Inference

– Predict via Viterbi algorithm

  • Conditional models/discriminative models

– Local approaches (no inference during training)

  • MEMM, conditional Markov model

– Global approaches (inference during training)

  • CRF, structured perceptron
  • To think

– What are the parts in a sequence model? – How is each model scoring these parts?

47

Same dichotomy for more general structures Prediction is not always tractable for general structures