[PPT] - L2S: Learning to Search CS 6355: Structured Prediction 1 Some PowerPoint Presentation

SLIDE 1

CS 6355: Structured Prediction

L2S: Learning to Search

1

Some slides adapted from Daumé and Ross

SLIDE 2

Inference

What is inference?

– An overview of what we have seen before – Combinatorial optimization – Different views of inference

Graph algorithms

– Dynamic programming, greedy algorithms, search

Integer programming
Heuristics for inference

– Sampling

Learning to search

2

SLIDE 3

Learning to Search (L2S)

We have seen that inference as graph search – Iteratively construct a series of partial structures – Find the highest scoring structure in this fashion Can we learn a model that is designed with such inference in mind? – L2S is a way of formulating structured prediction problems as a search problem – Integrates learning and prediction into a unified framework

3

SLIDE 4

Overview

1. Preliminaries

– Learning to minimize costs – Search problems and a generic search algorithm

2. Learning to search: A general formulation
3. LaSO: Learning as Search Optimization
4. SEARN: Search and Learning
5. DAgger: Dataset Aggregation

4

SLIDE 5

Learning to minimize prediction cost

5

x1 x2 x3 y3 y2 y1 Suppose each y can be one of A, B or C, and the true label is (𝑧1 = A, 𝑧2 = B, 𝑧3 = C) 𝐳 = (𝑧1, 𝑧2, 𝑧3)

SLIDE 6

Learning to minimize prediction cost

6

x1 x2 x3 y3 y2 y1 𝑑(𝐵, 𝐵, 𝐵) = 1 𝑑(𝐵, 𝐵, 𝐶) = 1 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 1 𝑑(𝐷, 𝐷, 𝐷) = 1 𝑑(𝐵, 𝐵, 𝐵) = 2 𝑑(𝐵, 𝐵, 𝐶) = 2 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 3 𝑑(𝐷, 𝐷, 𝐷) = 2 Hamming Distance

r

Suppose each y can be one of A, B or C, and the true label is (𝑧1 = A, 𝑧2 = B, 𝑧3 = C) 𝐳 = (𝑧1, 𝑧2, 𝑧3) The cost vector for this input x can be: The goal: Learn a classifier that has lowest cost What is the dimension of the cost vector c?

SLIDE 7

Learning to minimize prediction cost

7

x1 x2 x3 y3 y2 y1 𝑑(𝐵, 𝐵, 𝐵) = 1 𝑑(𝐵, 𝐵, 𝐶) = 1 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 1 𝑑(𝐷, 𝐷, 𝐷) = 1 Suppose each y can be one of A, B or C, and the true label is (𝑧1 = A, 𝑧2 = B, 𝑧3 = C) 𝐳 = (𝑧1, 𝑧2, 𝑧3) The cost vector for this input x can be: The goal: Learn a classifier that has lowest cost

SLIDE 8

Learning to minimize prediction cost

8

x1 x2 x3 y3 y2 y1 𝑑(𝐵, 𝐵, 𝐵) = 1 𝑑(𝐵, 𝐵, 𝐶) = 1 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 1 𝑑(𝐷, 𝐷, 𝐷) = 1 𝑑(𝐵, 𝐵, 𝐵) = 2 𝑑(𝐵, 𝐵, 𝐶) = 2 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 3 𝑑(𝐷, 𝐷, 𝐷) = 2 Hamming Distance

r

Suppose each y can be one of A, B or C, and the true label is (𝑧1 = A, 𝑧2 = B, 𝑧3 = C) 𝐳 = (𝑧1, 𝑧2, 𝑧3) The cost vector for this input x can be: The goal: Learn a classifier that has lowest cost

SLIDE 9

Learning to minimize prediction cost

9

x1 x2 x3 y3 y2 y1 𝑑(𝐵, 𝐵, 𝐵) = 1 𝑑(𝐵, 𝐵, 𝐶) = 1 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 1 𝑑(𝐷, 𝐷, 𝐷) = 1 𝑑(𝐵, 𝐵, 𝐵) = 2 𝑑(𝐵, 𝐵, 𝐶) = 2 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 3 𝑑(𝐷, 𝐷, 𝐷) = 2 Hamming Distance

r

Suppose each y can be one of A, B or C, and the true label is (𝑧1 = A, 𝑧2 = B, 𝑧3 = C) 𝐳 = (𝑧1, 𝑧2, 𝑧3) The cost vector for this input x can be: The goal: Learn a classifier that has lowest cost What is the dimension of the cost vector c?

SLIDE 10

A Structured Prediction Problem

Learn a mapping ℎ(𝐲) from inputs 𝐲 to outputs 𝐳

Each 𝐳 decomposes into vectors (𝑧1, 𝑧2, … , 𝑧𝑈)
Each 𝐲 has a cost vector 𝐝.

– 𝐝 has 28 components if 𝑧𝑗 are binary – each component specify the cost of the corresponding y

The goal is to minimize 𝑀 ℎ = 𝐹[𝑑=(>)]

10

SLIDE 11

Formalizing search problems

Initial state: denoted by s0

– The starting point

Actions: Actions(s)

– The set of actions that can be performed at a state

Transition model: Result(s, a)

– “Applies” an action a to a state s to produce the next state

Goal test: A check for whether the search is complete or not
Path cost/score: A score for the path from the start state to any state

11

SLIDE 12

Formalizing search problems

Initial state: denoted by s0

– The starting point

Actions: Actions(s)

– The set of actions that can be performed at a state

Transition model: Result(s, a)

– “Applies” an action a to a state s to produce the next state

Goal test: A check for whether the search is complete or not
Path cost/score: A score for the path from the start state to any state

A solution is an action sequence that leads from initial state to a goal state. An optimal solution has the lowest path cost or highest score.

12

SLIDE 13

Example Search Problem: 8-puzzle

7 2 4 5 blank 6 8 3 1

13

blank 1 2 3 4 5 6 7 8 Initial State Goal State

SLIDE 14

Example Search Problem: 8-puzzle

7 2 4 5 blank 6 8 3 1

14

blank 1 2 3 4 5 6 7 8 Initial State Goal State Initial state: s0 Actions: Actions(s) Transition model: Result(s, a) Goal test Path cost / score What are these five components for 8-puzzle?

SLIDE 15

Generic Search Algorithm

15

How do we solve a search problem? Answer: By starting at the initial state, and navigating the state space till we get to an answer

SLIDE 16

Generic Search Algorithm

16

Algo Search(problem, initial, enqueue):

SLIDE 17

Generic Search Algorithm

17

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial))

SLIDE 18

Generic Search Algorithm

18

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty:

SLIDE 19

Generic Search Algorithm

19

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes)

SLIDE 20

Generic Search Algorithm

20

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if GoalTest(node) then return node

SLIDE 21

Generic Search Algorithm

21

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if GoalTest(node) then return node next = Result(node, Actions(node))

SLIDE 22

Generic Search Algorithm

22

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if GoalTest(node) then return node next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next)

SLIDE 23

Generic Search Algorithm

23

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if GoalTest(node) then return node next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next) return failure

SLIDE 24

Generic Search Algorithm

24

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if GoalTest(node) then return node next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next) return failure

All magic happens in the enqueue function (BFS, DFS, beam, A*) Or is there any magic?

SLIDE 25

Learning to search: General setting

The high level idea:

– Frame the problem of structured prediction as a generic search problem – Learn to enqueue nodes so that “good” states are explored first, and we get to a solution easily.

25

Predicting an output 𝐳 as a sequence of decisions

SLIDE 26

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈)

26

Predicting an output 𝐳 as a sequence of decisions

SLIDE 27

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈) – Initial state: Empty assignments (−, −, … , −)

27

Predicting an output 𝐳 as a sequence of decisions

SLIDE 28

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧𝑗 component and assign an label to it

28

Predicting an output 𝐳 as a sequence of decisions

SLIDE 29

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧𝑗 component and assign an label to it – Transition model: Move from one partial structure to another

29

Predicting an output 𝐳 as a sequence of decisions

SLIDE 30

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned

A goal state does not need to be optimal

30

Predicting an output 𝐳 as a sequence of decisions

SLIDE 31

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned

A goal state does not need to be optimal

– Path cost/score function: 𝐱𝑈 𝜚(𝐲, node), or more generally, a neural network that depends on the 𝐲 and the node

A node contains the current state and the back pointer to trace

back the search path

31

Predicting an output 𝐳 as a sequence of decisions

SLIDE 32

Example

32

x1 x2 x3 y3 y2 y1 Suppose each y can be one

f A, B or C

SLIDE 33

Example

33

x1 x2 x3 y3 y2 y1

State: Triples (y1, y2, y3) all possibly unknown
(A, -, -), (-, A, A), (-, -, -),…
Transition: Fill in one of the unknowns
Start state: (-,-,-)
End state: All three y’s are assigned

Suppose each y can be one

f A, B or C

SLIDE 34

Example

34

x1 x2 x3 y3 y2 y1

State: Triples (y1, y2, y3) all possibly unknown
(A, -, -), (-, A, A), (-, -, -),…
Transition: Fill in one of the unknowns
Start state: (-,-,-)
End state: All three y’s are assigned

(-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) (A,A,A) (C,C,C) ….. Suppose each y can be one

f A, B or C

SLIDE 35

LaSO: Learning as Search Optimization

35

1st Framework:

[Hal Daumé III and Daniel Marcu, ICML 2005]

SLIDE 36

The enqueue function in LaSO

36

SLIDE 37

The enqueue function in LaSO

The goal of learning is to produce an enqueue

function that

– places good hypotheses high on the queue – places bad hypotheses low on the queue

37

SLIDE 38

The enqueue function in LaSO

The goal of learning is to produce an enqueue

function that

– places good hypotheses high on the queue – places bad hypotheses low on the queue

LaSO assumes enqueue is based on two components

g + h

38

SLIDE 39

The enqueue function in LaSO

The goal of learning is to produce an enqueue

function that

– places good hypotheses high on the queue – places bad hypotheses low on the queue

LaSO assumes enqueue is based on two components

g + h

– g: path component. (g = wT φ(x, node))

39

SLIDE 40

The enqueue function in LaSO

The goal of learning is to produce an enqueue

function that

– places good hypotheses high on the queue – places bad hypotheses low on the queue

LaSO assumes enqueue is based on two components

g + h

– g: path component. (g = wT φ(x, node)) – h: heuristic component. (h is given)

A* if h is admissible, heuristic search if h is not admissible, best

first search if h = 0, beam search if queue is limited.

40

SLIDE 41

The enqueue function in LaSO

The goal of learning is to produce an enqueue

function that

– places good hypotheses high on the queue – places bad hypotheses low on the queue

LaSO assumes enqueue is based on two components

g + h

– g: path component. (g = wT φ(x, node)) – h: heuristic component. (h is given)

A* if h is admissible, heuristic search if h is not admissible, best

first search if h = 0, beam search if queue is limited.

41

The goal is to learn w. How?

SLIDE 42

“y-good” node

42

SLIDE 43

“y-good” node

Assumption: for any given node s and an gold output y, we can tell whether s can or cannot lead to y.

43

SLIDE 44

“y-good” node

Assumption: for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition: The node s is y-good if s can lead to y

44

SLIDE 45

“y-good” node

Assumption: for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition: The node s is y-good if s can lead to y

45

Suppose each y can be one

f A, B or C, and the true

label is (y1=A, y2=B, y3=C) y = (y1, y2, y3)

SLIDE 46

“y-good” node

Assumption: for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition: The node s is y-good if s can lead to y

46

Suppose each y can be one

f A, B or C, and the true

label is (y1=A, y2=B, y3=C) y = (y1, y2, y3) (-,-,-) (A,-,-) (-,B,-) (C,-,-) (A,A,-) (C,C,-) (A,A,A) (C,C,C) …..

SLIDE 47

Learning in LaSO

47

SLIDE 48

Learning in LaSO

Search as if in the prediction phase, but when an

error is made:

48

SLIDE 49

Learning in LaSO

Search as if in the prediction phase, but when an

error is made:

– update w

49

SLIDE 50

Learning in LaSO

Search as if in the prediction phase, but when an

error is made:

– update w – clear the queue and insert all the correct moves

50

SLIDE 51

Learning in LaSO

Search as if in the prediction phase, but when an

error is made:

– update w – clear the queue and insert all the correct moves

Two kinds of errors:

51

SLIDE 52

Learning in LaSO

Search as if in the prediction phase, but when an

error is made:

– update w – clear the queue and insert all the correct moves

Two kinds of errors:

– Error type 1: none of the queue is y-good

52

SLIDE 53

Learning in LaSO

Search as if in the prediction phase, but when an

error is made:

– update w – clear the queue and insert all the correct moves

Two kinds of errors:

– Error type 1: none of the queue is y-good – Error type 2: the goal state is not y-good

53

SLIDE 54

Learning Algorithm in LaSO

54

SLIDE 55

Learning Algorithm in LaSO

55

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 56

Learning Algorithm in LaSO

56

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 57

Learning Algorithm in LaSO

57

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 58

Learning Algorithm in LaSO

58

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 59

Learning Algorithm in LaSO

59

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 60

Learning Algorithm in LaSO

60

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 61

Learning Algorithm in LaSO

61

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 62

Learning Algorithm in LaSO

62

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 63

Learning Algorithm in LaSO

63

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 64

Learning Algorithm in LaSO

64

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 65

Learning Algorithm in LaSO

65

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 66

What should learning do?

node 1 y-good node 2 y-good node 4 y-good current node 3 y-good node 5 y-good

66

Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node

SLIDE 67

What should learning do?

node 1 y-good node 2 y-good node 4 y-good current node 3 y-good node 5 y-good

67

Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node Node 4 is the y-good sibling of the current node

SLIDE 68

Learning Algorithm in LaSO

68

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if none of (node + nodes) is y-good or GoalTest(node) and node is not y-good then sibs = siblings(node, y) w = update(w, x, sibs, node, nodes) nodes = MakeQueue(sibs) else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 69

Learning Algorithm in LaSO

69

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if none of (node + nodes) is y-good or GoalTest(node) and node is not y-good then sibs = siblings(node, y) w = update(w, x, sibs, node, nodes) nodes = MakeQueue(sibs) else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 70

Learning Algorithm in LaSO

70

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if none of (node + nodes) is y-good or GoalTest(node) and node is not y-good then sibs = siblings(node, y) w = update(w, x, sibs, node, nodes) nodes = MakeQueue(sibs) else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 71

Learning Algorithm in LaSO

71

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if none of (node + nodes) is y-good or GoalTest(node) and node is not y-good then sibs = siblings(node, y) w = update(w, x, sibs, {node, nodes}) nodes = MakeQueue(sibs) else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 72

Learning Algorithm in LaSO

72

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if none of (node + nodes) is y-good or GoalTest(node) and node is not y-good then sibs = siblings(node, y) w = update(w, x, sibs, {node, nodes}) nodes = MakeQueue(sibs) else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

SLIDE 73

Parameter Updates

73

We need to specify w = update(w, x, sibs, nodes) A simple perceptron-style update rule: w = w + Δ

It comes with the usual perceptron-style mistake bound and generalization bound. (See references)

∆ = X

n∈sibs

Φ(x, n) |sibs| − X

n∈nodes

Φ(x, n) |nodes|

SLIDE 74

SEARN: Search and Learning

74

2nd Framework:

Hal Daumé III, John Langford, Daniel Marcu (2007)

SLIDE 75

Policy

A policy is a mapping from a state to an action
For a given node, the policy tells what action should be taken

75

SLIDE 76

Policy

A policy is a mapping from a state to an action
For a given node, the policy tells what action should be taken
A policy gives a search path in the search space.

– Different policy means different search path – Can be thought as the “driver” in the search space

76

SLIDE 77

Policy

A policy is a mapping from a state to an action
For a given node, the policy tells what action should be taken
A policy gives a search path in the search space.

– Different policy means different search path – Can be thought as the “driver” in the search space

A policy may be deterministic, or may contain some randomness.

(More on this later)

77

SLIDE 78

Reference Policy and Learned Policy

78

SLIDE 79

Reference Policy and Learned Policy

We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

79

SLIDE 80

Reference Policy and Learned Policy

We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

Goal: Learn a good policy for test data when we do not have

access to cost vector c. (Imitation Learning)

80

SLIDE 81

Reference Policy and Learned Policy

We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

Goal: Learn a good policy for test data when we do not have

access to cost vector c. (Imitation Learning)

81

π πref

ref

SLIDE 82

Reference Policy and Learned Policy

We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

Goal: Learn a good policy for test data when we do not have

access to cost vector c. (Imitation Learning)

82

π πref

ref

For example if we are using Hamming distance for cost vector 𝐝, then the reference policy is trivial to compute, why?

SLIDE 83

Reference Policy and Learned Policy

We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

Goal: Learn a good policy for test data when we do not have

access to cost vector c. (Imitation Learning)

83

π πref

ref

For example if we are using Hamming distance for cost vector 𝐝, then the reference policy is trivial to compute, why? Just make the right decision at every step

SLIDE 84

Reference Policy and Learned Policy

We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

Goal: Learn a good policy for test data when we do not have

access to cost vector c. (Imitation Learning)

84

π πref

ref

For example if we are using Hamming distance for cost vector 𝐝, then the reference policy is trivial to compute, why? Just make the right decision at every step Suppose gold state is (A, B, C, A) and we are at the state (A, C, -, -) The reference policy tells us the next action is assigned C to the third slot.

SLIDE 85

Cost-Sensitive Classification

Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification

Training data: Pairs of examples associated with labels

– 𝑦, 𝑧 ∈ 𝑌 ×[𝐿]

Learning goal: To find a classifier that has low error

– min= Pr ℎ 𝑦 ≠ 𝑧

Cost-sensitive classification

Training data: An example paired with a cost vector that lists out the cost
f predicting each label

– 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S

Learning goal: To find a classifier that has low cost

– min= 𝐹>,T 𝑑= >

85

SLIDE 86

Cost-Sensitive Classification

Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification

Training data: Pairs of examples associated with labels

– 𝑦, 𝑧 ∈ 𝑌 ×[𝐿]

Learning goal: To find a classifier that has low error

– min= Pr ℎ 𝑦 ≠ 𝑧

Cost-sensitive classification

Training data: An example paired with a cost vector that lists out the cost
f predicting each label

– 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S

Learning goal: To find a classifier that has low cost

– min= 𝐹>,T 𝑑= >

86

SLIDE 87

Cost-Sensitive Classification

Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification

Training data: Pairs of examples associated with labels

– 𝑦, 𝑧 ∈ 𝑌 ×[𝐿]

Learning goal: To find a classifier that has low error

– min= Pr ℎ 𝑦 ≠ 𝑧

Cost-sensitive classification

Training data: An example paired with a cost vector that lists out the cost
f predicting each label

– 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S

Learning goal: To find a classifier that has low cost

– min= 𝐹>,T 𝑑= >

87

SLIDE 88

Cost-Sensitive Classification

Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification

Training data: Pairs of examples associated with labels

– 𝑦, 𝑧 ∈ 𝑌 ×[𝐿]

Learning goal: To find a classifier that has low error

– min= Pr ℎ 𝑦 ≠ 𝑧

Cost-sensitive classification

Training data: An example paired with a cost vector that lists out the cost
f predicting each label

– 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S

Learning goal: To find a classifier that has low cost

– min= 𝐹>,T 𝑑= >

88

Exercise: How would you design a cost- sensitive learner?

SLIDE 89

Cost-Sensitive Classification

Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification

Training data: Pairs of examples associated with labels

– 𝑦, 𝑧 ∈ 𝑌 ×[𝐿]

Learning goal: To find a classifier that has low error

– min= Pr ℎ 𝑦 ≠ 𝑧

Cost-sensitive classification

Training data: An example paired with a cost vector that lists out the cost
f predicting each label

– 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S

Learning goal: To find a classifier that has low cost

– min= 𝐹>,T 𝑑= >

89

SEARN uses a cost-sensitive learner to learn a policy

SLIDE 90

SEARN at test time

We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output.

90

SLIDE 91

SEARN at test time

We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output.

1. Use the learned policy on initial state (-,…, -) to

compute y1

91

SLIDE 92

SEARN at test time

We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output.

1. Use the learned policy on initial state (-,…, -) to

compute y1

2. Use the learned policy on state (y1, -,…,-) to

compute y2

92

SLIDE 93

SEARN at test time

We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output.

1. Use the learned policy on initial state (-,…, -) to

compute y1

2. Use the learned policy on state (y1, -,…,-) to

compute y2

3. Keep going until we get y = (y1,…,yn)

93

SLIDE 94

SEARN at training time

94

SLIDE 95

SEARN at training time

The core idea in training is to notice that at each

decision step, we are actually doing a cost-sensitive classification

95

SLIDE 96

SEARN at training time

The core idea in training is to notice that at each

decision step, we are actually doing a cost-sensitive classification

Construct cost-sensitive classification examples (s, c)

with state s and cost vector c.

96

SLIDE 97

SEARN at training time

The core idea in training is to notice that at each

decision step, we are actually doing a cost-sensitive classification

Construct cost-sensitive classification examples (s, c)

with state s and cost vector c.

Learn a cost-sensitive classifier. (This is nothing but a

policy)

97

SLIDE 98

Roll-in, Roll-out

98

SLIDE 99

Roll-in, Roll-out

99

roll in At each state, use some policy to move to a new state.

SLIDE 100

Roll-in, Roll-out

100

roll in What is the cost of deviating from the policy at this step?

SLIDE 101

Roll-in, Roll-out

101

roll in

ne step deviation

What is the cost of deviating from the policy at this step? Assuming that there are three possible actions at this state

SLIDE 102

Roll-in, Roll-out

102

roll in

ne step deviation

What is the cost of deviating from the policy at this step?

SLIDE 103

Roll-in, Roll-out

103

roll in

ne step deviation

roll out roll out What is the cost of deviating from the policy at this step? Once we make the one- step deviation, we could use some policy to get to a goal state again

SLIDE 104

Roll-in, Roll-out

104

roll in

ne step deviation

roll out roll out What is the cost of deviating from the policy at this step?

SLIDE 105

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

105

SLIDE 106

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

106

Generate a search path

SLIDE 107

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

107

Generate a search path
Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))

SLIDE 108

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

108

Generate a search path
Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
Do this for every step along the path

SLIDE 109

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

109

Generate a search path
Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
Do this for every step along the path
And for every structured training example

SLIDE 110

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

110

Generate a search path
Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
Do this for every step along the path
And for every structured training example
Collect all these cost-sensitive examples to train a improved policy h’

SLIDE 111

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

111

Generate a search path
Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
Do this for every step along the path
And for every structured training example
Collect all these cost-sensitive examples to train a improved policy h’
Interpolate: h ← βh0 + (1 − β)h

SLIDE 112

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

112

Generate a search path
Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
Do this for every step along the path
And for every structured training example
Collect all these cost-sensitive examples to train a improved policy h’
Interpolate:
Repeat

h ← βh0 + (1 − β)h

SLIDE 113

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

113

Roll-in with current policy h

Generate a search path
Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
Do this for every step along the path
And for every structured training example
Collect all these cost-sensitive examples to train a improved policy h’
Interpolate:
Repeat

h ← βh0 + (1 − β)h

SLIDE 114

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

114

Roll-in with current policy h Roll-out with current policy h

Generate a search path
Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
Do this for every step along the path
And for every structured training example
Collect all these cost-sensitive examples to train a improved policy h’
Interpolate:
Repeat

h ← βh0 + (1 − β)h

SLIDE 115

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

115

Roll-in with current policy h Roll-out with current policy h

If h is deterministic:

lh(c, s, a) = cy(s,a,h) − min

a0 cy(s,a0,h)

SLIDE 116

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

116

Roll-in with current policy h Roll-out with current policy h

If h is deterministic:
If h contains randomness:

lh(c, s, a) = cy(s,a,h) − min

a0 cy(s,a0,h)

lh(c, s, a) = Ey∼(s,a,h)cy − min

a0 Ey∼(s,a0,h)cy

SLIDE 117

?

E E E

ro roll llin in rol rollo lout ut

ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

117

Roll-in with current policy h Roll-out with current policy h

If h is deterministic:
If h contains randomness:

lh(c, s, a) = cy(s,a,h) − min

a0 cy(s,a0,h)

lh(c, s, a) = Ey∼(s,a,h)cy − min

a0 Ey∼(s,a0,h)cy

The loss defined this way is called regret

SLIDE 118

DAgger: Dataset Aggregation

118

3rd Framework:

[Stéphane Ross, Geoffrey J. Gordon, J. Andrew Bagnell, 2011]

SLIDE 119

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

119

SLIDE 120

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

120

Initialize Dataset D = empty
Collect trajectories with reference policy πref (the expert)
Dataset D1 = {(s, πref(s))}
Aggregate Datasets
Train π1 on D
Collect new trajectories with π1
New Dataset D2 = {(s, πref(s))}
Aggregate Datasets
Train π2 on D

SLIDE 121

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

121

Initialize Dataset D = empty
Collect trajectories with reference policy πref (the expert)
Dataset D1 = {(s, πref(s))}
Aggregate Datasets
Train π1 on D
Collect new trajectories with π1
New Dataset D2 = {(s, πref(s))}
Aggregate Datasets
Train π2 on D

SLIDE 122

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

122

Initialize Dataset D = empty
Collect trajectories with reference policy πref (the expert)
Dataset D1 = {(s, πref(s))}
Aggregate Datasets
Train π1 on D
Collect new trajectories with π1
New Dataset D2 = {(s, πref(s))}
Aggregate Datasets
Train π2 on D

SLIDE 123

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

123

Initialize Dataset D = empty
Collect trajectories with reference policy πref (the expert)
Dataset D1 = {(s, πref(s))}
Aggregate Datasets
Train π1 on D
Collect new trajectories with π1
New Dataset D2 = {(s, πref(s))}
Aggregate Datasets
Train π2 on D

D = D ∪ D1

SLIDE 124

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

124

Initialize Dataset D = empty
Collect trajectories with reference policy πref (the expert)
Dataset D1 = {(s, πref(s))}
Aggregate Datasets
Train π1 on D
Collect new trajectories with π1
New Dataset D2 = {(s, πref(s))}
Aggregate Datasets
Train π2 on D

D = D ∪ D1

SLIDE 125

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

125

Initialize Dataset D = empty
Collect trajectories with reference policy πref (the expert)
Dataset D1 = {(s, πref(s))}
Aggregate Datasets
Train π1 on D
Collect new trajectories with π1
New Dataset D2 = {(s, πref(s))}
Aggregate Datasets
Train π2 on D

D = D ∪ D1

SLIDE 126

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

126

Initialize Dataset D = empty
Collect trajectories with reference policy πref (the expert)
Dataset D1 = {(s, πref(s))}
Aggregate Datasets
Train π1 on D
Collect new trajectories with π1
New Dataset D2 = {(s, πref(s))}
Aggregate Datasets
Train π2 on D

D = D ∪ D1

SLIDE 127

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

127

Initialize Dataset D = empty
Collect trajectories with reference policy πref (the expert)
Dataset D1 = {(s, πref(s))}
Aggregate Datasets
Train π1 on D
Collect new trajectories with π1
New Dataset D2 = {(s, πref(s))}
Aggregate Datasets
Train π2 on D

D = D ∪ D1 D = D ∪ D2

SLIDE 128

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

128

Initialize Dataset D = empty
Collect trajectories with reference policy πref (the expert)
Dataset D1 = {(s, πref(s))}
Aggregate Datasets
Train π1 on D
Collect new trajectories with π1
New Dataset D2 = {(s, πref(s))}
Aggregate Datasets
Train π2 on D

D = D ∪ D1 D = D ∪ D2

SLIDE 129

DAgger V.S. SEARN

Similarities:

Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

Roll-in with current policy
Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

There is no roll-out stage
At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

Aggregate dataset

129

SLIDE 130

DAgger V.S. SEARN

Similarities:

Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

Roll-in with current policy
Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

There is no roll-out stage
At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

Aggregate dataset

130

SLIDE 131

DAgger V.S. SEARN

Similarities:

Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

Roll-in with current policy
Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

There is no roll-out stage
At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

Aggregate dataset

131

SLIDE 132

DAgger V.S. SEARN

Similarities:

Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

Roll-in with current policy
Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

There is no roll-out stage
At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

Aggregate dataset

132

SLIDE 133

DAgger V.S. SEARN

Similarities:

Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

Roll-in with current policy
Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

There is no roll-out stage
At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

Aggregate dataset

133

SLIDE 134

DAgger V.S. SEARN

Similarities:

Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

Roll-in with current policy
Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

There is no roll-out stage
At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

Aggregate dataset

134

SLIDE 135

DAgger V.S. SEARN

Similarities:

Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

Roll-in with current policy
Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

There is no roll-out stage
At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

Aggregate dataset

135

SLIDE 136

DAgger V.S. SEARN

Similarities:

Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

Roll-in with current policy
Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

There is no roll-out stage
At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

Aggregate dataset

136

SLIDE 137

Other related algorithms

Incremental Perceptron (2002)

– Based on structured Perceptron – Instead of finishing inference during training, when inference makes its first mistake, stop and update parameters

AggreVaTe: Aggregate Values to Imitate (2014)

– Combines ideas from DAgger and SEARN – Cost-sensitive learning + dataset aggregation

LOLS: Locally Optimal Learning to Search (2015)

– What if the reference policy is not good? – Changes roll-outs to account for this

137

SLIDE 138

Learning to search: Summary

Inference in structured prediction can be framed as

search

– Can we learn a model that explicitly helps inference navigate the search space?

Several algorithms:

– LaSO, SEARN, DAgger, etc – Often easy to implement with simpler building blocks

Can be the basis of a general purpose structured prediction

framework

138