l2s learning to search

L2S: Learning to Search CS 6355: Structured Prediction 1 Some - PowerPoint PPT Presentation

L2S: Learning to Search CS 6355: Structured Prediction 1 Some slides adapted from Daum and Ross Inference What is inference? An overview of what we have seen before Combinatorial optimization Different views of inference


  1. Learning to search: General setting Predicting an output 𝐳 as a sequence of decisions General data structures – State: Partial assignments to (𝑧 1 , 𝑧 2 , … , 𝑧 𝑈 ) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧 𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned • A goal state does not need to be optimal 30

  2. Learning to search: General setting Predicting an output 𝐳 as a sequence of decisions General data structures – State: Partial assignments to (𝑧 1 , 𝑧 2 , … , 𝑧 𝑈 ) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧 𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned • A goal state does not need to be optimal – Path cost/score function: 𝐱 𝑈 𝜚(𝐲, node) , or more generally, a neural network that depends on the 𝐲 and the node • A node contains the current state and the back pointer to trace back the search path 31

  3. Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 32

  4. Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 State: Triples (y 1 , y 2 , y 3 ) all possibly unknown • (A, -, -), (-, A, A), (-, -, -),… • Transition: Fill in one of the unknowns • Start state: (-,-,-) • End state: All three y’s are assigned • 33

  5. Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 (-,-,-) State: Triples (y 1 , y 2 , y 3 ) all possibly unknown • (A, -, -), (-, A, A), (-, -, -),… • (A,-,-) (B,-,-) (C,-,-) Transition: Fill in one of the unknowns • (A,A,-) (C,C,-) Start state: (-,-,-) ….. • (A,A,A) End state: All three y’s are assigned (C,C,C) • 34

  6. 1 st Framework: LaSO: Learning as Search Optimization [Hal Daumé III and Daniel Marcu, ICML 2005] 35

  7. The enqueue function in LaSO 36

  8. The enqueue function in LaSO • The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue 37

  9. The enqueue function in LaSO • The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue • LaSO assumes enqueue is based on two components g + h 38

  10. The enqueue function in LaSO • The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue • LaSO assumes enqueue is based on two components g + h – g: path component. (g = w T φ(x, node)) 39

  11. The enqueue function in LaSO • The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue • LaSO assumes enqueue is based on two components g + h – g: path component. (g = w T φ(x, node)) – h: heuristic component. (h is given) • A * if h is admissible, heuristic search if h is not admissible, best first search if h = 0, beam search if queue is limited. 40

  12. The enqueue function in LaSO • The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue • LaSO assumes enqueue is based on two components g + h The goal is to learn w. – g: path component. (g = w T φ(x, node)) How? – h: heuristic component. (h is given) • A * if h is admissible, heuristic search if h is not admissible, best first search if h = 0, beam search if queue is limited. 41

  13. “y-good” node 42

  14. “y-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. 43

  15. “y-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y 44

  16. “y-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y y = (y 1 , y 2 , y 3 ) Suppose each y can be one of A, B or C, and the true label is (y 1 =A, y 2 =B, y 3 =C) 45

  17. “y-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y (-,-,-) y = (y 1 , y 2 , y 3 ) (A,-,-) (-,B,-) (C,-,-) Suppose each y can be one of A, B or C, and the true (A,A,-) (C,C,-) label is (y 1 =A, y 2 =B, y 3 =C) ….. (A,A,A) (C,C,C) 46

  18. Learning in LaSO 47

  19. Learning in LaSO • Search as if in the prediction phase, but when an error is made: 48

  20. Learning in LaSO • Search as if in the prediction phase, but when an error is made: – update w 49

  21. Learning in LaSO • Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves 50

  22. Learning in LaSO • Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves • Two kinds of errors: 51

  23. Learning in LaSO • Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves • Two kinds of errors: – Error type 1: none of the queue is y-good 52

  24. Learning in LaSO • Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves • Two kinds of errors: – Error type 1: none of the queue is y-good – Error type 2: the goal state is not y-good 53

  25. Learning Algorithm in LaSO 54

  26. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 55

  27. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 56

  28. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 57

  29. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 58

  30. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 59

  31. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 60

  32. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 61

  33. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 62

  34. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 63

  35. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 64

  36. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 65

  37. What should learning do? node 1 y-good node 2 node 3 y-good y-good node 4 node 5 current y-good y-good Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node 66

  38. What should learning do? node 1 y-good node 2 node 3 y-good y-good node 5 node 4 current y-good y-good Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node Node 4 is the y-good sibling of the current node 67

  39. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, node, nodes ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 68

  40. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, node, nodes ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 69

  41. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, node, nodes ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 70

  42. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, {node, nodes} ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 71

  43. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, {node, nodes} ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 72

  44. Parameter Updates We need to specify w = update( w, x, sibs, nodes ) A simple perceptron-style update rule: w = w + Δ Φ ( x, n ) Φ ( x, n ) X X ∆ = | sibs | − | nodes | n ∈ sibs n ∈ nodes It comes with the usual perceptron-style mistake bound and generalization bound. (See references) 73

  45. 2 nd Framework: SEARN: Search and Learning Hal Daumé III, John Langford, Daniel Marcu (2007) 74

  46. Policy A policy is a mapping from a state to an action • For a given node, the policy tells what action should be taken • 75

  47. Policy A policy is a mapping from a state to an action • For a given node, the policy tells what action should be taken • A policy gives a search path in the search space. • – Different policy means different search path – Can be thought as the “driver” in the search space 76

  48. Policy A policy is a mapping from a state to an action • For a given node, the policy tells what action should be taken • A policy gives a search path in the search space. • – Different policy means different search path – Can be thought as the “driver” in the search space A policy may be deterministic, or may contain some randomness. • (More on this later) 77

  49. Reference Policy and Learned Policy 78

  50. Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs 79

  51. Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs • Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) 80

  52. Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs • Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) π ref ref π 81

  53. Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs • Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial π ref ref π to compute, why? 82

  54. Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs • Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial π ref ref π to compute, why? Just make the right decision at every step 83

  55. Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs • Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial π ref ref π to compute, why? Just make the right decision at every step Suppose gold state is (A, B, C, A) and we are at the state (A, C, -, -) The reference policy tells us the next action is assigned C to the third slot. 84

  56. Cost-Sensitive Classification Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels • 𝑦, 𝑧 ∈ 𝑌 ×[𝐿] – Learning goal: To find a classifier that has low error • – min = Pr ℎ 𝑦 ≠ 𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost • of predicting each label 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S – Learning goal: To find a classifier that has low cost • – min = 𝐹 >,T 𝑑 = > 85

  57. Cost-Sensitive Classification Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels • 𝑦, 𝑧 ∈ 𝑌 ×[𝐿] – Learning goal: To find a classifier that has low error • – min = Pr ℎ 𝑦 ≠ 𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost • of predicting each label 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S – Learning goal: To find a classifier that has low cost • – min = 𝐹 >,T 𝑑 = > 86

  58. Cost-Sensitive Classification Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels • 𝑦, 𝑧 ∈ 𝑌 ×[𝐿] – Learning goal: To find a classifier that has low error • – min = Pr ℎ 𝑦 ≠ 𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost • of predicting each label 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S – Learning goal: To find a classifier that has low cost • – min = 𝐹 >,T 𝑑 = > 87

  59. Cost-Sensitive Classification Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels • 𝑦, 𝑧 ∈ 𝑌 ×[𝐿] – Learning goal: To find a classifier that has low error • – min = Pr ℎ 𝑦 ≠ 𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost • of predicting each label Exercise: How would 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S – you design a cost- Learning goal: To find a classifier that has low cost • sensitive learner? – min = 𝐹 >,T 𝑑 = > 88

  60. Cost-Sensitive Classification Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels • 𝑦, 𝑧 ∈ 𝑌 ×[𝐿] – Learning goal: To find a classifier that has low error • – min = Pr ℎ 𝑦 ≠ 𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost • of predicting each label 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S – Learning goal: To find a classifier that has low cost • – min = 𝐹 >,T 𝑑 = > SEARN uses a cost-sensitive learner to learn a policy 89

  61. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 90

  62. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 91

  63. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 2. Use the learned policy on state (y 1 , -,…,-) to compute y 2 92

  64. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 2. Use the learned policy on state (y 1 , -,…,-) to compute y 2 3. Keep going until we get y = (y 1 ,…,y n ) 93

  65. SEARN at training time 94

  66. SEARN at training time • The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification 95

  67. SEARN at training time • The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification • Construct cost-sensitive classification examples (s, c) with state s and cost vector c. 96

  68. SEARN at training time • The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification • Construct cost-sensitive classification examples (s, c) with state s and cost vector c. • Learn a cost-sensitive classifier. (This is nothing but a policy) 97

  69. Roll-in, Roll-out 98

  70. Roll-in, Roll-out roll in At each state, use some policy to move to a new state. 99

  71. Roll-in, Roll-out roll in What is the cost of deviating from the policy at this step? 100

Recommend


More recommend


Explore More Topics

Stay informed with curated content and fresh updates.