Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of - - PowerPoint PPT Presentation

▶

Dec 07, 2022 446 likes •821 views

Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of of Structured Prediction on Energy Networ orks Pedram Rooshenas Dongxu Zhang Gopal Sharma Andrew McCallum St Struc uctur ured d Predi diction We are interested to

SLIDE 1

Sea Search-Gu Guided, Lightly-Su Super ervised ed Training

f
f Structured Prediction
n Energy Networ
rks

Andrew McCallum Pedram Rooshenas Dongxu Zhang Gopal Sharma

SLIDE 2

St Struc uctur ured d Predi diction

We are interested to learn a function
X input variables
Y output variables
We can define as
For a Gibbs distribution:

SLIDE 3

St Struc uctur ured d Predi diction n Ene nergy Networks (SP SPENs)

If is parameterized using a differentiable model such as a

deep neural network:

We can find a local minimum of E using gradient descent
The energy networks express the correlation among input and output

variables.

Traditionally graphical models are used for representing the correlation among output

variables.

Inference is intractable for most of expressive graphical models

SLIDE 4

En Energy Mod

dels

[picture from Belanger (2016)] [picture from Altinel (2018)]

SLIDE 5

Tr Training SPENs

Structural SVM (Belanger and McCallum, 2016)
End-to-End (Belanger et al., 2017)
Value-based training (Gygliet al. 2017)
Inference Network (Lifu Tu and Kevin Gimpel, 2018)
Rank-Based Training (Rooshenas et al., 2018)

SLIDE 6

In Indirect Supervisi sion

Data annotation is expensive, especially for structured outputs.
Domain knowledge as the source of supervision.
It can be written as reward functions
evaluates a pair of input and output configuration into a scalar

value

For a given x, we are looking for the best y that maximize

SLIDE 7

Se Search-Gu Guided ed Training

We have a reward function that provides indirect supervision

SLIDE 8

Se Search-Gu Guided ed Training

We have a reward function that provides indirect supervision We want to learn a smooth version of the reward function such that we can use gradient-descent inference at test time

SLIDE 9

Se Search-Gu Guided ed Training

y0 We sample a point from energy function using noisy gradient-descent inference

SLIDE 10

Se Search-Gu Guided ed Training

y0 y1 We sample a point from energy function using noisy gradient-descent inference

SLIDE 11

Se Search-Gu Guided ed Training

y0 y2 y1 We sample a point from energy function using noisy gradient-descent inference

SLIDE 12

Se Search-Gu Guided ed Training

y0 y2 y3 y1 We sample a point from energy function using noisy gradient-descent inference

SLIDE 13

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 We sample a point from energy function using noisy gradient-descent inference

SLIDE 14

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 We sample a point from energy function using noisy gradient-descent inference

SLIDE 15

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 Then we project the sample to the domain of the reward function (the sample is a point in the simplex, but the domain of the reward function is often discrete, i.e., the vertices of the simplex)

SLIDE 16

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 Then the search procedure uses the sample as input and returns an output structure by searching the reward function

SLIDE 17

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 We expect that the two points have the same ranking

n the reward function and negative of the energy function

SLIDE 18

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 Ranking violation We expect that the two points have the same ranking

n the reward function and negative of the energy function

SLIDE 19

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 When we find a pair of points that violates the ranking constraints, we update the energy function towards reducing the violation

SLIDE 20

Ta Task-Lo Loss as Reward Function fo for Multi-La Label Classification

The simplest form of indirect supervision is to use task-loss as reward

function:

SLIDE 21

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

SLIDE 22

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

SLIDE 23

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

SLIDE 24

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

SLIDE 25

En Energy Model

0.9 0.9 0.85 0.4 0.1 0.05 0.05 0.04 0.1 0.45 0.8 0.9 ... ...

Input embedding Tag distribution Convolutional layer with multiple filters and different window sizes Max pooling and concatenation Multi-layer perceptron Tokens Wei Li . Deep Learning for ... Energy

... ... ... ... ... ... ...

author title ...

F i l t e r s i z e Filter size

SLIDE 26

Pe Performance on Citation Field Extraction

SLIDE 27

Se Semi-Supe Supervised d Se Setting ng

Alternatively use the output of search and ground-truth label for

training.

SLIDE 28

Sha Shape pe Parser

I +

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing

SLIDE 29

Sha Shape pe Parser

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing I Predict +

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing

SLIDE 30

Sha Shape pe Parser

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing +

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing +

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing Graphic Engine I O Predict +

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing

SLIDE 31

Sha Shape pe Parser

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing +

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing +

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing Graphic Engine I O Predict +

c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing

SLIDE 32

Sha Shape pe Parser Ene nergy Mode del

0.8

1e-5 1e-5

0.01

1e-5

...

... ... ...

Convolutional layer Program circle(16,16,12) triangle(32,48,16) + circle(16,24,12) Energy

1e-5 1e-5 1e-3 1e-5

0.9

circle(16,16,12)

CNN

Output distribution Input image Multi-layer perceptron

SLIDE 33

Se Search h Budg udget vs. Cons nstraint nts

SLIDE 34

Pe Performance on Shape Pa Parser

SLIDE 35

Co Conclusion and Future Directions

If a reward function exists to evaluate every structured output into a

scalar value

We can use unlabled data for training structured prediction energy networks
Domain knowledge or non-differentiable pipelines can be used to

define the reward functions.

The main ingredient for learning from the reward function is the

search operator.

Here we only use simple search operators, but more complex search

Sea Search-Gu Guided, Lightly-Su Super ervised ed Training

St Struc uctur ured d Predi diction

St Struc uctur ured d Predi diction n Ene nergy Networks (SP SPENs)

deep neural network:

variables.

En Energy Mod

Tr Training SPENs

In Indirect Supervisi sion

value

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Se Search-Gu Guided ed Training

Ta Task-Lo Loss as Reward Function fo for Multi-La Label Classification

function:

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

En Energy Model

Pe Performance on Citation Field Extraction

Se Semi-Supe Supervised d Se Setting ng

training.

Sha Shape pe Parser

Sha Shape pe Parser

Sha Shape pe Parser

Sha Shape pe Parser

Sha Shape pe Parser Ene nergy Mode del

Se Search h Budg udget vs. Cons nstraint nts

Pe Performance on Shape Pa Parser

Co Conclusion and Future Directions

scalar value

define the reward functions.

search operator.

functions derived from domain knowledge can be used for complicated problems.