Sea Search-Gu Guided, Lightly-Su Super ervised ed Training
- f
- f Structured Prediction
- n Energy Networ
- rks
Andrew McCallum Pedram Rooshenas Dongxu Zhang Gopal Sharma
Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of - - PowerPoint PPT Presentation
Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of of Structured Prediction on Energy Networ orks Pedram Rooshenas Dongxu Zhang Gopal Sharma Andrew McCallum St Struc uctur ured d Predi diction We are interested to
Andrew McCallum Pedram Rooshenas Dongxu Zhang Gopal Sharma
variables.
[picture from Belanger (2016)] [picture from Altinel (2018)]
6
We have a reward function that provides indirect supervision
We have a reward function that provides indirect supervision We want to learn a smooth version of the reward function such that we can use gradient-descent inference at test time
y0 We sample a point from energy function using noisy gradient-descent inference
y0 y1 We sample a point from energy function using noisy gradient-descent inference
y0 y2 y1 We sample a point from energy function using noisy gradient-descent inference
y0 y2 y3 y1 We sample a point from energy function using noisy gradient-descent inference
y0 y2 y3 y1 y4 We sample a point from energy function using noisy gradient-descent inference
y0 y2 y3 y1 y4 y5 We sample a point from energy function using noisy gradient-descent inference
y0 y2 y3 y1 y4 y5 Then we project the sample to the domain of the reward function (the sample is a point in the simplex, but the domain of the reward function is often discrete, i.e., the vertices of the simplex)
y0 y2 y3 y1 y4 y5 Then the search procedure uses the sample as input and returns an output structure by searching the reward function
y0 y2 y3 y1 y4 y5 We expect that the two points have the same ranking
y0 y2 y3 y1 y4 y5 Ranking violation We expect that the two points have the same ranking
y0 y2 y3 y1 y4 y5 When we find a pair of points that violates the ranking constraints, we update the energy function towards reducing the violation
24
25
26
27
0.9 0.9 0.85 0.4 0.1 0.05 0.05 0.04 0.1 0.45 0.8 0.9 ... ...
Input embedding Tag distribution Convolutional layer with multiple filters and different window sizes Max pooling and concatenation Multi-layer perceptron Tokens Wei Li . Deep Learning for ... Energy
... ... ... ... ... ... ...
author title ...
F i l t e r s i z e Filter size
I +
c(32,32,24)
t(32,32,20)
Parsing
+
c(32,32,24)
t(32,32,20)
Parsing I Predict +
c(32,32,24)
t(32,32,20)
Parsing
+
c(32,32,24)
t(32,32,20)
Parsing +
c(32,32,24)
t(32,32,20)
Parsing +
c(32,32,24)
t(32,32,20)
Parsing Graphic Engine I O Predict +
c(32,32,24)
t(32,32,20)
Parsing
+
c(32,32,24)
t(32,32,20)
Parsing +
c(32,32,24)
t(32,32,20)
Parsing +
c(32,32,24)
t(32,32,20)
Parsing Graphic Engine I O Predict +
c(32,32,24)
t(32,32,20)
Parsing
0.8
1e-5 1e-5
0.01
1e-5
...
...
... ... ...
Convolutional layer Program circle(16,16,12) triangle(32,48,16) + circle(16,24,12) Energy
1e-5 1e-5 1e-3 1e-5
0.9
circle(16,16,12)
CNN
Output distribution Input image Multi-layer perceptron