Video Paragraph Captioning using Hierarchical Recurrent Neural Networks
Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu
Video Paragraph Captioning using Hierarchical Recurrent Neural - - PowerPoint PPT Presentation
Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu Problem Given a video, generate a paragraph (multiple sentences). 01/13 Problem Given a video, generate a paragraph
Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu
01/13
The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. 01/13
The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. VS. The person sharpened the knife in the kitchen. 01/13
02/13
The person took out some potatoes.
02/13
The person took out some potatoes. The person peeled the potatoes. The person turned on the stove.
02/13
The person took out some potatoes. The person peeled the potatoes. The person turned on the stove.
02/13
03/13
The person took out some potatoes. 03/13
The person took out some potatoes. The person peeled the potatoes. … … 03/13
The person took out some potatoes. The person peeled the potatoes. … … 03/13 RNN RNN
The person took out some potatoes. The person peeled the potatoes. RNN … …
03/13 RNN
(a (a) ) Sentence Generator (b (b) ) Paragraph Generator
04/13 RNN
512 512
Input Words Embedding Recurrent I Multimodal Predicted Words Hidden Softmax MaxID
1024 512
(a (a) ) Sentence Generator (b (b) ) Paragraph Generator
04/13
512 512
Input Words Embedding Recurrent I Multimodal Predicted Words Hidden Softmax MaxID
1024 512
(a (a) ) Sentence Generator (b (b) ) Paragraph Generator
Video Feature Pool
Attention I
Weighted Average
Attention II Sequential Softmax
04/13
512 512
Input Words Embedding Recurrent I Multimodal Predicted Words Hidden Softmax MaxID
Last Instance 512 1024 512 Embedding Average
Recurrent II
512
Sentence Embedding
512
Paragraph State
(a (a) ) Sentence Generator (b (b) ) Paragraph Generator
Video Feature Pool
Attention I
Weighted Average
Attention II Sequential Softmax
04/13
VGG-16 (fc7) [Simonyan et al., 2015], pre-trained on ImageNet dataset
C3D (fc6) [Tran et al., 2015], pre-trained on Sports-1M dataset Dense Trajectories+Fisher Vector [Wang et al., 2011]
Video Feature Pool Appearance Feature Pool Action Feature Pool
05/13
Weighted Average Video Feature Pool
512
Recurrent I
Attention I Attention II Sequential Softmax
06/13
Weighted Average Video Feature Pool
512
Recurrent I
Attention I Attention II Sequential Softmax
06/13
Weighted Average Video Feature Pool
512
Recurrent I
Attention I Attention II Sequential Softmax
06/13
Weighted Average Video Feature Pool
512
Recurrent I
Attention I Attention II Sequential Softmax
06/13
i i-1 i+1 feature pool
Weighted Average Video Feature Pool
512
Recurrent I
Attention I Attention II Sequential Softmax
06/13
t-1 i i-1 i+1 feature pool previous recurrent state
Weighted Average Video Feature Pool
512
Recurrent I
Attention I Attention II Sequential Softmax
06/13
t-1 i i-1 i+1 feature pool previous recurrent state attention weights
Weighted Average Video Feature Pool
512
Recurrent I
Attention I Attention II Sequential Softmax
06/13
t-1 i i-1 i+1 feature pool previous recurrent state attention weights
dot product
average feature
(input to multimodal layer)
next word next word 512 512 current word embedding sentence generator multi-model hidden softmax maxid input to next sentence paragraph generator 512 1024 512 7192 7192
sentence n-1
visual features 512 512 current word embedding multi-model hidden softmax maxid paragraph generator 512 1024 512 7192 7192
sentence n
visual features sentence generator
07/13
512 512
Input Words Embedding Recurrent I
Last Instance Embedding Average 512
Sentence Embedding
08/13
YouTube2Text
> open-domain > 1,970 videos, ~80k video-sentence pairs, 12k unique words > only one sentence for a video (special case)
TACoS-MultiLevel
> closed-domain: cooking > 173 videos, 16,145 intervals, ~40k interval-sentence pairs, 2k unique words > several dependent sentences for a video
BLEU [Papineni et al., 2002] METEOR [Banerjee and Lavie, 2005] CIDEr [Vedantam et al., 2015] The higher, the better. 09/13
10/13
0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 BLEU@4 METEOR CIDEr
10/13
0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 BLEU@4 METEOR 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 CIDEr
10/13
0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 BLEU@4 METEOR 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 CIDEr
10/13
0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 BLEU@4 METEOR 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 CIDEr
Evaluation metric scores are not always reliable, we need further comparison.
11/13
The person took out some potatoes. The person peeled the potatoes. RNN … … 11/13
The person took out some potatoes. The person peeled the potatoes. RNN … …
Which of the two sentences better describes the video?
good or bad
11/13
The person took out some potatoes. The person peeled the potatoes. RNN … …
Which of the two sentences better describes the video?
good or bad
11/13
12/13
13/13
13/13