Video Paragraph Captioning using Hierarchical Recurrent Neural - - PowerPoint PPT Presentation

video paragraph captioning using hierarchical recurrent
SMART_READER_LITE
LIVE PREVIEW

Video Paragraph Captioning using Hierarchical Recurrent Neural - - PowerPoint PPT Presentation

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu Problem Given a video, generate a paragraph (multiple sentences). 01/13 Problem Given a video, generate a paragraph


slide-1
SLIDE 1

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks

Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu

slide-2
SLIDE 2

Problem

Given a video, generate a paragraph (multiple sentences).

01/13

slide-3
SLIDE 3

Problem

Given a video, generate a paragraph (multiple sentences).

The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. 01/13

slide-4
SLIDE 4

Problem

Given a video, generate a paragraph (multiple sentences).

The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. VS. The person sharpened the knife in the kitchen. 01/13

slide-5
SLIDE 5

Motivation

Inter-sentence dependency (semantics context)

02/13

slide-6
SLIDE 6

Motivation

Inter-sentence dependency (semantics context)

The person took out some potatoes.

02/13

slide-7
SLIDE 7

Motivation

Inter-sentence dependency (semantics context)

The person took out some potatoes. The person peeled the potatoes. The person turned on the stove.

02/13

slide-8
SLIDE 8

Motivation

Inter-sentence dependency (semantics context)

The person took out some potatoes. The person peeled the potatoes. The person turned on the stove.

We want to model this dependency.

02/13

slide-9
SLIDE 9

Hierarchy

A paragraph is inherently hierarchical.

03/13

slide-10
SLIDE 10

Hierarchy

A paragraph is inherently hierarchical.

The person took out some potatoes. 03/13

slide-11
SLIDE 11

Hierarchy

A paragraph is inherently hierarchical.

The person took out some potatoes. The person peeled the potatoes. … … 03/13

slide-12
SLIDE 12

Hierarchy

A paragraph is inherently hierarchical.

The person took out some potatoes. The person peeled the potatoes. … … 03/13 RNN RNN

slide-13
SLIDE 13

Hierarchy

A paragraph is inherently hierarchical.

The person took out some potatoes. The person peeled the potatoes. RNN … …

RNN

03/13 RNN

slide-14
SLIDE 14

(a (a) ) Sentence Generator (b (b) ) Paragraph Generator

Framework

04/13 RNN

RNN

slide-15
SLIDE 15

512 512

Input Words Embedding Recurrent I Multimodal Predicted Words Hidden Softmax MaxID

1024 512

(a (a) ) Sentence Generator (b (b) ) Paragraph Generator

Framework – language model

04/13

RNN

slide-16
SLIDE 16

512 512

Input Words Embedding Recurrent I Multimodal Predicted Words Hidden Softmax MaxID

1024 512

(a (a) ) Sentence Generator (b (b) ) Paragraph Generator

Video Feature Pool

Attention I

Weighted Average

Attention II Sequential Softmax

Framework – attention model for video feature

04/13

RNN

slide-17
SLIDE 17

512 512

Input Words Embedding Recurrent I Multimodal Predicted Words Hidden Softmax MaxID

Last Instance 512 1024 512 Embedding Average

Recurrent II

512

Sentence Embedding

512

Paragraph State

(a (a) ) Sentence Generator (b (b) ) Paragraph Generator

Video Feature Pool

Attention I

Weighted Average

Attention II Sequential Softmax

Framework – paragraph model

04/13

slide-18
SLIDE 18

Visual Features

Object appearance:

VGG-16 (fc7) [Simonyan et al., 2015], pre-trained on ImageNet dataset

Action:

C3D (fc6) [Tran et al., 2015], pre-trained on Sports-1M dataset Dense Trajectories+Fisher Vector [Wang et al., 2011]

Video Feature Pool Appearance Feature Pool Action Feature Pool

05/13

slide-19
SLIDE 19

Attention Model

Learning spatial & temporal attention simultaneously

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

slide-20
SLIDE 20

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

slide-21
SLIDE 21

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

slide-22
SLIDE 22

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

… …

i i-1 i+1 feature pool

slide-23
SLIDE 23

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

… …

t-1 i i-1 i+1 feature pool previous recurrent state

slide-24
SLIDE 24

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

… … … …

t-1 i i-1 i+1 feature pool previous recurrent state attention weights

slide-25
SLIDE 25

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

… … … …

t-1 i i-1 i+1 feature pool previous recurrent state attention weights

dot product

average feature

(input to multimodal layer)

slide-26
SLIDE 26

Paragraph Generator Unrolled

next word next word 512 512 current word embedding sentence generator multi-model hidden softmax maxid input to next sentence paragraph generator 512 1024 512 7192 7192

sentence n-1

visual features 512 512 current word embedding multi-model hidden softmax maxid paragraph generator 512 1024 512 7192 7192

sentence n

visual features sentence generator

07/13

slide-27
SLIDE 27

Sentence Embedding

512 512

Input Words Embedding Recurrent I

Last Instance Embedding Average 512

Sentence Embedding

08/13

slide-28
SLIDE 28

Experiments - Setup

Two datasets:

YouTube2Text

> open-domain > 1,970 videos, ~80k video-sentence pairs, 12k unique words > only one sentence for a video (special case)

TACoS-MultiLevel

> closed-domain: cooking > 173 videos, 16,145 intervals, ~40k interval-sentence pairs, 2k unique words > several dependent sentences for a video

Three evaluation metrics:

BLEU [Papineni et al., 2002] METEOR [Banerjee and Lavie, 2005] CIDEr [Vedantam et al., 2015] The higher, the better. 09/13

slide-29
SLIDE 29

Experiments - YouTube2Text

10/13

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 BLEU@4 METEOR CIDEr

slide-30
SLIDE 30

Experiments - TACoS-MultiLevel

10/13

0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 BLEU@4 METEOR 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 CIDEr

slide-31
SLIDE 31

Experiments - TACoS-MultiLevel

10/13

0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 BLEU@4 METEOR 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 CIDEr

slide-32
SLIDE 32

Experiments - TACoS-MultiLevel

10/13

0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 BLEU@4 METEOR 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 CIDEr

Evaluation metric scores are not always reliable, we need further comparison.

slide-33
SLIDE 33

RNN-cat vs. h-RNN

11/13

slide-34
SLIDE 34

RNN-cat vs. h-RNN

RNN-cat flat structure, concatenating sentences directly with one RNN

The person took out some potatoes. The person peeled the potatoes. RNN … … 11/13

slide-35
SLIDE 35

RNN-cat vs. h-RNN

RNN-cat flat structure, concatenating sentences directly with one RNN Amazon Mechanical Turk (AMT): side-by-side comparison

The person took out some potatoes. The person peeled the potatoes. RNN … …

Which of the two sentences better describes the video?

  • 1. the first 2. the second.
  • 3. Equally

good or bad

11/13

slide-36
SLIDE 36

RNN-cat vs. h-RNN

RNN-cat flat structure, concatenating sentences directly with one RNN Amazon Mechanical Turk (AMT): side-by-side comparison

The person took out some potatoes. The person peeled the potatoes. RNN … …

Which of the two sentences better describes the video?

  • 1. the first 2. the second.
  • 3. Equally

good or bad

11/13

slide-37
SLIDE 37

RNN-sent vs. h-RNN examples

12/13

slide-38
SLIDE 38

Conclusions & Discussions

Hierarchical RNN improves paragraph generation

13/13

slide-39
SLIDE 39

Conclusions & Discussions

Hierarchical RNN improves paragraph generation Issues:

  • 1. Most errors occur when generating nouns; small objects hard

to recognize (on TACoS-MultiLevel)

  • 2. One-way information flow
  • 3. Language model helps, but sometimes overrides computer

vision result in a wrong way

13/13

slide-40
SLIDE 40

Thanks!

Poster #4