[PPT] - Video Paragraph Captioning using Hierarchical Recurrent Neural PowerPoint Presentation

SLIDE 1

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks

Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu

SLIDE 2

Problem

Given a video, generate a paragraph (multiple sentences).

01/13

SLIDE 3

Problem

Given a video, generate a paragraph (multiple sentences).

The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. 01/13

SLIDE 4

Problem

Given a video, generate a paragraph (multiple sentences).

The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. VS. The person sharpened the knife in the kitchen. 01/13

SLIDE 5

Motivation

Inter-sentence dependency (semantics context)

02/13

SLIDE 6

Motivation

Inter-sentence dependency (semantics context)

The person took out some potatoes.

02/13

SLIDE 7

Motivation

Inter-sentence dependency (semantics context)

The person took out some potatoes. The person peeled the potatoes. The person turned on the stove.

02/13

SLIDE 8

Motivation

Inter-sentence dependency (semantics context)

The person took out some potatoes. The person peeled the potatoes. The person turned on the stove.

We want to model this dependency.

02/13

SLIDE 9

Hierarchy

A paragraph is inherently hierarchical.

03/13

SLIDE 10

Hierarchy

A paragraph is inherently hierarchical.

The person took out some potatoes. 03/13

SLIDE 11

Hierarchy

A paragraph is inherently hierarchical.

The person took out some potatoes. The person peeled the potatoes. … … 03/13

SLIDE 12

Hierarchy

A paragraph is inherently hierarchical.

The person took out some potatoes. The person peeled the potatoes. … … 03/13 RNN RNN

SLIDE 13

Hierarchy

A paragraph is inherently hierarchical.

The person took out some potatoes. The person peeled the potatoes. RNN … …

RNN

03/13 RNN

SLIDE 14

(a (a) ) Sentence Generator (b (b) ) Paragraph Generator

Framework

04/13 RNN

RNN

SLIDE 15

512 512

Input Words Embedding Recurrent I Multimodal Predicted Words Hidden Softmax MaxID

1024 512

(a (a) ) Sentence Generator (b (b) ) Paragraph Generator

Framework – language model

04/13

RNN

SLIDE 16

512 512

Input Words Embedding Recurrent I Multimodal Predicted Words Hidden Softmax MaxID

1024 512

(a (a) ) Sentence Generator (b (b) ) Paragraph Generator

Video Feature Pool

Attention I

Weighted Average

Attention II Sequential Softmax

Framework – attention model for video feature

04/13

RNN

SLIDE 17

512 512

Input Words Embedding Recurrent I Multimodal Predicted Words Hidden Softmax MaxID

Last Instance 512 1024 512 Embedding Average

Recurrent II

512

Sentence Embedding

512

Paragraph State

(a (a) ) Sentence Generator (b (b) ) Paragraph Generator

Video Feature Pool

Attention I

Weighted Average

Attention II Sequential Softmax

Framework – paragraph model

04/13

SLIDE 18

Visual Features

Object appearance:

VGG-16 (fc7) [Simonyan et al., 2015], pre-trained on ImageNet dataset

Action:

C3D (fc6) [Tran et al., 2015], pre-trained on Sports-1M dataset Dense Trajectories+Fisher Vector [Wang et al., 2011]

Video Feature Pool Appearance Feature Pool Action Feature Pool

05/13

SLIDE 19

Attention Model

Learning spatial & temporal attention simultaneously

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

SLIDE 20

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

SLIDE 21

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

SLIDE 22

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

… …

i i-1 i+1 feature pool

SLIDE 23

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

… …

t-1 i i-1 i+1 feature pool previous recurrent state

SLIDE 24

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

… … … …

t-1 i i-1 i+1 feature pool previous recurrent state attention weights

SLIDE 25

Attention Model

Weighted Average Video Feature Pool

512

Recurrent I

Attention I Attention II Sequential Softmax

06/13

… … … …

t-1 i i-1 i+1 feature pool previous recurrent state attention weights

dot product

average feature

(input to multimodal layer)

SLIDE 26

Paragraph Generator Unrolled

next word next word 512 512 current word embedding sentence generator multi-model hidden softmax maxid input to next sentence paragraph generator 512 1024 512 7192 7192

sentence n-1

visual features 512 512 current word embedding multi-model hidden softmax maxid paragraph generator 512 1024 512 7192 7192

sentence n

visual features sentence generator

07/13

SLIDE 27

Sentence Embedding

512 512

Input Words Embedding Recurrent I

Last Instance Embedding Average 512

Sentence Embedding

08/13

SLIDE 28

Experiments - Setup

Two datasets:

YouTube2Text

> open-domain > 1,970 videos, ~80k video-sentence pairs, 12k unique words > only one sentence for a video (special case)

TACoS-MultiLevel

> closed-domain: cooking > 173 videos, 16,145 intervals, ~40k interval-sentence pairs, 2k unique words > several dependent sentences for a video

Three evaluation metrics:

BLEU [Papineni et al., 2002] METEOR [Banerjee and Lavie, 2005] CIDEr [Vedantam et al., 2015] The higher, the better. 09/13

SLIDE 29

Experiments - YouTube2Text

10/13

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 BLEU@4 METEOR CIDEr

SLIDE 30

Experiments - TACoS-MultiLevel

10/13

0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 BLEU@4 METEOR 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 CIDEr

SLIDE 31

Experiments - TACoS-MultiLevel

10/13

0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 BLEU@4 METEOR 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 CIDEr

SLIDE 32

Experiments - TACoS-MultiLevel

10/13

0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 BLEU@4 METEOR 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 CIDEr

Evaluation metric scores are not always reliable, we need further comparison.

SLIDE 33

RNN-cat vs. h-RNN

11/13

SLIDE 34

RNN-cat vs. h-RNN

RNN-cat flat structure, concatenating sentences directly with one RNN

The person took out some potatoes. The person peeled the potatoes. RNN … … 11/13

SLIDE 35

RNN-cat vs. h-RNN

RNN-cat flat structure, concatenating sentences directly with one RNN Amazon Mechanical Turk (AMT): side-by-side comparison

The person took out some potatoes. The person peeled the potatoes. RNN … …

Which of the two sentences better describes the video?

1. the first 2. the second.
3. Equally

good or bad

11/13

SLIDE 36

RNN-cat vs. h-RNN

RNN-cat flat structure, concatenating sentences directly with one RNN Amazon Mechanical Turk (AMT): side-by-side comparison

The person took out some potatoes. The person peeled the potatoes. RNN … …

Which of the two sentences better describes the video?

1. the first 2. the second.
3. Equally

good or bad

11/13

SLIDE 37

RNN-sent vs. h-RNN examples

12/13

SLIDE 38

Conclusions & Discussions

Hierarchical RNN improves paragraph generation

13/13

SLIDE 39

Conclusions & Discussions

Hierarchical RNN improves paragraph generation Issues:

1. Most errors occur when generating nouns; small objects hard

to recognize (on TACoS-MultiLevel)

2. One-way information flow
3. Language model helps, but sometimes overrides computer

vision result in a wrong way

13/13

SLIDE 40

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks

Problem

Given a video, generate a paragraph (multiple sentences).

Problem

Given a video, generate a paragraph (multiple sentences).

Problem

Given a video, generate a paragraph (multiple sentences).

Motivation

Inter-sentence dependency (semantics context)

Motivation

Inter-sentence dependency (semantics context)

Motivation

Inter-sentence dependency (semantics context)

Motivation

Inter-sentence dependency (semantics context)

We want to model this dependency.

Hierarchy

A paragraph is inherently hierarchical.

Hierarchy

A paragraph is inherently hierarchical.

Hierarchy

A paragraph is inherently hierarchical.

Hierarchy

A paragraph is inherently hierarchical.

Hierarchy

A paragraph is inherently hierarchical.

RNN

Framework

RNN

Framework – language model

RNN

Framework – attention model for video feature

RNN

Framework – paragraph model

Visual Features

Object appearance:

Action:

Attention Model

Learning spatial & temporal attention simultaneously

Attention Model

Attention Model

Attention Model

… …

Attention Model

… …

Attention Model

… … … …

Attention Model

… … … …

Paragraph Generator Unrolled

Sentence Embedding

Experiments - Setup

Two datasets:

Three evaluation metrics:

Experiments - YouTube2Text

Experiments - TACoS-MultiLevel

Experiments - TACoS-MultiLevel

Experiments - TACoS-MultiLevel

RNN-cat vs. h-RNN

RNN-cat vs. h-RNN

RNN-cat flat structure, concatenating sentences directly with one RNN

RNN-cat vs. h-RNN

RNN-cat flat structure, concatenating sentences directly with one RNN Amazon Mechanical Turk (AMT): side-by-side comparison

RNN-cat vs. h-RNN

RNN-cat flat structure, concatenating sentences directly with one RNN Amazon Mechanical Turk (AMT): side-by-side comparison

RNN-sent vs. h-RNN examples

Conclusions & Discussions

Hierarchical RNN improves paragraph generation

Conclusions & Discussions

Hierarchical RNN improves paragraph generation Issues:

to recognize (on TACoS-MultiLevel)

vision result in a wrong way

Thanks!

Poster #4