[PPT] - Learning Graph Representations for Video Understanding Xiaolong PowerPoint Presentation

SLIDE 1

Learning Graph Representations for Video Understanding

Xiaolong Wang

Carnegie Mellon University

SLIDE 2

Computer Vision

Dog

He et al. Mask R-CNN. ICCV 2017. Güler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018.

SLIDE 3

Deep Learning

Mushroom Dog Ant Jelly Fungus Nest

ImageNet

Image Mushroom Dog Ant Jelly Fungus Nest

Train a Convolutional Neural Network

Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. 2014.

SLIDE 4

Convolutional Neural Networks

Figure credit: Van Den Oord et al.

Convolution is local
Long-range Pairwise relations are not modeled

SLIDE 5

Related Work: Relation Networks

[Santoro et al, 2017]

SLIDE 6

Related Work: Self-Attention

[Vaswani et al, 2017]

SLIDE 7

Related Work: Graph Convolution Networks

[Kipf et al, 2017]

SLIDE 8

This Tutorial

Perform connections on different graph/relation

networks

Under the application of video understanding
Both supervised and self-supervised methods

SLIDE 9

Video Recognition

3D Conv 3D Conv 3D Conv

Playing Soccer

SLIDE 10

Reasoning for Action Recognition

X. Wang , R. Girshick , A. Gupta, and K. He. Non-local Neural Networks. CVPR 2018.

Long-rang explicit reasoning

SLIDE 11

Non-local Means

𝑞 𝑟2 𝑟1 𝑟3

Buades et al. A non-local algorithm for image denoising. CVPR, 2005.

SLIDE 12

Non-local Operator

Operation in feature space Can be embedded into any ConvNets 𝑦𝑗 𝑦𝑘

SLIDE 13

Non-local Operator

𝑦𝑗 𝑦𝑘

Affinity Features

𝑧𝑗 = 1 𝐷(𝑦)

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 𝑕(𝑦𝑘)

SLIDE 14

Non-local Operator

14

𝑦

𝜄: 1 × 1 × 1 𝜚: 1 × 1 × 1

𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈𝐼𝑋 × 512 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 𝑈𝐼𝑋 𝑈𝐼𝑋 512 𝑈𝐼𝑋 512 𝑈𝐼𝑋 𝑈𝐼𝑋

× =

𝑧𝑗 = 1 𝐷(𝑦)

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 𝑕(𝑦𝑘)

𝑈 × 𝐼 × 𝑋 × 512

SLIDE 15

Non-local Operator

15

𝑦

𝜄: 1 × 1 × 1 𝜚: 1 × 1 × 1

𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈𝐼𝑋 × 512 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 𝑈𝐼𝑋 𝑈𝐼𝑋 512 𝑈𝐼𝑋 512 𝑈𝐼𝑋 𝑈𝐼𝑋

× =

𝑧𝑗 = 1 𝐷(𝑦)

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 𝑕(𝑦𝑘)

𝑈 × 𝐼 × 𝑋 × 512

SLIDE 16

Non-local Operator

16

𝑦

𝜄: 1 × 1 × 1 𝜚: 1 × 1 × 1

𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈𝐼𝑋 × 512 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 𝑈𝐼𝑋

normalize

𝑧𝑗 = 1 𝐷(𝑦)

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 𝑕(𝑦𝑘)

𝑈 × 𝐼 × 𝑋 × 512

SLIDE 17

Non-local Operator

17

𝑦

𝜄: 1 × 1 × 1 𝜚: 1 × 1 × 1

𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈𝐼𝑋 × 512 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 𝑈𝐼𝑋

normalize

𝑧𝑗 = 1 𝐷(𝑦)

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 𝑕(𝑦𝑘)

𝑈 × 𝐼 × 𝑋 × 512

𝑔 𝑦𝑗, 𝑦𝑘 = exp(𝑦𝑗

𝑈𝑦𝑘)

𝐷(𝑦) =

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 𝑔 𝑦𝑗, 𝑦𝑘 𝐷(𝑦) = exp(𝑦𝑗

𝑈𝑦𝑘)

∀𝑘 exp(𝑦𝑗

𝑈𝑦𝑘)

SLIDE 18

Non-local Operator

18

𝑦

𝜄: 1 × 1 × 1 𝜚: 1 × 1 × 1

𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512

𝑕: 1 × 1 × 1

𝑈 × 𝐼 × 𝑋 × 512 𝑈𝐼𝑋 × 512 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 𝑈𝐼𝑋 × 512

normalize

𝑧𝑗 = 1 𝐷(𝑦)

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 𝑕(𝑦𝑘)

𝑈 × 𝐼 × 𝑋 × 512

SLIDE 19

Non-local Operator

19

𝑦

𝜄: 1 × 1 × 1 𝜚: 1 × 1 × 1

𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512

𝑕: 1 × 1 × 1

𝑈 × 𝐼 × 𝑋 × 512 𝑈𝐼𝑋 × 512 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 𝑈𝐼𝑋 × 512

normalize

𝑧𝑗 = 1 𝐷(𝑦)

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 𝑕(𝑦𝑘)

𝑈 × 𝐼 × 𝑋 × 512

SLIDE 20

Non-local Operator as A Residual Block

3D Conv 3D Conv Non-local 3D Conv Non- local

Video Action Class

𝑨𝑗 = 𝑧𝑗𝑋 + 𝑦𝑗

SLIDE 21

Examples

SLIDE 22

Action Recognition in Daily Lives

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ivan Laptev, Ali Farhadi, Abhinav Gupta. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. ECCV 2016.

Charades Dataset: 157 classes, 9.8k videos, 30s per video We let the people upload their own videos!

SLIDE 23

Action Recognition on Charades

Method mAP 3D Conv 31.8% 3D Conv + Non-local 33.5%

SLIDE 24

Opening A Book

24

SLIDE 25

Opening A Book

25

The Non-local Block

SLIDE 26

Opening A Book

Object states changes over time Human-object, object-object interactions

X. Wang and A. Gupta. Video as Space-Time Region Graphs. ECCV 2018.

SLIDE 27

Opening A Book

27

A1 B1 A2 B2 A3 B3 A4 B4

Highly Correlated

SLIDE 28

Relations between Regions

SLIDE 29

Relations between Regions

𝑔 𝑦𝑗, 𝑦𝑘 = 𝜚 𝑦𝑗 𝑈 𝜚′(𝑦𝑘)

𝐻𝑗𝑘 = exp 𝑔 𝑦𝑗, 𝑦𝑘 ∀𝑘 exp 𝑔 𝑦𝑗, 𝑦𝑘

SLIDE 30

Graph Convolutional Network

𝑎 = 𝐻𝑌𝑋

Kipf. Semi-Supervised Classification with Graph Convolutional Networks. 2017

𝑂 𝐻 𝑂

×

𝑌 𝑂 𝑒 𝑒 𝑒 𝑋

×

𝑎 𝑂 𝑒

=

SLIDE 31

Graph Convolutional Network

31

Propagation

SLIDE 32

Connecting Non-local and GCN

𝑨𝑗 = 𝑧𝑗𝑋 + 𝑦𝑗 𝑧𝑗 = 1 𝐷(𝑦)

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 𝑕(𝑦𝑘) =

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 ∀𝑘 𝑔 𝑦𝑗, 𝑦𝑘 𝑕(𝑦𝑘) =

∀𝑘

𝐻𝑗𝑘 𝑕(𝑦𝑘) =

∀𝑘

𝐻𝑗𝑘 𝑕(𝑦𝑘) 𝑋 + 𝑦𝑗 𝑎 = 𝐻 𝑕 𝑌 𝑋 + 𝑌 The Non-local Operator: The Graph Convolution

SLIDE 33

Action Recognition on Charades

33

Method mean AP 3D Conv 31.8% 3D Conv + Non-local 33.5% 3D Conv + Region Graph 36.2% +4.4%

SLIDE 34

Action Recognition on Charades

34

30% 35% 40% 45% Involves Objects ? No Yes 3D Conv 3D Conv + Graph

SLIDE 35

Action Recognition on Charades

35

30% 35% 40% 45% Pose Variances 3D Conv 3D Conv + Graph

SLIDE 36

Connection to Mean-Shift

𝑧𝑗 =

∀𝑘

𝑔 𝑦𝑗, 𝑦𝑘 ∀𝑘 𝑔 𝑦𝑗, 𝑦𝑘 𝑕(𝑦𝑘) The Non-local Operator: The Mean-Shift Clustering: 𝑛(𝑦) =

𝑦𝑘∈𝑂(𝑦)

𝐿 𝑦, 𝑦𝑘 𝑦𝑘∈𝑂(𝑦) 𝐿 𝑦, 𝑦𝑘 𝑦𝑘 Converging to the same mean?

https://tw.rpi.edu/web/project/JeffersonProjectAtLakeGeorge/Clustering

SLIDE 37

Recent Related Work

Actor-Centric Relation Network [Sun et al, 2018] Video Action Transformer Network [Girdhar et al, 2019] Long-Term Feature Banks for Detailed Video Understanding [Wu et al, 2019]

SLIDE 38

Learning Affinity with Semantic Supervision

SLIDE 39

Learn Correspondence without Human Supervision

Goal:

SLIDE 40

The visual world exhibits continuity

SLIDE 41

Prior Work: Learning from Time

Predict Color in Time [Vondrick et al, 2018]

Inputs Outputs

Predict Pixel in Time [Mathieu et al, 2015] Predict Arrow of Time [Wei et al, 2018]

SLIDE 42

Using Tracking to Learn Features

CNN CNN

Similarity

Tracking → Similarity [Wang et al, 2015]

SLIDE 43

Using Tracking to Learn Features

CNN CNN

Similarity

Tracking → Similarity [Wang et al, 2015]

Limited by Off-the-shelf Trackers

SLIDE 44

Similarity requires tracking Tracking requires similarity

Let’s jointly learn both!

SLIDE 45

Learning to Track

How to obtain supervision?

ℱ ℱ ℱ

ℱ: a deep tracker

SLIDE 46

Supervision: Cycle-Consistency in Time

Track backwards Track forwards, back to the future

ℱ ℱ ℱ ℱ ℱ ℱ

SLIDE 47

Backpropagation through time along the cycle

Supervision: Cycle-Consistency in Time

ℱ ℱ ℱ ℱ ℱ ℱ

SLIDE 48

Differentiable Tracking

48

Encoder 𝜚 Encoder 𝜚

transpose

Patch feature in time 𝑢: 𝑦𝑢

𝑞

Image feature in time 𝑢 − 1: 𝑦𝑢−1

𝐽

100 900

× =

900 𝑑 100 𝑑

𝑦𝑢−1

𝐽

𝑦𝑢

𝑞

SLIDE 49

Spatial Transformer 𝜄 Cropping

Differentiable Tracking

49

Encoder 𝜚 Encoder 𝜚

transpose

Patch feature in time 𝑢: 𝑦𝑢

𝑞

Image feature in time 𝑢 − 1: 𝑦𝑢−1

𝐽

Patch feature in time 𝑢 − 1: 𝑦𝑢−1

𝑞

SLIDE 50

Spatial Transformer 𝜄 Cropping

Differentiable Tracking

50

Encoder 𝜚 Encoder 𝜚

transpose

𝑦𝑢−1

𝑞

= ℱ(𝑦𝑢−1

𝐽

, 𝑦𝑢

𝑞)

ℱ

SLIDE 51

Recurrent Tracking

51

𝑦𝑢

𝑞

𝑦𝑢

𝑞

𝑢 − 1

ℱ

𝑢

ℱ

𝑢 − 1

ℱ

𝑢 − 2

ℱ

𝑢 − 2

ℱ

𝑢 − 3

ℱ

ℒ𝑑𝑧𝑑𝑚𝑓

SLIDE 52

Cycle-Consistency Loss Function

ℒ𝑑𝑧𝑑𝑚𝑓 = ||𝑀𝑝𝑑 𝑦𝑢

𝑞 − 𝑀𝑝𝑑 𝑦𝑢 𝑞 ||2 2

𝑦𝑢

𝑞

𝑦𝑢

𝑞

ℱ

ℱ ℱ ℱ ℱ ℱ

SLIDE 53

Multiple Cycles

Sub-cycles: a natural curriculum

SLIDE 54

Skip Cycles

Skip-cycles: skipping occlusions

SLIDE 55

Visualization of Training

SLIDE 56

Test Time: Nearest Neighbors in Feature Space

𝑢 − 1 𝑢

SLIDE 57

𝑢 − 1 𝑢

Test Time: Nearest Neighbors in Feature Space

SLIDE 58

Instance Mask Tracking

DAVIS Dataset

DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

SLIDE 59

Pose Keypoint Tracking

JHMDB Dataset

SLIDE 60

Comparison

Our Correspondence Optical Flow

SLIDE 61

Pose Keypoint Tracking

JHMDB Dataset

Method PCK @.1 Optical Flow 45% Vondrick et al. 45% Ours 58%

Vondrick et al. Tracking Emerges by Colorizing Videos. ECCV 2018.

SLIDE 62

Texture Tracking

DAVIS Dataset

DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

SLIDE 63

Semantic Masks Tracking

Video Instance Parsing Dataset

Zhou et al. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing. ACM MM 2018.

SLIDE 64

Questions?