[PPT] - Learning Space-Time Structures for Human Action Recognition and PowerPoint Presentation

SLIDE 1

Learning Space-Time Structures for Human Action Recognition and Localization

10/7/15 1 Shugao Ma Jianming Zhang Nazli Ikizler-Cinbis Leonid Sigal Stan Sclaroff

1 1 1 2 3

Department of Computer Science, Boston University Department of Computer Engineering, Hacettepe University Disney Research Pittsburgh

1 2 3

SLIDE 2

Human actions are inherently structured patterns of body movements.

10/7/15 2

SLIDE 3

spatial structures

10/7/15 3

Below

credit of original photo: www.paceliving.com

SLIDE 4

temporal structures

10/7/15 4

Before

credit of original photo: www.paceliving.com

SLIDE 5

hierarchical structures

10/7/15 5

credit of original photo: www.paceliving.com

is-part is-part

SLIDE 6

Algorithms for Action Recognition

10/7/15 6

Space-time structural information Number of structures Topology of the structures Supervision Bag-of-Words Discarded N/A N/A action class label

f video

E.g. , Laptev et al. CVPR 2008, Wang et al. IJCV 2013, Wang et al. ICCV 2013, Ma et al. ICCV 2013, Zhang et al. CVPR 2014, Kantorov et al. CVPR 2014

SLIDE 7

Algorithms for Action Recognition

10/7/15 7

Space-time structural information Number of structures Topology of the structures Supervision Bag-of-Words Discarded N/A N/A action class label

f video

Space-Time Pyramid Weakly captured N/A N/A action class label

f video

E.g. , Laptev et al. CVPR 2008, Sadanand et al. CVPR 2012, Oneata et al. ICCV 2013

SLIDE 8

Algorithms for Action Recognition

10/7/15 8

Space-time structural information Number of structures Topology of the structures Supervision Bag-of-Words Discarded N/A N/A action class label

f video

Space-Time Pyramid Weakly captured N/A N/A action class label

f video

Structural Models (past works) captured Predefined,

ften one

predefined action class label

f video +

human bounding box annotations

E.g. , Ramanan et al. NIPS 2003, Weinland et al. ICCV 2007, Ikizler et al. IJCV 2008, Wang et al. TPAMI 2011, Raptis et al. CVPR 2012, Wang et al. ECCV 2014

SLIDE 9

Our Approach

10/7/15 9

SLIDE 10

Action as Space-Time Trees

10/7/15 10

… …

Root action word Part action word Temporal Relationship Spatial Relationship

Any graph can be approximated by a set of trees.
Inference with trees is efficient and exact.
A collection of trees is necessary for intra-class variations.
Partial matching for trees is allowed in inference.

SLIDE 11

Space-Time Tree

10/7/15 11

tree nodes (indices to action words) adjacency matrices for time, space and hierarchy discriminative node and edge weights

The tree nodes, the tree edges and their weights are all learned from training data.
Action words are used to share parameters among trees, reducing model complexities.

10 12 20 22 18 24

discriminative node and edge weights tree nodes (indices to action words) adjacency matrices for time, space and hierarchy

SLIDE 12

Ensemble of Space-Time Trees

For each action class , a collection of trees is used to construct action classifier . .

10/7/15 12

video graph

ensemble weight

tree matching score collection of trees

SLIDE 13

Algorithms for Action Recognition

10/7/15 13

Space-time structural information Number of structures Topology of the structures Supervision Bag-of-Words Discarded N/A N/A action class label

f video

Space-Time Pyramid Weakly captured N/A N/A action class label

f video

Structural Models (past works) captured Predefined,

ften one

predefined action class label

f video +

human bounding box annotations Ensemble of Space- Time Trees Better captured discovered from training data discovered from training data action class label

f video

SLIDE 14

The Algorithm

10/7/15 14

SLIDE 15

Hierarchical Space-Time Segments

Space-time volumes of video segments preserving their

hierarchical relationships.

Covering relevant static parts of video.
Two types: root space-time segments and part space-time

segments.

Published in ICCV 2013.

10/7/15 15

SLIDE 16

Hierarchical Space-Time Segments Extraction

Step 1: hierarchical video frame segments extraction

Key idea: segment tree pruning

1. Each segment tree is either pruned altogether or preserved with all nodes
2. Pruning cues: shape, motion, structure and global color

10/7/15 16

SLIDE 17

Hierarchical Space-Time Segments Extraction

Step 2: video frame segments tracking

10/7/15 17

SLIDE 18

10/7/15 18

SLIDE 19

Learning Action Words

10/7/15 19

image credit: familysponge.com

SLIDE 20

Root Space-Time Segments Training Videos

… …

Part Space-Time Segments

… …

10/7/15 20

Extracting Hierarchical Space-Time Segments (Ma et al. ICCV 2013)

SLIDE 21

… … … …

1 2 3

…

Part Action Words

1 2 3

…

10/7/15 21

Discriminative Clustering

Root Action Words Training Videos

SLIDE 22

10/7/15 22

+ + + + + + + + + + + + +

+

+

+

+ + + + + + + + + + + +

+

+

+

+ + + + + + + + + + + +

+

+

Affinity Propagation

(Frey et al. Science 2007) Discriminative Subcategorization (Hoai et al. CVPR 2013)

SLIDE 23

Part Action Words

10/7/15 23

Root Action Words

SLIDE 24

Learning Space-Time Trees

10/7/15 24

image credit: www.naturalturf.net

SLIDE 25

… …

10/7/15 25

Training Video Training Video

Extracting Hierarchical Space-Time Segments

SLIDE 26

… …

10/7/15 26

Training Video Training Video

Construct Video Graph

SLIDE 27

… …

10/7/15 27

Training Video Training Video

Associating Action Words to Graph Vertices

…

10 12 17 7 23 27 2 8 8 17 31 31 25 25

SLIDE 28

…

Discovered Tree Structures 31 25 25 17 8 10 33 44 27

…

tree mining, tree clustering, tree ranking

10/7/15 28

Training Video Training Video

Tree Structure Discovery

…

10 12 17 7 23 27 2 8 8 17 31 31 25 25

SLIDE 29

31 25 25 17 8

…

Discovered Tree Structures

…

tree mining, tree clustering, tree ranking

10/7/15 29

Training Video Training Video

Tree Structure Discovery

10 33 44 27

…

10 12 17 7 23 27 2 8 8 17 31 31 25 25

SLIDE 30

Tree Structure Discovery

10/7/15 30

Training Video Graphs Tree Structures

Tree Mining Tree Clustering Tree Ranking

SLIDE 31

Tree Structure Discovery

10/7/15 31

Find frequent subtrees by graph mining.
Train discriminative edge and node weights for

each mined tree by one iteration of latent-svm.

Training Video Graphs Tree Structures

Tree Mining Tree Clustering Tree Ranking

SLIDE 32

Frequent Subtree Mining

We use GASTON (Nijssen et al. ICCS 2005) to mine frequent subtrees from training graphs.

Trees with at most six nodes are mined.
We use small support threshold to mine

thousands of trees per action class.

10/7/15 32

SLIDE 33

Tree Structure Discovery

10/7/15 33

Compute tree similarities by tree matching.
Cluster trees and select one tree per cluster.

Training Video Graphs Tree Structures

Tree Mining Tree Clustering Tree Ranking

SLIDE 34

Tree Structure Discovery

10/7/15 34

Rank trees by activation entropy and select trees with small entropies.

Training Video Graphs Tree Structures

Tree Mining Tree Clustering Tree Ranking

# of trees # of trees Mean Average Precision Mean Per-class Accuracy

SLIDE 35

Inference

The matching score of a tree to a graph is

10/7/15 35

set of all (partial) matches matching scores to tree nodes and edges pooling function

max pooling: find the best match of the tree in the graph by dynamic programming.

SLIDE 36

Evaluation

10/7/15 36

SLIDE 37

Experiments

10/7/15 37

UCF-Sports [Rodriguez et al. CVPR 2008] 10 actions, 103 training videos and 47 testing videos HighFive [Patron-Perez et al. BMVC 2010] 4 interactions from TV programs, 150 training videos and 150 testing videos

SLIDE 38

Action Classification

10/7/15 38

Method mAP Ours (early fusion) 62.7 Ours (late fusion) 64.4 Gaidon et al. IJCV 2014 62.4 Wang et al. CVPR 2011 53.4 Ma et al. ICCV 2013 53.3 Patron-Perez et al. BMVC 2010 42.4 Laptev et al. CVPR 2008 36.9 HighFive Dataset Method Accuracy Ours (early fusion) 89.4 Ours (late fusion) 86.9 Wang et al. ICCV 2013 85.2 Ma et al. ICCV 2013 81.7 Raptis et al. CVPR 2012 79.4 Tian et al. CVPR 2013 75.2 Lan et al. ICCV 2011 73.1 UCF-Sports Dataset mAP: mean average precision Accuracy: mean per-class accuracy

SLIDE 39

Tree discriminative power increases as we capture more complex time, space and hierarchical structures.

10/7/15 39

Impact of Tree Size

# of trees

UCF-Sports

Mean Per-class Accuracy

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12 24 36 48 60

# tree nodes = 2 # tree nodes = 3 # tree nodes = 4 # tree nodes = 5 # tree nodes = 6

# of tree nodes = 6 # of tree nodes = 5 # of tree nodes = 4 # of tree nodes = 3 # of tree nodes = 2

SLIDE 40

10/7/15 40

SLIDE 41

10/7/15 41

SLIDE 42

10/7/15 42

SLIDE 43

Action Localization

10/7/15 43

UCF-Sports Precision predicted area (PA) divided by ground truth area (GA) Recall intersection of PA and GA divided by GA IOU intersection of PA and GA divided by union of PA and GA

SLIDE 44

Cross Dataset Validation

We use trees learned on HighFive to recognize two actions common in the Hollywood3D dataset.

10/7/15 44

Method Kiss Hug

Hadfield et al. CVPR 2013 10.2 12.1 Ours (not using depth info) 20.8 27.4 Hadfield et al. ECCV 2014 31.3 32.4 Evaluation Metric: Average Precision

SLIDE 45

Now you might have the following question:

10/7/15 45

SLIDE 46

Our method automatically discovers space-time

trees from training videos that capture rich time, space and hierarchical structures in human actions.

We propose the ensemble of space-time trees for

action classification and achieve promising results.

We show generalization of the learned trees by

cross-dataset validation.

10/7/15 46

SLIDE 47

10/7/15 47