[PPT] - Disentangling and Unifying Graph Convolutions for Skeleton-Based PowerPoint Presentation

SLIDE 1

Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

Ziyu Liu1,3, Hongwen Zhang2, Zhenghao Chen1, Zhiyong Wang1, Wanli Ouyang1,3

1The University of Sydney, 2University of Chinese Academy of Sciences & CASIA  3The University of Sydney, SenseTime Computer Vision Research Group, Australia

SLIDE 2

Agenda

Overview
Contributions
1. Factorized Modeling

Unified Spatial-Temporal Modeling

2. Adjacency Powering

Disentangling Neighborhoods

Experiments & Results

→ →

SLIDE 3

Action Recognition from Skeletons

Human actions can be efficiently represented by skeletons
Free of background clutter / lighting conditions / clothing variations

“Hand Shaking” Input Video … Estimated 2D/3D Poses Skeletons … Predicted Action

Image Credit: Amir Shahroudy, Jun Liu, Tian-Tsong Ng, Gang Wang, "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis", CVPR 2016

…

Skeleton-Based Action Recognition

SLIDE 4

Previous Approaches

Traditional
Handcraft features (e.g. Vemulapalli et al., CVPR’14; Huang et al., CVPR’17)
CNNs / RNNs (e.g. Ke et al., Wang et al., Liu et al., CVPR’17; Si et al., ECCV’18)
Often overlook semantic connectivity patterns between joints
Graph-Based (e.g. Shi et al., Li et al., Si et al., CVPR’19; Li et al., Wen et al., AAAI’19)
Graphs naturally captures the structure of human bodies
Joints

nodes, bones edges

No hand-crafted node traversal

→ →

Huang et al., CVPR’17

G = (V, E)

SLIDE 5

Preliminaries

Actions as Graph Sequences
Structure:
node graph with adjacency matrix

(normalized )

Features: Joint locations
ver frames
Goal: Learn to classify graph sequences

N A ̂ A X T

A ∈ ℝN×N X ∈ ℝT×N×C

Structure: Features: Entire Action …

A ∈ ℝN×N Xt ∈ ℝN×C

Structure: Features: Each Frame

SLIDE 6

Preliminaries

Feature Learning with Graph Convolutional Nets (GCNs) (Kipf et al., ICLR’17)
1. Neighborhood feature aggregation
2. Layer-wise feature update

A ∈ ℝN×N Xt ∈ ℝN×C

Structure: Features: Each Frame

X(l+1) = σ ( ̂ A X(l)Θ(l))

Neighborhood Aggregation Feature Update

Neighborhood Aggregation

A ∈ ℝN×N X ∈ ℝT×N×C

Structure: Features: Entire Action …

SLIDE 7

e.g. Li et al. CVPR’19, Li et al. AAAI’18

Existing Graph-Based Approaches

e.g. Li et al. CVPR’19, Shi et al. CVPR’19, Shi et al. CVPR’19, Yan et al. AAAI’18

2. Multi-Scale Graph Convolutions

GCNs over different adjacency powers

1. Factorized Modeling

GCNs + Temporal Models

… Temporal Aggregation … Spatial Aggregation Spatial Aggregation

SLIDE 8

Agenda

Overview
Contributions
1. Factorized Modeling

Unified Spatial-Temporal Modeling

2. Adjacency Powering

Disentangling Neighborhoods

Experiments & Results

→ →

SLIDE 9

Previous Approach #1: Factorized Modeling

Learn spatial-temporal features with spatial / temporal modules
Spatial: Neighborhood aggregation (GCNs)
Temporal: Node-wise sequence models (1d conv / recurrent)

(cf. Factorized 3D CNNs) … Temporal Aggregation … Spatial Aggregation

SLIDE 10

Motivation #1: Indirect Information Flow

Factorization can create bottlenecks for feature propagation
Unweighted message passing (GCNs) can also make aggregated features generic

(cf. Factorized 3D CNNs)

Hard to learn spatial-temporal relationships information bottleneck

SLIDE 11

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

G3D modules: neighborhood aggregation across space and time
Edges serve as skip connections, allowing more direct information flow

SLIDE 12

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

Sliding Temporal Window

size = !, dilation = "

GCN

#: %

&'×)×*&'

+: )×)

G3D

#(-): %

/01×!)×*&'

+(-): !)×!)

Collapse Window Reshape + FC

#: %

/01×)×*/01

#(-): %

/01×!)×*2&3

(1) Sliding Temporal Window

Spatial Graph Skeleton Features Spatial-Temporal Window Window Features

! (Window Size)

Temporal Edges Spatial Edges Spatial-Temporal Edges

(2) Extrapolate Spatial Connectivity

(3) Graph Convolutions over Windows (4) Squeeze Windows with 1x1 Conv

X(τ) X

Conv 1x1 BatchNorm

SLIDE 13

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

Sliding Temporal Window

size = !, dilation = "

GCN

#: %

&'×)×*&'

+: )×)

G3D

#(-): %

/01×!)×*&'

+(-): !)×!)

Collapse Window Reshape + FC

#: %

/01×)×*/01

#(-): %

/01×!)×*2&3

(1) Sliding Temporal Window

Spatial Graph Skeleton Features Spatial-Temporal Window Window Features

! (Window Size)

Temporal Edges Spatial Edges Spatial-Temporal Edges

(2) Extrapolate Spatial Connectivity

(3) Graph Convolutions over Windows (4) Squeeze Windows with 1x1 Conv

X(τ) X

Conv 1x1 BatchNorm

SLIDE 14

! (Window Size)

Temporal Edges Spatial Edges Spatial-Temporal Edges

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

Sliding Temporal Window

size = !, dilation = "

GCN

#: %

&'×)×*&'

+: )×)

G3D

#(-): %

/01×!)×*&'

+(-): !)×!)

Collapse Window Reshape + FC

#: %

/01×)×*/01

#(-): %

/01×!)×*2&3

(2) Extrapolate Spatial Connectivity

(1) Sliding Temporal Window

Spatial-Temporal Window Window Features Spatial Graph Skeleton Features

(3) Graph Convolutions over Windows (4) Squeeze Windows with 1x1 Conv

X(τ) X

Conv 1x1 BatchNorm

SLIDE 15

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

Sliding Temporal Window

size = !, dilation = "

GCN

#: %

&'×)×*&'

+: )×)

G3D

#(-): %

/01×!)×*&'

+(-): !)×!)

Collapse Window Reshape + FC

#: %

/01×)×*/01

#(-): %

/01×!)×*2&3

! (Window Size)

Temporal Edges Spatial Edges Spatial-Temporal Edges

(1) Sliding Temporal Window

(2) Extrapolate Spatial Connectivity

(3) Graph Convolutions over Windows

Spatial-Temporal Window Window Features Spatial Graph Skeleton Features

(4) Squeeze Windows with 1x1 Conv

X(τ) X

Conv 1x1 BatchNorm

SLIDE 16

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

Sliding Temporal Window

size = !, dilation = "

GCN

#: %

&'×)×*&'

+: )×)

G3D

#(-): %

/01×!)×*&'

+(-): !)×!)

Collapse Window Reshape + FC

#: %

/01×)×*/01

#(-): %

/01×!)×*2&3

(4) Squeeze Windows with 1x1 Conv

Spatial-Temporal Window Window Features

X(τ) X

! (Window Size)

Temporal Edges Spatial Edges Spatial-Temporal Edges

Conv 1x1 BatchNorm

(1) Sliding Temporal Window

(2) Extrapolate Spatial Connectivity

(3) Graph Convolutions over Windows

Spatial Graph Skeleton Features

SLIDE 17

Discussion

Analogous to 3D convolutions on RGB videos
Unlike 3D conv, # parameters is independent of receptive field size
Temporal receptive field can be controlled based on input resolution
Considers more information at once and helps prevent losing features during

unweighted spatial aggregation

Memory footprint

Spatial-Temporal Neighborhood Aggregation

SLIDE 18

Agenda

Overview
Contributions
1. Factorized Modeling

Unified Spatial-Temporal Modeling

2. Adjacency Powering

Disentangling Neighborhoods

Experiments & Results

→ →

SLIDE 19

Previous Approach #2: Multi-Scale Graph Convolutions

Making

neighbors reachable with

Mixing features

with normalized

k-hop ˜ Ak ̂ A kX for k = 0,1,... ̂ A

X(l+1)

t

= σ (

K

∑

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

e.g. Li et al. CVPR’19, Abu-El-Haija et al. ICML’19, Luan et al. NeurIPS’19, Liao et al. ICLR’19

SLIDE 20

Motivation #2: Biased Node Weighting

Node weights from

are biased towards closer nodes

More length- walks to closer nodes due to cyclic walks

̂ A k k

X(l+1)

t

= σ (

K

∑

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

2

4 6 8 10 12 14 16 Walk Length k 100000 200000 300000 400000 500000 Number of Walks

Number of length-k walks from Node 1

To Node 1 To Node 2 To Node 3 To Node 4 To Node 5 2 4 6 8 10 12 14 16 18 Walk Length k 100000 200000 300000 400000 500000 600000 Number of Walks

Number of length-k walks to Self

Node 1 Node 2 Node 3 Node 4 Node 5

SLIDE 21

Motivation #2: Biased Node Weighting

Node weights from

are biased towards closer nodes

More length- walks to closer nodes due to cyclic walks

̂ A k k

X(l+1)

t

= σ (

K

∑

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

1

2 3 4 5 6 Walk Length k 50 100 150 200 250 300 350 400 Number of Walks

Number of length-k walks to Self

Node 1 Node 1 (Self-loops) Node 2 Node 2 (Self-loops) Node 3 Node 3 (Self-loops) Node 4 Node 4 (Self-loops) Node 5 Node 5 (Self-loops)

SLIDE 22

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

Motivation #2: Biased Node Weighting

Node weights from

are biased towards closer nodes

More length- walks to closer nodes due to cyclic walks
On skeleton graphs, features from local body parts will dominate under GCNs

̂ A k k

Hard to learn long-range dependencies

X(l+1)

t

= σ (

K

∑

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

SLIDE 23

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

Motivation #2: Biased Node Weighting

Node weights from

are biased towards closer nodes

More length- walks to closer nodes due to cyclic walks
On skeleton graphs, features from local body parts will dominate under GCNs

̂ A k k

Hard to learn long-range dependencies

Ideally: unbiased weighting

1 2 3 4 5 6 7

X(l+1)

t

= σ (

K

∑

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

SLIDE 24

Idea #2: Disentangled Multi-Scale Aggregation

adjacency: Reweigh joint importance at different neighborhoods
A new graph construction with no impact on runtime / # params

k

Self-loops to model joint relationships and retain identity features

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

˜ A(1) ˜ A(2) ˜ A(3)

X(l+1)

t

= σ (

K

∑

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

SLIDE 25

Idea #2: Disentangled Multi-Scale Aggregation

adjacency: Reweigh joint importance at different neighborhoods
A new graph construction with no impact on runtime / # params

k

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

˜ A(1) ˜ A(2) ˜ A(3)

X(l+1)

t

= σ (

K

∑

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

SLIDE 26

Idea #2: Disentangled Multi-Scale Aggregation

Updated multi-scale graph convolutions:

X(l+1)

t

= σ (

K

∑

k=0

̂ A kX(l)

t Θ(l) (k))

̂ A = ˜ D− 1

2 ˜

A ˜ D− 1

2

X(l+1)

t

= σ (

K

∑

k=0

̂ A (k)X(l)

t Θ(l) (k))

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

˜ A(1) ˜ A(2) ˜ A(3)

̂ A (k) = ˜ D− 1

2

(k) ˜

A(k) ˜ D− 1

2

(k)

SLIDE 27

MS-G3D: Putting everything together

Multi-scale learning directly over spatial-temporal domain
Disentangled aggregation complements G3D due to its higher node degrees

[X(l+1)

(τ) ]t

= σ (

K

∑

k=0

˜ D− 1

2

(τ,k) ˜

A(τ,k) ˜ D− 1

2

(τ,k) [X(l) (τ)]t

Θ(l)

(k))

Disentangled Multi-Scale Aggregation Unified Spatial-Temporal Modeling

(partially coloured for visual clarity)

SLIDE 28

MS-GCN MS-TCN MS-TCN

(stride = 2)

+

MS-TCN Inputs

STGC Block !

G3D Pathway(s)

Conv 1×1 Conv 1×1

$: &(()×*×+(() $: &((,-)×*×+((,-)

.: Skeleton Graph Adjacency $: Node Features &: Number of Frames *: Number of Nodes +: Number of Channels

Add

Factorized Pathway

MS-G3D MS-G3D

window size = /0 window dilation = 10 window size = /- window dilation = 1-

Model Architecture

STGC

“Hand Waving”

STGC

...

r

FC + Softmax

!: #(%)×(×)(%)

Global Average Pooling

!: 1×1×)(+)

...

Optional Modules Multi-Scale Temporal Conv

SLIDE 29

MS-GCN MS-TCN MS-TCN

(stride = 2)

+

MS-TCN Inputs

STGC Block !

G3D Pathway(s)

Conv 1×1 Conv 1×1

$: &(()×*×+(() $: &((,-)×*×+((,-)

.: Skeleton Graph Adjacency $: Node Features &: Number of Frames *: Number of Nodes +: Number of Channels

Add

Factorized Pathway

MS-G3D MS-G3D

window size = /0 window dilation = 10 window size = /- window dilation = 1-

Model Architecture

STGC

“Hand Waving”

STGC

...

r

FC + Softmax

!: #(%)×(×)(%)

Global Average Pooling

!: 1×1×)(+)

...

Efficient, long-range spatial and temporal factorized modeling Captures regional spatial-temporal features Optional Modules Multi-Scale Temporal Conv

SLIDE 30

MS-GCN MS-TCN MS-TCN

(stride = 2)

+

MS-TCN Inputs

STGC Block !

G3D Pathway(s)

Conv 1×1 Conv 1×1

$: &(()×*×+(() $: &((,-)×*×+((,-)

.: Skeleton Graph Adjacency $: Node Features &: Number of Frames *: Number of Nodes +: Number of Channels

Add

Factorized Pathway

MS-G3D MS-G3D

window size = /0 window dilation = 10 window size = /- window dilation = 1-

Model Architecture

STGC

“Hand Waving”

STGC

...

r

FC + Softmax

!: #(%)×(×)(%)

Global Average Pooling

!: 1×1×)(+)

...

Extend vanilla temporal convolutions with multi-scale modelling Optional Modules Multi-Scale Temporal Conv Bottleneck layers can be used to control complexity

SLIDE 31

Agenda

Overview
Contributions
1. Factorized Modeling

Unified Spatial-Temporal Modeling

2. Adjacency Powering

Disentangling Neighborhoods

Experiments & Results

→ →

SLIDE 32

Quantitative Comparison

NTU RGB+D 60 NTU RGB+D 120 Kinetics Skeleton 400

Code & Models

NTU RGB+D 60 Skeleton 25 Joints 56578 samples 60 action classes 91.5% (X-Sub) 96.2% (X-View) NTU RGB+D 120 Skeleton 25 Joints 113945 samples 120 action classes 86.9% (X-Sub) 88.4% (X-Set) Kinetics Skeleton 400 18 Joints 260232 samples 400 action classes 38.0% (Top-1) 60.9% (Top-5)

bit.ly/ms-g3d

Methods NTU RGB+D 120 X-Sub (%) X-Set (%) ST-LSTM [26] 55.7 57.9 GCA-LSTM [27] 61.2 63.3 RotClips + MTCNN [16] 62.2 61.8 Body Pose Evolution Map [28] 64.6 66.9 2s-AGCN [33] 82.9 84.9 MS-G3D Net 86.9 88.4

Table 4: Classification accuracy comparison against state-of-the-

Methods Kinetics Skeleton 400 Top-1 (%) Top-5 (%) ST-GCN [50] 30.7 52.8 AS-GCN [21] 34.8 56.5 ST-GR [18] 33.6 56.1 2s-AGCN [33] 36.1 58.7 DGNN [32] 36.9 59.6 MS-G3D Net 38.0 60.9

Methods NTU RGB+D 60 X-Sub (%) X-View (%) IndRNN [23] 81.8 88.0 HCN [20] 86.5 91.1 ST-GR [18] 86.9 92.3 AS-GCN [21] 86.8 94.2 2s-AGCN [33] 88.5 95.1 AGC-LSTM [34] 89.2 95.0 DGNN [32] 89.9 96.1 GR-GCN [8] 87.5 94.3 MS-G3D Net (Joint Only) 89.4 95.0 MS-G3D Net (Bone Only) 90.1 95.3 MS-G3D Net 91.5 96.2

SLIDE 33

Number of Scales ( )

K

1 4 8 12

( ˜ D− 1

2 ˜

A ˜ D− 1

2)

k

Exponentiation Add learnable residual mask to adjacency matrix of every k

Ablations: Disentangled Multi-Scale Aggregation

G3D Pathway on NTU RGB+D 60 Cross Subject

Code & Models

bit.ly/ms-g3d

SLIDE 34

Number of Scales ( )

K

1 4 8 12

˜ D− 1

2

(k) ˜

A(k) ˜ D− 1

2

(k)

Disentangled

Ablations: Disentangled Multi-Scale Aggregation

( ˜ D− 1

2 ˜

A ˜ D− 1

2)

k

Exponentiation

G3D Pathway on NTU RGB+D 60 Cross Subject

Code & Models

bit.ly/ms-g3d

SLIDE 35

NTU RGB+D 60 Cross Subject Acc (%) 87.6 88 88.4 88.8 89.2 89.6

89.4 89.3 89.0 89.2 89.2 89.1 89.1 89.0 88.6 88.5 87.8

# Parameters 1M 1.5M 2M 2.5M 3M 3.5M 4M

Ablations: G3D Modules

Without G3D (Factorized)

With Single G3D Pathway With Two G3D Pathways

Without G3D With Single G3D Pathway With Two G3D Pathways

Code & Models

bit.ly/ms-g3d

SLIDE 36

Ablations: Spatial-Temporal Edges in G3D

! (Window Size)

Temporal Edges Spatial Edges

(a) Grid-like Connec<vity

Acc (%) 88.4 88.6 88.8 89 89.2

(a) Grid NTU RGB+D 60 Cross Subject

Code & Models

bit.ly/ms-g3d

SLIDE 37

! (Window Size)

Temporal Edges Spatial Edges

(a) Grid-like Connec<vity ! (Window Size)

Temporal Edges Spatial Edges

(b) Grid-like + Dense Self Edges

Acc (%) 88.4 88.6 88.8 89 89.2

(a) Grid (b) Grid + Dense Self-Edges NTU RGB+D 60 Cross Subject

Ablations: Spatial-Temporal Edges in G3D

Code & Models

bit.ly/ms-g3d

SLIDE 38

! (Window Size)

Temporal Edges Spatial Edges

(a) Grid-like Connec<vity ! (Window Size)

Temporal Edges Spatial Edges

(b) Grid-like + Dense Self Edges ! (Window Size)

Temporal Edges Spatial Edges

(c) Dense Cross-Space<me Edges

Spa<al-Temporal Edges

Acc (%) 88.4 88.6 88.8 89 89.2

(a) Grid (b) Grid + Dense Self-Edges (c) Cross-Spacetime Edges NTU RGB+D 60 Cross Subject

Ablations: Spatial-Temporal Edges in G3D

Code & Models

bit.ly/ms-g3d

SLIDE 39

Qualitative Results

SLIDE 40

Failure Modes

Noisy Inputs Long-Term Minor Movements

Code & Models

bit.ly/ms-g3d

SLIDE 41

Code & Models

bit.ly/ms-g3d