Disentangling and Unifying Graph Convolutions for Skeleton-Based - - PowerPoint PPT Presentation

disentangling and unifying graph convolutions for
SMART_READER_LITE
LIVE PREVIEW

Disentangling and Unifying Graph Convolutions for Skeleton-Based - - PowerPoint PPT Presentation

Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition Ziyu Liu 1,3 , Hongwen Zhang 2 , Zhenghao Chen 1 , Zhiyong Wang 1 , Wanli Ouyang 1,3 1 The University of Sydney, 2 University of Chinese Academy of Sciences &


slide-1
SLIDE 1

Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

Ziyu Liu1,3, Hongwen Zhang2, Zhenghao Chen1, Zhiyong Wang1, Wanli Ouyang1,3

1The University of Sydney, 2University of Chinese Academy of Sciences & CASIA
 3The University of Sydney, SenseTime Computer Vision Research Group, Australia

slide-2
SLIDE 2

Agenda

  • Overview
  • Contributions
  • 1. Factorized Modeling

Unified Spatial-Temporal Modeling

  • 2. Adjacency Powering

Disentangling Neighborhoods

  • Experiments & Results

→ →

slide-3
SLIDE 3

Action Recognition from Skeletons

  • Human actions can be efficiently represented by skeletons
  • Free of background clutter / lighting conditions / clothing variations

“Hand Shaking” Input Video … Estimated 2D/3D Poses Skeletons … Predicted Action

Image Credit: Amir Shahroudy, Jun Liu, Tian-Tsong Ng, Gang Wang, "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis", CVPR 2016

Skeleton-Based Action Recognition

slide-4
SLIDE 4

Previous Approaches

  • Traditional
  • Handcraft features (e.g. Vemulapalli et al., CVPR’14; Huang et al., CVPR’17)
  • CNNs / RNNs (e.g. Ke et al., Wang et al., Liu et al., CVPR’17; Si et al., ECCV’18)
  • Often overlook semantic connectivity patterns between joints
  • Graph-Based (e.g. Shi et al., Li et al., Si et al., CVPR’19; Li et al., Wen et al., AAAI’19)
  • Graphs naturally captures the structure of human bodies
  • Joints

nodes, bones edges

  • No hand-crafted node traversal

→ →

Huang et al., CVPR’17

G = (V, E)

slide-5
SLIDE 5

Preliminaries

  • Actions as Graph Sequences
  • Structure:
  • node graph with adjacency matrix

(normalized )

  • Features: Joint locations
  • ver frames
  • Goal: Learn to classify graph sequences

N A ̂ A X T

A ∈ ℝN×N X ∈ ℝT×N×C

Structure: Features: Entire Action …

A ∈ ℝN×N Xt ∈ ℝN×C

Structure: Features: Each Frame

slide-6
SLIDE 6

Preliminaries

  • Feature Learning with Graph Convolutional Nets (GCNs) (Kipf et al., ICLR’17)
  • 1. Neighborhood feature aggregation
  • 2. Layer-wise feature update

A ∈ ℝN×N Xt ∈ ℝN×C

Structure: Features: Each Frame

X(l+1) = σ ( ̂ A X(l)Θ(l))

Neighborhood Aggregation Feature Update

Neighborhood Aggregation

A ∈ ℝN×N X ∈ ℝT×N×C

Structure: Features: Entire Action …

slide-7
SLIDE 7

e.g. Li et al. CVPR’19, Li et al. AAAI’18

Existing Graph-Based Approaches

e.g. Li et al. CVPR’19, Shi et al. CVPR’19, Shi et al. CVPR’19, Yan et al. AAAI’18

  • 2. Multi-Scale Graph Convolutions

GCNs over different adjacency powers

  • 1. Factorized Modeling

GCNs + Temporal Models

… Temporal Aggregation … Spatial Aggregation Spatial Aggregation

slide-8
SLIDE 8

Agenda

  • Overview
  • Contributions
  • 1. Factorized Modeling

Unified Spatial-Temporal Modeling

  • 2. Adjacency Powering

Disentangling Neighborhoods

  • Experiments & Results

→ →

slide-9
SLIDE 9

Previous Approach #1: Factorized Modeling

  • Learn spatial-temporal features with spatial / temporal modules
  • Spatial: Neighborhood aggregation (GCNs)
  • Temporal: Node-wise sequence models (1d conv / recurrent)

(cf. Factorized 3D CNNs) … Temporal Aggregation … Spatial Aggregation

slide-10
SLIDE 10

Motivation #1: Indirect Information Flow

  • Factorization can create bottlenecks for feature propagation
  • Unweighted message passing (GCNs) can also make aggregated features generic

(cf. Factorized 3D CNNs)

Hard to learn spatial-temporal relationships information bottleneck

slide-11
SLIDE 11

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

  • G3D modules: neighborhood aggregation across space and time
  • Edges serve as skip connections, allowing more direct information flow
slide-12
SLIDE 12

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

Sliding Temporal Window

size = !, dilation = "

GCN

#: %

&'×)×*&'

+: )×)

G3D

#(-): %

/01×!)×*&'

+(-): !)×!)

Collapse Window Reshape + FC

#: %

/01×)×*/01

#(-): %

/01×!)×*2&3

(1) Sliding Temporal Window

Spatial Graph Skeleton Features Spatial-Temporal Window Window Features

! (Window Size)

Temporal Edges Spatial Edges Spatial-Temporal Edges

(2) Extrapolate Spatial Connectivity

(3) Graph Convolutions over Windows (4) Squeeze Windows with 1x1 Conv

X(τ) X

Conv 1x1 BatchNorm

slide-13
SLIDE 13

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

Sliding Temporal Window

size = !, dilation = "

GCN

#: %

&'×)×*&'

+: )×)

G3D

#(-): %

/01×!)×*&'

+(-): !)×!)

Collapse Window Reshape + FC

#: %

/01×)×*/01

#(-): %

/01×!)×*2&3

(1) Sliding Temporal Window

Spatial Graph Skeleton Features Spatial-Temporal Window Window Features

! (Window Size)

Temporal Edges Spatial Edges Spatial-Temporal Edges

(2) Extrapolate Spatial Connectivity

(3) Graph Convolutions over Windows (4) Squeeze Windows with 1x1 Conv

X(τ) X

Conv 1x1 BatchNorm

slide-14
SLIDE 14

! (Window Size)

Temporal Edges Spatial Edges Spatial-Temporal Edges

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

Sliding Temporal Window

size = !, dilation = "

GCN

#: %

&'×)×*&'

+: )×)

G3D

#(-): %

/01×!)×*&'

+(-): !)×!)

Collapse Window Reshape + FC

#: %

/01×)×*/01

#(-): %

/01×!)×*2&3

(2) Extrapolate Spatial Connectivity

(1) Sliding Temporal Window

Spatial-Temporal Window Window Features Spatial Graph Skeleton Features

(3) Graph Convolutions over Windows (4) Squeeze Windows with 1x1 Conv

X(τ) X

Conv 1x1 BatchNorm

slide-15
SLIDE 15

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

Sliding Temporal Window

size = !, dilation = "

GCN

#: %

&'×)×*&'

+: )×)

G3D

#(-): %

/01×!)×*&'

+(-): !)×!)

Collapse Window Reshape + FC

#: %

/01×)×*/01

#(-): %

/01×!)×*2&3

! (Window Size)

Temporal Edges Spatial Edges Spatial-Temporal Edges

(1) Sliding Temporal Window

(2) Extrapolate Spatial Connectivity

(3) Graph Convolutions over Windows

Spatial-Temporal Window Window Features Spatial Graph Skeleton Features

(4) Squeeze Windows with 1x1 Conv

X(τ) X

Conv 1x1 BatchNorm

slide-16
SLIDE 16

Spatial-Temporal Information Flow

G3D

Idea #1: Unified Spatial-Temporal Modeling

Sliding Temporal Window

size = !, dilation = "

GCN

#: %

&'×)×*&'

+: )×)

G3D

#(-): %

/01×!)×*&'

+(-): !)×!)

Collapse Window Reshape + FC

#: %

/01×)×*/01

#(-): %

/01×!)×*2&3

(4) Squeeze Windows with 1x1 Conv

Spatial-Temporal Window Window Features

X(τ) X

! (Window Size)

Temporal Edges Spatial Edges Spatial-Temporal Edges

Conv 1x1 BatchNorm

(1) Sliding Temporal Window

(2) Extrapolate Spatial Connectivity

(3) Graph Convolutions over Windows

Spatial Graph Skeleton Features

slide-17
SLIDE 17

Discussion

  • Analogous to 3D convolutions on RGB videos
  • Unlike 3D conv, # parameters is independent of receptive field size
  • Temporal receptive field can be controlled based on input resolution
  • Considers more information at once and helps prevent losing features during

unweighted spatial aggregation

  • Memory footprint

Spatial-Temporal Neighborhood Aggregation

slide-18
SLIDE 18

Agenda

  • Overview
  • Contributions
  • 1. Factorized Modeling

Unified Spatial-Temporal Modeling

  • 2. Adjacency Powering

Disentangling Neighborhoods

  • Experiments & Results

→ →

slide-19
SLIDE 19

Previous Approach #2: Multi-Scale Graph Convolutions

  • Making

neighbors reachable with

  • Mixing features

with normalized

k-hop ˜ Ak ̂ A kX for k = 0,1,... ̂ A

X(l+1)

t

= σ (

K

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

e.g. Li et al. CVPR’19, Abu-El-Haija et al. ICML’19, Luan et al. NeurIPS’19, Liao et al. ICLR’19

slide-20
SLIDE 20

Motivation #2: Biased Node Weighting

  • Node weights from

are biased towards closer nodes

  • More length- walks to closer nodes due to cyclic walks

̂ A k k

X(l+1)

t

= σ (

K

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

  • 2

4 6 8 10 12 14 16 Walk Length k 100000 200000 300000 400000 500000 Number of Walks

Number of length-k walks from Node 1

To Node 1 To Node 2 To Node 3 To Node 4 To Node 5 2 4 6 8 10 12 14 16 18 Walk Length k 100000 200000 300000 400000 500000 600000 Number of Walks

Number of length-k walks to Self

Node 1 Node 2 Node 3 Node 4 Node 5

slide-21
SLIDE 21

Motivation #2: Biased Node Weighting

  • Node weights from

are biased towards closer nodes

  • More length- walks to closer nodes due to cyclic walks

̂ A k k

X(l+1)

t

= σ (

K

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

  • 1

2 3 4 5 6 Walk Length k 50 100 150 200 250 300 350 400 Number of Walks

Number of length-k walks to Self

Node 1 Node 1 (Self-loops) Node 2 Node 2 (Self-loops) Node 3 Node 3 (Self-loops) Node 4 Node 4 (Self-loops) Node 5 Node 5 (Self-loops)

slide-22
SLIDE 22

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

Motivation #2: Biased Node Weighting

  • Node weights from

are biased towards closer nodes

  • More length- walks to closer nodes due to cyclic walks
  • On skeleton graphs, features from local body parts will dominate under GCNs

̂ A k k

Hard to learn long-range dependencies

X(l+1)

t

= σ (

K

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

slide-23
SLIDE 23

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

Motivation #2: Biased Node Weighting

  • Node weights from

are biased towards closer nodes

  • More length- walks to closer nodes due to cyclic walks
  • On skeleton graphs, features from local body parts will dominate under GCNs

̂ A k k

Hard to learn long-range dependencies

Ideally: unbiased weighting

1 2 3 4 5 6 7

X(l+1)

t

= σ (

K

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

slide-24
SLIDE 24

Idea #2: Disentangled Multi-Scale Aggregation

  • adjacency: Reweigh joint importance at different neighborhoods
  • A new graph construction with no impact on runtime / # params

k

Self-loops to model joint relationships and retain identity features

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

˜ A(1) ˜ A(2) ˜ A(3)

X(l+1)

t

= σ (

K

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

slide-25
SLIDE 25

Idea #2: Disentangled Multi-Scale Aggregation

  • adjacency: Reweigh joint importance at different neighborhoods
  • A new graph construction with no impact on runtime / # params

k

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

˜ A(1) ˜ A(2) ˜ A(3)

X(l+1)

t

= σ (

K

k=0

̂ A kX(l)

t Θ(l) (k))

Multi-Scale Aggregation with different and

̂ A k Θ(k)

slide-26
SLIDE 26

Idea #2: Disentangled Multi-Scale Aggregation

  • Updated multi-scale graph convolutions:

X(l+1)

t

= σ (

K

k=0

̂ A kX(l)

t Θ(l) (k))

̂ A = ˜ D− 1

2 ˜

A ˜ D− 1

2

X(l+1)

t

= σ (

K

k=0

̂ A (k)X(l)

t Θ(l) (k))

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

̂ A 3 ̂ A 2 ̂ A 1

1 2 3 4 5 7 1 2 3 4 5 7 1 2 3 4 5 7

˜ A(1) ˜ A(2) ˜ A(3)

̂ A (k) = ˜ D− 1

2

(k) ˜

A(k) ˜ D− 1

2

(k)

slide-27
SLIDE 27

MS-G3D: Putting everything together

  • Multi-scale learning directly over spatial-temporal domain
  • Disentangled aggregation complements G3D due to its higher node degrees

[X(l+1)

(τ) ]t

= σ (

K

k=0

˜ D− 1

2

(τ,k) ˜

A(τ,k) ˜ D− 1

2

(τ,k) [X(l) (τ)]t

Θ(l)

(k))

Disentangled Multi-Scale Aggregation Unified Spatial-Temporal Modeling

(partially coloured for visual clarity)

slide-28
SLIDE 28

MS-GCN MS-TCN MS-TCN

(stride = 2)

+

MS-TCN Inputs

STGC Block !

G3D Pathway(s)

Conv 1×1 Conv 1×1

$: &(()×*×+(() $: &((,-)×*×+((,-)

.: Skeleton Graph Adjacency $: Node Features &: Number of Frames *: Number of Nodes +: Number of Channels

Add

Factorized Pathway

MS-G3D MS-G3D

window size = /0 window dilation = 10 window size = /- window dilation = 1-

Model Architecture

STGC

“Hand Waving”

STGC

...

r

FC + Softmax

!: #(%)×(×)(%)

Global Average Pooling

!: 1×1×)(+)

...

Optional Modules Multi-Scale Temporal Conv

slide-29
SLIDE 29

MS-GCN MS-TCN MS-TCN

(stride = 2)

+

MS-TCN Inputs

STGC Block !

G3D Pathway(s)

Conv 1×1 Conv 1×1

$: &(()×*×+(() $: &((,-)×*×+((,-)

.: Skeleton Graph Adjacency $: Node Features &: Number of Frames *: Number of Nodes +: Number of Channels

Add

Factorized Pathway

MS-G3D MS-G3D

window size = /0 window dilation = 10 window size = /- window dilation = 1-

Model Architecture

STGC

“Hand Waving”

STGC

...

r

FC + Softmax

!: #(%)×(×)(%)

Global Average Pooling

!: 1×1×)(+)

...

Efficient, long-range spatial and temporal factorized modeling Captures regional spatial-temporal features Optional Modules Multi-Scale Temporal Conv

slide-30
SLIDE 30

MS-GCN MS-TCN MS-TCN

(stride = 2)

+

MS-TCN Inputs

STGC Block !

G3D Pathway(s)

Conv 1×1 Conv 1×1

$: &(()×*×+(() $: &((,-)×*×+((,-)

.: Skeleton Graph Adjacency $: Node Features &: Number of Frames *: Number of Nodes +: Number of Channels

Add

Factorized Pathway

MS-G3D MS-G3D

window size = /0 window dilation = 10 window size = /- window dilation = 1-

Model Architecture

STGC

“Hand Waving”

STGC

...

r

FC + Softmax

!: #(%)×(×)(%)

Global Average Pooling

!: 1×1×)(+)

...

Extend vanilla temporal convolutions with multi-scale modelling Optional Modules Multi-Scale Temporal Conv Bottleneck layers can be used to control complexity

slide-31
SLIDE 31

Agenda

  • Overview
  • Contributions
  • 1. Factorized Modeling

Unified Spatial-Temporal Modeling

  • 2. Adjacency Powering

Disentangling Neighborhoods

  • Experiments & Results

→ →

slide-32
SLIDE 32

Quantitative Comparison

NTU RGB+D 60 NTU RGB+D 120 Kinetics Skeleton 400

Code & Models

NTU RGB+D 60 Skeleton 25 Joints 56578 samples 60 action classes 91.5% (X-Sub) 96.2% (X-View) NTU RGB+D 120 Skeleton 25 Joints 113945 samples 120 action classes 86.9% (X-Sub) 88.4% (X-Set) Kinetics Skeleton 400 18 Joints 260232 samples 400 action classes 38.0% (Top-1) 60.9% (Top-5)

bit.ly/ms-g3d

Methods NTU RGB+D 120 X-Sub (%) X-Set (%) ST-LSTM [26] 55.7 57.9 GCA-LSTM [27] 61.2 63.3 RotClips + MTCNN [16] 62.2 61.8 Body Pose Evolution Map [28] 64.6 66.9 2s-AGCN [33] 82.9 84.9 MS-G3D Net 86.9 88.4

Table 4: Classification accuracy comparison against state-of-the-

Methods Kinetics Skeleton 400 Top-1 (%) Top-5 (%) ST-GCN [50] 30.7 52.8 AS-GCN [21] 34.8 56.5 ST-GR [18] 33.6 56.1 2s-AGCN [33] 36.1 58.7 DGNN [32] 36.9 59.6 MS-G3D Net 38.0 60.9

Methods NTU RGB+D 60 X-Sub (%) X-View (%) IndRNN [23] 81.8 88.0 HCN [20] 86.5 91.1 ST-GR [18] 86.9 92.3 AS-GCN [21] 86.8 94.2 2s-AGCN [33] 88.5 95.1 AGC-LSTM [34] 89.2 95.0 DGNN [32] 89.9 96.1 GR-GCN [8] 87.5 94.3 MS-G3D Net (Joint Only) 89.4 95.0 MS-G3D Net (Bone Only) 90.1 95.3 MS-G3D Net 91.5 96.2

slide-33
SLIDE 33

Number of Scales ( )

K

1 4 8 12

( ˜ D− 1

2 ˜

A ˜ D− 1

2)

k

Exponentiation Add learnable residual mask to adjacency matrix of every k

Ablations: Disentangled Multi-Scale Aggregation

G3D Pathway on NTU RGB+D 60 Cross Subject

Code & Models

bit.ly/ms-g3d

slide-34
SLIDE 34

Number of Scales ( )

K

1 4 8 12

˜ D− 1

2

(k) ˜

A(k) ˜ D− 1

2

(k)

Disentangled

Ablations: Disentangled Multi-Scale Aggregation

( ˜ D− 1

2 ˜

A ˜ D− 1

2)

k

Exponentiation

G3D Pathway on NTU RGB+D 60 Cross Subject

Code & Models

bit.ly/ms-g3d

slide-35
SLIDE 35

NTU RGB+D 60 Cross Subject Acc (%) 87.6 88 88.4 88.8 89.2 89.6

89.4 89.3 89.0 89.2 89.2 89.1 89.1 89.0 88.6 88.5 87.8

# Parameters 1M 1.5M 2M 2.5M 3M 3.5M 4M

Ablations: G3D Modules

Without G3D (Factorized)

With Single G3D Pathway With Two G3D Pathways

Without G3D With Single G3D Pathway With Two G3D Pathways

Code & Models

bit.ly/ms-g3d

slide-36
SLIDE 36

Ablations: Spatial-Temporal Edges in G3D

! (Window Size)

Temporal Edges Spatial Edges

(a) Grid-like Connec<vity

Acc (%) 88.4 88.6 88.8 89 89.2

(a) Grid NTU RGB+D 60 Cross Subject

Code & Models

bit.ly/ms-g3d

slide-37
SLIDE 37

! (Window Size)

Temporal Edges Spatial Edges

(a) Grid-like Connec<vity ! (Window Size)

Temporal Edges Spatial Edges

(b) Grid-like + Dense Self Edges

Acc (%) 88.4 88.6 88.8 89 89.2

(a) Grid (b) Grid + Dense Self-Edges NTU RGB+D 60 Cross Subject

Ablations: Spatial-Temporal Edges in G3D

Code & Models

bit.ly/ms-g3d

slide-38
SLIDE 38

! (Window Size)

Temporal Edges Spatial Edges

(a) Grid-like Connec<vity ! (Window Size)

Temporal Edges Spatial Edges

(b) Grid-like + Dense Self Edges ! (Window Size)

Temporal Edges Spatial Edges

(c) Dense Cross-Space<me Edges

Spa<al-Temporal Edges

Acc (%) 88.4 88.6 88.8 89 89.2

(a) Grid (b) Grid + Dense Self-Edges (c) Cross-Spacetime Edges NTU RGB+D 60 Cross Subject

Ablations: Spatial-Temporal Edges in G3D

Code & Models

bit.ly/ms-g3d

slide-39
SLIDE 39

Qualitative Results

slide-40
SLIDE 40

Failure Modes

Noisy Inputs Long-Term Minor Movements

Code & Models

bit.ly/ms-g3d

slide-41
SLIDE 41

Code & Models

bit.ly/ms-g3d