Semantic Spaces for Zero-Shot Behaviour Analysis Xun Xu Computer - - PowerPoint PPT Presentation

semantic spaces for zero shot behaviour analysis
SMART_READER_LITE
LIVE PREVIEW

Semantic Spaces for Zero-Shot Behaviour Analysis Xun Xu Computer - - PowerPoint PPT Presentation

Semantic Spaces for Zero-Shot Behaviour Analysis Xun Xu Computer Vision and Interactive Media Lab, NUS Singapore 1 Collaborators Prof. Shaogang Gong Dr. Timothy Hospedales 2 Outline Background Transductive Zero-Shot Action


slide-1
SLIDE 1

Semantic Spaces for Zero-Shot Behaviour Analysis

Xun Xu Computer Vision and Interactive Media Lab, NUS Singapore

1

slide-2
SLIDE 2

Collaborators

2

  • Prof. Shaogang Gong
  • Dr. Timothy Hospedales
slide-3
SLIDE 3

Outline

  • Background
  • Transductive Zero-Shot Action Recognition
  • Multi-Task Zero-Shot Embedding
  • Zero-Shot Crowd Analysis

3

slide-4
SLIDE 4

Video Behaviour

Defined as Visually Distinguishable Activities

  • Human Actions
  • Crowd Behaviour

4

slide-5
SLIDE 5

Human Actions

  • Individual or multiple interactive human activities

5

Soomro, et al. “UCF101: A Dataset of 101 human actions classes from videos in the wild.” 2012

slide-6
SLIDE 6

Human Actions Tasks

  • Action Recognition

6

Eye Makeup Rafting Swimming Diving Archery Fencing

slide-7
SLIDE 7

Human Actions Tasks

  • Action Detection (Retrieval)

Given query “Swimming” return ranked videos

7

…… Lower Ranking

slide-8
SLIDE 8

Crowd Behaviour

  • A group of people acting collectively

8

Shao, J., et al. “Deeply learned attributes for crowded scene understanding.” CVPR 2015

slide-9
SLIDE 9

Crowd Behaviour Tasks

  • Crowd Behaviour Profiling

9

slide-10
SLIDE 10

Crowd Behaviour Tasks

  • Crowd Anomaly Detection

10

Hassner, T., et al. “Violent flows: Real-time detection of violent crowd behavior.” CVPR 2012

slide-11
SLIDE 11

Potential Applications

11

Surveillance

Human Computer Interaction

Video Sharing

slide-12
SLIDE 12

Outline

  • Background
  • Transductive Zero-Shot Action Recognition
  • Multi-Task Zero-Shot Embedding
  • Zero-Shot Crowd Analysis

12

slide-13
SLIDE 13

Motivation

  • Ever Increasing #Categories for action recognition

KTH 6 Classes Weizmann 9 Classes

2004

Olympic Sports 16 Classes HMDB51 51 Classes UCF101 101 Classes

2005 2010 2011 2012

13

2015

203 Classes

slide-14
SLIDE 14

Motivation

  • Ever Increasing #Categories

KTH 6 Classes Weizmann 9 Classes

2004

Olympic Sports 16 Classes HMDB51 51 Classes UCF101 101 Classes

2005 2010 2011 2012

14

2015

203 Classes

Limitations

Expensive to collect training data Annotating video is costly

slide-15
SLIDE 15

Zero-Shot Learning (ZSL)

  • Can we use videos from known class to help predict videos

from unknown classes?

Unknown Classes Known Classes Discus Throw Hammer Throw Shot-Put

15

slide-16
SLIDE 16

Attributes

Attribute Semantic Space

  • Attribute Based

Ball

Throw Away Discus Throw

Hammer Throw

Bend

Turn Around

Outdoor

16

slide-17
SLIDE 17

Attributes

Attribute Semantic Space

  • Attribute Based

Ball

Throw Away Shot-put Discus Throw

Hammer Throw

Bend

Turn Around

Outdoor

17

Known a priori

slide-18
SLIDE 18

Attributes

Attribute Semantic Space

  • Attribute Based

Ball

Throw Away Shot-put Discus Throw

Hammer Throw

Bend

Turn Around

Outdoor

18

Test video

slide-19
SLIDE 19

Attributes

Attribute Semantic Space

  • Attribute Based

Ball

Throw Away Shot-put Hammer Throw

Discus Throw

Bend

Turn Around

Outdoor

19

Limitations

  • Ontological problem
  • Manual label attributes is

costly for videos

  • Incompatible with other

attribute sets

slide-20
SLIDE 20

Word-Vector Semantic Space

Word-Vector Space Z Discus Throw = [0.2 0.5 0.1 …] Feature Space X Hammer Throw Hammer Throw = [0.1 0.6 0.1 …] Discus Throw

20

( ) z f x 

slide-21
SLIDE 21

Word-Vector Semantic Space

Word-Vector Space Z Discus Throw = [0.2 0.5 0.1 …] Feature Space X Hammer Throw Hammer Throw = [0.1 0.6 0.1 …] Discus Throw ShotPut = [0.3 0.4 0.2 …]

21

slide-22
SLIDE 22

Semantic Word-Vector

  • Skip-gram model predicts adjacent words

1      

 

log |

T t j t { z } t c j c , j

1 max p(z z ) T

Result of this optimization

Mikolov, T., et al. "Distributed representations of words and phrases and their compositionality.” NIPS2013 Pennington, J., et al. "Glove: Global vectors for word representation." EMNLP 2014.

vec(“ball”)=[-0.004 0.01 0.01 -0.03 0.05] vec(“sword”)=[0.16 0.06 0.09 -0.06 -0.002] vec(“archery”)=[0.02 0.01 0.02 -0.03 -0.03] vec(“boxing”)=[-0.08 -0.01 0.15 -0.01 0.09]

22

  |

T i j i j T i j i

exp(z z ) p(z z ) exp(z z )

slide-23
SLIDE 23

Benefits

  • Geometric Meaningful

Word-Vector Space Run Walk ship cat dog

Closer Far Away

23

slide-24
SLIDE 24

Benefits

  • Unsupervised Semantic Space

24

slide-25
SLIDE 25

Benefits

  • Wide coverage of words

Vec(“Apple”) = [0.2 0.3 0.1 …] Vec(“Bear”) = [0.1 0.9 0.1 …] Vec(“Car ”) = [0.6 0.2 0.4 …] Vec(“Desk”) = [0.2 0.8 0.4 …] Vec(“Fish”) = [0.5 0.2 0.3 …] …

25

slide-26
SLIDE 26

Benefits

  • Uniform across datasets

HammerThrow = [0.1 0.2 …] Discus Throw = [0.2 0.5 …] Dataset 1 HammerThrow = [0.1 0.2 …] Discus Throw = [0.2 0.5 …] Dataset 2

26

slide-27
SLIDE 27

Challenges

  • Domain Shift

Semantic Vector Space Y Discus Throw Feature Space X Hammer Throw HammerThrow Discus Throw Sword Exercise Play Guitar

27

slide-28
SLIDE 28

Challenges

  • Domain Shift

Semantic Vector Space Y Discus Throw Feature Space X Hammer Throw HammerThrow Discus Throw Sword Exercise Play Guitar

Confusion

28

slide-29
SLIDE 29

Our Solution

29 Xu, X., et al. “Transductive Zero-Shot Action Recognition by Word-Vector Embedding.” IJCV 2017

slide-30
SLIDE 30

Our Solution

30 Xu, X., et al. “Transductive Zero-Shot Action Recognition by Word-Vector Embedding.” IJCV 2017

slide-31
SLIDE 31

Low-Level Visual Feature

  • Improved Trajectory Feature for x

Wang, H. and Schmid, C., et al. “Action recognition with improved trajectories,” ICCV13

31

slide-32
SLIDE 32

Our Solution

32 Xu, X., et al. “Transductive Zero-Shot Action Recognition by Word-Vector Embedding.” IJCV 2017

slide-33
SLIDE 33

Combinations of Multi Words

  • A phrase is constructed from single word vectors

Additive Composition vec(“Apply Eye Makeup”) = vec(“Apply”) + vec(“Eye”) + vec(“Makeup”) vec(“Brushing Teeth”) = vec(“Brushing”) + vec(“Teeth”) vec(“Playing Guitar”) = vec(“Playing”) + vec(“Guitar”)

33

slide-34
SLIDE 34

Our Solution

34 Xu, X., et al. “Transductive Zero-Shot Action Recognition by Word-Vector Embedding.” IJCV 2017

slide-35
SLIDE 35

Visual to Semantic Mapping by Regularized Linear Regression

  • Multi-Dimensional Regularized Linear Regression

2 2 2 2 1

 

W

z Wx W

N i i i

min

z1 z2 x1 x2 x3

W

z is D Dimension Semantic Space

…… …

x is N Dimension Feature Space

35

slide-36
SLIDE 36

Domain Shift – Semi Supervised (Manifold Regularized) Regression

  • Semi-supervised regression is applied to tackle domain shift which takes

test data distribution into consideration

 

 

 

2 2

  

[ ]

ij i j tr te

f x f x : x X ;X

Manifold Regularizor

tr

trg tr

X X 

trg te te

X X 

Train and Test Data in Feature Space

Target Test Data Target Train Data

tr

trg

X

trg te

X

KNN Graph weight KNN Graph to model Manifold

36

slide-37
SLIDE 37

Domain Shift – Semi Supervised (Manifold Regularized) Regression

  • Semi-supervised regression is applied to tackle domain shift which takes

test data distribution into consideration

Target Test Data Target Train Data

tr

trg

X

trg te

X

KNN Graph to model Manifold

37

2 2 2 2 2 2 1

  

   

 

W

z Wx W Wx Wx

N i i ij i j i ij

min

slide-38
SLIDE 38

Our Solution

38 Xu, X., et al. “Transductive Zero-Shot Action Recognition by Word-Vector Embedding.” IJCV 2017

Additional datasets are available

slide-39
SLIDE 39

Auxiliary Dataset Data (e.g. UCF101)

aux

X

Data Augmentation

  • Use more training data from Auxiliary Dataset to help learn a better

regression

Target Dataset Train Data (e.g. HMDB51)

More Data is considered to learn more robust regressor

trg tr

X

[ ; X ]

tr

trg aux tr

X X 

trg te te

X X 

Augmented Train and Test Data in Feature Space

Target Test Data Target Train Data

tr

trg

X

trg te

X

Auxiliary Data

aux

X

Data Augmentation

39

slide-40
SLIDE 40

Semantic Word Vector Approach

40

slide-41
SLIDE 41

Zero-Shot Recognition by Nearest Neighbor

  • Do nearest Neighbor search in word-vector space to predict category of

test data

Basketball Kayaking Fencing Diving HulaHoop TaiChi Rafting

Minimal distance

TestData Category Name Test Video Instance

41

W

slide-42
SLIDE 42

Domain Shift – SelfTraining

  • Self-training is applied to tackle domain shift

1

te

K * te z NN( Z("Taichi"),K )

Z ("Taichi") z K

proto

NN( Z , K ) is the KNN function

4 NN example

5 6 7 8

4    

*

Z ("Taichi") ( z z z z )

*

Z ("Taichi") Z("Taichi") Category Name Test Video Instance

42

te

z f ( x )

5

z

4

z

3

z

2

z

1

z

8

z

6

z

7

z

 Z("Taichi") g("Taichi")

slide-43
SLIDE 43

Domain Shift – SelfTraining

  • Self-training is applied to tackle domain shift

1

te

K * te z NN( Z("Taichi"),K )

Z ("Taichi") z K

proto

NN( Z , K ) is the KNN function

4 NN example

5 6 7 8

4    

*

Z ("Taichi") ( z z z z )

*

Z ("Taichi") Z("Taichi") Category Name Test Video Instance

43

te

z f ( x )

5

z

4

z

3

z

2

z

1

z

8

z

6

z

7

z

 Z("Taichi") g("Taichi")

slide-44
SLIDE 44

Experiments

Dataset:

  • HMDB51 – 51 classes 6766 videos
  • UCF101 – 101 classes 13320 videos
  • Olympic Sports – 16 classes 786 videos
  • CCV – 20 classes 9317 videos
  • USAA – 8 classes (subset of CCV)

Visual Feature:

  • Improved Trajectory Feature [1]
  • Improved fisher vector encoding [2]

Semantic Embedding Space:

  • Skip-gram neural network model trained on Google News Dataset
  • 300 dimension word vector

[1] Wang, Heng, and Cordelia Schmid. "Action recognition with improved trajectories.“ ICCV 2013. [2] Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." ECCV 2010

44

slide-45
SLIDE 45

Qualitative Insight

  • How do Self-Training, Manifold Regularization and Data Augmentation

perform

45

All data projected to 2D space via T-SNE [1]

L.J.P. van der Maaten and Hinton, G. “Visualizing high-dimensional data using t-sne.” JMLR 2008.

slide-46
SLIDE 46

Zero-Shot Experiment

  • Test on public human action datasets

46

slide-47
SLIDE 47

Outline

  • Background
  • Transductive Zero-Shot Action Recognition
  • Multi-Task Zero-Shot Embedding
  • Zero-Shot Crowd Analysis

47

slide-48
SLIDE 48

Revisit Visual-to-Semantic Mapping

  • Multi-Dimensional Regularized Linear Regression

2 2 2 2 1

 

W

z Wx W

N i i i

min

z1 z2 x1 x2 x3

W

z is D Dimension Semantic Space

…… …

x is N Dimension Feature Space

48

1

w

2

w

slide-49
SLIDE 49

Visual-to-Semantic Mapping by Multi- Task Regression

  • Two stage regression

49

l1 x1 xd

A

… z1 zm lT

Latent Tasks Visual Feature

S

Semantic Space

1

a aT

1

s

sm  W AS

Xu, X., et al. “Multi-task zero-shot action recognition with prioritised data augmentation.” ECCV 2016

slide-50
SLIDE 50

Visual-to-Semantic Mapping by Multi- Task Regression

  • Two stage regression

50

l1 x1 xd

A

… z1 zm lT

Latent Tasks Visual Feature

S

Semantic Space

1

a aT

1

s sT

slide-51
SLIDE 51

Visual-to-Semantic Mapping by Multi- Task Regression

  • Solve efficiently

51

Loss Function Iterative Update

slide-52
SLIDE 52

Multi-Task Embedding

  • Lower dimension subspace embedding

52

l x z

A

1 

S

Test Video Novel Category

2 1 2 

 

z

z Ax S z

*

argmin

slide-53
SLIDE 53

Importance Weighting for Domain Adaptation

( ) min ( ( ) | ( ) ( )) ( )log ( ) ( )

te te tr te KL tr

p D p p p d p

    x x x x x x x x

slide-54
SLIDE 54

Revisit Visual-to-Semantic Mapping

  • Uniform weight is given to all training examples

2 2 2 2 1

 

W

z Wx W

N i i i

min

2 2 2 2 1

 

 

W

z Wx W

N i i i i

min

Uniform Model Weighted Model

slide-55
SLIDE 55

Experiments

Dataset:

  • HMDB51 – 51 classes 6766 videos
  • UCF101 – 101 classes 13320 videos
  • Olympic Sports – 16 classes 786 videos

Feature:

  • Improved Trajectory Feature [1]
  • Improved fisher vector encoding [2]

Semantic Embedding Space:

  • Skip-gram neural network model trained on Google News Dataset
  • 300 dimension word vector

[1] Wang, Heng, and Cordelia Schmid. "Action recognition with improved trajectories.“ ICCV 2013. [2] Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." ECCV 2010

55

slide-56
SLIDE 56

MTL v.s. STL

56

ZSL Model MTL Latent Matching HMDB51 UCF101 Olympic Sports

RR [1] X N/A

18.3±2.1 14.5±0.9 40.9±10.1

RMTL [2]  X

18.5±2.1 14.6±1.1 41.1±10.0

RMTL [2]  

18.7±1.7 14.7±1.0 41.1±10.0

GOMTL [3]  X

18.5±2.2 13.1±1.5 43.5±8.8

GOMTL [3]  

18.9±1.0 14.9±1.5 44.5±8.5

MTE(Ours)  X

18.7±2.2 14.2±1.3 44.5±8.2

MTE(Ours)  

19.7±1.6 15.8±1.3 44.3±8.1

[1] Xu, X., et al. “Transductive Zero-Shot Action Recognition by Word-Vector Embedding.” IJCV 2017 [2] Evgeniou, A., et al. “Regularized multi-task learning.” ACM SIGKDD 2004 [3] Kumar, A., et al. “Learning Task Grouping and Overlap in Multi-task Learning.” ICML 2012

slide-57
SLIDE 57

Importance Weighting

57

ZSL Model Weighting Model HMDB51 UCF101

Olympic Sports

RR [1] Uniform

21.9±2.4 19.4±1.7 46.5±9.4

MTE (Ours) Uniform

23.4±3.4 20.9±1.5 49.4±8.8

RR [1]

Visual KLIEP 23.2±2.7 20.3±1.6 47.2±9.3

RR [1]

Category KLIEP 23.0±2.1 20.2±1.6 51.8±8.7

RR [1]

Full KLIEP 23.7±2.7 20.7±1.4 51.3±9.0

MTE (Ours)

Visual KLIEP 23.4±2.8 20.8±2.0 51.4±9.2

MTE (Ours)

Category KLIEP 23.3±2.4 20.9±1.7 50.9±8.3

MTE (Ours)

Full KLIEP 23.9±3.0 21.9±2.7 52.3±8.1

[1] Xu, X., et al. “Transductive Zero-Shot Action Recognition by Word-Vector Embedding.” IJCV 2017

slide-58
SLIDE 58

Ours v.s. State-of-the-Art

58

Method Embed Feat TD Aug HMDB51 UCF101 Olympic Sports Ours

MTE W FV X X 19.7±1.6 15.8±1.3 44.3±8.1 MTE+Full KLIEP W FV

 

23.9±3.0 21.9±2.7 52.3±8.1

MTE+Full KLIEP+PP

W FV

 

24.8±2.2 22.9±3.3 56.6±7.7 MTE A FV X X N/A 18.3±1.7 55.6±11.3

State-of-the-art models

DAP [1] CVPR09 A FV X X N/A 15.9±1.2 45.4±12.8 IAP [1] CVPR09 A FV X X N/A 16.7±1.1 42.3±12.5 HAA [2] CVPR11 A FV X X N/A 14.9±0.8 46.1±12.4 SVE [3] ICIP15 W BoW X X 14.9±1.8 12.0±1.4 N/A SVE [3] ICIP15 W BoW

 

22.8±2.6 18.4±1.4 N/A ESZSL [4] ICML15 W FV X X 18.5±2.0 15.0±1.3 39.6±9.6 ESZSL [4] ICML15 A FV X X N/A 17.1±1.2 53.9±10.8 SJE [5] ICCV15 W FV X X 13.3±2.4 9.9±1.4 28.6±4.9 SJE [5] ICCV15 A FV X X N/A 12.0±1.2 47.5±14.8

[1] Lampert, C., et al. “Learning to detect unseen object classes by between-class attribute transfer.” CVPR 2009 [2] Liu, J., et al. “Recognizing human actions by attributes.” CVPR 2011 [3] Xu, X., et al. “Semantic embedding space for zero-shot action recognition.” ICIP 2015 [4] Romera-paredes, B., Torr, P.H.S. “An embarrassingly simple approach to zero-shot learning.” ICML 2015 [5] Akata, Z., et al. “Evaluation of Output Em- beddings for Fine-Grained Image Classification.” CVPR 2015

slide-59
SLIDE 59

Outline

  • Background
  • Transductive Zero-Shot Action Recognition
  • Multi-Task Zero-Shot Embedding
  • Zero-Shot Crowd Analysis

59

slide-60
SLIDE 60

Zero-Shot Crowd Analysis

  • Interesting crowd behaviours, e.g. violence, are rare

60 Xu, X., Et Al., “Zero-Shot Crowd Behaviour Recognition.” In Shah EtAl. , Group and Crowd Behaviour Understanding in Computer Vision , Elsevier, April 2017

slide-61
SLIDE 61

Motivation

  • Interesting Crowd Behaviours are Rare, e.g. ViolenceFlow Dataset.

Violent Videos Non-Violent Videos Only 124 positive violent videos

Hassner, T., et al. “Violent flows: Real-time detection of violent crowd behavior.” CVPR 2012

slide-62
SLIDE 62

Motivation

  • Exploit Existing Crowd Video Data, e.g. WWW Crowd dataset

62

Shao, J., et al. “Deeply learned attributes for crowded scene understanding.” CVPR 2015

slide-63
SLIDE 63

Zero-Shot Predict Crowd Behaviour

  • Predict “Violence” in Zero-Shot Manner

63 Gan, C., et al. “Exploring Semantic Inter-Class Relationships ( SIR ) for Zero-Shot Action Recognition.” AAAI 2015

slide-64
SLIDE 64

Challenges

  • Semantic relatedness v.s. Visual relatedness

64

vec(“Outdoor”)T vec(“Indoor”)=0.7104 “Outdoor” & “Indoor” highly related in word-vector space

slide-65
SLIDE 65

Solution

  • Exploit co-occurrence of labels to improve ZSL

65

slide-66
SLIDE 66

Solution

  • Exploit co-occurrence of labels to improve ZSL

66

slide-67
SLIDE 67

Zero-Shot Predict Crowd Behaviour

  • Visual Context Aware ZSL

67

Text Only Visual Co-Occurrence

slide-68
SLIDE 68

Solution

  • Exploit co-occurrence of labels to improve ZSL

68

slide-69
SLIDE 69

Experiment

Dataset

  • WWW Crowd dataset [1]
  • Violence Flow [2]

Visual Feature

  • Improved Trajectory Feature [3]

Semantic Embedding Space:

  • Skip-gram neural network model trained on Google News Dataset
  • 300 dimension word vector

Setting

  • Training on WWW dataset and testing on violence flow
  • Evaluate both accuracy and ROC

69 [1] Shao, J., et al. “Deeply learned attributes for crowded scene understanding.” CVPR 2015 [2] Hassner, T., et al. “Violent flows: Real-time detection of violent crowd behavior.” CVPR 2012 [3] Wang, Heng, and Cordelia Schmid. "Action recognition with improved trajectories.“ ICCV 2013.

slide-70
SLIDE 70

Performance

  • Evaluation on Violence Detection Dataset

70

 

T

1 (" "|" ") exp (" ") (" ") P viol mob viol mob Z  v Mv

Model Split Feature Accuracy AUC WVE[1] Zero-Shot ITF 64.27+-5.06 64.25 ESZSL[2] Zero-Shot ITF 61.30+-8.28 61.76 ExDAP[3] Zero-Shot ITF 54.47+-7.37 52.31 TexCAZSL Zero-Shot ITF 67.07+-3.87 69.95 CoCAZSL Zero-Shot ITF 80.52+-4.67 87.22 Linear SVM 5-fold CV ITF 94.72+-4.85 98.72 Linear SVM[4] 5-fold CV ViF 81.30+-0.21 85.00

TexCAZSL uses M=I CoCAZSL learns M from attribute co-occurrence

[1] Shao, J., et al. “Deeply learned attributes for crowded scene understanding.” CVPR 2015 [2] Romera-paredes, B., Torr, P.H.S. “An embarrassingly simple approach to zero-shot learning.” ICML 2015 [3] Wang, Heng, and Cordelia Schmid. "Action recognition with improved trajectories.“ ICCV 2013. [4] Hassner, T., et al. “Violent flows: Real-time detection of violent crowd behavior.” CVPR 2012

slide-71
SLIDE 71

Qualitative Evaluation

  • Relation to “Violence”

71

slide-72
SLIDE 72

Conclusion

  • Zero-shot learning can overcome the challenge of labelling

ever increasing data

  • Unsupervised word-vector semantic space produces

reasonable ZSL performance without labelling attribute

  • Access to testing data could substantially improve the quality
  • f ZSL
  • ZSL underpinned by large amount of related data may

perform rather close to specifically collected small training data

72

slide-73
SLIDE 73

Thank You

73

xu-xun.com