[PPT] - Event Recognition by Learning Amir Habibian Qualcomm Research, PowerPoint Presentation

SLIDE 1

1

Event Recognition by Learning

Amir Habibian

Qualcomm Research, Amsterdam 27 Feb 2017

SLIDE 2

2

Interaction of people and objects under a certain scene Examples

Personal events: marriage proposal, grooming an animal
Traffic events: accident, traffic jam
Security events: breaking a lock, leaving a bag unattended

What is an event?

Object

Event

Actio n Scene People Event: Winning a race without a vehicle

SLIDE 3

3

Large variation in examples (semantic variance)

Depending on the context, may involve various objects, actions and scenes

Limited number of training examples

More specific than individual object, action, and scenes

Why event recognition is hard?

Event: Feeding an animal

SLIDE 4

4

Neither shallow BoW nor deep learned representations fit well

BoW are not discriminative enough to handle the large variations
Not enough training examples to train a deep neural network

SOTA rely on pre-trained semantic encoders to represent videos

Video representations for event recognition

Non-semantic representation

Event: Making a sandwich Semantic representation

SLIDE 5

5

Video representations for event recognition

Early work Non-semantic Semantic Handcrafted Learned

Research trend

SLIDE 6

6

Aggregation of handcrafted descriptors over video

Non-semantic representation (handcrafted)

[Jiang et al., TRECVID 2010] [Natarajan et al., CVPR 2012] [Wang et al., ICCV 2013] and many others Decoding video Extracting descriptors Quantizing descriptors Appearance

SIFT, GIST,

… Motion

HOF, MBH,

… Bag-of-words VLAD Fisher vector

SLIDE 7

7

Aggregation of CNN descriptors over video More effective and efficient compared to the handcrafted

Non-semantic representation (learned)

[Xu et al., CVPR 2015] [Nagel et al., BMVC 2015] Decoding video Extracting CNN descriptors Video pooling Trained on images VGG - Inception Averaging Fisher vector VLAD

SLIDE 8

8

Video representations for event recognition

Non-semantic Semantic Handcrafted Learned

SLIDE 9

9

Handcraft a vocabulary of concept detectors

Semantic representation (handcrafted)

SLIDE 10

10

The vocabulary is created in three steps:

1. Identifying the concepts to be included in the vocabulary 2. Providing training examples per concept 3. Training concept classifiers

Involves lots of annotation effort

To identify which concepts to include
To provide training examples per concept

Handcrafting concept vocabulary

SLIDE 11

11

Key questions

How many concepts to include in the vocabulary?
How accurate should the concept detectors be?
What concept types to include in the vocabulary?
Which concepts to include in the vocabulary?
...

Handcrafted vocabulary

A. Habibian, K. van de Sande, and C. Snoek, ICMR’13
A. Habibian and C. Snoek, CVIU’14

SLIDE 12

12

Impact of concept detector accuracies on event recognition Impose noise on concept detector predictions

Quantity vs Quality

10 20 30 40 50 60 70 80 90 100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Imposed Detection Noise (in %) Mean Average Precision

Vocabulary Size = 50 Vocabulary Size = 100 Vocabulary Size = 200 Vocabulary Size = 300 Vocabulary Size = 500 Vocabulary Size = 1346

SLIDE 13

13

Impact of concept detector accuracies on event recognition Impose noise on concept detector predictions Make the vocabulary larger rather than more accurate

Quantity vs Quality

10 20 30 40 50 60 70 80 90 100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Imposed Detection Noise (in %) Mean Average Precision

Vocabulary Size = 50 Vocabulary Size = 100 Vocabulary Size = 200 Vocabulary Size = 300 Vocabulary Size = 500 Vocabulary Size = 1346

SLIDE 14

14

Comprehensive set of concepts from various types are needed It requires lots of annotation effort …

Conclusion

SLIDE 15

15

Expanding the labels by logical operations

AND, OR, …

Label composition trick

A. Habibian, T. Mensink, and C. Snoek, ICMR’14

SLIDE 16

16

Expanding the labels by logical operations

AND, OR, …

Label composition trick

…

SLIDE 17

17

Expanding the vocabulary for free Composite concepts can be easier to detect

boat-AND-sea
bear-AND-cage
man-OR-woman

Composite concepts can be more indicative of the event

bike-AND-ride for attempting a bike trick

Motivation

SLIDE 18

18

For a vocabulary of n concepts, there are Bn disjoint compositions

Bell number:
Not all of them are useful

Which concepts should be composed together?

NP-hard problem, equivalent to set-partitioning
Approximated by a greedy search algorithm

Learning composite concepts

SLIDE 19

19

Top ranked videos for flash mob gathering Most dominant concepts in the video representation

Qualitative results

SLIDE 20

20

More comprehensive vocabulary by composing the concepts Still grounded on the handcrafted concepts …

Conclusion

SLIDE 21

21

Video representations for event recognition

Non-semantic Semantic Handcrafted Learned

SLIDE 22

22

Discovering concepts from the web

[Wu et. al. CVPR’14] [Chen et al., ICMR’14]

SLIDE 23

23

Learn the mutual underlying subspace between videos and descriptions

Video2Vec embedding

A. Habibian, T. Mensink, and C. Snoek, PAMI, In press

Videos …

A woman folds and packages a scarf she has made. A woman points out bones on a skeleton for lab practical for an anatomy class. A mother at a fountain tries to get her daughter to step on the water jets.

… Descriptions Semantic space

SLIDE 24

24

Learn a compact representation by which the input could be reconstructed

Codes as data representation

Autoencoder for visual data: Autoencoder for textual data:

Autoencoder

Encoder Codes Decoder

Crazy guy doing insane stunts on bike.

Encoder Codes Decoder

Crazy guy doing insane stunts on bike.

SLIDE 25

25

Reconstruct the other view of data

Reconstruct the textual view from visual view
Reconstruct the visual view from textual view:

Video2Vec embedding

Encoder Codes Decoder

Crazy guy doing insane stunts on bike.

Codes Decoder

Crazy guy doing insane stunts on bike.

Encoder

SLIDE 26

26

Reconstruct the other view of data

Reconstruct the textual view from visual view
W: encodes visual features into codes
A: decodes codes into textual features

Video2Vec embedding

Encoder Codes Decoder

Crazy guy doing insane stunts on bike.

𝑦" 𝑋 𝐵 𝑡" 𝑧" ℒ 𝑧, 𝑧 ) = 𝑧" − 𝐵 𝑋 𝑦"

SLIDE 27

27

Train a different encoder to encode every video channel

Appearance, Motion, and audio

Share the codes to enforce the common structures across modalities

Acts as a regularizer

Multimodal encoding

Codes Decoder

Crazy guy doing insane stunts on bike.

Encoder motion

Audio Appearance Motion

SLIDE 28

28

Visualizing the decoder (A) as A x A

T

The multimodal encoder better learns the semantic relations

Multimodal encoding

Unimodal encoder (Audio) Unimodal encoder (Motion) Unimodal encoder (Appearance) Multimodal encoder

SLIDE 29

29

Joint encoding of multiple modalities lead to a better representation

Impact of multimodal encoding

SLIDE 30

30

Autoencoders rely on ℓ- loss to measure reconstruction error: ℒ 𝑧, 𝑧 ) = 𝑧 − 𝑧 ) - The error in reconstructing all of the words are treated equally We replace the ℓ- loss with: ℒ 𝑧, 𝑧 ) = 𝐼0 (𝑧 − 𝑧 )) - Ht is a diagonal matrix determining the importance of each word per task

Task specific decoding

SLIDE 31

31

Middle: standard decoder Bottom: task specific decoder

Task specific decoding

SLIDE 32

32

Event specific decoding lead to a better representation

For the both unimodal and multimodal encoders

Impact of event specific decoding

Zero-shot event recognition

SLIDE 33

33

1. Train the embedding on a collection of videos and their descriptions

− Videos and their captions downloaded from YouTube

2. Use the trained embedding to encode event videos
3. Train and use the event classifier on the encoded representations

− SVM

Event recognition with video examples

SLIDE 34

34

Event recognition without video examples

Term extraction

Term Vector

Video2Vec

Term Vector

Text Matching

Test videos Event description

SLIDE 35

35

Applications

SLIDE 36

36

Represent the all modalities in a mutual semantic space

Application 1: Cross-modal retrieval

A. Habibian, T. Mensink, and C. Snoek, ICMR’15

Speech Text Images Videos

SLIDE 37

37

Application 1: Cross-modal retrieval

A. Habibian and C. Snoek, MM’13

SLIDE 38

38

Application 1: Cross-modal retrieval

A. Habibian and C. Snoek, MM’13

SLIDE 39

39

Efficiency

Representing videos by a compact set of concepts

Few exemplars

Transfer learning from vocabulary training examples

Recounting

Interpretable video representation

Application 2: On-the-fly event search

A. Habibian, M. Mazloom, and C. Snoek, ICMR’14
M. Mazloom, A. Habibian, and C.Snoek, MM’13

SLIDE 40

40

Application 2: On-the-fly event search

SLIDE 41

41

Application 2: On-the-fly event search

SLIDE 42

42

Application 2: On-the-fly event search

SLIDE 43

43

Localizing the event over time by following its concepts Summarizing long videos, i.e. GoPro footages

Application 3: Video summarization

M.Mazloom, A. Habibian and C. Snoek, ICMR’15

Changing a vehicle tire

SLIDE 44

44

habibian.a.h@gmail.com

Event Recognition by Learning

Amir Habibian

What is an event?

Event

Why event recognition is hard?

Video representations for event recognition

Video representations for event recognition

Non-semantic representation (handcrafted)

Non-semantic representation (learned)

Video representations for event recognition

Semantic representation (handcrafted)

Handcrafting concept vocabulary

Handcrafted vocabulary

Quantity vs Quality

Quantity vs Quality

Conclusion

Label composition trick

Label composition trick

Motivation

Learning composite concepts

Qualitative results

Conclusion

Video representations for event recognition

Discovering concepts from the web

Video2Vec embedding

Autoencoder

Video2Vec embedding

Video2Vec embedding

Multimodal encoding

Multimodal encoding

Impact of multimodal encoding

Task specific decoding

Task specific decoding

Impact of event specific decoding

Event recognition with video examples

Event recognition without video examples

Applications

Application 1: Cross-modal retrieval

Application 1: Cross-modal retrieval

Application 1: Cross-modal retrieval

Application 2: On-the-fly event search

Application 2: On-the-fly event search

Application 2: On-the-fly event search

Application 2: On-the-fly event search

Application 3: Video summarization

Thanks !