1
Event Recognition by Learning
Amir Habibian
Qualcomm Research, Amsterdam 27 Feb 2017
Event Recognition by Learning Amir Habibian Qualcomm Research, - - PowerPoint PPT Presentation
Event Recognition by Learning Amir Habibian Qualcomm Research, Amsterdam 27 Feb 2017 1 What is an event? Interaction of people and objects under a certain scene Event Object People Examples Actio Scene n Personal events: marriage
1
Qualcomm Research, Amsterdam 27 Feb 2017
2
Interaction of people and objects under a certain scene Examples
Object
Actio n Scene People Event: Winning a race without a vehicle
3
Large variation in examples (semantic variance)
Limited number of training examples
Event: Feeding an animal
4
Neither shallow BoW nor deep learned representations fit well
SOTA rely on pre-trained semantic encoders to represent videos
Non-semantic representation
Event: Making a sandwich Semantic representation
5
Early work Non-semantic Semantic Handcrafted Learned
Research trend
6
Aggregation of handcrafted descriptors over video
[Jiang et al., TRECVID 2010] [Natarajan et al., CVPR 2012] [Wang et al., ICCV 2013] and many others Decoding video Extracting descriptors Quantizing descriptors Appearance
… Motion
… Bag-of-words VLAD Fisher vector
7
Aggregation of CNN descriptors over video More effective and efficient compared to the handcrafted
[Xu et al., CVPR 2015] [Nagel et al., BMVC 2015] Decoding video Extracting CNN descriptors Video pooling Trained on images VGG - Inception Averaging Fisher vector VLAD
8
Non-semantic Semantic Handcrafted Learned
9
Handcraft a vocabulary of concept detectors
10
The vocabulary is created in three steps:
1. Identifying the concepts to be included in the vocabulary 2. Providing training examples per concept 3. Training concept classifiers
Involves lots of annotation effort
11
Key questions
12
Impact of concept detector accuracies on event recognition Impose noise on concept detector predictions
10 20 30 40 50 60 70 80 90 100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Imposed Detection Noise (in %) Mean Average Precision
Vocabulary Size = 50 Vocabulary Size = 100 Vocabulary Size = 200 Vocabulary Size = 300 Vocabulary Size = 500 Vocabulary Size = 1346
13
Impact of concept detector accuracies on event recognition Impose noise on concept detector predictions Make the vocabulary larger rather than more accurate
10 20 30 40 50 60 70 80 90 100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Imposed Detection Noise (in %) Mean Average Precision
Vocabulary Size = 50 Vocabulary Size = 100 Vocabulary Size = 200 Vocabulary Size = 300 Vocabulary Size = 500 Vocabulary Size = 1346
14
Comprehensive set of concepts from various types are needed It requires lots of annotation effort …
15
Expanding the labels by logical operations
16
Expanding the labels by logical operations
…
17
Expanding the vocabulary for free Composite concepts can be easier to detect
Composite concepts can be more indicative of the event
18
For a vocabulary of n concepts, there are Bn disjoint compositions
Which concepts should be composed together?
19
Top ranked videos for flash mob gathering Most dominant concepts in the video representation
20
More comprehensive vocabulary by composing the concepts Still grounded on the handcrafted concepts …
21
Non-semantic Semantic Handcrafted Learned
22
[Wu et. al. CVPR’14] [Chen et al., ICMR’14]
23
Learn the mutual underlying subspace between videos and descriptions
Videos …
A woman folds and packages a scarf she has made. A woman points out bones on a skeleton for lab practical for an anatomy class. A mother at a fountain tries to get her daughter to step on the water jets.
… Descriptions Semantic space
24
Learn a compact representation by which the input could be reconstructed
Autoencoder for visual data: Autoencoder for textual data:
Encoder Codes Decoder
Crazy guy doing insane stunts on bike.
Encoder Codes Decoder
Crazy guy doing insane stunts on bike.
25
Reconstruct the other view of data
Encoder Codes Decoder
Crazy guy doing insane stunts on bike.
Codes Decoder
Crazy guy doing insane stunts on bike.
Encoder
26
Reconstruct the other view of data
Encoder Codes Decoder
Crazy guy doing insane stunts on bike.
𝑦" 𝑋 𝐵 𝑡" 𝑧" ℒ 𝑧, 𝑧 ) = 𝑧" − 𝐵 𝑋 𝑦"
27
Train a different encoder to encode every video channel
Share the codes to enforce the common structures across modalities
Codes Decoder
Crazy guy doing insane stunts on bike.
Encoder motion
Audio Appearance Motion
28
Visualizing the decoder (A) as A x A
T
The multimodal encoder better learns the semantic relations
Unimodal encoder (Audio) Unimodal encoder (Motion) Unimodal encoder (Appearance) Multimodal encoder
29
Joint encoding of multiple modalities lead to a better representation
30
Autoencoders rely on ℓ- loss to measure reconstruction error: ℒ 𝑧, 𝑧 ) = 𝑧 − 𝑧 ) - The error in reconstructing all of the words are treated equally We replace the ℓ- loss with: ℒ 𝑧, 𝑧 ) = 𝐼0 (𝑧 − 𝑧 )) - Ht is a diagonal matrix determining the importance of each word per task
31
Middle: standard decoder Bottom: task specific decoder
32
Event specific decoding lead to a better representation
Zero-shot event recognition
33
− Videos and their captions downloaded from YouTube
− SVM
34
Term extraction
Term Vector
Video2Vec
Term Vector
Text Matching
Test videos Event description
35
36
Represent the all modalities in a mutual semantic space
Speech Text Images Videos
37
38
39
Efficiency
Few exemplars
Recounting
40
41
42
43
Localizing the event over time by following its concepts Summarizing long videos, i.e. GoPro footages
M.Mazloom, A. Habibian and C. Snoek, ICMR’15
Changing a vehicle tire
44
habibian.a.h@gmail.com