Event Recognition by Learning Amir Habibian Qualcomm Research, - - PowerPoint PPT Presentation

event recognition by learning
SMART_READER_LITE
LIVE PREVIEW

Event Recognition by Learning Amir Habibian Qualcomm Research, - - PowerPoint PPT Presentation

Event Recognition by Learning Amir Habibian Qualcomm Research, Amsterdam 27 Feb 2017 1 What is an event? Interaction of people and objects under a certain scene Event Object People Examples Actio Scene n Personal events: marriage


slide-1
SLIDE 1

1

Event Recognition by Learning

Amir Habibian

Qualcomm Research, Amsterdam 27 Feb 2017

slide-2
SLIDE 2

2

Interaction of people and objects under a certain scene Examples

  • Personal events: marriage proposal, grooming an animal
  • Traffic events: accident, traffic jam
  • Security events: breaking a lock, leaving a bag unattended

What is an event?

Object

Event

Actio n Scene People Event: Winning a race without a vehicle

slide-3
SLIDE 3

3

Large variation in examples (semantic variance)

  • Depending on the context, may involve various objects, actions and scenes

Limited number of training examples

  • More specific than individual object, action, and scenes

Why event recognition is hard?

Event: Feeding an animal

slide-4
SLIDE 4

4

Neither shallow BoW nor deep learned representations fit well

  • BoW are not discriminative enough to handle the large variations
  • Not enough training examples to train a deep neural network

SOTA rely on pre-trained semantic encoders to represent videos

Video representations for event recognition

Non-semantic representation

Event: Making a sandwich Semantic representation

slide-5
SLIDE 5

5

Video representations for event recognition

Early work Non-semantic Semantic Handcrafted Learned

Research trend

slide-6
SLIDE 6

6

Aggregation of handcrafted descriptors over video

Non-semantic representation (handcrafted)

[Jiang et al., TRECVID 2010] [Natarajan et al., CVPR 2012] [Wang et al., ICCV 2013] and many others Decoding video Extracting descriptors Quantizing descriptors Appearance

  • SIFT, GIST,

… Motion

  • HOF, MBH,

… Bag-of-words VLAD Fisher vector

slide-7
SLIDE 7

7

Aggregation of CNN descriptors over video More effective and efficient compared to the handcrafted

Non-semantic representation (learned)

[Xu et al., CVPR 2015] [Nagel et al., BMVC 2015] Decoding video Extracting CNN descriptors Video pooling Trained on images VGG - Inception Averaging Fisher vector VLAD

slide-8
SLIDE 8

8

Video representations for event recognition

Non-semantic Semantic Handcrafted Learned

slide-9
SLIDE 9

9

Handcraft a vocabulary of concept detectors

Semantic representation (handcrafted)

slide-10
SLIDE 10

10

The vocabulary is created in three steps:

1. Identifying the concepts to be included in the vocabulary 2. Providing training examples per concept 3. Training concept classifiers

Involves lots of annotation effort

  • To identify which concepts to include
  • To provide training examples per concept

Handcrafting concept vocabulary

slide-11
SLIDE 11

11

Key questions

  • How many concepts to include in the vocabulary?
  • How accurate should the concept detectors be?
  • What concept types to include in the vocabulary?
  • Which concepts to include in the vocabulary?
  • ...

Handcrafted vocabulary

  • A. Habibian, K. van de Sande, and C. Snoek, ICMR’13
  • A. Habibian and C. Snoek, CVIU’14
slide-12
SLIDE 12

12

Impact of concept detector accuracies on event recognition Impose noise on concept detector predictions

Quantity vs Quality

10 20 30 40 50 60 70 80 90 100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Imposed Detection Noise (in %) Mean Average Precision

Vocabulary Size = 50 Vocabulary Size = 100 Vocabulary Size = 200 Vocabulary Size = 300 Vocabulary Size = 500 Vocabulary Size = 1346

slide-13
SLIDE 13

13

Impact of concept detector accuracies on event recognition Impose noise on concept detector predictions Make the vocabulary larger rather than more accurate

Quantity vs Quality

10 20 30 40 50 60 70 80 90 100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Imposed Detection Noise (in %) Mean Average Precision

Vocabulary Size = 50 Vocabulary Size = 100 Vocabulary Size = 200 Vocabulary Size = 300 Vocabulary Size = 500 Vocabulary Size = 1346

slide-14
SLIDE 14

14

Comprehensive set of concepts from various types are needed It requires lots of annotation effort …

Conclusion

slide-15
SLIDE 15

15

Expanding the labels by logical operations

  • AND, OR, …

Label composition trick

  • A. Habibian, T. Mensink, and C. Snoek, ICMR’14
slide-16
SLIDE 16

16

Expanding the labels by logical operations

  • AND, OR, …

Label composition trick

slide-17
SLIDE 17

17

Expanding the vocabulary for free Composite concepts can be easier to detect

  • boat-AND-sea
  • bear-AND-cage
  • man-OR-woman

Composite concepts can be more indicative of the event

  • bike-AND-ride for attempting a bike trick

Motivation

slide-18
SLIDE 18

18

For a vocabulary of n concepts, there are Bn disjoint compositions

  • Bell number:
  • Not all of them are useful

Which concepts should be composed together?

  • NP-hard problem, equivalent to set-partitioning
  • Approximated by a greedy search algorithm

Learning composite concepts

slide-19
SLIDE 19

19

Top ranked videos for flash mob gathering Most dominant concepts in the video representation

Qualitative results

slide-20
SLIDE 20

20

More comprehensive vocabulary by composing the concepts Still grounded on the handcrafted concepts …

Conclusion

slide-21
SLIDE 21

21

Video representations for event recognition

Non-semantic Semantic Handcrafted Learned

slide-22
SLIDE 22

22

Discovering concepts from the web

[Wu et. al. CVPR’14] [Chen et al., ICMR’14]

slide-23
SLIDE 23

23

Learn the mutual underlying subspace between videos and descriptions

Video2Vec embedding

  • A. Habibian, T. Mensink, and C. Snoek, PAMI, In press

Videos …

A woman folds and packages a scarf she has made. A woman points out bones on a skeleton for lab practical for an anatomy class. A mother at a fountain tries to get her daughter to step on the water jets.

… Descriptions Semantic space

slide-24
SLIDE 24

24

Learn a compact representation by which the input could be reconstructed

  • Codes as data representation

Autoencoder for visual data: Autoencoder for textual data:

Autoencoder

Encoder Codes Decoder

Crazy guy doing insane stunts on bike.

Encoder Codes Decoder

Crazy guy doing insane stunts on bike.

slide-25
SLIDE 25

25

Reconstruct the other view of data

  • Reconstruct the textual view from visual view
  • Reconstruct the visual view from textual view:

Video2Vec embedding

Encoder Codes Decoder

Crazy guy doing insane stunts on bike.

Codes Decoder

Crazy guy doing insane stunts on bike.

Encoder

slide-26
SLIDE 26

26

Reconstruct the other view of data

  • Reconstruct the textual view from visual view
  • W: encodes visual features into codes
  • A: decodes codes into textual features

Video2Vec embedding

Encoder Codes Decoder

Crazy guy doing insane stunts on bike.

𝑦" 𝑋 𝐵 𝑡" 𝑧" ℒ 𝑧, 𝑧 ) = 𝑧" − 𝐵 𝑋 𝑦"

slide-27
SLIDE 27

27

Train a different encoder to encode every video channel

  • Appearance, Motion, and audio

Share the codes to enforce the common structures across modalities

  • Acts as a regularizer

Multimodal encoding

Codes Decoder

Crazy guy doing insane stunts on bike.

Encoder motion

Audio Appearance Motion

slide-28
SLIDE 28

28

Visualizing the decoder (A) as A x A

T

The multimodal encoder better learns the semantic relations

Multimodal encoding

Unimodal encoder (Audio) Unimodal encoder (Motion) Unimodal encoder (Appearance) Multimodal encoder

slide-29
SLIDE 29

29

Joint encoding of multiple modalities lead to a better representation

Impact of multimodal encoding

slide-30
SLIDE 30

30

Autoencoders rely on ℓ- loss to measure reconstruction error: ℒ 𝑧, 𝑧 ) = 𝑧 − 𝑧 ) - The error in reconstructing all of the words are treated equally We replace the ℓ- loss with: ℒ 𝑧, 𝑧 ) = 𝐼0 (𝑧 − 𝑧 )) - Ht is a diagonal matrix determining the importance of each word per task

Task specific decoding

slide-31
SLIDE 31

31

Middle: standard decoder Bottom: task specific decoder

Task specific decoding

slide-32
SLIDE 32

32

Event specific decoding lead to a better representation

  • For the both unimodal and multimodal encoders

Impact of event specific decoding

Zero-shot event recognition

slide-33
SLIDE 33

33

  • 1. Train the embedding on a collection of videos and their descriptions

− Videos and their captions downloaded from YouTube

  • 2. Use the trained embedding to encode event videos
  • 3. Train and use the event classifier on the encoded representations

− SVM

Event recognition with video examples

slide-34
SLIDE 34

34

Event recognition without video examples

Term extraction

Term Vector

Video2Vec

Term Vector

Text Matching

Test videos Event description

slide-35
SLIDE 35

35

Applications

slide-36
SLIDE 36

36

Represent the all modalities in a mutual semantic space

Application 1: Cross-modal retrieval

  • A. Habibian, T. Mensink, and C. Snoek, ICMR’15

Speech Text Images Videos

slide-37
SLIDE 37

37

Application 1: Cross-modal retrieval

  • A. Habibian and C. Snoek, MM’13
slide-38
SLIDE 38

38

Application 1: Cross-modal retrieval

  • A. Habibian and C. Snoek, MM’13
slide-39
SLIDE 39

39

Efficiency

  • Representing videos by a compact set of concepts

Few exemplars

  • Transfer learning from vocabulary training examples

Recounting

  • Interpretable video representation

Application 2: On-the-fly event search

  • A. Habibian, M. Mazloom, and C. Snoek, ICMR’14
  • M. Mazloom, A. Habibian, and C.Snoek, MM’13
slide-40
SLIDE 40

40

Application 2: On-the-fly event search

slide-41
SLIDE 41

41

Application 2: On-the-fly event search

slide-42
SLIDE 42

42

Application 2: On-the-fly event search

slide-43
SLIDE 43

43

Localizing the event over time by following its concepts Summarizing long videos, i.e. GoPro footages

Application 3: Video summarization

M.Mazloom, A. Habibian and C. Snoek, ICMR’15

Changing a vehicle tire

slide-44
SLIDE 44

44

habibian.a.h@gmail.com

Thanks !