Zero-Example Event Detection and Recounting
Speaker: Yi-Jie Lu
Yi-Jie Lu, Hao Zhang, Ting Yao, Chong-Wah Ngo On behalf of VIREO Group, City University of Hong Kong
- Feb. 12, 2015
Zero-Example Event Detection and Recounting Speaker: Yi-Jie Lu - - PowerPoint PPT Presentation
Zero-Example Event Detection and Recounting Speaker: Yi-Jie Lu Yi-Jie Lu, Hao Zhang, Ting Yao, Chong-Wah Ngo On behalf of VIREO Group, City University of Hong Kong Feb. 12, 2015 Outline Multimedia Event Detection (MED) Background
Yi-Jie Lu, Hao Zhang, Ting Yao, Chong-Wah Ngo On behalf of VIREO Group, City University of Hong Kong
– Background – System Overview – Findings
– Background – System Workflow – Results
– An activity occurring at a specific place and time involving
– An activity occurring at a specific place and time involving
procedural action
– An activity occurring at a specific place and time involving
social activity
– An activity occurring at a specific place and time involving
Ad-Hoc Testing and Evaluation Events AH14: E041-E050
E041 - Baby shower E042 - Building a fire E043 - Busking E044 - Decorating for a celebration E045 - Extinguishing a fire E046 - Making a purchase E047 - Modeling E048 - Doing a magic trick E049 - Putting on additional apparel E050 - Teaching dance choreography
High-level Events
Low-level visual features
Extract
Model?
Raw images / video snippets
Raw images / video snippets High-level Events
Low-level visual features
Extract Pre-train
Model
Birthday cake Gift Gift Decoration: Party hat Decoration: Balloon Several persons gathered around
Candles
Event
High-Level Low-Level Changing a vehicle tire
Action
Squatting Standing up Walking
Interaction Human Object Scene
Side of the road Tire wrench Tire
Low-level visual features Low-level motion features
Person opening the car trunk Person jacking the car Person using wrench Person changing for a new tire
– [ Exemplar videos …… ] – Description: One or more people work to replace a tire on a vehicle – Explication: … – Evidential description
– Given an event query, SQG translates the query
Event Query (Attempting a Bike Trick)
SQG
< Objects >
0.60
0.60
0.60 < Actions >
1.00
0.62
0.61 < Scenes >
0.01 Semantic Query Relevant Concepts Relevance Score Concept Bank
$
€
TRECVID SIN
₤
Research Collection
ƒ
HMDB51
$
UCF101
¥
ImageNet
– Research collection (497 concepts) – ImageNet ILSVRC’12 (1000 concepts) – SIN’14 (346 concepts)
$
€
TRECVID SIN
₤
Research Collection
ƒ
HMDB51
$
UCF101
¥
ImageNet
– Exact matching vs. WordNet/ConceptNet matching – How many concepts are chosen to represent an event? – To further improve the performance:
– Ranking according to the SQ and concept responses
< Objects >
0.60
0.60
0.60 < Actions >
1.00
0.62
0.61 < Scenes >
0.01 Semantic Query Video Ranking
Event Search
i
s =
i
qc q
Concept Response
i
c
* 8000h video
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Average Precision Event ID
WordNet ExactMatching EM-TOP
WordNet Exact Matching Exact matching but
few concepts 7%
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 1 6 11 16 21 26
Mean Average Precision Top k Concepts MAP(all)
Hit the best MAP by only retaining the Top 8 concepts
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 6 11 16 21 26
Average Precision Top k Concepts
31
Event 31: Beekeeping Honeycomb (ImageNet) Bee (ImageNet) Bee house (ImageNet) Cutting (research collection) Cutting down tree (research collection)
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 6 11 16 21 26
Average Precision Top k Concepts
23
Event 23: Dog show Brush dog (research collection) Dog show (research collection)
A sample query in TRECVID 2009
A sample query in TRECVID 2009
A sample query in TRECVID 2009
A sample query in TRECVID 2009
Tailgating car food helmet team uniform portable shelter parking lot
Tailgating car food helmet team uniform portable shelter parking lot
Tailgating car food helmet team uniform portable shelter parking lot driver engine tailgating
desires
bus
Dog Show Concept “dog” cat horse mammal carnivore animal kit fox red wolf
SIN ImageNet
– harness the ontology-based mapping while constraining
– Synonyms
– Strict sub-categories
hot dog
– Key concepts are missing:
– Scale up the number of concepts, thus increasing the
631 events
Yin Cui, Dong Liu, Jiawei Chen, Shih-Fu Chang. Building A Large Concept Bank for Representing Events in Video. In arXiv.
Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov. DeViSE: A Deep Visual-Semantic Embedding Model. In NIPS’13. Amirhossein Habibian, Thomas Mensink, Cees G. M. Snoek. VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events. In MM’14, best paper.
Method MAP (on MED14-Test) Exact Matching Only 0.0306 Exact Matching + TF 0.0420 Exact Matching + TFIDF 0.0495 Exact Matching + TFIDF + Word Specificity 0.0502
0.01 0.02 0.03 0.04 0.05 0.06 EM Only EM + TF EM + TFIDF EM + TFIDF + Spec.
– Background – System Overview – Findings
– Background – System Workflow – Results
– Given an event query and a test video clip that contains an instance
summarizing the key evidence for the event in the clip. The recounting states:
– When: Intervals of time (or frames) when the event occurred in
the clip
– Where: Spatial location in the clip (pixel coordinate or bounding
polygon)
– What: A clear, concise textual recounting of the observations
– Concept-to-event relevancy – Evidence diversity – Viewing time of evidential shots
– Concept-to-event relevancy
– Evidence diversity – Viewing time of evidential shots
– Concept-to-event relevancy
– Evidence diversity
different from other shots.
– Viewing time of evidential shots
– Concept-to-event relevancy
– Evidence diversity
different from other shots.
– Viewing time of evidential shots
Concept Reponses
Apply concept detectors
$
€ TRECVID SIN ₤ Research Collection ƒ HMDB51 $ UCF101 ¥ ImageNet
Choose keyframes/snippets that are most relevant to this event
the weighted sum
i
s =
i
wr
< Objects >
0.60
0.60
0.60 < Actions >
1.00
0.62
0.61 < Scenes >
0.01
Semantic Query
w
The rests are non-key evidences The top 3 shots are selected as key evidences
The top concept in the key evidence is selected as the representative concept
* We choose unique concept for each shot
< Objects >
< Actions >
< Scenes >
Semantic Query Ridding bike Bike trick Bike
The top concept in the key evidence is selected as the representative concept
* We choose unique concept for each shot
< Objects >
< Actions >
< Scenes >
Semantic Query Ridding bike Bike trick Bike
The top concept in the key evidence is selected as the representative concept
* We choose unique concept for each shot
< Objects >
< Actions >
< Scenes >
Semantic Query
The top concept in the key evidence is selected as the representative concept
* We choose unique concept for each shot
< Objects >
< Actions >
< Scenes >
Semantic Query Bike trick Bike Ridding bike
The top concept in the key evidence is selected as the representative concept
* We choose unique concept for each shot
< Objects >
< Actions >
< Scenes >
Semantic Query Key Non-Key Key Key
The percentage of strongly agree
(b) Event query quality (a) Evidence quality
0% 5% 10% 15% 20% 25% 30% VIREO Team1 Team2 Team3 Team4 Team6 Team5 0% 5% 10% 15% 20% 25% 30% Team2 VIREO Team4 Team3 Team6 Team1 Team5
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Team1 Team2 Team3 VIREO Team4 Team5 Team6 0% 10% 20% 30% 40% 50% 60% 70% Team2 VIREO Team4 Team1 Team6 Team5 Team3
The percentage of both agree and strongly agree
(b) Event query quality (a) Evidence quality
– A good baseline: the simple exact matching shows
– Don’t include noisy concepts – The context of a concept is important in event detection.
– In key evidence localization, we emphasize the event
– We recommend three shots as key evidences and each in