End-to-end Learning of Action Detection from Frame Glimpse in Videos
Ezgi Pekşen Soysal
Hacettepe University
BIL722 - Advanced Topics in Computer Vision
End-to-end Learning of Action Detection from Frame Glimpse in - - PowerPoint PPT Presentation
End-to-end Learning of Action Detection from Frame Glimpse in Videos BIL722 - Advanced Topics in Computer Vision Ezgi Peken Soysal Hacettepe University Task: what is the person doing? Input Output t = 0 t = T Running Talking Olga
Ezgi Pekşen Soysal
Hacettepe University
BIL722 - Advanced Topics in Computer Vision
Olga Russakovsky: The human side of computer vision
Input Output
t = 0 t = T
Running Talking
Olga Russakovsky: The human side of computer vision
Input Output
Interpretability Efficiency Accuracy
t = 0 t = T
Running Talking
Olga Russakovsky: The human side of computer vision
t = 0 t = T
Olga Russakovsky: The human side of computer vision
Talking
t = 0 t = T
Running
Olga Russakovsky: The human side of computer vision
Talking
t = 0 t = T
Running
Olga Russakovsky: The human side of computer vision
[N. I. Badler. “Temporal Scene Analysis…” 1975]
“Knowing the output or the final state… there is no need to explicitly store many previous states”
Talking
t = 0 t = T
Running
Olga Russakovsky: The human side of computer vision
Dominant paradigm: sliding windows
t = 0 t = T
… …
Used in all THUMOS challenge action detection entries [OneVerSch 2014] [WanQiaTan 2014] KarSeiBim 2014] [YuaPeiNiMouKas 2015]
“Knowing the output or the final state… there is no need to explicitly store many previous states”
[N. I. Badler. “Temporal Scene Analysis…” 1975]
Talking
t = 0 t = T
Running
Olga Russakovsky: The human side of computer vision
“Knowing the output or the final state… there is no need to explicitly store many previous states” “Time may be represented in several ways… The intervals between ‘pulses’ need not be equal.”
[N. I. Badler. “Temporal Scene Analysis…” 1975]
Talking
t = 0 t = T
Running
Olga Russakovsky: The human side of computer vision
t = 0 t = T
[YeuRusMorFei CVPR’16]
Frame model Input: A frame
Output
Olga Russakovsky: The human side of computer vision
t = 0 t = T
Frame model Output: Detection instance [start, end] Next frame to glimpse Input: A frame
[ ]
Output
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
t = 0 t = T
Output: Detection instance [start, end] Next frame to glimpse
Frame model
[ ] [ ]
Output Output
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
t = 0 t = T Output
Output: Detection instance [start, end] Next frame to glimpse
Frame model
[ ] [ ] [ ] …
Output Output
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
t = 0 t = T Output
[ ]
Convolutional neural network (frame information) Recurrent neural network (time information)
[ ] [ ]
Output: Detection instance [start, end] Next frame to glimpse
…
Output Output
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
t = 0 t = T Output
Optional output: Detection instance [start, end] Output: Next frame to glimpse
…
(frame information) Recurrent neural network (time information)
Output Output
[ ]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Positive video Negative video
t = 0 t = T
t = 0 t = T
Training data
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Positive video Negative video
t = 0 t = T
t = 0 t = T
Training data
Aside:
[YeuRusJinAndMorFei UnderReview] [LiuRusDenBerFei ImageNetChallenge ’15]
[YeuRamRusMorFei InPreparation]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training data
Positive video Negative video
t = 0 t = T
t = 0 t = T
d1 d2
Detections
t = 0 t = T
t = 0 t = T
g1
d3
d4 g2
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training data
Positive video Negative video
t = 0 t = T
t = 0 t = T
d1 d2
Detections
t = 0 t = T
t = 0 t = T
g1
d3
d4
Reward for detection
g2
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training data
Positive video Negative video
t = 0 t = T
t = 0 t = T
d1 d2
Detections
t = 0 t = T
t = 0 t = T
g1
d3
d4
Reward for detection
g2 y3 = 2 y2 = 1 y1 = 1 y4 = 0 cross-entropy classification loss
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training data
Positive video Negative video
t = 0 t = T
t = 0 t = T
d1 d2
Detections
t = 0 t = T
t = 0 t = T
g1
d3
d4 g2
Reward for detection
L2 distance localization loss y3 = 2 y2 = 1 y1 = 1 y4 = 0 cross-entropy classification loss
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training data
t = 0 t = T
Detections
t = 0 t = T
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training data
t = 0 t = T
d1 d2
Detections
t = 0 t = T
Model’s action sequence a
Frame 1 Frame 8 Frame 6
go to frame 6
(1) whether to predict a detection (2) where to look next
Frame 15
d3
go to frame 15
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training data
t = 0 t = T
Detections
t = 0 t = T
Frame 1 Frame 8 Frame 6
Frame 15
Model’s action sequence a
go to frame 8 go to frame 6
(2) where to look next
go to frame 15
d1 d2
d3
Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training data
t = 0 t = T
Detections
t = 0 t = T
Frame 1 Frame 8 Frame 6
Reward for an action sequence :
Frame 15
Model’s action sequence a
bad bad
go to frame 8 go to frame 6
(2) where to look next
go to frame 15
good
d1 d2
d3
Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Training data
t = 0 t = T
Detections
t = 0 t = T
Frame 1 Frame 8 Frame 6
Reward for an action sequence :
Frame 15
Objective: Gradient: Monte-Carlo approximation:
Model’s action sequence a
bad bad
go to frame 8 go to frame 6
(2) where to look next
go to frame 15
good
d1 d2
d3
Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]
[YeuRusMorFei CVPR’16]
Olga Russakovsky: The human side of computer vision
Accuracy Interpretability
[YeuRusMorFei CVPR’16]
Efficiency
Olga Russakovsky: The human side of computer vision
[YeuRusMorFei CVPR’16]
Accuracy
Dataset Detection AP at IOU 0.5 State-of-the-art Our result THUMOS 2014 14.4
17.1
ActivityNet sports 33.2
36.7
ActivityNet work 31.1
39.9 Interpretability Efficiency
Olga Russakovsky: The human side of computer vision
[YeuRusMorFei CVPR’16]
Accuracy
Dataset Detection AP at IOU 0.5 State-of-the-art Our result THUMOS 2014 14.4
17.1
ActivityNet sports 33.2
36.7
ActivityNet work 31.1
39.9 Interpretability
Efficiency
Glimpse only 2% of video frames
Olga Russakovsky: The human side of computer vision
[YeuRusMorFei CVPR’16]
Accuracy
Dataset Detection AP at IOU 0.5 State-of-the-art Our result THUMOS 2014 14.4
17.1
ActivityNet sports 33.2
36.7
ActivityNet work 31.1
39.9 Interpretability
Efficiency
Glimpse only 2% of video frames Samping Detection AP at IOU 0.5 Uniform 9.3 Our glimpses
17.1
Olga Russakovsky: The human side of computer vision
[YeuRusMorFei CVPR’16]
Accuracy
Dataset Detection AP at IOU 0.5 State-of-the-art Our result THUMOS 2014 14.4
17.1
ActivityNet sports 33.2
36.7
ActivityNet work 31.1
39.9
Efficiency
Glimpse only 2% of video frames Samping Detection AP at IOU 0.5 Uniform 9.3 Our glimpses
17.1
Interpretability
Javelin throw
[ ]
Javelin throw
[ ]
Ground truth Detections Glimpses Frames
Olga Russakovsky: The human side of computer vision
…
Algorithm Test data
Car, Person Building
AI vision system
Training data
cars dogs people
Reinforcement learning for human action detection
[YeuRusMorFei CVPR’16]
Open-world probabilistic human- in-the-loop model
[RusLiFei CVPR’15]
The human cost of developing computer vision expertise
[RusDenDuKraSatEtal IJCV’15] [DenRusKraBerBerFei CHI’14]
Olga Russakovsky: The human side of computer vision
Data is critical Beyond classification Resource allocation Broader context
Table Chair Bowl Dog Cat …
+ +
+
Olga Russakovsky: The human side of computer vision
Olga Russakovsky: The human side of computer vision
Challenging objects Long-tail distributions Designing open-world evaluation frameworks
Olga Russakovsky: The human side of computer vision
Formulating the right temporal questions Multi-task models
Playing tennis
Visual fluents
Olga Russakovsky: The human side of computer vision
Understanding human intention
Knowledge acquisition
Effective teaching
Olga Russakovsky: The human side of computer vision
Olga Russakovsky: The human side of computer vision
Stanford Artificial Intelligence Laboratory’s Outreach Summer (SAILORS) program
24 high school girls, 2 weeks Rigorous AI curriculum emphasizing humanistic applications http://sailors.stanford.edu
[VacWuChaRusSomFei SIGCSE’16]
Olga Russakovsky: The human side of computer vision
Jia Deng Alexander Berg Fei-Fei Li Sean Ma Jonathan Krause Zhiheng Huang Andrej Karpathy Aditya Khosla Michael Bernstein Jia Li Yuanqing Lin Ellen Klingbeil Vittorio Ferrari Amy Bearman Blake Carpenter Davide Modolo Jenny Jin Misha Andriluka Hao Su Sanjeev Satheesh Andrew Ng Kai Yu Greg Mori Serena Yeung Sasha Vezhnevets Deva Ramanan Abhinav Gupta Vignesh Ramanathan
Olga Russakovsky: The human side of computer vision
…
Algorithm Test data
Car, Person Building
AI vision system
Training data
cars dogs people
http://cs.cmu.edu/~orussako olgarus@cmu.edu
[RusDenHuaBerFei ICCV’13] [KliCarRusNg ICRA’10] Strongly supervised: [RusNg CVPR’10] [RusFei ECCVW’10] [RusGupRam UnderReview] [ModVezRusFei CVPR’15] [YeuRusJinAndMorFei UnderReview] Weakly supervised: [RusLinYuFei ECCV’12] [BeaRusFerFei UnderReview]
[RusLiFei CVPR’15] [YeuRusMorFei CVPR’16] [RusDenSuKraSatEtal IJCV’15] [DenRusKraBerBerFei CHI’14]