[PPT] - End-to-end Learning of Action Detection from Frame Glimpse in PowerPoint Presentation

SLIDE 1

End-to-end Learning of Action Detection from Frame Glimpse in Videos

Ezgi Pekşen Soysal

Hacettepe University

BIL722 - Advanced Topics in Computer Vision

SLIDE 2

Olga Russakovsky: The human side of computer vision

Input Output

t = 0 t = T

Running Talking

Task: what is the person doing?

SLIDE 3

Olga Russakovsky: The human side of computer vision

Input Output

Task: what is the person doing?

Interpretability Efficiency Accuracy

t = 0 t = T

Running Talking

SLIDE 4

Olga Russakovsky: The human side of computer vision

Efficient video processing

t = 0 t = T

SLIDE 5

Olga Russakovsky: The human side of computer vision

Efficient video processing

Talking

t = 0 t = T

Running

SLIDE 6

Olga Russakovsky: The human side of computer vision

Efficient video processing

Talking

t = 0 t = T

Running

SLIDE 7

Olga Russakovsky: The human side of computer vision

Efficient video processing

[N. I. Badler. “Temporal Scene Analysis…” 1975]

“Knowing the output or the final state… there is no need to explicitly store many previous states”

Talking

t = 0 t = T

Running

SLIDE 8

Olga Russakovsky: The human side of computer vision

Efficient video processing

Dominant paradigm: sliding windows

t = 0 t = T

… …

Used in all THUMOS challenge action detection entries [OneVerSch 2014] [WanQiaTan 2014] KarSeiBim 2014] [YuaPeiNiMouKas 2015]

“Knowing the output or the final state… there is no need to explicitly store many previous states”

[N. I. Badler. “Temporal Scene Analysis…” 1975]

Talking

t = 0 t = T

Running

SLIDE 9

Olga Russakovsky: The human side of computer vision

Efficient video processing

“Knowing the output or the final state… there is no need to explicitly store many previous states” “Time may be represented in several ways… The intervals between ‘pulses’ need not be equal.”

[N. I. Badler. “Temporal Scene Analysis…” 1975]

Talking

t = 0 t = T

Running

SLIDE 10

Olga Russakovsky: The human side of computer vision

t = 0 t = T

[YeuRusMorFei CVPR’16]

Our model for efficient action detection

Frame model Input: A frame

Output

SLIDE 11

Olga Russakovsky: The human side of computer vision

t = 0 t = T

Our model for efficient action detection

Frame model Output: Detection instance [start, end] Next frame to glimpse Input: A frame

[ ]

Output

[YeuRusMorFei CVPR’16]

SLIDE 12

Olga Russakovsky: The human side of computer vision

t = 0 t = T

Output: Detection instance [start, end] Next frame to glimpse

Our model for efficient action detection

Frame model

[ ] [ ]

Output Output

[YeuRusMorFei CVPR’16]

SLIDE 13

Olga Russakovsky: The human side of computer vision

t = 0 t = T Output

Output: Detection instance [start, end] Next frame to glimpse

Our model for efficient action detection

Frame model

[ ] [ ] [ ] …

Output Output

[YeuRusMorFei CVPR’16]

SLIDE 14

Olga Russakovsky: The human side of computer vision

t = 0 t = T Output

Our model for efficient action detection

[ ]

Convolutional neural network (frame information) Recurrent neural network (time information)

[ ] [ ]

Output: Detection instance [start, end] Next frame to glimpse

…

Output Output

[YeuRusMorFei CVPR’16]

SLIDE 15

Olga Russakovsky: The human side of computer vision

t = 0 t = T Output

Optional output: Detection instance [start, end] Output: Next frame to glimpse

Our model for efficient action detection

…

Convolutional neural network

(frame information) Recurrent neural network (time information)

Output Output

[ ]

[YeuRusMorFei CVPR’16]

SLIDE 16

Olga Russakovsky: The human side of computer vision

Training the detection instance output

Positive video Negative video

t = 0 t = T

[ ] [ ]

t = 0 t = T

Training data

[YeuRusMorFei CVPR’16]

SLIDE 17

Olga Russakovsky: The human side of computer vision

Training the detection instance output

Positive video Negative video

t = 0 t = T

[ ] [ ]

t = 0 t = T

Training data

Aside:

effective video annotation

[YeuRusJinAndMorFei UnderReview] [LiuRusDenBerFei ImageNetChallenge ’15]

weakly supervised detection

[YeuRamRusMorFei InPreparation]

[YeuRusMorFei CVPR’16]

SLIDE 18

Olga Russakovsky: The human side of computer vision

Training the detection instance output

Training data

Positive video Negative video

t = 0 t = T

[ ] [ ]

t = 0 t = T

d1 d2

Detections

t = 0 t = T

[ ] [ ]

t = 0 t = T

g1

[ ]

d3

[ ]

d4 g2

[YeuRusMorFei CVPR’16]

SLIDE 19

Olga Russakovsky: The human side of computer vision

Training the detection instance output

Training data

Positive video Negative video

t = 0 t = T

[ ] [ ]

t = 0 t = T

d1 d2

Detections

t = 0 t = T

[ ] [ ]

t = 0 t = T

g1

[ ]

d3

[ ]

d4

Reward for detection

g2

[YeuRusMorFei CVPR’16]

SLIDE 20

Olga Russakovsky: The human side of computer vision

Training the detection instance output

Training data

Positive video Negative video

t = 0 t = T

[ ] [ ]

t = 0 t = T

d1 d2

Detections

t = 0 t = T

[ ] [ ]

t = 0 t = T

g1

[ ]

d3

[ ]

d4

Reward for detection

g2 y3 = 2 y2 = 1 y1 = 1 y4 = 0 cross-entropy classification loss

[YeuRusMorFei CVPR’16]

SLIDE 21

Olga Russakovsky: The human side of computer vision

Training the detection instance output

Training data

Positive video Negative video

t = 0 t = T

[ ] [ ]

t = 0 t = T

d1 d2

Detections

t = 0 t = T

[ ] [ ]

t = 0 t = T

g1

[ ]

d3

[ ]

d4 g2

Reward for detection

L2 distance localization loss y3 = 2 y2 = 1 y1 = 1 y4 = 0 cross-entropy classification loss

[YeuRusMorFei CVPR’16]

SLIDE 22

Olga Russakovsky: The human side of computer vision

Training the non-differentiable outputs

Training data

t = 0 t = T

[ ] [ ]

Detections

t = 0 t = T

[ ] [ ] [ ]

[YeuRusMorFei CVPR’16]

SLIDE 23

Olga Russakovsky: The human side of computer vision

Training the non-differentiable outputs

Training data

t = 0 t = T

[ ] [ ]

d1 d2

Detections

t = 0 t = T

[ ] [ ]

Model’s action sequence a

Frame 1 Frame 8 Frame 6

go to frame 8

go to frame 6

(1) whether to predict a detection (2) where to look next

[ ]

Frame 15

d3

go to frame 15

[YeuRusMorFei CVPR’16]

SLIDE 24

Olga Russakovsky: The human side of computer vision

Training the non-differentiable outputs

Training data

t = 0 t = T

[ ] [ ]

Detections

t = 0 t = T

[ ] [ ]

Frame 1 Frame 8 Frame 6

[ ]

Frame 15

Model’s action sequence a

go to frame 8 go to frame 6

(2) where to look next

go to frame 15

d1 d2

(1) whether to predict a detection

d3

Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]

[YeuRusMorFei CVPR’16]

SLIDE 25

Olga Russakovsky: The human side of computer vision

Training the non-differentiable outputs

Training data

t = 0 t = T

Detections

t = 0 t = T

[ ] [ ]

Frame 1 Frame 8 Frame 6

Reward for an action sequence :

Frame 15

Model’s action sequence a

[ ] [ ]

bad bad

go to frame 8 go to frame 6

(2) where to look next

go to frame 15

[ ]

good

d1 d2

(1) whether to predict a detection

d3

Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]

[YeuRusMorFei CVPR’16]

SLIDE 26

Olga Russakovsky: The human side of computer vision

Training the non-differentiable outputs

Training data

t = 0 t = T

Detections

t = 0 t = T

[ ] [ ]

Frame 1 Frame 8 Frame 6

Reward for an action sequence :

Frame 15

Objective: Gradient: Monte-Carlo approximation:

Model’s action sequence a

[ ] [ ]

bad bad

go to frame 8 go to frame 6

(2) where to look next

go to frame 15

[ ]

good

d1 d2

(1) whether to predict a detection

d3

Train an policy for actions (1) and (2) using REINFORCE [Williams 1992]

[YeuRusMorFei CVPR’16]

SLIDE 27

Olga Russakovsky: The human side of computer vision

Accuracy Interpretability

[YeuRusMorFei CVPR’16]

Efficiency

SLIDE 28

Olga Russakovsky: The human side of computer vision

✓

[YeuRusMorFei CVPR’16]

Accuracy

Dataset Detection AP at IOU 0.5 State-of-the-art Our result THUMOS 2014 14.4

17.1

ActivityNet sports 33.2

36.7

ActivityNet work 31.1

39.9 Interpretability Efficiency

SLIDE 29

Olga Russakovsky: The human side of computer vision

✓

[YeuRusMorFei CVPR’16]

Accuracy

Dataset Detection AP at IOU 0.5 State-of-the-art Our result THUMOS 2014 14.4

17.1

ActivityNet sports 33.2

36.7

ActivityNet work 31.1

39.9 Interpretability

Efficiency

✓

Glimpse only 2% of video frames

SLIDE 30

Olga Russakovsky: The human side of computer vision

✓

[YeuRusMorFei CVPR’16]

Accuracy

Dataset Detection AP at IOU 0.5 State-of-the-art Our result THUMOS 2014 14.4

17.1

ActivityNet sports 33.2

36.7

ActivityNet work 31.1

39.9 Interpretability

Efficiency

✓

Glimpse only 2% of video frames Samping Detection AP at IOU 0.5 Uniform 9.3 Our glimpses

17.1

SLIDE 31

Olga Russakovsky: The human side of computer vision

✓

[YeuRusMorFei CVPR’16]

Accuracy

Dataset Detection AP at IOU 0.5 State-of-the-art Our result THUMOS 2014 14.4

17.1

ActivityNet sports 33.2

36.7

ActivityNet work 31.1

39.9

Efficiency

✓

Glimpse only 2% of video frames Samping Detection AP at IOU 0.5 Uniform 9.3 Our glimpses

17.1

Interpretability

✓

Javelin throw

[ ]

Javelin throw

[ ]

Ground truth Detections Glimpses Frames

SLIDE 32

Olga Russakovsky: The human side of computer vision

…

Algorithm Test data

Car, Person Building

AI vision system

Training data

cars dogs people

Reinforcement learning for human action detection

[YeuRusMorFei CVPR’16]

Open-world probabilistic human- in-the-loop model

[RusLiFei CVPR’15]

The human cost of developing computer vision expertise

[RusDenDuKraSatEtal IJCV’15] [DenRusKraBerBerFei CHI’14]

SLIDE 33

Olga Russakovsky: The human side of computer vision

Data is critical Beyond classification Resource allocation Broader context

Takeaways

Table Chair Bowl Dog Cat …

+ +

+
+
+
+

+

+

SLIDE 34

Olga Russakovsky: The human side of computer vision

What next?

SLIDE 35

Olga Russakovsky: The human side of computer vision

Open-world recognition

Challenging objects Long-tail distributions Designing open-world evaluation frameworks

SLIDE 36

Olga Russakovsky: The human side of computer vision

Large-scale video analysis

Formulating the right temporal questions Multi-task models

Playing tennis

Visual fluents

SLIDE 37

Olga Russakovsky: The human side of computer vision

Collaborative visual systems

Understanding human intention

Knowledge acquisition

Effective teaching

SLIDE 38

Olga Russakovsky: The human side of computer vision

AI will change the world. Who will change AI?

SLIDE 39

Olga Russakovsky: The human side of computer vision

Stanford Artificial Intelligence Laboratory’s Outreach Summer (SAILORS) program

24 high school girls, 2 weeks Rigorous AI curriculum emphasizing humanistic applications http://sailors.stanford.edu

AI will change the world. Who will change AI?

[VacWuChaRusSomFei SIGCSE’16]

SLIDE 40

Olga Russakovsky: The human side of computer vision

Acknowledgements

Jia Deng Alexander Berg Fei-Fei Li Sean Ma Jonathan Krause Zhiheng Huang Andrej Karpathy Aditya Khosla Michael Bernstein Jia Li Yuanqing Lin Ellen Klingbeil Vittorio Ferrari Amy Bearman Blake Carpenter Davide Modolo Jenny Jin Misha Andriluka Hao Su Sanjeev Satheesh Andrew Ng Kai Yu Greg Mori Serena Yeung Sasha Vezhnevets Deva Ramanan Abhinav Gupta Vignesh Ramanathan

SLIDE 41

Olga Russakovsky: The human side of computer vision

…

Algorithm Test data

Car, Person Building

AI vision system

Training data

cars dogs people

http://cs.cmu.edu/~orussako olgarus@cmu.edu

Questions?

[RusDenHuaBerFei ICCV’13] [KliCarRusNg ICRA’10] Strongly supervised: [RusNg CVPR’10] [RusFei ECCVW’10] [RusGupRam UnderReview] [ModVezRusFei CVPR’15] [YeuRusJinAndMorFei UnderReview] Weakly supervised: [RusLinYuFei ECCV’12] [BeaRusFerFei UnderReview]

[RusLiFei CVPR’15] [YeuRusMorFei CVPR’16] [RusDenSuKraSatEtal IJCV’15] [DenRusKraBerBerFei CHI’14]