Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition
Stefan Mathe, Cristian Sminchisescu
Presented by Mit Shah
Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models - - PowerPoint PPT Presentation
Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition Stefan Mathe, Cristian Sminchisescu Presented by Mit Shah Motivation Current Computer Vision Annotations subjectively defined
Stefan Mathe, Cristian Sminchisescu
Presented by Mit Shah
○ Annotations subjectively defined ○ Intermediate levels of computation??
2
human visual system
3
4
A person browsing reddit with the F-shaped pattern
○ Inter-observer consistency
5
○ Inter-observer consistency ○ Bottom-up Features
6
○ Inter-observer consistency ○ Bottom-up Features ○ Human Fixations
7
○ Inter-observer consistency ○ Bottom-up Features ○ Human Fixations ○ Models of saliency
8
○ Inter-observer consistency ○ Bottom-up Features ○ Human Fixations ○ Models of saliency ○ Uses of Saliency maps
9
Action Recognition Scene Classification Object Localization
○ Inter-observer consistency ○ Bottom-up Features ○ Human Fixations ○ Models of saliency ○ Uses of Saliency maps ○ Previous data sets
10
At most few hundred videos recorded under free viewing conditions
❏ Extended existing large scale datasets Hollywood-2 and UCF Sports
11
❏ Dynamic consistency and alignment measures
12
AOI Markov Dynamics Temporal AOI Alignment
❏ Training an End-to-End automatic visual action recognition system
13
Hollywood-2 Movie Dataset
14
12 classes 69 movies 823/884 split 487k frames 20 hr Largest and Most challenging dataset Answering phone, driving a car, eating, fighting, etc.
UCF Sports Action Dataset
15
Broadcast of television channels 150 videos covering 9 sports action classes Diving, golf swinging, kicking, etc..
Extending the two data sets
16
19 Humans Action Recognition TASKS SMI iView X HiSpeed 1250 Tower-Mounted Eye Tracker Recording Environment Context Recognition Free Viewing Recording Protocol D i v i d e d i n t
T a s k s Many other Specifications Timings/Durations & Breaks
Action Recognition by Humans
○ Co Occurring Actions ○ False Positives ○ Mislabeling Videos
17
frame basis?
18
19
nA Times
20
SA \ {s} Derive Saliency Maps Predict Fixations of Subject s Evaluate average prediction score for s’ in SB nA prediction scores SA Derive Saliency Maps nB prediction scores Hypothesis p-value >= 0.5 ? Independent 2-sample T-test with unequal variances
Results -
21
22
○ AOI Markov dynamics ○ Temporal AOI alignment
23
24
Start K-Means with 1 cluster Successively Increase until the sum of squared errors drops below a threshold Link centroids from successive frames into tracks Each resulting track becomes an AOI Each fixation assigned to the closest AOI at the time of creation
.
25
26
Fixated at AOI “a” @ time t-1 Probability of Transitioning to AOI “b” @ time t Human Fixation String fi
27
28
Interest Point Operator Descriptor Visual Dictionary Classifiers
Input: A video Output: A set of spatio-temporal coordinates Spacetime generalization
MBH from
Cluster descriptors into 4000 Visual words using K-means RBF-2 kernel and Multiple Kernel Learning (MKL) framework
Human vs. Computer Vision Operators
29
○ Low correlation ○ Why??
Saliency maps encoding only the weak surface structure of fixations (no time
30
31
Static Features Motion Features AUC & Spatial KL Divergence
32
33
34