Summary of our efforts Last year Deep CNN for - - PowerPoint PPT Presentation

▶

Mar 27, 2024 268 likes •545 views

Video Concept Detec-on by Deep Nets with FLAIR Cees Snoek, Koen van de Sande, Daniel Fon2jne Qualcomm Technologies University of Amsterdam Netherlands B.V.

SLIDE 1

Video ¡Concept ¡Detec-on ¡by ¡Deep ¡Nets ¡with ¡FLAIR ¡

Cees ¡Snoek, ¡Koen ¡van ¡de ¡Sande, ¡Daniel ¡Fon2jne ¡

Qualcomm ¡Technologies ¡ Netherlands ¡B.V. ¡ University ¡of ¡Amsterdam ¡ The ¡Netherlands ¡

Presented ¡by ¡Thomas ¡Mensink, ¡UvA ¡

SLIDE 2

Summary ¡of ¡our ¡efforts ¡

Last ¡year ¡ ¡ ¡ ¡Deep ¡CNN ¡for ¡video ¡concept ¡detec2on ¡and ¡localiza2on ¡ ¡ ¡ ¡ ¡ This ¡year ¡ ¡ ¡ ¡ ¡Tangen2al ¡improvements ¡for ¡concept ¡detec2on ¡ ¡ ¡ ¡Our ¡main ¡innova+on ¡is ¡in ¡concept ¡localiza+on ¡ ¡

SLIDE 3

DETECTING ¡CONCEPTS ¡ TASK ¡I ¡

SLIDE 4

Conclusion ¡from ¡TRECVID ¡2013 ¡

Video ¡deep ¡net ¡ MediaMill ¡2012 ¡ Frame ¡fusion ¡ Video ¡fusion ¡

Bag ¡of ¡words ¡and ¡deep ¡net ¡profit ¡from ¡each ¡other ¡

SLIDE 5

MediaMill ¡TRECVID ¡2014 ¡runs ¡

¡

¡ ¡

Bag ¡of ¡codes ¡ Net ¡of ¡convolu-ons ¡ Late ¡Fusion ¡by ¡weighted ¡averaging ¡ Run: ¡Fusion ¡1 ¡ Fuse-‑all ¡ Baseline ¡Run: ¡ 8x ¡CNN ¡ Run: ¡Fusion ¡2 ¡ ¡8x ¡CNN ¡+ ¡BoW ¡ Run: ¡Fusion ¡3 ¡ Best ¡CNN ¡+ ¡BoW ¡

SLIDE 6

MediaMill: ¡Color ¡difference ¡coding ¡

Densely ¡sampled ¡points ¡
SIFT, ¡C-‑SIFT ¡and ¡T-‑SIFT ¡descriptors ¡
PCA ¡reduc2on ¡to ¡80D ¡
Fisher ¡vector ¡coding ¡with ¡codebook ¡size ¡256 ¡
Spa2al ¡pyramid ¡1x1+1x3 ¡
Spa2al ¡coordinate ¡coding ¡ ¡
Linear ¡classifier ¡

Color Descriptor software available for download at http://colordescriptors.com

SLIDE 7

MediaMill: ¡Video ¡deep ¡learning ¡

Convolu2onal ¡neural ¡network ¡with ¡8 ¡layers ¡with ¡weights ¡ ¡ Trained ¡using ¡error ¡back ¡propaga2on ¡

– ImageNet ¡for ¡pre-‑training ¡

SLIDE 8

Results ¡

Fusion ¡3: ¡Best ¡CNN ¡+ ¡BoW ¡ Baseline: ¡8x ¡CNN ¡ Fusion ¡1 ¡/ ¡2 ¡

Bag ¡of ¡words ¡and ¡deep ¡net ¡profit ¡from ¡each ¡other, ¡ ¡ be<er ¡results ¡with ¡more ¡nets ¡

SLIDE 9

Results ¡per ¡concept ¡

SLIDE 10

LOCALIZING ¡CONCEPTS ¡ TASK ¡II ¡

Fisher ¡and ¡VLAD ¡with ¡FLAIR, ¡Koen ¡van ¡de ¡Sande, ¡Cees ¡Snoek, ¡and ¡Arnold ¡Smeulders ¡ CVPR ¡2014 ¡

SLIDE 11

Goal: ¡meaningful ¡localiza2on ¡

Finding ¡where, ¡when, ¡what ¡is ¡happening ¡ Challenges: ¡huge ¡search ¡space, ¡non-‑rigid ¡deforma2on ¡

SLIDE 12

Related ¡work ¡

[Lampert, ¡2009] ¡ [Rodriguez, ¡2008] ¡

Sliding ¡Window ¡ Branch ¡and ¡Bound ¡ Deformable ¡Parts ¡

[Yuan, ¡2011] ¡ [Felzenswalb, ¡2008] ¡ [Tian, ¡2013] ¡ [Ke, ¡2005] ¡ [Viola ¡& ¡Jones, ¡2001] ¡

Boos2ng ¡Cascade ¡

[Rowley, ¡1996] ¡

… ¡

Image ¡ Video ¡ Image ¡ Video ¡ Image ¡ Video ¡ Image ¡ Video ¡

SLIDE 13

Inspira2on: ¡Selec2ve ¡Search ¡

[Uijlings, ¡2013] ¡

High ¡recall ¡with ¡modestly ¡sized ¡ ¡object ¡hypotheses ¡set ¡ Feasible ¡to ¡train ¡an ¡expensive ¡classifier ¡

Itera2ons ¡of ¡selec2ve ¡search ¡

Hierarchical ¡grouping ¡

f ¡super-‑pixels ¡

¡ ¡ Object ¡proposals ¡

SLIDE 14

Selec2ve ¡Search ¡

Mul2ple ¡complementary ¡invariant ¡color ¡spaces ¡ Loca2on ¡hypotheses ¡are ¡class-‑independent ¡

14 ¡

VOC2007 ¡test ¡ 1,500 ¡windows/image ¡ 98.0% ¡recall ¡ Software available for download at http://koen.me/research/selectivesearch/

SLIDE 15

Local ¡object ¡classifica2on ¡

Repeat ¡for ¡each ¡region ¡

Local Feature Extraction

Feature Pooling Feature Encoding Kernel Classification

Spa-al ¡Pyramids ¡[Lazebnik, ¡CVPR06] ¡

(#regions: ¡10-‑100) ¡

Object ¡Detec-on ¡[Sande, ¡ICCV11] ¡

(#regions: ¡1,000-‑10,000) ¡

Requires ¡repe--ve ¡computa2ons ¡on ¡overlapping ¡regions ¡

SLIDE 16

Features ¡

Use ¡SIFT ¡and ¡ColorSIFT ¡descriptors ¡ Bag-‑of-‑words, ¡VLAD, ¡Fisher ¡vector ¡encoding ¡ ¡ Encoding ¡2000 ¡boxes ¡per ¡image ¡is ¡expensive ¡

Bag-‑of-‑words: ¡ ¡ ¡10s ¡ VLAD: ¡ ¡ ¡ ¡30s ¡ Fisher: ¡ ¡120s ¡

SLIDE 17

Key ¡idea ¡

Decompose ¡assignment ¡over ¡codebook ¡elements ¡

Codebook ¡ Point ¡feature ¡ Codeword ¡index ¡ Decomposi2on ¡

SLIDE 18

Area-‑independent ¡decomposi2on ¡

Fast ¡box ¡evalua2on ¡with ¡integral ¡images ¡ 0 ¡ ¡ ¡ ¡0 ¡ ¡ ¡ ¡0 ¡ ¡ ¡ ¡1 ¡ 0 ¡ ¡ ¡ ¡1 ¡ ¡ ¡ ¡1 ¡ ¡ ¡ ¡2 ¡ ¡ 1 ¡ ¡ ¡ ¡2 ¡ ¡ ¡ ¡2 ¡ ¡ ¡ ¡3 ¡ 1 1 ¡ ¡ ¡ ¡2 ¡ ¡ ¡ ¡2 ¡ 2 ¡ ¡ ¡ ¡2 ¡ ¡ ¡ ¡4 ¡ ¡ ¡ ¡4 ¡ ¡ 2 ¡ ¡ ¡ ¡3 ¡ ¡ ¡ ¡5 ¡ ¡ ¡ ¡5 ¡

Integral ¡image ¡ Decomposi2on ¡ Box ¡feature ¡encoding ¡

(2 ¡ ¡ ¡ ¡ ¡0 ¡ ¡ ¡ ¡ ¡2) ¡ (1 ¡ ¡ ¡ ¡ ¡0 ¡ ¡ ¡ ¡ ¡1) ¡

SLIDE 19

VLAD ¡with ¡FLAIR ¡

Decomposi2on ¡as ¡mul2-‑dimensional ¡integral ¡image ¡ ¡ Sparsity ¡drops ¡memory ¡from ¡14GB ¡to ¡1GB/image ¡ Supports ¡power ¡norm, ¡L2 ¡norm ¡and ¡spa2al ¡pyramid ¡

¡

18X ¡speedup ¡

Porikil, ¡CVPR ¡2005 ¡

SLIDE 20

Fisher ¡with ¡FLAIR ¡

Decomposi2on ¡as ¡four ¡mul2-‑dimensional ¡integral ¡images ¡[See ¡paper] ¡ Supports ¡power ¡norm, ¡L2 ¡norm, ¡spa2al ¡pyramids ¡ No ¡need ¡for ¡approxima2ons ¡ Scalable ¡to ¡modern ¡datasets ¡

¡

18X ¡speedup ¡

SLIDE 21

Overall ¡detec2on ¡speedup ¡and ¡accuracy ¡

Time (s) per image

¡ ¡ ¡ ¡ ¡

Fisher ¡with ¡FLAIR ¡is ¡be<er ¡and ¡faster ¡than ¡BoW ¡

SLIDE 22

MediaMill ¡TRECVID ¡2014 ¡runs ¡

Selec-ve ¡Search ¡ Fisher ¡with ¡FLAIR ¡ ¡MediaMill ¡2014 ¡SIN ¡runs ¡ Run ¡ Run ¡ Run ¡ Run ¡ Bounding ¡box ¡ ¡ annota2ons ¡

SLIDE 23

Implementa2on ¡details ¡

PCA-‑reduced ¡ColorSIFT ¡descriptors ¡to ¡80D ¡ Fisher ¡with ¡FLAIR ¡encoding ¡ Spa2al ¡pyramid ¡ ¡ Linear ¡SVM ¡ Hard ¡nega2ve ¡mining ¡

SLIDE 24

Boat ¡

Best ¡box ¡ Other ¡boxes ¡

SLIDE 25

Airplane ¡

Best ¡box ¡ Other ¡boxes ¡

SLIDE 26

Results ¡

* ¡ ¡ ¡ ¡ ¡8x ¡CNN ¡+ ¡FLAIR ¡ + ¡ ¡ ¡ ¡ ¡Fusion ¡1 ¡+ ¡FLAIR ¡  ¡ ¡Fusion ¡2 ¡+ ¡FLAIR ¡ ☐ ¡ ¡ ¡Fusion ¡3 ¡+ ¡FLAIR ¡

FLAIR ¡aCer ¡deep ¡nets ¡is ¡best ¡

SLIDE 27

Conclusions ¡

Bag ¡of ¡words ¡and ¡deep ¡net ¡profit ¡from ¡each ¡other ¡ ¡ ¡ Encoding ¡Fisher ¡with ¡FLAIR ¡is ¡18x ¡faster ¡ ¡ ¡ ¡Area ¡independent ¡

¡ ¡ ¡ ¡Supports ¡spa2al ¡pyramids, ¡power ¡norm, ¡L2 ¡norm ¡ ¡ ¡ ¡ ¡No ¡approxima2on ¡

Allows ¡for ¡large-‑scale ¡localiza2on ¡in ¡video ¡ ¡ ¡

¡

27 ¡

Video ¡Concept ¡Detec-on ¡by ¡Deep ¡Nets ¡with ¡FLAIR ¡

Cees ¡Snoek, ¡Koen ¡van ¡de ¡Sande, ¡Daniel ¡Fon2jne ¡

Summary ¡of ¡our ¡efforts ¡

Last ¡year ¡ ¡ ¡ ¡Deep ¡CNN ¡for ¡video ¡concept ¡detec2on ¡and ¡localiza2on ¡ ¡ ¡ ¡ ¡ This ¡year ¡ ¡ ¡ ¡ ¡Tangen2al ¡improvements ¡for ¡concept ¡detec2on ¡ ¡ ¡ ¡Our ¡main ¡innova+on ¡is ¡in ¡concept ¡localiza+on ¡ ¡

DETECTING ¡CONCEPTS ¡ TASK ¡I ¡

Conclusion ¡from ¡TRECVID ¡2013 ¡

Bag ¡of ¡words ¡and ¡deep ¡net ¡profit ¡from ¡each ¡other ¡

MediaMill ¡TRECVID ¡2014 ¡runs ¡

¡

MediaMill: ¡Color ¡difference ¡coding ¡

MediaMill: ¡Video ¡deep ¡learning ¡

Convolu2onal ¡neural ¡network ¡with ¡8 ¡layers ¡with ¡weights ¡ ¡ Trained ¡using ¡error ¡back ¡propaga2on ¡

Results ¡

Bag ¡of ¡words ¡and ¡deep ¡net ¡profit ¡from ¡each ¡other, ¡ ¡ be<er ¡results ¡with ¡more ¡nets ¡

Results ¡per ¡concept ¡

LOCALIZING ¡CONCEPTS ¡ TASK ¡II ¡

Goal: ¡meaningful ¡localiza2on ¡

Finding ¡where, ¡when, ¡what ¡is ¡happening ¡ Challenges: ¡huge ¡search ¡space, ¡non-­‑rigid ¡deforma2on ¡

Related ¡work ¡

Inspira2on: ¡Selec2ve ¡Search ¡

High ¡recall ¡with ¡modestly ¡sized ¡ ¡object ¡hypotheses ¡set ¡ Feasible ¡to ¡train ¡an ¡expensive ¡classifier ¡

Hierarchical ¡grouping ¡

¡ ¡ Object ¡proposals ¡

Selec2ve ¡Search ¡

Mul2ple ¡complementary ¡invariant ¡color ¡spaces ¡ Loca2on ¡hypotheses ¡are ¡class-­‑independent ¡

Local ¡object ¡classifica2on ¡

Features ¡

Use ¡SIFT ¡and ¡ColorSIFT ¡descriptors ¡ Bag-­‑of-­‑words, ¡VLAD, ¡Fisher ¡vector ¡encoding ¡ ¡ Encoding ¡2000 ¡boxes ¡per ¡image ¡is ¡expensive ¡

Bag-­‑of-­‑words: ¡ ¡ ¡10s ¡ VLAD: ¡ ¡ ¡ ¡30s ¡ Fisher: ¡ ¡120s ¡

Key ¡idea ¡

Decompose ¡assignment ¡over ¡codebook ¡elements ¡

Area-­‑independent ¡decomposi2on ¡

(2 ¡ ¡ ¡ ¡ ¡0 ¡ ¡ ¡ ¡ ¡2) ¡ (1 ¡ ¡ ¡ ¡ ¡0 ¡ ¡ ¡ ¡ ¡1) ¡

VLAD ¡with ¡FLAIR ¡

Decomposi2on ¡as ¡mul2-­‑dimensional ¡integral ¡image ¡ ¡ Sparsity ¡drops ¡memory ¡from ¡14GB ¡to ¡1GB/image ¡ Supports ¡power ¡norm, ¡L2 ¡norm ¡and ¡spa2al ¡pyramid ¡

¡

18X ¡speedup ¡

Fisher ¡with ¡FLAIR ¡

Decomposi2on ¡as ¡four ¡mul2-­‑dimensional ¡integral ¡images ¡[See ¡paper] ¡ Supports ¡power ¡norm, ¡L2 ¡norm, ¡spa2al ¡pyramids ¡ No ¡need ¡for ¡approxima2ons ¡ Scalable ¡to ¡modern ¡datasets ¡

¡

18X ¡speedup ¡

Overall ¡detec2on ¡speedup ¡and ¡accuracy ¡

¡ ¡ ¡ ¡ ¡

Fisher ¡with ¡FLAIR ¡is ¡be<er ¡and ¡faster ¡than ¡BoW ¡

MediaMill ¡TRECVID ¡2014 ¡runs ¡

Implementa2on ¡details ¡

PCA-­‑reduced ¡ColorSIFT ¡descriptors ¡to ¡80D ¡ Fisher ¡with ¡FLAIR ¡encoding ¡ Spa2al ¡pyramid ¡ ¡ Linear ¡SVM ¡ Hard ¡nega2ve ¡mining ¡

Boat ¡

Airplane ¡

Results ¡

FLAIR ¡aCer ¡deep ¡nets ¡is ¡best ¡

Conclusions ¡

Bag ¡of ¡words ¡and ¡deep ¡net ¡profit ¡from ¡each ¡other ¡ ¡ ¡ Encoding ¡Fisher ¡with ¡FLAIR ¡is ¡18x ¡faster ¡ ¡ ¡ ¡Area ¡independent ¡

¡ ¡ ¡ ¡Supports ¡spa2al ¡pyramids, ¡power ¡norm, ¡L2 ¡norm ¡ ¡ ¡ ¡ ¡No ¡approxima2on ¡

Allows ¡for ¡large-­‑scale ¡localiza2on ¡in ¡video ¡ ¡ ¡

¡

Finding ¡where, ¡when, ¡what ¡is ¡happening ¡ Challenges: ¡huge ¡search ¡space, ¡non-‑rigid ¡deforma2on ¡

Mul2ple ¡complementary ¡invariant ¡color ¡spaces ¡ Loca2on ¡hypotheses ¡are ¡class-‑independent ¡

Use ¡SIFT ¡and ¡ColorSIFT ¡descriptors ¡ Bag-‑of-‑words, ¡VLAD, ¡Fisher ¡vector ¡encoding ¡ ¡ Encoding ¡2000 ¡boxes ¡per ¡image ¡is ¡expensive ¡

Bag-‑of-‑words: ¡ ¡ ¡10s ¡ VLAD: ¡ ¡ ¡ ¡30s ¡ Fisher: ¡ ¡120s ¡

Area-‑independent ¡decomposi2on ¡

Decomposi2on ¡as ¡mul2-‑dimensional ¡integral ¡image ¡ ¡ Sparsity ¡drops ¡memory ¡from ¡14GB ¡to ¡1GB/image ¡ Supports ¡power ¡norm, ¡L2 ¡norm ¡and ¡spa2al ¡pyramid ¡

Decomposi2on ¡as ¡four ¡mul2-‑dimensional ¡integral ¡images ¡[See ¡paper] ¡ Supports ¡power ¡norm, ¡L2 ¡norm, ¡spa2al ¡pyramids ¡ No ¡need ¡for ¡approxima2ons ¡ Scalable ¡to ¡modern ¡datasets ¡

PCA-‑reduced ¡ColorSIFT ¡descriptors ¡to ¡80D ¡ Fisher ¡with ¡FLAIR ¡encoding ¡ Spa2al ¡pyramid ¡ ¡ Linear ¡SVM ¡ Hard ¡nega2ve ¡mining ¡

Allows ¡for ¡large-‑scale ¡localiza2on ¡in ¡video ¡ ¡ ¡