[PPT] - Applications in Visual Object Tracking Yuanwei Wu 10-21-2016 1 PowerPoint Presentation

SLIDE 1

Week 42: Siamese Network: Architecture and Applications in Visual Object Tracking

Yuanwei Wu 10-21-2016

1

SLIDE 2

Outline

Siamese Architecture
Siamese Applications in Computer Vision
Paper review

 Visual Object Tracking using Siamese CNN

Future Work

2

SLIDE 3

What does “Siamese” mean?

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 3

SLIDE 4

Siamese Architecture

Source: Learning Hierarchies of Invariant Features. Yann LeCun. helper.ipam.ucla.edu/publications/gss2012/gss2012_10739.pdf 4

SLIDE 5

Siamese Architecture and loss function

Source: Learning Hierarchies of Invariant Features. Yann LeCun. helper.ipam.ucla.edu/publications/gss2012/gss2012_10739.pdf 5

SLIDE 6

Siamese Applications in Computer Vision:

1. Signature Verification

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 6

SLIDE 7

Siamese Applications in Computer Vision:

2. Dimensionality Reduction

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 7

SLIDE 8

Siamese Applications in Computer Vision: 3.1 Learning Image Descriptors

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 8

CNN Model

SLIDE 9

Siamese Applications in Computer Vision: 3.2 Learning Image Descriptors

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 9

SLIDE 10

Siamese Applications in Computer Vision: 4.1 Face Verification

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 10

SLIDE 11

Siamese Applications in Computer Vision: 4.2 Face Verification

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 11

SLIDE 12

Siamese Applications in Computer Vision: 4.3 Face Verification

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 12

SLIDE 13

Siamese Applications in Computer Vision: 4.4 Face Verification

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 13

SLIDE 14

Siamese Applications in Computer Vision: 4.5 Face Verification

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 14

SLIDE 15

@article{bertinetto2016fully, title={Fully-Convolutional Siamese Networks for Object Tracking}, author={Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS}, journal={arXiv preprint arXiv:1606.09549}, year={2016} }

Paper Review: Fully-Convolutional Siamese Networks for Object Tracking

15

SLIDE 16

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.

Architecture of Siamese CNN

16

SLIDE 17

Details of the Architecture of Siamese CNN

Source: 1: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012.

1.

17

SLIDE 18

Details of the Architecture of Siamese CNN

Source: 1: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012. 2: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.

1. 2.

18

Cross-correlation layer

SLIDE 19

Training: dataset

ImageNet Video dataset of 2015:

 contains ~4000 videos  with ~1 million annotated frames

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 19

SLIDE 20

Training: preprocessing on the images

Preprocessing: 2820 videos, examplar image: 127 x 127,

search image: 255 x 255

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 20

SLIDE 21

Training: recap the steps

ImageNet Video dataset of 2015:

 contains ~4000 videos  with ~1 million annotated frames

Preprocessing:

2820 videos  examplar image: 127 x 127 search image: 255 x 255

Training with a standard Stochastic Gradient

Descent (SGD) solver using MathConvNet

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 21

SLIDE 22

Training: loss function

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.

Employing a discriminative training approach

using positive and negative pairs and adopting the logistic loss:

22

SLIDE 23

Training: loss function

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.

Employing a discriminative training approach

using positive and negative pairs and adopting the logistic loss:

The loss of a score map is the mean of the

individual losses:

23

SLIDE 24

Training: loss function

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.

Employing a discriminative training approach

using positive and negative pairs and adopting the logistic loss:

The loss of a score map is the mean of the

individual losses:

Applying SGD to find the conv-net Ѳ using

24

SLIDE 25

Tracking algorithm

Use a search image centered at the previous

position of the target.

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 25

SLIDE 26

Tracking algorithm

Use a search image centered at the previous

position of the target.

Only search for the object within a region of

approximately four times its previous size.

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 26

SLIDE 27

Tracking algorithm

Use a search image centered at the previous

position of the target.

Only search for the object within a region of

approximately four times its previous size.

A cosine window is added to the score map to

penalize large displacements.

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 27

SLIDE 28

Tracking algorithm

Use a search image centered at the previous

position of the target.

Only search for the object within a region of

approximately four times its previous size.

A cosine window is added to the score map to

penalize large displacements.

The position of the maximum score relative to the

center of the score map, multiplied by the stride

f the network, gives the displacement of the

target from frame to frame.

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 28

SLIDE 29

Experiments: training dataset size

Accuracy: is calculated as the average

Intersection-over-Union (IoU)

Robustness: in terms of the total number of

failures

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 29

SLIDE 30

Experiments: training dataset size

Accuracy: is calculated as the average Intersection-
ver-Union (IoU)
Robustness: in terms of the total number of failures
Using a larger video dataset could increase the

performance even further.

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 30

SLIDE 31

Experiments: OTB13 benchmark results

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 31

SLIDE 32

Experiments: VOT15 benchmark results

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 32

SLIDE 33

Experiments: VOT15 benchmark results

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 33

SLIDE 34

Experiments: VOT15 benchmark results

Estimates the new position of the target object by merely cross-

correlating the embeddings of two patches over three scales.

Achieves real-time performance and state-of-the-art results.

Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS, fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016. 34

SLIDE 35

Future work: How to improve the performance?

By augmenting the online tracking pipeline:

 online model updating (i.e. tracking-by-detection)  bounding-box regression (i.e. YOLO, Faster-CNN)  fine-tuning (i.e. correlation filters + CNN features)  memory (i.e. add RNN, LSTM)

35

SLIDE 36

Source: Guanghan Ning, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, Haohong Wang, Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking, arXiv preprint, 2016. 36

SLIDE 37

Future work: How to improve the performance?

By augmenting the online tracking pipeline:

 online model updating (i.e. tracking-by-detection)  bounding-box regression (i.e. YOLO, Faster-CNN)  fine-tuning (i.e. correlation filters + CNN features)  memory (i.e. add RNN, LSTM)

By introducing new architecture in the

framework of Siamese CNN, need to dig deeply in the structure of networks (i.e. regression network, triplet network).

37

SLIDE 38

Triplet Network

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 38

SLIDE 39

Future work: How to improve the performance?

By augmenting the online tracking pipeline:

 online model updating (i.e. tracking-by-detection)  bounding-box regression (i.e. YOLO, Faster-CNN)  fine-tuning (i.e. correlation filters + CNN features)  memory (i.e. add RNN, LSTM)

By introducing new architecture in the framework
f Siamese CNN, need to dig deeply in the

structure of networks (i.e. regression network, triplet network).

By introducing new loss function is Siamese

network.

39

SLIDE 40

40

Loss function used in face verification

Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf

SLIDE 41

Thank you!

41