[PPT] - Siamese Neural l Netw Networks a and Simila larity Learning Wh PowerPoint Presentation

SLIDE 1

Siamese Neural l Netw Networks a and Simila larity Learning

SLIDE 2

Wh What at can an ML ML do do for

r us?
Classification problem
Prof. Leal-Taixé and Prof. Niessner

2

Neural Network

CAT

SLIDE 3

Wh What at can an ML ML do do for

r us?
Classification problem on ImageNet with thousands
f categories
Prof. Leal-Taixé and Prof. Niessner

3

SLIDE 4

Wh What at can an ML ML do do for

r us?
Performance on ImageNet

– Size of the blobs indicates the number of parameters

Prof. Leal-Taixé and Prof. Niessner

4

A. Canziani et al. „An Analysis of Deep Neural Network Models for Practical

Applications“. arXiv:1605.07678 2016

SLIDE 5

Wh What at can an ML ML do do for

r us?
Regression problem: pose regression
Prof. Leal-Taixé and Prof. Niessner

5

y ∈ R2048 p ∈ R3 FC FC q ∈ R4

Linear regression Feature extraction Pretrained network

SLIDE 6

Wh What at can an ML ML do do for

r us?
Regression problem: bounding box regression
Prof. Leal-Taixé and Prof. Niessner

6

D. Held et al. „Learning to Track at 100 FPS with Deep Regression Networks“. ECCV 2016

SLIDE 7

Wh What at can an ML ML do do for

r us?
Third type of problems
Prof. Leal-Taixé and Prof. Niessner

7

A B Classification: person, face, female Classification: person, face, male

SLIDE 8

Wh What at can an ML ML do do for

r us?
Third type of problems
Prof. Leal-Taixé and Prof. Niessner

8

A B

Is it the same person?

SLIDE 9

Wh What at can an ML ML do do for

r us?
Third type of problems: Similarity Learning
Prof. Leal-Taixé and Prof. Niessner

9

A B

Comparison
Ranking

SLIDE 10

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

Application: unlocking your iPhone with your face
Prof. Leal-Taixé and Prof. Niessner

10

Training

SLIDE 11

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

Application: unlocking your iPhone with your face
Prof. Leal-Taixé and Prof. Niessner

11

A B YES NO

Testing Can be solved as a classification problem

SLIDE 12

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

Application: face recognition system so students can

enter the exam room without the need for ID check

Prof. Leal-Taixé and Prof. Niessner

12

Person 1 Person 2 Training Person 3

SLIDE 13

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

Application: face recognition system so students can

enter the exam room without the need for ID check

Prof. Leal-Taixé and Prof. Niessner

13

What is the problem with this approach? Scalability – we need to retrain our model every time a new student is registered to the course

SLIDE 14

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

Application: face recognition system so students can

enter the exam room without the need for ID check

Prof. Leal-Taixé and Prof. Niessner

14

Can we train one model and use it every year?

SLIDE 15

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

Learn a similarity function
Prof. Leal-Taixé and Prof. Niessner

15

A B Low similarity score A B High similarity score

SLIDE 16

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

Learn a similarity function: testing
Prof. Leal-Taixé and Prof. Niessner

16

A B Not the same person

d(A, B) > τ

SLIDE 17

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

Learn a similarity function
Prof. Leal-Taixé and Prof. Niessner

17

A B

d(A, B) < τ

Same person

SLIDE 18

Si Simila larity ty le learni ning ng

How do we train a network to learn similarity?
Prof. Leal-Taixé and Prof. Niessner

18

SLIDE 19

Siamese Neural l Netw Networks

SLIDE 20

Si Simila larity ty le learni ning ng

How do we train a network to learn similarity?
Prof. Leal-Taixé and Prof. Niessner

20

Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014

CNN FC Representation

f my face in

128 values A

SLIDE 21

Si Simila larity ty le learni ning ng

How do we train a network to learn similarity?
Prof. Leal-Taixé and Prof. Niessner

21

Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014

A B

f(A) f(B)

SLIDE 22

Si Simila larity ty le learni ning ng

Siamese network = shared weights
Prof. Leal-Taixé and Prof. Niessner

22

Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014

A B

f(A) f(B)

SLIDE 23

Si Simila larity ty le learni ning ng

Siamese network = shared weights
We use the same network to obtain an encoding of

the image

To be done: compare the encodings
Prof. Leal-Taixé and Prof. Niessner

23

Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014

f(A)

SLIDE 24

Si Simila larity ty le learni ning ng

Distance function
Training: learn the parameter such that

– If and depict the same person, is small – If and depict a different person, is large

Prof. Leal-Taixé and Prof. Niessner

24

Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014

d(A, B) = ||f(A) − f(B)||2 d(A, B) = d(A, A, B) = d(A, B) = d(A, A, B) =

SLIDE 25

Si Simila larity ty le learni ning ng

Loss function for a positive pair:

– If and depict the same person, is small

Prof. Leal-Taixé and Prof. Niessner

25

d(A, B) = d(A, A, B) = L(A, B) = ||f(A) − f(B)||2

SLIDE 26

Si Simila larity ty le learni ning ng

Loss function for a negative pair:

– If and depict a different person, is large – Better use a Hinge loss:

Prof. Leal-Taixé and Prof. Niessner

26

d(A, B) = d(A, A, B) = L(A, B) = max(0, m2 − ||f(A) − f(B)||2)

If two elements are already far away, do not spend energy in pulling them even further away

SLIDE 27

Si Simila larity ty le learni ning ng

Contrastive loss:
Prof. Leal-Taixé and Prof. Niessner

27

L(A, B) = y∗||f(A) − f(B)||2 + (1 − y∗)max(0, m2 − ||f(A) − f(B)||2)

Positive pair, reduce the distance between the elements Negative pair, brings the elements further apart up to a margin

SLIDE 28

Si Simila larity ty le learni ning ng

Training the siamese networks

– You can update the weights for each channel independently and then average them

This loss function allows us to learn to bring positive

pairs together and negative pairs apart

Prof. Leal-Taixé and Prof. Niessner

28

SLIDE 29

Triple let Loss

SLIDE 30

Tr Triple let t lo loss

Triplet loss allows us to learn a ranking

We want:

Prof. Leal-Taixé and Prof. Niessner

30

Schroff et al „FaceNet: a unified embedding for face recognition and clustering“. CVPR 2015

Anchor (A) Positive (P) Negative (N)

||f(A) − f(P)||2 < ||f(A) − f(N)||2

SLIDE 31

Tr Triple let t lo loss

Triplet loss allows us to learn a ranking
Prof. Leal-Taixé and Prof. Niessner

31

Schroff et al „FaceNet: a unified embedding for face recognition and clustering“. CVPR 2015

||f(A) − f(P)||2 < ||f(A) − f(N)||2 ||f(A) − f(P)||2 − ||f(A) − f(N)||2 < 0

) = ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m < 0

margin

SLIDE 32

Tr Triple let t lo loss

Triplet loss allows us to learn a ranking
Prof. Leal-Taixé and Prof. Niessner

32

Schroff et al „FaceNet: a unified embedding for face recognition and clustering“. CVPR 2015

||f(A) − f(P)||2 < ||f(A) − f(N)||2 ||f(A) − f(P)||2 − ||f(A) − f(N)||2 < 0

) = ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m < 0

L(A, P, N) = max(0, ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m)

SLIDE 33

Tr Triple let t lo loss

Hard negative mining: training with hard cases
Train for a few epochs
Choose the hard cases where
Train with those to refine the distance learned
Prof. Leal-Taixé and Prof. Niessner

33

L(A, P, N) = max(0, ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m) d(A, P) ≈ d(A, N)

SLIDE 34

Tr Triple let t lo loss

Prof. Leal-Taixé and Prof. Niessner

34

Anchor Negative Positive Anchor Negative Positive Training

SLIDE 35

Tr Triple let t lo loss: te test t ti time

Just do nearest neighbor search!
Prof. Leal-Taixé and Prof. Niessner

35

SLIDE 36

Tr Triple let t Lo Loss Cha halle lleng nges

Random sampling does not work - the number of

possible triplets is O(n^3) so the network would need to be trained for a very long time.

Even with hard negative mining, there is the risk of

being stuck in local minima.

Prof. Leal-Taixé and Prof. Niessner

36

SLIDE 37

Several l approaches to improve simila larity le learning

SLIDE 38

Im Improving simil imilar arit ity lear earnin ing

Loss:

– Contrastive vs. triplet loss

Sampling:

– Choosing the best triplets to train with, sample the space wisely

= diversity of classes + hard cases

Ensembles:

– Why not using several networks, each of them trained with a

subset of triplets?

Can we use a classification loss for similarity learning?
Prof. Leal-Taixé and Prof. Niessner

38

SLIDE 39

Lo Losses: : in interestin ing wo works ks

Wang et al., Deep metric learning with angular loss, (ICCV

2017)

Yu et al., Correcting the triplet selection bias for triplet loss,

(ECCV 2018)

Prof. Leal-Taixé and Prof. Niessner

39

SLIDE 40

Im Improving simil imilar arit ity lear earnin ing

Loss:

– Contrastive vs. triplet loss

Sampling:

– Choosing the best triplets to train with, sample the space wisely

= diversity of classes + hard cases

Ensembles:

– Why not using several networks, each of them trained with a

subset of triplets?

Can we use a classification loss for similarity learning?
Prof. Leal-Taixé and Prof. Niessner

40

SLIDE 41

Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 .

Sa Sampli ling ng: Hie Hierarchic ical Tr Triple let t Lo Loss

Build a hierarchical tree where the leaves of the tree

represent the image classes. Recursively merge them until you reach the root node

Prof. Leal-Taixé and Prof. Niessner

41

Ge et al., Deep Metric Learning with Hierarchical Triplet Loss, ECCV 2018

SLIDE 42

HT HTL: : bui build lding the the tr tree

In order to create the tree, we first define a distance

between classes. Intuition: if the distance is small, they will be merged in the next level of the tree.

Prof. Leal-Taixé and Prof. Niessner

42

Deep features of images i and j The cardinality of classes p and q (how many samples do we have for each class)

SLIDE 43

HT HTL: : Fin Findin ing the the an anchor

rs
Prof. Leal-Taixé and Prof. Niessner

43

Randomly select l’

l’ nodes at the 0th level

– This is done to preserve class diversity in the mini-batch

Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 .

SLIDE 44

HT HTL: : Fin Findin ing the the an anchor

rs
Prof. Leal-Taixé and Prof. Niessner

44

Randomly select l’

l’ nodes at the 0th level

– This is done to preserve class diversity in the mini-batch

m-1 nearest classes at the 0th level are selected for

each of the l’ l’ nodes based on the distance in feature space.

SLIDE 45

HT HTL: : Fin Findin ing the the an anchor

rs
Prof. Leal-Taixé and Prof. Niessner

45

Randomly select l’

l’ nodes at the 0th level

– This is done to preserve class diversity in the mini-batch

m-1 nearest classes at the 0th level are selected for

each of the l’ l’ nodes based on the distance in feature space:

– We want to encourage the model to learn discriminative features from the visual similar classes.

SLIDE 46

HT HTL: : Fin Findin ing the the an anchor

rs
Prof. Leal-Taixé and Prof. Niessner

46

Randomly select l’

l’ nodes at the 0th level

– This is done to preserve class diversity in the mini-batch

m-1 nearest classes at the 0th level are selected for

each of the l’ l’ nodes based on the distance in feature space:

– We want to encourage the model to learn discriminative features from the visual similar classes.

t

t images per class are randomly collected

t*m *m*l *l’ images in the mini-batch

SLIDE 47

HT HTL: L : Loss ss fo form rmul ulation

Prof. Leal-Taixé and Prof. Niessner

47

all the triplets The margin actually depends on the distances computed on the hierachical tree. The idea is that it can adapt to class distributions and differences of the samples within the classes.

SLIDE 48

Sa Sampli ling ng: in interestin ing wo works ks

Manmatha et al., Sampling matters for deep metric learning,

(ICCV 2017) - original sampling method

Xu et al., Deep asymmetric metric learning via rich

relationship mining, (CVPR 2019)

Duan et al., Deep embedding learning with discriminative

sampling policy, (CVPR 2019)

Wang et al., Ranked list loss for deep metric learning (CVPR

2019)

Wang et al., Multi-similarity loss with general pair weighting

for deep metric learning (CVPR 2019) - best performance

Prof. Leal-Taixé and Prof. Niessner

48

SLIDE 49

Im Improving simil imilar arit ity lear earnin ing

Loss:

– Contrastive vs. triplet loss

Sampling:

– Choosing the best triplets to train with, sample the space wisely

= diversity of classes + hard cases

Ensembles:

– Why not using several networks, each of them trained with a

subset of triplets?

Can we use a classification loss for similarity learning?
Prof. Leal-Taixé and Prof. Niessner

49

SLIDE 50

En Ensemb embles es

Idea: divide the space into K clusters, and have one

learner per cluster.

Prof. Leal-Taixé and Prof. Niessner

50

Divide Conquer

Sanakoyeu et al., Divide and Conquer the Embedding Space for Metric Learning, CVPR 2019

SLIDE 51

En Ensemb embles es: Di Divi vide an and Con Conquer er

1) Cluster the embedding space in K clusters using K-means. 2) Build K independent learners (fully connected layer) at the top of the CNN, where each learner corresponds to one cluster - DI DIVIDE 3) Until convergence, sample each mini-batch from one random cluster, and update only its corresponding learner. 4) After the network has converged finetune using all learners at the same time - CONQ NQUER 5) Go back to (1) and repeat several times.

Prof. Leal-Taixé and Prof. Niessner

51

SLIDE 52

En Ensemb embles es: in interestin ing wo works ks

Opitz et al., BIER - Boosting Independent Embeddings Robustly,

ICCV 2017 - train K independent networks.

Elezi et al., The Group Loss for Metric Learning, arXiv 2020 - train

K independent networks and concatenate their features.

Yuan et al., Hard-Aware Deeply Cascaded Embedding, CVPR

2017 - concatenate features from different levels of the network.

Wang et al., Ranked list loss for deep metric learning, CVPR 2019 -

concatenate features from different levels of the network.

Kim et al., Attention-based Ensemble for Deep Metric Learning,

ECCV 2018 - use an attention mechanism such that each learner looks at different parts of the object.

Prof. Leal-Taixé and Prof. Niessner

52

SLIDE 53

Im Improving simil imilar arit ity lear earnin ing

Loss:

– Contrastive vs. triplet loss

Sampling:

– Choosing the best triplets to train with, sample the space wisely

= diversity of classes + hard cases

Ensembles:

– Why not using several networks, each of them trained with a

subset of triplets?

Can we use a classification loss for similarity learning?
Prof. Leal-Taixé and Prof. Niessner

53

SLIDE 54

Cl Classif ific ication ion los

ss:

: in interestin ing wo works ks

Movshovitz-Attias et al., No Fuss Distance Metric Learning using

Proxies, ICCV 2017 - learn “proxy” samples to keep as positives and negatives in the mini-batch).

Teh et al., ProxyNCA++: Revisiting and Revitalizing Proxy

Neighborhood Component Analysis, arXiv 2020 - a better way of using proxies, some of the best results in the field.

Qian et al., SoftTriple Loss: Deep Metric Learning Without Triplet

Sampling, ICCV 2019 - using multiple centers for class

Elezi et al., The Group Loss for Deep Metric Learning, arXiv 2020 -

refine the softmax probabilities via a dynamical system for better feature embedding.

Prof. Leal-Taixé and Prof. Niessner

55

SLIDE 55

So Some re resul sults

Prof. Leal-Taixé and Prof. Niessner

56

Jacob et al., Metric Learning With HORDE: High-Order Regularizer for Deep Embeddings, ICCV 2019

SLIDE 56

So Some re resul sults

Prof. Leal-Taixé and Prof. Niessner

57

SLIDE 57

So, , whi which mod model el to to use use?

Prof. Leal-Taixé and Prof. Niessner

58

CUB CARS When trained correctly (and using the same backbone, same embedding space and no extra-tricks to boost the results) the difference in accuracy between different models is not that large…

Musgrave et al., A Metric Learning Reality Check, arXiv 2020

SLIDE 58

Ti Tips an and tr tricks cks

Simple baselines (contrastive loss, triplet loss and

classification loss) actually perform well when trained correctly.

Sampling is as important as the choice of loss
function. Every method can be boosted by

devising an intelligent sampling strategy.

Some tricks may further improve the results

(temperature for softmax, freezing batch-norm layers, using multiple centers per class, etc).

Prof. Leal-Taixé and Prof. Niessner

59

SLIDE 59

Ti Tips an and tr tricks cks

Even naive ensembles may (significantly) boost

performance.

Good out-of-box choices: Proxy-NCA and SoftTriple

Loss à they perform well, and do not require a massive hyperparameter search (and have code

nline!).
Contrastive loss and triplet loss give a similarity score in

addition to the feature embedding.

Stronger backbone choices (densenet) further improve

the results.

Prof. Leal-Taixé and Prof. Niessner

60

SLIDE 60

Appli lications in vision

SLIDE 61

Siamese network on MNIS IST

Prof. Leal-Taixé and Prof. Niessner

62

SLIDE 62

Es Establis ishin ing ima image e cor

rres

espon

nden

ences es

Prof. Leal-Taixé and Prof. Niessner

63

Image from University of Washington

SLIDE 63

Es Establis ishin ing ima image e cor

rres

espon

nden

ences es

Prof. Leal-Taixé and Prof. Niessner

64

Image from University of Washington

SLIDE 64

Es Establis ishin ing ima image e cor

rres

espon

nden

ences es

Used in a wide range of Computer Vision applications

– Image stitching or image alignment – Object recognition – 3D reconstruction – Object tracking – Image retrieval

Many of these applications are now targeted directly

with Neural Networks as we will see in the course

Prof. Leal-Taixé and Prof. Niessner

65

SLIDE 65

Es Establis ishin ing ima image e cor

rres

espon

nden

ences es

Classic method pipeline

– Extract manually designed feature descriptors

Harris, SIFT, SURF: most are based on image gradients
They suffer under extreme illumination or viewpoint

changes

Slow to extract dense features

– Match descriptors from the two images

Many descriptors are similar, one needs to filter out possible

double matches and keep only reliable ones.

Prof. Leal-Taixé and Prof. Niessner

66

Sameer Agarwal et al. „Building Rome in a Day“. ICCV 2009

SLIDE 66

End-to-end learning for patch similarity
Fast to allow dense extraction
Invariant to a wide array of

transformations (illumination, viewpoint)

Prof. Leal-Taixé and Prof. Niessner

67

S. Zagoruyko and N. Komodakis. „Learning to Compare Image Patches via Convolutional Neural Networks“. CVPR 2015

Es Establis ishin ing ima image e cor

rres

espon

nden

ences es

Siamese network

SLIDE 67

Classic Siamese architecture

– Shared layers

Simulated feature extraction

– One decision layer

Simulates the matching
Prof. Leal-Taixé and Prof. Niessner

68

S. Zagoruyko and N. Komodakis. „Learning to Compare Image Patches via Convolutional Neural Networks“. CVPR 2015

Es Establis ishin ing ima image e cor

rres

espon

nden

ences es

SLIDE 68

Im Image retrieval

Prof. Leal-Taixé and Prof. Niessner

69

Radenovic et al.. „Fine-tuning CNN Image Retrieval with No Human Annotation“. TPAMI 2018

SLIDE 69

Un Unsu supervise ised l learnin ing

Learning from videos

– Tracking provides the supervision – Use those as positive samples – Extract random patches as negative samples

Prof. Leal-Taixé and Prof. Niessner

70

Wang and Gupta. „Unsupervised Learning of Visual Representations using Videos“. ICCV 2015

SLIDE 70

Opt Optica cal l flo flow

Input: 2 consecutive images (e.g. from a video)
Output: displacement of every pixel from image A to

image B

Results in the “perceived” 2D motion, not the real

motion of the object

Prof. Leal-Taixé and Prof. Niessner

71

SLIDE 71

Opt Optica cal l flo flow

Prof. Leal-Taixé and Prof. Niessner

72

SLIDE 72

Opt Optica cal l flo flow

Prof. Leal-Taixé and Prof. Niessner

73

SLIDE 73

Opt Optica cal l flo flow with CNNs NNs

End-to-end supervised learning of optical flow
Prof. Leal-Taixé and Prof. Niessner

74

P. Fischer et al. „FlowNet: Learning Optical Flow With Convolutional Networks“. ICCV 2015

SLIDE 74

Opt Optica cal l flo flow with CNNs NNs

Prof. Leal-Taixé and Prof. Niessner

75

P. Fischer et al. „FlowNet: Learning Optical Flow With Convolutional Networks“. ICCV 2015

SLIDE 75

Fl FlowNet: a : arc rchit itecture ure 1 1

Prof. Leal-Taixé and Prof. Niessner

76

Stack both images à input is now 2 x RGB = 6 channels

SLIDE 76

Fl FlowNet: a : arc rchit itecture ure 2 2

Prof. Leal-Taixé and Prof. Niessner

77

Siamese architecture

SLIDE 77

Fl FlowNet : a : arc rchit itecture ure 2 2

Prof. Leal-Taixé and Prof. Niessner

78

Two key design choices

How to combine the information from both images?

SLIDE 78

Cor Correl elation ion layer er

Multiplies a feature vector with another feature vector
Prof. Leal-Taixé and Prof. Niessner

79

Fixed operation. No learnable weights!

SLIDE 79

Cor Correl elation ion layer er

The matching score represents how correlated these

two feature vectors are

Prof. Leal-Taixé and Prof. Niessner

80

SLIDE 80

Cor Correl elation ion layer er

Useful for finding image correspondences
Prof. Leal-Taixé and Prof. Niessner

81

I. Rocco et al. “Convolutional neural network architecture for

geometric matching. CVPR 2017.

Find a transformation from image A to image B A B

SLIDE 81

Cor Correl elation ion layer er

Prof. Leal-Taixé and Prof. Niessner

82

I. Rocco et al. “Convolutional neural network architecture for geometric matching. CVPR 2017.

SLIDE 82

Siamese Neural l Netw Networks a and Simila larity Learning

SLIDE 83

Fu Further r references

Savinov et al. „Quad-networks: unsupervised learning

to rank for interest point detection“. CVPR 2017

Ristani & Tomasi. „Features for Multi-Target Multi-

Camera Tracking and Re-Identification“. CVPR 2018

Chen et al. „Beyond triplet loss: a deep quadruplet

network for person re-identification“. CVPR 2017

Prof. Leal-Taixé and Prof. Niessner

84