Neural Codes for Image Retrieval
David Stutz
July 22, 2015
David Stutz | July 22, 2015 0/48 David Stutz | July 22, 2015 1/48
Neural Codes for Image Retrieval David Stutz July 22, 2015 David - - PowerPoint PPT Presentation
Neural Codes for Image Retrieval David Stutz July 22, 2015 David Stutz | July 22, 2015 David Stutz | July 22, 2015 0/48 1/48 Table of Contents Introduction 1 Image Retrieval 2 Bag of Visual Words Vector of Locally Aggregated Descriptors
David Stutz | July 22, 2015 0/48 David Stutz | July 22, 2015 1/48
1
2
3
4
5
6
David Stutz | July 22, 2015 2/48
1
2
3
4
5
6
David Stutz | July 22, 2015 3/48
◮ Text-based retrieval systems based on manual annotations; ◮ unpractical for large collections of images.
◮ Techniques based on the Bag of Visual Words [SZ03] model.
David Stutz | July 22, 2015 4/48
◮ Text-based retrieval systems based on manual annotations; ◮ unpractical for large collections of images.
◮ Techniques based on the Bag of Visual Words [SZ03] model.
David Stutz | July 22, 2015 4/48
◮ Text-based retrieval systems based on manual annotations; ◮ unpractical for large collections of images.
◮ Techniques based on the Bag of Visual Words [SZ03] model.
David Stutz | July 22, 2015 4/48
1
2
3
4
5
6
David Stutz | July 22, 2015 5/48
David Stutz | July 22, 2015 6/48
David Stutz | July 22, 2015 6/48
◮ Histograms; ◮ Bag of Visual Words [SZ03].
David Stutz | July 22, 2015 6/48
David Stutz | July 22, 2015 7/48
n=1 Yn to obtain visual words
Y (yl,n) = ˆ
L
David Stutz | July 22, 2015 8/48
n=1 Yn to obtain visual words
Y (yl,n) = ˆ
L
David Stutz | July 22, 2015 8/48
n=1 Yn to obtain visual words
Y (yl,n) = ˆ
L
David Stutz | July 22, 2015 8/48
David Stutz | July 22, 2015 9/48
Y (yl,n) = ˆ
L
David Stutz | July 22, 2015 10/48
Y (yl,n) = ˆ
L
David Stutz | July 22, 2015 10/48
Y (yl,n) = ˆ
L
David Stutz | July 22, 2015 10/48
David Stutz | July 22, 2015 11/48
rl
2 + λrl1.
1≤l≤L{f1(yl,n)}, . . .
David Stutz | July 22, 2015 12/48
rl
2 + λrl1.
1≤l≤L{f1(yl,n)}, . . .
David Stutz | July 22, 2015 12/48
rl
2 + λrl1.
1≤l≤L{f1(yl,n)}, . . .
David Stutz | July 22, 2015 12/48
◮ compression of image representations; ◮ efficient indexing and nearest-neighbor search [JDS11]; ◮ query expansion [CPS+07] and spatial verification [PCI+07].
◮ Unsupervised methods, e.g. Principal Component Analysis (PCA); ◮ or discriminate methods, e.g. Joint Subspace and Classifier Learning
David Stutz | July 22, 2015 13/48
◮ compression of image representations; ◮ efficient indexing and nearest-neighbor search [JDS11]; ◮ query expansion [CPS+07] and spatial verification [PCI+07].
◮ Unsupervised methods, e.g. Principal Component Analysis (PCA); ◮ or discriminate methods, e.g. Joint Subspace and Classifier Learning
David Stutz | July 22, 2015 13/48
1
2
3
4
5
6
David Stutz | July 22, 2015 14/48
i
j=1 w(1) i,j xj + w(1) i,0
1 1+exp(−z)
David Stutz | July 22, 2015 15/48
1
m(l−1)
i
j=1
i,jy(l−1) j
i,0
1 1+exp(−z)
David Stutz | July 22, 2015 16/48
1
2
m(1)
1
2
m(L−1)
1
2
m(L)
David Stutz | July 22, 2015 17/48
◮ Multi-layer perceptrons do not naturally accept images as input; ◮ however, spatial information is important.
David Stutz | July 22, 2015 18/48
◮ Multi-layer perceptrons do not naturally accept images as input; ◮ however, spatial information is important.
David Stutz | July 22, 2015 18/48
David Stutz | July 22, 2015 19/48
David Stutz | July 22, 2015 20/48
1
j
i
i
m(l−1)
1
i,j ∗ Y (l−1) j
1
i
i,j are filters.
David Stutz | July 22, 2015 21/48
David Stutz | July 22, 2015 22/48
1
j
i
i
1
j=1
j
r,s
1 = m(l−1) 1
David Stutz | July 22, 2015 23/48
David Stutz | July 22, 2015 24/48
j
2
3
i
◮ computing the average value within (non-overlapping) windows
◮ or keeping the maximum value of (non-overlapping) windows (max
David Stutz | July 22, 2015 25/48
David Stutz | July 22, 2015 26/48
Figure : Architecture used by Krizhevsky et al. [KSH12], L = 13.
David Stutz | July 22, 2015 27/48
i
i
j=1 exp
j
N
tn
David Stutz | July 22, 2015 28/48
i
i
j=1 exp
j
N
tn
David Stutz | July 22, 2015 28/48
1
2
3
4
5
6
David Stutz | July 22, 2015 29/48
Figure : Back-projection of a single feature activation in the fourth convolutional layer [ZF14].
David Stutz | July 22, 2015 30/48
◮ layer l = 10: last convolutional layer, including subsequent max
◮ layer l = 11 and l = 12: first and second layer of the three-layer
◮ pre-trained on ImageNet1 (∼ 3.2 million images, > 1000 classes); ◮ and re-trained on the Landmark dataset (213, 678 images of 672
1Available at http://www.image-net.org/.
David Stutz | July 22, 2015 31/48
◮ layer l = 10: last convolutional layer, including subsequent max
◮ layer l = 11 and l = 12: first and second layer of the three-layer
◮ pre-trained on ImageNet1 (∼ 3.2 million images, > 1000 classes); ◮ and re-trained on the Landmark dataset (213, 678 images of 672
1Available at http://www.image-net.org/.
David Stutz | July 22, 2015 31/48
Figure : Architecture used by Krizhevsky et al. [KSH12], L = 13.
David Stutz | July 22, 2015 32/48
N
David Stutz | July 22, 2015 33/48
N
David Stutz | July 22, 2015 33/48
1
2
3
4
5
6
David Stutz | July 22, 2015 34/48
◮ Oxford 5k [PCI+07]: 5, 062 images of eleven different landmarks in
◮ INRIA Holidays [JDS08]: 1, 491 holiday images with 500 distinct
Figure : Example images from the Oxford 5k dataset showing the All Souls College of the University of Oxford.
David Stutz | July 22, 2015 35/48
◮ Oxford 5k [PCI+07]: 5, 062 images of eleven different landmarks in
◮ INRIA Holidays [JDS08]: 1, 491 holiday images with 500 distinct
Figure : Example images from the INRIA Holidays dataset.
David Stutz | July 22, 2015 35/48
◮ Recall: ratio of true positives to all related images; ◮ Precision: ratio of true positives to number of retrieved images.
David Stutz | July 22, 2015 36/48
◮ Recall: ratio of true positives to all related images; ◮ Precision: ratio of true positives to number of retrieved images.
David Stutz | July 22, 2015 36/48
Oxford 5k Holidays Fisher Vectors [GRPV12] – 0.774 Vector of Locally Aggregated Descriptors [AZ13] 0.555 0.646 Sparse-Coded Features [GKS13] – 0.767 Triangulation Embedding [JZ14] 0.676 0.771 Pre-Trained on ImageNet l = 10 0.389 0.69 l = 11 0.435 0.749 l = 12 0.430 0.736 Re-Trained l = 10 0.387 0.674 l = 11 0.545 0.793 l = 12 0.538 0.764
Table : Mean average precision for the Oxford 5k dataset and the Holidays dataset.
David Stutz | July 22, 2015 37/48
Oxford 5k Holidays Fisher Vectors [GRPV12] – 0.774 Vector of Locally Aggregated Descriptors [AZ13] 0.555 0.646 Sparse-Coded Features [GKS13] – 0.767 Triangulation Embedding [JZ14] 0.676 0.771 Pre-Trained on ImageNet l = 10 0.389 0.69 l = 11 0.435 0.749 l = 12 0.430 0.736 Re-Trained l = 10 0.387 0.674 l = 11 0.545 0.793 l = 12 0.538 0.764
Table : Mean average precision for the Oxford 5k dataset and the Holidays dataset.
David Stutz | July 22, 2015 37/48
Oxford 5k Holidays Fisher Vectors [GRPV12] – 0.774 Vector of Locally Aggregated Descriptors [AZ13] 0.555 0.646 Sparse-Coded Features [GKS13] – 0.767 Triangulation Embedding [JZ14] 0.676 0.771 Pre-Trained on ImageNet l = 10 0.389 0.69 l = 11 0.435 0.749 l = 12 0.430 0.736 Re-Trained l = 10 0.387 0.674 l = 11 0.545 0.793 l = 12 0.538 0.764
Table : Mean average precision for the Oxford 5k dataset and the Holidays dataset.
David Stutz | July 22, 2015 37/48
Oxford 5k Holidays Fisher Vectors [GRPV12] – 0.723 Fisher Vectors* [GRPV12] – 0.764 Vector of Locally Aggregated Descriptors [AZ13] 0.448 0.625 Sparse-Coded Features [GKS13] – 0.727 Triangulation Embedding [JZ14] 0.433 0.617 Pre-Trained on ImageNet l = 11 (PCA) 0.433 0.747 l = 11 (Large-Margin) 0.439 – Re-Trained l = 11 (PCA) 0.557 0.789
Table : Mean average precision for the Oxford 5k dataset and the Holidays dataset using 128 dimensional image representations.
David Stutz | July 22, 2015 38/48
Oxford 5k Holidays Fisher Vectors [GRPV12] – 0.723 Fisher Vectors* [GRPV12] – 0.764 Vector of Locally Aggregated Descriptors [AZ13] 0.448 0.625 Sparse-Coded Features [GKS13] – 0.727 Triangulation Embedding [JZ14] 0.433 0.617 Pre-Trained on ImageNet l = 11 (PCA) 0.433 0.747 l = 11 (Large-Margin) 0.439 – Re-Trained l = 11 (PCA) 0.557 0.789
Table : Mean average precision for the Oxford 5k dataset and the Holidays dataset using 128 dimensional image representations.
David Stutz | July 22, 2015 38/48
Oxford 5k Holidays Fisher Vectors [GRPV12] – 0.723 Fisher Vectors* [GRPV12] – 0.764 Vector of Locally Aggregated Descriptors [AZ13] 0.448 0.625 Sparse-Coded Features [GKS13] – 0.727 Triangulation Embedding [JZ14] 0.433 0.617 Pre-Trained on ImageNet l = 11 (PCA) 0.433 0.747 l = 11 (Large-Margin) 0.439 – Re-Trained l = 11 (PCA) 0.557 0.789
Table : Mean average precision for the Oxford 5k dataset and the Holidays dataset using 128 dimensional image representations.
David Stutz | July 22, 2015 38/48
Figure : Qualitative examples provided by Babenko et al. [BSCL14]: left-most image is the query; correctly retrieved images are marked.
David Stutz | July 22, 2015 39/48
◮ no experiments using Large Margin Dimensionality Reduction on the
◮ and the results for state-of-the-art approaches are taken from the
◮ fully learned features are interesting alternative to hand-crafted
◮ and convolutional neural networks may be explicitly trained for the
David Stutz | July 22, 2015 40/48
◮ no experiments using Large Margin Dimensionality Reduction on the
◮ and the results for state-of-the-art approaches are taken from the
◮ fully learned features are interesting alternative to hand-crafted
◮ and convolutional neural networks may be explicitly trained for the
David Stutz | July 22, 2015 40/48
1
2
3
4
5
6
David Stutz | July 22, 2015 41/48
◮ Bag of Visual Words [SZ03]; ◮ Vector of Locally Aggregated Gradients [AZ13]; ◮ Sparse-Coded Features [GKS13].
◮ Excellent performance on ImageNet; ◮ but difficult to train or implement.
David Stutz | July 22, 2015 42/48
◮ Bag of Visual Words [SZ03]; ◮ Vector of Locally Aggregated Gradients [AZ13]; ◮ Sparse-Coded Features [GKS13].
◮ Excellent performance on ImageNet; ◮ but difficult to train or implement.
David Stutz | July 22, 2015 42/48
◮ Bag of Visual Words [SZ03]; ◮ Vector of Locally Aggregated Gradients [AZ13]; ◮ Sparse-Coded Features [GKS13].
◮ Excellent performance on ImageNet; ◮ but difficult to train or implement.
David Stutz | July 22, 2015 42/48
◮ hierarchical k-means [NS06]; ◮ or approximate k-means [PCI+07].
◮ term frequency, inverse document frequency weighting; ◮ or component-wise square root and L2 normalization.
David Stutz | July 22, 2015 43/48
Y (yl,n) = ˆ
L
◮ power-law normalization (usually, α = 0.5):
◮ intra-normalization: L2-normalize sum of residuals for each visual
David Stutz | July 22, 2015 44/48
◮ Compute ∇E(W[t]) in O(|W|) using Error Backpropagation. ◮ Add a regularizer of the form
◮ Use dropout [HSK+12] and stochastic gradient descent.
David Stutz | July 22, 2015 45/48
Figure : Back-projection of a single feature activation in layer l = 3 [ZF14]2.
2Note that the architecture used by Zeiler et al. [ZF14] does not exactly match the
architecture presented previously.
David Stutz | July 22, 2015 46/48
Figure : Computed image to maximize posterior for classes “goose” (left) and “husky” (right) [SVZ13].
David Stutz | July 22, 2015 47/48
◮ Oxford 5k dataset (including evaluation script):
◮ SIFT, Vector of Locally Aggregated Descriptors and Fisher Vectors
◮ Caffe: http://caffe.berkeleyvision.org/.
David Stutz | July 22, 2015 48/48
David Stutz | July 22, 2015 48/48
David Stutz | July 22, 2015 48/48
David Stutz | July 22, 2015 48/48
David Stutz | July 22, 2015 48/48
David Stutz | July 22, 2015 48/48