[PPT] - Metric Learning for Large-Scale Image Classification: Generalizing PowerPoint Presentation

SLIDE 1

Metric Learning for Large-Scale Image Classification:

Generalizing to New Classes at Near-Zero Cost

Florent Perronnin1

work published at ECCV 2012 with:

Thomas Mensink1,2 Jakob Verbeek2 Gabriela Csurka1

1 Xerox Research Centre Europe, 2 INRIA

NIPS BigVision Workshop December 7, 2012

1

SLIDE 2

Motivation

Real-life image datasets are always evolving:

new images are added every second
new labels, tags, faces and products appear over time
for example: Facebook, Flickr, Twitter, Amazon. . .

Need to annotate these items for indexing and retrieval Therefore, we are interested in methods for large-scale visual classification where we can add new images and new classes at near-zero cost on the fly

2

SLIDE 3

Outline

1. Introduction
2. Distance Based Classifiers
3. Metric learning for NCM Classifier
4. Experimental Evaluation
5. Conclusion

3

SLIDE 4

Introduction

Recent focus on large-scale image classification

ImageNet data set [1]
Currently over 14 million images, and 20 thousand classes

Standard large-scale classification pipeline:

High dim. features: Super Vector [3] & Fisher Vector [4]
Linear 1-vs-Rest SVM classifiers [2,3,4]
Stochastic Gradient Descent (SGD) training [3,4]

→ In this work, we take features for granted and focus on the learning problem.

1. Deng et al., ImageNet: A large-scale hierarchical image database, CVPR’09
2. Deng et al., What does classifying 10,000 image categories tell us?, ECCV’10
3. Lin et al., Large-scale image classification: Fast feature extraction, CVPR’11
4. S´

anchez and Perronnin, High-dimensional signature compression for large-scale image classification, CVPR’11 4

SLIDE 5

Challenges of open-ended datasets

1-vs-Rest + SGD might look ideal for our problem:

1-vs-Rest: classes are trained independently
SGD: online algorithm can accomodate new data

Still several issues need to be addressed:

Given a new sample, feed it to all classifiers?

→ costly and suboptimal [1]

How to balance the negatives and positives?
How to regularize (and choose the step-size)?

→ We turn to distance-based classifiers.

1. Perronnin et al., Towards good practice in large-scale learning for image

classification, CVPR’12 5

SLIDE 6

Outline

1. Introduction
2. Distance Based Classifiers
3. Metric learning for NCM Classifier
4. Experimental Evaluation
5. Conclusion

6

SLIDE 7

Distance Based Classifiers

Classify based on the distance between images, or between image and class-representatives:

k-Nearest Neighbors
Nearest Class Mean Classification

Trivial addition of new images or new classes Critically depends on the distance function

7

SLIDE 8

k-Nearest Neighbor Classifier

Assign an image i to the most common class among the k closest images from the training set ✓ Very flexible non-linear model ✓ Easy to integrate new images ✓ Easy to integrate new classes ✗ Expensive at test time!

8

SLIDE 9

k-Nearest Neighbor Classifier

Assign an image i to the most common class among the k closest images from the training set ✓ Very flexible non-linear model ✓ Easy to integrate new images ✓ Easy to integrate new classes ✗ Expensive at test time!

8

SLIDE 10

k-Nearest Neighbor Classifier

Assign an image i to the most common class among the k closest images from the training set ✓ Very flexible non-linear model ✓ Easy to integrate new images ✓ Easy to integrate new classes ✗ Expensive at test time! Metric Learning: Large Margin Nearest Neighbors [1]

1. Weinberger et al., Distance Metric Learning for LMNN Classification, NIPS’06

8

SLIDE 11

Nearest Class Mean Classifier

Assign an image i to the class with the closest class mean µc = 1 Nc

i:yi=c

xi c∗ = argmin

c

d(x, µc) ✓ Very fast at test time: linear model ✓ Easy to integrate new images ✓ Easy to integrate new classes ✗ Class only represented with mean, not flexible enough?

9

SLIDE 12

Nearest Class Mean Classifier

Assign an image i to the class with the closest class mean µc = 1 Nc

i:yi=c

xi c∗ = argmin

c

d(x, µc) ✓ Very fast at test time: linear model ✓ Easy to integrate new images ✓ Easy to integrate new classes ✗ Class only represented with mean, not flexible enough?

9

SLIDE 13

Nearest Class Mean Classifier

Assign an image i to the class with the closest class mean µc = 1 Nc

i:yi=c

xi c∗ = argmin

c

d(x, µc) ✓ Very fast at test time: linear model ✓ Easy to integrate new images ✓ Easy to integrate new classes ✗ Class only represented with mean, not flexible enough? We introduce metric learning

9

SLIDE 14

Outline

1. Introduction
2. Distance Based Classifiers
3. Metric learning for NCM Classifier
4. Experimental Evaluation
5. Conclusion

10

SLIDE 15

Mahalanobis Distance Learning

d(x, x′) = (x − x′)⊤M(x − x′) dW(x, x′) = ||Wx − Wx′||2

2

1. M = I

Euclidean distance

Likely to be suboptimal
2. M : D × D

Full Mahalanobis distance

Huge number of parameters for large D
Expensive to compute distances in O
D2
3. M = W ⊤W

Low-Rank Projection W : m × D

Controllable number of parameters: m × D
Allows for compression of images to only m dimensions
Cheap computation of distances in O
m2

11

SLIDE 16

Mahalanobis Distance Learning

d(x, x′) = (x − x′)⊤M(x − x′) dW(x, x′) = ||Wx − Wx′||2

2

1. M = I

Euclidean distance

Likely to be suboptimal
2. M : D × D

Full Mahalanobis distance

Huge number of parameters for large D
Expensive to compute distances in O
D2
3. M = W ⊤W

Low-Rank Projection W : m × D

Controllable number of parameters: m × D
Allows for compression of images to only m dimensions
Cheap computation of distances in O
m2

11

SLIDE 17

Mahalanobis Distance Learning

d(x, x′) = (x − x′)⊤M(x − x′) dW(x, x′) = ||Wx − Wx′||2

2

1. M = I

Euclidean distance

Likely to be suboptimal
2. M : D × D

Full Mahalanobis distance

Huge number of parameters for large D
Expensive to compute distances in O
D2
3. M = W ⊤W

Low-Rank Projection W : m × D

Controllable number of parameters: m × D
Allows for compression of images to only m dimensions
Cheap computation of distances in O
m2

11

SLIDE 18

NCM Metric Learning (NCMML)

Probabilistic formulation using the soft-min function: p(c|x) = exp −dW(x, µc) C

c′=1 exp −dW(x, µc′)

Corresponds to class posterior in generative model: → p(x|c) = N (x; µc, Σ), with shared covariance matrix Crucial point: parameters W and {µc, c = 1, . . . , C} can be learned independently on different data subsets.

12

SLIDE 19

NCM Metric Learning (NCMML)

Discriminative maximum likelihood training:

We maximize with respect to W:

L(W) =

N

i=1

ln p(yi|xi)

Implicit regularization through the rank of W

Stochastic Gradient Descent (SGD): at time t

Pick a random sample (xt, yt)
Update:

W (t) = W (t−1) + ηt∇W=W (t−1) ln p(yt|xt) → mini-batch more efficient

13

SLIDE 20

Illustration of Learned Distances

14

SLIDE 21

Illustration of Learned Distances

14

SLIDE 22

Relationship to FDA

Three non-linearly separable classes

15

SLIDE 23

Relationship to FDA

Fisher Discriminant Analysis: maximizes variance between all class means

15

SLIDE 24

Relationship to FDA

NCMML: maximizes variance between nearby class means

15

SLIDE 25

Relation to other linear classifiers

fc(x) = bc + wc⊤x Linear SVM

Learn {bc, wc} per class

WSABIE [1]

wc = vcW

W ∈ Rd×D

Learn {vc} per class and shared W

Nearest Class Mean

bc = ||Wµc||2

2,

wc = −2

µc⊤W ⊤W
Learn shared W
1. Weston et al., Scaling up to large vocabulary image annotation, IJCAI’11

16

SLIDE 26

Outline

1. Introduction
2. Distance Based Classifiers
3. Metric learning for NCM Classifier
4. Experimental Evaluation
5. Conclusion

17

SLIDE 27

Experimental Evaluation

Data sets:

ILSVRC’10: classes = 1,000, images = 1.2M training + 50K

validation + 150K test

INET10K: classes ≈ 10K, images = 4.5M training + 50K

validation + 4.5M test

Features:

4K and 64K dimensional Fisher Vectors [1]
PQ Compression on 64K features [2]
1. Perronnin et al., Improving the Fisher kernel for image classification, ECCV’10
2. J´

egou et al., Product quantization for nearest neighbor search, PAMI’11 18

SLIDE 28

Evaluation: ILSVRC’10 (Top 5 acc.)

k-NN & NCM improve with metric learning NCM outperforms more flexible k-NN 4K Fisher Vectors Projection dimensionality 256 512 1024 ℓ2 k-NN, LMNN [1] - dynamic 61.0 60.9 59.6 44.1 NCM, learned metric 62.6 63.0 63.0 32.0

1. Weinberger et al., Distance Metric Learning for LMNN Classification, NIPS’06

19

SLIDE 29

Evaluation: ILSVRC’10 (Top 5 acc.)

k-NN & NCM improve with metric learning NCM outperforms more flexible k-NN NCM competitive with SVM and WSABIE 4K Fisher Vectors Projection dimensionality 256 512 1024 ℓ2 k-NN, LMNN [1] - dynamic 61.0 60.9 59.6 44.1 NCM, learned metric 62.6 63.0 63.0 32.0 WSABIE [2] 61.6 61.3 61.5 Baseline: 1-vs-Rest SVM 61.8

1. Weinberger et al., Distance Metric Learning for LMNN Classification, NIPS’06
2. Weston et al., Scaling up to large vocabulary image annotation, IJCAI’11

19

SLIDE 30

Generalization on INET10K (Top 1 acc.)

Nearest Class Mean Classifier

Compute means of 10K classes, in about 1 CPU hour
Re-use metric learned on ILSVRC’10

1-vs-Rest SVM baseline

Train 10K SVM classifiers, in about 280 CPU days

20

SLIDE 31

Generalization on INET10K (Top 1 acc.)

Nearest Class Mean Classifier

Compute means of 10K classes, in about 1 CPU hour
Re-use metric learned on ILSVRC’10

1-vs-Rest SVM baseline

Train 10K SVM classifiers, in about 280 CPU days
Feat. dim.

64K 21K 128K ≈ 60K Method NCM SVM SVM [1] SVM [2] DL [3] Flat top-1 13.9 21.9 6.4 19.1 19.2

1. Deng et al., What does classifying 10,000 image categories tell us?, ECCV’10
2. Perronnin et al., Good practice in large-scale image classification, CVPR’12
3. Le et al., Building high-level features using large scale unsupervised learning,

ICML ’12 20

SLIDE 32

Transfer Learning - Zero-Shot Prior

Use ImageNet class hiearchy to estimate a mean, [1] Internal nodes — Training nodes — New class

1. Rohrbach et al., Evaluating knowledge transfer and zero-shot learning in a

large-scale setting, CVPR’11 21

SLIDE 33

Transfer Learning - Zero-Shot Prior

Use ImageNet class hiearchy to estimate a mean, [1] Internal nodes — Training nodes — New class

1. Rohrbach et al., Evaluating knowledge transfer and zero-shot learning in a

large-scale setting, CVPR’11 21

SLIDE 34

Transfer Learning - Zero-Shot Prior

Use ImageNet class hiearchy to estimate a mean, [1] Internal nodes — Training nodes — New class

1. Rohrbach et al., Evaluating knowledge transfer and zero-shot learning in a

large-scale setting, CVPR’11 21

SLIDE 35

Transfer Learning - Results ILSVRC’10

Step 1 Metric learning on 800 classes Step 2 Estimate means for remaining 200 for evaluation:

Data mean (Maximum Likelihood)
Zero-Shot prior + data mean (Maximum a Posteriori)

1 10 100 1000 20 40 60 80 Number of samples per class Top-5 accuracy

22

SLIDE 36

Outline

1. Introduction
2. Distance Based Classifiers
3. Metric learning for NCM Classifier
4. Experimental Evaluation
5. Conclusion

23

SLIDE 37

Conclusion

Nearest Class Mean (NCM) Classification We proposed NCM Metric Learning Outperforms k-NN, on par with SVM and WSABIE Advantages of NCM over alternatives: Allows adding new images and classes at near zero cost Shows competitive results on unseen classes Can benefit from class priors for small sample sizes Further improvements Extension using multiple class centroids [1]

1. Mensink et al., Large Scale Metric Learning for Distance-Based Image

Classification, Tech-report, 2012 24