VALSE webinar 2015 5 27 Feature Selection in Image and Video - - PowerPoint PPT Presentation

valse webinar 2015 5 27 feature selection in image and
SMART_READER_LITE
LIVE PREVIEW

VALSE webinar 2015 5 27 Feature Selection in Image and Video - - PowerPoint PPT Presentation

VALSE webinar 2015 5 27 Feature Selection in Image and Video Recognition JianxinWu National Key Laboratory for Novel Software Technology Nanjing University http://lamda.nju.edu.cn Introduction For image classification, how to


slide-1
SLIDE 1

JianxinWu National Key Laboratory for Novel Software Technology Nanjing University

Feature Selection in Image and Video Recognition

http://lamda.nju.edu.cn

VALSE webinar,2015年5月27日

slide-2
SLIDE 2

Introduction

For image classification, how to represent an image? With

  • strong discriminative power; and,
  • manageable storage and CPU costs

2

slide-3
SLIDE 3

Bag of words

3

 Dense sample  Extract visual descriptor

(e.g. SIFT or CNN) at every sample location, usually PCA to reduce dimensionality

 Learning a visual codebook

by k-means

slide-4
SLIDE 4

The VLAD pipeline

4

 𝐿 code words 𝒅𝑗 ∈ ℝ𝐸  Pooling

𝒈𝑗 =

𝒚∈𝒅𝑗

𝒚 − 𝒅𝑗

 Concatenation

[𝒈1𝒈2 ⋯ 𝒈𝐿]

 Dimensionality: 𝐸 × 𝐿

Jegou et al. Aggregating local images descriptors into compact codes. TPAMI, 2012

slide-5
SLIDE 5

Effect of High Dimensionality

5

 Blessing

  • Fisher Vector: 𝐿 × (2𝐸 + 1)
  • Super Vector: 𝐿 × 𝐸 + 1
  • State-of-the-art results in many application domains

 Curse

  • 1 million images
  • 8 spatial pyramid regions
  • 𝐿 = 256, 𝐸 = 64, 4 bytes to store a floating number
  • 1056G bytes!
  • J. Sanchez et al. Image classification with the fisher vector: Theory and practice. IJCV

, 2013.

  • X. Zhou et al. Image classification using super-vector coding of local image descriptors. ECCV

, 2010.

slide-6
SLIDE 6

Solution?

6

 Use fewer example / dimensions?

  • Reduce accuracy quickly

 Feature compression

  • Introduction soon

 Feature selection

  • This talk
slide-7
SLIDE 7

To compress?

Methods in the literature: feature compression Compress the long feature vectors so that

  • Much fewer bytes to store them
  • (possibly) faster learning

7

slide-8
SLIDE 8

Product Quantization illustration

8

 For every 8 dimensions 1.

Generate a codebook with 256 words

2.

VQ a 8d vector (32 bytes) into a index (1 byte)

On-the-fly decoding

1.

Get stored index 𝑗

2.

Expand into 8d 𝒅𝑗

Do not change learning time

Jegou et al. Product quantization for nearest neighbor search. TPAMI, 2011. Vedaldi & Zisserman. Sparse kernel approximations for efficient classification and detection. CVPR, 2012.

slide-9
SLIDE 9

Thresholding

9

 A simple idea

𝑦 ← −1, 𝑦 < 0 +1, 𝑦 ≥ 0

 32 times compression  Working surprisingly well!  But, why?

Perronnin et al. Large-scale image retrieval with compressed Fisher vectors. CVPR, 2010.

slide-10
SLIDE 10

Bilinear projections (BPBC)

10

 FV or VLAD requires rotation

  • A large matrix times the long vector

 Bilinear projection + binary feature

  • Example: 𝐿𝐸 vector 𝒚 reshape into 𝐿 × 𝐸 matrix 𝑌
  • Bilinear projection / rotation

sgn 𝑆1

𝑈𝑌𝑆2

  • 𝑆1: 𝐿 × 𝐿, 𝑆2: 𝐸 × 𝐸
  • Smaller storage and faster computation than PQ

 But, learning 𝑆 is very time consuming (circulant?)

Gong et al. Learning binary codes for high-dimensional data using bilinear projections. CVPR, 2013.

slide-11
SLIDE 11

The commonality

11

 Linear projection!

  • New features are linear combinations of multiple

dimensions from the original vector

 What does this mean?

  • Assuming strong multicollinearity exists!

 Is this true in reality?

slide-12
SLIDE 12

Collinearity and multicollinearity

Examining real data find that:

  • Collinearity almost never exist
  • Too expensive to examine the existence of

multicollineairty, but we have something to say

12

slide-13
SLIDE 13

Collinearity

13

 Existence of strong linear dependencies between two

dimensions in the VLAD / FV vector

 Pearson’s correlation coefficient

𝑠 = 𝒚:𝑗

𝑈𝒚:𝑘

𝒚:𝑗 𝒚:𝑘

  • 𝑠 = ±1: perfect collinearity
  • 𝑠 = 0: no linear dependency at all
slide-14
SLIDE 14

Three types of checks

14

Region 2 8 Spatial regions

Word 1 Word 2 … Word K Dim 1 Dim 2 … Dim D

1.

Random pair

2.

In the same spatial region

3.

In same code word / Gaussian component (all regions)

slide-15
SLIDE 15

15

 Same Gaussian shows a

little stronger correlation

 Mostly no correlation at

all!

slide-16
SLIDE 16

From 2 to 𝑜

16

 Multicollinearity – strong linear dependency among > 2

dimensions

 Given the missing of collinearity, the chance of

multicollinearity is also small

 PCA is essential for FV and VLAD

  • Dimensions in PCA are uncorrelated

 Thus, we should choose, not compress!

slide-17
SLIDE 17

MI based feature selection

A simple mutual information based importance sorting algorithm to choose features

  • Computationally very efficient
  • When ratio changes, no need to repeat
  • Highly accurate

17

slide-18
SLIDE 18

Yes, to choose!

18

 Choose is better than compress

  • Given that multicollinearity is missing

 Cannot afford expensive feature selection

  • Features too big to put into memory
  • Complex algorithms take too long
slide-19
SLIDE 19

Usefulness measure

19

 Mutual information

𝐽 𝒚, 𝒛 = 𝐼 𝒚 + 𝐼 𝒛 − 𝐼(𝒚, 𝒛)

  • 𝐼: entropy
  • 𝒚: one dimension
  • 𝒛: image label vector

 Selection

  • Sort all MI values, choose the top 𝐸’
  • Only one pass of data
  • No addition work if 𝐸’ changes
slide-20
SLIDE 20

Entropy computation

20

 Too expensive using complex methods

  • e.g. kernel density estimation

 Use discrete quantization

  • 1-bit: 𝑦 ← −1, 𝑦 < 0

+1, 𝑦 ≥ 0

  • N-bins: uniformly quantize into N bins
  • 1-bit and 2-bins are different
  • Discrete entropy: 𝐼 = − 𝑘 𝑞𝑘 log2 𝑞𝑘
  • Larger N, bigger 𝐼 value
slide-21
SLIDE 21

21

 Most features are not

use

 Choose a small subset is

not only for speed or scalability, but also for accuracy!

 1-bit >> 4/8 bins –

keep the threshold at 0 is important!

slide-22
SLIDE 22

The pipeline

22

1.

Generate a FV / VLAD vector

2.

Only keep the chosen 𝐸’ dimensions

3.

Further quantize the 𝐸’ dimensions into 𝐸’ bits

 Compression ratio is

32𝐸 𝐸′

 Store 8 bits in a byte

slide-23
SLIDE 23

Image Results

  • Much faster in feature dimensionality reduction, learning
  • Requires almost no extra storage
  • In general, significantly higher accuracy with same ratio

23

slide-24
SLIDE 24

Features

24

 Use the Fisher Vector  D=64

  • 128 dim SIFT, reduced by PCA

 K=256  Use mean and variance part  8 spatial regions  Total dimensionality:

256 × 64 × 2 × 8 = 262,144

slide-25
SLIDE 25

VOC2007: accuracy

25

 #classes: 20  #training: 5000  #testing: 5000

slide-26
SLIDE 26

ILSVRC2010: accuracy

26

 #classes: 1000  #training: 1,200,000  #testing: 150,000

slide-27
SLIDE 27

SUN397: accuracy

27

 #classes: 397  #training: 19,850  #testing: 19,850

slide-28
SLIDE 28

Fine-Grained Categorization

Selecting features is more important

28

slide-29
SLIDE 29

Selection of subtle differences?

29

slide-30
SLIDE 30

What features (parts) are chosen?

30

slide-31
SLIDE 31

31

slide-32
SLIDE 32

32

slide-33
SLIDE 33

How about accuracy?

33

slide-34
SLIDE 34

Published results

34

Compact Representation for Image Classification: To Choose or to Compress? Yu Zhang, JianxinWu, Jianfei Cai CVPR 2014 Towards Good Practices for Action Video Encoding JianxinWu, Yu Zhang, Weiyao Lin CVPR 2014

slide-35
SLIDE 35

New methods & results in arXiv

35

 VOC 2012: 90.7%, VOC 2007: 92.0%

  • http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?c

hallengeid=11&compid=2

  • http://arxiv.org/abs/1504.05843

 SUN 397: 61.83%

  • http://arxiv.org/abs/1504.05277
  • http://arxiv.org/abs/1504.04792

 Details of fine-grained categorization

  • http://arxiv.org/abs/1504.04943
slide-36
SLIDE 36

DSP

36

 An intuitive, principled, efficient, and effective image

representation for image recognition

  • Using only the convolutional layers of CNN

 Very efficient, but impressive representational power  No fine-tuning at all

  • Extremely small but effective FV / VLAD encoding (K=1, or 2)

 Small memory footprint

  • New normalization strategy

 Matrix norm to utilize global information

  • Spatial pyramid

 Natural and principled way to integrate spatial information

slide-37
SLIDE 37

D3

37

 Discriminative Distribution Distance

  • FV

, VLAD and Super Vectors are generative representations

  • They ask “how one set is generated?”
  • But for image recognition, we care about “how two sets are

separated?”

  • Proposed directional distribution distance to compare two sets
  • Proposed using a classifier MPM to robustly estimate the distance
  • D3 is very stable
  • D3 is very efficient
slide-38
SLIDE 38

Multiview image representation

38

 Using DSP as the global view  But context is also important: what are the neighborhood

structure?

  • Solving distance metric learning as a DNN
  • Called the label view

 Integrated (global+label) views

  • 90.7% @ VOC2012 recognition task
  • 92.0% @ VOC2007 recognition task
slide-39
SLIDE 39

39

Thanks!