and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou - - PowerPoint PPT Presentation
and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou - - PowerPoint PPT Presentation
Semantic Image Indexing and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Outline State of the nation Early description methods
Source: H. Jegou
Source: H. Jegou
Source: H. Jegou
Source: H. Jegou
Source: H. Jegou
Source: H. Jegou
Source: H. Jegou
Source: H. Jegou
Outline
- State of the nation
- Early description methods
- Local features detection and description
- Matching local features
- Large scale image retrieval
- Image classification
Context
Context
- Available online video content is ever increasing
– 35 hours of video uploaded every minute – 2 billion videos watched per day – 20 million videos uploaded each month – 2+ billion videos watched per month
– Archived TV content is growing
- 1.5 million hours = 120 km of shelves
- 300000 hours | 1 petabyte/year
Content difficult to search and reuse Barely invisible for search engines
Context
- Query by example is reaching the general public
Context
- Other fields are equally popular
- Visual data-mining
- Security applications
- Advertising
- Augmented reality
- Assistive technologies
- Autonomous vehicles
- Landmark recognition
Why is it so difficult to find appropriate multimedia content, to reuse it and to present this content in interfaces that vary with user needs?
tree dog car humans machine tree dog car
tree dog car humans machine tree dog car
The semantic gap is the lack of coincidence between the information that one can extract from the sensory data and the interpretation that the same data has for a user in a given situation [Arnold Smeulders, PAMI, 2000]
Automatic semantics extraction
- Generic principle : workflow
Low-level visual description Image- & mid-level feature aggregation Learning & Classification
knowledge
Sky Greek guard My ex-boss Greek Parliament
pixels
Introduction
Properties of ideal features
- Local: features are local, so robust to occlusion and
clutter (no prior segmentation)
- Invariant (or covariant)
- Robust: noise, blur, discretization, compression, etc. do
not have a big impact on the feature
- Distinctive: individual features can be matched to a large
database of objects
- Quantity: many features can be generated even for
small objects
- Accurate: precise localization
- Efficient: close to real-time performance
Any other philosophical aspects? Let’s get to business.
Color Color space Color quantization Scalable color (Haar transform representation) Color structure (histogram of structuring elements) GoF/GoP color histograms Dominant colors Color Layout (DCT-based) Texture Gradient orientation histogram Homogeneous texture (multi-resolution Gabor filters) Texture browsing (Tamura features) Shape Region shape (2D) (Angular Radial Transform) Contour shape (2D) (contour scale space) 3D shape (3D shape spectrum) Motion Parametric motion Motion trajectory Camera motion (complete 3D camera model) Motion activity Others Face recognition (eigenfaces)
- More visual descriptors …
MPEG-7 visual descriptors
Any other global appearance descriptors?
- We need to establish a correspondence between the
target image and other images in the model database
Lowe, 1999 model Target image
How can we recognize specific
- bjects?
- Object class recognition
How can we recognize specific
- bjects?
Find these landmarks ...in these images and other 1M more
How can we recognize specific
- bjects?
- We need to establish a correspondence between the
query image and all images from the database depicting the same object/scene
Query image Database image
Solution?
Source: J.Sivic
- Finding the object despite possibly large changes in
scale, viewpoint, lighting and partial occlusion
Viewpoint Scale
Occlusion
Lighting
Main challenges?
Application
https://www.youtube.com/watch?v=Hhgfz0zPmH4
Application
https://www.youtube.com/watch?v=w95kwXy_MOY&feature=youtu.be&t=26m30s
- Global representations have major limitations
- Instead, describe and match only local regions
- Increased robustness to
– Occlusions – Articulation – Intra-category variations
θq
φ
dq
φ θ
d
Local features
Motivation
1) Detection: identify the interest points 2) Description: extract vector feature descriptor around each point of interest 3) Matching: determine correspondence between descriptors in two views
Source: K.Grauman
Local features
Main components
- Interest operator repeatability
– We want to detect at least the same points in both images, while running the detection procedure independently per image
Cannot find true matches here
Source: K.Grauman
We need a repeatable detector
Local features
Desired features
- Descriptor distinctiveness
– We want to be able to reliably determine which point goes with which – We must provide some invariance to geometric and photometric differences between the two views
Source: K.Grauman
?
We need a reliable and distinctive detector
Local features
Desired features
1) Detection: identify the interest points 2) Description: extract vector feature descriptor around each point of interest 3) Matching: determine correspondence between descriptors in two views
Source: K.Grauman
Local features
Main components
What parts would you choose?
Let’s try the corners
Finding corners
- Key property: in the region around a corner, image
gradient has two or more dominant directions
- Corners are repeatable and distinctive
Source: L. Lazebnik
Many existing detectors available
Hessian & Harris [Beaudet ‘78], [Harris ‘88] Laplacian, DoG [Lindeberg ‘98], [Lowe 1999] Harris-/Hessian-Laplace [Mikolajczyk & Schmid ‘01] Harris-/Hessian-Affine [Mikolajczyk & Schmid ‘04] EBR and IBR [Tuytelaars & Van Gool ‘04] MSER [Matas ‘02] Salient Regions [Kadir & Brady ‘01] Others…
Many existing detectors available
Hessian & Harris [Beaudet ‘78], [Harris ‘88] Laplacian, DoG [Lindeberg ‘98], [Lowe 1999] Harris-/Hessian-Laplace [Mikolajczyk & Schmid ‘01] Harris-/Hessian-Affine [Mikolajczyk & Schmid ‘04] EBR and IBR [Tuytelaars & Van Gool ‘04] MSER [Matas ‘02] Salient Regions [Kadir & Brady ‘01] Others…
- We should easily recognize the point by looking through a
small window
- Shifting a window in any direction should give a large
change in intensity
Harris detector – basic idea
How to find corners?
Source: K. Grauman
“edge”: no change along the edge direction “corner”: significant change in all directions “flat” region: no change in all directions
Harris detector – basic idea
How to find corners?
Source: R. Szeliski
Window-averaged change of intensity induced by shifting the image date by [u,v]:
2 ,
( , ) ( , ) ( , ) ( , )
x y
E u v w x y I x u y v I x y
Intensity Shifted intensity Window function
- r
Window function w(x,y) = Gaussian 1 in window, 0 outside
Harris detector
Maths
Taylor series approximation to shifted image
Harris detector
Maths
Expanding I(x,y) in a Taylor series expansion, we have a bilinear approximation for small shifts [u,v]:
v u M v u v u E ] [ ) , (
where M is a 2x2 matrix computed from image derivates:
Source: R. Szeliski
2 2 ,
( , )
x x y x y x y y
I I I M w x y I I I
Sum over image region – area we are checking for corner Gradient with respect to x, times gradient with respect to y
M is also called “structure tensor”
Harris detector
Maths
Let’s consider an axis-aligned corner
Harris detector
What does this matrix reveal?
In this case:
- The dominant gradient directions align with x or y axis
- If either λ is close to 0, then this is not a corner, so look for
locations where both are large.
2 1 2 2
y y x y x x
I I I I I I M
Harris detector
What does this matrix reveal?
Intensity change in shifting window: eigenvalue analysis Ellipse E(u,v)
E(u,v) » [u v] M u v é ë ê ù û ú
- eigenvalues of M
direction of the slowest change direction of the fastest change
(max)-1/2 (min)-1/2
Harris detector
General case
Ixx Iyy Ixy
2
)) ( det(
xy yy xx
I I I I Hessian
yy xy xy xx
I I I I I Hessian ) (
Harris detector
Hessian determinant
Iy
- Second moment matrix / autocorrelation matrix
) ( ) ( ) ( ) ( ) ( ) , (
2 2 D y D y x D y x D x I D I
I I I I I I g
1. Image derivatives gx(D), gy(D), Ix Iy
Harris detector
Hessian determinant
Iy
- Second moment matrix / autocorrelation matrix
) ( ) ( ) ( ) ( ) ( ) , (
2 2 D y D y x D y x D x I D I
I I I I I I g
1. Image derivatives gx(D), gy(D), Ix Iy
- 2. Square of
derivatives Ix
2
Iy
2
IxIy
Harris detector
Hessian determinant
Ix
2
- Second moment matrix / autocorrelation matrix
) ( ) ( ) ( ) ( ) ( ) , (
2 2 D y D y x D y x D x I D I
I I I I I I g
- 1. Image
derivatives
- 2. Square of
derivatives
- 3. Gaussian filter
g(I) Ix Iy Ix
2
Iy
2
IxIy g(Ix
2)
g(Iy
2)
g(IxIy)
Harris detector
Hessian determinant
- Second moment matrix / autocorrelation matrix
- 1. Image
derivatives
- 2. Square of
derivatives
- 3. Gaussian
filter g(I)
Ix Iy Ix
2
Iy
2
IxIy g(Ix
2)
g(Iy
2)
g(IxIy)
2 2 2 2 2 2
)] ( ) ( [ )] ( [ ) ( ) (
y x y x y x
I g I g I I g I g I g
))] , ( [trace( )] , ( det[
D I D I
har
- 4. Cornerness function – both eigenvalues are strong
har
- 5. Non-maxima suppression
Harris detector
Hessian determinant
large small
Harris detector
How to select features?
large large
Harris detector
How to select features?
small small
Harris detector
How to select features?
1 2 “Corner” 1 and 2 are large, 1 ~ 2; E increases in all
directions 1 and 2 are small; E is almost constant
in all directions
“Edge” 1 >> 2 “Edge” 2 >> 1 “Flat” region
Classification of image points using eigenvalues of M:
Source: K. Grauman
Harris detector
Interpretation of eigenvalues
Measure of corner response:
Does not require computing of the eigenvalues
α - constant α = 0.04 - 0.06
Harris detector
Corner response function
1 “Corner” R > 0 “Edge” R < 0 “Edge” R < 0 “Flat” region |R| small 2
Source: K. Grauman
Harris detector
Corner response function
- Compute M matrix within all image windows to get their R
scores
- Find points with large corner response (R > threshold)
- Take the points of local maxima of R
Harris detector
Algorithm
Source: D. Frolova
Harris detector
Algorithm
Source: D. Frolova
Compute corner response R
Harris detector
Algorithm
Source: D. Frolova
Find points with large corner response: R>threshold
Harris detector
Algorithm
Source: D. Frolova
Take only the points of local maxima of R
Harris detector
Algorithm
Source: D. Frolova
Harris detector
Algorithm
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial
Harris detector
Typical response
Properties of ideal feature
- Local: features are local, so robust to occlusion and
clutter (no prior segmentation)
- Invariant (or covariant)
- Robust: noise, blur, discretization, compression, etc. do
not have a big impact on the feature
- Distinctive: individual features can be matched to a
large database of objects
- Quantity: many features can be generated for even
small objects
- Accurate: precise localization
- Efficient: close to real-time performance
Remember this?
Is the Harris corner detector rotation invariant?
Ellipse rotates but its shape remains the same Corner response R is invariant to image rotation
Is the Harris corner detector scale invariant?
Is the Harris corner detector scale invariant?
Not invariant to image scale! All points will be classified as edges This is a corner
How can we detect scale invariant interest points?
Source: T. Tuytelaars
Exhaustive search
A multi-scale approach
Source: T. Tuytelaars
Exhaustive search
A multi-scale approach
Source: T. Tuytelaars
Exhaustive search
A multi-scale approach
Source: T. Tuytelaars
Exhaustive search
A multi-scale approach
Source: T. Tuytelaars
Extract patch from each image individually
Exhaustive search
A multi-scale approach
Source: T. Tuytelaars
We want to extract the patches from each image independently.
Scale invariant detection
Lindeberg et. al, 1996
Design a function on the region, which is “scale invariant” (the same for corresponding regions, even if they are at different scales)
Example: average intensity. For corresponding regions (even of different sizes) it will be the same.
- For a point in one image, we can consider it as a function of
region size (patch width)
scale = 1/2
f
region size Image 1
f
region size Image 2
Exhaustive search
Solution
Lindeberg et. al, 1996
Take a local maximum of this function Observation: region size, for which the maximum is achieved, should be invariant to image scale. This scale invariant region size is found in each image independently!
scale = 1/2
f
region size Image 1
f
region size Image 2
s1 s2
Scale invariant detection
Common approach
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial
)) , ( ( )) , ( (
1 1
x I f x I f
m m
i i i i
Same operator responses if the patch contains the same image up to scale factor How to find corresponding patch sizes?
Scale invariant detection
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial
Function responses for increasing scale (scale signature)
)) , ( (
1
x I f
m
i i
)) , ( (
1
x I f
m
i i
Scale invariant detection
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial
)) , ( (
1
x I f
m
i i
)) , ( (
1
x I f
m
i i
Scale invariant detection
Function responses for increasing scale (scale signature)
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial
)) , ( (
1
x I f
m
i i
)) , ( (
1
x I f
m
i i
Scale invariant detection
Function responses for increasing scale (scale signature)
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial
)) , ( (
1
x I f
m
i i
)) , ( (
1
x I f
m
i i
Scale invariant detection
Function responses for increasing scale (scale signature)
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial
)) , ( (
1
x I f
m
i i
)) , ( (
1
x I f
m
i i
Scale invariant detection
Function responses for increasing scale (scale signature)
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial
)) , ( (
1
x I f
m
i i
)) , ( (
1
x I f
m
i i
Scale invariant detection
Function responses for increasing scale (scale signature)
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial
Scale invariant detection
Function responses for increasing scale (scale signature)
scale = 1/2
f
region size Image 1
f
region size Image 2
s1 s2
Scale invariant detection
Common approach
- A good function for scale detection has one stable
sharp peak
- For usual images: a good function would be one
which responds to contrast (sharp local intensity change)
Scale invariant detection
Common approach
Source: L. Lazebnik
We define the characteristic scale that produces peak of Laplacian response
Scale invariant detection
Common approach
Source: K. Grauman
Scale invariant detection
Harris-Laplace
Source: K. Grauman
Scale invariant detection
Harris-Laplace
Source: K. Grauman
Scale invariant detection
Harris-Laplace
Source: K. Grauman
Scale invariant detection
Harris-Laplace
Source: K. Grauman
Scale invariant detection
Harris-Laplace
Source: K. Grauman
Scale invariant detection
Harris-Laplace
Interest points (blobs) are local maxima in both position and scale
scale
List of (x, y, σ)
Source: K. Grauman
Scale invariant detection
Harris-Laplace
Scale-space blob detector example
Source: T. Lindeberg
Scale invariant detection
Harris-Laplace
Scale-space blob detector example
Source: L. Lazebnik
Scale invariant detection
Harris-Laplace
Harris points vs. Harris Laplace points
Source: C. Schmid
Harris points Harris-Laplace points
Scale invariant detection
Harris-Laplace
- Functions for determining scale
Kernels:
2
( , , ) ( , , )
xx yy
L G x y G x y ( , , ) ( , , ) DoG G x y k G x y
(Laplacian) (Difference of Gaussians) where Gaussian
f = Kernel * Image
Lowe, 1999
Scale invariant detection
Technical detail
Source: T. Tuytelaars
- Local maxima in scale space of
Laplacian of Gaussian LoG
) ( ) (
yy xx
L L
2 3 4 5
list of x, y, scale
=
Scale invariant detection
Technical detail
2
( , , ) ( , , )
xx yy
L G x y G x y ( , , ) ( , , ) DoG G x y k G x y
(Laplacian) (Difference of Gaussians)
Source: T. Tuytelaars
- =
Scale invariant detection
Difference of Gaussians approximation
Source: T. Tuytelaars
- LoG result
Scale invariant detection
Technical detail
Original image
4 1
2
sampling with step 4 =2
Source: T. Tuytelaars
Scale invariant detection
Difference of Gaussians approximation
list of x, y,
- Detect maxima of difference-of-
Gaussian (DoG) in scale space
- Then reject points with low contrast
(threshold)
- Eliminate edge response
Scale invariant detection
Difference of Gaussians approximation
(a) 233x189 image (b) 832 DOG extrema (c) 729 left after peak value threshold (d) 536 left after testing ratio of principle curvatures (removing edge responses)
Source: D. Lowe
Scale invariant detection
Difference of Gaussians approximation
Harris corner detector Harris Laplace detector DoG detector
Scale invariant detection
Nonlinear SVMs
- Simple, efficient scheme
- Laplacian fires more on edges than
determinant of hessian
Scale invariant detection
Evaluation
- Given: two images of the same scene with a large
scale difference between them
- Goal: find the same interest points independently in
each image
- Solution: search for maxima of suitable functions in
scale and in space (over the image)
Scale Invariant local features
Summary
Local features
Main components 1) Detection: identify the interest points 2) Description: extract vector feature descriptor around each point of interest 3) Matching: determine correspondence between descriptors in two views
- We know how to detect points
- Next question:
How to describe them for matching?
?
The point descriptor should be invariant and distinctive
Local features
- Geometry:
– Rotation – Similarity (rotation + uniform scale) – Affine (scale dependent on direction)
- Photometry:
– Affine intensity change (I -> aI + b)
Source: C. Snoek
Models of image change
- The easiest way to describe the neighborhood around an
interest point is to write down the list of intensities to form a feature vector
- Consider that this is very sensitive to even small shifts,
rotations
Local descriptors
- Patches
– Disadvantage of patches as descriptors: small shifts can strongly affect the matching
- Histograms
2 p
Source: L. Lazebnik
Local descriptors
SIFT descriptor
- Scale Invariant Feature Transform
- How does it work?
– Divide patch into 4x4 sub-patches: 16 cells – Compute histogram of gradient orientations (8 reference angles) for all pixels inside each sub-patch – Resulting descriptor: 4x4x8 = 128 dimensions
Lowe, 2004
SIFT descriptor
- Compute orientation histogram
- Select dominant orientation
- Normalize: rotate to fixed orientation
2p
Rotation invariance
Lowe, 2004
SIFT descriptor
- Multiple dominant orientations
Rotation invariance
Lowe, 2004
SIFT descriptor
Rotation invariance
SIFT descriptor
- Normalize the descriptor to norm 1
- A change in image contrast can change
the magitude but not the orientation
- Reduce the influence of large
gradient magnitudes
- Threshold the values in the unit
feature vector to be no larger than 0.2
- Renormalize the feature vector after
thresholding.
Illumination invariance
Evaluation
Nonlinear SVMs
- Very robust
- 80% Repeatability at:
- 10% image noise
- 45° viewing angle
- 1k-100k keypoints in database
- Best descriptor in [Mikolajczyk &
Schmid 2005]’s extensive survey
- 11200+ citations on Google Scholar
already for [2004] paper
- Source code available for download
SIFT descriptor
SIFT descriptor
Performance
SIFT descriptor
Performance
- One image yields:
- n 128-dimensional descriptors: each one is a histogram
- f the gradient orientations within a patch
> [n x 128 matrix]
- n scale parameters specifying the size of each patch
> [n x 1 vector]
- n orientation parameters specifying the angle of the
patch > [n x 1 vector]
- n 2d points giving positions of the patches
> [n x 2 matrix]
SIFT descriptor
Performance
Panorama stitching
SIFT descriptor
Applications
Panorama stitching – iPhone app
http://www.cloudburstresearch.com/
SIFT descriptor
Applications
SIFT descriptor
Applications
SIFT descriptor
Applications
SIFT descriptor
Applications
209
- B. Leibe
Mobile tourist guide
- Self-localization
- Object/building recognition
- Photo/video augmentation
Quack, 2008
SIFT descriptor
Applications
- B. Leibe
- Augmented reality