Survey Similarity search for complex similarity models Analysis of - - PowerPoint PPT Presentation
Survey Similarity search for complex similarity models Analysis of - - PowerPoint PPT Presentation
Optimal Multi-Step k -Nearest Neighbor Search Thomas Seidl and Hans-Peter Kriegel University of Munich, Germany ACM SIGMOD 98, Seattle Survey Similarity search for complex similarity models Analysis of previous solution for k -nn
(c) 1998 Thomas Seidl SIGMOD ‘98 - 2
Survey
- Similarity search for complex similarity models
- Analysis of previous solution for k-nn search
- An optimality criterion for k-nn search
- Optimal algorithm for k-nn search
- Performance analysis
(c) 1998 Thomas Seidl SIGMOD ‘98 - 3
Distance-based Similarity Search
1st 2nd 3rd 4th
no answer too many answers k nearest neighbors
Principle: Small Distances ↔ Strong Similarity
( )
{ }
RangeQuery , : ( , ) q
- DB
d o q ε ε ∈ ≤
{ }
1, ,
monotonous
k DB
dq −
→
k-NearestNeighborQuery (q,k):
(c) 1998 Thomas Seidl SIGMOD ‘98 - 4
Complex Similarity Models
- Quadratic Form Distance Functions
– Color Histograms for Image Databases (QBIC) 256-D histograms (Niblack et al. 93) (Hafner et al. 95) – Shape Similarity for 2D and 3D: Up to 4,096-D vectors (Thesis Seidl 97) – …
- Max-Morphological Distance
– 2D images: Tumor shapes (Korn et al. 96)
d p q p q p q
A T 2
A
( , ) ( ) ( ) = − ⋅ ⋅ −
(c) 1998 Thomas Seidl SIGMOD ‘98 - 5
Cost of Single Evaluations
– Quadratic Form Distance Functions – Max-Morphological Distance (Korn et al. 96)
12.69 seconds (avg) per distance evaluation
0.23 0.4 1.1 6.2 102 1,656 10 1,000 100,000 21 64 112 256 1,024 4,096 dimension evaluation time [msec]
(c) 1998 Thomas Seidl SIGMOD ‘98 - 6
Multi-Step Query Processing
- Multi-Step Similarity Search
Range Queries (Faloutsos et al. 94) k-Nearest Neighbor Queries (Korn et al. 96)
- No False Drops?
Filter Step (index-based) Refinement Step (exact evaluation)
candidates results
d p q d p q
f
- ( , )
( , )
≤
filter distance
- bject distance
Lower-Bounding Property
(c) 1998 Thomas Seidl SIGMOD ‘98 - 7
Previous k-nn Algorithm (Korn et al. 96)
More candidates generated than necessary
Index
final k-nn (do) dmax query on Index (df) k-nn query on Index (df) primary dmax (do) query (q,k)
Objects
First Phase Second Phase
Fixed d
m a x in
2nd Phase!
k >>k
(c) 1998 Thomas Seidl SIGMOD ‘98 - 8
Number of Candidates
0.2 0.4 0.6 0.8 1 1.2 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
rank according to filter distance dmax
- bject distance
filter distance
d max
k -th object distance
- bject and
filter distances
(c) 1998 Thomas Seidl SIGMOD ‘98 - 9
Optimality of k-NN Algorithms
Lemma
– Let df be a lower-bounding filter of do: – For a multi-step k-nn algorithm based on do and df, the
- ptimal set of candidates is:
– where εk is the k-th object similarity distance:
{ }
- DB
- q
f k
∈ ≤ d ( , ) ε
( )
{ }
εk
- q
- q
- NN
k = ∈ max ( , ) d
d d
f
- p q
p q ( , ) ( , ) ≤
(c) 1998 Thomas Seidl SIGMOD ‘98 - 10
Optimal k-nn Algorithm (new)
d
m a x is adjusted
step by step!
init ranking on Index (df) while df(o,q) ≤ dmax do get next o from index and adjust dmax (do) final k-nn: do(o,q) ≤ dmax query (q,k) result
Index Objects
No unnecessary candidates
No false drops
THEOREM:
1 2
Required: Incremental Ranking on index
(Hjaltason & Samet 95)
(c) 1998 Thomas Seidl SIGMOD ‘98 - 11
Minimal Set of Candidates
0.2 0.4 0.6 0.8 1 1.2 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
rank according to filter distance primary dmax
- ptimal dmax
filter distance
- bject distance
primary dmax
- ptimal dmax
- bject and
filter distances
The higher the filter distance, the better the filter selectivity
(c) 1998 Thomas Seidl SIGMOD ‘98 - 12
Uniformly Distributed Data (20-D)
Experimental Setting
- 100,000 Objects, 20-D
- Matrices: sim-id, 1-0, 2-2
- Queries: k = 10 (0.01%)
- Index: 15-D
- Avg. Improvement Factors
- Candidates: 72, 120, 64
- Overall Time: 26, 48, 23
419 664 1,117 48 14 16 200 400 600 800 1000 1200 sim-id sim-1-0 sim-2-2 similarity matrix
- verall runtime [sec]
previous algorithm
- ptimal
algorithm number of candidates 71,610 26,546 42,891 370 358 1,118 20000 40000 60000 80000 sim-id sim-1-0 sim-2-2
(c) 1998 Thomas Seidl SIGMOD ‘98 - 13
2-D Shape Similarity (1,024-D)
Experimental Setting
- 10,000 Images, 32x32 Pixel
- ‘Neighborhood Area’: 9-1
- Queries: k = 5 (0.05%)
- Index (KLT): 16-D, …, 64-D
- Avg. Improvement Factors
- Candidates: 2.3
- Overall Time: 1.6 to 2.3
50 100 150 200 250 300 16-D 32-D 48-D 64-D dimension of index
- verall runtime [sec]
previous algorithm
- ptimal
algorithm 500 1000 1500 2000 2500 16-D 32-D 48-D 64-D number of candidates
(c) 1998 Thomas Seidl SIGMOD ‘98 - 14
Color Histograms (112-D)
Experimental Setting
- 112,700 Histograms (112-D)
- Quadratic Form Distance
- Queries: k = 2,…,12 (0.01%)
- Index (KLT): 12-D
- Avg. Improvement Factors
- Candidates: 17
- Overall Time: 8.5
20 40 60 80 100 120 2 4 6 8 10 12 query parameter k
- verall runtime [sec]
previous algorithm
- ptim
al algorithm 2000 4000 6000 8000 10000 2 4 6 8 10 12 number of candidates
(c) 1998 Thomas Seidl SIGMOD ‘98 - 15
Conclusions
- Complex Similarity Search: Expensive similarity evaluations
- Multi-Step Approach: Lower-bounding filter distance function
- Optimal Algorithm: Minimum number of exact evaluations
- Average Improvement Factors:
– up to 120 (number of candidates) – up to 48 (overall runtime)
- Future Work: New applications; Integration with Data Mining