[PPT] - Survey Similarity search for complex similarity models Analysis of PowerPoint Presentation

SLIDE 1

Optimal Multi-Step k-Nearest Neighbor Search

Thomas Seidl and Hans-Peter Kriegel University of Munich, Germany ACM SIGMOD ‘98, Seattle

SLIDE 2

Survey

Similarity search for complex similarity models
Analysis of previous solution for k-nn search
An optimality criterion for k-nn search
Optimal algorithm for k-nn search
Performance analysis

SLIDE 3

Distance-based Similarity Search

1st 2nd 3rd 4th

no answer too many answers k nearest neighbors

Principle: Small Distances ↔ Strong Similarity

( )

{ }

RangeQuery , : ( , ) q

DB

d o q ε ε ∈ ≤

{ }

1, ,

monotonous

k DB

dq −

 →   

k-NearestNeighborQuery (q,k):

SLIDE 4

Complex Similarity Models

Quadratic Form Distance Functions

– Color Histograms for Image Databases (QBIC) 256-D histograms (Niblack et al. 93) (Hafner et al. 95) – Shape Similarity for 2D and 3D: Up to 4,096-D vectors (Thesis Seidl 97) – …

Max-Morphological Distance

– 2D images: Tumor shapes (Korn et al. 96)

d p q p q p q

A T 2

A

( , ) ( ) ( ) = − ⋅ ⋅ −

SLIDE 5

Cost of Single Evaluations

– Quadratic Form Distance Functions – Max-Morphological Distance (Korn et al. 96)

12.69 seconds (avg) per distance evaluation

0.23 0.4 1.1 6.2 102 1,656 10 1,000 100,000 21 64 112 256 1,024 4,096 dimension evaluation time [msec]

SLIDE 6

Multi-Step Query Processing

Multi-Step Similarity Search

Range Queries (Faloutsos et al. 94) k-Nearest Neighbor Queries (Korn et al. 96)

No False Drops?

Filter Step (index-based) Refinement Step (exact evaluation)

candidates results

d p q d p q

f

( , )

( , )

≤

filter distance

bject distance

Lower-Bounding Property

SLIDE 7

Previous k-nn Algorithm (Korn et al. 96)

More candidates generated than necessary

Index

final k-nn (do) dmax query on Index (df) k-nn query on Index (df) primary dmax (do) query (q,k)

Objects

First Phase Second Phase

Fixed d

m a x in

2nd Phase!

k >>k

SLIDE 8

Number of Candidates

0.2 0.4 0.6 0.8 1 1.2 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

rank according to filter distance dmax

bject distance

filter distance

d max

k -th object distance

bject and

filter distances

SLIDE 9

Optimality of k-NN Algorithms

Lemma

– Let df be a lower-bounding filter of do: – For a multi-step k-nn algorithm based on do and df, the

ptimal set of candidates is:

– where εk is the k-th object similarity distance:

{ }

DB
q

f k

∈ ≤ d ( , ) ε

( )

{ }

εk

q
q
NN

k = ∈ max ( , ) d

d d

f

p q

p q ( , ) ( , ) ≤

SLIDE 10

Optimal k-nn Algorithm (new)

d

m a x is adjusted

step by step!

init ranking on Index (df) while df(o,q) ≤ dmax do get next o from index and adjust dmax (do) final k-nn: do(o,q) ≤ dmax query (q,k) result

Index Objects

No unnecessary candidates

No false drops

THEOREM:

1 2

Required: Incremental Ranking on index

(Hjaltason & Samet 95)

SLIDE 11

Minimal Set of Candidates

0.2 0.4 0.6 0.8 1 1.2 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

rank according to filter distance primary dmax

ptimal dmax

filter distance

bject distance

primary dmax

ptimal dmax
bject and

filter distances

The higher the filter distance, the better the filter selectivity

SLIDE 12

Uniformly Distributed Data (20-D)

Experimental Setting

100,000 Objects, 20-D
Matrices: sim-id, 1-0, 2-2
Queries: k = 10 (0.01%)
Index: 15-D
Avg. Improvement Factors
Candidates: 72, 120, 64
Overall Time: 26, 48, 23

419 664 1,117 48 14 16 200 400 600 800 1000 1200 sim-id sim-1-0 sim-2-2 similarity matrix

verall runtime [sec]

previous algorithm

ptimal

algorithm number of candidates 71,610 26,546 42,891 370 358 1,118 20000 40000 60000 80000 sim-id sim-1-0 sim-2-2

SLIDE 13

2-D Shape Similarity (1,024-D)

Experimental Setting

10,000 Images, 32x32 Pixel
‘Neighborhood Area’: 9-1
Queries: k = 5 (0.05%)
Index (KLT): 16-D, …, 64-D
Avg. Improvement Factors
Candidates: 2.3
Overall Time: 1.6 to 2.3

50 100 150 200 250 300 16-D 32-D 48-D 64-D dimension of index

verall runtime [sec]

previous algorithm

ptimal

algorithm 500 1000 1500 2000 2500 16-D 32-D 48-D 64-D number of candidates

SLIDE 14

Color Histograms (112-D)

Experimental Setting

112,700 Histograms (112-D)
Quadratic Form Distance
Queries: k = 2,…,12 (0.01%)
Index (KLT): 12-D
Avg. Improvement Factors
Candidates: 17
Overall Time: 8.5

20 40 60 80 100 120 2 4 6 8 10 12 query parameter k

verall runtime [sec]

previous algorithm

ptim

al algorithm 2000 4000 6000 8000 10000 2 4 6 8 10 12 number of candidates

SLIDE 15

Conclusions

Complex Similarity Search: Expensive similarity evaluations
Multi-Step Approach: Lower-bounding filter distance function
Optimal Algorithm: Minimum number of exact evaluations
Average Improvement Factors:

– up to 120 (number of candidates) – up to 48 (overall runtime)

Future Work: New applications; Integration with Data Mining