Survey Similarity search for complex similarity models Analysis of - - PowerPoint PPT Presentation

survey
SMART_READER_LITE
LIVE PREVIEW

Survey Similarity search for complex similarity models Analysis of - - PowerPoint PPT Presentation

Optimal Multi-Step k -Nearest Neighbor Search Thomas Seidl and Hans-Peter Kriegel University of Munich, Germany ACM SIGMOD 98, Seattle Survey Similarity search for complex similarity models Analysis of previous solution for k -nn


slide-1
SLIDE 1

Optimal Multi-Step k-Nearest Neighbor Search

Thomas Seidl and Hans-Peter Kriegel University of Munich, Germany ACM SIGMOD ‘98, Seattle

slide-2
SLIDE 2

(c) 1998 Thomas Seidl SIGMOD ‘98 - 2

Survey

  • Similarity search for complex similarity models
  • Analysis of previous solution for k-nn search
  • An optimality criterion for k-nn search
  • Optimal algorithm for k-nn search
  • Performance analysis
slide-3
SLIDE 3

(c) 1998 Thomas Seidl SIGMOD ‘98 - 3

Distance-based Similarity Search

1st 2nd 3rd 4th

no answer too many answers k nearest neighbors

Principle: Small Distances ↔ Strong Similarity

( )

{ }

RangeQuery , : ( , ) q

  • DB

d o q ε ε ∈ ≤

{ }

1, ,

monotonous

k DB

dq −

 →   

k-NearestNeighborQuery (q,k):

slide-4
SLIDE 4

(c) 1998 Thomas Seidl SIGMOD ‘98 - 4

Complex Similarity Models

  • Quadratic Form Distance Functions

– Color Histograms for Image Databases (QBIC) 256-D histograms (Niblack et al. 93) (Hafner et al. 95) – Shape Similarity for 2D and 3D: Up to 4,096-D vectors (Thesis Seidl 97) – …

  • Max-Morphological Distance

– 2D images: Tumor shapes (Korn et al. 96)

d p q p q p q

A T 2

A

( , ) ( ) ( ) = − ⋅ ⋅ −

slide-5
SLIDE 5

(c) 1998 Thomas Seidl SIGMOD ‘98 - 5

Cost of Single Evaluations

– Quadratic Form Distance Functions – Max-Morphological Distance (Korn et al. 96)

12.69 seconds (avg) per distance evaluation

0.23 0.4 1.1 6.2 102 1,656 10 1,000 100,000 21 64 112 256 1,024 4,096 dimension evaluation time [msec]

slide-6
SLIDE 6

(c) 1998 Thomas Seidl SIGMOD ‘98 - 6

Multi-Step Query Processing

  • Multi-Step Similarity Search

Range Queries (Faloutsos et al. 94) k-Nearest Neighbor Queries (Korn et al. 96)

  • No False Drops?

Filter Step (index-based) Refinement Step (exact evaluation)

candidates results

d p q d p q

f

  • ( , )

( , )

filter distance

  • bject distance

Lower-Bounding Property

slide-7
SLIDE 7

(c) 1998 Thomas Seidl SIGMOD ‘98 - 7

Previous k-nn Algorithm (Korn et al. 96)

More candidates generated than necessary

Index

final k-nn (do) dmax query on Index (df) k-nn query on Index (df) primary dmax (do) query (q,k)

Objects

First Phase Second Phase

Fixed d

m a x in

2nd Phase!

k >>k

slide-8
SLIDE 8

(c) 1998 Thomas Seidl SIGMOD ‘98 - 8

Number of Candidates

0.2 0.4 0.6 0.8 1 1.2 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

rank according to filter distance dmax

  • bject distance

filter distance

d max

k -th object distance

  • bject and

filter distances

slide-9
SLIDE 9

(c) 1998 Thomas Seidl SIGMOD ‘98 - 9

Optimality of k-NN Algorithms

Lemma

– Let df be a lower-bounding filter of do: – For a multi-step k-nn algorithm based on do and df, the

  • ptimal set of candidates is:

– where εk is the k-th object similarity distance:

{ }

  • DB
  • q

f k

∈ ≤ d ( , ) ε

( )

{ }

εk

  • q
  • q
  • NN

k = ∈ max ( , ) d

d d

f

  • p q

p q ( , ) ( , ) ≤

slide-10
SLIDE 10

(c) 1998 Thomas Seidl SIGMOD ‘98 - 10

Optimal k-nn Algorithm (new)

d

m a x is adjusted

step by step!

init ranking on Index (df) while df(o,q) ≤ dmax do get next o from index and adjust dmax (do) final k-nn: do(o,q) ≤ dmax query (q,k) result

Index Objects

No unnecessary candidates

No false drops

THEOREM:

1 2

Required: Incremental Ranking on index

(Hjaltason & Samet 95)

slide-11
SLIDE 11

(c) 1998 Thomas Seidl SIGMOD ‘98 - 11

Minimal Set of Candidates

0.2 0.4 0.6 0.8 1 1.2 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

rank according to filter distance primary dmax

  • ptimal dmax

filter distance

  • bject distance

primary dmax

  • ptimal dmax
  • bject and

filter distances

The higher the filter distance, the better the filter selectivity

slide-12
SLIDE 12

(c) 1998 Thomas Seidl SIGMOD ‘98 - 12

Uniformly Distributed Data (20-D)

Experimental Setting

  • 100,000 Objects, 20-D
  • Matrices: sim-id, 1-0, 2-2
  • Queries: k = 10 (0.01%)
  • Index: 15-D
  • Avg. Improvement Factors
  • Candidates: 72, 120, 64
  • Overall Time: 26, 48, 23

419 664 1,117 48 14 16 200 400 600 800 1000 1200 sim-id sim-1-0 sim-2-2 similarity matrix

  • verall runtime [sec]

previous algorithm

  • ptimal

algorithm number of candidates 71,610 26,546 42,891 370 358 1,118 20000 40000 60000 80000 sim-id sim-1-0 sim-2-2

slide-13
SLIDE 13

(c) 1998 Thomas Seidl SIGMOD ‘98 - 13

2-D Shape Similarity (1,024-D)

Experimental Setting

  • 10,000 Images, 32x32 Pixel
  • ‘Neighborhood Area’: 9-1
  • Queries: k = 5 (0.05%)
  • Index (KLT): 16-D, …, 64-D
  • Avg. Improvement Factors
  • Candidates: 2.3
  • Overall Time: 1.6 to 2.3

50 100 150 200 250 300 16-D 32-D 48-D 64-D dimension of index

  • verall runtime [sec]

previous algorithm

  • ptimal

algorithm 500 1000 1500 2000 2500 16-D 32-D 48-D 64-D number of candidates

slide-14
SLIDE 14

(c) 1998 Thomas Seidl SIGMOD ‘98 - 14

Color Histograms (112-D)

Experimental Setting

  • 112,700 Histograms (112-D)
  • Quadratic Form Distance
  • Queries: k = 2,…,12 (0.01%)
  • Index (KLT): 12-D
  • Avg. Improvement Factors
  • Candidates: 17
  • Overall Time: 8.5

20 40 60 80 100 120 2 4 6 8 10 12 query parameter k

  • verall runtime [sec]

previous algorithm

  • ptim

al algorithm 2000 4000 6000 8000 10000 2 4 6 8 10 12 number of candidates

slide-15
SLIDE 15

(c) 1998 Thomas Seidl SIGMOD ‘98 - 15

Conclusions

  • Complex Similarity Search: Expensive similarity evaluations
  • Multi-Step Approach: Lower-bounding filter distance function
  • Optimal Algorithm: Minimum number of exact evaluations
  • Average Improvement Factors:

– up to 120 (number of candidates) – up to 48 (overall runtime)

  • Future Work: New applications; Integration with Data Mining