Similarity Estimation Similarity Estimation Techniques from - - PowerPoint PPT Presentation

similarity estimation similarity estimation techniques
SMART_READER_LITE
LIVE PREVIEW

Similarity Estimation Similarity Estimation Techniques from - - PowerPoint PPT Presentation

Similarity Estimation Similarity Estimation Techniques from Rounding Techniques from Rounding Algorithms Algorithms Moses Charikar Moses Charikar Princeton University Princeton University 1 1 Compact sketches for Compact sketches for


slide-1
SLIDE 1

1 1

Similarity Estimation Similarity Estimation Techniques from Rounding Techniques from Rounding Algorithms Algorithms

Moses Charikar Moses Charikar Princeton University Princeton University

slide-2
SLIDE 2

2 2

Compact sketches for Compact sketches for estimating similarity estimating similarity

  • Collection of objects, e.g. mathematical

Collection of objects, e.g. mathematical representation of documents, images. representation of documents, images.

  • Implicit similarity/distance function.

Implicit similarity/distance function.

  • Want to estimate similarity without

Want to estimate similarity without looking at entire objects. looking at entire objects.

  • Compute compact sketches of objects

Compute compact sketches of objects so that similarity/distance can be so that similarity/distance can be estimated from them. estimated from them.

slide-3
SLIDE 3

3 3

Similarity Preserving Hashing Similarity Preserving Hashing

  • Similarity function

Similarity function sim sim(x,y) (x,y)

  • Family of hash functions

Family of hash functions F F with with probability distribution such that probability distribution such that

Pr [ ( ) ( )] ( , )

h F h x

h y sim x y

= =

slide-4
SLIDE 4

4 4

Applications Applications

  • Compact representation scheme for

Compact representation scheme for estimating similarity estimating similarity

  • Approximate nearest neighbor search

Approximate nearest neighbor search [ [Indyk Indyk, ,Motwani Motwani] ] [ [Kushilevitz Kushilevitz, ,Ostrovsky Ostrovsky, ,Rabani Rabani]

1 2 1 2

( ( ), ( ), , ( )) ( ( ), ( ), , ( ))

k k

x h x h x h x y h y h y h y → → … …

]

slide-5
SLIDE 5

5 5

Estimating Set Similarity Estimating Set Similarity

[ [Broder Broder, ,Manasse Manasse, ,Glassman Glassman, ,Zweig Zweig] ] [ [Broder Broder,C,Frieze, ,C,Frieze,Mitzenmacher Mitzenmacher] ]

  • Collection of subsets

Collection of subsets

1

S

2

S

1 2 1 2

| | similarity = | | S S S S ∩ ∪

slide-6
SLIDE 6

6 6

Minwise Minwise Independent Independent Permutations Permutations

1 2 1 2 1 2

| | prob(min( ( ) min( ( )) | | S S S S S S σ σ ∩ = = ∪

1

S

2

S

σ

1

min( ( )) S σ

2

min( ( )) S σ

σ

2

S

1

S

slide-7
SLIDE 7

7 7

Related Work Related Work

  • Streaming algorithms

Streaming algorithms

  • Compute

Compute f(data) f(data) in one pass using small space. in one pass using small space.

  • Implicitly construct sketch of data seen so far.

Implicitly construct sketch of data seen so far.

  • Synopsis data structures

Synopsis data structures [Gibbons, [Gibbons,Matias Matias] ]

  • Compact distance oracles, distance labels.

Compact distance oracles, distance labels.

  • Hash functions with similar properties:

Hash functions with similar properties: [ [Linial Linial,Sassoon] ,Sassoon] [ [Indyk Indyk, ,Motwani Motwani, ,Raghavan Raghavan, ,Vempala Vempala] ] [ [Feige Feige, , Krauthgamer Krauthgamer] ]

slide-8
SLIDE 8

8 8

Results Results

  • Necessary conditions for existence of

Necessary conditions for existence of similarity preserving hashing (SPH). similarity preserving hashing (SPH).

  • SPH schemes from rounding algorithms

SPH schemes from rounding algorithms

  • Hash function for vectors based on

Hash function for vectors based on random random hyperplane hyperplane rounding rounding. .

  • Hash function for estimating

Hash function for estimating Earth Mover Earth Mover Distance Distance based on rounding schemes for based on rounding schemes for classification with classification with pairwise pairwise relationships relationships. .

slide-9
SLIDE 9

9 9

Existence of SPH schemes Existence of SPH schemes

  • sim

sim(x,y) (x,y) admits an SPH scheme if admits an SPH scheme if ∃ ∃ family of hash functions family of hash functions F F such that such that

Pr [ ( ) ( )] ( , )

h F h x

h y sim x y

= =

slide-10
SLIDE 10

10 10

Theorem Theorem: If : If sim sim(x,y) (x,y) admits an SPH admits an SPH scheme then scheme then 1 1-

  • sim

sim(x,y) (x,y) satisfies satisfies triangle inequality. triangle inequality. Proof Proof: : indicator variable for 1 ( , ) Pr ( ( ) ( )) ( , ) : ( ) ( ) ( , ) ( , ) ( , ) 1 ( , ) E [ ( , )]

h F h h h h h F h

sim x y h x h y x y h x h y x y y z x z sim x y x y

∈ ∈

− = ≠ ∆ ≠ ∆ +∆ ≥∆ − = ∆

slide-11
SLIDE 11

11 11

Stronger Condition Stronger Condition

Theorem Theorem: If : If sim sim(x,y) (x,y) admits an SPH admits an SPH scheme then scheme then (1+ (1+sim sim(x,y) )/2 (x,y) )/2 has an SPH has an SPH scheme with hash functions mapping scheme with hash functions mapping

  • bjects to
  • bjects to {0,1}

{0,1}. . Theorem Theorem: If : If sim sim(x,y) (x,y) admits an SPH admits an SPH scheme then scheme then 1 1-

  • sim

sim(x,y) (x,y) is is isometrically isometrically embeddable in the Hamming cube. embeddable in the Hamming cube.

slide-12
SLIDE 12

12 12

Random Random Hyperplane Hyperplane Rounding Rounding based SPH based SPH

  • Collection of vectors

Collection of vectors

  • Pick random

Pick random hyperplane hyperplane through origin (normal ) through origin (normal )

  • [

[Goemans Goemans,Williamson] ,Williamson]

( , ) ( , ) 1 u v sim u v π = −

  • r
  • 1

( ) if if

r

r u h u r u  ⋅ ≥   =  ⋅ <  

slide-13
SLIDE 13

13 13

Earth Mover Distance (EMD) Earth Mover Distance (EMD)

P P

Q Q

EMD(P,Q) EMD(P,Q)

slide-14
SLIDE 14

14 14

Earth Mover Distance Earth Mover Distance

  • Set of points

Set of points L={l L={l1

1,l

,l2

2,

,… …l ln

n}

}

  • Distance function

Distance function d(i,j) d(i,j) (assume metric) (assume metric)

  • Distribution

Distribution P(L) P(L) : non : non-

  • negative weights

negative weights (p (p1

1,p

,p2

2,

,… …p pn

n)

) . .

  • Earth Mover Distance (

Earth Mover Distance (EMD EMD): distance ): distance between distributions between distributions P P and and Q Q. .

  • Proposed as metric in graphics and

Proposed as metric in graphics and vision for distance between images. vision for distance between images. [ [Rubner Rubner, ,Tomasi Tomasi, ,Guibas Guibas] ]

slide-15
SLIDE 15

15 15

, j , , i , ,

min ( , ) ,

i j i j i i j j i j i j

d i j i p f f j q i j f f ⋅ ∀ = ∀ = ∀ ≥

∑ ∑ ∑

slide-16
SLIDE 16

16 16

Relaxation of SPH Relaxation of SPH

  • Estimate distance measure, not

Estimate distance measure, not similarity measure in [0,1]. similarity measure in [0,1].

  • Allow hash functions to map objects to

Allow hash functions to map objects to points in metric space and measure points in metric space and measure E[ E[d(h(P),h(Q) d(h(P),h(Q)] ]. . (SPH: (SPH: d(x,y) = 1 if x d(x,y) = 1 if x ≠ ≠y y) )

  • Estimator will approximate EMD.

Estimator will approximate EMD.

slide-17
SLIDE 17

17 17

Classification with Classification with pairwise pairwise relationships relationships [

[Kleinberg Kleinberg, ,Tardos Tardos] ] Assignment cost Assignment cost separation separation cost cost w we

e

slide-18
SLIDE 18

18 18

Classification with Classification with pairwise pairwise relationships relationships

  • Collection of objects

Collection of objects V V

  • Labels

Labels L={l L={l1

1,l

,l2

2,

,… …l ln

n}

}

  • Assignment of labels

Assignment of labels h : V h : V→ →L L

  • Cost of assigning label to

Cost of assigning label to u u : : c(u,h(u)) c(u,h(u))

  • Graph of related objects; for edge

Graph of related objects; for edge e=(u,v), e=(u,v), cost paid: cost paid: w we

e.d(h(u),h(v))

.d(h(u),h(v))

  • Find assignment of labels to minimize

Find assignment of labels to minimize cost. cost.

slide-19
SLIDE 19

19 19

LP Relaxation and Rounding LP Relaxation and Rounding

[ [Kleinberg Kleinberg, ,Tardos Tardos] ] [ [Chekuri Chekuri, ,Khanna Khanna, ,Naor Naor, ,Zosin Zosin] ] Separation cost measured by Separation cost measured by EMD(P,Q) EMD(P,Q) P P Q Q Rounding algorithm guarantees Rounding algorithm guarantees

E[ E[d(h(P),h(Q) d(h(P),h(Q)] ] ≤ ≤ O( O(log log n n log log log log n) EMD(P,Q) n) EMD(P,Q) Pr[ Pr[h(P)= h(P)=l li

i]

] = p = pi

i

slide-20
SLIDE 20

20 20

Rounding details Rounding details

  • Probabilistically approximate metric on

Probabilistically approximate metric on L by tree metric (HST) L by tree metric (HST)

  • Expected distortion

Expected distortion O(log n log log n) O(log n log log n)

  • EMD on tree metric has nice form:

EMD on tree metric has nice form:

  • T

T: : subtree subtree

  • P(T):

P(T): sum of probabilities for leaves in T sum of probabilities for leaves in T

  • l

lT

T : length of edge leading up from T

: length of edge leading up from T

  • EMD(P,Q) =

EMD(P,Q) = ∑ ∑ l lT

T|P(T)

|P(T)-

  • Q(T)|

Q(T)|

slide-21
SLIDE 21

21 21

Theorem Theorem: The rounding scheme gives a : The rounding scheme gives a hashing scheme such that hashing scheme such that EMD(P,Q) EMD(P,Q) ≤ ≤ E[d(h(P),h(Q)] E[d(h(P),h(Q)] ≤ ≤ O(log n log log n) EMD(P,Q) O(log n log log n) EMD(P,Q) Proof Proof: :

, ,

Probability that give feasible solution to LP for EMD Cost of this solution : ( ) , ( ) = E[ ( ( ), Henc ( )] ( , ) E[ ( ( , ( e ) )]

i j i j i j

y h P l h Q l y d h P h Q EMD P Q d h P h Q = = ≤

slide-22
SLIDE 22

22 22

SPH for weighted sets SPH for weighted sets

  • Weighted Set:

Weighted Set: (p (p1

1,p

,p2

2,

,… …p pn

n) ,

) , weights in [0,1] weights in [0,1]

  • Kleinberg

Kleinberg-

  • Tardos

Tardos rounding scheme for rounding scheme for uniform metric can be thought of as a hashing uniform metric can be thought of as a hashing scheme for weighted sets with scheme for weighted sets with

  • Generalization of

Generalization of minwise minwise independent independent permutations

min( , ) ( , ) max( , )

i i i i

p q sim P Q p q = ∑

permutations

slide-23
SLIDE 23

23 23

Conclusions and Future Work Conclusions and Future Work

  • Interesting connection between rounding

Interesting connection between rounding procedures for approximation algorithms and procedures for approximation algorithms and hash functions for estimating similarity. hash functions for estimating similarity.

  • Better estimators for Earth Mover Distance

Better estimators for Earth Mover Distance

  • Ignored variance of estimators:

Ignored variance of estimators: related to dimensionality reduction in L related to dimensionality reduction in L1

1

  • Study

Study compact representation schemes compact representation schemes in in general general