[PPT] - Growing a Graph Matching from a Handful of Seeds Ehsan KAZEMI 1 , S. PowerPoint Presentation

SLIDE 1

Growing a Graph Matching from a Handful of Seeds

Ehsan KAZEMI1, S. Hamed HASSANI2, and Matthias GROSSGLAUSER1

1School of Computer and Communication Sciences, EPFL 2Department of Computer Science, ETHZ

September 1, 2015

SLIDE 2

Motivation

Example 1: network de-anonymization

x@epfl.ch y@epfl.ch z@epfl.ch

Anonymized e-mail network

Hamed@epfl.ch Matthias@epfl.ch Ehsan@epfl.ch

Linkedin connections

Example 2: protein-protein interaction network alignment

Q07890 P55957 P04637 Q92934 P06436 P62805 P00742 Q8WUU5 Q9Y365 O60271

Human network

P46108 P01127 P58391 P62806 O88947 Q920S3 Q9JMD3 Q58A65

Mouse network

1/18

SLIDE 3

Motivation

Graph matching (also known as network reconciliation or network alignment) is studied in many fields: Network analysis: matching networks in similar domains for friend suggestion and personalized advertisements Bioinformatics: protein-protein interaction networks alignment Document and Image processing: OCR and handwritten recognition Biometric identification: face authentication and recognition Image database: matching graph segments of two scenes

Matching graph segments of scenes [Lazebnik et al., 2006]

2/18

SLIDE 4

What is Graph Matching?

Goal: find the unknown matching (bijection) between nodes in the intersection of the two graphs G1(V1, E1) and G2(V2, E2) where the presence of edges between the same nodes in the two graphs are correlated Questions: When is it possible to align? How to align? graph matching algorithms

Is it possible to use only the graph structures to establish the true matching between the nodes?

3/18

SLIDE 5

Algorithm, Model and Performance Guarantee

Algorithm: percolation graph matching [Yartseva and Grossglauser, 2013; Chiasserini et al., 2014; Korula and Lattanzi, 2014] Model: a random bigraph generator [Pedarsani and Grossglauser, 2011; Kazemi et al., 2015] Performance guarantee: theory of bootstrap percolation over random graphs [Janson et al., 2010]

4/18

SLIDE 6

Percolation Graph Matching

An initial candidate set of seed pairs Every non-matched pair with r neighbouring seed-pairs get matched and becomes a new seed

5/18

SLIDE 7

Percolation Graph Matching

An initial candidate set of seed pairs Every non-matched pair with r neighbouring seed-pairs get matched and becomes a new seed

5/18

SLIDE 8

Percolation Graph Matching

An initial candidate set of seed pairs Every non-matched pair with r neighbouring seed-pairs get matched and becomes a new seed

5/18

SLIDE 9

Percolation Graph Matching

An initial candidate set of seed pairs Every non-matched pair with r neighbouring seed-pairs get matched and becomes a new seed

5/18

SLIDE 10

Percolation Graph Matching

An initial candidate set of seed pairs Every non-matched pair with r neighbouring seed-pairs get matched and becomes a new seed Size of the final matching vs. number of initial seeds

5/18

SLIDE 11

Bi(G; t, s): A Random Bigraph Model

Bi(G; t, s) is a random bigraph model to generate two correlated graphs

Bi( ; t, s)

G(V, E)

Node sampling Edge sampling

G1(V1, E1) G2(V2, E2)

6/18

SLIDE 12

Bootstrap Percolation

Marks are spread over the tensor product of the two graphs:

Green nodes are correct pairs Red nodes are wrong pairs Green nodes are more connected

(u4, u4) (u1, u1) (u3, u3) (u2, u2)

n nodes

(u1, u2) (u1, u3) (u1, u4) (u2, u1) (u2, u3) (u2, u4) (u3, u1) (u3, u2) (u3, u4) (u4, u1) (u4, u2) (u4, u3)

n2 − n nodes

7/18

SLIDE 13

Bootstrap Percolation

Marks are spread over the tensor product of the two graphs:

Green nodes are correct pairs Red nodes are wrong pairs Green nodes are more connected

(u4, u4) (u1, u1) (u3, u3) (u2, u2)

n nodes

(u1, u2) (u1, u3) (u1, u4) (u2, u1) (u2, u3) (u2, u4) (u3, u1) (u3, u2) (u3, u4) (u4, u1) (u4, u2) (u4, u3)

n2 − n nodes

(u1, u1) (u3, u3)

7/18

SLIDE 14

Bootstrap Percolation

Marks are spread over the tensor product of the two graphs:

Green nodes are correct pairs Red nodes are wrong pairs Green nodes are more connected

(u4, u4) (u1, u1) (u3, u3) (u2, u2)

n nodes

(u1, u2) (u1, u3) (u1, u4) (u2, u1) (u2, u3) (u2, u4) (u3, u1) (u3, u2) (u3, u4) (u4, u1) (u4, u2) (u4, u3)

n2 − n nodes

(u4, u4) (u1, u1) (u3, u3) (u2, u2)

7/18

SLIDE 15

Bootstrap Percolation: Phase Transition

Supercritical regime: percolates to whole network

PGM Seed set Matched set

Subcritical regime: dies young

PGM Seed set Matched set

8/18

SLIDE 16

NoisySeeds Algorithms

State-of-the-art PGM algorithms needs many seeds: with even moderate number of seeds percolation stuck in early steps Finding many seeds is difficult and expensive Observation: PGM is robust to the noise

(u4, u4) (u1, u1) (u3, u3) (u2, u2)

n nodes

(u1, u2) (u1, u3) (u1, u4) (u2, u1) (u2, u3) (u2, u4) (u3, u1) (u3, u2) (u3, u4) (u4, u1) (u4, u2) (u4, u3)

n2 − n nodes

9/18

SLIDE 17

NoisySeeds Algorithms

State-of-the-art PGM algorithms needs many seeds: with even moderate number of seeds percolation stuck in early steps Finding many seeds is difficult and expensive Observation: PGM is robust to the noise

(u4, u4) (u1, u1) (u3, u3) (u2, u2)

n nodes

(u1, u2) (u1, u3) (u1, u4) (u2, u1) (u2, u3) (u2, u4) (u3, u1) (u3, u2) (u3, u4) (u4, u1) (u4, u2) (u4, u3)

n2 − n nodes

(u1, u2) (u1, u4) (u2, u4) (u3, u4) (u1, u1) (u3, u3)

9/18

SLIDE 18

NoisySeeds Algorithms

State-of-the-art PGM algorithms needs many seeds: with even moderate number of seeds percolation stuck in early steps Finding many seeds is difficult and expensive Observation: PGM is robust to the noise

(u4, u4) (u1, u1) (u3, u3) (u2, u2)

n nodes

(u1, u2) (u1, u3) (u1, u4) (u2, u1) (u2, u3) (u2, u4) (u3, u1) (u3, u2) (u3, u4) (u4, u1) (u4, u2) (u4, u3)

n2 − n nodes

(u1, u2) (u1, u4) (u2, u4) (u3, u4) (u4, u4) (u1, u1) (u3, u3) (u2, u2)

9/18

SLIDE 19

NoisySeeds Algorithms

Addition of many wrong pairs to the initial candidate set have a negligible effect on the performance of NoisySeeds

NoisySeeds Expand Seed set Expanded noisy seed set Matched set Matched set

10/18

SLIDE 20

NoisySeeds: Performance Guarantee

Theorem (Performance Guarantee over Bi(G(n, p); t, s)) For Bi(G(n, p); t, s) with fixed s and t assume n−1 ≪ p ≤ n− 5

6 −ǫ,

provided a seed set of

at,s,r = (1 − 1

r )   (r − 1)! nt2 ( ps2 ) r  

1

r −1 correct pairs

O(n) wrong pairs,

with high probability NoisySeeds percolates and outputs nt2 ± o(n) correct pairs

(n) wrong pairs

11/18

SLIDE 21

ExpandWhenStuck

A heuristic based on the idea of robustness to noisy pairs

Percolation process is stuck Node u is matched (correctly)

u u u1 u2 u3 u1 u2 u3 u4 u5 12/18

SLIDE 22

ExpandWhenStuck

Unmatched neighbouring pairs of node-pair [u, u] are new candidate pairs Two graphs are correlated: among new candidate pairs a small fraction is correct, e.g, [u1, u1] PGM is robust to the noise in candidate pairs

u3 u3 u1 u1 u2 u2 u4 u5 u u G1 G2

13/18

SLIDE 23

ExpandWhenStuck

Expand the candidate pairs by many noisy pairs whenever the percolation process stuck

NoisySeeds Seed set Matched set

14/18

SLIDE 24

ExpandWhenStuck

Expand the candidate pairs by many noisy pairs whenever the percolation process stuck

NoisySeeds Seed set Matched set

14/18

SLIDE 25

ExpandWhenStuck

Expand the candidate pairs by many noisy pairs whenever the percolation process stuck

NoisySeeds Seed set Matched set Expand Expanded noisy candidate set

14/18

SLIDE 26

ExpandWhenStuck

Expand the candidate pairs by many noisy pairs whenever the percolation process stuck

NoisySeeds Seed set Matched set Expand Expanded noisy candidate set

14/18

SLIDE 27

ExpandWhenStuck

Expand the candidate pairs by many noisy pairs whenever the percolation process stuck

NoisySeeds Seed set Matched set Expand Expanded noisy candidate set

14/18

SLIDE 28

ExpandWhenStuck

Expand the candidate pairs by many noisy pairs whenever the percolation process stuck

NoisySeeds Seed set Matched set Expand Expanded noisy candidate set

14/18

SLIDE 29

Experiment 1: Random Graphs

ExpandWhenStuck vs. PercolateMatche [Yartseva and Grossglauser, 2013] over Bi(G(n, p); t, s) with n = 106, p = 20

n and t2 = 1.0

✵ ✷✵✵✵✵✵ ✹✵✵✵✵✵ ✻✵✵✵✵✵ ✽✵✵✵✵✵ ✶❡✰✵✻ ✺ ✶✵ ✶✺ ✷✵ ✷✺ ✸✵ ✸✺ ✹✵ ✹✺ ✺✵ ❚♦t❛❧ ♥✉♠❜❡r ♦❢ ♠❛t❝❤❡❞ ♣❛✐rs ◆✉♠❜❡r ♦❢ s❡❡❞s P❡r❝♦❧❛t❡▼❛t❝❤❡❞ 1906 s❡❡❞s ❢♦r s2 = 0.81 3052 s❡❡❞s ❢♦r s2 = 0.64 5207 s❡❡❞s ❢♦r s2 = 0.49 ❊①♣❛♥❞❲❤❡♥❙t✉❝❦✱ s2 = 0.81 ❊①♣❛♥❞❲❤❡♥❙t✉❝❦✱ s2 = 0.64 ❊①♣❛♥❞❲❤❡♥❙t✉❝❦✱ s2 = 0.49

238 times improvement for s2 = 0.81

15/18

SLIDE 30

Experiment 2: Gowalla Network

ExpandWhenStuck vs. PercolateMatched[Yartseva and Grossglauser, 2013] and User–Matching [Korula and Lattanzi, 2014] over Gowalla network.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 20 40 60 80 100 120 140 160 180 200 F1–Score Number of Seeds ExpandWhenStuck PercolateMatched User–Matching 16/18

SLIDE 31

Experiment 3: Network Alignment in Bioinformatics

ExpandWhenStuck vs. state-of-the-art PPI network alignment algorithms∗

✵ ✶✵✵ ✷✵✵ ✸✵✵ ✹✵✵ ✺✵✵ ✻✵✵ ✼✵✵ ✵ ✵✳✵✺ ✵✳✶ ✵✳✶✺ ✵✳✷ ✵✳✷✺ ✵✳✸

❖❈ ❙❝♦r❡

■❈❙ ❙❝♦r❡

❊①♣❛♥❞❲❤❡♥❙t✉❝❦ ■s♦❘❛♥❦ ❙P■◆❆▲ ■ ❙P■◆❆▲ ■■ P■◆❆▲❖●

Access: http://proper.epfl.ch

∗ E. Kazemi, S. H. Hassani, H. Pezeshgi Modarres and M. Grossglauser. “ProPer: Global Protein-Protein Interaction Network Alignment with Percolation Graph-Matching.” Submitted to Bioinformatics. 17/18

SLIDE 32

Conclusion

Graph matching has applications in many fields Percolation graph matching

ExpandWhenStuck: a fast and accurate algorithm

nly a handful of seeds is enough for percolation

MapReduce implementation: a variant of ExpandWhenStuck

analysis of a simplified version of ExpandWhenStuck a phase transition result

Mahalo!

18/18

SLIDE 33

ExpandWhenStuck

Input: G1(V1, E1), G2(V2, E2), seed set A0 of correct pairs Output: The set of matched pairs M A ← A0 is the initial set of seed pairs, M ← A0; Z ← ∅ is the set of used pairs; while |A| > 0 do for all pairs [i, j] ∈ A do add the pair [i, j] to Z and add one mark to all of its neighbouring pairs; while there exists an unmatched pair with score at least 2 do among the pairs with the highest score select the unmatched pair [i, j] with the minimum |d1,i − d2,j|; add [i, j] to the set M; if [i, j] / ∈ Z then add one mark to all of its neighbouring pairs and add the pair [i, j] to Z; A ← all neighbouring pairs [i, j] of matched pairs M s.t. [i, j] / ∈ Z, i / ∈ V1(M) and j / ∈ V2(M); return M;

18/18

SLIDE 34

ExpandWhenStuck

Input: G1(V1, E1), G2(V2, E2), seed set A0 of correct pairs Output: The set of matched pairs M A ← A0 is the initial set of seed pairs, M ← A0; Z ← ∅ is the set of used pairs; while |A| > 0 do for all pairs [i, j] ∈ A do add the pair [i, j] to Z and add one mark to all of its neighbouring pairs; while there exists an unmatched pair with score at least 2 do among the pairs with the highest score select the unmatched pair [i, j] with the minimum |d1,i − d2,j|; add [i, j] to the set M; if [i, j] / ∈ Z then add one mark to all of its neighbouring pairs and add the pair [i, j] to Z; A ← all neighbouring pairs [i, j] of matched pairs M s.t. [i, j] / ∈ Z, i / ∈ V1(M) and j / ∈ V2(M); return M;

Initial candidate set

18/18

SLIDE 35

ExpandWhenStuck

Input: G1(V1, E1), G2(V2, E2), seed set A0 of correct pairs Output: The set of matched pairs M A ← A0 is the initial set of seed pairs, M ← A0; Z ← ∅ is the set of used pairs; while |A| > 0 do for all pairs [i, j] ∈ A do add the pair [i, j] to Z and add one mark to all of its neighbouring pairs; while there exists an unmatched pair with score at least 2 do among the pairs with the highest score select the unmatched pair [i, j] with the minimum |d1,i − d2,j|; add [i, j] to the set M; if [i, j] / ∈ Z then add one mark to all of its neighbouring pairs and add the pair [i, j] to Z; A ← all neighbouring pairs [i, j] of matched pairs M s.t. [i, j] / ∈ Z, i / ∈ V1(M) and j / ∈ V2(M); return M;

Spread marks from the candidate set

18/18

SLIDE 36

ExpandWhenStuck

Input: G1(V1, E1), G2(V2, E2), seed set A0 of correct pairs Output: The set of matched pairs M A ← A0 is the initial set of seed pairs, M ← A0; Z ← ∅ is the set of used pairs; while |A| > 0 do for all pairs [i, j] ∈ A do add the pair [i, j] to Z and add one mark to all of its neighbouring pairs; while there exists an unmatched pair with score at least 2 do among the pairs with the highest score select the unmatched pair [i, j] with the minimum |d1,i − d2,j|; add [i, j] to the set M; if [i, j] / ∈ Z then add one mark to all of its neighbouring pairs and add the pair [i, j] to Z; A ← all neighbouring pairs [i, j] of matched pairs M s.t. [i, j] / ∈ Z, i / ∈ V1(M) and j / ∈ V2(M); return M;

Percolation Graph Matching

18/18

SLIDE 37

ExpandWhenStuck

Input: G1(V1, E1), G2(V2, E2), seed set A0 of correct pairs Output: The set of matched pairs M A ← A0 is the initial set of seed pairs, M ← A0; Z ← ∅ is the set of used pairs; while |A| > 0 do for all pairs [i, j] ∈ A do add the pair [i, j] to Z and add one mark to all of its neighbouring pairs; while there exists an unmatched pair with score at least 2 do among the pairs with the highest score select the unmatched pair [i, j] with the minimum |d1,i − d2,j|; add [i, j] to the set M; if [i, j] / ∈ Z then add one mark to all of its neighbouring pairs and add the pair [i, j] to Z; A ← all neighbouring pairs [i, j] of matched pairs M s.t. [i, j] / ∈ Z, i / ∈ V1(M) and j / ∈ V2(M); return M;

ops! Stuck!:-(

18/18

SLIDE 38

ExpandWhenStuck

Input: G1(V1, E1), G2(V2, E2), seed set A0 of correct pairs Output: The set of matched pairs M A ← A0 is the initial set of seed pairs, M ← A0; Z ← ∅ is the set of used pairs; while |A| > 0 do for all pairs [i, j] ∈ A do add the pair [i, j] to Z and add one mark to all of its neighbouring pairs; while there exists an unmatched pair with score at least 2 do among the pairs with the highest score select the unmatched pair [i, j] with the minimum |d1,i − d2,j|; add [i, j] to the set M; if [i, j] / ∈ Z then add one mark to all of its neighbouring pairs and add the pair [i, j] to Z; A ← all neighbouring pairs [i, j] of matched pairs M s.t. [i, j] / ∈ Z, i / ∈ V1(M) and j / ∈ V2(M); return M;

Expand When Stuck!:-)

18/18

SLIDE 39

ExpandWhenStuck

Input: G1(V1, E1), G2(V2, E2), seed set A0 of correct pairs Output: The set of matched pairs M A ← A0 is the initial set of seed pairs, M ← A0; Z ← ∅ is the set of used pairs; while |A| > 0 do for all pairs [i, j] ∈ A do add the pair [i, j] to Z and add one mark to all of its neighbouring pairs; while there exists an unmatched pair with score at least 2 do among the pairs with the highest score select the unmatched pair [i, j] with the minimum |d1,i − d2,j|; add [i, j] to the set M; if [i, j] / ∈ Z then add one mark to all of its neighbouring pairs and add the pair [i, j] to Z; A ← all neighbouring pairs [i, j] of matched pairs M s.t. [i, j] / ∈ Z, i / ∈ V1(M) and j / ∈ V2(M); return M;

Expand When Stuck!:-) Fuel the percolation process!:-)

18/18

SLIDE 40

ExpandWhenStuck

Input: G1(V1, E1), G2(V2, E2), seed set A0 of correct pairs Output: The set of matched pairs M A ← A0 is the initial set of seed pairs, M ← A0; Z ← ∅ is the set of used pairs; while |A| > 0 do for all pairs [i, j] ∈ A do add the pair [i, j] to Z and add one mark to all of its neighbouring pairs; while there exists an unmatched pair with score at least 2 do among the pairs with the highest score select the unmatched pair [i, j] with the minimum |d1,i − d2,j|; add [i, j] to the set M; if [i, j] / ∈ Z then add one mark to all of its neighbouring pairs and add the pair [i, j] to Z; A ← all neighbouring pairs [i, j] of matched pairs M s.t. [i, j] / ∈ Z, i / ∈ V1(M) and j / ∈ V2(M); return M;

Percolation Graph Matching

18/18

SLIDE 41

ExpandWhenStuck: MapReduce Implementation

A parallelized variant of ExpandWhenStuck

Job1: spread

ut marks from

candidate pairs Graphs Seed set Job2: filter out the node pairs is percolation stuck? Job3: spread out marks from new matched pairs Job4: expand to generate new candidate pairs no yes

18/18

SLIDE 42

Sketch of the Proof

1 In the beginning of NoisySeeds, all the pairs in the noisy seed

set spread out marks to their neighbouring pairs. Lemma At the completion time of initial phase, the expected number of wrongly matched pairs is o(at,s,r).

2 Percolation graph matching process continues in at most

min(|V1|, |V2|) more steps. Lemma When PGM stops, the expected number of wrongly matched pairs is o(at,s,r).

18/18

SLIDE 43

Sketch of the Proof (Continued)

3 Using Markov’s inequality, we find an upper bound for the

number of wrongly matched pairs. Lemma With high probability, the total number of wrongly matched pairs at any time step is o(at,s,r). Idea: To apply the theory of bootstrap percolation in G(nt2, ps2) [Janson et al, 2010] over Bi(G(n, p); t, s) graphs.

18/18

SLIDE 44

Experiment 4: Phase Transition

Bi(G(n, p); t, s) with n = 106 and p = 20

n

2e+05 4e+05 6e+05 8e+05 1e+06 0.8 0.9 1 1.1 1.2 1.3 1.4 Total number of correct matched pairs Number of correct seeds Λ(A0) / at,s,r t2 = 1.0, s2 = 1.0 t2 = 1.0, s2 = 0.81 t2 = 1.0, s2 = 0.64 t2 = 0.81, s2 = 1.0 t2 = 0.81, s2 = 0.81 t2 = 0.81, s2 = 0.64 t2 = 0.64, s2 = 1.0 t2 = 0.64, s2 = 0.81 t2 = 0.64, s2 = 0.64

18/18

SLIDE 45

Experiment 5: Slashdot Network

ExpandWhenStuck over Bi(Slashdot network; t, s) when the number of seeds is 10. Combine precision and recall in one metric: F1–score = 2precision × recall precision + recall

N

d

e

v

e r l a p p r

b

a b i l i t y t

2

0.5 0.6 0.7 0.8 0.9 1.0 Edge overlap probability s2 0.5 0.6 0.7 0.8 0.9 1.0 F1–score 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

For results regarding random graph models such as Barab´ asi-Albert, Chung-Lu and Erd¨

s–R´

enyi please refer to the paper. 18/18

SLIDE 46

Experiment 6: EPFL e-mail Network

ExpandWhenStuck vs. PercolateMatched over EPFL e-mail exchange network.

18/18