http://cs224w.stanford.edu Non overlapping vs overlapping - - PowerPoint PPT Presentation

http cs224w stanford edu non overlapping vs overlapping
SMART_READER_LITE
LIVE PREVIEW

http://cs224w.stanford.edu Non overlapping vs overlapping - - PowerPoint PPT Presentation

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University Jure Leskovec Stanford University http://cs224w.stanford.edu Non overlapping vs overlapping communities Non overlapping vs. overlapping communities


slide-1
SLIDE 1

CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University

http://cs224w.stanford.edu

slide-2
SLIDE 2

 Non overlapping vs overlapping communities  Non‐overlapping vs. overlapping communities

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

slide-3
SLIDE 3

[Palla et al., ‘05]

 A node belongs to many social circles  A node belongs to many social circles

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

slide-4
SLIDE 4

[Palla et al., ‘05]

 Two nodes belong to the same community if they

Two nodes belong to the same community if they can be connected through adjacent k‐cliques:

  • k‐clique:
  • Fully connected

graph on k nodes

  • Adjacent k‐cliques:

4-clique

Adjacent k cliques:

  • overlap in k-1 nodes

 k‐clique community

  • Set of nodes that can

be reached through a sequence of adjacent

adjacent 3-cliques

sequence of adjacent k‐cliques

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

slide-5
SLIDE 5

[Palla et al., ‘05]

 Clique Percolation Method:

Clique Percolation Method:

  • Find maximal‐cliques (not k‐cliques!)
  • Clique overlap matrix:

q p

  • Each clique is a node
  • Connect two cliques if they
  • verlap in at least k-1 nodes
  • verlap in at least k 1 nodes
  • Communities:
  • Connected components of

th li l t i the clique overlap matrix

 How to set k?

  • Set k so that we get the “richest” (most widely

distributed cluster sizes) community structure

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

slide-6
SLIDE 6

[Palla et al., ‘05]

 Start with graph

g p and find maximal cliques

 Create clique  Create clique

  • verlap matrix

 Threshold the

(1) Graph (2) Clique overlap matrix

matrix at value k‐1

  • If aij<k-1 set 0

 Communities are  Communities are

the connected components of the thresholded matrix

(3) Thresholded

thresholded matrix

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

(3) Thresholded matrix at k=4 (4) Communities (connected components)

slide-7
SLIDE 7

[Palla et al., ‘07]

Communities in a “tiny” part of a phone ll t k f calls network of 4 million users [Barabasi‐Palla, 2007]

11/10/2010 7 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-8
SLIDE 8

 Each node is a community  Each node is a community  Nodes are weighted for

community size community size

 Links are weighted for

  • verlap size
  • verlap size

 DIP “core” data base of

protein interactions (S. cerevisiase, yeast)

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

( y )

slide-9
SLIDE 9

 No nice way NP hard combinatorial problem  No nice way, NP‐hard combinatorial problem  Simple Algorithm:

  • Start with max clique size s
  • Start with max‐clique size s
  • Choose node u, extract

cliques of size s node cliques of size s node u is member of

  • Delete u and its edges

Delete u and its edges

  • When graph is empty, s=s-1,

restart on original graph restart on original graph

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

slide-10
SLIDE 10

[Palla et al., ‘05]

 Finding cliques around u of size s:  Finding cliques around u of size s:

  • 2 sets A and B:
  • Each node in B links to all nodes in A
  • Each node in B links to all nodes in A
  • Set A grows by moving nodes from B to it
  • Start with A={u} B={v: (u v)E}

Start with A {u}, B {v: (u,v)E}

  • Recursively move each possible

v B to A and prune B v B to A and prune B

  • If B runs out of nodes

before A reaches size s,

  • backtrack the recursion

and try a different v

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

slide-11
SLIDE 11

 Let’s rethink what we

Let s rethink what we are doing…

  • Given a network
  • Want to find clusters!

 Need to:

  • Formalize the notion
  • f a cluster
  • Need to design an algorithm

Need to design an algorithm that will find sets of nodes that are “good” clusters

 More generally:

  • How to think about clusters in large networks?

11 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-12
SLIDE 12

What is a good cluster?

S

What is a good cluster?

 Many edges internally  Few pointing outside

Few pointing outside Formally, conductance:

S’

Where: A(S)….volume

Small Φ(S) corresponds to good clusters

12 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

slide-13
SLIDE 13

[WWW ‘08]

 Define:

Network community profile (NCP) plot

Plot the score of best community of size k

k=5 k=7

log Φ(k)

Φ(5)=0.25 Φ(7)=0.18

13

Community size, log k

(7)

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

slide-14
SLIDE 14

[WWW ‘08]

 Meshes grids dense random graphs:  Meshes, grids, dense random graphs:

d-dimensional meshes California road network

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

d dimensional meshes

11/10/2010

slide-15
SLIDE 15

[WWW ‘08]

 Collaborations between scientists in networks

[Newman, 2005]

log Φ(k) ductance, Cond

15

Community size, log k

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

slide-16
SLIDE 16

[Internet Mathematics ‘09]

Natural hypothesis about NCP: Natural hypothesis about NCP:

 NCP of real networks slopes

downward

 Slope of the NCP corresponds to

the dimensionality of the network

What about large What about large networks?

16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Examine more than 100 large networks

11/10/2010

slide-17
SLIDE 17

[Internet Mathematics ‘09]

Typical example: General Relativity collaborations Typical example: General Relativity collaborations (n=4,158, m=13,422)

17 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

slide-18
SLIDE 18

[Internet Mathematics ‘09]

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18 11/10/2010

slide-19
SLIDE 19

[Internet Mathematics ‘09]

B tt d b tt

nce)

Better and better communities

nductan

Communities get worse and worse

k), (con Φ(

Best community has ~100 nodes

k, (cluster size)

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

slide-20
SLIDE 20

[Internet Mathematics ‘09]

 Each successive edge inside the  Each successive edge inside the

community costs more cut‐edges

NCP plot

Φ /3 0 33 Φ=2/4 = 0 5 Φ=1/3 = 0.33 Φ=2/4 = 0.5 Φ=8/6 = 1.3 Φ=64/14 = 4.5

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

Each node has twice as many children

11/10/2010

slide-21
SLIDE 21

[Internet Mathematics ‘09]

 Empirically we note that best clusters (call them  Empirically we note that best clusters (call them

whiskers) are barely connected to the network

If we remove whiskers.. How does NCP look like?

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

 Core‐periphery structure

11/10/2010

slide-22
SLIDE 22

[Internet Mathematics ‘09]

Nothing happens!  Nestedness of the

22 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

 Nestedness of the core‐periphery structure

11/10/2010

slide-23
SLIDE 23

Denser and denser Denser and denser network core Small good communities

23 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

Nested core‐periphery

slide-24
SLIDE 24

[Internet Mathematics ‘09]

Practically Practically constant!

 Each dot is a different network

24 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

slide-25
SLIDE 25

[Internet Mathematics ‘09] LiveJournal DBLP LiveJournal DBLP

Rewired Network

Amazon IMDB

Ground truth

25 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-26
SLIDE 26

 Some issues with community detection:

Some issues with community detection:

  • Many different formalizations of clustering objective

functions

  • Objectives are NP‐hard to optimize exactly
  • Methods can find clusters that are systematically

“biased” biased

  • Methods can perform well/poorly on some kinds of

graphs

 Questions:

  • How well do algorithms optimize objectives?
  • What clusters do different methods find?

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

slide-27
SLIDE 27

[WWW ‘09]

Many algorithms to extract clusters:

 Heuristics:

  • Girvan‐Newman, Modularity optimization: popular

heuristics heuristics

  • Metis (multi‐resolution heuristic): common in practice

[Karypis‐Kumar ‘98] yp

 Theoretical approximation algorithms:

  • Spectral partitioning: most practical but confuses “long
  • Spectral partitioning: most practical but confuses long

paths” with “deep cuts”

  • Local Spectral [Andersen‐Chung ‘07]
  • Leighton‐Rao: based on multi‐commodity flow

27 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

slide-28
SLIDE 28

[WWW ‘09]

 Spectral embeddings stretch along directions in  Spectral embeddings stretch along directions in

which the random‐walk mixes slowly

  • Resulting hyperplane cuts have "good" conductance
  • Resulting hyperplane cuts have good conductance

cuts, but may not yield the optimal cuts

spectral embedding flow based embedding

28 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-29
SLIDE 29

[WWW ‘09]

Rewired network Rewired network Local spectral i Metis

LiveJournal LiveJournal

29 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-30
SLIDE 30

[WWW ‘09]

500 node communities from Local Spectral: 500 node communities from Local Spectral: 500 node communities from Metis:

30 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-31
SLIDE 31

[WWW ‘09]

Conductance of bounding cut

L l S t l Local Spectral Disconnected Metis Connected Metis

 Metis (red) gives sets with

Metis (red) gives sets with better conductance

 Local Spectral (blue) gives

tighter and more well‐ rounded sets

31 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-32
SLIDE 32

[WWW ‘09]

 LeightonRao: based on

Spectral LRao conn

multi‐commodity flow

  • Disconnected clusters vs.

Lrao disconn

Connected clusters

 Graclus prefers larger

Metis

 Graclus prefers larger

clusters

 Newman’s modularity

Graclus

 Newman s modularity

  • ptimization similar to

Local Spectral

Newman

Local Spectral

32 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-33
SLIDE 33

[WWW ‘09]

 Single‐criterion:

d l ( )

S

  • Modularity: m-E(m)
  • Modularity Ratio: m-E(m)
  • Volume: u d(u)=2m+c

u ( )

  • Edges cut: c

 Multi‐criterion:

  • Conductance: c/(2m+c)

n: nodes in S m: edges in S d i ti Conductance: c/(2m+c)

  • Expansion: c/n
  • Density: 1-m/n2

C tR ti / (N ) c: edges pointing

  • utside S
  • CutRatio: c/n(N-n)
  • Normalized Cut: c/(2m+c) + c/2(M-m)+c
  • Max ODF: max frac. of edges of a node pointing outside S
  • Average‐ODF: avg. frac. of edges of a node pointing outside
  • Flake‐ODF: frac. of nodes with mode than ½ edges inside

33 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-34
SLIDE 34

[WWW ‘09]

 Qualitatively similar  Observations:  Observations:

  • Conductance,

Expansion, Norm‐ cut Cut‐ratio and cut, Cut ratio and Avg‐ODF are similar

  • Max‐ODF prefers

smaller clusters smaller clusters

  • Flake‐ODF prefers

larger clusters

  • Internal density is
  • Internal density is

bad

  • Cut‐ratio has high

variance

34

variance

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-35
SLIDE 35

[WWW ‘09]

Observations: Observations:

 All measures are

monotonic monotonic

 Modularity

  • prefers large
  • prefers large

clusters

  • Ignores small
  • Ignores small

clusters

35 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-36
SLIDE 36

[WWW ‘09]

 Lower bounds on conductance

can be computed from can be computed from:

  • Spectral embedding

(independent of balance)

  • SDP‐based methods (for

volume‐balanced partitions)

 Algorithms find clusters close to

Algorithms find clusters close to theoretical lower bounds

36 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-37
SLIDE 37

Denser and denser Denser and denser network core

So, what’s a

Small good communities

good model?

37 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

Nested core‐periphery

slide-38
SLIDE 38

[Internet Mathematics ‘09]

Flat Down and Flat Flat and Down

  • Pref. attachment

Small World Geometric Pref. Attachment

 None of the common models works!

38 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

slide-39
SLIDE 39

[Internet Mathematics ‘09]

 Forest Fire:

Model ingredients:

connections spread like a fire

  • New node joins the network
  • Selects a seed node
  • Preferential attachment:

second neighbor is not uniform at random

  • Selects a seed node
  • Connects to some of its neighbors
  • Continue recursively
  • Copying flavor: ‐ since we

burn seed’s neighbors

  • Hierarchical flavor:‐ burn

Hierarchical flavor: burn around the seed node

  • “Local” flavor:‐ burn “near”

the seed node u the seed node

As cluster grows it blends into the core

39

blends into the core

  • f the network

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

slide-40
SLIDE 40

[Internet Mathematics ‘09]

rewired network network

40 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010