[PPT] - http://cs224w.stanford.edu Non overlapping vs overlapping PowerPoint Presentation

SLIDE 1

CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University

http://cs224w.stanford.edu

SLIDE 2

 Non overlapping vs overlapping communities  Non‐overlapping vs. overlapping communities

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

SLIDE 3

[Palla et al., ‘05]

 A node belongs to many social circles  A node belongs to many social circles

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

SLIDE 4

[Palla et al., ‘05]

 Two nodes belong to the same community if they

Two nodes belong to the same community if they can be connected through adjacent k‐cliques:

k‐clique:
Fully connected

graph on k nodes

Adjacent k‐cliques:

4-clique

Adjacent k cliques:

overlap in k-1 nodes

 k‐clique community

Set of nodes that can

be reached through a sequence of adjacent

adjacent 3-cliques

sequence of adjacent k‐cliques

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

SLIDE 5

[Palla et al., ‘05]

 Clique Percolation Method:

Clique Percolation Method:

Find maximal‐cliques (not k‐cliques!)
Clique overlap matrix:

q p

Each clique is a node
Connect two cliques if they
verlap in at least k-1 nodes
verlap in at least k 1 nodes
Communities:
Connected components of

th li l t i the clique overlap matrix

 How to set k?

Set k so that we get the “richest” (most widely

distributed cluster sizes) community structure

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

SLIDE 6

[Palla et al., ‘05]

 Start with graph

g p and find maximal cliques

 Create clique  Create clique

verlap matrix

 Threshold the

(1) Graph (2) Clique overlap matrix

matrix at value k‐1

If aij<k-1 set 0

 Communities are  Communities are

the connected components of the thresholded matrix

(3) Thresholded

thresholded matrix

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

(3) Thresholded matrix at k=4 (4) Communities (connected components)

SLIDE 7

[Palla et al., ‘07]

Communities in a “tiny” part of a phone ll t k f calls network of 4 million users [Barabasi‐Palla, 2007]

11/10/2010 7 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 8

 Each node is a community  Each node is a community  Nodes are weighted for

community size community size

 Links are weighted for

verlap size
verlap size

 DIP “core” data base of

protein interactions (S. cerevisiase, yeast)

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

( y )

SLIDE 9

 No nice way NP hard combinatorial problem  No nice way, NP‐hard combinatorial problem  Simple Algorithm:

Start with max clique size s
Start with max‐clique size s
Choose node u, extract

cliques of size s node cliques of size s node u is member of

Delete u and its edges

Delete u and its edges

When graph is empty, s=s-1,

restart on original graph restart on original graph

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

SLIDE 10

[Palla et al., ‘05]

 Finding cliques around u of size s:  Finding cliques around u of size s:

2 sets A and B:
Each node in B links to all nodes in A
Each node in B links to all nodes in A
Set A grows by moving nodes from B to it
Start with A={u} B={v: (u v)E}

Start with A {u}, B {v: (u,v)E}

Recursively move each possible

v B to A and prune B v B to A and prune B

If B runs out of nodes

before A reaches size s,

backtrack the recursion

and try a different v

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

SLIDE 11

 Let’s rethink what we

Let s rethink what we are doing…

Given a network
Want to find clusters!

 Need to:

Formalize the notion
f a cluster
Need to design an algorithm

Need to design an algorithm that will find sets of nodes that are “good” clusters

 More generally:

How to think about clusters in large networks?

11 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 12

What is a good cluster?

S

What is a good cluster?

 Many edges internally  Few pointing outside

Few pointing outside Formally, conductance:

S’

Where: A(S)….volume

Small Φ(S) corresponds to good clusters

12 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

SLIDE 13

[WWW ‘08]

 Define:

Network community profile (NCP) plot

Plot the score of best community of size k

k=5 k=7

log Φ(k)

Φ(5)=0.25 Φ(7)=0.18

13

Community size, log k

(7)

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

SLIDE 14

[WWW ‘08]

 Meshes grids dense random graphs:  Meshes, grids, dense random graphs:

d-dimensional meshes California road network

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

d dimensional meshes

11/10/2010

SLIDE 15

[WWW ‘08]

 Collaborations between scientists in networks

[Newman, 2005]

log Φ(k) ductance, Cond

15

Community size, log k

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

SLIDE 16

[Internet Mathematics ‘09]

Natural hypothesis about NCP: Natural hypothesis about NCP:

 NCP of real networks slopes

downward

 Slope of the NCP corresponds to

the dimensionality of the network

What about large What about large networks?

16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

Examine more than 100 large networks

11/10/2010

SLIDE 17

[Internet Mathematics ‘09]

Typical example: General Relativity collaborations Typical example: General Relativity collaborations (n=4,158, m=13,422)

17 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

SLIDE 18

[Internet Mathematics ‘09]

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18 11/10/2010

SLIDE 19

[Internet Mathematics ‘09]

B tt d b tt

nce)

Better and better communities

nductan

Communities get worse and worse

k), (con Φ(

Best community has ~100 nodes

k, (cluster size)

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

SLIDE 20

[Internet Mathematics ‘09]

 Each successive edge inside the  Each successive edge inside the

community costs more cut‐edges

NCP plot

Φ /3 0 33 Φ=2/4 = 0 5 Φ=1/3 = 0.33 Φ=2/4 = 0.5 Φ=8/6 = 1.3 Φ=64/14 = 4.5

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

Each node has twice as many children

11/10/2010

SLIDE 21

[Internet Mathematics ‘09]

 Empirically we note that best clusters (call them  Empirically we note that best clusters (call them

whiskers) are barely connected to the network

If we remove whiskers.. How does NCP look like?

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

 Core‐periphery structure

11/10/2010

SLIDE 22

[Internet Mathematics ‘09]

Nothing happens!  Nestedness of the

22 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

 Nestedness of the core‐periphery structure

11/10/2010

SLIDE 23

Denser and denser Denser and denser network core Small good communities

23 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

Nested core‐periphery

SLIDE 24

[Internet Mathematics ‘09]

Practically Practically constant!

 Each dot is a different network

24 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

SLIDE 25

[Internet Mathematics ‘09] LiveJournal DBLP LiveJournal DBLP

Rewired Network

Amazon IMDB

Ground truth

25 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 26

 Some issues with community detection:

Some issues with community detection:

Many different formalizations of clustering objective

functions

Objectives are NP‐hard to optimize exactly
Methods can find clusters that are systematically

“biased” biased

Methods can perform well/poorly on some kinds of

graphs

 Questions:

How well do algorithms optimize objectives?
What clusters do different methods find?

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

SLIDE 27

[WWW ‘09]

Many algorithms to extract clusters:

 Heuristics:

Girvan‐Newman, Modularity optimization: popular

heuristics heuristics

Metis (multi‐resolution heuristic): common in practice

[Karypis‐Kumar ‘98] yp

 Theoretical approximation algorithms:

Spectral partitioning: most practical but confuses “long
Spectral partitioning: most practical but confuses long

paths” with “deep cuts”

Local Spectral [Andersen‐Chung ‘07]
Leighton‐Rao: based on multi‐commodity flow

27 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

SLIDE 28

[WWW ‘09]

 Spectral embeddings stretch along directions in  Spectral embeddings stretch along directions in

which the random‐walk mixes slowly

Resulting hyperplane cuts have "good" conductance
Resulting hyperplane cuts have good conductance

cuts, but may not yield the optimal cuts

spectral embedding flow based embedding

28 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 29

[WWW ‘09]

Rewired network Rewired network Local spectral i Metis

LiveJournal LiveJournal

29 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 30

[WWW ‘09]

500 node communities from Local Spectral: 500 node communities from Local Spectral: 500 node communities from Metis:

30 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 31

[WWW ‘09]

Conductance of bounding cut

L l S t l Local Spectral Disconnected Metis Connected Metis

 Metis (red) gives sets with

Metis (red) gives sets with better conductance

 Local Spectral (blue) gives

tighter and more well‐ rounded sets

31 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 32

[WWW ‘09]

 LeightonRao: based on

Spectral LRao conn

multi‐commodity flow

Disconnected clusters vs.

Lrao disconn

Connected clusters

 Graclus prefers larger

Metis

 Graclus prefers larger

clusters

 Newman’s modularity

Graclus

 Newman s modularity

ptimization similar to

Local Spectral

Newman

Local Spectral

32 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 33

[WWW ‘09]

 Single‐criterion:

d l ( )

S

Modularity: m-E(m)
Modularity Ratio: m-E(m)
Volume: u d(u)=2m+c

u ( )

Edges cut: c

 Multi‐criterion:

Conductance: c/(2m+c)

n: nodes in S m: edges in S d i ti Conductance: c/(2m+c)

Expansion: c/n
Density: 1-m/n2

C tR ti / (N ) c: edges pointing

utside S
CutRatio: c/n(N-n)
Normalized Cut: c/(2m+c) + c/2(M-m)+c
Max ODF: max frac. of edges of a node pointing outside S
Average‐ODF: avg. frac. of edges of a node pointing outside
Flake‐ODF: frac. of nodes with mode than ½ edges inside

33 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 34

[WWW ‘09]

 Qualitatively similar  Observations:  Observations:

Conductance,

Expansion, Norm‐ cut Cut‐ratio and cut, Cut ratio and Avg‐ODF are similar

Max‐ODF prefers

smaller clusters smaller clusters

Flake‐ODF prefers

larger clusters

Internal density is
Internal density is

bad

Cut‐ratio has high

variance

34

variance

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 35

[WWW ‘09]

Observations: Observations:

 All measures are

monotonic monotonic

 Modularity

prefers large
prefers large

clusters

Ignores small
Ignores small

clusters

35 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 36

[WWW ‘09]

 Lower bounds on conductance

can be computed from can be computed from:

Spectral embedding

(independent of balance)

SDP‐based methods (for

volume‐balanced partitions)

 Algorithms find clusters close to

Algorithms find clusters close to theoretical lower bounds

36 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

SLIDE 37

Denser and denser Denser and denser network core

So, what’s a

Small good communities

good model?

37 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

Nested core‐periphery

SLIDE 38

[Internet Mathematics ‘09]

Flat Down and Flat Flat and Down

Pref. attachment

Small World Geometric Pref. Attachment

 None of the common models works!

38 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

SLIDE 39

[Internet Mathematics ‘09]

 Forest Fire:

Model ingredients:

connections spread like a fire

New node joins the network
Selects a seed node
Preferential attachment:

second neighbor is not uniform at random

Selects a seed node
Connects to some of its neighbors
Continue recursively
Copying flavor: ‐ since we

burn seed’s neighbors

Hierarchical flavor:‐ burn

Hierarchical flavor: burn around the seed node

“Local” flavor:‐ burn “near”

the seed node u the seed node

As cluster grows it blends into the core

39

blends into the core

f the network

Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010

SLIDE 40

[Internet Mathematics ‘09]

rewired network network

40 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11/10/2010