[PPT] - Privacy Aspects of Social Graphs Joseph Bonneau Stanford Security PowerPoint Presentation

SLIDE 1

Privacy Aspects of Social Graphs

Joseph Bonneau Stanford Security Seminar, July 14 2009 University of Cambridge Computer Laboratory

SLIDE 2

Social Context And The Web

SLIDE 3

Everything's Better With Friends...

“Hyper-presence” of friends
“networked public spaces”
All web activity will have

social context

SLIDE 4

Facebook Is Becoming A Second Internet...

Function Internet version HTML, JavaScript FBML DB Queries SQL FBQL Email SMTP FB Mail Forums Usenet, etc. FB Groups Instant Messages XMPP FB Chat News Streams RSS FB Stream Authentication FB Connect Photo Sharing FB Photos Video Sharing FB Video FB Notes Twitter, etc. FB Status Updates FB Points Event Planning FB Events Classified Ads FB Marketplace Facebook version Page Markup OpenID Flickr, etc. YouTube, etc. Blogging Blogger, etc. Microblogging Micropayment Peppercoin, etc. E-Vite craigslist

SLIDE 5

Parallel Trend: The Internet is Becoming Social

“Given sufficient funding, all web sites expand in functionality until users can add each other as friends”

SLIDE 6

“Traditional” Social Network Analysis

Performed by sociologists, anthropologists, etc. since the 70's
Use data carefully collected through interviews & observation
Typically < 100 nodes
Complete knowledge
Links have consistent meaning
All of these assumptions fail badly for online social network data

SLIDE 7

Traditional Graph Theory

Nice Proofs
Tons of definitions
Ignored topics:
Large graphs
Sampling
Uncertainty

SLIDE 8

Models Of Complex Networks From Math & Physics

Many nice models

Erdos-Renyi
Watts-Strogatz
Barabasi-Albert

Social Networks properties:

Power-law
Small-world
High clustering coefficient

SLIDE 9

Real social graphs are complicated!

SLIDE 10

When In Doubt, Compute!

We do know many graph algorithms:

Find important nodes
Identify communities
Train classifiers
Identify anomalous connections

Major Privacy Implications!

SLIDE 11

Privacy Questions

What can we infer purely from link structure?

SLIDE 12

Privacy Questions

What can we infer purely from link structure?

A surprising amount!

Popularity
Centrality
Introvert vs. Extrovert
Leadership potential

SLIDE 13

Privacy Questions

If we know nothing about a node but it's neighbours, what can we infer?

SLIDE 14

Privacy Questions

If we know nothing about a node but it's neighbours, what can we infer?

A lot!

Gender
Political Beliefs
Location
Breed?

SLIDE 15

Privacy Questions

Can we anonymise graphs?

SLIDE 16

Can we anonymise graphs?

Not easily...

Seminal result by Backstrom et al.: Attack of attack needs just 7 nodes
Can do even better given user's complete neighborhood
Also results for correlating users across networks
Developing line of research...

Privacy Questions

SLIDE 17

Privacy Questions

What can we infer if we “compromise” a fraction of nodes?

SLIDE 18

What can we infer if we “compromise” a fraction of nodes?

A lot...

Common theme: small groups of nodes can see the rest
Danezis et al.
Nagaraja
Korolova et al.
Bonneau et al.

Privacy Questions

SLIDE 19

Can we defend against crawling in a sound way?

Work in progress!

Privacy Questions

SLIDE 20

What if we get a subset of neighbours for all nodes?

Privacy Questions

SLIDE 21

What if we get a subset of k neighbours for all nodes?

Emerging question for many social graphs

Facebook and online SNS
Mobile SNS

Privacy Questions

SLIDE 22

A Quietly Introduced Feature...

Public Search Listings, Sep 2007

SLIDE 23

Public Search Listings

Unprotected against crawling
Indexed by search engines
Opt out—but most users don't know it exists!

SLIDE 24

Utility

Entity Resolution

SLIDE 25

Utility

Promotion via Network Effects

SLIDE 26

Legal Status “Your name, network names, and profile picture thumbnail will be available in search results across the Facebook network and those limited pieces of information may be made available to third party search engines. This is primarily so your friends can find you and send a friend request.”

Facebook Privacy Policy

SLIDE 27

Legal Status

Much More Info Now Included...

SLIDE 28

Legal Status

Public Group Pages Recently Added

SLIDE 29

Obvious Attack

Initially returned new friend set on refresh
Can find all n friends in O(n·log n) queries
The Coupon Collector's Problem
For 100 Friends, need 65 page refreshes
As of Jan 2009, friends fixed per IP address

SLIDE 30

Fun with Tor

UK Germany USA Australia

SLIDE 31

Attack Scenario

Spider all public listings
Our experiments crawled 250 k users daily
Implies ~800 CPU-days to recover all users

SLIDE 32

Abstraction

Take a graph G = <V,E>
Randomly select k out-edges from each node
Result is a sampled graph Gk = <V,Ek>
Try to approximate f(G) ≈ fapprox(Gk)

SLIDE 33

Node Degree
Dominating Set
Betweenness Centrality
Path Length
Community Structure

Approximable Functions

SLIDE 34

Experimental Data

Crawled networks for Stanford, Harvard universities
Representative sub-networks

# Users Stanford 15043 125 90 Harvard 18273 116 76 Mean d Median d

SLIDE 35

Back To Our Abstraction

Take a graph G = <V,E>
Randomly select k out-edges from each node
Result is a sampled graph Gk = <V,Ek>
Try to approximate f(G) ≈ fapprox(Gk)

SLIDE 36

Estimating Degrees

Convert sampled graph into a directed graph
Edges originate at the node where they were seen
Learn exact degree for nodes with degree < k
Less than k out-edges
Get random sample for nodes with degree ≥ k
Many have more than k in-edges

SLIDE 37

Estimating Degrees

3 3 3 4 4 2 1 2 6

Average Degree: 3.5

SLIDE 38

Estimating Degrees

3 3 3 4 4 2 1 2 6

Sampled with k=2

SLIDE 39

Estimating Degrees

? ? ? ? ? ? 1 ? ?

Degree known exactly for one node

SLIDE 40

Estimating Degrees

3.5 3.5 1.75 3.5 5.25 1.75 1 1.75 7

Naïve approach: Multiply in-degree by average degree / k

SLIDE 41

Estimating Degrees

3.5 3.5 2 3.5 5.25 2 1 2 7

Raise estimates which are less than k

SLIDE 42

Estimating Degrees

3.5 3.5 2 3.5 5.25 2 1 2 7

Nodes with high-degree neighbors underestimated

SLIDE 43

Estimating Degrees

3.5 3.5 3.5 3.5 5.25 2 1 2 7

Iteratively scale by current estimate / k in each step

SLIDE 44

Estimating Degrees

2.75 2.75 3.5 3.63 5.5 2 1 2 5.5

After 1 iteration

SLIDE 45

Estimating Degrees

2.68 2.68 3.41 3.53 5.35 2 1 2 5.35

Normalise to estimated total degree

SLIDE 46

Estimating Degrees

2.48 2.83 3.04 3.64 5.09 2 1 2 5.91

Convergence after n > 10 iterations

SLIDE 47

Estimating Degrees

Converges fast, typically after 10 iterations
Absolute error is high—38% average
Reduced to 23% for nodes with d ≥ 50
Still accurately can pick high degree nodes

SLIDE 48

Aggregate of x highest-degree nodes

SLIDE 49

Comparison of sampling parameters

SLIDE 50

Dominating Sets

Set of Nodes D⊆V such that

D Neighbours( ∪ D)=V

Set allows viewing the entire network
Also useful for marketing, trend-setting

SLIDE 51

Dominating Sets

3 3 4 4 4 5 3 2 3 1

Trivial Algorithm: Select High-Degree Nodes in Order

SLIDE 52

Dominating Sets

3 3 4 4 4 5 3 2 3 1

In fact, finding minimal dominating set is NP-complete

SLIDE 53

Dominating Sets

4 4 5 5 5 6 4 3 4 2

Greedy Algorithm: select for maximal coverage

SLIDE 54

Dominating Sets

1 1 2 4 3 2

Greedy Algorithm: select for maximal coverage

SLIDE 55

Dominating Sets

Shown to perform adequately in practice

SLIDE 56

Works Well on Sampled Graph

SLIDE 57

Insensitive to Sampling Parameter!

Surprising: Even k = 1 performs quite well

SLIDE 58

Centrality

A measure of a node's importance
Betweenness centrality:

CBv= ∑

s≠v≠t∈V

 stv st

Measures the shortest paths in the

graph that a particular vertex is part of

SLIDE 59

Centrality

SLIDE 60

Community Detection

Goal: Find highly-connected sub-groups
Measure success by high modularity:
Ratio of intra-community edges to random
Normalised to be between -1 and 1

SLIDE 61

Community Detection

2 2 3 4 4 2 2 1

0.01 0.04 0.035 0.03 0.03 0.035 0.02 0.03 0.03 0.01 0.04

Clausen et. al 2004 – find maximal modularity in O(nlg2n)
Track marginal modularity, update neighbours on each merge

SLIDE 62

Community Detection

Q=0.04

2 2 3 4 4 2 2 1

0.04 0.035 0.025 0.03 0.03 0.035 0.0125 0.04 0.03

SLIDE 63

Community Detection

Q=0.08

2 2 3 4 4 2 2 1

0.04 0.035 0.025 0.03 0.06 0.035 0.0125 0.06 0.04

SLIDE 64

Community Detection

Q=0.14

2 2 3 4 4 2 2 1

0.11

0.10 0.035 0.025 0.01 0.035 0.0125 0.04

SLIDE 65

Community Detection

Q=0.175

2 2 3 4 4 2 2 1

0.11

0.10 0.035 0.0375 0.01 0.025 0.0375 0.04

SLIDE 66

Community Detection

Q=0.2125

2 2 3 4 4 2 2 1

0.15

0.10 0.1125 0.01

SLIDE 67

Community Detection

Q=0.2225

2 2 3 4 4 2 2 1

0.15

0.11 0.1125

0.15

SLIDE 68

Community Detection

SLIDE 69

Conclusions

k-sampling of each edge gives away a lot

SLIDE 70

Conclusions

k-sampling of each edge gives away a lot

Can we fix it?

SLIDE 71

Regular subgraph extraction

3 3 3 4 4 2 1 2 6

Can we find a 2-regular subgraph?

SLIDE 72

Regular subgraph extraction

3 3 3 4 4 2 1 2 6

Step 1: Remove edges, weight by smallest attached node

3 1 3 2 3 3 3 4 4 2 2 4 3

SLIDE 73

Regular subgraph extraction

3 3 3 4 3 2 1 2 5

Step 1: Remove edges, weight by smallest attached node

3 1 3 2 3 3 3 4 2 2 3 3

SLIDE 74

Regular subgraph extraction

3 3 3 3 3 2 1 2 4

Step 1: Remove edges, weight by smallest attached node

3 1 3 2 3 3 3 2 2 3 3

SLIDE 75

Regular subgraph extraction

3 2 3 3 3 2 1 2 3

Step 1: Remove edges, weight by smallest attached node

3 1 2 2 2 3 2 2 3 3

SLIDE 76

Regular subgraph extraction

3 2 3 3 3 2 1 2 2

Step 1: Remove edges, weight by smallest attached node

2 1 2 2 2 3 2 2 3

SLIDE 77

Regular subgraph extraction

3 2 2 2 2 2 1 2 2

Step 2: Remove further edges to force all degrees ≤ k

2 1 2 2 2 2 2 2

SLIDE 78

Regular subgraph extraction

2 1 2 2 2 2 1 2 2

Step 3: Randomly add edges between pairs of edges below k

2 1 2 1 2 2 2

SLIDE 79

Regular subgraph extraction

2 1 2 2 2 2 1 2 2

Step 3: Randomly add edges between pairs of edges below k

2 1 2 1 2 2 2

SLIDE 80

Regular subgraph extraction

2 2 2 2 2 2 2 2 2

(note: producing a cycle is atypical!)

SLIDE 81

How well have we done?

Recall original goal of showing k-sample
Promotion, identification
Two measures:
Precision: Percentage of edges shown which are real
Recall: Percentage of real edges which are shown

(normalise recall to showing a max of k per node)

SLIDE 82

How well have we done?

Recall original goal of showing k-sample
Promotion, identification
Two measures:
Precision: Percentage of edges shown which are real
Recall: Percentage of real edges which are shown

(normalise recall to showing a max of k per node)

SLIDE 83

Regular subgraph extraction

Original Step 1 Step 2 Step 3 Precision 1 1 1 0.90 Recall 1 1 0.99 0.99

SLIDE 84

Regular subgraph extraction

SLIDE 85

Drawbacks

Requires complete graph knowledge
Graph frequently changes!

SLIDE 86

Drawbacks

Requires complete graph knowledge
Graph frequently changes!

Alternative: Random Sampling

Weight selection towards low-degree neighbours
Computable locally, incrementally
(much weaker...)

SLIDE 87

Random Sampling

SLIDE 88

Caveats

Can gain some protection against degree estimation
With a lot of work
Doesn't prevent inference of dominating sets, centrality!

SLIDE 89

Conclusions

Availability of social graphs raises serious privacy concern
The blueprint of our society...
Very fragile to many attacks
Right now, we're choosing utility over privacy