SLIDE 1
Privacy Aspects of Social Graphs Joseph Bonneau Stanford Security - - PowerPoint PPT Presentation
Privacy Aspects of Social Graphs Joseph Bonneau Stanford Security - - PowerPoint PPT Presentation
Privacy Aspects of Social Graphs Joseph Bonneau Stanford Security Seminar, July 14 2009 University of Cambridge Computer Laboratory Social Context And The Web Everything's Better With Friends... Hyper-presence of friends
SLIDE 2
SLIDE 3
Everything's Better With Friends...
- “Hyper-presence” of friends
- “networked public spaces”
- All web activity will have
social context
SLIDE 4
Facebook Is Becoming A Second Internet...
Function Internet version HTML, JavaScript FBML DB Queries SQL FBQL Email SMTP FB Mail Forums Usenet, etc. FB Groups Instant Messages XMPP FB Chat News Streams RSS FB Stream Authentication FB Connect Photo Sharing FB Photos Video Sharing FB Video FB Notes Twitter, etc. FB Status Updates FB Points Event Planning FB Events Classified Ads FB Marketplace Facebook version Page Markup OpenID Flickr, etc. YouTube, etc. Blogging Blogger, etc. Microblogging Micropayment Peppercoin, etc. E-Vite craigslist
SLIDE 5
Parallel Trend: The Internet is Becoming Social
“Given sufficient funding, all web sites expand in functionality until users can add each other as friends”
SLIDE 6
“Traditional” Social Network Analysis
- Performed by sociologists, anthropologists, etc. since the 70's
- Use data carefully collected through interviews & observation
- Typically < 100 nodes
- Complete knowledge
- Links have consistent meaning
- All of these assumptions fail badly for online social network data
SLIDE 7
Traditional Graph Theory
- Nice Proofs
- Tons of definitions
- Ignored topics:
- Large graphs
- Sampling
- Uncertainty
SLIDE 8
Models Of Complex Networks From Math & Physics
Many nice models
- Erdos-Renyi
- Watts-Strogatz
- Barabasi-Albert
Social Networks properties:
- Power-law
- Small-world
- High clustering coefficient
SLIDE 9
Real social graphs are complicated!
SLIDE 10
When In Doubt, Compute!
We do know many graph algorithms:
- Find important nodes
- Identify communities
- Train classifiers
- Identify anomalous connections
Major Privacy Implications!
SLIDE 11
Privacy Questions
- What can we infer purely from link structure?
SLIDE 12
Privacy Questions
- What can we infer purely from link structure?
A surprising amount!
- Popularity
- Centrality
- Introvert vs. Extrovert
- Leadership potential
SLIDE 13
Privacy Questions
- If we know nothing about a node but it's neighbours, what can we infer?
SLIDE 14
Privacy Questions
- If we know nothing about a node but it's neighbours, what can we infer?
A lot!
- Gender
- Political Beliefs
- Location
- Breed?
SLIDE 15
Privacy Questions
- Can we anonymise graphs?
SLIDE 16
- Can we anonymise graphs?
Not easily...
- Seminal result by Backstrom et al.: Attack of attack needs just 7 nodes
- Can do even better given user's complete neighborhood
- Also results for correlating users across networks
- Developing line of research...
Privacy Questions
SLIDE 17
Privacy Questions
- What can we infer if we “compromise” a fraction of nodes?
SLIDE 18
- What can we infer if we “compromise” a fraction of nodes?
A lot...
- Common theme: small groups of nodes can see the rest
- Danezis et al.
- Nagaraja
- Korolova et al.
- Bonneau et al.
Privacy Questions
SLIDE 19
- Can we defend against crawling in a sound way?
Work in progress!
Privacy Questions
SLIDE 20
- What if we get a subset of neighbours for all nodes?
Privacy Questions
SLIDE 21
- What if we get a subset of k neighbours for all nodes?
Emerging question for many social graphs
- Facebook and online SNS
- Mobile SNS
Privacy Questions
SLIDE 22
A Quietly Introduced Feature...
Public Search Listings, Sep 2007
SLIDE 23
Public Search Listings
- Unprotected against crawling
- Indexed by search engines
- Opt out—but most users don't know it exists!
SLIDE 24
Utility
Entity Resolution
SLIDE 25
Utility
Promotion via Network Effects
SLIDE 26
Legal Status “Your name, network names, and profile picture thumbnail will be available in search results across the Facebook network and those limited pieces of information may be made available to third party search engines. This is primarily so your friends can find you and send a friend request.”
- Facebook Privacy Policy
SLIDE 27
Legal Status
Much More Info Now Included...
SLIDE 28
Legal Status
Public Group Pages Recently Added
SLIDE 29
Obvious Attack
- Initially returned new friend set on refresh
- Can find all n friends in O(n·log n) queries
- The Coupon Collector's Problem
- For 100 Friends, need 65 page refreshes
- As of Jan 2009, friends fixed per IP address
SLIDE 30
Fun with Tor
UK Germany USA Australia
SLIDE 31
Attack Scenario
- Spider all public listings
- Our experiments crawled 250 k users daily
- Implies ~800 CPU-days to recover all users
SLIDE 32
Abstraction
- Take a graph G = <V,E>
- Randomly select k out-edges from each node
- Result is a sampled graph Gk = <V,Ek>
- Try to approximate f(G) ≈ fapprox(Gk)
SLIDE 33
- Node Degree
- Dominating Set
- Betweenness Centrality
- Path Length
- Community Structure
Approximable Functions
SLIDE 34
Experimental Data
- Crawled networks for Stanford, Harvard universities
- Representative sub-networks
# Users Stanford 15043 125 90 Harvard 18273 116 76 Mean d Median d
SLIDE 35
Back To Our Abstraction
- Take a graph G = <V,E>
- Randomly select k out-edges from each node
- Result is a sampled graph Gk = <V,Ek>
- Try to approximate f(G) ≈ fapprox(Gk)
SLIDE 36
Estimating Degrees
- Convert sampled graph into a directed graph
- Edges originate at the node where they were seen
- Learn exact degree for nodes with degree < k
- Less than k out-edges
- Get random sample for nodes with degree ≥ k
- Many have more than k in-edges
SLIDE 37
Estimating Degrees
3 3 3 4 4 2 1 2 6
Average Degree: 3.5
SLIDE 38
Estimating Degrees
3 3 3 4 4 2 1 2 6
Sampled with k=2
SLIDE 39
Estimating Degrees
? ? ? ? ? ? 1 ? ?
Degree known exactly for one node
SLIDE 40
Estimating Degrees
3.5 3.5 1.75 3.5 5.25 1.75 1 1.75 7
Naïve approach: Multiply in-degree by average degree / k
SLIDE 41
Estimating Degrees
3.5 3.5 2 3.5 5.25 2 1 2 7
Raise estimates which are less than k
SLIDE 42
Estimating Degrees
3.5 3.5 2 3.5 5.25 2 1 2 7
Nodes with high-degree neighbors underestimated
SLIDE 43
Estimating Degrees
3.5 3.5 3.5 3.5 5.25 2 1 2 7
Iteratively scale by current estimate / k in each step
SLIDE 44
Estimating Degrees
2.75 2.75 3.5 3.63 5.5 2 1 2 5.5
After 1 iteration
SLIDE 45
Estimating Degrees
2.68 2.68 3.41 3.53 5.35 2 1 2 5.35
Normalise to estimated total degree
SLIDE 46
Estimating Degrees
2.48 2.83 3.04 3.64 5.09 2 1 2 5.91
Convergence after n > 10 iterations
SLIDE 47
Estimating Degrees
- Converges fast, typically after 10 iterations
- Absolute error is high—38% average
- Reduced to 23% for nodes with d ≥ 50
- Still accurately can pick high degree nodes
SLIDE 48
Aggregate of x highest-degree nodes
SLIDE 49
Comparison of sampling parameters
SLIDE 50
Dominating Sets
- Set of Nodes D⊆V such that
D Neighbours( ∪ D)=V
- Set allows viewing the entire network
- Also useful for marketing, trend-setting
SLIDE 51
Dominating Sets
3 3 4 4 4 5 3 2 3 1
Trivial Algorithm: Select High-Degree Nodes in Order
SLIDE 52
Dominating Sets
3 3 4 4 4 5 3 2 3 1
In fact, finding minimal dominating set is NP-complete
SLIDE 53
Dominating Sets
4 4 5 5 5 6 4 3 4 2
Greedy Algorithm: select for maximal coverage
SLIDE 54
Dominating Sets
1 1 2 4 3 2
Greedy Algorithm: select for maximal coverage
SLIDE 55
Dominating Sets
Shown to perform adequately in practice
SLIDE 56
Works Well on Sampled Graph
SLIDE 57
Insensitive to Sampling Parameter!
Surprising: Even k = 1 performs quite well
SLIDE 58
Centrality
- A measure of a node's importance
- Betweenness centrality:
CBv= ∑
s≠v≠t∈V
stv st
- Measures the shortest paths in the
graph that a particular vertex is part of
SLIDE 59
Centrality
SLIDE 60
Community Detection
- Goal: Find highly-connected sub-groups
- Measure success by high modularity:
- Ratio of intra-community edges to random
- Normalised to be between -1 and 1
SLIDE 61
Community Detection
2 2 3 4 4 2 2 1
0.01 0.04 0.035 0.03 0.03 0.035 0.02 0.03 0.03 0.01 0.04
- Clausen et. al 2004 – find maximal modularity in O(nlg2n)
- Track marginal modularity, update neighbours on each merge
SLIDE 62
Community Detection
Q=0.04
2 2 3 4 4 2 2 1
0.04 0.035 0.025 0.03 0.03 0.035 0.0125 0.04 0.03
SLIDE 63
Community Detection
Q=0.08
2 2 3 4 4 2 2 1
0.04 0.035 0.025 0.03 0.06 0.035 0.0125 0.06 0.04
SLIDE 64
Community Detection
Q=0.14
2 2 3 4 4 2 2 1
- 0.11
0.10 0.035 0.025 0.01 0.035 0.0125 0.04
SLIDE 65
Community Detection
Q=0.175
2 2 3 4 4 2 2 1
- 0.11
0.10 0.035 0.0375 0.01 0.025 0.0375 0.04
SLIDE 66
Community Detection
Q=0.2125
2 2 3 4 4 2 2 1
- 0.15
0.10 0.1125 0.01
SLIDE 67
Community Detection
Q=0.2225
2 2 3 4 4 2 2 1
- 0.15
0.11 0.1125
- 0.15
SLIDE 68
Community Detection
SLIDE 69
Conclusions
- k-sampling of each edge gives away a lot
SLIDE 70
Conclusions
- k-sampling of each edge gives away a lot
Can we fix it?
SLIDE 71
Regular subgraph extraction
3 3 3 4 4 2 1 2 6
Can we find a 2-regular subgraph?
SLIDE 72
Regular subgraph extraction
3 3 3 4 4 2 1 2 6
Step 1: Remove edges, weight by smallest attached node
3 1 3 2 3 3 3 4 4 2 2 4 3
SLIDE 73
Regular subgraph extraction
3 3 3 4 3 2 1 2 5
Step 1: Remove edges, weight by smallest attached node
3 1 3 2 3 3 3 4 2 2 3 3
SLIDE 74
Regular subgraph extraction
3 3 3 3 3 2 1 2 4
Step 1: Remove edges, weight by smallest attached node
3 1 3 2 3 3 3 2 2 3 3
SLIDE 75
Regular subgraph extraction
3 2 3 3 3 2 1 2 3
Step 1: Remove edges, weight by smallest attached node
3 1 2 2 2 3 2 2 3 3
SLIDE 76
Regular subgraph extraction
3 2 3 3 3 2 1 2 2
Step 1: Remove edges, weight by smallest attached node
2 1 2 2 2 3 2 2 3
SLIDE 77
Regular subgraph extraction
3 2 2 2 2 2 1 2 2
Step 2: Remove further edges to force all degrees ≤ k
2 1 2 2 2 2 2 2
SLIDE 78
Regular subgraph extraction
2 1 2 2 2 2 1 2 2
Step 3: Randomly add edges between pairs of edges below k
2 1 2 1 2 2 2
SLIDE 79
Regular subgraph extraction
2 1 2 2 2 2 1 2 2
Step 3: Randomly add edges between pairs of edges below k
2 1 2 1 2 2 2
SLIDE 80
Regular subgraph extraction
2 2 2 2 2 2 2 2 2
(note: producing a cycle is atypical!)
SLIDE 81
How well have we done?
- Recall original goal of showing k-sample
- Promotion, identification
- Two measures:
- Precision: Percentage of edges shown which are real
- Recall: Percentage of real edges which are shown
(normalise recall to showing a max of k per node)
SLIDE 82
How well have we done?
- Recall original goal of showing k-sample
- Promotion, identification
- Two measures:
- Precision: Percentage of edges shown which are real
- Recall: Percentage of real edges which are shown
(normalise recall to showing a max of k per node)
SLIDE 83
Regular subgraph extraction
Original Step 1 Step 2 Step 3 Precision 1 1 1 0.90 Recall 1 1 0.99 0.99
SLIDE 84
Regular subgraph extraction
SLIDE 85
Drawbacks
- Requires complete graph knowledge
- Graph frequently changes!
SLIDE 86
Drawbacks
- Requires complete graph knowledge
- Graph frequently changes!
Alternative: Random Sampling
- Weight selection towards low-degree neighbours
- Computable locally, incrementally
- (much weaker...)
SLIDE 87
Random Sampling
SLIDE 88
Caveats
- Can gain some protection against degree estimation
- With a lot of work
- Doesn't prevent inference of dominating sets, centrality!
SLIDE 89
Conclusions
- Availability of social graphs raises serious privacy concern
- The blueprint of our society...
- Very fragile to many attacks
- Right now, we're choosing utility over privacy