Database Privacy Review: Anonymity vs. Privacy Privacy - Privacy is - - PowerPoint PPT Presentation
Database Privacy Review: Anonymity vs. Privacy Privacy - Privacy is - - PowerPoint PPT Presentation
Introduction to Cybersecurity Database Privacy Review: Anonymity vs. Privacy Privacy - Privacy is the claim of individuals, groups, or institutions to determine for themselves when, how, and to what extent information about them is
Review: Anonymity vs. Privacy
- Privacy
- Privacy is the claim of individuals, groups, or institutions to
determine for themselves when, how, and to what extent information about them is communicated to others
- Anonymity
- The state of being not identifiable within a set of
subjects/individuals
- It is a property exclusively of individuals
- Privacy != Anonymity
- Anonymity is a way to maintain privacy, and sometimes it is not
necessary
Foundations of Cybersecurity 2016 1
Review: Anonymous Communication (AC) Protocols
- Various AC protocols with different goals:
- Low Latency Overhead
- Low Communication Overhead
- High Traffic-Analysis Resistance
- Typically categorized by latency overhead:
- low-latency AC protcols
e.g. Tor, DC Nets, Crowds
- high-latency AC protocols
e.g. Mix networks
Introduction to Cybersecurity 2016
Latency Traffic-Analysis Resistance Communication Complexity
2
related
A Glimpse on Research: Privacy Assessment with MATor
Maximize A ๐ โ โ๐โA ๐๐๐ค๐๐๐๐ ๐ subject to A โ ๐, ๐ A โค ๐ถ
Impact of Single Node Corruption Overall Guarantee
Randomly choose an entry, a middle and an exit node.
entry middle exit
Goal: Derive worst-case quantitative anonymity guarantees
๐๐๐๐ข๐ ๐ง ๐ = โ(๐,๐ฆ)โ๐2 Pr[ ๐, ๐, ๐ฆ โ ๐๐๐ ] ๐๐๐๐๐๐๐ ๐ = โ(๐,๐ฆ)โ๐2 ฮst Pr[ ๐, ๐, ๐ฆ โ ๐๐๐ ] ๐๐๐ฆ๐๐ข ๐ = ฮ๐ก๐ข โ๐โ๐ Pr[ ๐, ๐, ๐ โ ๐๐๐ ] ๐๐๐ค๐๐๐๐(๐)
related
i Budget Adversary ๐ต๐
๐ถ
with cost function ๐: ๐ โ โ and budget ๐ถ Computational Soundness
๐๐๐๐๐๐๐ ๐๐๐ (๐) โ ๐๐๐ ๐ง๐๐ข๐ (๐) โค 1 ๐๐๐๐ง ๐
corrupts
Integer maximization problem Anonymity degeneration (for encryption as terms)
Introduction to Cybersecurity 2016 3
related
A Glimpse on Research: Privacy Assessment with MATor
Randomly choose an entry, a middle and an exit node.
entry middle exit
Goal: Derive worst-case quantitative anonymity guarantees
corrupts
Challenges: Comprehensive network-layer attackers, extension beyond structural corruption, content-sensitive assessment Potential killer arguments: Attackers overly powerful, hence too pessimistic guarantees; assessment only for Tor, not tailored attack
Alternative Path Selection Algorithms
Live Monitor
Anonymity 0,5 1 time 2012 2013 2014
0.2 0.4 0.6 0.8 1
1 8 64 512 4,096 32,768
Tor LASTor Uniform US-Exit
Bandwidth in MB/s
Introduction to Cybersecurity 2016 4
Lecture Summary โ Part I
Basic Database Privacy
- Motivation
- Data Sanitization
- k-anonymity and l-diversity
Principle Approaches to Data Protection
- Sanitization before Publication
- Protection after Publication
- Publication without Control
5 Introduction to Cybersecurity 2016
Data Privacy: Attribute Disclosure
6 Introduction to Cybersecurity 2016
social network
female 29y Saarbrรผcken
Alice suffers from the Addison disorder!
female 25-30 Saarland Addison Disorder female 25-30 Saarland Addison Disorder male 30-35 Saarland Healthy female 25-30 Saarland Addison Disorder female 25-30 Saarland Addison Disorder male 30-35 Saarland Healthy
Cryptographic Solutions
- Why not just delete the data?
- Why canโt we encrypt?
7 Introduction to Cybersecurity 2016
In contrast to cryptography, privacy
- ften requires a certain utility.
Deleting data destroys utility. Storing or transmitting data encrypted is a good idea. Someone has (needs to have) the key.
Sanitization
- Legally, data has to be โsanitizedโ:
- Removal of โidentifyingโ information
8 Introduction to Cybersecurity 2016
Unsanitized data
- Name
- Gender
- Age
- Address
- Phone Number
- Field of studies
- Grades
Sanitized data
- Name
- Gender
- Age
- Address
- Phone Number
- Field of studies
- Grades
Benefits of Sanitization
Sanitized data can (still) be used for:
- Research
- Healthcare
- Governmental statistics
- Improving business models
9 Introduction to Cybersecurity 2016
Sanitized data
- Name
- Gender
- Age
- Address
- Phone Number
- Field of studies
- Grades
Statistics Science!
Does Sanitization suffice?
Sanitization = Privacy?
- No identity
- No identifying information (โquasi identifiersโ)
such as address or phone number
10 Introduction to Cybersecurity 2016
Sanitized data
- Name
- Gender
- Age
- Address
- Phone Number
- Field of studies
- Grades
1 female student
- f this age
attends a course Privacy Breach
Attacks on Databases
11 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 2.0 Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 3.7 John 20 Male 3 1.7 Kale 21 Male 5 1.7 Leonard 23 Male 5 failed Martin 20 Male 5 2.7 Nils 22 Male 5 3.0 Otto 20 Male 5 1.0
Early defense mechanisms: query sanitization.
SELECT SUM(Grade) WHERE Name = โIsaโ 3.7
Sanitization: Queries must not depend on identifiers!
Attacks on Databases
12 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 2.0 Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 3.7 John 20 Male 3 1.7 Kale 21 Male 5 1.7 Leonard 23 Male 5 failed Martin 20 Male 5 2.7 Nils 22 Male 5 3.0 Otto 20 Male 5 1.0
Early defense mechanisms: query sanitization.
SELECT SUM(Grade) WHERE Semester = 3 AND Gender = Female 3.7
Sanitization: Queries must not depend on identifiers! Sanitization: Queries must not be answered if the answer is below a threshold
Attacks on Databases
13 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 2.0 Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 3.7 John 20 Male 3 1.7 Kale 21 Male 5 1.7 Leonard 23 Male 5 failed Martin 20 Male 5 2.7 Nils 22 Male 5 3.0 Otto 20 Male 5 1.0
Early defense mechanisms: query sanitization.
SELECT SUM(Grade) 30.1
Sanitization: Queries must not depend on identifiers! Sanitization: Queries must not be answered if the answer is below a threshold
SELECT SUM(Grade) WHERE NOT (Semester = 3 AND Gender = Female) 26.4 Local Computation: Isaโs Grade = 30.1 โ 26.4 = 3.7
K-Anonymity (Intuitive Idea)
K-Anonymity: Privacy means that one can hide within a set of (at least) K
- ther people with the same quasi-identifiers.
Quasi identifiers: Attributes that could identify a person (name, age, etc.)
15 Introduction to Cybersecurity 2016
K = 6
K-Anonymity (Definition)
Definition: Data satisfies K-Anonymity, if each person contained in the data cannot be distinguished from at least K-1 other individuals also within the data.
16 Introduction to Cybersecurity 2016
Achieving K-Anonymity
Reduce the information such that the data collapses:
17 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 2.0 Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed John 20 Male 3 1.7 Kale 21 Male 5 1.7 Leonard 23 Male 5 failed Martin 20 Male 5 2.7 Nils 22 Male 5 3.0 Otto 20 Male 5 1.0
Suppression: Generalization:
Name Age Gender Semester Grade * 19 * 1 1.3 * 18 * 1 2.0 * 18 * 1 1.7 * 18 * 1 3.7 * 17 * 1 1.0 Name Age Gender Semester Grade 21-25 5 1.7 21-25 5 failed 18-20 5 2.7 21-25 5 3.0 20 5 1.0
K-Anonymity (3)
Example: K-Anonymity for a list of students with K=5. For each semester, there are at least 5 individuals present that cannot be distinguished. Idea/Goal: Consequently, one cannot be identified, but hides in a group of K=5 people.
18 Introduction to Cybersecurity 2016 Name Semester Grade * 1 1.3 * 1 2.0 * 1 1.7 * 1 3.7 * 1 1.0 * 3 1.3 * 3 2.3 * 3 3.0 * 3 failed * 3 1.7 * 5 1.7 * 5 failed * 5 2.7 * 5 3.0 * 5 1.0
Attacks on K-Anonymity โ Homogeneity
One may learn a lot of information about an individual, if there are k people with this information.
19 Introduction to Cybersecurity 2016 Name Semester Grade * 1 1.3 * 1 2.0 * 1 1.7 * 1 3.7 * 1 1.0 * 3 failed * 3 failed * 3 failed * 3 failed * 3 failed * 5 1.7 * 5 failed * 5 2.7 * 5 3.0 * 5 1.0
K-Anonymity with K=5 But:
- If we know that a particular student,
say, Isa is in the 3rd semester, then we immediately learn that she has failed the exam.
Attacks on K-Anonymity โ Background Knowledge
Background knowledge that might look unsuspicious or not too privacy critical may lead to privacy breaches.
20 Introduction to Cybersecurity 2016 Name Semester Grade * 1 1.3 * 1 2.0 * 1 1.7 * 1 3.7 * 1 1.0 * 3 1.0 * 3 1.3 * 3 1.3 * 3 failed * 3 failed * 5 1.7 * 5 failed * 5 2.7 * 5 3.0 * 5 1.0
K-Anonymity with K=5 But:
- After the Exam, Isa (in 3rd semester)
looked disappointed after seeing the result.
- We can conclude that, with a very high
probability, she has not achieved a 1.0
- r a 1.3 and thus she most likely failed
the exam.
L-Diversity
Intuition and definition:
- There have to be L different, โrepresentativeโ results for each set of quasi
identifiers.
21 Introduction to Cybersecurity 2016 Name Semester Grade * 3 1.0 * 3 2.3 * 3 3.7 * 3 failed * 3 3.0
L-Diversity (2)
Properties
- Homogeneity attacks are impossible (enough representative values)
- Many knowledge based attacks can be covered
- They often do not lead to direct deanonymization,
- but only quantitatively reduce the diversity.
22 Introduction to Cybersecurity 2016
Attack on L-Diversity โ Lots of Knowledge
Every 5-Block is optimally L-Diverse (L=5)
23 Introduction to Cybersecurity 2016 Name Semester Grade * 1 1.3 * 1 2.0 * 1 1.7 * 1 3.7 * 1 1.0 * 3 1.0 * 3 2.3 * 3 3.7 * 3 failed * 3 3.0 * 5 1.7 * 5 failed * 5 2.7 * 5 3.0 * 5 1.0
But:
- Assume you are in 3rd semester, have a
2.3, your friend John has a 3.0 and you know that a male student from 3rd semester just barely passed the exam.
- Moreover, Isa, who is also in 3rd semester,
looked unhappy after the exam, so it is very unlikely that she achieved the 1.0 and consequently it is very likely that she failed the exam.
Netflix Prize
When: 2007-2009 Challenge: โFind a better recommendation algorithmโ 1.000.000 $ Reward for the winner.
24 Introduction to Cybersecurity 2016
Netflix Prize
When: 2007-2009 Challenge: โFind a better recommendation algorithmโ 1.000.000 $ Reward for the winner. Data: Training set (โ 100,000,000 ratings from โ 480,000 users) To prevent certain inferences being drawn about the Netflix customer base, some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates.
25 Introduction to Cybersecurity 2016 User Movie Rating Date Alice Pirates of the Caribbean 3 04-Nov-15
Netflix Prize โ Anonymization and Deanonymization
Claim: โTo protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids.โ
26 Introduction to Cybersecurity 2016 User Movie Rating Date Alice Pirates o.t.C. 3 04-Nov-15 Alice Matrix 4 95-Jun-04 Bob Titanic 5 97-Dec-21 Bob Matrix 4 02-Jan-23 Eve Godfather 4 05-Dec-06 Eve Pirates o.t.C. 5 03-Dec-12 Tom Toy Story 2 04-Jul-27 Replaced ID1 (Alice) ID1 (Alice) ID2 (Bob) ID2 (Bob) ID3 (Eve) ID3 (Eve) ID4 (Tom) Name Movie Rating Date Alice Pirates o.t.C. 3 04-Nov-14 Tom Toy Story 2 04-Jul-27 John Matrix 5 00-Jan-22 Peter Inception 5 11-Jan-07 Bob Toy Story 4 03-Jun-04 Bob Matrix 4 02-Jan-22 Susie Pirates o.t.C. 5 06-Feb-11 Replaced Alice Alice Bob Bob ID3 (Eve) ID3 (Eve) Tom
Data Sparseness
- Sparse data leads to privacy breaches
27 Introduction to Cybersecurity 2016
Sparse distribution of properties (only two attributes) Dense distribution of properties (only two attributes)
Data Sparseness
- Facebook data is really sparse:
- Education
- Hobbies
- Favorite music / books / movies
Mathematically:
- A person is a dot in a n-dimensional graph, where n = number of attributes.
28 Introduction to Cybersecurity 2016
1 1 1 1 1 1 1 1
2 attributes 6 attributes Alice Bob Alice Bob
Does a higher order (number of dimensions) lead to more or less sparseness?
Data Sparseness
- Facebook data is really sparse:
- Education
- Hobbies
- Favorite music / books / movies
Mathematically:
- A person is a dot in a n-dimensional graph, where n = number of attributes.
- The higher the order (the number of dimensions), the more sparse it is.
Similarities are less likely, as the space grows exponentially.
- For Boolean attributes:
- 2 Attributes: 22 = 4 possibilities
- 6 Attributes: 26 = 64 possibilities
- 50 Attributes: 250 = 1,125,899,906,842,624 possibilities
29 Introduction to Cybersecurity 2016
1 1 1 1 1 1 1 1
2 attributes 6 attributes Alice Bob Alice Bob
Lecture Summary โ Part I
Basic Database Privacy
- Motivation
- Data Sanitization
- k-anonymity and l-diversity
Principle Approaches to Data Protection
- Sanitization before Publication
- Protection after Publication
- Publication without Control
30 Introduction to Cybersecurity 2016
Three Principle Approaches to Data Protection
Introduction to Cybersecurity 2016
- Problem: Generic approaches difficult, sanitization removes valuable and
potentially also crucially required information
- Goal: Provide data without potential linkability
Data under control: Sanitization before publication
- Problem: Limited computations, efficiency
- Goal: Use and modify data without direct access
Data under control: Strong protection after publication
- Problem: Difficulties to enforce privacy, no hard guarantees
- Goal: Understand exposure and privacy consequences
Most cases: Data dissemination without control
- Problem: Existing approaches can be circumvented;
specific sanitization (remove information) without guarantees
- Goal: Provide provably private sanitization
31
Differential Privacy
Intuition: A mechanism is differentially private, if the output does not observably depend on whether you are in the database or not.
32 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed
Statistics
Differential Privacy (2)
Definition (informal): For two neighboring databases, i.e., databases differing in at most one row, every observable output must be almost equally likely.
33 Introduction to Cybersecurity 2016
Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0
Around 60% of the students have passed Around 60% of the students have passed
Differential Privacy (2)
For two neighboring databases, i.e., databases differing in at most one row, every observable output must be almost equally likely. Idea: No attacker can learn whether or not an individual person is within the database or not. Consequently, no attacker can learn information about any individual member
- f the database, but tendencies and statistics can still be learned.
34 Introduction to Cybersecurity 2016
Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0
Differential Privacy โ How (not) to achieve it
Generalization: Circumstances and sufficient knowledge breaks generalization.
35 Introduction to Cybersecurity 2016
7 Students have passed, 2 have failed
Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed
7 Students have passed, 1 has failed
An attacker may observe the difference. (1 vs. 2 students have failed). With sufficient knowledge (in the extreme case: about all other students), it may learn whether Isa participated or not.
Differential Privacy โ How to achieve it
Observation: No deterministic method is possible, if we want to preserve utility: For every function that allows to learn some tendencies, there are corner- cases in which one can observe the presence or absence of someone. Examples:
- Round to values: โ10 people have succeededโ
- Corner case: 14 is rounded to 10, 15 is rounded to 20
- Boolean statements: โAt least as many people have succeeded than failedโ
- Corner case: 15 Students succeeded and 15 (or 14) failed
- Even relative statements do not work, if the attacker has arbitrary knowledge:
- โ80% of the people have succeededโ (if one knows all other students, the
percentage leaks information)
36 Introduction to Cybersecurity 2016
We need a randomized sanitization method!
Achieving Differential Privacy
Addition of random noise:
- We randomly modify the result
37 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed
170 Students have passed, 82 have failed
Precise answer + Noise:
- 2.3 Students have passed,
+1.4 Students have failed
167.3 Students have passed, 83.4 have failed
โฆ
Noisy answer
Achieving Differential Privacy
Addition of random noise:
- We randomly modify the result
- Note that the noise does not preserve
โsanity checksโ such as:
- The total number of students
is preserved
- The result is always a natural number โฅ 0
- If within the noisy answer the exam has to
be repeated (because too many people failed), then the same holds for the precise result. (such a mismatch can leak information)
38 Introduction to Cybersecurity 2016
170 Students have passed, 82 have failed
Precise answer + Noise:
- 2.3 Students have passed,
+1.4 Students have failed
167.3 Students have passed, 83.4 have failed
Noisy answer
Achieving Differential Privacy
Differential privacy can cope with arbitrary adversarial knowledge:
- The adversary may know the whole database, except for one entry
Rules of thumb:
- The more precise the answer is (fewer noise), the more privacy is lost.
- For a small database: Good privacy ๏ณ Lots of noise ๏ณ bad utility
- Answers like: -3 students have passed the exam.
39 Introduction to Cybersecurity 2016
Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0
Known to adversary
Post-Sanitization and Differential Privacy
Post sanitization (deterministic and probabilistic) possible: As long as it only depends on the noisy output (not on the original dataset), every computation is possible and does not decrease privacy.
40 Introduction to Cybersecurity 2016 170 Students have passed, 82 have failed
Precise answer
167.3 Students have passed, 83.4 have failed
โฆ
Noisy answer
Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed
Arbitrary computation:
rounding, bounding, relations e.g., If (answer < 0) then answer = 0
167 Students have passed, 83 have failed. Twice as many students have passed than have failed
More details in future lectures!
Introduction to Cybersecurity 2016
Privacy-friendly Aggregation (Smart Metering)
41
1 2 3 4 5 6 7aggregated
- ver households
kW 2 4 6 time 00:00 12:00 06:00 18:00
2 4 6 8 10 12 14 16 18 20๐๐๐ เต ฮ๐ ๐
Goal: Privacy Guarantess for aggregated data: It should be impossible to infer the energy consumption of any individual household
Probability density function: ๐๐๐๐ ๐ฆ =
1 2๐ ๐โ
๐ฆโฮ๐ ๐
kW 500 1000 1500 Zeit 00:00 12:00 06:00 18:00 ฮ๐: Sensitivity, i.e. a single userโs/householdโs impact
- n the function output
๐: Privacy parameter from Differential Privacy
Technische Definition: Differential Privacy: Pr ๐ฆ โ ๐; ๐ฆ โ ๐ ๐ธ1 โค ๐๐ Pr ๐ฆ โ ๐; ๐ฆ โ ๐ ๐ธ2 + ๐
Introduction to Cybersecurity 2016
Challenges: Sanitization of dynamic data(streaming) decentralized noising, learning with differential privacy Potential killer arguments: provided utility not sufficient for any practical use-case; interesting outliers are removed Privacy-friendly Aggregation (Smart Metering)
42
1 2 3 4 5 6 7aggregated
- ver households
kW 2 4 6 time 00:00 12:00 06:00 18:00
2 4 6 8 10 12 14 16 18 20๐๐๐ เต ฮ๐ ๐
Probability density function: ๐๐๐๐ ๐ฆ =
1 2๐ ๐โ
๐ฆโฮ๐ ๐
kW 500 1000 1500 Zeit 00:00 12:00 06:00 18:00 ฮ๐: Sensitivity, i.e. a single userโs/householdโs impact
- n the function output
๐: Privacy parameter from Differential Privacy
Three Principle Approaches to Data Protection
Introduction to Cybersecurity 2016
- Problem: Generic approaches difficult, sanitization removes valuable and
potentially also crucially required information
- Goal: Provide data without potential linkability
Data under control: Sanitization before publication
- Problem: Limited computations, efficiency
- Goal: Use and modify data without direct access
Data under control: Strong protection after publication
- Problem: Difficulties to enforce privacy, no hard guarantees
- Goal: Understand exposure and privacy consequences
Most cases: Data dissemination without control
- Problem: Existing approaches can be circumvented;
specific sanitization (remove information) without guarantees
- Goal: Provide provably private sanitization
44
Three Principle Approaches to Data Protection
Introduction to Cybersecurity 2016
- Problem: Generic approaches difficult, sanitization removes valuable and
potentially also crucially required information
- Goal: Provide data without potential linkability
Data under control: Sanitization before publication
- Problem: Limited computations, efficiency
- Goal: Use and modify data without direct access
Data under control: Strong protection after publication
- Problem: Difficulties to enforce privacy, no hard guarantees
- Goal: Understand exposure and privacy consequences
Most cases: Data dissemination without control
- Problem: Limited computations, efficiency
- Goal: Efficiently use and modify data without direct access
- Problem: Existing approaches can be circumvented;
specific sanitization (remove information) without guarantees
- Goal: Provide provably private sanitization
45
Protected Publication โ Rough Overview
- Idea: Data never leaves the userโs control
- Data is published encrypted, or stays in trusted hardware
- Goal: Only trustworthy users / processes are granted access
Introduction to Cybersecurity 2016 46
Protected Publication โ Rough Overview
- Idea: Data never leaves the userโs control
- Data is published encrypted, or stays in trusted hardware
- Goal: Only trustworthy users / processes are granted access
- Possibility 1: Trustworthy computations in secure hardware
(IBM Cryptocard, ARM Trustzone, Intel SGX)
- Major challenge: granting access under which conditions? Everything that has
been output is out of the userโs control (!)
Introduction to Cybersecurity 2016
m f(m)
47
Protected Publication โ Rough Overview
- Idea: Data never leaves the userโs control
- Data is published encrypted, or stays in trusted hardware
- Goal: Only trustworthy users / processes are granted access
- Possibility 2: Computing over encrypted data (fully homomorphic encryption)
- Given E(K, m) and a function f
- Compute E(K, f(m)) by computing f( E(K, m))
- Major challenges: Generality of E for permitted functions f;
currently still extremely inefficient.
Introduction to Cybersecurity 2016 48
Three Principle Approaches to Data Protection
Introduction to Cybersecurity 2016
- Problem: Generic approaches difficult, sanitization removes valuable and
potentially also crucially required information
- Goal: Provide data without potential linkability
Data under control: Sanitization before publication
- Problem: Limited computations, efficiency
- Goal: Use and modify data without direct access
Data under control: Strong protection after publication
- Problem: Difficulties to enforce privacy, no hard guarantees
- Goal: Understand exposure and privacy consequences
Most cases: Data dissemination without control
- Problem: Limited computations, efficiency
- Goal: Efficiently use and modify data without direct access
- Problem: Existing approaches can be circumvented;
specific sanitization (remove information) without guarantees
- Goal: Provide provably private sanitization
49
Three Principle Approaches to Data Protection
Introduction to Cybersecurity 2016
- Problem: Generic approaches difficult, sanitization removes valuable and
potentially also crucially required information
- Goal: Provide data without potential linkability
Data under control: Sanitization before publication
- Problem: Limited computations, efficiency
- Goal: Use and modify data without direct access
Data under control: Strong protection after publication
- Problem: Difficulties to enforce privacy, no hard guarantees
- Goal: Understand exposure and privacy consequences
Most cases: Data dissemination without control
- Problem: Limited computations, efficiency
- Goal: Efficiently use and modify data without direct access
- Problem: Existing approaches can be circumvented;
specific sanitization (remove information) without guarantees
- Goal: Provide provably private sanitization
- Problem: Often impossible to enforce privacy; hence no hard guarantees
- Goal: Understand exposure and privacy consequences
50
Privacy & Individual Utility
People do want to post personal contents and appreciate individualized services โฆ but donโt want to be tracked, targeted, rated
Key issue: privacy risk models that reconcile privacy and utility, and tools that analyze & explain risks and guide users
User Privacy Risks in Online Communities
Nobody interested in your research? We read your papers!
Established privacy models Todayโs user behavior & risks
- Data: single database
- Goal: hard anonymity guarantees,
non-disclosure of any properties
- Adversary: comp. powerful, but
agnostic; global access & view
- Measures: data coarsening,
perturbation, limit queries; tension with utility
- Data & User: textual contents,
social, agile, longitudinal
- Goal: alert & advise, bound risk
- Adversary: world knowledge &
- prob. inference, cost-aware
- Measures: estimate risk,
rank โtarget usersโ, selective anon. ๏ฎ Privacy Advisor tool
Outlook: Assessing Privacy at Large
search publish & recommend
Levothroid shaking Addisonโs disease โฆโฆโฆ Nive concert Greenland singers Somalia elections Steve Biko
search engine
female 29y Jamame
social network
Nive Nielsen Cry Freedom
discuss & seek help
- nline
forum
female 25-30 Somalia Synthroid tremble โฆโฆโฆ. Addison disorder โฆโฆโฆ.
Zoe
53 Introduction to Cybersecurity 2016
Outlook: Assessing Privacy at Large
search publish & recommend
Levothroid shaking Addisonโs disease โฆโฆโฆ Nive concert Greenland singers Somalia elections Steve Biko
search engine Zoe
female 29y Jamame
social network
Nive Nielsen Cry Freedom
discuss & seek help
- nline
forum
female 25-30 Somalia Synthroid tremble โฆโฆโฆ. Addison disorder โฆโฆโฆ.
Threats from
- Direct cues:
profile data
- Indirect cues:
profiles of friends
- Semantic cues:
health, taste, queries
- Statistical cues:
correlations
54 Introduction to Cybersecurity 2016
Let a wise person speak to thatโฆ
55 Introduction to Cybersecurity 2016
First: our ERC Synergy Grant got scooped by the Simpsons! Second: let me try to give you a glimpse on how this could really work.
56 Introduction to Cybersecurity 2016
Privacy Advisor โ Building Blocks
User
Probabilistic model of privacy state & transitions User action Privacy policy World knowledge Personal info and history
Internet contents and interactions
Privacy Advisor (PA)
Software tool that
- analyses risk
- alerts user
- advises user
57 Introduction to Cybersecurity 2016
Lecture Summary โ Part I
Basic Database Privacy
- Motivation
- Data Sanitization
- k-anonymity and l-diversity
Principle Approaches to Data Protection
- Sanitization before Publication
- Protection after Publication
- Publication without Control
58 Introduction to Cybersecurity 2016
Introduction to Cybersecurity Secure Information Flow
Summary
Introduction to Cybersecurity 2016
Secure Information Flow
- Confidentiality
- (In-)Secure Information Flow
- Explicit Flow
- Implicit Flow
- Termination Flow
60
Confidentiality
- Recall: Confidentiality
Assure that information is not disclosed to unauthorized principals
- In general, we can observe that
- It is easy to check information release
- It is hard to check information propagation
61
page
Introduction to Cybersecurity 2016
Confidentiality issues
- Modern applications process sensitive data
- passwords, credit card numbers, phone numbers, ...
- In most systems, data is shared and can be accessed by possibly untrusted
applications (Facebook, smartphones, ...)
- How do we know whether or not applications access sensitive data in a
legitimate way?
- address book can be read but not forwarded
- Data leakage...
- how does it happen?
- how can we detect it?
- how can we prevent it?
62 Introduction to Cybersecurity 2016
Confidentiality
- Standard security mechanisms are unsatisfactory
- Anti-virus scanning: rejects a black list of known attacks...but doesnโt
prevent new attacks
- Cryptography: protects secret data on the network...but endpoints of
communication may leak data
- Sandboxing: good for low-level events (read a file), but programs are
treated as black boxes
- Access control: prevents unauthorized release of information...but what
programs should be authorized?
63
NET
Introduction to Cybersecurity 2016
Checking confidentiality
- We need to look at the code (inside the black box!) and check whether or
not our programs leak information
- Immediate benefits:
- semantics-based security specification
- end-to-end security policies
- powerful analysis techniques
64
NET
Introduction to Cybersecurity 2016
(In-)Secure Information Flow
- Privacy leaks can also occur from improper processing of data.
- This leakage is not always obvious.
- Information might flow to unintended
places / recipients / variables
- Secret inputs of programs may not influence public output.
- Most basic setting:
- low variables, meaning low security, public information.
- high variables, meaning high security, private information.
65 Introduction to Cybersecurity 2016
(In-)Secure Information Flow
- Security definition (intuitive):
- We assume the low variables are published at the end of the program.
- They should not leak information about the high variables.
66 Introduction to Cybersecurity 2016
๐๐๐๐๐ฃ๐ข ๐ผ๐๐๐๐ฃ๐ข ๐๐๐ฃ๐ข๐๐ฃ๐ข ๐ผ๐๐ฃ๐ข๐๐ฃ๐ข
Information Flow โ Example 1
- Consider the following program. Is it secure?
67 Introduction to Cybersecurity 2016
low2 := low3 + low3 low1 := secret low2 := low3 + low3 low1 := secret
Direct explicit flow from high variable to low variable
Information Flow โ Example 2
- Consider the following program. Is it secure?
68 Introduction to Cybersecurity 2016
low1 := low2 + low3 secret3 := secret1 + secret2 copy := secret1 secret1 := secret2 secret2 := copy low2 := copy low1 := low2 + low3 secret3 := secret1 + secret2 copy := secret1 secret1 := secret2 secret2 := copy low2 := copy low1 := low2 + low3 secret3 := secret1 + secret2 copy := secret1 secret1 := secret2 secret2 := copy low2 := copy low1 := low2 + low3 secret3 := secret1 + secret2 copy := secret1 secret1 := secret2 secret2 := copy low2 := copy
Indirect explicit flow from high variable to low variable (no matter whether copy is high or low)
Information Flow โ Explicit Flow
Explicit flow occurs whenever a computation involving a high variable is assigned to a low variable. Examples for explicit flow:
- low := high
- low := low + high
- low := function(low,high)
We need to find rules to avoid explicit information flow.
- There must never be an assignment of a high variable to a low variable.
69 Introduction to Cybersecurity 2016
Information Flow โ Solution: Assignment rule
- There must never be an assignment of a high variable to a low variable.
Information flow solved?
70 Introduction to Cybersecurity 2016
low2 := low3 + low3 low1 := 0 if secret_bit == 1: low1 := 1 low2 := low3 + low3 low1 := 0 if secret_bit == 1: low1 := 1
Implicit (conditional) flow from high variable to low variable The program actually computes: low1 := secret_bit No assignment from high to low!
Information Flow โ Implicit (Conditional) Flow
Conditional flow occurs whenever a computation branches depending on a high variable and within these branches assigns (different) values to low variables. Examples for conditional flow:
- low := 0
while high > 0: low := low +1 high := high - 1
- low := 0
if boolean_function(high): low := 1 We need to find rules to avoid conditional information flow.
- If a conditional (if/for/while/โฆ) depends on a high variable, then no
assignment to low variables is allowed.
71 Introduction to Cybersecurity 2016
Should be allowed:
- low := 0
while high1 > 0: high1 := function(high1,high2)
- low := 0
if boolean_function(high1): high2 := 1
Information Flow โ Solution: Conditional rule
- There must never be an assignment of a high variable to a low variable.
- If a conditional depends on a high variable, then no assignment to low
variables is allowed. Information flow solved?
72 Introduction to Cybersecurity 2016
low2 := low3 + low3 low1 := 0 while secret_bit == 1: high := 0 low1 := 1 low2 := low3 + low3 low1 := 0 while secret_bit == 1: high := 0 low1 := 1
Covert channel (termination) flow from high variable. The program only terminates if secret_bit == 0. No explicit flow! No assignment to low in conditional! A covert channel is a channel not intended for information transfer at all.
Information Flow โ Covert Channel (Termination) Flow
Termination flow occurs whenever the termination of a computation depends
- n a high variable.
Examples for termination flow:
- low := 0
while high > 0: high := high
- for (temp := 1; temp < high; temp := temp +1)
high := high + 1
- JUMP_MARK:
if high == 1: goto JUMP_MARK We need to find rules to avoid termination information flow.
- Termination may not depend on high variables.
73 Introduction to Cybersecurity 2016
Non-terminating behavior depending on high variables should never be allowed.
Information Flow โ Solution: Termination rule
- There must never be an assignment of a high variable to a low variable.
- If a conditional depends on a high variable, then no assignment to low
variables is allowed.
- Termination may not depend on high variables.
Information flow solved?
74 Introduction to Cybersecurity 2016
low2 := low3 + low3 low1 := 0 if secret_bit == 1: high := compute_complex_function(high) low1 := 1 low2 := low3 + low3 low1 := 0 if secret_bit == 1: high := compute_complex_function(high) low1 := 1
Covert channel (timing) flow from high variable. The program may take significantly longer to terminate if secret_bit == 1. No explicit flow! No conditional flow! Always terminates!
Information Flow โ Covert channel (Timing) Flow
Timing flow occurs whenever the time that a computation needs depends on high variables. Examples for timing flow:
- while high1 > 0:
high1 := function(high1,high2)
- for each bit ๐๐ of secret_key
if ๐๐ == 1: high := function(high) We need to find rules to avoid timing information flow.
- Computation time may not depend on high variables.
75 Introduction to Cybersecurity 2016
Should be allowed: for each bit ๐๐ of secret_key if ๐๐ == 1: high := function(high) else: dummy := function(high)
Information Flow โ Wrap-up
- There must never be an assignment of a high variable to a low variable.
- If a conditional depends on a high variable, then no assignment to low
variables is allowed.
- Termination may not depend on high variables.
- Computation time may not depend on high variables.
Even more forms of information flow exist:
- Concurrent programs can leak internal states (and are hard to analyze)
- Limited resources can be used to leak information:
- write high times LARGE DATA to the disk (or load it into the RAM) and
wait for overflow.
- Other side channels, e.g., using volume control to transfer information,
measuring the electricity consumption, โฆ
76 Introduction to Cybersecurity 2016
Summary
Introduction to Cybersecurity 2016
Secure Information Flow
- Confidentiality
- (In-)Secure Information Flow
- Explicit Flow
- Implicit Flow
- Termination Flow
77