[PPT] - Database Privacy Review: Anonymity vs. Privacy Privacy - Privacy is PowerPoint Presentation

SLIDE 1

Introduction to Cybersecurity Database Privacy

SLIDE 2

Review: Anonymity vs. Privacy

Privacy
Privacy is the claim of individuals, groups, or institutions to

determine for themselves when, how, and to what extent information about them is communicated to others

Anonymity
The state of being not identifiable within a set of

subjects/individuals

It is a property exclusively of individuals
Privacy != Anonymity
Anonymity is a way to maintain privacy, and sometimes it is not

necessary

Foundations of Cybersecurity 2016 1

SLIDE 3

Review: Anonymous Communication (AC) Protocols

Various AC protocols with different goals:
Low Latency Overhead
Low Communication Overhead
High Traffic-Analysis Resistance
Typically categorized by latency overhead:
low-latency AC protcols

e.g. Tor, DC Nets, Crowds

high-latency AC protocols

e.g. Mix networks

Introduction to Cybersecurity 2016

Latency Traffic-Analysis Resistance Communication Complexity

2

SLIDE 4

Maximize A 𝑕 ≔ ∑𝑏∈A 𝑏𝑒𝑤𝑜𝑝𝑒𝑓 𝑏 subject to A ⊆ 𝑂, 𝑔 A ≤ 𝐶

Impact of Single Node Corruption Overall Guarantee

Randomly choose an entry, a middle and an exit node.

entry middle exit

Goal: Derive worst-case quantitative anonymity guarantees

𝜀𝑓𝑜𝑢𝑠𝑧 𝑗 = ∑(𝑛,𝑦)∈𝑂2 Pr[ 𝑗, 𝑛, 𝑦 ← 𝑈𝑝𝑠] 𝜀𝑛𝑗𝑒𝑒𝑚𝑓 𝑗 = ∑(𝑓,𝑦)∈𝑂2 Δst Pr[ 𝑓, 𝑗, 𝑦 ← 𝑈𝑝𝑠] 𝜀𝑓𝑦𝑗𝑢 𝑗 = Δ𝑡𝑢 ∑𝑓∈𝑂 Pr[ 𝑓, 𝑛, 𝑗 ← 𝑈𝑝𝑠] 𝑏𝑒𝑤𝑜𝑝𝑒𝑓(𝑗)

𝐶

with cost function 𝑔: 𝑂 → ℝ and budget 𝐶 Computational Soundness

𝑕𝑏𝑚𝑕𝑓𝑐𝑠𝑏𝑗𝑑 (𝑜) − 𝑕𝑑𝑠𝑧𝑞𝑢𝑝 (𝑜) ≤ 1 𝑞𝑝𝑚𝑧 𝑜

corrupts

Integer maximization problem Anonymity degeneration (for encryption as terms)

Introduction to Cybersecurity 2016 3

SLIDE 5

Randomly choose an entry, a middle and an exit node.

entry middle exit

Goal: Derive worst-case quantitative anonymity guarantees

corrupts

Challenges: Comprehensive network-layer attackers, extension beyond structural corruption, content-sensitive assessment Potential killer arguments: Attackers overly powerful, hence too pessimistic guarantees; assessment only for Tor, not tailored attack

Alternative Path Selection Algorithms

Live Monitor

Anonymity 0,5 1 time 2012 2013 2014

0.2 0.4 0.6 0.8 1

1 8 64 512 4,096 32,768

Tor LASTor Uniform US-Exit

Bandwidth in MB/s

Introduction to Cybersecurity 2016 4

SLIDE 6

Lecture Summary – Part I

Basic Database Privacy

Motivation
Data Sanitization
k-anonymity and l-diversity

Principle Approaches to Data Protection

Sanitization before Publication
Protection after Publication
Publication without Control

5 Introduction to Cybersecurity 2016

SLIDE 7

Data Privacy: Attribute Disclosure

6 Introduction to Cybersecurity 2016

social network

female 29y Saarbrücken

Alice suffers from the Addison disorder!

female 25-30 Saarland Addison Disorder female 25-30 Saarland Addison Disorder male 30-35 Saarland Healthy female 25-30 Saarland Addison Disorder female 25-30 Saarland Addison Disorder male 30-35 Saarland Healthy

SLIDE 8

Cryptographic Solutions

Why not just delete the data?
Why can’t we encrypt?

7 Introduction to Cybersecurity 2016

In contrast to cryptography, privacy

ften requires a certain utility.

Deleting data destroys utility. Storing or transmitting data encrypted is a good idea. Someone has (needs to have) the key.

SLIDE 9

Sanitization

Legally, data has to be “sanitized”:
Removal of “identifying” information

8 Introduction to Cybersecurity 2016

Unsanitized data

Name
Gender
Age
Address
Phone Number
Field of studies
Grades

Sanitized data

Name
Gender
Age
Address
Phone Number
Field of studies
Grades

SLIDE 10

Benefits of Sanitization

Sanitized data can (still) be used for:

Research
Healthcare
Governmental statistics
Improving business models

9 Introduction to Cybersecurity 2016

Sanitized data

Name
Gender
Age
Address
Phone Number
Field of studies
Grades

Statistics Science!

SLIDE 11

Does Sanitization suffice?

Sanitization = Privacy?

No identity
No identifying information (“quasi identifiers”)

such as address or phone number

10 Introduction to Cybersecurity 2016

Sanitized data

Name
Gender
Age
Address
Phone Number
Field of studies
Grades

1 female student

f this age

attends a course Privacy Breach

SLIDE 12

Attacks on Databases

11 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 2.0 Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 3.7 John 20 Male 3 1.7 Kale 21 Male 5 1.7 Leonard 23 Male 5 failed Martin 20 Male 5 2.7 Nils 22 Male 5 3.0 Otto 20 Male 5 1.0

Early defense mechanisms: query sanitization.

SELECT SUM(Grade) WHERE Name = ‘Isa’ 3.7

Sanitization: Queries must not depend on identifiers!

SLIDE 13

Attacks on Databases

12 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 2.0 Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 3.7 John 20 Male 3 1.7 Kale 21 Male 5 1.7 Leonard 23 Male 5 failed Martin 20 Male 5 2.7 Nils 22 Male 5 3.0 Otto 20 Male 5 1.0

Early defense mechanisms: query sanitization.

SELECT SUM(Grade) WHERE Semester = 3 AND Gender = Female 3.7

Sanitization: Queries must not depend on identifiers! Sanitization: Queries must not be answered if the answer is below a threshold

SLIDE 14

Attacks on Databases

13 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 2.0 Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 3.7 John 20 Male 3 1.7 Kale 21 Male 5 1.7 Leonard 23 Male 5 failed Martin 20 Male 5 2.7 Nils 22 Male 5 3.0 Otto 20 Male 5 1.0

Early defense mechanisms: query sanitization.

SELECT SUM(Grade) 30.1

Sanitization: Queries must not depend on identifiers! Sanitization: Queries must not be answered if the answer is below a threshold

SELECT SUM(Grade) WHERE NOT (Semester = 3 AND Gender = Female) 26.4 Local Computation: Isa’s Grade = 30.1 – 26.4 = 3.7

SLIDE 15

K-Anonymity (Intuitive Idea)

K-Anonymity: Privacy means that one can hide within a set of (at least) K

ther people with the same quasi-identifiers.

Quasi identifiers: Attributes that could identify a person (name, age, etc.)

15 Introduction to Cybersecurity 2016

K = 6

SLIDE 16

K-Anonymity (Definition)

Definition: Data satisfies K-Anonymity, if each person contained in the data cannot be distinguished from at least K-1 other individuals also within the data.

16 Introduction to Cybersecurity 2016

SLIDE 17

Achieving K-Anonymity

Reduce the information such that the data collapses:

17 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 2.0 Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed John 20 Male 3 1.7 Kale 21 Male 5 1.7 Leonard 23 Male 5 failed Martin 20 Male 5 2.7 Nils 22 Male 5 3.0 Otto 20 Male 5 1.0

Suppression: Generalization:

Name Age Gender Semester Grade * 19 * 1 1.3 * 18 * 1 2.0 * 18 * 1 1.7 * 18 * 1 3.7 * 17 * 1 1.0 Name Age Gender Semester Grade 21-25 5 1.7 21-25 5 failed 18-20 5 2.7 21-25 5 3.0 20 5 1.0

SLIDE 18

K-Anonymity (3)

Example: K-Anonymity for a list of students with K=5. For each semester, there are at least 5 individuals present that cannot be distinguished. Idea/Goal: Consequently, one cannot be identified, but hides in a group of K=5 people.

18 Introduction to Cybersecurity 2016 Name Semester Grade * 1 1.3 * 1 2.0 * 1 1.7 * 1 3.7 * 1 1.0 * 3 1.3 * 3 2.3 * 3 3.0 * 3 failed * 3 1.7 * 5 1.7 * 5 failed * 5 2.7 * 5 3.0 * 5 1.0

SLIDE 19

Attacks on K-Anonymity – Homogeneity

One may learn a lot of information about an individual, if there are k people with this information.

19 Introduction to Cybersecurity 2016 Name Semester Grade * 1 1.3 * 1 2.0 * 1 1.7 * 1 3.7 * 1 1.0 * 3 failed * 3 failed * 3 failed * 3 failed * 3 failed * 5 1.7 * 5 failed * 5 2.7 * 5 3.0 * 5 1.0

K-Anonymity with K=5 But:

If we know that a particular student,

say, Isa is in the 3rd semester, then we immediately learn that she has failed the exam.

SLIDE 20

Attacks on K-Anonymity – Background Knowledge

Background knowledge that might look unsuspicious or not too privacy critical may lead to privacy breaches.

20 Introduction to Cybersecurity 2016 Name Semester Grade * 1 1.3 * 1 2.0 * 1 1.7 * 1 3.7 * 1 1.0 * 3 1.0 * 3 1.3 * 3 1.3 * 3 failed * 3 failed * 5 1.7 * 5 failed * 5 2.7 * 5 3.0 * 5 1.0

K-Anonymity with K=5 But:

After the Exam, Isa (in 3rd semester)

looked disappointed after seeing the result.

We can conclude that, with a very high

probability, she has not achieved a 1.0

r a 1.3 and thus she most likely failed

the exam.

SLIDE 21

L-Diversity

Intuition and definition:

There have to be L different, “representative” results for each set of quasi

identifiers.

21 Introduction to Cybersecurity 2016 Name Semester Grade * 3 1.0 * 3 2.3 * 3 3.7 * 3 failed * 3 3.0

SLIDE 22

L-Diversity (2)

Properties

Homogeneity attacks are impossible (enough representative values)
Many knowledge based attacks can be covered
They often do not lead to direct deanonymization,
but only quantitatively reduce the diversity.

22 Introduction to Cybersecurity 2016

SLIDE 23

Attack on L-Diversity – Lots of Knowledge

Every 5-Block is optimally L-Diverse (L=5)

23 Introduction to Cybersecurity 2016 Name Semester Grade * 1 1.3 * 1 2.0 * 1 1.7 * 1 3.7 * 1 1.0 * 3 1.0 * 3 2.3 * 3 3.7 * 3 failed * 3 3.0 * 5 1.7 * 5 failed * 5 2.7 * 5 3.0 * 5 1.0

But:

Assume you are in 3rd semester, have a

2.3, your friend John has a 3.0 and you know that a male student from 3rd semester just barely passed the exam.

Moreover, Isa, who is also in 3rd semester,

looked unhappy after the exam, so it is very unlikely that she achieved the 1.0 and consequently it is very likely that she failed the exam.

SLIDE 24

Netflix Prize

When: 2007-2009 Challenge: “Find a better recommendation algorithm” 1.000.000 $ Reward for the winner.

24 Introduction to Cybersecurity 2016

SLIDE 25

Netflix Prize

When: 2007-2009 Challenge: “Find a better recommendation algorithm” 1.000.000 $ Reward for the winner. Data: Training set (≈ 100,000,000 ratings from ≈ 480,000 users) To prevent certain inferences being drawn about the Netflix customer base, some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates.

25 Introduction to Cybersecurity 2016 User Movie Rating Date Alice Pirates of the Caribbean 3 04-Nov-15

SLIDE 26

Netflix Prize – Anonymization and Deanonymization

Claim: “To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids.”

26 Introduction to Cybersecurity 2016 User Movie Rating Date Alice Pirates o.t.C. 3 04-Nov-15 Alice Matrix 4 95-Jun-04 Bob Titanic 5 97-Dec-21 Bob Matrix 4 02-Jan-23 Eve Godfather 4 05-Dec-06 Eve Pirates o.t.C. 5 03-Dec-12 Tom Toy Story 2 04-Jul-27 Replaced ID1 (Alice) ID1 (Alice) ID2 (Bob) ID2 (Bob) ID3 (Eve) ID3 (Eve) ID4 (Tom) Name Movie Rating Date Alice Pirates o.t.C. 3 04-Nov-14 Tom Toy Story 2 04-Jul-27 John Matrix 5 00-Jan-22 Peter Inception 5 11-Jan-07 Bob Toy Story 4 03-Jun-04 Bob Matrix 4 02-Jan-22 Susie Pirates o.t.C. 5 06-Feb-11 Replaced Alice Alice Bob Bob ID3 (Eve) ID3 (Eve) Tom

SLIDE 27

Data Sparseness

Sparse data leads to privacy breaches

27 Introduction to Cybersecurity 2016

Sparse distribution of properties (only two attributes) Dense distribution of properties (only two attributes)

SLIDE 28

Data Sparseness

Facebook data is really sparse:
Education
Hobbies
Favorite music / books / movies

Mathematically:

A person is a dot in a n-dimensional graph, where n = number of attributes.

28 Introduction to Cybersecurity 2016

1 1 1 1 1 1 1 1

2 attributes 6 attributes Alice Bob Alice Bob

Does a higher order (number of dimensions) lead to more or less sparseness?

SLIDE 29

Data Sparseness

Facebook data is really sparse:
Education
Hobbies
Favorite music / books / movies

Mathematically:

A person is a dot in a n-dimensional graph, where n = number of attributes.
The higher the order (the number of dimensions), the more sparse it is.

Similarities are less likely, as the space grows exponentially.

For Boolean attributes:
2 Attributes: 22 = 4 possibilities
6 Attributes: 26 = 64 possibilities
50 Attributes: 250 = 1,125,899,906,842,624 possibilities

29 Introduction to Cybersecurity 2016

1 1 1 1 1 1 1 1

2 attributes 6 attributes Alice Bob Alice Bob

SLIDE 30

Lecture Summary – Part I

Basic Database Privacy

Motivation
Data Sanitization
k-anonymity and l-diversity

Principle Approaches to Data Protection

Sanitization before Publication
Protection after Publication
Publication without Control

30 Introduction to Cybersecurity 2016

SLIDE 31

Three Principle Approaches to Data Protection

Introduction to Cybersecurity 2016

Problem: Generic approaches difficult, sanitization removes valuable and

potentially also crucially required information

Goal: Provide data without potential linkability

Data under control: Sanitization before publication

Problem: Limited computations, efficiency
Goal: Use and modify data without direct access

Data under control: Strong protection after publication

Problem: Difficulties to enforce privacy, no hard guarantees
Goal: Understand exposure and privacy consequences

Most cases: Data dissemination without control

Problem: Existing approaches can be circumvented;

specific sanitization (remove information) without guarantees

Goal: Provide provably private sanitization

31

SLIDE 32

Differential Privacy

Intuition: A mechanism is differentially private, if the output does not observably depend on whether you are in the database or not.

32 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed

Statistics

SLIDE 33

Differential Privacy (2)

Definition (informal): For two neighboring databases, i.e., databases differing in at most one row, every observable output must be almost equally likely.

33 Introduction to Cybersecurity 2016

Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0

Around 60% of the students have passed Around 60% of the students have passed

SLIDE 34

Differential Privacy (2)

For two neighboring databases, i.e., databases differing in at most one row, every observable output must be almost equally likely. Idea: No attacker can learn whether or not an individual person is within the database or not. Consequently, no attacker can learn information about any individual member

f the database, but tendencies and statistics can still be learned.

34 Introduction to Cybersecurity 2016

Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0

SLIDE 35

Differential Privacy – How (not) to achieve it

Generalization: Circumstances and sufficient knowledge breaks generalization.

35 Introduction to Cybersecurity 2016

7 Students have passed, 2 have failed

Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed

7 Students have passed, 1 has failed

An attacker may observe the difference. (1 vs. 2 students have failed). With sufficient knowledge (in the extreme case: about all other students), it may learn whether Isa participated or not.

SLIDE 36

Differential Privacy – How to achieve it

Observation: No deterministic method is possible, if we want to preserve utility: For every function that allows to learn some tendencies, there are corner- cases in which one can observe the presence or absence of someone. Examples:

Round to values: “10 people have succeeded”
Corner case: 14 is rounded to 10, 15 is rounded to 20
Boolean statements: “At least as many people have succeeded than failed”
Corner case: 15 Students succeeded and 15 (or 14) failed
Even relative statements do not work, if the attacker has arbitrary knowledge:
“80% of the people have succeeded” (if one knows all other students, the

percentage leaks information)

36 Introduction to Cybersecurity 2016

We need a randomized sanitization method!

SLIDE 37

Achieving Differential Privacy

Addition of random noise:

We randomly modify the result

37 Introduction to Cybersecurity 2016 Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed

170 Students have passed, 82 have failed

Precise answer + Noise:

2.3 Students have passed,

+1.4 Students have failed

167.3 Students have passed, 83.4 have failed

…

Noisy answer

SLIDE 38

Achieving Differential Privacy

Addition of random noise:

We randomly modify the result
Note that the noise does not preserve

“sanity checks” such as:

The total number of students

is preserved

The result is always a natural number ≥ 0
If within the noisy answer the exam has to

be repeated (because too many people failed), then the same holds for the precise result. (such a mismatch can leak information)

38 Introduction to Cybersecurity 2016

170 Students have passed, 82 have failed

Precise answer + Noise:

2.3 Students have passed,

+1.4 Students have failed

167.3 Students have passed, 83.4 have failed

Noisy answer

SLIDE 39

Achieving Differential Privacy

Differential privacy can cope with arbitrary adversarial knowledge:

The adversary may know the whole database, except for one entry

Rules of thumb:

The more precise the answer is (fewer noise), the more privacy is lost.
For a small database: Good privacy  Lots of noise  bad utility
Answers like: -3 students have passed the exam.

39 Introduction to Cybersecurity 2016

Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0

Known to adversary

SLIDE 40

Post-Sanitization and Differential Privacy

Post sanitization (deterministic and probabilistic) possible: As long as it only depends on the noisy output (not on the original dataset), every computation is possible and does not decrease privacy.

40 Introduction to Cybersecurity 2016 170 Students have passed, 82 have failed

Precise answer

167.3 Students have passed, 83.4 have failed

…

Noisy answer

Name Age Gender Semester Grade Alice 19 Female 1 1.3 Bob 18 Male 1 failed Charlie 18 Male 1 1.7 Dave 18 Male 1 3.7 Eve 17 Female 1 1.0 Fritz 19 Male 3 1.3 Gerd 21 Male 3 2.3 Hans 23 Male 3 3.0 Isa 20 Female 3 failed

Arbitrary computation:

rounding, bounding, relations e.g., If (answer < 0) then answer = 0

167 Students have passed, 83 have failed. Twice as many students have passed than have failed

More details in future lectures!

SLIDE 41

Introduction to Cybersecurity 2016

Privacy-friendly Aggregation (Smart Metering)

41

1 2 3 4 5 6 7

aggregated

ver households

kW 2 4 6 time 00:00 12:00 06:00 18:00

2 4 6 8 10 12 14 16 18 20

𝑀𝑏𝑞 ൗ Δ𝑔 𝜁

Goal: Privacy Guarantess for aggregated data: It should be impossible to infer the energy consumption of any individual household

Probability density function: 𝑞𝑀𝑏𝑞 𝑦 =

1 2𝜁 𝑓−

𝑦−Δ𝑔 𝜁

kW 500 1000 1500 Zeit 00:00 12:00 06:00 18:00 Δ𝑔: Sensitivity, i.e. a single user’s/household’s impact

n the function output

𝜁: Privacy parameter from Differential Privacy

Technische Definition: Differential Privacy: Pr 𝑦 ∈ 𝑇; 𝑦 ← 𝑔 𝐸1 ≤ 𝑓𝜁 Pr 𝑦 ∈ 𝑇; 𝑦 ← 𝑔 𝐸2 + 𝜀

SLIDE 42

Introduction to Cybersecurity 2016

Challenges: Sanitization of dynamic data(streaming) decentralized noising, learning with differential privacy Potential killer arguments: provided utility not sufficient for any practical use-case; interesting outliers are removed Privacy-friendly Aggregation (Smart Metering)

42

1 2 3 4 5 6 7

aggregated

ver households

kW 2 4 6 time 00:00 12:00 06:00 18:00

2 4 6 8 10 12 14 16 18 20

𝑀𝑏𝑞 ൗ Δ𝑔 𝜁

Probability density function: 𝑞𝑀𝑏𝑞 𝑦 =

1 2𝜁 𝑓−

𝑦−Δ𝑔 𝜁

kW 500 1000 1500 Zeit 00:00 12:00 06:00 18:00 Δ𝑔: Sensitivity, i.e. a single user’s/household’s impact

n the function output

𝜁: Privacy parameter from Differential Privacy

SLIDE 43

Three Principle Approaches to Data Protection

Introduction to Cybersecurity 2016

Problem: Generic approaches difficult, sanitization removes valuable and

potentially also crucially required information

Goal: Provide data without potential linkability

Data under control: Sanitization before publication

Problem: Limited computations, efficiency
Goal: Use and modify data without direct access

Data under control: Strong protection after publication

Problem: Difficulties to enforce privacy, no hard guarantees
Goal: Understand exposure and privacy consequences

Most cases: Data dissemination without control

Problem: Existing approaches can be circumvented;

specific sanitization (remove information) without guarantees

Goal: Provide provably private sanitization

44

SLIDE 44

Three Principle Approaches to Data Protection

Introduction to Cybersecurity 2016

Problem: Generic approaches difficult, sanitization removes valuable and

potentially also crucially required information

Goal: Provide data without potential linkability

Data under control: Sanitization before publication

Problem: Limited computations, efficiency
Goal: Use and modify data without direct access

Data under control: Strong protection after publication

Problem: Difficulties to enforce privacy, no hard guarantees
Goal: Understand exposure and privacy consequences

Most cases: Data dissemination without control

Problem: Limited computations, efficiency
Goal: Efficiently use and modify data without direct access
Problem: Existing approaches can be circumvented;

specific sanitization (remove information) without guarantees

Goal: Provide provably private sanitization

45

SLIDE 45

Protected Publication – Rough Overview

Idea: Data never leaves the user’s control
Data is published encrypted, or stays in trusted hardware
Goal: Only trustworthy users / processes are granted access

Introduction to Cybersecurity 2016 46

SLIDE 46

Protected Publication – Rough Overview

Idea: Data never leaves the user’s control
Data is published encrypted, or stays in trusted hardware
Goal: Only trustworthy users / processes are granted access
Possibility 1: Trustworthy computations in secure hardware

(IBM Cryptocard, ARM Trustzone, Intel SGX)

Major challenge: granting access under which conditions? Everything that has

been output is out of the user’s control (!)

Introduction to Cybersecurity 2016

m f(m)

47

SLIDE 47

Protected Publication – Rough Overview

Idea: Data never leaves the user’s control
Data is published encrypted, or stays in trusted hardware
Goal: Only trustworthy users / processes are granted access
Possibility 2: Computing over encrypted data (fully homomorphic encryption)
Given E(K, m) and a function f
Compute E(K, f(m)) by computing f( E(K, m))
Major challenges: Generality of E for permitted functions f;

currently still extremely inefficient.

Introduction to Cybersecurity 2016 48

SLIDE 48

Three Principle Approaches to Data Protection

Introduction to Cybersecurity 2016

Problem: Generic approaches difficult, sanitization removes valuable and

potentially also crucially required information

Goal: Provide data without potential linkability

Data under control: Sanitization before publication

Problem: Limited computations, efficiency
Goal: Use and modify data without direct access

Data under control: Strong protection after publication

Problem: Difficulties to enforce privacy, no hard guarantees
Goal: Understand exposure and privacy consequences

Most cases: Data dissemination without control

Problem: Limited computations, efficiency
Goal: Efficiently use and modify data without direct access
Problem: Existing approaches can be circumvented;

specific sanitization (remove information) without guarantees

Goal: Provide provably private sanitization

49

SLIDE 49

Three Principle Approaches to Data Protection

Introduction to Cybersecurity 2016

Problem: Generic approaches difficult, sanitization removes valuable and

potentially also crucially required information

Goal: Provide data without potential linkability

Data under control: Sanitization before publication

Problem: Limited computations, efficiency
Goal: Use and modify data without direct access

Data under control: Strong protection after publication

Problem: Difficulties to enforce privacy, no hard guarantees
Goal: Understand exposure and privacy consequences

Most cases: Data dissemination without control

Problem: Limited computations, efficiency
Goal: Efficiently use and modify data without direct access
Problem: Existing approaches can be circumvented;

specific sanitization (remove information) without guarantees

Goal: Provide provably private sanitization
Problem: Often impossible to enforce privacy; hence no hard guarantees
Goal: Understand exposure and privacy consequences

50

SLIDE 50

Privacy & Individual Utility

People do want to post personal contents and appreciate individualized services … but don‘t want to be tracked, targeted, rated

Key issue: privacy risk models that reconcile privacy and utility, and tools that analyze & explain risks and guide users

SLIDE 51

User Privacy Risks in Online Communities

Nobody interested in your research? We read your papers!

Established privacy models Today‘s user behavior & risks

Data: single database
Goal: hard anonymity guarantees,

non-disclosure of any properties

Adversary: comp. powerful, but

agnostic; global access & view

Measures: data coarsening,

perturbation, limit queries; tension with utility

Data & User: textual contents,

social, agile, longitudinal

Goal: alert & advise, bound risk
Adversary: world knowledge &
prob. inference, cost-aware
Measures: estimate risk,

rank “target users“, selective anon.  Privacy Advisor tool

SLIDE 52

Outlook: Assessing Privacy at Large

search publish & recommend

Levothroid shaking Addison’s disease ……… Nive concert Greenland singers Somalia elections Steve Biko

search engine

female 29y Jamame

social network

Nive Nielsen Cry Freedom

discuss & seek help

nline

forum

female 25-30 Somalia Synthroid tremble ………. Addison disorder ……….

Zoe

53 Introduction to Cybersecurity 2016

SLIDE 53

Outlook: Assessing Privacy at Large

search publish & recommend

Levothroid shaking Addison’s disease ……… Nive concert Greenland singers Somalia elections Steve Biko

search engine Zoe

female 29y Jamame

social network

Nive Nielsen Cry Freedom

discuss & seek help

nline

forum

female 25-30 Somalia Synthroid tremble ………. Addison disorder ……….

Threats from

Direct cues:

profile data

Indirect cues:

profiles of friends

Semantic cues:

health, taste, queries

Statistical cues:

correlations

54 Introduction to Cybersecurity 2016

SLIDE 54

Let a wise person speak to that…

55 Introduction to Cybersecurity 2016

SLIDE 55

First: our ERC Synergy Grant got scooped by the Simpsons! Second: let me try to give you a glimpse on how this could really work.

56 Introduction to Cybersecurity 2016

SLIDE 56

Privacy Advisor – Building Blocks

User

Probabilistic model of privacy state & transitions User action Privacy policy World knowledge Personal info and history

Internet contents and interactions

Privacy Advisor (PA)

Software tool that

analyses risk
alerts user
advises user

57 Introduction to Cybersecurity 2016

SLIDE 57

Lecture Summary – Part I

Basic Database Privacy

Motivation
Data Sanitization
k-anonymity and l-diversity

Principle Approaches to Data Protection

Sanitization before Publication
Protection after Publication
Publication without Control

58 Introduction to Cybersecurity 2016

SLIDE 58

Introduction to Cybersecurity Secure Information Flow

SLIDE 59

Summary

Introduction to Cybersecurity 2016

Secure Information Flow

Confidentiality
(In-)Secure Information Flow
Explicit Flow
Implicit Flow
Termination Flow

60

SLIDE 60

Confidentiality

Recall: Confidentiality

Assure that information is not disclosed to unauthorized principals

In general, we can observe that
It is easy to check information release
It is hard to check information propagation

61

page

Introduction to Cybersecurity 2016

SLIDE 61

Confidentiality issues

Modern applications process sensitive data
passwords, credit card numbers, phone numbers, ...
In most systems, data is shared and can be accessed by possibly untrusted

applications (Facebook, smartphones, ...)

How do we know whether or not applications access sensitive data in a

legitimate way?

address book can be read but not forwarded
Data leakage...
how does it happen?
how can we detect it?
how can we prevent it?

62 Introduction to Cybersecurity 2016

SLIDE 62

Confidentiality

Standard security mechanisms are unsatisfactory
Anti-virus scanning: rejects a black list of known attacks...but doesn’t

prevent new attacks

Cryptography: protects secret data on the network...but endpoints of

communication may leak data

Sandboxing: good for low-level events (read a file), but programs are

treated as black boxes

Access control: prevents unauthorized release of information...but what

programs should be authorized?

63

NET

Introduction to Cybersecurity 2016

SLIDE 63

Checking confidentiality

We need to look at the code (inside the black box!) and check whether or

not our programs leak information

Immediate benefits:
semantics-based security specification
end-to-end security policies
powerful analysis techniques

64

NET

Introduction to Cybersecurity 2016

SLIDE 64

(In-)Secure Information Flow

Privacy leaks can also occur from improper processing of data.
This leakage is not always obvious.
Information might flow to unintended

places / recipients / variables

Secret inputs of programs may not influence public output.
Most basic setting:
low variables, meaning low security, public information.
high variables, meaning high security, private information.

65 Introduction to Cybersecurity 2016

SLIDE 65

(In-)Secure Information Flow

Security definition (intuitive):
We assume the low variables are published at the end of the program.
They should not leak information about the high variables.

66 Introduction to Cybersecurity 2016

𝑀𝑗𝑜𝑞𝑣𝑢 𝐼𝑗𝑜𝑞𝑣𝑢 𝑀𝑝𝑣𝑢𝑞𝑣𝑢 𝐼𝑝𝑣𝑢𝑞𝑣𝑢

SLIDE 66

Information Flow – Example 1

Consider the following program. Is it secure?

67 Introduction to Cybersecurity 2016

low2 := low3 + low3 low1 := secret low2 := low3 + low3 low1 := secret

Direct explicit flow from high variable to low variable

SLIDE 67

Information Flow – Example 2

Consider the following program. Is it secure?

68 Introduction to Cybersecurity 2016

low1 := low2 + low3 secret3 := secret1 + secret2 copy := secret1 secret1 := secret2 secret2 := copy low2 := copy low1 := low2 + low3 secret3 := secret1 + secret2 copy := secret1 secret1 := secret2 secret2 := copy low2 := copy low1 := low2 + low3 secret3 := secret1 + secret2 copy := secret1 secret1 := secret2 secret2 := copy low2 := copy low1 := low2 + low3 secret3 := secret1 + secret2 copy := secret1 secret1 := secret2 secret2 := copy low2 := copy

Indirect explicit flow from high variable to low variable (no matter whether copy is high or low)

SLIDE 68

Information Flow – Explicit Flow

Explicit flow occurs whenever a computation involving a high variable is assigned to a low variable. Examples for explicit flow:

low := high
low := low + high
low := function(low,high)

We need to find rules to avoid explicit information flow.

There must never be an assignment of a high variable to a low variable.

69 Introduction to Cybersecurity 2016

SLIDE 69

Information Flow – Solution: Assignment rule

There must never be an assignment of a high variable to a low variable.

Information flow solved?

70 Introduction to Cybersecurity 2016

low2 := low3 + low3 low1 := 0 if secret_bit == 1: low1 := 1 low2 := low3 + low3 low1 := 0 if secret_bit == 1: low1 := 1

Implicit (conditional) flow from high variable to low variable The program actually computes: low1 := secret_bit No assignment from high to low!

SLIDE 70

Information Flow – Implicit (Conditional) Flow

Conditional flow occurs whenever a computation branches depending on a high variable and within these branches assigns (different) values to low variables. Examples for conditional flow:

low := 0

while high > 0: low := low +1 high := high - 1

low := 0

if boolean_function(high): low := 1 We need to find rules to avoid conditional information flow.

If a conditional (if/for/while/…) depends on a high variable, then no

assignment to low variables is allowed.

71 Introduction to Cybersecurity 2016

Should be allowed:

low := 0

while high1 > 0: high1 := function(high1,high2)

low := 0

if boolean_function(high1): high2 := 1

SLIDE 71

Information Flow – Solution: Conditional rule

There must never be an assignment of a high variable to a low variable.
If a conditional depends on a high variable, then no assignment to low

variables is allowed. Information flow solved?

72 Introduction to Cybersecurity 2016

low2 := low3 + low3 low1 := 0 while secret_bit == 1: high := 0 low1 := 1 low2 := low3 + low3 low1 := 0 while secret_bit == 1: high := 0 low1 := 1

Covert channel (termination) flow from high variable. The program only terminates if secret_bit == 0. No explicit flow! No assignment to low in conditional! A covert channel is a channel not intended for information transfer at all.

SLIDE 72

Information Flow – Covert Channel (Termination) Flow

Termination flow occurs whenever the termination of a computation depends

n a high variable.

Examples for termination flow:

low := 0

while high > 0: high := high

for (temp := 1; temp < high; temp := temp +1)

high := high + 1

JUMP_MARK:

if high == 1: goto JUMP_MARK We need to find rules to avoid termination information flow.

Termination may not depend on high variables.

73 Introduction to Cybersecurity 2016

Non-terminating behavior depending on high variables should never be allowed.

SLIDE 73

Information Flow – Solution: Termination rule

There must never be an assignment of a high variable to a low variable.
If a conditional depends on a high variable, then no assignment to low

variables is allowed.

Termination may not depend on high variables.

Information flow solved?

74 Introduction to Cybersecurity 2016

low2 := low3 + low3 low1 := 0 if secret_bit == 1: high := compute_complex_function(high) low1 := 1 low2 := low3 + low3 low1 := 0 if secret_bit == 1: high := compute_complex_function(high) low1 := 1

Covert channel (timing) flow from high variable. The program may take significantly longer to terminate if secret_bit == 1. No explicit flow! No conditional flow! Always terminates!

SLIDE 74

Information Flow – Covert channel (Timing) Flow

Timing flow occurs whenever the time that a computation needs depends on high variables. Examples for timing flow:

while high1 > 0:

high1 := function(high1,high2)

for each bit 𝑐𝑗 of secret_key

if 𝑐𝑗 == 1: high := function(high) We need to find rules to avoid timing information flow.

Computation time may not depend on high variables.

75 Introduction to Cybersecurity 2016

Should be allowed: for each bit 𝑐𝑗 of secret_key if 𝑐𝑗 == 1: high := function(high) else: dummy := function(high)

SLIDE 75

Information Flow – Wrap-up

There must never be an assignment of a high variable to a low variable.
If a conditional depends on a high variable, then no assignment to low

variables is allowed.

Termination may not depend on high variables.
Computation time may not depend on high variables.

Even more forms of information flow exist:

Concurrent programs can leak internal states (and are hard to analyze)
Limited resources can be used to leak information:
write high times LARGE DATA to the disk (or load it into the RAM) and

wait for overflow.

Other side channels, e.g., using volume control to transfer information,

measuring the electricity consumption, …

76 Introduction to Cybersecurity 2016

SLIDE 76

Summary

Introduction to Cybersecurity 2016

Secure Information Flow

Confidentiality
(In-)Secure Information Flow
Explicit Flow
Implicit Flow
Termination Flow

77

Introduction to Cybersecurity Database Privacy

Review: Anonymity vs. Privacy

determine for themselves when, how, and to what extent information about them is communicated to others

subjects/individuals

necessary

Review: Anonymous Communication (AC) Protocols

e.g. Tor, DC Nets, Crowds

e.g. Mix networks

A Glimpse on Research: Privacy Assessment with MATor

Impact of Single Node Corruption Overall Guarantee

Goal: Derive worst-case quantitative anonymity guarantees

A Glimpse on Research: Privacy Assessment with MATor

Goal: Derive worst-case quantitative anonymity guarantees

Challenges: Comprehensive network-layer attackers, extension beyond structural corruption, content-sensitive assessment Potential killer arguments: Attackers overly powerful, hence too pessimistic guarantees; assessment only for Tor, not tailored attack

Live Monitor

Lecture Summary – Part I

Basic Database Privacy

Principle Approaches to Data Protection

Data Privacy: Attribute Disclosure

social network

Cryptographic Solutions

In contrast to cryptography, privacy

Deleting data destroys utility. Storing or transmitting data encrypted is a good idea. Someone has (needs to have) the key.

Sanitization

Benefits of Sanitization

Sanitized data can (still) be used for:

Does Sanitization suffice?

Sanitization = Privacy?

such as address or phone number

Attacks on Databases

Early defense mechanisms: query sanitization.

Sanitization: Queries must not depend on identifiers!

Attacks on Databases

Early defense mechanisms: query sanitization.

Sanitization: Queries must not depend on identifiers! Sanitization: Queries must not be answered if the answer is below a threshold

Attacks on Databases

Early defense mechanisms: query sanitization.

Sanitization: Queries must not depend on identifiers! Sanitization: Queries must not be answered if the answer is below a threshold

K-Anonymity (Intuitive Idea)

K-Anonymity: Privacy means that one can hide within a set of (at least) K

Quasi identifiers: Attributes that could identify a person (name, age, etc.)

K-Anonymity (Definition)

Definition: Data satisfies K-Anonymity, if each person contained in the data cannot be distinguished from at least K-1 other individuals also within the data.

Achieving K-Anonymity

Reduce the information such that the data collapses:

Suppression: Generalization:

K-Anonymity (3)

Example: K-Anonymity for a list of students with K=5. For each semester, there are at least 5 individuals present that cannot be distinguished. Idea/Goal: Consequently, one cannot be identified, but hides in a group of K=5 people.

Attacks on K-Anonymity – Homogeneity

One may learn a lot of information about an individual, if there are k people with this information.

K-Anonymity with K=5 But:

say, Isa is in the 3rd semester, then we immediately learn that she has failed the exam.

Attacks on K-Anonymity – Background Knowledge

Background knowledge that might look unsuspicious or not too privacy critical may lead to privacy breaches.

K-Anonymity with K=5 But:

looked disappointed after seeing the result.

probability, she has not achieved a 1.0

the exam.

L-Diversity

Intuition and definition:

identifiers.

L-Diversity (2)

Properties

Attack on L-Diversity – Lots of Knowledge

Every 5-Block is optimally L-Diverse (L=5)

But:

2.3, your friend John has a 3.0 and you know that a male student from 3rd semester just barely passed the exam.

looked unhappy after the exam, so it is very unlikely that she achieved the 1.0 and consequently it is very likely that she failed the exam.

Netflix Prize

When: 2007-2009 Challenge: “Find a better recommendation algorithm” 1.000.000 $ Reward for the winner.

Netflix Prize

Netflix Prize – Anonymization and Deanonymization

Claim: “To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids.”

Data Sparseness

Data Sparseness

Mathematically:

Does a higher order (number of dimensions) lead to more or less sparseness?

Data Sparseness

Mathematically:

Similarities are less likely, as the space grows exponentially.