[PPT] - Privacy-preserving entity resolution and logistic regression on PowerPoint Presentation

SLIDE 1

Privacy-preserving entity resolution and logistic regression on encrypted data

Giorgio Patrini & Mentari Djatmiko, Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Maximilian Ott, Huy Pham, Guillaume Smith, Brian Thorne, Dongyao Wu N1 Analytics @ Data61 CSIRO PSML workshop, ICML17 11/8/2017, Sydney

N1 Analytics

SLIDE 2

Scenario & motivation

Compute

Data

Compute

Data

Compute

A: Data provider

Confidentiality boundary

C: Coordinator

Sensitive messages are encrypted

B: Data provider

Different features, many shared entities

2

SLIDE 3

Vertical partition of a dataset: common entities but different features

○ One data provider has the labels ○ E.g. banking and insurance data about common customers; labels are fraudulent activity

Goal: learn a predictive model in the cross-feature space

○

Comparable accuracy as if had all data in one place

○

Scale to real-world applications

Secure end to end system

3

SLIDE 4

Vertical partition of a dataset: common entities but different features

○ One data provider has the labels ○ E.g. banking and insurance data about common customers; labels are fraudulent activity

Goal: learn a predictive model in the cross-feature space

○

Comparable accuracy as if had all data in one place

○

Scale to real-world applications

Constraints

○

Who is who? ⇨ Private entity resolution ○ Raw data remains private ⇨ federated learning + privacy

Secure end to end system

4

SLIDE 5

Overview

End-to-end system:

○ Security assumptions / requirements ○ Entity resolution ○ Learning on private data

Deployment & experiments

5

SLIDE 6

Security assumptions / requirements

Participants are honest-but-curious:

○ they follow the protocol ○ they are not colluding ○ but: they try to infer as much as possible

Reasonable: participants have an incentive to compute an accurate model.
Only the Coordinator holds the private key used to decrypt messages.
No sensitive data (raw or aggregated) leaves a data provider unencrypted

○ ...but computation uses unencrypted individual records locally.

6

SLIDE 7

Overview

End-to-end system:

○ Security assumptions / requirements ○ Entity resolution ○ Learning on private data

Deployment & experiments

7

SLIDE 8

Privacy-preserving entity resolution

Goal: match corresponding rows in two distinct databases
Constraint: can’t share Personally Identifiable Information (PII)

8

SLIDE 9

Privacy-preserving entity resolution

Goal: match corresponding rows in two distinct databases
Constraint: can’t share Personally Identifiable Information (PII)
Solution: fuzzy & private matching

9

SLIDE 10

Privacy-preserving entity resolution

PII PII A: Data provider C: Coordinator B: Data provider

Name, DOB, gender, etc.

f A’s customers

10

SLIDE 11

Privacy-preserving entity resolution

Hash

PII

Hash

PII A: Data provider C: Coordinator B: Data provider

Preserves similarity, e.g. by hash on bigrams [Schnell et al. 11]

Shared secret salt

11

SLIDE 12

Privacy-preserving entity resolution

Hash

PII

Hash

PII

Fuzzy matcher

A: Data provider C: Coordinator B: Data provider

Robust to misspellings and errors

12

SLIDE 13

encrypted mask: vector of encrypted 0/1 to select matches

Privacy-preserving entity resolution: the output

PII PII

Permutation & encrypted mask

A: Data provider C: Coordinator B: Data provider

13

No data provider knows which/how many entities are in common!

permutations: align row of A and B

SLIDE 14

Overview

End-to-end system:

○ Security assumptions / requirements ○ Entity resolution ○ Learning on private data

Deployment & experiments

14

SLIDE 15

Background: Paillier Partially Homomorphic Encryption

is the encryption of
Addition:
Scalar multiplication:
Extend to vectors ⇨ encrypted linear algebra (almost)!

15

SLIDE 16

Background: Paillier Partially Homomorphic Encryption

is the encryption of
Addition:
Scalar multiplication:
Extend to vectors ⇨ encrypted linear algebra (almost)!
Our Paillier implementations:

○ Python github.com/n1analytics/python-paillier ○ Java github.com/n1analytics/javallier

16

SLIDE 17

Logistic regression

17

Goal: Distributed SGD for logistic regression keeping data private
Challenges:

○ Constrained by Paillier to simple arithmetics (e.g.: no log, no exp) ○ Data is split by features and cannot leave their data providers

SLIDE 18

Logistic regression

18

Goal: Distributed SGD for logistic regression keeping data private
Challenges:

○ Constrained by Paillier to simple arithmetics (e.g.: no log, no exp) ○ Data is split by features and cannot leave their data providers

Solutions:

○ Gradient and loss approximation using Taylor expansion, up to 2nd

rder

○ Collaborative protocol for computing gradients and loss values

SLIDE 19

Taylor approximation*

19

Logistic loss,
and its gradient

* similar to [Aono et al. 16]

Only used for stopping criterion

SLIDE 20

Logistic loss vs. its Taylor approximation

20

For a good approx: scale features into a small interval and regularize !

SLIDE 21

Protocol example: how to compute a square?

21

The most complex operation in the learning protocol
… and we cannot do squares on encrypted numbers with Paillier !

SLIDE 22

Protocol example: how to compute a square?

22

C: Coordinator, private key holder

(Entities are matched via permutation and mask here)

A: Data provider B: Data provider

SLIDE 23