Data Matching Research at the Australian National University Peter - - PowerPoint PPT Presentation

data matching research at the australian national
SMART_READER_LITE
LIVE PREVIEW

Data Matching Research at the Australian National University Peter - - PowerPoint PPT Presentation

Data Matching Research at the Australian National University Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University Contact: peter.christen@anu.edu.au


slide-1
SLIDE 1

Data Matching Research at the Australian National University

Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University Contact: peter.christen@anu.edu.au http://cs.anu.edu.au/people/Peter.Christen

February 2014 – p. 1/46

slide-2
SLIDE 2

Outline

Background about me and the ANU A short introduction to data matching and its challenges Research projects in data matching at the ANU

Scalable real-time entity resolution on dynamic databases Scalable privacy-preserving record linkage techniques Efficient matching of historical census data across time

Conclusions and research directions

February 2014 – p. 2/46

slide-3
SLIDE 3

Background - Short CV

Born and grew up in Basel, Switzerland

Diploma in Computer Science, ETH Zürich in 1995 PhD in Parallel Computing, University of Basel in 1999

Moved to Canberra / ANU in 1999

Postdoctoral Researcher (funded by Swiss NSF) from 1999 to 2000 Lecturer from 2001 to 2006 Senior Lecturer from 2007 to 2012 Associate Dean (Higher Degree Research) for Engineering and Computer Science, 2009 to 2011 Associate Professor since 2013

February 2014 – p. 3/46

slide-4
SLIDE 4

Canberra, Australia

February 2014 – p. 4/46

slide-5
SLIDE 5

Research at the ANU (1)

Around 17,000 students, over 2,000 PhD students (around 100 in computer science)

February 2014 – p. 5/46

slide-6
SLIDE 6

Research at the ANU (2)

Over 1,600 academics (around 40 in computer science, including 14 full professors)

February 2014 – p. 6/46

slide-7
SLIDE 7

What is data matching?

The process of matching records that represent the same entity in one or more databases

(patient, customer, business name, etc.)

Also known as record linkage, entity resolution,

  • bject identification, duplicate detection, identity

uncertainty, merge-purge, etc. Major challenge is that unique entity identifiers are often not available in the databases to be matched (or if available, they are not consistent)

E.g., which of these records represent the same person?

Dr Smith, Peter 42 Miller Street 2602 O’Connor Pete Smith 42 Miller St 2600 Canberra A.C.T. P . Smithers 24 Mill Rd 2600 Canberra ACT

February 2014 – p. 7/46

slide-8
SLIDE 8

The data matching process

Comparison Matches Non− matches Matches processing Data pre− processing Data pre− Classif− ication Clerical Review Evaluation Potential Indexing / Searching Database A Database B

February 2014 – p. 8/46

slide-9
SLIDE 9

Applications of data matching

Remove duplicates in one data set (deduplication) Merge new records into a larger master data set Create patient or customer oriented statistics

(for example for longitudinal studies)

Clean and enrich data for analysis and mining Geocode matching (with reference address data) Widespread use of data matching

Immigration, taxation, social security, census Fraud, crime, and terrorism intelligence Business mailing lists, exchange of customer data Health and social science research

February 2014 – p. 9/46

slide-10
SLIDE 10

Data matching challenges

No unique entity identifiers are available

(use approximate (string) comparison functions)

Real world data are dirty

(typographical errors and variations, missing and

  • ut-of-date values, different coding schemes, etc.)

Scalability to very large databases

(naïve comparison of all record pairs is quadratic; some form of blocking, indexing or filtering is needed)

No training data in many data matching applications (true match status not known) Privacy and confidentiality

(because personal information is commonly required for matching)

February 2014 – p. 10/46

slide-11
SLIDE 11

Types of data matching techniques

Deterministic matching

Exact matching (if a unique identifier of high quality is available: precise, robust, stable over time) Examples: Social security or Medicare numbers Rule-based matching (complex to build and maintain)

Probabilistic record linkage (Fellegi and Sunter, 69)

Use available attributes for matching (often personal information, like names, addresses, dates of birth, etc.) Calculate matching weights for attributes

‘Computer science’ approaches

(based on machine learning, data mining, database, or information retrieval techniques)

February 2014 – p. 11/46

slide-12
SLIDE 12

Advanced classification techniques

View record pair classification as a multi- dimensional binary classification problem

(use attribute similarities to classify record pairs as matches or non-matches)

Many machine learning techniques can be used

Supervised: Decision trees, SVMs, neural networks, learnable string comparisons, active learning, etc. Un-supervised: Various clustering algorithms

Recently, collective classification techniques have been investigated (build graph of database and

conduct overall classification, rather than each record pair independently)

February 2014 – p. 12/46

slide-13
SLIDE 13

Project 1

Scalable real-time entity resolution on dynamic databases

February 2014 – p. 13/46

slide-14
SLIDE 14

Scalable real-time entity resolution

  • n dynamic databases

A Linkage Project funded by the Australian Research Council, Veda (credit bureau), and Funnelback (web and enterprise search) Collaborators:

Dr Huizhi (Elly) Liang (Post-doc, ANU) Ms Banda Ramadan (PhD student, ANU) Assoc Prof Peter Strazdins (ANU) Dr Ross Gayler (Veda) Prof David Hawking (Funnelback and ANU)

February 2014 – p. 14/46

slide-15
SLIDE 15

Motivation and objectives

Credit bureau requires matching in real-time

  • f query records to a large database of entity

records (credit enquiries) Improve indexing to retrieve candidate records faster, therefore have more time for advanced classification (currently proprietary rules-based) Objectives are to develop:

Novel indexing techniques that allow for real-time matching of query records on dynamic databases Techniques that consider temporal data aspects Improved techniques for real-time classification of query records (to match with database records)

February 2014 – p. 15/46

slide-16
SLIDE 16

Dynamic similarity-aware indexing (1)

RecID Given- Double- name Metaphone r1 tony tn r2 cathrine k0rn r3 tony tn r4 kathryn k0rn r5 tonya tn

cathrine kathryn tony tonya cathrine kathryn tony tonya

r2 r4 r1 r5 r3

RI SI k0rn

kathryn 0.7 cathrine 0.7 tonya 0.9 tony 0.9

tn BI

cathrine kathryn tony tonya

RI: Record index, BI: Block index, SI: Similarity index

February 2014 – p. 16/46

slide-17
SLIDE 17

Dynamic similarity-aware indexing (2)

RecID Given- Double- name Metaphone r1 tony tn r2 cathrine k0rn r3 tony tn r4 kathryn k0rn r5 tonya tn r6 cathrine k0rn

cathrine kathryn tony tonya cathrine kathryn tony tonya

r2 r4 r1 r5 r3

RI SI k0rn

kathryn 0.7 cathrine 0.7 tonya 0.9 tony 0.9

tn BI

cathrine kathryn tony tonya

r6

RI: Record index, BI: Block index, SI: Similarity index

February 2014 – p. 17/46

slide-18
SLIDE 18

Dynamic similarity-aware indexing (3)

RecID Given- Double- name Metaphone r1 tony tn r2 cathrine k0rn r3 tony tn r4 kathryn k0rn r5 tonya tn r6 cathrine k0rn r7 linda lnt

cathrine kathryn tony tonya cathrine kathryn tony tonya

r2 r4 r1 r5 r3

RI SI k0rn

kathryn 0.7 cathrine 0.7 tonya 0.9 tony 0.9

tn BI

cathrine kathryn tony tonya

r6

linda

r7

linda lnt

linda

RI: Record index, BI: Block index, SI: Similarity index

February 2014 – p. 18/46

slide-19
SLIDE 19

Dynamic similarity-aware indexing (4)

RecID Given- Double- name Metaphone r1 tony tn r2 cathrine k0rn r3 tony tn r4 kathryn k0rn r5 tonya tn r6 cathrine k0rn r7 linda lnt r8 tonia tn

cathrine kathryn tony tonya cathrine kathryn tony tonya

r2 r4 r1 r5 r3

RI SI k0rn

kathryn 0.7 cathrine 0.7 tonia 0.8 tonia 0.9

tn BI

cathrine kathryn tony tonya

r6

linda

r7

linda lnt

linda

tonia

r8

tonia

tonya 0.9 tonya 0.9

tonia

tony 0.8 tonya 0.9

RI: Record index, BI: Block index, SI: Similarity index

February 2014 – p. 19/46

slide-20
SLIDE 20

Dynamic similarity-aware indexing (5)

500000 1000000 1500000 2000000 2500000

Record Insertion Number

10

  • 5

10

  • 4

10

  • 3

10

  • 2

10

  • 1

Insertion Time (s)

Insertion Time for a Single Record

Max Ave Min

On North Carolina voter database (around 2.4 million records)

February 2014 – p. 20/46

slide-21
SLIDE 21

Dynamic similarity-aware indexing (6)

500000 1000000 1500000 2000000 2500000

Record Insertion Number

10

  • 4

10

  • 3

10

  • 2

10

  • 1

10 10

1

Query Time (s)

Query Time for a Single Record

Max Ave Min

February 2014 – p. 21/46

slide-22
SLIDE 22

Project 2

Scalable privacy-preserving record linkage (PPRL)

February 2014 – p. 22/46

slide-23
SLIDE 23

Scalable privacy-preserving record linkage

A Discovery Project funded by the Australian Research Council Collaborators:

Ms Dinusha Vatsalan (PhD student, ANU) Assoc Prof Vassilios Verykios (Hellenic Open University) Mr Thilina Ranbaduge (PhD student, starting 2014)

February 2014 – p. 23/46

slide-24
SLIDE 24

Motivation and objectives

Privacy concerns in many applications where data are matched between organisations Matched data can allow analysis not possible on individual databases

(potentially revealing highly sensitive information)

Objectives are to develop:

Scalable techniques to facilitate PPRL Techniques that allow PPRL on multiple databases Improved classification techniques for PPRL Methods to assess matching quality and completeness in a privacy-preserving framework

February 2014 – p. 24/46

slide-25
SLIDE 25

Privacy and data matching: An example scenario (1)

February 2014 – p. 25/46

slide-26
SLIDE 26

Privacy and data matching: An example scenario (2)

Preventing the outbreak of epidemics requires monitoring of occurrences of unusual patterns in symptoms (in real time!) Data from many different sources will need to be collected (including travel and immigration records;

doctors, emergency and hospital admissions; drug purchases in pharmacies; animal health data; etc.)

Privacy concerns arise if such data are stored and matched at a central location Matched sensitive patient data and confidential data from healthcare organisations must be kept secure, while still allowing analysis

February 2014 – p. 26/46

slide-27
SLIDE 27

The PPRL process

Comparison Matches Non− matches Matches

Privacy−preserving context

Clerical Review Classif− ication processing Data pre− processing Data pre− Evaluation Potential

Encoded data

Indexing / Searching Database A Database B

February 2014 – p. 27/46

slide-28
SLIDE 28

PPRL challenges and basic protocols

Main challenges in PPRL

Allow for approximate matching of values (given real world data are often ‘dirty’) Have techniques that are not vulnerable to any kind of attack, and are scalable to matching large databases

Two basic types of protocols

Two-party protocol: Only the two database owners who wish to link their data Three-party protocols: Use a (trusted) third party (linkage unit) to conduct the linkage (this party will never see any unencoded values, but collusion is possible)

February 2014 – p. 28/46

slide-29
SLIDE 29

Basic protocol steps

(1) (2) (2) (3) (3)

Bob Alice

(3) (3) (2) (2) (1)

Alice Carol Bob

Generally, three main communication steps

  • 1. Exchange of which attributes to use in a linkage,

pre-processing methods, encoding functions, parameters, secret keys, etc.

  • 2. Exchange of the somehow encoded database records
  • 3. Exchange of records (or selected attribute values, or

identifiers only) of records classified as matches

February 2014 – p. 29/46

slide-30
SLIDE 30

Bloom filter based PPRL

er te et

1 1 1 1 1 1 1

pe

Alice

pe et te

1 1 1 1 1

Bob

Proposed by Schnell et al. (Biomed Central, 2009) Idea: Map q-grams into Bloom filters using hash functions

  • nly known to database owners, send Bloom filters to

linkage unit to calculate Dice similarity 1-bits for string ‘peter’: 7, 1-bits for ‘pete’: 5, common 1-bits: 5, therefore simDice = 2×5/(7+5)= 10/12 = 0.83

February 2014 – p. 30/46

slide-31
SLIDE 31

Two-party Bloom filter protocol (1)

Iteratively exchange certain bits from the Bloom filters between database owners Calculate the minimum Dice similarity from the bits exchanged, and classify record pairs as matches, non-matches, and possible matches Pairs classified as possible matches are taken to the next iteration (where more bits are exchanged)

The number of bits revealed in each iteration is calculated such that the risk of revealing more bits for non-matches is minimised Minimum similarity of possible matches increases as more bits are revealed

February 2014 – p. 31/46

slide-32
SLIDE 32

Two-party Bloom filter protocol for PPRL (2)

Iteration 2 ra1 rb1 1 1 0 0 1 0 0 1 1 1 1 1 1 ra2 0 0 1 1 1 1 1 0 0 0 rb2 1 1 1 possible match sim = [0.67, 0.89] sim = [0.0, 0.25] non−match ra2 ra1

Alice

Bob

rb1 Iteration 1 possible match possible match 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 rb2 0 1 1 1 1 1 0 0 0 rb3 0 1 1 0 1 1 1 sim = [0.22, 0.89] sim = [0.0, 0.75] sim = [0.0, 0.28] non−match ra3

Each party knows how many 1-bits are set in total in a Bloom filter received from the other party In iteration 1, for example, there is one unrevealed 1-bit in ra3, and so the maximum possible Dice similarity with rb3 is: max(sim(ra3, rb3)) = 2×1/(4+3)= 2/7 = 0.28

February 2014 – p. 32/46

slide-33
SLIDE 33

Project 3

Efficient matching of historical census data across time

February 2014 – p. 33/46

slide-34
SLIDE 34

Project 3: Efficient matching of historical census data across time

Collaborators:

Ms Zhichun (Sally) Fu (PhD student, ANU) Assoc Prof Mac Boot (Australian Demographic and Social Research Institute, ANU)

Motivation

Shift in the social sciences from small-scale studies to using population databases New field of ‘population informatics’ to analyse the ‘social genome’ Develop techniques to compile family trees over time from large data collections (population reconstruction)

February 2014 – p. 34/46

slide-35
SLIDE 35

Challenges with historical (census) data

Low literacy (recording errors and unknown exact values), no address or occupation standards Large percentage of a population had one of just a few common names (‘John’ or ‘Mary’) Households and families change over time Immigration and emigration, birth and death Scanning, OCR, and transcription errors

February 2014 – p. 35/46

slide-36
SLIDE 36

Group matching using household information

Conduct pair-wise matching of individual records Calculate household similarities using Jaccard or weighted similarities (based on pair-wise matching) Promising results on UK Census data from 1851 to 1901 (Rawtenstall, with around 17,000 to 31,000 records)

February 2014 – p. 36/46

slide-37
SLIDE 37

Graph-matching based on household structure

r11 r12 goodshaw r13

Address

smith smith smith john mary

FN SN ID

H1 − 1851

goodshaw anton goodshaw

Age

31 32 1 goodshaw goodshaw goodshaw

Address

smith smith smith

FN SN ID

jack r21 r22 r23 toni marie

Age

39 40 10 r13

H1

r23 r11 r12 r21 r22

H2

29 30 31 1 −1

AttrSim = 0.81 AttrSim = 0.42 AttrSim = 0.56

30

H2 − 1861 One graph per household, find best matching graphs using both record attribute and structural similarities Edge attributes are information that does not change

  • ver time (like age differences)

February 2014 – p. 37/46

slide-38
SLIDE 38

To make sure everybody is awake..

February 2014 – p. 38/46

slide-39
SLIDE 39

Conclusions and research directions

We address various challenges in data matching: real-time matching and dynamic data; temporal aspects; privacy; and population reconstruction Challenges in data matching

Improved classification for matching personal data Matching data from many sources Use of cloud computing platforms for large-scale data matching Frameworks for data matching that allow comparative experimental studies, and test data collections Develop practical PPRL techniques (assessing accuracy and completeness)

February 2014 – p. 39/46

slide-40
SLIDE 40

Recent publications (1)

Christen P and Gayler R: Adaptive temporal entity resolution on dynamic

  • databases. PAKDD, Gold Coast, Australia, Springer LNCS vol. 7819, 2013.

Christen P and Vatsalan D: Flexible and extensible generation and corruption of personal data. ACM CIKM, San Francisco, 2013. Christen P: Advanced record linkage methods and privacy aspects for population

  • reconstruction. Workshop on Population Reconstruction, Amsterdam, 2014.

Fisher J, Wang Q, Wong P and Christen P: Data cleaning and matching of institutions in bibliographic databases. AusDM, Canberra, CRPIT vol. 146, 2013. Fu Z, Zhou J, Christen P and Boot M: Multiple instance learning for group record

  • linkage. PAKDD, Kuala Lumpur, Malaysia, Springer LNCS vol. 7301, 2012.

Fu Z, Boot M, Christen P and Zhou J: Automatic record linkage of individuals and households in historical census data. International Journal of Humanities and Arts Computing, 2014. Fu Z, Christen P and Zhou J: A graph matching method for historical census household linkage. PAKDD, Tainan, Taiwan, 2014.

February 2014 – p. 40/46

slide-41
SLIDE 41

Recent publications (2)

Li S, Liang H and Ramadan B: Two stage similarity-aware indexing for large-scale real-time entity resolution. AusDM, Canberra, CRPIT vol. 146, 2013. Liang H, Wang Y, Christen P and Gayler R: Noise-tolerant approximate blocking for dynamic real-time entity resolution. PAKDD, Tainan, Taiwan, 2014. Ramadan B, Christen P , Liang H, Gayler R, and Hawking D: Dynamic similarity-aware inverted indexing for real-time entity resolution. PAKDD Workshops (DMApps), Gold Coast, Australia, Springer LNCS vol. 7867, 2013. Tran KN, Vatsalan D and Christen P: GeCo: an online personal data generator and corruptor. ACM CIKM, San Francisco, 2013. Vatsalan D and Christen P: Sorted nearest neighborhood clustering for efficient private blocking. PAKDD, Gold Coast, Australia, Springer LNCS vol. 7819, 2013. Vatsalan D, Christen P and Verykios VS: A taxonomy of privacy-preserving record linkage techniques. Journal of Information Systems, 2013. Vatsalan D, Christen P and Verykios VS: Efficient two-party private-blocking based

  • n sorted nearest neighborhood clustering. ACM CIKM, San Francisco, 2013.

February 2014 – p. 41/46

slide-42
SLIDE 42

Advertisement: Book ‘Data Matching’

The book is very well organized and exceptionally well written. Because

  • f the depth, amount, and quality of

the material that is covered, I would expect this book to be one of the standard references in future years. William E. Winkler, U.S. Bureau of the Census.

February 2014 – p. 42/46

slide-43
SLIDE 43

Collective classification example

Dave White Don White Susan Grey John Black Paper 2 Paper 1 Paper 3 ? Joe Brown ? Paper 4 Liz Pink Paper 6 Paper 5 Intel CMU MIT

w1=? w2=? w4=? w3=?

(A1, Dave White, Intel) (P1, John Black / Don White) (A2, Don White, CMU) (P2, Sue Grey / D. White) (A3, Susan Grey, MIT) (P3, Dave White) (A4, John Black, MIT) (P4, Don White / Joe Brown) (A5, Joe Brown, unknown) (P5, Joe Brown / Liz Pink) (A6, Liz Pink, unknown) (P6, Liz Pink / D. White) Adapted from Kalashnikov and Mehrotra, ACM TODS, 31(2), 2006

February 2014 – p. 43/46

slide-44
SLIDE 44

A definition of PPRL

Assume O1 · · · Od are the d owners of their respective databases D1 · · · Dd They wish to determine which of their records r i

1

∈ D1, r j

2 ∈ D2, · · · , and r k d ∈ Dd, match according

to a decision model C(ri

1, r j 2, · · · , r k d) that classifies

pairs (or groups) of records into one of the two classes M of matches, and U of non-matches O1 · · · Od do not wish to reveal their actual records r i

1 · · · r k d with any other party

(they are, however, prepared to disclose to each other, or to an external party, the outcomes of the matching process — certain attribute values of record pairs in class M — to allow further analysis)

February 2014 – p. 44/46

slide-45
SLIDE 45

A taxonomy for PPRL

PPRL

Practical Linkage aspects

Number Aversary Privacy Data sets

  • f parties

model Comparison Indexing

Privacy Evaluation aspects

Application area Implementation

Taxonomy

Classification Scalability Linkage quality Privacy vulnerabilities Scalability Privacy

analysis Theoretical

Linkage quality

techniques

techniques

February 2014 – p. 45/46

slide-46
SLIDE 46

Secure multi-party computation

Compute a function across several parties, such that no party learns the information from the other parties, but all receive the final results

[Yao 1982; Goldreich 1998/2002]

Simple example: Secure summation s =

ixi. Step 1: Z+x1= 1054 Step 4: s = 1169−Z = 170 Party 1 Party 2 Party 3 x1=55 x3=42 x2=73 Step 0: Z=999 Step 2: (Z+x1)+x2 = 1127 Step 3: ((Z+x1)+x2)+x3=1169

February 2014 – p. 46/46