Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu - - PowerPoint PPT Presentation

tracing lineage beyond relational operators
SMART_READER_LITE
LIVE PREVIEW

Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu - - PowerPoint PPT Presentation

Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu Zhang 1 Xiang Zhang 2 Sunil Prabhakar 1 1 Computer Science 2 Bindley Bioscience Center Purdue University Introduction Lineage (Data Provenance) is defined as description of


slide-1
SLIDE 1

Tracing Lineage Beyond Relational Operators

Mingwu Zhang1 Xiangyu Zhang1 Xiang Zhang2 Sunil Prabhakar1

1Computer Science 2Bindley Bioscience Center

Purdue University

slide-2
SLIDE 2

Mingwu Zhang et al.

Introduction

Lineage (Data Provenance) is defined as

description of the origin of the data and the process by which the data is derived.

Lineage is

  • critical for determining data quality and reliability

(e.g. biological data, data cleansing)

  • mandated by law (e.g. audit trails for FDA)
  • essential for data dissemination and reproduction
  • Informative (e.g. querying lineage)

Database support for tracing lineage is urgent

slide-3
SLIDE 3

Mingwu Zhang et al.

Fine-grained Coarse-grained Program X Input

File 1 File 2

Output

File 2

R1, 14 R2, 12

Lineage tracing

e.g Workflow level lineage Output

File 2 File 4

Input

File 1

T1,1,2 T2,3,4

Program X

slide-4
SLIDE 4

Mingwu Zhang et al.

Output

R1,12 R2, 34

Tracing fine-grained lineage

Input

T1,1,2 T2,3,4

✓ ✓

Internal External

State-of-the-art is limited to three special cases

Currently no known techniques for general fine-grained lineage tracing

Array Manipulation Language (AML) Reversible Mathematical Functions Relational Operators

???

Cui et al Woodruff & stonebraker A.P.Marathe

slide-5
SLIDE 5

Mingwu Zhang et al.

Contributions

Enable fine-grained external lineage tracing

for any arbitrary program without requiring

  • Domain expertise
  • Understanding the semantics of the operation
  • Source code

Computed lineage is accurate (no false

positives)

Lineage is derived directly from program

execution.

slide-6
SLIDE 6

Mingwu Zhang et al.

Introduction Lineage tracing Case study Conclusion

Outline

slide-7
SLIDE 7

Mingwu Zhang et al.

Our approach

Automatically trace lineage using only

binary executables.

Monitor the data flow during program

execution.

As the program is executed, the lineage of

each variable is traced.

Each binary instruction is modified to keep

track of the data/control dependencies generated by the instruction.

slide-8
SLIDE 8

Mingwu Zhang et al.

Tracing Lineage

  • A statement S data depends on another

statement t if and only if a variable is defined at t and then used at S .

  • A statement S control depends on a

predicate statement t if and only if the execution of S is the result of the branch

  • utcome of t
  • Definition: Given a program execution, the

data lineage of variable v at an execution point of Si , denoted as DL(v@Si), is the set

  • f input items that are directly or indirectly

involved in the computation of the value v at Si

slide-9
SLIDE 9

Mingwu Zhang et al.

Tracing example

At Statement 5, x depends upon a1,b1 (data

dependency) and y (control dependence)

Thus, the lineage of x is the union of the

lineages of a1, b1 and y

) 1 @ ( ) 3 @ 1 ( ) 2 @ 1 ( ) 4 @ ( ) 3 @ 1 ( ) 2 @ 1 ( ) 4 ( )) 5 @ 1 ( ) 5 @ 1 ( ( ) 5 @ ( y DL b DL a DL y DL b DL a DL DL b DL a DL x DL Υ Υ Υ Υ Υ Υ = = =

1: y=3; 2: a1=3; 3: b1=4; 4: if(y>2) 5: x = a1 + b1;

slide-10
SLIDE 10

Mingwu Zhang et al.

Deriving lineage

Let si : dest = ? tj : f( use0, use1, …, usen) be the executed statement instance si, which assigns a value to variable dest by using variables use0, use1, …, usen and si control depends on tj . Let DEF(x) be the latest statement instance that defines x.

  • DL(dest@si) = (

DL(usex @si)) DL(t j)

∀x

  • = DL(t j)(

DL(usex @DEF(usex)

∀x.DEF(usex )≠φ

  • )

( {usex}

∀x.DEF(usex )=φ

  • ))
slide-11
SLIDE 11

Mingwu Zhang et al.

Instrumenting the code

We use an open-source instrumentation

kernel called Valgrind

The lineage is typically set data and is

stored using a structure called roBDD which is optimized for set operations.

Shadow memory (SM) stores the lineage

sets associated with variables in stack/heap and shadow register file (SRF) stores lineage sets for variables in registers

slide-12
SLIDE 12

Mingwu Zhang et al.

Instrumenting the code

Example instrumentation

  • A=(int*) malloc(100)→

SM(A) = malloc_in_shadow(100)

  • add (0x0884dc0), eax→

mov SM[0x0884dc0] U SRF(eax), SRF(eax)

slide-13
SLIDE 13

Mingwu Zhang et al.

Instrumenter Runtime roBDD

Valgrind Kernel

Lineage Repository instrumented

Architecture

We have developed a prototype system

based on Valgrind engine to trace fine- grained lineage

code code event bdd lineage

  • utput

x86 binary input bdd

slide-14
SLIDE 14

Mingwu Zhang et al.

Introduction Lineage tracing Case study Conclusion

Outline

slide-15
SLIDE 15

Mingwu Zhang et al.

Case study: Cancer biomarker discovery

Cancer Normal 1:1 mix

m/z intensity

Doublet Doublet

intensity m/z

Digestion Digestion Isotope Labeling Isotope Labeling De-Isotope LC-MS Label1 Label2

slide-16
SLIDE 16

Mingwu Zhang et al.

De-isotoping

Figure to show challenge of de-isotoping

Seq1: ATLNELVEYVSTNR

20 40 60 80 100 120 400 600 800 1000 1200 1400 1600 m/z intensity

+1 +2 +3

slide-17
SLIDE 17

Mingwu Zhang et al.

De-isotoping

20 40 60 80 100 120 400 600 800 1000 1200 1400 1600 m/z intensity

+1 +2 +3

20 40 60 80 100 120 1606 1607 1608 1609 1610 1611 1612 1613 1614 m/z intensity

+1

10 20 30 40 50 60 70 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z intenisty

+2

5 10 15 20 25 30 35 40 45 535 535.5 536 536.5 537 537.5 538 538.5 539 539.5 540 m/z

intensity +3

slide-18
SLIDE 18

Mingwu Zhang et al.

De-isotoping

10 20 30 40 50 60 70 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z intenisty 20 40 60 80 100 120 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z intensity 20 40 60 80 100 120 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z intensity

Seq1: ATLNELVEYVSTNR Seq2: ITCAELR

20 40 60 80 100 120 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z intensity

+2 +1

slide-19
SLIDE 19

Mingwu Zhang et al.

De-isotope algorithm

Complex, mostly heuristics Numerous parameters picked by

experts

Validity of results can be affected by

choice of parameters

Identifying a reverse function is

impossible, even for experts.

Using state-of-the-art algorithm.

slide-20
SLIDE 20

Mingwu Zhang et al.

De-isotope result

α β γ δ ε ι κ λ ζ η θ 500 1000 1500 2000 2500 intensity m/z 912 914 916 918 920 922 924 500 1000 1500 2000 2500 intensity m/z 912 914 916 918 920 922 924 Pep 1 (L)

()

Pep 1 (H) (ν) Pep 2 (L) (ξ) Pep 2 (H) (ο)

slide-21
SLIDE 21

Mingwu Zhang et al.

Fine-grained lineage

500 1000 1500 2000 2500 intensity m/z 912 914 916 918 920 922 924

Pep 1 (L) () Pep 1 (H) (ν) Pep 2 (L) (ξ) Pep 2 (H) (ο) α β γ δ ε ι κ λ ζ η θ

500 1000 1500 2000 2500 intensity m/z 912 914 916 918 920 922 924

slide-22
SLIDE 22

Mingwu Zhang et al.

Case study

External fine-grained lineage is crucial

for our biomarker discovery application

Our technique enabled experts to

detect errors assess the reliability of data identify false positives identify program limitations

slide-23
SLIDE 23

Mingwu Zhang et al.

Detect error

φ +4

m/z intensity 50 100 150 200 250 300 350 400 450 1450 1451 1452 1453 1454 1455 1456 1457

σ +4 τ +4 υ +4 β’ γ’ δ’ ε’ ζ’ η’ θ’ ι’ κ’ λ’ ’ ν’ ξ’ ο’ π’ ρ’

20 40 60 80 100 120 140 1450 1451 1452 1453 1454 1455 1456 1457 m/z intensity

α’ ς’ σ’ +3 τ’ +3 υ’ +3 φ’ +3

m/z intensity 50 100 150 200 250 300 350 400 450 1450 1451 1452 1453 1454 1455 1456 1457

ς β γ δ ε ζ η θ ι κ λ

  • ν

ξ ο π ρ

20 40 60 80 100 120 140 1450 1451 1452 1453 1454 1455 1456 1457 m/z intensity

α ς β γ δ ε ζ η θ ι κ λ

  • ν

ξ ο π ρ

20 40 60 80 100 120 140 1450 1451 1452 1453 1454 1455 1456 1457 m/z intensity

α

slide-24
SLIDE 24

Mingwu Zhang et al.

Identifying false positives

500 1000 1500 3000 3500

intensity m/z

969 971 973 975 977

θ

2000 2500

ι κ λ

500 1000 1500 3000 3500

intensity m/z

969 971 973 975 977

α

2000 2500

β γ δ ε η ζ

slide-25
SLIDE 25

Mingwu Zhang et al.

Program limitations

α β γ δ ε ζ η θ ι κ λ

  • ν

ξ ο π ρ ς σ τ

10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity

α β γ δ ε ζ η θ ι κ λ

  • ν

ξ ο π ρ ς σ τ

10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity

υ +2 φ +1 χ +1 ψ +1 ω +1

10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity

υ +1 φ +1 χ +1 ψ +1 ω +1

10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity

+1 υ +2 χ +2 ψ +1 ω +1

10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity

α β γ δ ε ζ η θ ι κ λ

  • ν

ξ ο π ρ ς σ τ

10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity

α β γ δ ε ζ η θ ι κ λ

  • ν

ξ ο π ρ ς σ τ

10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity

α β γ δ ε ζ η θ ι κ λ

  • ν

ξ ο π ρ ς σ τ

10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity

slide-26
SLIDE 26

Mingwu Zhang et al.

Performance

39.7 1670 42 1.67 cluto 7.5 646.6 85.8 9.2 deisotope 12.4 257.4 20.7 2.06 apriori 14.6 286.6 19.6 2.22 rainbow 25.0 302.8 12.1 0.85 lemur 32.3 166.3 5.15 0.8 Image processing 32.0 93.6 2.92 0.104 Auto-class Tracing/ valgrind Tracing (sec) Valgrind (sec) Original (sec) benchmark

slide-27
SLIDE 27

Mingwu Zhang et al.

Memory consumption

2.2 5.2 3 cluto 17.4 66.2 125 deisotope 3.6 0.19 4.1 apriori 15.3 50.8 6.8 rainbow 9.7 38.4 14 lemur 16 198 16.1 Image processing 2.2 1.9 1.8 Auto-class Tracing(MB) BDD(MB) Orig(MB) Benchmark

slide-28
SLIDE 28

Mingwu Zhang et al.

Introduction Tracing lineage Case study Conclusion

Outline

slide-29
SLIDE 29

Mingwu Zhang et al.

Conclusion

This is the first work that can trace external

general fine-grained lineage

Advantages

  • Fully automated
  • Does not require user input or domain

knowledge

  • Does not need source code
  • The lineage is accurate
slide-30
SLIDE 30

Mingwu Zhang et al.

Conclusion

Disadvantage

Performance Tracing lineage incurs a slowdown but

acceptable for applications in need of lineage

Part of the overhead is caused by

Valgrind, other industrial instrumentation engines such as dbt (Intel) and valcun (Mirosoft) incur less overhead.