Tracing Lineage Beyond Relational Operators
Mingwu Zhang1 Xiangyu Zhang1 Xiang Zhang2 Sunil Prabhakar1
1Computer Science 2Bindley Bioscience Center
Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu - - PowerPoint PPT Presentation
Tracing Lineage Beyond Relational Operators Mingwu Zhang 1 Xiangyu Zhang 1 Xiang Zhang 2 Sunil Prabhakar 1 1 Computer Science 2 Bindley Bioscience Center Purdue University Introduction Lineage (Data Provenance) is defined as description of
1Computer Science 2Bindley Bioscience Center
Mingwu Zhang et al.
Lineage (Data Provenance) is defined as
Lineage is
(e.g. biological data, data cleansing)
Database support for tracing lineage is urgent
Mingwu Zhang et al.
Fine-grained Coarse-grained Program X Input
File 1 File 2
Output
File 2
R1, 14 R2, 12
e.g Workflow level lineage Output
File 2 File 4
Input
File 1
T1,1,2 T2,3,4
Program X
Mingwu Zhang et al.
Output
R1,12 R2, 34
Input
T1,1,2 T2,3,4
Internal External
Currently no known techniques for general fine-grained lineage tracing
Array Manipulation Language (AML) Reversible Mathematical Functions Relational Operators
Cui et al Woodruff & stonebraker A.P.Marathe
Mingwu Zhang et al.
Enable fine-grained external lineage tracing
Computed lineage is accurate (no false
Lineage is derived directly from program
Mingwu Zhang et al.
Introduction Lineage tracing Case study Conclusion
Mingwu Zhang et al.
Automatically trace lineage using only
Monitor the data flow during program
As the program is executed, the lineage of
Each binary instruction is modified to keep
Mingwu Zhang et al.
Mingwu Zhang et al.
At Statement 5, x depends upon a1,b1 (data
Thus, the lineage of x is the union of the
) 1 @ ( ) 3 @ 1 ( ) 2 @ 1 ( ) 4 @ ( ) 3 @ 1 ( ) 2 @ 1 ( ) 4 ( )) 5 @ 1 ( ) 5 @ 1 ( ( ) 5 @ ( y DL b DL a DL y DL b DL a DL DL b DL a DL x DL Υ Υ Υ Υ Υ Υ = = =
1: y=3; 2: a1=3; 3: b1=4; 4: if(y>2) 5: x = a1 + b1;
Mingwu Zhang et al.
Let si : dest = ? tj : f( use0, use1, …, usen) be the executed statement instance si, which assigns a value to variable dest by using variables use0, use1, …, usen and si control depends on tj . Let DEF(x) be the latest statement instance that defines x.
DL(usex @si)) DL(t j)
∀x
DL(usex @DEF(usex)
∀x.DEF(usex )≠φ
( {usex}
∀x.DEF(usex )=φ
Mingwu Zhang et al.
We use an open-source instrumentation
The lineage is typically set data and is
Shadow memory (SM) stores the lineage
Mingwu Zhang et al.
Example instrumentation
SM(A) = malloc_in_shadow(100)
mov SM[0x0884dc0] U SRF(eax), SRF(eax)
Mingwu Zhang et al.
Instrumenter Runtime roBDD
Valgrind Kernel
Lineage Repository instrumented
We have developed a prototype system
code code event bdd lineage
x86 binary input bdd
Mingwu Zhang et al.
Introduction Lineage tracing Case study Conclusion
Mingwu Zhang et al.
Cancer Normal 1:1 mix
m/z intensity
Doublet Doublet
intensity m/z
Digestion Digestion Isotope Labeling Isotope Labeling De-Isotope LC-MS Label1 Label2
Mingwu Zhang et al.
Figure to show challenge of de-isotoping
Seq1: ATLNELVEYVSTNR
20 40 60 80 100 120 400 600 800 1000 1200 1400 1600 m/z intensity
+1 +2 +3
Mingwu Zhang et al.
20 40 60 80 100 120 400 600 800 1000 1200 1400 1600 m/z intensity
+1 +2 +3
20 40 60 80 100 120 1606 1607 1608 1609 1610 1611 1612 1613 1614 m/z intensity
+1
10 20 30 40 50 60 70 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z intenisty
+2
5 10 15 20 25 30 35 40 45 535 535.5 536 536.5 537 537.5 538 538.5 539 539.5 540 m/z
intensity +3
Mingwu Zhang et al.
10 20 30 40 50 60 70 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z intenisty 20 40 60 80 100 120 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z intensity 20 40 60 80 100 120 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z intensity
Seq1: ATLNELVEYVSTNR Seq2: ITCAELR
20 40 60 80 100 120 803 803.5 804 804.5 805 805.5 806 806.5 807 807.5 808 m/z intensity
+2 +1
Mingwu Zhang et al.
Complex, mostly heuristics Numerous parameters picked by
Validity of results can be affected by
Identifying a reverse function is
Using state-of-the-art algorithm.
Mingwu Zhang et al.
α β γ δ ε ι κ λ ζ η θ 500 1000 1500 2000 2500 intensity m/z 912 914 916 918 920 922 924 500 1000 1500 2000 2500 intensity m/z 912 914 916 918 920 922 924 Pep 1 (L)
()
Pep 1 (H) (ν) Pep 2 (L) (ξ) Pep 2 (H) (ο)
Mingwu Zhang et al.
500 1000 1500 2000 2500 intensity m/z 912 914 916 918 920 922 924
Pep 1 (L) () Pep 1 (H) (ν) Pep 2 (L) (ξ) Pep 2 (H) (ο) α β γ δ ε ι κ λ ζ η θ
500 1000 1500 2000 2500 intensity m/z 912 914 916 918 920 922 924
Mingwu Zhang et al.
External fine-grained lineage is crucial
Our technique enabled experts to
detect errors assess the reliability of data identify false positives identify program limitations
Mingwu Zhang et al.
φ +4
m/z intensity 50 100 150 200 250 300 350 400 450 1450 1451 1452 1453 1454 1455 1456 1457
σ +4 τ +4 υ +4 β’ γ’ δ’ ε’ ζ’ η’ θ’ ι’ κ’ λ’ ’ ν’ ξ’ ο’ π’ ρ’
20 40 60 80 100 120 140 1450 1451 1452 1453 1454 1455 1456 1457 m/z intensity
α’ ς’ σ’ +3 τ’ +3 υ’ +3 φ’ +3
m/z intensity 50 100 150 200 250 300 350 400 450 1450 1451 1452 1453 1454 1455 1456 1457
ς β γ δ ε ζ η θ ι κ λ
ξ ο π ρ
20 40 60 80 100 120 140 1450 1451 1452 1453 1454 1455 1456 1457 m/z intensity
α ς β γ δ ε ζ η θ ι κ λ
ξ ο π ρ
20 40 60 80 100 120 140 1450 1451 1452 1453 1454 1455 1456 1457 m/z intensity
α
Mingwu Zhang et al.
500 1000 1500 3000 3500
intensity m/z
969 971 973 975 977
θ
2000 2500
ι κ λ
500 1000 1500 3000 3500
intensity m/z
969 971 973 975 977
α
2000 2500
β γ δ ε η ζ
Mingwu Zhang et al.
α β γ δ ε ζ η θ ι κ λ
ξ ο π ρ ς σ τ
10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity
α β γ δ ε ζ η θ ι κ λ
ξ ο π ρ ς σ τ
10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity
υ +2 φ +1 χ +1 ψ +1 ω +1
10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity
υ +1 φ +1 χ +1 ψ +1 ω +1
10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity
+1 υ +2 χ +2 ψ +1 ω +1
10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity
α β γ δ ε ζ η θ ι κ λ
ξ ο π ρ ς σ τ
10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity
α β γ δ ε ζ η θ ι κ λ
ξ ο π ρ ς σ τ
10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity
α β γ δ ε ζ η θ ι κ λ
ξ ο π ρ ς σ τ
10 20 30 40 50 60 70 80 90 100 1051 1053 1055 1057 1059 1061 m/z intensity
Mingwu Zhang et al.
39.7 1670 42 1.67 cluto 7.5 646.6 85.8 9.2 deisotope 12.4 257.4 20.7 2.06 apriori 14.6 286.6 19.6 2.22 rainbow 25.0 302.8 12.1 0.85 lemur 32.3 166.3 5.15 0.8 Image processing 32.0 93.6 2.92 0.104 Auto-class Tracing/ valgrind Tracing (sec) Valgrind (sec) Original (sec) benchmark
Mingwu Zhang et al.
2.2 5.2 3 cluto 17.4 66.2 125 deisotope 3.6 0.19 4.1 apriori 15.3 50.8 6.8 rainbow 9.7 38.4 14 lemur 16 198 16.1 Image processing 2.2 1.9 1.8 Auto-class Tracing(MB) BDD(MB) Orig(MB) Benchmark
Mingwu Zhang et al.
Introduction Tracing lineage Case study Conclusion
Mingwu Zhang et al.
This is the first work that can trace external
Advantages
knowledge
Mingwu Zhang et al.
Disadvantage
Performance Tracing lineage incurs a slowdown but
Part of the overhead is caused by