Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis
VLDB’19 Yanying Li1 Haipei Sun1 Boxiang Dong2 Hui (Wendy) Wang1
1Stevens Institute of Technology
Hoboken, NJ
2Montclair State University
Montclair, NJ
Cost-efficient Data Acquisition on Online Data Marketplaces for - - PowerPoint PPT Presentation
Cost-efficient Data Acquisition on Online Data Marketplaces for Correlation Analysis VLDB19 Yanying Li 1 Haipei Sun 1 Boxiang Dong 2 Hui (Wendy) Wang 1 1 Stevens Institute of Technology Hoboken, NJ 2 Montclair State University Montclair, NJ
1Stevens Institute of Technology
Hoboken, NJ
2Montclair State University
Montclair, NJ
2 / 33
Age Zipcode Population [35, 40] 10003 7,000 [20, 25] 01002 3,500 [55, 60] 07003 1,200 [35, 40] 07003 5,800 [35, 40] 07304 2,000 (a) DS: Source instance owned by data shopper Adam Zipcode State 07003 NJ correct 07304 NJ correct 10001 NY correct 10001 NJ wrong State Disease # of cases MA Flu 300 NJ Flu 400 Florida Lyme disease 130 California Lyme disease 40 NJ Lyme disease 200 D1: Zipcode table D2: Data and statistics of diseases by state (FD: Zipcode → State) Age Address Insurance Disease [35, 40] 10 North St. UnitedHealthCare Flu [20, 25] 5 Main St. MedLife HIV [35, 40] 25 South St. UnitedHealthCare Flu D3: Insurance & disease data instance (b) Relevant instances on data marketplace
Need: Find correlation between age groups and diseases in New Jersey
3 / 33
Age Zipcode Population Address Insurance Disease [35, 40] 10003 7,000 10 North St. UnitedHealthCare Flu [35, 40] 10003 7,000 25 South St. UnitedHealthCare Flu [20, 25] 01002 3,500 5 Main St. MedLife HIV [35, 40] 07003 5,800 10 North St. UnitedHealthCare Flu [35, 40] 07003 5,800 10 North St. UnitedHealthCare Flu [35, 40] 07304 2,000 25 South St. UnitedHealthCare Flu [35, 40] 07304 2,000 25 South St. UnitedHealthCare Flu
4 / 33
5 / 33
6 / 33
7 / 33
1 Introduction 2 Related Work 3 Preliminaries 4 DANCE
5 Experiments 6 Conclusion
8 / 33
9 / 33
10 / 33
11 / 33
12 / 33
13 / 33
5 = 0.6
14 / 33
15 / 33
16 / 33
T
Ti∈S∪T
17 / 33
Data Marketplace
DANCE
Construction of Join Graph Data Acquisition Join Graph Data Shopper Request for Samples Samples Source Instances Correlation (AS, AT)!"#$%&'("# Data Purchase Query
Offline Phase Online Phase
Data Purchase Query Purchased Data
18 / 33
⋉ S2)) = Q(D1 ⋊ ⋉ D2)
⋉S2(AS, AT )) = CORRD1⋊ ⋉D2(AS, AT )
19 / 33
20 / 33
ABC BC AB AC BCD BCDE CDE BDE BCE BC BD CD BE CE DE
(BC, 0.5) (B, 0.45) (C, 0.6)
D1 D2 D1 D2 Attribute set level Instance level
(C, 0.6) (B, 0.45) (B, 0.45) (C, 0.6) (BC, 0.5) (BC, 0.5) (BC, 0.5) (B, 0.45)
21 / 33
D1 D1 D3 D3 D2 D2 D4 D4 D5 D5 D7 D7 D8 D8 D6 D6 D9 D9 J12 J13 J16 J34 J35 J27 J46 J49 J56 J57 J59 J58 J89 Source Attribute Set Target Attribute Set
22 / 33
D1 D1 D3 D3 D2 D2 D4 D4 D5 D5 D7 D7 D8 D8 D6 D6 D9 D9 s12 s14 s17 s29 s49 s79 s48 s28 s78 Source Attribute Set Target Attribute Set Landmark
23 / 33
24 / 33
25 / 33
1 10 100 1000 10000 100000 1x106 5 6 7 8 Time Logscale (Seconds) Number of Instances Heuristic LP GP 1 10 100 1000 10000 100000 1x106 5 6 7 8 Time Logscale (Seconds) Number of Instances Heuristic LP GP 1 10 100 1000 10000 100000 1x106 5 6 7 8 Time Logscale (Seconds) Number of Instances Heuristic LP GP
Query Source Target Explanation Q1 customer.account_balance
link customers’ account with responsible clerks Q2 nation.name partsupp.availqty link parts with the nation of their suppliers Q3
region.name associate orders’ price with the origin region 26 / 33
2 4 6 8 10 12 14 16 18 20 0.07 0.09 0.11 0.13 0.15 Correlation Budget Ratio Heuristic LP GP 2 4 6 8 10 0.07 0.09 0.11 0.13 0.15 Correlation Budget Ratio Heuristic LP GP 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.09 0.11 0.13 0.15 Correlation Budget Ratio Heuristic LP GP
27 / 33
28 / 33
[BHS11] Magdalena Balazinska, Bill Howe, and Dan Suciu. Data markets in the cloud: An opportunity for the database community. Proceedings of the VLDB Endowment, 4(12):1482–1485, 2011. [DK17] Shaleen Deep and Paraschos Koutris. The design of arbitrage-free data pricing schemes. In International Conference on Database Theory, 2017. [GBSW10] Andrey Gubichev, Srikanta Bedathur, Stephan Seufert, and Gerhard Weikum. Fast and accurate estimation of shortest paths in large graphs. In Proceedings of ACM International Conference on Information and Knowledge Management, pages 499–508, 2010. [KEW09] Gjergji Kasneci, Shady Elbassuoni, and Gerhard Weikum. Ming: mining informative entity relationship subgraphs. In Proceedings of the ACM Conference on Information and Knowledge Management, pages 1653–1656, 2009. 30 / 33
[KUB+12] Paraschos Koutris, Prasang Upadhyaya, Magdalena Balazinska, Bill Howe, and Dan Suciu. Querymarket demonstration: Pricing for online data markets.
[KUB+15] Paraschos Koutris, Prasang Upadhyaya, Magdalena Balazinska, Bill Howe, and Dan Suciu. Query-based data pricing. Journal of the ACM, 62(5):43, 2015. [LK14] Bing-Rong Lin and Daniel Kifer. On arbitrage-free pricing for general data queries. Proceedings of the VLDB Endowment, 7(9):757–768, 2014. [TF06] Hanghang Tong and Christos Faloutsos. Center-piece subgraphs: problem definition and fast solutions. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pages 404–413, 2006. 31 / 33
[TFK07] Hanghang Tong, Christos Faloutsos, and Yehuda Koren. Fast direction-aware proximity for graph mining. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 747–756, 2007. [U+16] Prasang Upadhyaya et al. Price-optimal querying with data apis. Proceedings of the VLDB Endowment, 2016. [Vaz13] Vijay V Vazirani. Approximation algorithms. Springer Science & Business Media, 2013. [YPS09] Xiaoyan Yang, Cecilia M Procopiuc, and Divesh Srivastava. Summarizing relational databases. Proceedings of the VLDB Endowment, 2(1):634–645, 2009. 32 / 33
[YPS11] Xiaoyan Yang, Cecilia M Procopiuc, and Divesh Srivastava. Summary graphs for relational database schemas. Proceedings of the VLDB Endowment, 2011. [ZEPS13] Meihui Zhang, Hazem Elmeleegy, Cecilia M Procopiuc, and Divesh Srivastava. Reverse engineering complex join queries. In Proceedings of the ACM International Conference on Management of Data, pages 809–820, 2013. 33 / 33