Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany
Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany
Scaling out the Discovery of Inclusion Dependencies BTW 2015, - - PowerPoint PPT Presentation
Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany Inclusion Dependencies Examples Customers ID Name
Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany
ID Name Address 1 Tanja Jager Marseiller Str. 12 2 Sandra Möller
3 Dennis Eberhart Sonnenallee 19 4 Barbara Pabst
5 Thorsten Mauer Güntzelstr. 90 Customer Item Quantity 3 CK-242-1 1 3 DF-098-7 1 3 KE-883-6 1 1 LM-437-2 2 5 PE-383-5 1
Customer ⊆ ID Quantity ⊆ ID
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 2
http://geneontology.org/sites/default/files/public/diag-godb-er.jpg
Sebastian Kruse March 5, 2015 3 Scaling out the Discovery of INDs
http://www.ibm.com/developerworks/data/library/techarticle/dm-1109proteindatadb2purexml/pdb_scheme_large.jpg
Sebastian Kruse March 5, 2015 4 Scaling out the Discovery of INDs
■ Fabien De Marchi, Stéphan Lopes, and Jean-Marc Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32:53–73, 2009.
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 7
ID Name Address 1 Tanja Jager Marseiller Str. 12 2 Sandra Möller
3 Dennis Eberhart Sonnenallee 19 4 Barbara Pabst
5 Thorsten Mauer Güntzelstr. 90 Customer Item Quantity 3 CK-242-1 1 3 DF-098-7 1 3 KE-883-6 1 1 LM-437-2 2 5 PE-383-5 1
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 8
Value Attributes 1 ID, Customer, Quantity Tanja Jager Name Marseiller Str. 12 Address 2 ID, Quantity Sandra Möller Name
Address … … ID, Quantity
Intersection
Quantity ⊆ ID Quantity ⊆ Quantity
■ Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, 2006.
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 9
ID Name Address 1 Tanja Jager Marseiller Str. 12 2 Sandra Möller
3 Dennis Eberhart Sonnenallee 19 4 Barbara Pabst
5 Thorsten Mauer Güntzelstr. 90 Customer Item Quantity 3 CK-242-1 1 3 DF-098-7 1 3 KE-883-6 1 1 LM-437-2 2 5 PE-383-5 1
■ Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, 2006.
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 10
ID 1 2 3 4 5 Name Barbara Pabst Dennis Eberhart Sandra Möller Tanja Jager Thorsten Mauer Customer 1 3 5 Item CK-242-1 DF-098-7 KE-883-6 LM-437-2 PE-383-5 Quantity 1 2
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 11 ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 4 B.P. Z.76 5 T.M. G.90
Cus Item Qty
3 CK 1 3 DF 1 3 KE 1 1 LM 2 5 PE 1 ID Name Addr Cus Item Qty 1 1 1 2 2 3 3 4 5 5 T.J. S.M. … … … … … …
Quantity ⊆ ID Customer ⊆ ID Input Data Full Outer Join Inclusion Dependencies Step 1: Calculate full outer join of all attributes Step 2: Extract inclusion dependencies
Sebastian Kruse March 5, 2015 13 Scaling out the Discovery of INDs ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Item Qty
3 CK 1 3 DF 1 3 KE 1 ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 Cus Item Qty 1 LM 2 5 PE 1
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 14 ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 Cus Item Qty 3 CK 1 3 DF 1 3 KE 1
ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Item Qty 1 LM 2 5 PE 1
1 ID 2 ID 3 ID T.J. Name S.M. Name D.E. Name M.12 Addr F.63 Addr S.19 Addr 3 Cus 3 Cus 3 Cus CK Item DF Item KE Item 1 Qty 1 Qty 1 Qty 4 ID 5 ID B.P. Name T.M. Name Z.76 Addr G.90 Addr 1 Cus 5 Cus LM Item PE Item 2 Qty 1 Qty 1 Cus, Qty 5 Cus, ID
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 15
1 ID 2 ID 3 ID T.J. Name S.M. Name D.E. Name M.12 Addr F.63 Addr S.19 Addr 3 Cus CK Item DF Item KE Item 1 Qty 4 ID B.P. Name T.M. Name Z.76 Addr G.90 Addr LM Item PE Item 1 Cus, Qty 5 Cus, ID 2 Qty 1 ID, Cus, Qty 2 ID, Qty 3 ID, Cus
ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 Cus Item Qty 3 CK 1 3 DF 1 3 KE 1 ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Item Qty 1 LM 2 5 PE 1
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 16
ID, Cus, Qty Addr Cus, ID Name ID Name Item ID, Qty Item Name Cus, ID Addr Addr Addr Item Item Name Name Item
ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 Cus Item Qty 3 CK 1 3 DF 1 3 KE 1 ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Item Qty 1 LM 2 5 PE 1
Addr
Sebastian Kruse March 5, 2015 17 Scaling out the Discovery of INDs
ID, Cus, Qty ID, Cus Item Name Addr ID, Cus Name ID, Qty Item Addr ID Name Item
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 18
ID Cus, Qty Cus ID, Qty Qty ID, Cus Addr Ø Name Ø ID Ø Item Ø ID Qty Qty ID ID Cus Cus ID Name Ø Item Ø ID Cus Cus ID Name Ø Item Ø Addr Ø ID Ø ID, Cus Item Name Addr ID, Cus Name ID, Qty Item ID, Cus, Qty Addr ID Name Item
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 19
Cus ID, Qty Qty ID, Cus Addr Ø Name Ø ID Ø Item Ø Qty ID Cus ID Name Ø Item Ø ID Cus Cus ID Name Ø Item Ø Addr Ø ID Ø ID, Cus, Qty ID, Cus Item Name Addr ID, Cus Name ID, Qty Item Addr ID Name Item
Sebastian Kruse March 5, 2015 20 Scaling out the Discovery of INDs
Qty ID Cus ID Item Ø Addr Ø ID Ø
Quantity ⊆ ID Customer ⊆ ID
Name Ø
■ Inclusion dependencies on combinations of columns (aka n-ary INDs) □ Adaption: Create cells for combinations of values □ Powerful in combination with apriori-like proceeding ■ Partial inclusion dependencies □ Adaption: aggregate IND candidates with multiset union instead of intersection □ Compare with number of distinct values of dependent column
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 21
■ Cluster Setup □ 1 master node (Intel Xeon @ 2x2.67 GHz, 8 GiB RAM) □ 10 worker nodes (Intel Core 2 Duo @ 2x2.6 GHz, 8 GiB RAM) □ Apache HDFS 2.2, Apache Flink 0.6.2 ■ Single node (for SPIDER) □ Intel Xeon @ 8x2GHz, 128 GiB RAM, RAID-1 ■ Datasets □ Relational datasets from different domains □ 16 KB to 44.9 GB
Sebastian Kruse March 5, 2015 23 Scaling out the Discovery of INDs
Sebastian Kruse March 5, 2015
1 2 3 4 5 6 7 8 9 10 0.1 1 10 100 1000 10000 speed up runtime [s] SINDY SPIDER Speed Up
24 Scaling out the Discovery of INDs
Sebastian Kruse March 5, 2015 25 Scaling out the Discovery of INDs
8 16 32 64 128 256 512 1024 2048 1/1 2/2 3/3 4/4 5/5 6/6 7/7 8/8 9/9 10/10 20/10 runtime [s] logicalworkers/physical workers MB-core (5.8 GB) TPC-H (1.4 GB) CATH (907 MB) LOD+ (825 MB) BIOSQLSP (567 MB) WIKIPEDIA (539 MB) CATH (907 MB) CENSUS (111 MB) SCOP (15 MB) COMA (16 KB)
■ Presented new distributed IND discovery algorithm □ Applicable for unary, n-ary, and partial inclusion dependencies □ Consists of full outer join calculation and extraction phase □ Scales well on large datasets ■ Open questions □ How can one continuously maintain inclusion dependencies? □ How can the algorithm be applied to similar problems, e.g., RDF data?
Sebastian Kruse March 5, 2015 27 Scaling out the Discovery of INDs
Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 29
Map Combine Split tuples into cells split attributes into IND candidates group by value Reduce Union attributes Map Combine Reduce group by dependent attribute intersect referenced attributes
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 30
Map Combine Split tuples into cells split attributes into IND candidates group by value Reduce Union attributes Map Combine Reduce group by dependent attribute intersect referenced attributes
Sebastian Kruse March 5, 2015
200 400 600 800 1000 1200 runtime [s] share of rows MB PDB 200 400 600 800 1000 1200 1000 2000 runtime [s] # columns MB PDB
31 Scaling out the Discovery of INDs
Sebastian Kruse March 5, 2015
200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 1,000,000,000 2,000,000,000 3,000,000,000 4,000,000,000 5,000,000,000 6,000,000,000 7,000,000,000 500,000 1,000,000 1,500,000 runtime [ms] #INDs #columns #INDs runtime [ms]
32 Scaling out the Discovery of INDs
Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 33
10 20 30 40 50 60 70 80 90 SCOP COMA CENSUS CATH BIOSQLSP TPC-H Runtime [s] n=6 n=5 n=4 n=3 n=2 n=1 remainder