Scaling out the Discovery of Inclusion Dependencies BTW 2015, - - PowerPoint PPT Presentation

▶

Apr 23, 2023 306 likes •641 views

Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany Inclusion Dependencies Examples Customers ID Name

SLIDE 1

Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany

Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany

SLIDE 2

Inclusion Dependencies

Examples

ID Name Address 1 Tanja Jager Marseiller Str. 12 2 Sandra Möller

Flughafenstr. 63

3 Dennis Eberhart Sonnenallee 19 4 Barbara Pabst

Ziegelstr. 76

5 Thorsten Mauer Güntzelstr. 90 Customer Item Quantity 3 CK-242-1 1 3 DF-098-7 1 3 KE-883-6 1 1 LM-437-2 2 5 PE-383-5 1

Customer ⊆ ID Quantity ⊆ ID

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 2

Customers Orders

SLIDE 3

http://geneontology.org/sites/default/files/public/diag-godb-er.jpg

Inclusion Dependencies

Examples

Sebastian Kruse March 5, 2015 3 Scaling out the Discovery of INDs

SLIDE 4

http://www.ibm.com/developerworks/data/library/techarticle/dm-1109proteindatadb2purexml/pdb_scheme_large.jpg

Inclusion Dependencies

Example

Sebastian Kruse March 5, 2015 4 Scaling out the Discovery of INDs

SLIDE 5

Scaling Out the Discovery of Inclusion Dependencies

Agenda

1. Discovering Inclusion Dependencies
2. Related Work
3. SINDY: A distributed discovery algorithm
4. Evaluation
5. Conclusions

SLIDE 6

Scaling Out the Discovery of Inclusion Dependencies

Agenda

1. Discovering Inclusion Dependencies
2. Related Work
3. SINDY: A distributed discovery algorithm
4. Evaluation
5. Conclusions

SLIDE 7

■ Fabien De Marchi, Stéphan Lopes, and Jean-Marc Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32:53–73, 2009.

Related Work

MIND

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 7

ID Name Address 1 Tanja Jager Marseiller Str. 12 2 Sandra Möller

Flughafenstr. 63

3 Dennis Eberhart Sonnenallee 19 4 Barbara Pabst

Ziegelstr. 76

5 Thorsten Mauer Güntzelstr. 90 Customer Item Quantity 3 CK-242-1 1 3 DF-098-7 1 3 KE-883-6 1 1 LM-437-2 2 5 PE-383-5 1

SLIDE 8

Related Work

MIND

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 8

Value Attributes 1 ID, Customer, Quantity Tanja Jager Name Marseiller Str. 12 Address 2 ID, Quantity Sandra Möller Name

Flughafenstr. 63

Address … … ID, Quantity

Quantity ⊆ ?

Intersection

Quantity ⊆ ID Quantity ⊆ Quantity

SLIDE 9

■ Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, 2006.

Related Work

SPIDER

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 9

ID Name Address 1 Tanja Jager Marseiller Str. 12 2 Sandra Möller

Flughafenstr. 63

3 Dennis Eberhart Sonnenallee 19 4 Barbara Pabst

Ziegelstr. 76

5 Thorsten Mauer Güntzelstr. 90 Customer Item Quantity 3 CK-242-1 1 3 DF-098-7 1 3 KE-883-6 1 1 LM-437-2 2 5 PE-383-5 1

SLIDE 10

■ Jana Bauckmann, Ulf Leser, and Felix Naumann. Efficiently Computing Inclusion Dependencies for Schema Discovery. In ICDE Workshops, 2006.

Related Work

SPIDER

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 10

ID 1 2 3 4 5 Name Barbara Pabst Dennis Eberhart Sandra Möller Tanja Jager Thorsten Mauer Customer 1 3 5 Item CK-242-1 DF-098-7 KE-883-6 LM-437-2 PE-383-5 Quantity 1 2

SLIDE 11

Related Work

Common proceeding

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 11 ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 4 B.P. Z.76 5 T.M. G.90

Cus Item Qty

3 CK 1 3 DF 1 3 KE 1 1 LM 2 5 PE 1 ID Name Addr Cus Item Qty 1 1 1 2 2 3 3 4 5 5 T.J. S.M. … … … … … …

Quantity ⊆ ID Customer ⊆ ID Input Data Full Outer Join Inclusion Dependencies Step 1: Calculate full outer join of all attributes Step 2: Extract inclusion dependencies

SLIDE 12

Scaling Out the Discovery of Inclusion Dependencies

Agenda

1. Discovering Inclusion Dependencies
2. Related Work
3. SINDY: A distributed discovery algorithm
4. Evaluation
5. Conclusions

SLIDE 13

SINDY: A distributed discovery algorithm

Distributed setting

Sebastian Kruse March 5, 2015 13 Scaling out the Discovery of INDs ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Item Qty

3 CK 1 3 DF 1 3 KE 1 ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 Cus Item Qty 1 LM 2 5 PE 1

SLIDE 14

SINDY: A distributed discovery algorithm

Perform full outer join

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 14 ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 Cus Item Qty 3 CK 1 3 DF 1 3 KE 1

ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Item Qty 1 LM 2 5 PE 1

1 ID 2 ID 3 ID T.J. Name S.M. Name D.E. Name M.12 Addr F.63 Addr S.19 Addr 3 Cus 3 Cus 3 Cus CK Item DF Item KE Item 1 Qty 1 Qty 1 Qty 4 ID 5 ID B.P. Name T.M. Name Z.76 Addr G.90 Addr 1 Cus 5 Cus LM Item PE Item 2 Qty 1 Qty 1 Cus, Qty 5 Cus, ID

SLIDE 15

SINDY: A distributed discovery algorithm

Perform full outer join

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 15

1 ID 2 ID 3 ID T.J. Name S.M. Name D.E. Name M.12 Addr F.63 Addr S.19 Addr 3 Cus CK Item DF Item KE Item 1 Qty 4 ID B.P. Name T.M. Name Z.76 Addr G.90 Addr LM Item PE Item 1 Cus, Qty 5 Cus, ID 2 Qty 1 ID, Cus, Qty 2 ID, Qty 3 ID, Cus

ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 Cus Item Qty 3 CK 1 3 DF 1 3 KE 1 ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Item Qty 1 LM 2 5 PE 1

SLIDE 16

SINDY: A distributed discovery algorithm

Perform full outer join

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 16

ID, Cus, Qty Addr Cus, ID Name ID Name Item ID, Qty Item Name Cus, ID Addr Addr Addr Item Item Name Name Item

ID Name Addr 1 T.J. M.12 2 S.M. F.63 3 D.E. S.19 Cus Item Qty 3 CK 1 3 DF 1 3 KE 1 ID Name Addr 4 B.P. Z.76 5 T.M. G.90 Cus Item Qty 1 LM 2 5 PE 1

Addr

SLIDE 17

SINDY: A distributed discovery algorithm

Distributed join product

Sebastian Kruse March 5, 2015 17 Scaling out the Discovery of INDs

ID, Cus, Qty ID, Cus Item Name Addr ID, Cus Name ID, Qty Item Addr ID Name Item

SLIDE 18

SINDY: A distributed discovery algorithm

Evaluate full outer join

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 18

ID Cus, Qty Cus ID, Qty Qty ID, Cus Addr Ø Name Ø ID Ø Item Ø ID Qty Qty ID ID Cus Cus ID Name Ø Item Ø ID Cus Cus ID Name Ø Item Ø Addr Ø ID Ø ID, Cus Item Name Addr ID, Cus Name ID, Qty Item ID, Cus, Qty Addr ID Name Item

SLIDE 19

SINDY: A distributed discovery algorithm

Evaluate full outer join

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 19

Cus ID, Qty Qty ID, Cus Addr Ø Name Ø ID Ø Item Ø Qty ID Cus ID Name Ø Item Ø ID Cus Cus ID Name Ø Item Ø Addr Ø ID Ø ID, Cus, Qty ID, Cus Item Name Addr ID, Cus Name ID, Qty Item Addr ID Name Item

SLIDE 20

SINDY: A distributed discovery algorithm

Distributed inclusion dependencies

Sebastian Kruse March 5, 2015 20 Scaling out the Discovery of INDs

Qty ID Cus ID Item Ø Addr Ø ID Ø

Quantity ⊆ ID Customer ⊆ ID

Name Ø

SLIDE 21

■ Inclusion dependencies on combinations of columns (aka n-ary INDs) □ Adaption: Create cells for combinations of values □ Powerful in combination with apriori-like proceeding ■ Partial inclusion dependencies □ Adaption: aggregate IND candidates with multiset union instead of intersection □ Compare with number of distinct values of dependent column

SINDY

Variants

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 21

SLIDE 22

Scaling Out the Discovery of Inclusion Dependencies

Agenda

1. Discovering Inclusion Dependencies
2. Related Work
3. SINDY: A distributed discovery algorithm
4. Evaluation
5. Conclusions

SLIDE 23

■ Cluster Setup □ 1 master node (Intel Xeon @ 2x2.67 GHz, 8 GiB RAM) □ 10 worker nodes (Intel Core 2 Duo @ 2x2.6 GHz, 8 GiB RAM) □ Apache HDFS 2.2, Apache Flink 0.6.2 ■ Single node (for SPIDER) □ Intel Xeon @ 8x2GHz, 128 GiB RAM, RAID-1 ■ Datasets □ Relational datasets from different domains □ 16 KB to 44.9 GB

Evaluation

Experimental setup

Sebastian Kruse March 5, 2015 23 Scaling out the Discovery of INDs

SLIDE 24

Evaluation

Performance comparison with SPIDER

Sebastian Kruse March 5, 2015

1 2 3 4 5 6 7 8 9 10 0.1 1 10 100 1000 10000 speed up runtime [s] SINDY SPIDER Speed Up

24 Scaling out the Discovery of INDs

SLIDE 25

Evaluation

Scale-Out Behavior

Sebastian Kruse March 5, 2015 25 Scaling out the Discovery of INDs

8 16 32 64 128 256 512 1024 2048 1/1 2/2 3/3 4/4 5/5 6/6 7/7 8/8 9/9 10/10 20/10 runtime [s] logicalworkers/physical workers MB-core (5.8 GB) TPC-H (1.4 GB) CATH (907 MB) LOD+ (825 MB) BIOSQLSP (567 MB) WIKIPEDIA (539 MB) CATH (907 MB) CENSUS (111 MB) SCOP (15 MB) COMA (16 KB)

SLIDE 26

Scaling Out the Discovery of Inclusion Dependencies

Agenda

1. Discovering Inclusion Dependencies
2. Related Work
3. SINDY: A distributed discovery algorithm
4. Evaluation
5. Conclusions

SLIDE 27

■ Presented new distributed IND discovery algorithm □ Applicable for unary, n-ary, and partial inclusion dependencies □ Consists of full outer join calculation and extraction phase □ Scales well on large datasets ■ Open questions □ How can one continuously maintain inclusion dependencies? □ How can the algorithm be applied to similar problems, e.g., RDF data?

Scaling Out the Discovery of Inclusion Dependencies

Conclusions

Sebastian Kruse March 5, 2015 27 Scaling out the Discovery of INDs

SLIDE 28

Scaling out the Discovery of Inclusion Dependencies BTW 2015, Hamburg, Germany

Sebastian Kruse, Thorsten Papenbrock, Felix Naumann Research Assistant Hasso Plattner Institute, Potsdam, Germany

SLIDE 29

Backup

Apache Hadoop Implementation

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 29

Map Combine Split tuples into cells split attributes into IND candidates group by value Reduce Union attributes Map Combine Reduce group by dependent attribute intersect referenced attributes

SLIDE 30

Backup

Apache Flink/Spark Implementation

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 30

Map Combine Split tuples into cells split attributes into IND candidates group by value Reduce Union attributes Map Combine Reduce group by dependent attribute intersect referenced attributes

SLIDE 31

Backup

Column and row scaling behavior

Sebastian Kruse March 5, 2015

200 400 600 800 1000 1200 runtime [s] share of rows MB PDB 200 400 600 800 1000 1200 1000 2000 runtime [s] # columns MB PDB

31 Scaling out the Discovery of INDs

SLIDE 32

Backup

Evaluation: Wikitables

Sebastian Kruse March 5, 2015

200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 2,000,000 1,000,000,000 2,000,000,000 3,000,000,000 4,000,000,000 5,000,000,000 6,000,000,000 7,000,000,000 500,000 1,000,000 1,500,000 runtime [ms] #INDs #columns #INDs runtime [ms]

32 Scaling out the Discovery of INDs

SLIDE 33

Backup

Evaluation: n-ary INDs

Sebastian Kruse March 5, 2015 Scaling out the Discovery of INDs 33

10 20 30 40 50 60 70 80 90 SCOP COMA CENSUS CATH BIOSQLSP TPC-H Runtime [s] n=6 n=5 n=4 n=3 n=2 n=1 remainder