SPARQL Graph Pattern Processing with Apache Spark title ?P - - PowerPoint PPT Presentation

sparql graph pattern processing
SMART_READER_LITE
LIVE PREVIEW

SPARQL Graph Pattern Processing with Apache Spark title ?P - - PowerPoint PPT Presentation

SPARQL Graph Pattern Processing with Apache Spark title ?P speaker Hubert Naacke University author P. et M. Curie Paris 6 Olivier Cur Paris Est Marne-la-Valle Bernd Amann GRADES 2017 1 Context Big RDF data Linked Open Data


slide-1
SLIDE 1

SPARQL Graph Pattern Processing with Apache Spark

GRADES 2017 1

title

Hubert Naacke

?P speaker author

Olivier Curé Bernd Amann

  • P. et M. Curie Paris 6

University

Paris Est Marne-la-Vallée

slide-2
SLIDE 2

Context

  • Big RDF data
  • Linked Open Data impulse: ever growing RDF content
  • Large datasets: billions of <subject, prop, object> triples
  • e.g. DBPedia
  • Query RDF data in SPARQL
  • The building block is a Basic Graph Pattern (BGP) query
  • e.g.:

2

Snowflake pattern

from WatDiv benchmark

Chain pattern

from LUBM benchmark

GRADES 2017

t2 t3

?z ?y

t1 advisor teacherOf type Course

includes

  • ffers

?x ?u

Retail0

slide-3
SLIDE 3

Cluster computing platforms

  • Cluster computing platforms provide
  • main memory data management
  • distributed and parallel data access and processing
  • fault-tolerance, highly availability

➭ Leverage on existing platform

  • e.g. Apache Spark

GRADES 2017 3

slide-4
SLIDE 4

SPARQL on Spark Architecture

GRADES 2017 4

Cluster ressource management Distributed File system Resilient Distributed Datastructures (RDD)

SPARQL SQL SPARQL DF SPARQL RDD

Hybrid DF Hybrid RDD

Our solutions

SPARQL Graph Pattern query RDF triples RDF triples

no compression

DataFrame (DF) SQL GraphX

data compression

slide-5
SLIDE 5

SPARQL query evaluation: Challenges

  • Requirements
  • Low memory usage: no data replication, no indexing
  • Fast data preparation: simple hash-based <Subject> partitioning
  • Challenges
  • Efficiently evaluate parallel and distributed join plans with Spark

➭ Favor local computation ➭ Reduce data transfers

  • Benefit from several join algorithms
  • Local partitioned join: no transfer
  • Distributed partitioned join
  • Broadcast join

GRADES 2017 5

slide-6
SLIDE 6

Solution

  • Local subquery evaluation
  • Merge multiple triple selections aka shared scan
  • Distributed query evaluation
  • Cost model for partitioned and broadcast joins
  • Generate Hybrid join plans, dynamic programming

GRADES 2017 6

slide-7
SLIDE 7

cost(Q92) = m * (Ct2 + Ct3) cost(Q93) = Ct1 + m * Ct3 cost(Q91) = Ct1 + Ct2 + Ct2 ⨝ t3

with : Cpattern = transferCost(pattern) θcomm is the unit tranfer cost m = #computeNodes - 1

Plan cost:

SELECT * WHERE {

?x advisor ?y . ?y teacherOf ?z . ?z type Course }

Triple patterns of Q9 SPARQL Hybrid plan

Q93 ⋈z

t2 t3

B

y

⋈y

t1

P

x

t2 t3

?z ?y

t1 advisor teacherOf type

SPARQL RDD plan

Q91 ⋈z

t2 t3

P

y z

⋈y

t1

P

x

Distribute Partitioned join

P SPARQL DF plan

Q92 ⋈y

t1 t2

B

x

⋈z

t3

B Broadcast Broadcast join

B Legend:

Hybrid plan : example and cost model

slide-8
SLIDE 8

Performance comparison with S2RDF

  • S2RDF at VLDB 2016
  • Same dataset (1B triples) & queries
  • Various query patterns:
  • Star, snowFlake, Complex

GRADES 2017 8

Star Snowfake Complex

➭ One dataset: <Subject> partitioning Hybrid DF accelerates DF up to 2.4 times ➭ One dataset per property: <Property> and <Subject> partitioning Hybrid accelerates S2RDF up to 2.2 times

slide-9
SLIDE 9

Take home message

  • Existing cluster computing platforms are mature enough to process

SPARQL queries at large scale.

  • To accelerate query plans:
  • Provide several distributed join algorithms
  • Allows for mixing several join algorithms

More info at the poster session … Thank you. Questions ?

GRADES 2017 9

slide-10
SLIDE 10

Existing solutions

  • S2RDF (VLDB 2016)
  • Spark
  • Long data preparation time
  • Use a single join algo
  • CliqueSquare (ICDE 2015)
  • Hadoop platform
  • Data replicated 3 times: by subject, prop and object
  • AdPart (VLDBJ 2016)
  • Native distributed layer
  • "semi-join based" join algorithm
  • Distributed RDFox (ISWC 2016)
  • Native distributed layer
  • Data shuffling

GRADES 2017 10

slide-11
SLIDE 11

Conclusion

  • First detailed analysis of SPARQL processing on Spark
  • Cost model aware of data transfers
  • Efficient query plan generation
  • Optimality not studied (future works)
  • Extensive experiments at large scale
  • Future works: incorporate other recent join algo
  • Handle data bias
  • Hypercube n-way join: targets load balancing

GRADES 2017 11

slide-12
SLIDE 12

Thank you Questions ?

GRADES 2017 12

slide-13
SLIDE 13

Extra slides

  • GRADES 2017

13

slide-14
SLIDE 14

SPARQL RDD plan

Q91 ⋈z

t2 t3

P

y z

⋈y

t1

P

x

SPARQL SQL plan

Q92 ⋈y

t1 t2

B

x

⋈z

t3

B SPARQL Hybrid plan

Q93 ⋈z

t2 t3

B

y

⋈y

t1

P

x

with : Cpattern = θcomm * size(pattern) θcomm is the unit tranfer cost m = #computeNodes - 1

cost(Q92) = m * (Ct2 + Ct3) cost(Q93) = Ct1 + m * Ct3

Plan cost:

cost(Q91) = Ct1 + Ct2 + Ct2 ⨝ t3

Hybrid plan: Cost model

slide-15
SLIDE 15

s3 p1 o2 s2 p1 o2 s2 p3 o4

Data distribution (1/2) Hash-based partitioning

Dataset (subject, prop, object)

BDA 2016 15

s1 p1 o1 s1 p2 o3 s1 p1 o1 s2 p1 o2 s3 p1 o2 s1 p2 o3 s2 p3 o4 ...

Part 1 Part 2 Part N

Partitioning is

  • Straightforward
  • Simple map-reduce task
  • No preparation overhead requirement
  • Hash-based partitioning on subject
slide-16
SLIDE 16

Data distribution (2/2)

  • ver a cluster

BDA 2016 16

Piece of data Operation Result

Compute node 1 node 2 node N

Memory CPU Memory Comm is expensive

Ressources:

slide-17
SLIDE 17

Parallel and distributed data processing workflow (1/2)

BDA 2016 17

Part 1 Part 2 Part N

Local (MAP) Operation

Partitioned dataset

select Result 1 select Result 2 select Result N

Examples of local MAP operations:  selection, projection, join on subject

Partitioned Result Compute node 1 node 2 node N

slide-18
SLIDE 18

BDA 2016 18

Part 2 Part n

Dataset

Global Operation Part 1 Result 1

Data transfers

Global (REDUCE) Operation

Global Operation Global Operation Result 2 Result n

Examples of global REDUCE operations : join, sort, distinct

Parallel and distributed data processing workflow (2/2)

slide-19
SLIDE 19

Join processing wrt. query pattern

Data:

BDA 2016 19

lab at

?L ?P ?V

Star query:

  • Find laboratory and name of persons

P2 lab L3 P2 age 20 P2 name Bob P4 lab L1 P1 lab L1 P1 name Ali P3 lab L2 P3 name Clo L1 at Poitiers L1 since 2000 L3 at Paris L3 staff 200 L2 at Aix L2 at Toulon L2 partner L1 …

Transfer lab or at

lab name

?P ?L ?N ?L

age

Snowflake query: No transfer

Chain query:

  • Find lab and its city for persons

lab

?L ?P ?a

age name

?n

at

?V

staff

?s ?N

partner

Complex query

slide-20
SLIDE 20

Join algorithms

  • Partitioned join (Pjoin)
  • Distribute data
  • Broadcast join (Brjoin)
  • Broadcast to all
  •  Hybrid join (contribution)
  • Distribute and/or broadcast
  • Based on a cost model

BDA 2016 20

slide-21
SLIDE 21

Join on L1

Result is partitioned on L

Join on L2

P1 lab L1 at Poitiers P4 lab L1 at Poitiers P3 lab L2 at Aix P3 lab L2 at Toulon P4 lab L1 at Poitiers

Join on L3

P2 lab L3 at Paris

Data transfers = sum of repartitioned datasets

Cost of Join (1/2) Partitioned join

BDA 2016 21

hash on L hash on L hash on L hash on L

Partitioned dataset

P1 lab L1 P3 lab L2 C1 loc L3 C3 loc L1 P2 lab L3 P4 lab L1 C2 loc L1 C4 loc L2

Part 1 Part n Part 1 Part n

P1 lab L1 P4 lab L1 C3 loc L1 C2 loc L1

slide-22
SLIDE 22

Cost of join(2/2) Broadcast Join

BDA 2016 22

Join on L

Result preserves the target partitioning

P1 lab L1 P3 lab L2 P2 lab L3 P4 lab L1

Join on L

P1 lab L1 at Poitiers P3 lab L2 at Aix P3 lab L2 at Toulon

Part 1 Part n Part 1 Part n

P2 lab L3 at Paris P4 lab L1 at Poitiers L1 at Poitiers L2 at Aix L2 at Toulon L3 at Paris

Larger target dataset Smaller broadcast dataset

Data transfers = Small dataset * nb of compute nodes

slide-23
SLIDE 23

Proposed Solution: Hybrid join plan

  • Cost Model for Pjoin and BrJoin
  • Aware of data partitioning, number of compute nodes
  • Size of intermediate results
  • Handle plans of star patterns
  • Star = local Pjoin

Get a linear join plan of stars

  • Often with successive BrJoins between selective stars
  • Build plan at runtime
  • Get size of intermediate results

BDA 2016 23

slide-24
SLIDE 24

Build Hybrid join plan

1) Compute all stars: S1, S2,…

  • Si = Pjoin(t1, t2, …)

2) Join 2 stars, say Si with Sj

  • Ensure cost(Si ⨝ Sj) is minimal

 get Si, Sj and a join algorithm

  • Let Temp = Si ⨝ Sj

3) Continue with a 3rd star, say Sk

  • Ensure cost(Temp ⨝ Sk) is minimal

and so on …

BDA 2016 24

slide-25
SLIDE 25

SPARQL on Spark: Qualitative comparison

Method co-partitioning Join plan Merged selection Query Optimizer Data Compression SPARQL RDD Pjoin SPARQL DF v 1.5 Pjoin,BrJoin1 poor SPARQL SQL v 1.5 Pjoin,BrJoin1 cross prod Hybrid RDD Pjoin,BrJoin+ cost based Hybrid DF Pjoin,BrJoin+ cost based

BDA 2016 25

Our solutions

supported not supported

Spark interface

slide-26
SLIDE 26

Experimental validation: setup

  • Datasets
  • Cluster
  • 17 compute nodes
  • Resource per node: 12 cores x 2 hyperthreads, 64 GB memory
  • 1Gb/s interconnect
  • Spark
  • 16 worker nodes
  • Aggregated resources: 300 cores, 800 GB memory
  • Solution
  • Implem written in scala, see companion website

BDA 2016 26

Dataset Name Nb of triples Description DrugBank 500K Real dataset LUBM 1.3B Synthetic data, LeHigh Univ WatDiv 1.1B Synthetic data, Waterloo Univ

slide-27
SLIDE 27

Experiments: Performance gain

  • Response time for Snowflake Q8 query from LUBM
  • 2 dataset sizes: medium (100M triples), large (1B triples)

BDA 2016 27

Dataset size

 Achieve higher gain for larger datasets No compression: 4,7 times faster Compressed data 3 times faster

slide-28
SLIDE 28

Thanks for your attention

Questions ?

BDA 2016 28

slide-29
SLIDE 29

Extra slides

BDA 2016 29

slide-30
SLIDE 30

SPARQL on Spark

  • Spark: cluster computing platform, in memory
  • 2 data models: RDD and DataFrame (DF). DF is compressed
  • 3 manipulation languages: SQL, RDD interface, DF interface
  • SPARQL SQL
  • Translate SPARQL query into SQL
  • Benefit from Catalyst query optimizer,
  • But for chain queries: generate query plan with cross product ...
  • SPARQL RDD
  • Translate query into join(), filter(), map() physical operators
  • No Brjoin. Only Pjoin plans. Fixed join order
  • SPARQL DF
  • Translate query into join(), where(), select() logical operators
  • Poor choice Pjoin/Brjoin,
  • Does not take into account triple pattern selectivity
  • Miss to choose BrJoin in many cases

BDA 2016 30

slide-31
SLIDE 31

SPARQL on Spark: Hybrid solution

  • Combine mutiple triple selections
  • Prune data to reduce access cost
  • Build cost-based optimized plan
  • Supports both data models: RDD and DF
  • Implements missing BrJoin for RDD
  • Allows for broadcasting intermediate results

BDA 2016 31

slide-32
SLIDE 32

Perf of Star queries

BDA 2016 32

slide-33
SLIDE 33

Perf of chain queries

BDA 2016 33

slide-34
SLIDE 34

Partitioned join : Detailed algorithm

1) Partition data on the join key

Check current data partitioning

2) Distribute (i.e., shuffle) the partitions 3) Compute the join for each key  Data transfers

  • see formula in the paper (cost of n-ary Pjoin)

BDA 2016 34

slide-35
SLIDE 35

Broadcast Join: Detailed algorithm

  • Join a smaller dataset with a larger one
  • Larger dataset = target dataset

1) Broadcat the small dataset to every compute node 2) Compute the join for each partition of the target  Data transfers

  • see formula in the paper (cost of n-ary Brjoin)

BDA 2016 35

slide-36
SLIDE 36

OLD

  • OLD

BDA 2016 36

slide-37
SLIDE 37

BDA 2016 37

lab at

Requête:

?L ?P ?V

slide-38
SLIDE 38

Join processing (1)

BDA 2016 38

Part 1 Part n

Triple dataset

Join for h(x)=1 Result 1

Triple join : x memberOf y . x email z

Result

Select t2 and hash

  • n x

Result n Select t2 and hash

  • n x

Part 1 Part n Select t2 and hash

  • n x

Select t2 and hash

  • n x
slide-39
SLIDE 39

ENLEVER Assumptions and requirements

  • Data Volume
  • Requires a distributed environnement
  • Data Velocity
  • Requires reduced data loading time
  • Main memory data management
  • No data replication

BDA 2016 39

slide-40
SLIDE 40

SPARQL Query ry Processin ing wit ith Apache Spark

BDA 2016 40

title ?L ?P speaker author Auteur Laboratoire Université Hubert Naacke LIP6

  • P. et M. Curie Paris 6

Olivier Curé LIGM Paris Est Marne-la-Vallée Bernd Amann LIP6

  • P. et M. Curie Paris 6
slide-41
SLIDE 41

Processing a glo lobal operation on distributed data

BDA 2016 41

Part 1 Part 2 Part N

Triple dataset

Global Operation Result

Data transfers  Global operation is not parallel enough, Scalability ?

slide-42
SLIDE 42

Join processing wrt. query shape

Data:

BDA 2016 42

lab at

?L ?P ?V

Query:

  • Find laboratory and city of persons

P2 lab L3 P2 name Bob P4 lab L1 P1 lab L1 P1 name Ali P3 lab L2 P3 name Clo L1 at Poitiers L1 since 2000 L3 at Paris L3 staff 200 L2 at Aix L2 at Toulon L2 partner L1

Chain:

lab name

?P ?L

Star:

?N ?L

age lab at

?L ?P ?V

Chain:

slide-43
SLIDE 43

Data:

lab at

?L ?P ?V

Star query:

P2 lab L3 P2 age 20 P2 name Bob P4 lab L1 P1 lab L1 P1 name Ali P3 lab L2 P3 name Clo L1 at Poitiers L1 since 2000 L3 at Paris L3 staff 200 L2 at Aix L2 at Toulon L2 partner L1 … lab name

?P ?L ?N ?L

age

Snowflake query:

Chain query:

lab

?L ?P ?a

age name

?n

at

?V

staff

?s ?N

partner

Complex query

slide-44
SLIDE 44

BDA 2016 44

slide-45
SLIDE 45

BDA 2016 45

slide-46
SLIDE 46

GRADES 2017 46

SELECT * WHERE { <retailer0> offers ?u . ?u price ?v . ?u validThrough ?w . ?u includes ?x . ?x title ?y . ?x type ?z }

WatDiv F5

includes

  • ffers

?x ?u

slide-47
SLIDE 47

cc

BDA 2016 47