[PPT] - Beyond Macrobenchmarks Microbenchmark-based Graph Database PowerPoint Presentation

SLIDE 1

Beyond Macrobenchmarks

Microbenchmark-based Graph Database Evaluation

Matteo Lissandrini, Martin Brugnara, Yannis Velegrakis

Universiteit Utrecth

SLIDE 2

Graph Databases Evaluation – Matteo Lissandrini

2

Graphs are Everywhere

Protein Interaction Network Road Network Social Network Knowledge Graph

SLIDE 3

Graph Databases Evaluation – Matteo Lissandrini

3 PROPERTY GRAPHS

node02 node03 node01 Pr Presents in in

Na Name: Matteo Ro Role: Post-do doc In Interests: Graphs Ti Title: Beyond

nd….

Top Topic: Gr GraphDB On On : 2019-08 08-26 26

re refere rences

na name: VLDB’19 ye year ar: 2019

edge01 e d g e 2 edge03

Edge-labelled Multigraphs

G: ⟨V, E, L, ℓ⟩

ID: V / E ↦ ℕ Labeling ℓ : E ↦ L Properties: V/E ↦ { <key,value>, …}

SLIDE 4

Graph Databases Evaluation – Matteo Lissandrini

4 Neptune

CosmosDB Oracle Graph

GRAPH DATABASES

SLIDE 5

Graph Databases Evaluation – Matteo Lissandrini

5

OLTP

Updates Transaction Selectivity Indices User-interaction Concurrency Availability

OLAP*

Business-intelligence Batch Algorithms Processing Statistics Mining Complex Queries Pathfinding Connectivity Export/Import

Graph Databases Graph Processing

GraphLab Giraph/Pregel GraphX ArangoDB Blazegraph Neo4j OrientDB Sparksee Titan/Janus Our Focus

WHERE TO STORE A GRAPH?

[Ammar and Özsu, VLDB’18]

SLIDE 6

Graph Databases Evaluation – Matteo Lissandrini

6 Graph Databases

HOW TO CHOOSE THE RIGHT SYSTEM?

?

OLTP

Updates Transaction Selectivity Indices User-interaction Concurrency Availability Complex Queries Pathfinding Connectivity Export/Import

ArangoDB Blazegraph Neo4j OrientDB Sparksee Titan/Janus What solution works best?

SLIDE 7

Graph Databases Evaluation – Matteo Lissandrini

7 THERE IS NO SILVER BULLET

Different Data Characteristics Different Query Types Different Use-cases Different Data Organization Different Indexing/Optimizations Different Query Processing Strategies

SLIDE 8

Graph Databases Evaluation – Matteo Lissandrini

8 GRAPH DATABASE ARCHITECTURES

Native Native Non Native

Query Processing Storage

Specialized Query-processing &Algorithms Specialized Data-structures & Indexes

How to implement a Graph Database

SLIDE 9

Graph Databases Evaluation – Matteo Lissandrini

9

GOAL: UNDERSTAND GRAPH DATABASES PERFORMANCE

FACTORS

System Architecture Query Workload Data Characteristics

OUTCOME

Evaluate Pros/Cons of each design decision Identify cause of underperformant operations 1 2

SLIDE 10

Graph Databases Evaluation – Matteo Lissandrini

10 Macro-Benchmarks Micro-Benchmark

Goals

Predefined realistic(?) Domain & Application
Study specific Use-Cases

Techniques

Test Complex Operations
Queries based on the structure of the data

and output of previous queries

Limitations

Test query-planner but hides single operator

performance

Domain Specific

Goals

Applicable over different Domains/Datasets
Test Basic & Common Operations

Techniques

Decompose Complex Queries
Identify Ubiquitous Operators
Test Same Operations under Different Conditions

Advantages

Domain/Data Independent
Generalizable
Allow identification of Weak Operators

Our Proposal

Example:

SLIDE 11

Graph Databases Evaluation – Matteo Lissandrini

11 MICRO-BENCHMARKING GRAPH OPERATIONS

CRUD: Create Read Update Delete

Insertions, updates, retrievals both for values stored on nodes and edges, and structural elements (add/remove/retrieve nodes/edges)

Graph Queries: Edges & Traversals

Access local structure around the node, verify reachability, as well as search for nodes with specific structural characteristics

SLIDE 12

Graph Databases Evaluation – Matteo Lissandrini

12 MICRO-BENCHMARKING GRAPH OPERATIONS

Create new node with property P { Name : Value }
Add edge from v1 to v2 (plus some properties P)
Add property P { Name : Value } to node v or to edge e
Add a new node, and then edges from it to other nodes
Update Value for property P { Name : Value }
Delete Node/Edge
Delete node property P from node/edge

CRUD: Create Read Update Delete

Insertions, updates, retrievals both for values stored on nodes and edges, and structural elements (add/remove/retrieve nodes/edges)

Find node/edge with specific ID
Find nodes/edges with property P { Name : Value }
Find edges with a specific label
Count edges/nodes
Count distinct edge labels

SLIDE 13

Graph Databases Evaluation – Matteo Lissandrini

13 MICRO-BENCHMARKING GRAPH OPERATIONS

Graph Queries: Edges & Traversals

Access local structure around the node, verify reachability, as well as search for nodes with specific structural characteristics

Find nodes directly connected (find all

incoming/outgoing edges)

Find only certain connections (filter by label)
Degree based search: e.g., high degree nodes, only

inbound connections

Find all nodes reachable in K or less steps (BFS)
Find a list of shortest paths between two nodes

SLIDE 14

Graph Databases Evaluation – Matteo Lissandrini

14 OUR FRAMEWORK Selected Operations

# Query Description

Cat

1.

g.loadGraphSON("/path")

Load dataset into the graph ‘g’

L

2.

g.addVertex(p[])

Create new node with properties p

C

3.

g.addEdge(v1 , v2 , l)

Add edge from 1 to 2 4.

g.addEdge(v1 , v2 , l , p[])

Same as Q.3, but with properties p 5.

v.setProperty(Name, Value)

Add property Name=Value to node 6.

e.setProperty(Name, Value)

Add property Name=Value to edge e 7.

g.addVertex(. . . ); g.addEdge(. . . ) Add a new node, and then edges to it

8.

g.V.count()

Total number of nodes

R

9.

g.E.count()

Total number of edges

10. g.E.label.dedup()

Existing edge labels (no duplicates)

11. g.V.has(Name, Value)

Nodes with property Name=Value

12. g.E.has(Name, Value)

Edges with property Name=Value

13. g.E.has(’label’,l)

Edges with label l

14. g.V(id)

The node with identifier d

15. g.E(id)

The edge with identifier d

16. v.setProperty(Name, Value)

Update property Name for vertex

U

17. e.setProperty(Name, Value)

Update property Name for edge e

18. g.removeVertex(id)

Delete node identified by d

D

19. g.removeEdge(id)

Delete edge identified by d

20. v.removeProperty(Name)

Remove node property Name from

21. e.removeProperty(Name)

Remove edge property Name from e

22. v.in()

Nodes adjacent to via incoming edges

T

23. v.out()

Nodes adjacent to via outgoing edges

24. v.both(‘l’)

Nodes adjacent to via edges labeled l

25. v.inE.label.dedup()

Labels of in coming edges of (no dupl.)

26. v.outE.label.dedup()

Labels of outgoing edges of (no dupl.)

27. v.bothE.label.dedup()

Labels of edges of (no dupl.)

28. g.V.filter{it.inE.count()>=k}

Nodes of at least k-incoming-degree

29. g.V.filter{it.outE.count()>=k}

Nodes of at least k-outgoing-degree

30. g.V.filter{it.bothE.count()>=k}

Nodes of at least k-degree

31. g.V.out.dedup()

Nodes having an incoming edge

32. v.as(‘i’).both().except(vs)

Nodes reached via breadth-First

.store(j).loop(‘i’)

traversal from

33. v.as(‘i’).both(*ls).except(j)

Nodes reached via breadth-First

.store(vs).loop(‘i’)

traversal from on labels s

34. v1.as(’i’).both().except(j).store(j) Unweighted Shortest Path from 1 to 2

.loop(’i’){!it.object.equals(v2)} .retain([v2]).path()

35. Shortest Path on ‘l’

Same as Q.34, but only following label

∗

[ ] d e n

t

e s a H a s h M a p ; g i s t h e g r a p h ;

a

n d e a r e n

d

e / e d g e s .

3 5 d i s t i n c t C

n

c r e t e O p e r a t

r

s

Coverage of all the required operations
Complex queries can be composed through those
Domain agnostic

SLIDE 15

Graph Databases Evaluation – Matteo Lissandrini

15 OUR FRAMEWORK Experimental Environment

B a t t e r i e s I n c l u d e d

Connected Component Degree |V| |E| |L| # Maxim Density Modularity Avg Max

Yeast

2.3K 7.1K 167 101 2.2K 1.34∗10−3 3.66∗10−2 6.1 66 11 MiCo 100K 1.1M 106 1.3K 93K 1.10∗10−6 5.45∗10−3 21.6 1.3K 23 Frb-O 1.9M 4.3M 424 133K 1.6M 1.19∗10−6 9.82∗10−1 4.3 92K 48 Frb-S 0.5M 0.3M 1814 0.16M 20K 1.20∗10−6 9.91∗10−1 1.3 13K 4 Frb-M 4M 3.1M 2912 1.1M 1.4M 1.94∗10−7 7.97∗10−1 1.5 139K 37 Frb-L 28.4M 31.2M 3821 2M 23M 3.87∗10−8 2.12∗10−1 2.2 1.4M 33 ldbc 184K 1.5M 15 1 184K 4.43∗10−5 16.6 48K 10

PREVIOUS TESTS ONLY 1M Nodes Various Sizes & Domains: Real and Synthetic Datasets Ready-to-go Systems & Configurations Most popular systems already integrated and ready to use

SLIDE 16

Graph Databases Evaluation – Matteo Lissandrini

16 OUR FRAMEWORK Extensibility Reproducible!

Common Query Language Plug and Play setup & Controlled Environment

Easy to add

New Queries
New Systems
New Datasets

SLIDE 17

Graph Databases Evaluation – Matteo Lissandrini

17 Finding 1: Native GDB are best for Generic Traversals

Native systems with JOIN-free adjacency provide the best scalability for generic traversals (> 2 hops).

100 1.000 10.000 100.000 1.000.000

Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Q32 (depth 3) Q32 (depth 4) Q32 (depth 5)

Time (ms)

Blaze

Tit. 0.5
Tit. 1.0

Neo 3.0 Arango Neo 1.9 Orient Sparksee Pg

100 1.000 10.000 100.000

Fbr-S Fbr-O Fbr-M Fbr-L Q32 (depth 2)

Time (ms)

(b) (a)

1 sec 1 min 1sec 1min 1hour

BFS

SLIDE 18

Graph Databases Evaluation – Matteo Lissandrini

18 Finding 2: Not all SEARCH query are equally optimized

Depending on the nature of the query some systems perform best than others: e.g., relational systems perform best in high selectivity queries for attributes.

1 10 100 1.000 10.000 Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Q14 Q15 Time (ms)

Blaze

Tit. 0.5
Tit. 1.0

Neo 3.0 Arango Neo 1.9 Orient Sparksee Pg

10 100 1.000 10.000 100.000 1.000.000 10.000.000 Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Q8 Q9 Q10 Q11 Q12 Q13

Time (ms)

(b) (a)

1sec 1min 1hour 1 sec 10 sec 100ms

Count nodes, edges, and distinct labels Search by Property and Label Search by ID

SLIDE 19

Graph Databases Evaluation – Matteo Lissandrini

19 Finding 3: Many systems have scalability issues

With large graphs and large intermediate results,

nly few systems can deliver good performance
r even complete the query

20 40 60 80 100 120

I B I B I B I B I B I B I B Orient

Tit. 0.5
Tit. 1.0 Sparksee

Pg Arango Blaze

# Timeouts DB Engine and Execution Method

(c)

Frb L Frb O Frb M Frb S

Timeouts

SLIDE 20

Graph Databases Evaluation – Matteo Lissandrini

20

A Micro-Benchmark for an in-depth understanding of Graph Databases Performance http:// graphbenchmark . com / results.html

FEATURES & ADVANTAGES

Richest set of queries
Open source platform that is easily extensible
Multi-domain datasets included
Gremlin based: widespread system adoption

Extensible

& Reproducible

!!!!

SLIDE 21

Graph Databases Evaluation – Matteo Lissandrini

21

10 100 1.000 10.000

max-iid max-oid create city company university friend1 friend2 friend-tags add-tags friend-of-friend triangle places Time (ms)

Neo 1.9 Neo 3.0 Orient

Tit. 0.5
Tit. 1.0

Sparksee Arango Sqlg

MACRO-BENCHMARKS PROVIDE LIMITED INSIGHT

Current Macro-benchmarks do not provide sufficient insight to understand the real capabilities and limitations of a graph database

Global search Local search 1-hop + edge insertion Local search 2+ hops

SLIDE 22

Graph Databases Evaluation – Matteo Lissandrini

22

Specialized Query Languages & API:

AQL (Arango DB)
CYPHER (Neo4j)
Extended SQL (Orient DB)
Programming API (Sparksee)

Standard Query Language

MAJORITY OF VENDORS: SQL dialect is simpler for customers A COMMON STANDARD Gremlin is still the most supported but implementations are still “young”

A note on QUERY LANGUAGES

SLIDE 23

Graph Databases Evaluation – Matteo Lissandrini

23 More info on GRAPH STORAGE

SLIDE 24

Graph Databases Evaluation – Matteo Lissandrini

24 INDEXED ADJACENCY

FIND FRIENDS = 2 JOINS

THE NODE-EDGE STRUCTURE

IS STORED IN INTERMEDIATE TABLES AND ACCESSED VIA INDEXES

COST O(log(n))

One Relation for each Node-type & Edge type

SLIDE 25

Graph Databases Evaluation – Matteo Lissandrini

25 INDEX FREE ADJACENCY

DATA SPLITTED IN FILES: Node/Edge/Label/Property stores RECORDS OF FIXED SIZE NODE/EDGE RECORDS CONTAIN ONLY DIRECT POINTERS

SLIDE 26

Graph Databases Evaluation – Matteo Lissandrini

26 INDEX FREE ADJACENCY (ALTERNATIVE)

VALUE IS EVERYTHING THAT IS SHARED BY MULTIPLE OBJECTS Node Types, Edge Types, Attribute Values

BIT-MAPS: From object IDs to values, And from values to object IDs

OID 1 2 3 4 5 value A B C OIDS: 1 1 1 OIDS: 1 OIDS: 1

1 2 3 4 5

SLIDE 27

Graph Databases Evaluation – Matteo Lissandrini

27 BIG TABLE DATA MODEL

https://docs.janusgraph.org/advanced-topics/data-model/

Beyond Macrobenchmarks

Microbenchmark-based Graph Database Evaluation

2

Graphs are Everywhere

3

PROPERTY GRAPHS

Edge-labelled Multigraphs

G: ⟨V, E, L, ℓ⟩

ID: V / E ↦ ℕ Labeling ℓ : E ↦ L Properties: V/E ↦ { <key,value>, …}

4 Neptune

GRAPH DATABASES

5

Graph Databases Graph Processing

WHERE TO STORE A GRAPH?

6

Graph Databases

HOW TO CHOOSE THE RIGHT SYSTEM?

?

7

THERE IS NO SILVER BULLET

Different Data Characteristics Different Query Types Different Use-cases Different Data Organization Different Indexing/Optimizations Different Query Processing Strategies

8

GRAPH DATABASE ARCHITECTURES

Query Processing Storage

9

GOAL: UNDERSTAND GRAPH DATABASES PERFORMANCE

10

Macro-Benchmarks Micro-Benchmark

11

MICRO-BENCHMARKING GRAPH OPERATIONS

CRUD: Create Read Update Delete

Graph Queries: Edges & Traversals

12

MICRO-BENCHMARKING GRAPH OPERATIONS

CRUD: Create Read Update Delete

13

MICRO-BENCHMARKING GRAPH OPERATIONS

Graph Queries: Edges & Traversals

14

OUR FRAMEWORK Selected Operations

15

OUR FRAMEWORK Experimental Environment

16

OUR FRAMEWORK Extensibility Reproducible!

Easy to add

17

Finding 1: Native GDB are best for Generic Traversals

18

Finding 2: Not all SEARCH query are equally optimized

19

Finding 3: Many systems have scalability issues

20

FEATURES & ADVANTAGES

Extensible

!!!!

21

MACRO-BENCHMARKS PROVIDE LIMITED INSIGHT

22

A note on QUERY LANGUAGES

23

More info on GRAPH STORAGE

24

INDEXED ADJACENCY

25

INDEX FREE ADJACENCY

26

INDEX FREE ADJACENCY (ALTERNATIVE)

OID 1 2 3 4 5 value A B C OIDS: 1 1 1 OIDS: 1 OIDS: 1

27

BIG TABLE DATA MODEL

28 Wrappers for Column Stores Wrappers For Relational

WRAPPERS

Wrappers for Document/NoSQL Wrappers for RDF Store