Beyond Macrobenchmarks Microbenchmark-based Graph Database - - PowerPoint PPT Presentation

beyond macrobenchmarks
SMART_READER_LITE
LIVE PREVIEW

Beyond Macrobenchmarks Microbenchmark-based Graph Database - - PowerPoint PPT Presentation

Beyond Macrobenchmarks Microbenchmark-based Graph Database Evaluation Matteo Lissandrini, Martin Brugnara, Yannis Velegrakis Universiteit Utrecth Knowledge Graph Protein Interaction Road Network Network Graphs are Everywhere Social


slide-1
SLIDE 1

Beyond Macrobenchmarks

Microbenchmark-based Graph Database Evaluation

Matteo Lissandrini, Martin Brugnara, Yannis Velegrakis

Universiteit Utrecth

slide-2
SLIDE 2

Graph Databases Evaluation – Matteo Lissandrini

2

Graphs are Everywhere

Protein Interaction Network Road Network Social Network Knowledge Graph

slide-3
SLIDE 3

Graph Databases Evaluation – Matteo Lissandrini

3

PROPERTY GRAPHS

node02 node03 node01 Pr Presents in in

Na Name: Matteo Ro Role: Post-do doc In Interests: Graphs Ti Title: Beyond

  • nd….

Top Topic: Gr GraphDB On On : 2019-08 08-26 26

re refere rences

na name: VLDB’19 ye year ar: 2019

edge01 e d g e 2 edge03

Edge-labelled Multigraphs

G: ⟨V, E, L, ℓ⟩

ID: V / E ↦ ℕ Labeling ℓ : E ↦ L Properties: V/E ↦ { <key,value>, …}

slide-4
SLIDE 4

Graph Databases Evaluation – Matteo Lissandrini

4 Neptune

CosmosDB Oracle Graph

GRAPH DATABASES

slide-5
SLIDE 5

Graph Databases Evaluation – Matteo Lissandrini

5

OLTP

Updates Transaction Selectivity Indices User-interaction Concurrency Availability

OLAP*

Business-intelligence Batch Algorithms Processing Statistics Mining Complex Queries Pathfinding Connectivity Export/Import

Graph Databases Graph Processing

GraphLab Giraph/Pregel GraphX ArangoDB Blazegraph Neo4j OrientDB Sparksee Titan/Janus Our Focus

WHERE TO STORE A GRAPH?

[Ammar and Özsu, VLDB’18]

slide-6
SLIDE 6

Graph Databases Evaluation – Matteo Lissandrini

6

Graph Databases

HOW TO CHOOSE THE RIGHT SYSTEM?

?

OLTP

Updates Transaction Selectivity Indices User-interaction Concurrency Availability Complex Queries Pathfinding Connectivity Export/Import

ArangoDB Blazegraph Neo4j OrientDB Sparksee Titan/Janus What solution works best?

slide-7
SLIDE 7

Graph Databases Evaluation – Matteo Lissandrini

7

THERE IS NO SILVER BULLET

Different Data Characteristics Different Query Types Different Use-cases Different Data Organization Different Indexing/Optimizations Different Query Processing Strategies

slide-8
SLIDE 8

Graph Databases Evaluation – Matteo Lissandrini

8

GRAPH DATABASE ARCHITECTURES

Native Native Non Native

Query Processing Storage

Specialized Query-processing &Algorithms Specialized Data-structures & Indexes

How to implement a Graph Database

slide-9
SLIDE 9

Graph Databases Evaluation – Matteo Lissandrini

9

9

GOAL: UNDERSTAND GRAPH DATABASES PERFORMANCE

FACTORS

System Architecture Query Workload Data Characteristics

OUTCOME

Evaluate Pros/Cons of each design decision Identify cause of underperformant operations 1 2

slide-10
SLIDE 10

Graph Databases Evaluation – Matteo Lissandrini

10

Macro-Benchmarks Micro-Benchmark

Goals

  • Predefined realistic(?) Domain & Application
  • Study specific Use-Cases

Techniques

  • Test Complex Operations
  • Queries based on the structure of the data

and output of previous queries

Limitations

  • Test query-planner but hides single operator

performance

  • Domain Specific

Goals

  • Applicable over different Domains/Datasets
  • Test Basic & Common Operations

Techniques

  • Decompose Complex Queries
  • Identify Ubiquitous Operators
  • Test Same Operations under Different Conditions

Advantages

  • Domain/Data Independent
  • Generalizable
  • Allow identification of Weak Operators

Our Proposal

Example:

slide-11
SLIDE 11

Graph Databases Evaluation – Matteo Lissandrini

11

MICRO-BENCHMARKING GRAPH OPERATIONS

CRUD: Create Read Update Delete

Insertions, updates, retrievals both for values stored on nodes and edges, and structural elements (add/remove/retrieve nodes/edges)

Graph Queries: Edges & Traversals

Access local structure around the node, verify reachability, as well as search for nodes with specific structural characteristics

slide-12
SLIDE 12

Graph Databases Evaluation – Matteo Lissandrini

12

MICRO-BENCHMARKING GRAPH OPERATIONS

  • Create new node with property P { Name : Value }
  • Add edge from v1 to v2 (plus some properties P)
  • Add property P { Name : Value } to node v or to edge e
  • Add a new node, and then edges from it to other nodes
  • Update Value for property P { Name : Value }
  • Delete Node/Edge
  • Delete node property P from node/edge

CRUD: Create Read Update Delete

Insertions, updates, retrievals both for values stored on nodes and edges, and structural elements (add/remove/retrieve nodes/edges)

  • Find node/edge with specific ID
  • Find nodes/edges with property P { Name : Value }
  • Find edges with a specific label
  • Count edges/nodes
  • Count distinct edge labels
slide-13
SLIDE 13

Graph Databases Evaluation – Matteo Lissandrini

13

MICRO-BENCHMARKING GRAPH OPERATIONS

Graph Queries: Edges & Traversals

Access local structure around the node, verify reachability, as well as search for nodes with specific structural characteristics

  • Find nodes directly connected (find all

incoming/outgoing edges)

  • Find only certain connections (filter by label)
  • Degree based search: e.g., high degree nodes, only

inbound connections

  • Find all nodes reachable in K or less steps (BFS)
  • Find a list of shortest paths between two nodes
slide-14
SLIDE 14

Graph Databases Evaluation – Matteo Lissandrini

14

OUR FRAMEWORK Selected Operations

# Query Description

Cat

1.

g.loadGraphSON("/path")

Load dataset into the graph ‘g’

L

2.

g.addVertex(p[])

Create new node with properties p

C

3.

g.addEdge(v1 , v2 , l)

Add edge from 1 to 2 4.

g.addEdge(v1 , v2 , l , p[])

Same as Q.3, but with properties p 5.

v.setProperty(Name, Value)

Add property Name=Value to node 6.

e.setProperty(Name, Value)

Add property Name=Value to edge e 7.

g.addVertex(. . . ); g.addEdge(. . . ) Add a new node, and then edges to it

8.

g.V.count()

Total number of nodes

R

9.

g.E.count()

Total number of edges

  • 10. g.E.label.dedup()

Existing edge labels (no duplicates)

  • 11. g.V.has(Name, Value)

Nodes with property Name=Value

  • 12. g.E.has(Name, Value)

Edges with property Name=Value

  • 13. g.E.has(’label’,l)

Edges with label l

  • 14. g.V(id)

The node with identifier d

  • 15. g.E(id)

The edge with identifier d

  • 16. v.setProperty(Name, Value)

Update property Name for vertex

U

  • 17. e.setProperty(Name, Value)

Update property Name for edge e

  • 18. g.removeVertex(id)

Delete node identified by d

D

  • 19. g.removeEdge(id)

Delete edge identified by d

  • 20. v.removeProperty(Name)

Remove node property Name from

  • 21. e.removeProperty(Name)

Remove edge property Name from e

  • 22. v.in()

Nodes adjacent to via incoming edges

T

  • 23. v.out()

Nodes adjacent to via outgoing edges

  • 24. v.both(‘l’)

Nodes adjacent to via edges labeled l

  • 25. v.inE.label.dedup()

Labels of in coming edges of (no dupl.)

  • 26. v.outE.label.dedup()

Labels of outgoing edges of (no dupl.)

  • 27. v.bothE.label.dedup()

Labels of edges of (no dupl.)

  • 28. g.V.filter{it.inE.count()>=k}

Nodes of at least k-incoming-degree

  • 29. g.V.filter{it.outE.count()>=k}

Nodes of at least k-outgoing-degree

  • 30. g.V.filter{it.bothE.count()>=k}

Nodes of at least k-degree

  • 31. g.V.out.dedup()

Nodes having an incoming edge

  • 32. v.as(‘i’).both().except(vs)

Nodes reached via breadth-First

.store(j).loop(‘i’)

traversal from

  • 33. v.as(‘i’).both(*ls).except(j)

Nodes reached via breadth-First

.store(vs).loop(‘i’)

traversal from on labels s

  • 34. v1.as(’i’).both().except(j).store(j) Unweighted Shortest Path from 1 to 2

.loop(’i’){!it.object.equals(v2)} .retain([v2]).path()

  • 35. Shortest Path on ‘l’

Same as Q.34, but only following label

[ ] d e n

  • t

e s a H a s h M a p ; g i s t h e g r a p h ;

  • a

n d e a r e n

  • d

e / e d g e s .

3 5 d i s t i n c t C

  • n

c r e t e O p e r a t

  • r

s

  • Coverage of all the required operations
  • Complex queries can be composed through those
  • Domain agnostic
slide-15
SLIDE 15

Graph Databases Evaluation – Matteo Lissandrini

15

OUR FRAMEWORK Experimental Environment

B a t t e r i e s I n c l u d e d

Connected Component Degree |V| |E| |L| # Maxim Density Modularity Avg Max

  • Yeast

2.3K 7.1K 167 101 2.2K 1.34∗10−3 3.66∗10−2 6.1 66 11 MiCo 100K 1.1M 106 1.3K 93K 1.10∗10−6 5.45∗10−3 21.6 1.3K 23 Frb-O 1.9M 4.3M 424 133K 1.6M 1.19∗10−6 9.82∗10−1 4.3 92K 48 Frb-S 0.5M 0.3M 1814 0.16M 20K 1.20∗10−6 9.91∗10−1 1.3 13K 4 Frb-M 4M 3.1M 2912 1.1M 1.4M 1.94∗10−7 7.97∗10−1 1.5 139K 37 Frb-L 28.4M 31.2M 3821 2M 23M 3.87∗10−8 2.12∗10−1 2.2 1.4M 33 ldbc 184K 1.5M 15 1 184K 4.43∗10−5 16.6 48K 10

PREVIOUS TESTS ONLY 1M Nodes Various Sizes & Domains: Real and Synthetic Datasets Ready-to-go Systems & Configurations Most popular systems already integrated and ready to use

slide-16
SLIDE 16

Graph Databases Evaluation – Matteo Lissandrini

16

OUR FRAMEWORK Extensibility Reproducible!

Common Query Language Plug and Play setup & Controlled Environment

Easy to add

  • New Queries
  • New Systems
  • New Datasets
slide-17
SLIDE 17

Graph Databases Evaluation – Matteo Lissandrini

17

Finding 1: Native GDB are best for Generic Traversals

Native systems with JOIN-free adjacency provide the best scalability for generic traversals (> 2 hops).

100 1.000 10.000 100.000 1.000.000

Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Q32 (depth 3) Q32 (depth 4) Q32 (depth 5)

Time (ms)

Blaze

  • Tit. 0.5
  • Tit. 1.0

Neo 3.0 Arango Neo 1.9 Orient Sparksee Pg

100 1.000 10.000 100.000

Fbr-S Fbr-O Fbr-M Fbr-L Q32 (depth 2)

Time (ms)

(b) (a)

1 sec 1 min 1sec 1min 1hour

BFS

slide-18
SLIDE 18

Graph Databases Evaluation – Matteo Lissandrini

18

Finding 2: Not all SEARCH query are equally optimized

Depending on the nature of the query some systems perform best than others: e.g., relational systems perform best in high selectivity queries for attributes.

1 10 100 1.000 10.000 Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Q14 Q15 Time (ms)

Blaze

  • Tit. 0.5
  • Tit. 1.0

Neo 3.0 Arango Neo 1.9 Orient Sparksee Pg

10 100 1.000 10.000 100.000 1.000.000 10.000.000 Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Fbr-S Fbr-O Fbr-M Fbr-L Q8 Q9 Q10 Q11 Q12 Q13

Time (ms)

(b) (a)

1sec 1min 1hour 1 sec 10 sec 100ms

Count nodes, edges, and distinct labels Search by Property and Label Search by ID

slide-19
SLIDE 19

Graph Databases Evaluation – Matteo Lissandrini

19

Finding 3: Many systems have scalability issues

With large graphs and large intermediate results,

  • nly few systems can deliver good performance
  • r even complete the query

20 40 60 80 100 120

I B I B I B I B I B I B I B Orient

  • Tit. 0.5
  • Tit. 1.0 Sparksee

Pg Arango Blaze

# Timeouts DB Engine and Execution Method

(c)

Frb L Frb O Frb M Frb S

Timeouts

slide-20
SLIDE 20

Graph Databases Evaluation – Matteo Lissandrini

20

A Micro-Benchmark for an in-depth understanding of Graph Databases Performance http:// graphbenchmark . com / results.html

FEATURES & ADVANTAGES

  • Richest set of queries
  • Open source platform that is easily extensible
  • Multi-domain datasets included
  • Gremlin based: widespread system adoption

Extensible

& Reproducible

!!!!

slide-21
SLIDE 21

Graph Databases Evaluation – Matteo Lissandrini

21

10 100 1.000 10.000

max-iid max-oid create city company university friend1 friend2 friend-tags add-tags friend-of-friend triangle places Time (ms)

Neo 1.9 Neo 3.0 Orient

  • Tit. 0.5
  • Tit. 1.0

Sparksee Arango Sqlg

MACRO-BENCHMARKS PROVIDE LIMITED INSIGHT

Current Macro-benchmarks do not provide sufficient insight to understand the real capabilities and limitations of a graph database

Global search Local search 1-hop + edge insertion Local search 2+ hops

slide-22
SLIDE 22

Graph Databases Evaluation – Matteo Lissandrini

22

Specialized Query Languages & API:

  • AQL (Arango DB)
  • CYPHER (Neo4j)
  • Extended SQL (Orient DB)
  • Programming API (Sparksee)

Standard Query Language

MAJORITY OF VENDORS: SQL dialect is simpler for customers A COMMON STANDARD Gremlin is still the most supported but implementations are still “young”

A note on QUERY LANGUAGES

slide-23
SLIDE 23

Graph Databases Evaluation – Matteo Lissandrini

23

More info on GRAPH STORAGE

slide-24
SLIDE 24

Graph Databases Evaluation – Matteo Lissandrini

24

INDEXED ADJACENCY

FIND FRIENDS = 2 JOINS

THE NODE-EDGE STRUCTURE

IS STORED IN INTERMEDIATE TABLES AND ACCESSED VIA INDEXES

COST O(log(n))

One Relation for each Node-type & Edge type

slide-25
SLIDE 25

Graph Databases Evaluation – Matteo Lissandrini

25

INDEX FREE ADJACENCY

DATA SPLITTED IN FILES: Node/Edge/Label/Property stores RECORDS OF FIXED SIZE NODE/EDGE RECORDS CONTAIN ONLY DIRECT POINTERS

slide-26
SLIDE 26

Graph Databases Evaluation – Matteo Lissandrini

26

INDEX FREE ADJACENCY (ALTERNATIVE)

VALUE IS EVERYTHING THAT IS SHARED BY MULTIPLE OBJECTS Node Types, Edge Types, Attribute Values

BIT-MAPS: From object IDs to values, And from values to object IDs

OID 1 2 3 4 5 value A B C OIDS: 1 1 1 OIDS: 1 OIDS: 1

1 2 3 4 5

slide-27
SLIDE 27

Graph Databases Evaluation – Matteo Lissandrini

27

BIG TABLE DATA MODEL

https://docs.janusgraph.org/advanced-topics/data-model/

slide-28
SLIDE 28

Graph Databases Evaluation – Matteo Lissandrini

28 Wrappers for Column Stores Wrappers For Relational

WRAPPERS

Wrappers for Document/NoSQL Wrappers for RDF Store