MIT And they are growing bigger X X -> DATA DATABASE - - PowerPoint PPT Presentation

mit and they are growing bigger x x
SMART_READER_LITE
LIVE PREVIEW

MIT And they are growing bigger X X -> DATA DATABASE - - PowerPoint PPT Presentation

Graphs On Databases at Talking on NEDB 2014 Alekh Jindal collaborate Supervisors Amol Deshpande Sam Madden work sabbatical Mike Stonebraker work University work of Maryland MIT And they are growing bigger X X -> DATA


slide-1
SLIDE 1

Alekh Jindal

Graphs On Databases

Sam Madden Mike Stonebraker Amol Deshpande

MIT

University

  • f Maryland

NEDB 2014

Talking on at Supervisors work work collaborate work sabbatical

slide-2
SLIDE 2

And they are growing bigger

slide-3
SLIDE 3
slide-4
SLIDE 4

DATA DATABASE

  • >

X X

slide-5
SLIDE 5

DATA DATABASE

  • >

Relational Relational

slide-6
SLIDE 6

DATA DATABASE

  • >

Streaming Streaming

slide-7
SLIDE 7

DATA DATABASE

  • >

XML XML

slide-8
SLIDE 8

DATA DATABASE

  • >

RDF RDF

slide-9
SLIDE 9

DATA DATABASE

  • >

Graph Graph

slide-10
SLIDE 10

DATA

DATABASE

slide-11
SLIDE 11

DATA

DATABASE

Physical Data Independence Logical Data Independence

APPLICATIONS

slide-12
SLIDE 12
  • Graphs in relational model
  • Graph operations in SQL
  • Expressing iterative graph queries
  • Efficient graph analytics performance
  • Ease-of-use

Barriers to “Graphs on Databases”

slide-13
SLIDE 13

Graphs in Relational Model

5 1 2 3 4

id value 1 1 2 1 3 1 4 1 5 1

Nodes

fromId toId weight t 1 2 1 2 4 1 2 5 1 3 2 1 5 1 1

Edges

slide-14
SLIDE 14

Graphs Operations in SQL

5 1 2 3 4

  • Node access


Select * From Nodes Where Id=ID

  • Neighborhood access


Select * From Edges Where fromId=ID

  • Parallel neighborhood access


Select * From Edges Group By fromId

  • 1-hop neighbors


Select * From Edges e1,Edges e2 Where e1.toId=e2.fromId

slide-15
SLIDE 15

Example: Shortest Paths

FROM Nodes AS n1, Edges AS e, Nodes AS n2 WHERE n1.Id=e.fromId AND n2.Id=e.toId GROUP BY e.toId, n2.value HAVING min(n1.value+1) < n2.value SELECT e.toId AS Id, min(n1.value+1) AS value UPDATE Nodes AS node SET value= new_node.value FROM( ) AS new_node WHERE node.Id = new_node.Id;

slide-16
SLIDE 16

Example: Shortest Paths

FROM Nodes AS n1, Edges AS e, Nodes AS n2 WHERE n1.Id=e.fromId AND n2.Id=e.toId GROUP BY e.toId, n2.value HAVING min(n1.value+1) < n2.value SELECT e.toId AS Id, min(n1.value+1) AS value UPDATE Nodes AS node SET value= new_node.value FROM( ) AS new_node WHERE node.Id = new_node.Id;

Parallel Graph Exploration Nested Query Sorting/Indexing

slide-17
SLIDE 17

Iterative Graph Queries

  • Driver program: 


UDF / Stored Procedure

  • Three Things:

  • initialization

  • actual graph query (in a loop)

  • termination condition

5 1 2 3 4

slide-18
SLIDE 18

Example: Shortest Paths

Initialization:

  • 1. Set the value of start node to 0
  • 2. Set the value of all other node to inf

Loop:

The shortest paths SQL

Termination Condition:

UPDATE Nodes AS node SET value=new_node.value FROM( SELECT e.toId AS Id, min(n1.value+1) AS value FROM Nodes AS n1, Edges AS e, Nodes AS n2 WHERE n1.Id=e.fromId AND n2.Id=e.toId GROUP BY e.toId, n2.value HAVING min(n1.value+1) < n2.value ) AS new_node WHERE node.Id = new_node.Id;

No more nodes to Update

slide-19
SLIDE 19

Efficient Graph Analytics

  • Three SQL Databases:

  • row store

  • column store

  • main-memory store
  • Two Graph Databases:

  • transactional graph database

  • graph analytics system
  • Two queries: PageRank, Shortest Paths
  • Social network dataset from snap.stanford.edu/data
slide-20
SLIDE 20

PageRank

Time (seconds)

1 10 100 1000 10000

Twitter GPlus LiveJournal

29.4 4.2 3.3 218.1 53.5 47.0 4,172.4 101.5 17.4 28.0 589.0

Graph Database Main-memory Database Row Store Database Apache Giraph Column Store Database

slide-21
SLIDE 21

Shortest Paths

Time (seconds)

1 10 100 1000 10000 100000

Twitter GPlus LiveJournal

135.1 9.2 6.7 115.5 50.8 43.7 18,702.2 492.9 74.5 29.1 395.6

Graph Database Main-memory Database Row Store Database Apache Giraph Column Store Database

slide-22
SLIDE 22

Ease-of-Use

UPDATE Nodes AS node SET value=new_node.value FROM( SELECT e.toId AS Id, min(n1.value+1) AS value FROM Nodes AS n1, Edges AS e, Nodes AS n2 WHERE n1.Id=e.fromId AND n2.Id=e.toId GROUP BY e.toId, n2.value HAVING min(n1.value+1) < n2.value ) AS new_node WHERE node.Id = new_node.Id;

SQL

void compute(vector<float> messages){ // get the minimum distance float mindist = id==START_NODE ? 0 : DBL_MAX; for(vector<float>::iterator it = messages.begin(); it != messages.end(); ++it) mindist = min(mindist,*it); // send messages to all edges if new minimum is found float vvalue = getVertexValue(); if(mindist < vvalue){ modifyVertexValue(mindist); vector<int> edges = getOutEdges(); for(vector<int>::iterator it = edges.begin(); it != edges.end(); ++it) sendMessage(*it, mindist+1); } // halt voteToHalt(); }

Pregel

slide-23
SLIDE 23

Ease-of-Use

SQL Pregel

id value 1 1 2 1 3 1 4 1 5 1

Nodes

fromId toId weight 1 2 1 2 4 1 2 5 1 3 2 1 5 1 1

Edges

5 1 2 3 4

slide-24
SLIDE 24

DATA DATABASE

Physical Data Independence Logical Data Independence

APPLICATION

Vertex Programs

Pregel-style API:

  • getMessages()

  • getEdges()

  • sendMessages()
  • voteToHalt(), etc.

Vertex UDF

Invokes the vertex program if:


  • the vertex is active, or
  • the vertex has incoming messages

Coordinator

Synchronizes supersteps Redistributes Messages

Vertex (V), Edge (E), Message (M)

Vertex-centric Query Interface

slide-25
SLIDE 25

DATA DATABASE

Physical Data Independence Logical Data Independence

APPLICATION

Vertex Programs

Pregel-style API:

  • getMessages()

  • getEdges()

  • sendMessages()
  • voteToHalt(), etc.

Vertex UDF

Invokes the vertex program if:


  • the vertex is active, or
  • the vertex has incoming messages

Coordinator

Synchronizes supersteps Redistributes Messages

Vertex (V), Edge (E), Message (M)

Vertex-centric Query Interface

Union Batching No in- place Updates

slide-26
SLIDE 26

PageRank (Vertex)

Time (seconds)

1 10 100 1000 10000 100000

Twitter GPlus LiveJournal

335.5 47.7 10.9 218.1 53.5 47.0 2,071.0 421.5

Main-memory Database Apache Giraph Column Store Database

slide-27
SLIDE 27

Shortest Paths (Vertex)

Time (seconds)

1 10 100 1000 10000

Twitter GPlus LiveJournal

146.3 23.8 10.6 115.5 50.8 43.7 7,950.1 712.2 121.0

Main-memory Database Apache Giraph Column Store Database

slide-28
SLIDE 28

Vertex-centric interface allows…

  • Connected Components
  • Random Walks with Restart
  • Stochastic Gradient Descent
  • Or, other message Passing Algorithms

…. right within the database system!

slide-29
SLIDE 29
  • Running arbitrary SQL queries
  • Pre- and post- processing of data
  • Updates are trivial
  • ACID for free
  • Don’t need to deal with Yet-Another-System!

Advantages of “Graphs on Databases”

slide-30
SLIDE 30

Summary

  • Graph analytics can be mapped to relational

queries (plus UDFs)

  • SQL systems can offer very good performance
  • ver relational queries
  • We can extend SQL systems to provide more

graph-natural query interfaces