GraphFrames: An Integrated API for Mixing Graph and Relational - - PowerPoint PPT Presentation

graphframes an integrated api for mixing graph and
SMART_READER_LITE
LIVE PREVIEW

GraphFrames: An Integrated API for Mixing Graph and Relational - - PowerPoint PPT Presentation

GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and MateiZaharia (MIT


slide-1
SLIDE 1

GraphFrames: An Integrated API for Mixing Graph and Relational Queries

Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and MateiZaharia (MIT and Databricks)

UC BERKELEY

slide-2
SLIDE 2

+ Graph Queries 2016 Apache Spark + GraphFrames

Trend: Unified Graph Analysis

+ Graph Algorithms 2013 Apache Spark + GraphX Relational Queries 2009 Spark

slide-3
SLIDE 3

Graph Algorithms vs. Graph Queries

x PageRank Alternating Least Squares

Graph Algorithms Graph Queries

slide-4
SLIDE 4

Graph Algorithms vs. Graph Queries

Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators

Editor 1 Editor 2 Article 1 Article 2

Article 1 Article 2 Editor 1 Editor 2 same day

}

same day

}

slide-5
SLIDE 5

Graph Algorithms vs. Graph Queries

Graph Algorithm: PageRank

// Iterate until convergence wikipedia.pregel( sendMsg = { e => e.sendToDst(e.srcRank * e.weight) }, mergeMsg = _ + _, vprog = { (id, oldRank, msgSum) => 0.15 + 0.85 * msgSum })

Graph Query: Wikipedia Collaborators

wikipedia.find( "(u1)-[e11]->(article1); (u2)-[e21]->(article1); (u1)-[e12]->(article2); (u2)-[e22]->(article2)") .select( "*", "e11.date – e21.date".as("d1"), "e12.date – e22.date".as("d2")) .sort("d1 + d2".desc).take(10)

slide-6
SLIDE 6

Separate Systems

Graph Algorithms Graph Queries

slide-7
SLIDE 7

Raw Wikipedia

< / > < / >

< / >

XML

Text Table Edit Graph Edit Table Frequent Collaborators

Problem: Mixed Graph Analysis

Hyperlinks PageRank

Article Text User Article

Vandalism Suspects

User User User Article

slide-8
SLIDE 8

Solution: GraphFrames

Graph Algorithms Graph Queries

Spark SQL GraphFramesAPI

Pattern Query Optimizer

slide-9
SLIDE 9

GraphFrames API

  • Unifies graph algorithms, graph queries, and relational operations (DataFrames)
  • Designed for interactive use
  • Available in Scala, Java, and Python

class GraphFrame { def vertices: DataFrame def edges: DataFrame def find(pattern: String): DataFrame def registerView(pattern: String, df: DataFrame): Unit def degrees(): DataFrame def pageRank(): GraphFrame def connectedComponents(): GraphFrame ... }

slide-10
SLIDE 10

Implementation

Parsed Pattern Logical Plan Materialized Views Optimized Logical Plan DataFrame Result Query String Graph–Relational Translation Join Elimination and Reordering Spark SQL View Selection Graph Algorithms GraphX

slide-11
SLIDE 11

Graph–Relational Translation

B D A C Existing Logical Plan

Output: A,B,C Src Dst

C=Src

Edge Table

ID Attr

Vertex Table

D=ID

slide-12
SLIDE 12

Join Elimination

Src Dst 1 2 1 3 2 3 2 5

Edges

ID Attr 1 A 2 B 3 C 4 D

Vertices SELECT src, dst FROM edges INNER JOIN vertices ON src = id;

Unnecessary join can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation:

SELECT src, dst FROM edges;

slide-13
SLIDE 13

Materialized View Selection

GraphX: Triplet view enabled efficient message-passing algorithms

Vertices

B A C D

Edges A B A C B C C D

A B

Triplet View A C B C C D

Graph

+ Updated PageRanks B

A

C D A

slide-14
SLIDE 14

Materialized View Selection

GraphFrames: User-defined views enable efficient graph queries

Vertices

B A C D

Edges A B A C B C C D

A B

Triplet View A C B C C D

Graph

User-Defined Views PageRank Community Detection … Graph Queries

slide-15
SLIDE 15

Join Reordering

A → B B → A

A, B

B → D C → B

B

B → E

B

C → D

B

C → E

C, D

C, E

Example Query Left-Deep Plan Bushy Plan A → B B → A

A, B

B → D C → B

B

B → E

B

B

B, C

User-Defined View

slide-16
SLIDE 16

Query Planning Algorithm

Dynamic programming algorithm based on:

  • J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed

pattern matching. In ICDE 2014.

  • 1. Considers all left-deep plans, and a subset of bushy plans
  • Bushy plans to explore are chosen using layered-DAG and cycle-detection heuristics
  • 2. Considers using each view that is exactly equivalent to a plan subtree
  • Result: Selects the largest of multiple hierarchically contained views
slide-17
SLIDE 17

Evaluation

Faster than Neo4j for unanchored pattern queries

0.5 1 1.5 2 2.5 GraphFrames Neo4j Query latency, s

Anchored Pattern Query

10 20 30 40 50 60 70 80 GraphFrames Neo4j Query latency, s

Unanchored Pattern Query

Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.

slide-18
SLIDE 18

Evaluation

Approaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation

1 2 3 4 5 6 7 GraphFrames GraphX Naïve Spark Per-iteration runtime, s

PageRank Performance

Per-iteration performance on web-Google, single 8-core machine. Naïve Spark uses Scala RDD API.

slide-19
SLIDE 19

Evaluation

Registering the right views can greatly improve performance for some queries

Workload: J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.

slide-20
SLIDE 20

Future Work

  • Suggest views automatically
  • Exploit attribute-based partitioning in optimizer
  • Code generation for single node
slide-21
SLIDE 21

Try It Out!

Released as a Spark Package at: https://github.com/graphframes/graphframes

Thanks to Joseph Bradley, Xiangrui Meng, and Timothy Hunter. ankurd@eecs.berkeley.edu