One Trillion edges : Graph Processing at Facebook- Scale Tong Niu - - PowerPoint PPT Presentation

▶

Feb 07, 2023 231 likes •428 views

One Trillion edges : Graph Processing at Facebook- Scale Tong Niu tong.niu.cn@outlook.com 11. Juli 2019 1 Outline Introduction Improvements Experiment Results Conclusion& Future Work Discussion Tong Niu 2

SLIDE 1

11. Juli 2019

One Trillion edges : Graph Processing at Facebook- Scale

Tong Niu tong.niu.cn@outlook.com

SLIDE 2

Tong Niu 2

Outline

Introduction
Improvements
Experiment Results
Conclusion& Future Work
Discussion

SLIDE 3

Tong Niu 3

Introduction

Graph Structures(entities, connections)
social networks
Facebook manages a social graph that is composed of people,

their friendships, subscriptions, likes, posts, and many other connections. 1.39B active users in 2014 with more than 400B edges

SLIDE 4

Tong Niu 4

Introduction

What is Apache Giraph?
“Think like a vertex”
Each vertex has an id, a value, a list of adjacent neighbors and

corresponding edge values

Bulk synchronous processing(BSP)
Break up to several supersteps(iteration)
Messages are sent during a superstep from one vertex to another

and then delivered in the following supersteps

SLIDE 5

Tong Niu 5

Introduction

What is Apache Giraph?

SLIDE 6

Tong Niu 6

Introduction

What is Apache Giraph?
Master – Application coordinator
Assigns partitions to workers
Synchronizes supersteps
Worker – Computation, messaging
Load the graph from input splits
Does the computation/messaging of its assigned partitions

SLIDE 7

Tong Niu 7

1. Flexible vertex/edge based input
Original input:
All data(vertex/edge) need to be read from the same record and

assumed to the same data source

Modified input:
Allow loading vertex data and edges from separate sources
Add an arbitrary number of data sources

SLIDE 8

Tong Niu 8

2. Parallelization support
Original:
Scheduled as a single MapReduce job
Modified:
Add more workers per machine
Use local multithreading to maximize resource utilization

SLIDE 9

Tong Niu 9

3. Memory optimization
Original:
Large memory overhead because of flexibility
Modified:
Serialize the edges of every vertex into a bit array rather than using

native direct serialization methods

Create an OutEdges interface that allow developers to achieve edge

stores

SLIDE 10

Tong Niu 10

4. Sharded aggregators
global computation(min/max value)
provide efficient shared state across workers
make the values available in the next superstep

SLIDE 11

Tong Niu 11

4. Sharded aggregators
Original:
Use znodes in zookeeper to store partial aggregated data from workers,

master aggregate all of them and write result back to znode for workers to access it

every worker has plenty of data that need to be aggregated
Modified:

Randomly assigned to one of the workers

Distribute final values to master/workers

SLIDE 12

Tong Niu 12

K-Means clustering

In a graph application, input vectors are vertices, and centroids are aggregators.

SLIDE 13

Tong Niu 13

1. Worker phases
Add preApplication() to initialize positions of centroids
Add preSuperstep() to calculate the new position for each of the

centroids before next superstep

2. Master computation
Centralized computation prior to every superstep that can communicate with

the workers via aggregators

SLIDE 14

Tong Niu 14

3. Composable computation
Allows us to use different message types ,combiners and

computation to build a powerful k-means application

4. Superstep splitting
For a message heavy superstep
send a fragment of messages to the destinations and do a

partial computation during each iteration

SLIDE 15

Tong Niu 15

Experiment results

SLIDE 16

Tong Niu 16

Experiment results

Giraph(200 machines) vs Hive(at least 200 machines)
compare CPU time and elapsed time
label propagation algorithm
Weighted PageRank

SLIDE 17

Tong Niu 17

Conclusion & Future work

How a processing framework supports Facebook-scale production

workloads. We have described the improvements to Giraph.

1.Determine a good quality graph partitioning prior to our computation. 2.Make our computation more asynchronous to improve convergence speed. 3.Leverage Giraph as a parallel machine-learning platform

SLIDE 18

Tong Niu 18