Piccolo: Building fast distributed programs with partitioned tables - - PowerPoint PPT Presentation

▶

Mar 26, 2023 320 likes •654 views

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University Motivating Example: PageRank for each node X in graph: Repeat until for each edge X Z: convergence next[Z] += curr[X] Fits

SLIDE 1

Piccolo: Building fast distributed programs with partitioned tables

Russell Power Jinyang Li New York University

SLIDE 2

Motivating Example: PageRank

for each node X in graph: for each edge XZ: next[Z] += curr[X]

Repeat until convergence

AB,C,D BE CD … A: 0.12 B: 0.15 C: 0.2 … A: 0 B: 0 C: 0 …

Curr Next Input Graph

A: 0.2 B: 0.16 C: 0.21 … A: 0.2 B: 0.16 C: 0.21 … A: 0.25 B: 0.17 C: 0.22 … A: 0.25 B: 0.17 C: 0.22 …

Fits in memory!

SLIDE 3

PageRank in MapReduce

1 2 3 Distributed Storage

Graph stream Rank stream A->B,C, B->D A:0.1, B:0.2 Rank stream A:0.1, B:0.2

Data flow models do not expose global state.

SLIDE 4

PageRank in MapReduce

1 2 3 Distributed Storage

Graph stream Rank stream A->B,C, B->D A:0.1, B:0.2 Rank stream A:0.1, B:0.2

Data flow models do not expose global state.

SLIDE 5

PageRank With MPI/RPC

1 2 3 Distributed Storage

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 … User explicitly programs communication

SLIDE 6

Piccolo’s Goal: Distributed Shared State 1 2 3

Distributed Storage

Graph A->B,C B->D … Ranks A: 0 B: 0 …

read/write Distributed in- memory state

SLIDE 7

Piccolo’s Goal: Distributed Shared State 1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 … Piccolo runtime handles communication

SLIDE 8

Ease of use Performance

SLIDE 9

SLIDE 10

Talk outline

 Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

SLIDE 11

Programming Model

1 2 3

Graph AB,C BD … Ranks A: 0 B: 0 …

read/write

x

get/put update/iterate

Implemented as library for C++ and Python

SLIDE 12

def main main(): for i in range(50): launch_jobs(NUM_MACHINES, pr_kernel, graph, curr, next) swap(curr, next) next.clear() def pr_kernel pr_kernel(graph, curr, next): i = my_instance n = len(graph)/NUM_MACHINES for s in graph[(i-1)*n:i*n] for t in s.out: next[t] += curr[s.id] / len(s.out)

Naïve PageRank with Piccolo

Run by a single controller Jobs run by many machines

curr = Table(key=PageID, value=double) next = Table(key=PageID, value=double)

Controller launches jobs in parallel

SLIDE 13

Naïve PageRank is Slow

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

get put put put get get

SLIDE 14

PageRank: Exploiting Locality

Control table partitioning Co-locate tables Co-locate execution with table

curr = Table(…,partitions=100,partition_by=site) next = Table(…,partitions=100,partition_by=site) group_tables(curr,next,graph) def pr_kernel pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next[t] += curr[s.id] / len(s.out) def main main(): for i in range(50): launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) swap(curr, next) next.clear()

SLIDE 15

Exploiting Locality

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

get put put put get get

SLIDE 16

Exploiting Locality

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

put get put put get get

SLIDE 17

Synchronization

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

put (a=0.3) put (a=0.2)

How to handle synchronization?

SLIDE 18

Synchronization Primitives

 Avoid write conflicts with accumulation functions

 NewValue = Accum(OldValue, Update)

 sum, product, min, max

Global barriers are sufficient

 Tables provide release consistency

SLIDE 19

PageRank: Efficient Synchronization

curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next,graph) def pr_kernel pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next.update(t, curr.get(s.id)/len(s.out)) def main main(): for i in range(50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) barrier(handle) swap(curr, next) next.clear()

Accumulation via sum Update invokes accumulation function Explicitly wait between iterations

SLIDE 20

Efficient Synchronization

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

put (a=0.3) put (a=0.2) update (a, 0.2) update (a, 0.3)

Runtime computes sum

Workers buffer updates locally  Release consistency

SLIDE 21

Table Consistency

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

put (a=0.3) put (a=0.2) update (a, 0.2) update (a, 0.3)

SLIDE 22

PageRank with Checkpointing

curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next) def pr_kernel pr_kernel(graph, curr, next): for node in graph.get_iterator(my_instance) for t in s.out: next.update(t,curr.get(s.id)/len(s.out)) def main main(): curr, userdata = restore() last = userdata.get(‘iter’, 0) for i in range(last,50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) cp_barrier(handle, tables=(next), userdata={‘iter’:i}) swap(curr, next) next.clear()

Restore previous computation User decides which tables to checkpoint and when

SLIDE 23

Distributed Storage

Recovery via Checkpointing

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

Runtime uses Chandy-Lamport protocol

SLIDE 24

Talk Outline

 Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

SLIDE 25

Load Balancing

1 3 2 master J5 J3

Other workers are updating P6!

Pause updates!

P1 P1, P2 P3 P3, P4 P5 P5, P6 J1 J1, J2 J3 J3, J4 J6 J5, J6 P6

Coordinates work- stealing

SLIDE 26

Talk Outline

 Motivation  Piccolo's Programming Model  System Design  Evaluation

SLIDE 27

Piccolo is Fast

 NYU cluster, 12 nodes, 64 cores  100M-page graph

Main Hadoop Overheads:

Sorting
HDFS
Serialization

8 16 32 64 Workers 50 100 150 200 250 300 350 400 PageRank iteration time (seconds) Hadoop Piccolo

SLIDE 28

Piccolo Scales Well

 EC2 Cluster - linearly scaled input graph

12 24 48 100 200 Workers 10 20 30 40 50 60 70

ideal 1 billion page graph

PageRank iteration time (seconds)

SLIDE 29

 Iterative Applications

 N-Body Simulation  Matrix Multiply

 Asynchronous Applications

 Distributed web crawler‏

Other applications

No straightforward Hadoop implementation

SLIDE 30

Related Work

 Data flow

 MapReduce, Dryad

 Tuple Spaces

 Linda, JavaSpaces

 Distributed Shared Memory

 CRL, TreadMarks, Munin, Ivy  UPC, Titanium

SLIDE 31

Conclusion

 Distributed shared table model

 User-specified policies provide for

 Effective use of locality  Efficient synchronization  Robust failure recovery

SLIDE 32

Piccolo: Building fast distributed programs with partitioned tables

Russell Power Jinyang Li New York University

Motivating Example: PageRank

for each node X in graph: for each edge XZ: next[Z] += curr[X]

AB,C,D BE CD … A: 0.12 B: 0.15 C: 0.2 … A: 0 B: 0 C: 0 …

Curr Next Input Graph

A: 0.2 B: 0.16 C: 0.21 … A: 0.2 B: 0.16 C: 0.21 … A: 0.25 B: 0.17 C: 0.22 … A: 0.25 B: 0.17 C: 0.22 …

Fits in memory!

PageRank in MapReduce

1 2 3 Distributed Storage

PageRank in MapReduce

1 2 3 Distributed Storage

PageRank With MPI/RPC

1 2 3 Distributed Storage

Piccolo’s Goal: Distributed Shared State 1 2 3

Distributed Storage

read/write Distributed in- memory state

Piccolo’s Goal: Distributed Shared State 1 2 3

Ease of use Performance

Talk outline

Programming Model

1 2 3

read/write

x

get/put update/iterate

Implemented as library for C++ and Python

Naïve PageRank with Piccolo

Run by a single controller Jobs run by many machines

Controller launches jobs in parallel

Naïve PageRank is Slow

1 2 3

get put put put get get

PageRank: Exploiting Locality

Exploiting Locality

1 2 3

get put put put get get

Exploiting Locality

1 2 3

put get put put get get

Synchronization

1 2 3

put (a=0.3) put (a=0.2)

How to handle synchronization?

Synchronization Primitives

Global barriers are sufficient

PageRank: Efficient Synchronization

Accumulation via sum Update invokes accumulation function Explicitly wait between iterations

Efficient Synchronization

1 2 3

put (a=0.3) put (a=0.2) update (a, 0.2) update (a, 0.3)

Runtime computes sum

Workers buffer updates locally  Release consistency

Table Consistency

1 2 3

put (a=0.3) put (a=0.2) update (a, 0.2) update (a, 0.3)

PageRank with Checkpointing

Restore previous computation User decides which tables to checkpoint and when

Distributed Storage

Recovery via Checkpointing

1 2 3

Runtime uses Chandy-Lamport protocol

Talk Outline

Load Balancing

Other workers are updating P6!

Coordinates work- stealing

Talk Outline

Piccolo is Fast

Piccolo Scales Well

ideal 1 billion page graph

 N-Body Simulation  Matrix Multiply

 Distributed web crawler‏

Other applications

No straightforward Hadoop implementation

Related Work

Conclusion

 Effective use of locality  Efficient synchronization  Robust failure recovery

Gratuitous Cat Picture

I can haz kwestions?

Try it out: piccolo.news.cs.nyu.edu