Piccolo: Building fast distributed programs with partitioned tables - - PowerPoint PPT Presentation

piccolo building fast distributed programs with
SMART_READER_LITE
LIVE PREVIEW

Piccolo: Building fast distributed programs with partitioned tables - - PowerPoint PPT Presentation

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University Motivating Example: PageRank for each node X in graph: Repeat until for each edge X Z: convergence next[Z] += curr[X] Fits


slide-1
SLIDE 1

Piccolo: Building fast distributed programs with partitioned tables

Russell Power Jinyang Li New York University

slide-2
SLIDE 2

Motivating Example: PageRank

for each node X in graph: for each edge XZ: next[Z] += curr[X]

Repeat until convergence

AB,C,D BE CD … A: 0.12 B: 0.15 C: 0.2 … A: 0 B: 0 C: 0 …

Curr Next Input Graph

A: 0.2 B: 0.16 C: 0.21 … A: 0.2 B: 0.16 C: 0.21 … A: 0.25 B: 0.17 C: 0.22 … A: 0.25 B: 0.17 C: 0.22 …

Fits in memory!

slide-3
SLIDE 3

PageRank in MapReduce

1 2 3 Distributed Storage

Graph stream Rank stream A->B,C, B->D A:0.1, B:0.2 Rank stream A:0.1, B:0.2

  • Data flow models do not expose global state.
slide-4
SLIDE 4

PageRank in MapReduce

1 2 3 Distributed Storage

Graph stream Rank stream A->B,C, B->D A:0.1, B:0.2 Rank stream A:0.1, B:0.2

  • Data flow models do not expose global state.
slide-5
SLIDE 5

PageRank With MPI/RPC

1 2 3 Distributed Storage

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 … User explicitly programs communication

slide-6
SLIDE 6

Piccolo’s Goal: Distributed Shared State 1 2 3

Distributed Storage

Graph A->B,C B->D … Ranks A: 0 B: 0 …

read/write Distributed in- memory state

slide-7
SLIDE 7

Piccolo’s Goal: Distributed Shared State 1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 … Piccolo runtime handles communication

slide-8
SLIDE 8

Ease of use Performance

slide-9
SLIDE 9
slide-10
SLIDE 10

Talk outline

 Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

slide-11
SLIDE 11

Programming Model

1 2 3

Graph AB,C BD … Ranks A: 0 B: 0 …

read/write

x

get/put update/iterate

Implemented as library for C++ and Python

slide-12
SLIDE 12

def main main(): for i in range(50): launch_jobs(NUM_MACHINES, pr_kernel, graph, curr, next) swap(curr, next) next.clear() def pr_kernel pr_kernel(graph, curr, next): i = my_instance n = len(graph)/NUM_MACHINES for s in graph[(i-1)*n:i*n] for t in s.out: next[t] += curr[s.id] / len(s.out)

Naïve PageRank with Piccolo

Run by a single controller Jobs run by many machines

curr = Table(key=PageID, value=double) next = Table(key=PageID, value=double)

Controller launches jobs in parallel

slide-13
SLIDE 13

Naïve PageRank is Slow

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

get put put put get get

slide-14
SLIDE 14

PageRank: Exploiting Locality

Control table partitioning Co-locate tables Co-locate execution with table

curr = Table(…,partitions=100,partition_by=site) next = Table(…,partitions=100,partition_by=site) group_tables(curr,next,graph) def pr_kernel pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next[t] += curr[s.id] / len(s.out) def main main(): for i in range(50): launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) swap(curr, next) next.clear()

slide-15
SLIDE 15

Exploiting Locality

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

get put put put get get

slide-16
SLIDE 16

Exploiting Locality

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

put get put put get get

slide-17
SLIDE 17

Synchronization

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

put (a=0.3) put (a=0.2)

How to handle synchronization?

slide-18
SLIDE 18

Synchronization Primitives

 Avoid write conflicts with accumulation functions

 NewValue = Accum(OldValue, Update)

 sum, product, min, max

Global barriers are sufficient

 Tables provide release consistency

slide-19
SLIDE 19

PageRank: Efficient Synchronization

curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next,graph) def pr_kernel pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next.update(t, curr.get(s.id)/len(s.out)) def main main(): for i in range(50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) barrier(handle) swap(curr, next) next.clear()

Accumulation via sum Update invokes accumulation function Explicitly wait between iterations

slide-20
SLIDE 20

Efficient Synchronization

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

put (a=0.3) put (a=0.2) update (a, 0.2) update (a, 0.3)

Runtime computes sum

Workers buffer updates locally  Release consistency

slide-21
SLIDE 21

Table Consistency

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

put (a=0.3) put (a=0.2) update (a, 0.2) update (a, 0.3)

slide-22
SLIDE 22

PageRank with Checkpointing

curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next) def pr_kernel pr_kernel(graph, curr, next): for node in graph.get_iterator(my_instance) for t in s.out: next.update(t,curr.get(s.id)/len(s.out)) def main main(): curr, userdata = restore() last = userdata.get(‘iter’, 0) for i in range(last,50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) cp_barrier(handle, tables=(next), userdata={‘iter’:i}) swap(curr, next) next.clear()

Restore previous computation User decides which tables to checkpoint and when

slide-23
SLIDE 23

Distributed Storage

Recovery via Checkpointing

1 2 3

Graph A->B,C … Ranks A: 0 … Graph B->D … Ranks B: 0 … Graph C->E,F … Ranks C: 0 …

Runtime uses Chandy-Lamport protocol

slide-24
SLIDE 24

Talk Outline

 Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

slide-25
SLIDE 25

Load Balancing

1 3 2 master J5 J3

Other workers are updating P6!

Pause updates!

P1 P1, P2 P3 P3, P4 P5 P5, P6 J1 J1, J2 J3 J3, J4 J6 J5, J6 P6

Coordinates work- stealing

slide-26
SLIDE 26

Talk Outline

 Motivation  Piccolo's Programming Model  System Design  Evaluation

slide-27
SLIDE 27

Piccolo is Fast

 NYU cluster, 12 nodes, 64 cores  100M-page graph

Main Hadoop Overheads:

  • Sorting
  • HDFS
  • Serialization

8 16 32 64 Workers 50 100 150 200 250 300 350 400 PageRank iteration time (seconds) Hadoop Piccolo

slide-28
SLIDE 28

Piccolo Scales Well

 EC2 Cluster - linearly scaled input graph

12 24 48 100 200 Workers 10 20 30 40 50 60 70

ideal 1 billion page graph

PageRank iteration time (seconds)

slide-29
SLIDE 29

 Iterative Applications

 N-Body Simulation  Matrix Multiply

 Asynchronous Applications

 Distributed web crawler‏

Other applications

No straightforward Hadoop implementation

slide-30
SLIDE 30

Related Work

 Data flow

 MapReduce, Dryad

 Tuple Spaces

 Linda, JavaSpaces

 Distributed Shared Memory

 CRL, TreadMarks, Munin, Ivy  UPC, Titanium

slide-31
SLIDE 31

Conclusion

 Distributed shared table model

 User-specified policies provide for

 Effective use of locality  Efficient synchronization  Robust failure recovery

slide-32
SLIDE 32

Gratuitous Cat Picture

I can haz kwestions?

Try it out: piccolo.news.cs.nyu.edu