Case Studies in Asynchronous, Message-Driven Shared Memory - - PowerPoint PPT Presentation

case studies in asynchronous message driven shared memory
SMART_READER_LITE
LIVE PREVIEW

Case Studies in Asynchronous, Message-Driven Shared Memory - - PowerPoint PPT Presentation

Case Studies in Asynchronous, Message-Driven Shared Memory Programming Pritish Jetley Parallel Programming Laboratory pjetley2@illinois.edu 04/25/11 1 Outline Shared memory programming today Charm++ on multicore systems Shared


slide-1
SLIDE 1

04/25/11 1

Case Studies in Asynchronous, Message-Driven Shared Memory Programming

Pritish Jetley Parallel Programming Laboratory

pjetley2@illinois.edu

slide-2
SLIDE 2

04/25/11 2

Outline

  • Shared memory programming today
  • Charm++ on multicore systems
  • Shared memory (SM) programming in Charm++
  • Case studies
  • Barnes-Hut (SPLASH)
  • SAH-based kd-tree construction
slide-3
SLIDE 3

04/25/11 3

SM programming today

  • Fork-join
  • Amorphous, thread-based (pthreads)
  • Data parallelism-centric (OpenMP)
  • Tasks (TBB, Cilk)
  • Message-driven execution (Charm++)
slide-4
SLIDE 4

04/25/11 4

Fork-join model +

Simple to program (?) Global view of control Natural fit for certain problems

  • Forced synchrony

Low-level Mutex Grainsize control

slide-5
SLIDE 5

04/25/11 5

Charm++ on multicore systems

  • Decompose algorithm into objects encapsulating its

natural elements

  • Objects present reactive interfaces
  • Control flows through asynch. entry method

invocations

  • Data flows through pointer exchange
slide-6
SLIDE 6

04/25/11 6

SM programming with Charm++ and MDE +

Natural decomposition Dependencies = messages Asynchrony Dynamic load balancing Task prioritization

  • Charm++ has no

faults whatsoever

  • No global view of control

MDE is low-level

slide-7
SLIDE 7

04/25/11 7

Performance and productivity studies

  • How easy (or hard) is it to write SM programs in

Charm++?

  • Can we expect improvements in performance?
  • Are there abstractions that would improve

programmability in Charm++?

slide-8
SLIDE 8

04/25/11 8

Comparison points

  • SPLASH2 Barnes-Hut benchmark
  • Stud y e vo lutio n o f s e lf-g ra vita ting s ys te m s
  • Tre e -b a s e d c o d e
  • U

s e s pth re a d s

  • SAH-based kd-tree construction
  • H

ig h -pe rfo rm a nc e ra y tra c ing

  • Ne s te d pa ra lle lis m
  • U

s e s TB B

slide-9
SLIDE 9

04/25/11 9

SPLASH Barnes-Hut

  • Domain decomposition and tree building
  • Partition space into compact, disjoint regions

containing approximately equal numbers of particles

  • Regions arranged in an octree
  • Independent subtrees: task parallel
  • Shuffle particles into child bins: data parallel
  • Force calculation
  • Objects own non-intersecting sets of particles,

and calculate forces on them

slide-10
SLIDE 10

04/25/11 10

Decomposition

  • Recursively divide partition into quadrants if more

than τ particles within it τ = 3

slide-11
SLIDE 11

04/25/11 11

Domain decomposition

Child tasks Node task N First particle Last particle Combined messages

slide-12
SLIDE 12

04/25/11 12

Decomposition with pthreads

void decompose(){ for(int I = 0; I < myNP; I++){ Particle *p = myParticles[I]; Cell *cell = g_root; while(1){ c e ll->LOC K (); if(!cell->isLeaf()){ save = cell; int which = cell->which(p->key); cell = cell->child(which); s a ve ->UN LOC K (); } else{ cell->particles.add(p); cell->split(); c e ll->UN LOC K (); break; } } } }

slide-13
SLIDE 13

04/25/11 13

Decomposition with Charm++

void Tre e P ie c e ::de c om pos e (){ for(int I = 0; I < myNP; I++){ Particle *p = myParticles[I]; int which = g_root->whichChild(p->key); buffe rP a rtic le (which,p); if(outParticles[which].size() > THRESH){ flus hP a rtic le s (which); } } flus hAllP a rtic le s (); } Tre e P ie c e ::re c vP a rtic le s (Particle *ptr, int np){ if(myRoot->isLeaf()){ myRoot->addParticles(ptr,np); if(myRoot->split()){ forw a rdP a rtic le s ToC hildre n(myRoot->particles); } } else{ forw a rdP a rtic le s ToC hildre n(ptr,np); } } void TreePiece::forwardParticlesToChildren( for(int I = 0; I < NUM_CHILDREN; I++){ tre e P ie c e P roxy[c hildInde x[I]].re c vP a rtic le s ( childParticles[I], childPartilces[I].size()); } } void TreePiece::flushParticles(int I){ tre e P ie c e P roxy[I].re c vP a rtic le s (buffered[I], buffered[I].size()); }

slide-14
SLIDE 14

04/25/11 14

Tree traversal

Tra ve rs e (Leaf b, Node n){ if(Is L e a f(n)){ L e a fF orc e s (b,n); } else if(S ide (n)/|r(n)-r(b)| < Theta_T){ C e llF orc e s (b,n); }

slide-15
SLIDE 15

04/25/11 15

Fewer barriers

Title:10k.1.comparison.eps Creator:gnuplot 4.2 patchlevel 6 CreationDate:Tue Apr 19 01:03:33 2011 Title:100k.1.comparison.eps Creator:gnuplot 4.2 patchlevel 6 CreationDate:Tue Apr 19 01:05:26 2011

slide-16
SLIDE 16

04/25/11 16

Performance profile

slide-17
SLIDE 17

04/25/11 17

Performance profile

slide-18
SLIDE 18

04/25/11 18

More results

Title:10k.2.comparison.eps Creator:gnuplot 4.2 patchlevel 6 CreationDate:Tue Apr 19 01:08:11 2011 Title:100k.2.comparison.eps Creator:gnuplot 4.2 patchlevel 6 CreationDate:Tue Apr 19 01:08:05 2011

slide-19
SLIDE 19

04/25/11 19

SAH-based kd-trees

  • Used to efficiently render complex graphical scenes
  • Task parallel construction of independent subtrees

(dynamically created chares)

  • Data parallel calculation of node split point (chare

arrays)

slide-20
SLIDE 20

04/25/11 20

Binary Space Partitioning

  • SAH decides position of partition based on triangle

distribution and partition surface area

Extents Partitioning plane

slide-21
SLIDE 21

04/25/11 21

kd-tree construction

Child tasks Particle chare array P Node task N First triangle Last triangle

slide-22
SLIDE 22

04/25/11 22

Charm++ pseudocode

e ntry void Worke r::s c a nTria ng le C ounts (ActivationRec ar, NodeTaskID N){ dist = W >> 1; w hile (dist > 0){ if(thisIdx < dist){ ScanMsg m; m.NL = myNL; m.NR = ar.nTris-myNR; Re fN um (m) = dist; workers[thisIdx+dist].re c vN e ig hborC ounts (m); } w he n recvNeighborCounts[dis t](ScanMsg m1){ myNL += m.NL; myNR -= m.NR; dist >>= 1; } } Plane bestPlane = c a lc ula te S AH(); re duc e (bestPlane,N,N ode Ta s k::g e tB e s tP la ne s ); }

  • Use SDAG to sequence events in parallel scan
slide-23
SLIDE 23

04/25/11 23

Charm++ implementation

  • One chare for each node of kd-tree (orchestrator)
  • For data-parallel operations, orchestrator either
  • Fire s ne w c h a re s (d yna m ic lo a d b a la nc e )
  • U

s e s c h a re a rra y (lo w o ve rh e a d o f us e )

  • Several optimizations in place
  • Prio ritiza tio n
  • A

rra y-le ve l m c a s ts /re d uc tio ns

  • M

a nua l “s m e a ring ” o f ta s ks a t to p le ve l

  • U

s e o f c hunke d a rra ys

– Re d uc e s fa ls e s h a ring – Re d uc e s a m o unt o f c o o rd ina tio n c o m m unic a tio n

slide-24
SLIDE 24

04/25/11 24

Results

Title:bunny.eps Creator:gnuplot 4.2 patchlevel 6 CreationDate:Tue Apr 19 01:18:08 2011 Title:angel.eps Creator:gnuplot 4.2 patchlevel 6 CreationDate:Tue Apr 19 01:18:08 2011 Title:fairy.eps Creator:gnuplot 4.2 patchlevel 6 CreationDate:Tue Apr 19 01:18:08 2011 Title:happy.eps Creator:gnuplot 4.2 patchlevel 6 CreationDate:Tue Apr 19 01:18:08 2011

slide-25
SLIDE 25

04/25/11 25

Performance profile