Lecture 11: HW3, Rest of Parallel Patterns, Load Balancing - - PowerPoint PPT Presentation

lecture 11 hw3 rest of parallel patterns load balancing
SMART_READER_LITE
LIVE PREVIEW

Lecture 11: HW3, Rest of Parallel Patterns, Load Balancing - - PowerPoint PPT Presentation

Lecture 11: HW3, Rest of Parallel Patterns, Load Balancing G63.2011.002/G22.2945.001 November 16, 2010 D&C General Outline Divide-and-Conquer General Data Dependencies D&C General Outline Divide-and-Conquer General Data


slide-1
SLIDE 1

Lecture 11: HW3, Rest of Parallel Patterns, Load Balancing

G63.2011.002/G22.2945.001 · November 16, 2010

D&C General

slide-2
SLIDE 2

Outline

Divide-and-Conquer General Data Dependencies

D&C General

slide-3
SLIDE 3

Outline

Divide-and-Conquer General Data Dependencies

D&C General

slide-4
SLIDE 4

Divide and Conquer

yi = fi(x1, . . . , xN)

for i ∈ {1, dots, M}. Main purpose: A way of partitioning up fully dependent tasks. x0 x1 x2 x3 x4 x5 x6 x7 x0 x1 x2 x3 x4 x5 x6 x7 x0 x1 x2 x3 x4 x5 x6 x7 u0 u1 u2 u3 u4 u5 u6 u7 x0 y0 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 v0 v1 v2 v3 v4 v5 v6 v7 w0 w1 w2 w3 w4 w5 w6 w7

D&C General

slide-5
SLIDE 5

Divide and Conquer

yi = fi(x1, . . . , xN)

for i ∈ {1, dots, M}. Main purpose: A way of partitioning up fully dependent tasks. x0 x1 x2 x3 x4 x5 x6 x7 x0 x1 x2 x3 x4 x5 x6 x7 x0 x1 x2 x3 x4 x5 x6 x7 u0 u1 u2 u3 u4 u5 u6 u7 x0 y0 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 v0 v1 v2 v3 v4 v5 v6 v7 w0 w1 w2 w3 w4 w5 w6 w7 Processor allocation?

D&C General

slide-6
SLIDE 6

Divide and Conquer: Examples

  • GEMM, TRMM, TRSM, GETRF

(LU)

  • FFT
  • Sorting: Bucket sort, Merge sort
  • N-Body problems (Barnes-Hut,

FMM)

  • Adaptive Integration

More fun with work and span: D&C analysis lecture

D&C General

slide-7
SLIDE 7

Divide and Conquer: Issues

  • “No idea how to parallelize that”
  • → Try D&C
  • Non-optimal during partition, merge
  • But: Does not matter if deep levels do

heavy enough processing

  • Subtle to map to fixed-width machines

(e.g. GPUs)

  • Varying data size along tree
  • Bookkeeping nontrivial for non-2n sizes
  • Side benefit: D&C is generally

cache-friendly

D&C General

slide-8
SLIDE 8

Outline

Divide-and-Conquer General Data Dependencies

D&C General

slide-9
SLIDE 9

General Dependency Graphs

B = f(A) C = g(B) E = f(C) F = h(C) G = g(E,F) P = p(B) Q = q(B) R = r(G,P,Q) A C B E G F Q P R h r g r g r g q f p f

D&C General

slide-10
SLIDE 10

General Dependency Graphs

B = f(A) C = g(B) E = f(C) F = h(C) G = g(E,F) P = p(B) Q = q(B) R = r(G,P,Q) A C B E G F Q P R h r g r g r g q f p f Great: All patterns discussed so far can be reduced to this one.

D&C General

slide-11
SLIDE 11

Cilk

cilk int fib (int n) { if (n < 2) return n; else { int x, y; x = spawn fib (n−1); y = spawn fib (n−2); sync; return (x+y); } }

Features:

  • Adds keywords spawn,

sync, (inlet, abort)

  • Remove keywords → valid

(seq.) C Timeline:

  • Developed at MIT, starting

in ‘94

  • Commercialized in ‘06
  • Bought by Intel in ‘09
  • Available in the Intel

Compilers

D&C General

slide-12
SLIDE 12

Cilk

cilk int fib (int n) { if (n < 2) return n; else { int x, y; x = spawn fib (n−1); y = spawn fib (n−2); sync; return (x+y); } }

Features:

  • Adds keywords spawn,

sync, (inlet, abort)

  • Remove keywords → valid

(seq.) C Timeline:

  • Developed at MIT, starting

in ‘94

  • Commercialized in ‘06
  • Bought by Intel in ‘09
  • Available in the Intel

Compilers Efficient implementation?

D&C General

slide-13
SLIDE 13

Work-Stealing

Cilk’s Work-Stealing Scheduler

Each processor maintains a work deque

  • f ready threads, and it manipulates the

bottom of the deque like a stack. P P P P P P P P Spawn!

With material by Charles E. Leiserson (MIT)

D&C General

slide-14
SLIDE 14

Work-Stealing

Cilk’s Work-Stealing Scheduler

Each processor maintains a work deque

  • f ready threads, and it manipulates the

bottom of the deque like a stack. P P P P P P P P Spawn! Spawn!

With material by Charles E. Leiserson (MIT)

D&C General

slide-15
SLIDE 15

Work-Stealing

Cilk’s Work-Stealing Scheduler

Each processor maintains a work deque

  • f ready threads, and it manipulates the

bottom of the deque like a stack. P P P P P P P P Return!

With material by Charles E. Leiserson (MIT)

D&C General

slide-16
SLIDE 16

Work-Stealing

Cilk’s Work-Stealing Scheduler

Each processor maintains a work deque

  • f ready threads, and it manipulates the

bottom of the deque like a stack. P P P P P P P P Return!

With material by Charles E. Leiserson (MIT)

D&C General

slide-17
SLIDE 17

Work-Stealing

Cilk’s Work-Stealing Scheduler

Each processor maintains a work deque

  • f ready threads, and it manipulates the

bottom of the deque like a stack. P P P P P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Steal!

With material by Charles E. Leiserson (MIT)

D&C General

slide-18
SLIDE 18

Work-Stealing

Cilk’s Work-Stealing Scheduler

Each processor maintains a work deque

  • f ready threads, and it manipulates the

bottom of the deque like a stack. P P P P P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Steal!

With material by Charles E. Leiserson (MIT)

D&C General

slide-19
SLIDE 19

Work-Stealing

Cilk’s Work-Stealing Scheduler

Each processor maintains a work deque

  • f ready threads, and it manipulates the

bottom of the deque like a stack. P P P P P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque.

With material by Charles E. Leiserson (MIT)

D&C General

slide-20
SLIDE 20

Work-Stealing

Cilk’s Work-Stealing Scheduler

Each processor maintains a work deque

  • f ready threads, and it manipulates the

bottom of the deque like a stack. P P P P P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Spawn!

With material by Charles E. Leiserson (MIT)

D&C General

slide-21
SLIDE 21

Work-Stealing

Cilk’s Work-Stealing Scheduler

Each processor maintains a work deque

  • f ready threads, and it manipulates the

bottom of the deque like a stack. P P P P P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Spawn!

With material by Charles E. Leiserson (MIT)

Why is Work-Stealing better than a Task Queue?

D&C General

slide-22
SLIDE 22

General Graphs: Issues

  • Model can accommodate ‘speculative

execution’

  • Launch many different ‘approaches’
  • Abort the others as soon as one

satisfactory one emerges.

  • Discover dependencies, make up schedule

at run-time

  • Usually less efficient than the case of

known dependencies

  • Map-Reduce absorbs many cases that

would otherwise be general

  • On-line scheduling: complicated
  • Not a good fit if a more specific pattern

applies

  • Good if inputs/outputs/functions are

(somewhat) heavy-weight

D&C General

slide-23
SLIDE 23

Questions?

?

D&C General