Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 - - PowerPoint PPT Presentation

algorithm engineering
SMART_READER_LITE
LIVE PREVIEW

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 - - PowerPoint PPT Presentation

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 8 Yan n Gu What is Parallelism and Scheduling Many slides in this lecture are borrowed from the seventh lecture in 6.172 Performance Engineering of Software


slide-1
SLIDE 1

Algorithm Engineering

(aka. How to Write Fast Code)

What is Parallelism and Scheduling

CS26 S260 – Lecture cture 8 Yan n Gu

Many slides in this lecture are borrowed from the seventh lecture in 6.172 Performance Engineering of Software Systems at

  • MIT. The credit is to Prof. Charles E. Leiserson, and the instructor appreciates the permission to use them in this course.
slide-2
SLIDE 2

CS260: Algorithm Engineering Lecture 8

2

Fork-Join Parallelism Greedy Scheduler Work-Stealing Scheduler

slide-3
SLIDE 3

Recall: Basics of Cilk

  • Cilk keywords grant permission for parallel execution. They

do not command parallel execution.

3

int fib(int n) { if (n < 2) return n; int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } The named child function may execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.

slide-4
SLIDE 4

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

Execution Model

Exampl mple: fib(4)

slide-5
SLIDE 5

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

Execution Model

The co computati mputation

  • n dag

ag unfolds dynamically. Exampl mple: fib(4) “Processor

  • blivious”

4 3 2 2 1 1 1

slide-6
SLIDE 6

How Much Parallelism?

Loop parallelism (cilk_for) is converted to spawns and syncs using recursive divide-and-conquer. Assuming that each node executes in unit time, what is the par aral allelism lelism of this computation?

slide-7
SLIDE 7

Performance Measures

T = execution time on P processors W = work = 18

slide-8
SLIDE 8

Performance Measures

= 18 = 9 W = work D = span* *Also called critical-path length

  • r computational depth.

T = execution time on P processors

slide-9
SLIDE 9

*Also called critical-path length

  • r computational depth.

WORK LAW

AW

∙ T ≥ W/P SPAN

PAN LAW AW

∙ T ≥ D

Performance Measures

T = execution time on P processors = 18 = 9 W = work D = span*

slide-10
SLIDE 10

Work: W(A∪B) = Work: W(A∪B) = W(A) + W(B)

Series Composition

A B

Sp Span: : D(A∪B) = D(A) + D(B) Sp Span: D(A∪B) =

slide-11
SLIDE 11

Wo Work rk: W(A∪B) = Wo Work rk: W(A∪B) = W(A) + W(B)

Parallel Composition

A B

Sp Span: D(A∪B) = max{D(A), D(B)} Sp Span: : D(A∪B) =

slide-12
SLIDE 12
  • Definition. W/T = speedup on P processors.
  • If W/T < P, we have sublinear speedup.
  • If W/T = P, we have (perfect) linear speedup.
  • If W/T > P, we have superlinear speedup, which is not possible in this

simple performance model, because of the WORK LAW T ≥ W/P.

Speedup

slide-13
SLIDE 13

Parallelism

Because the SPAN LAW dictates that T ≥ D, the maximum possible speedup given W and D is W/D = parallelism = the average amount of work per step along the span = 18/9 = 2

slide-14
SLIDE 14

Parall llel elism: ism: W/D = Parall llel elism: ism: W/D = 2.125 Work: W = 17 Work: W = Sp Span: D = 8 Sp Span: D

Example: fib(4)

Assume for simplicity that each strand in fib(4) takes unit time to execute.

4 5 6 1 2 7 8 3

Using many more than 2 processors can yield only marginal performance gains.

slide-15
SLIDE 15

CS260: Algorithm Engineering Lecture 8

16

Fork-Join Parallelism Greedy Scheduler Work-Stealing Scheduler

slide-16
SLIDE 16

Scheduling

  • Fork-Jo

Join in parallelis rallelism allows lows the e program

  • grammer

mer to ex expr press ess potential ential paralle rallelism lism in an application plication

  • The schedu

eduler ler map aps s strand rands s onto to proc

  • cess

essors

  • rs dynami

namically ally at runtime untime

  • Since

nce the e theory eory of distr istributed buted schedulers is complicated, we’ll first rst explore plore the ideas eas wi with th a centr tralized alized sched heduler ler

Memory I/O $ P $ P $ P Network

slide-17
SLIDE 17

Greedy Scheduling

IDEA: Do as much as possible on every step.

  • Definition. A node is ready if all its

predecessors have executed.

slide-18
SLIDE 18

Greedy Scheduling

Comple plete te step tep

  • ≥ P strands ready.
  • Run any P.

P = 3

  • Definition. A node is ready if all its

predecessors have executed. IDEA: Do as much as possible on every step.

slide-19
SLIDE 19

Greedy Scheduling

Comple plete te step tep

  • ≥ P strands ready.
  • Run any P.

P = 3 Incomple complete te step tep

  • < P strands ready.
  • Run all of them.
  • Definition. A node is ready if all its

predecessors have executed. IDEA: Do as much as possible on every step.

slide-20
SLIDE 20

Theorem rem [G68, B75, EZL89]. Any greedy scheduler achieves T ≤ W/P + D.

Analysis of Greedy

Proof. ∙ # complete steps ≤ W/P, since each complete step performs P work. ∙ # incomplete steps ≤ D, since each incomplete step reduces the span of the unexecuted dag by 1. ■

slide-21
SLIDE 21

CS260: Algorithm Engineering Lecture 8

22

Fork-Join Parallelism Greedy Scheduler Work-Stealing Scheduler

slide-22
SLIDE 22

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

Execution Model

The co computati mputation

  • n dag

ag unfolds dynamically. Exampl mple: fib(4) “Processor

  • blivious”

4 3 2 2 1 1 1

slide-23
SLIDE 23

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

Execution Model

Exampl mple: fib(4)

4 3 2 1 P1 P1 P1

Avai ailabl able e for execut cution ion Avai ailabl able e for execut cution ion Avail ilabl able e for execut cution ion

slide-24
SLIDE 24

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

4 3 2 1

Execution Model

Exampl mple: fib(4)

2 1 1 P2

Stea eal!

P1 P3

Steal!

slide-25
SLIDE 25

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

4 3 2 1

Execution Model

Exampl mple: fib(4)

2 1 1 P2 P1 P3

slide-26
SLIDE 26

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

4 3 2 1

Execution Model

Exampl mple: fib(4)

2 1 1 P2 P1 P3

Can’t execut cute! e!

slide-27
SLIDE 27

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

4 3 2 1

Execution Model

Exampl mple: fib(4)

2 1 1 P2 P1 P3

slide-28
SLIDE 28

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P

spawn call call call

P

spawn spawn

P P

call spawn call spawn call call

Call!

Cilk Runtime System

slide-29
SLIDE 29

P

spawn call call call spawn

P

spawn spawn

P P

call spawn call spawn call call

Spawn! n!

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

slide-30
SLIDE 30

P

spawn call call call spawn spawn

P

spawn spawn

P P

call spawn call call spawn call spawn call

Spawn! n! Spawn! n! Call!

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

slide-31
SLIDE 31

spawn call

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn

Return! n!

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

slide-32
SLIDE 32

spawn

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn

Return! n!

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

slide-33
SLIDE 33

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn

Steal! When a worker runs out of work, it steals from the top of a random victim’s deque.

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

slide-34
SLIDE 34

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn

Steal! When a worker runs out of work, it steals from the top of a random victim’s deque.

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

slide-35
SLIDE 35

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn

When a worker runs out of work, it steals from the top of a random victim’s deque.

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

slide-36
SLIDE 36

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn

Spawn!

spawn

When a worker runs out of work, it steals from the top of a random victim’s deque.

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

slide-37
SLIDE 37

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn spawn

When a worker runs out of work, it steals from the top of a random victim’s deque.

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

slide-38
SLIDE 38

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn spawn

Theorem rem [BL94]: With sufficient parallelism, workers steal infrequently  linear speed-up.

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

Link to Cilk implementation

slide-39
SLIDE 39

Work-Stealing Bounds

Theorem

  • em. The work-stealing scheduler achieves expected running

time T ≈ W/P + O(D)

  • n P processors.

Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T. Each steal has a 1/P chance of reducing the span by 1. Thus, the expected cost of all steals is O(PD). Since there are P processors, the expected time is (W + O(PD))/P = W/P + O(D) . ■

slide-40
SLIDE 40

Overhead of work-stealing scheduler

Bound the number of steals (whp):

𝑃 𝑞𝐸

Running time (whp): 𝑈 = 𝑋 + 𝑃 𝑞𝐸 𝑞 = 𝑋 𝑞 + 𝑃 𝐸

Link to a simple proof

slide-41
SLIDE 41

Successful steals can be expensive

  • Physical communication between two

processors

  • Can lead to considerably more cache

misses

  • Coarsening will not increase #SuccSteal

Bound the number of steals (whp):

𝑃 𝑞𝐸

slide-42
SLIDE 42

B A C E D Views s of stack ck A A B A C A C D A C E C B A D E Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent Cilk’s cactus stack supports multiple views in parallel.

Cactus Stack

slide-43
SLIDE 43

Bound on Stack Space

Theorem

  • rem. Let S1 be the stack space required by a serial execution of a

Cilk program. Then the stack space required by a P-processor execution is at most SP ≤ PS 1.

Proof (by induction). The

work-stealing algorithm maintains the busy-leaves property: Every extant leaf activation frame has a worker executing it. ■

P P P

S1 P = 3

slide-44
SLIDE 44

CS260: Algorithm Engineering Lecture 8

45

Fork-Join Parallelism Greedy Scheduler Work-Stealing Scheduler

slide-45
SLIDE 45

Design and Analysis of Parallel Algorithms

46

  • Work 𝑿, depth 𝑬, I/O cost 𝑹 (sequential / random)
  • Parallelism for work:

𝑿 𝑸

  • Time for I/O: max

𝑹 𝑸 , 𝑹 𝑪𝒏𝒃𝒚

  • Number of steals: 𝑷(𝑸𝑬)
  • Most combinatorial algorithms are I/O bottlenecked