[PPT] - Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 PowerPoint Presentation

SLIDE 1

Algorithm Engineering

(aka. How to Write Fast Code)

What is Parallelism and Scheduling

CS26 S260 – Lecture cture 8 Yan n Gu

Many slides in this lecture are borrowed from the seventh lecture in 6.172 Performance Engineering of Software Systems at

MIT. The credit is to Prof. Charles E. Leiserson, and the instructor appreciates the permission to use them in this course.

SLIDE 2

CS260: Algorithm Engineering Lecture 8

2

Fork-Join Parallelism Greedy Scheduler Work-Stealing Scheduler

SLIDE 3

Recall: Basics of Cilk

Cilk keywords grant permission for parallel execution. They

do not command parallel execution.

3

int fib(int n) { if (n < 2) return n; int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } The named child function may execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned.

SLIDE 4

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

Execution Model

Exampl mple: fib(4)

SLIDE 5

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

Execution Model

The co computati mputation

n dag

ag unfolds dynamically. Exampl mple: fib(4) “Processor

blivious”

4 3 2 2 1 1 1

SLIDE 6

How Much Parallelism?

Loop parallelism (cilk_for) is converted to spawns and syncs using recursive divide-and-conquer. Assuming that each node executes in unit time, what is the par aral allelism lelism of this computation?

SLIDE 7

Performance Measures

T = execution time on P processors W = work = 18

SLIDE 8

Performance Measures

= 18 = 9 W = work D = span* *Also called critical-path length

r computational depth.

T = execution time on P processors

SLIDE 9

*Also called critical-path length

r computational depth.

WORK LAW

AW

∙ T ≥ W/P SPAN

PAN LAW AW

∙ T ≥ D

Performance Measures

T = execution time on P processors = 18 = 9 W = work D = span*

SLIDE 10

Work: W(A∪B) = Work: W(A∪B) = W(A) + W(B)

Series Composition

A B

Sp Span: : D(A∪B) = D(A) + D(B) Sp Span: D(A∪B) =

SLIDE 11

Wo Work rk: W(A∪B) = Wo Work rk: W(A∪B) = W(A) + W(B)

Parallel Composition

A B

Sp Span: D(A∪B) = max{D(A), D(B)} Sp Span: : D(A∪B) =

SLIDE 12

Definition. W/T = speedup on P processors.
If W/T < P, we have sublinear speedup.
If W/T = P, we have (perfect) linear speedup.
If W/T > P, we have superlinear speedup, which is not possible in this

simple performance model, because of the WORK LAW T ≥ W/P.

Speedup

SLIDE 13

Parallelism

Because the SPAN LAW dictates that T ≥ D, the maximum possible speedup given W and D is W/D = parallelism = the average amount of work per step along the span = 18/9 = 2

SLIDE 14

Parall llel elism: ism: W/D = Parall llel elism: ism: W/D = 2.125 Work: W = 17 Work: W = Sp Span: D = 8 Sp Span: D

Example: fib(4)

Assume for simplicity that each strand in fib(4) takes unit time to execute.

4 5 6 1 2 7 8 3

Using many more than 2 processors can yield only marginal performance gains.

SLIDE 15

CS260: Algorithm Engineering Lecture 8

16

Fork-Join Parallelism Greedy Scheduler Work-Stealing Scheduler

SLIDE 16

Scheduling

Fork-Jo

Join in parallelis rallelism allows lows the e program

grammer

mer to ex expr press ess potential ential paralle rallelism lism in an application plication

The schedu

eduler ler map aps s strand rands s onto to proc

cess

essors

rs dynami

namically ally at runtime untime

Since

nce the e theory eory of distr istributed buted schedulers is complicated, we’ll first rst explore plore the ideas eas wi with th a centr tralized alized sched heduler ler

…

Memory I/O $ P $ P $ P Network

SLIDE 17

Greedy Scheduling

IDEA: Do as much as possible on every step.

Definition. A node is ready if all its

predecessors have executed.

SLIDE 18

Greedy Scheduling

Comple plete te step tep

≥ P strands ready.
Run any P.

P = 3

Definition. A node is ready if all its

predecessors have executed. IDEA: Do as much as possible on every step.

SLIDE 19

Greedy Scheduling

Comple plete te step tep

≥ P strands ready.
Run any P.

P = 3 Incomple complete te step tep

< P strands ready.
Run all of them.
Definition. A node is ready if all its

predecessors have executed. IDEA: Do as much as possible on every step.

SLIDE 20

Theorem rem [G68, B75, EZL89]. Any greedy scheduler achieves T ≤ W/P + D.

Analysis of Greedy

Proof. ∙ # complete steps ≤ W/P, since each complete step performs P work. ∙ # incomplete steps ≤ D, since each incomplete step reduces the span of the unexecuted dag by 1. ■

SLIDE 21

CS260: Algorithm Engineering Lecture 8

22

Fork-Join Parallelism Greedy Scheduler Work-Stealing Scheduler

SLIDE 22

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

Execution Model

The co computati mputation

n dag

ag unfolds dynamically. Exampl mple: fib(4) “Processor

blivious”

4 3 2 2 1 1 1

SLIDE 23

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

Execution Model

Exampl mple: fib(4)

4 3 2 1 P1 P1 P1

Avai ailabl able e for execut cution ion Avai ailabl able e for execut cution ion Avail ilabl able e for execut cution ion

SLIDE 24

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

4 3 2 1

Execution Model

Exampl mple: fib(4)

2 1 1 P2

Stea eal!

P1 P3

Steal!

SLIDE 25

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

4 3 2 1

Execution Model

Exampl mple: fib(4)

2 1 1 P2 P1 P3

SLIDE 26

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

4 3 2 1

Execution Model

Exampl mple: fib(4)

2 1 1 P2 P1 P3

Can’t execut cute! e!

SLIDE 27

int fib (int n) { if (n < 2) return n; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

4 3 2 1

Execution Model

Exampl mple: fib(4)

2 1 1 P2 P1 P3

SLIDE 28

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P

spawn call call call

P

spawn spawn

P P

call spawn call spawn call call

Call!

Cilk Runtime System

SLIDE 29

P

spawn call call call spawn

P

spawn spawn

P P

call spawn call spawn call call

Spawn! n!

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

SLIDE 30

P

spawn call call call spawn spawn

P

spawn spawn

P P

call spawn call call spawn call spawn call

Spawn! n! Spawn! n! Call!

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

SLIDE 31

spawn call

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn

Return! n!

Cilk Runtime System

spawn

When a worker runs out of work, it steals from the top of a random victim’s deque.

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

SLIDE 37

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn spawn

When a worker runs out of work, it steals from the top of a random victim’s deque.

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

SLIDE 38

P

spawn call call call spawn

P

spawn

P P

call spawn call call spawn call spawn spawn spawn

Theorem rem [BL94]: With sufficient parallelism, workers steal infrequently  linear speed-up.

Cilk Runtime System

Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

Link to Cilk implementation

SLIDE 39

Work-Stealing Bounds

Theorem

em. The work-stealing scheduler achieves expected running

time T ≈ W/P + O(D)

n P processors.

Pseudoproof. A processor is either working or stealing. The total time all processors spend working is T. Each steal has a 1/P chance of reducing the span by 1. Thus, the expected cost of all steals is O(PD). Since there are P processors, the expected time is (W + O(PD))/P = W/P + O(D) . ■

SLIDE 40

Overhead of work-stealing scheduler

Bound the number of steals (whp):

𝑃 𝑞𝐸

Running time (whp): 𝑈 = 𝑋 + 𝑃 𝑞𝐸 𝑞 = 𝑋 𝑞 + 𝑃 𝐸

Link to a simple proof

SLIDE 41

Successful steals can be expensive

Physical communication between two

processors

Can lead to considerably more cache

misses

Coarsening will not increase #SuccSteal

Bound the number of steals (whp):

𝑃 𝑞𝐸

SLIDE 42

B A C E D Views s of stack ck A A B A C A C D A C E C B A D E Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent Cilk’s cactus stack supports multiple views in parallel.

Cactus Stack

SLIDE 43

Bound on Stack Space

Theorem

rem. Let S1 be the stack space required by a serial execution of a

Cilk program. Then the stack space required by a P-processor execution is at most SP ≤ PS 1.

Proof (by induction). The

work-stealing algorithm maintains the busy-leaves property: Every extant leaf activation frame has a worker executing it. ■

P P P

S1 P = 3

SLIDE 44

CS260: Algorithm Engineering Lecture 8

45

Fork-Join Parallelism Greedy Scheduler Work-Stealing Scheduler

SLIDE 45

Design and Analysis of Parallel Algorithms

46

Work 𝑿, depth 𝑬, I/O cost 𝑹 (sequential / random)
Parallelism for work:

𝑿 𝑸

Time for I/O: max

𝑹 𝑸 , 𝑹 𝑪𝒏𝒃𝒚

Number of steals: 𝑷(𝑸𝑬)
Most combinatorial algorithms are I/O bottlenecked