Space Profiling for Parallel Functional Programs Daniel Spoonhower 1 - - PowerPoint PPT Presentation

space profiling for parallel functional programs
SMART_READER_LITE
LIVE PREVIEW

Space Profiling for Parallel Functional Programs Daniel Spoonhower 1 - - PowerPoint PPT Presentation

Space Profiling for Parallel Functional Programs Daniel Spoonhower 1 , Guy Blelloch 1 , Robert Harper 1 , & Phillip Gibbons 2 1 Carnegie Mellon University 2 Intel Research Pittsburgh 23 September 2008 ICFP 08, Victoria, BC Improving


slide-1
SLIDE 1

Space Profiling for Parallel Functional Programs

Daniel Spoonhower1, Guy Blelloch1, Robert Harper1, & Phillip Gibbons2

1Carnegie Mellon University 2Intel Research Pittsburgh

23 September 2008 ICFP ’08, Victoria, BC

slide-2
SLIDE 2

Improving Performance – Profiling Helps!

Profiling improves functional program performance.

slide-3
SLIDE 3

Improving Performance – Profiling Helps!

Profiling improves functional program performance. Good performance in parallel programs is also hard.

slide-4
SLIDE 4

Improving Performance – Profiling Helps!

Profiling improves functional program performance. Good performance in parallel programs is also hard. This work: space profiling for parallel programs

slide-5
SLIDE 5

Example: Matrix Multiply

Na¨ ıve NESL code for matrix multiplication function dot(a,b) = sum ({ a ∗ b : a; b }) function prod(m,n) = { { dot(m,n) : n } : m }

slide-6
SLIDE 6

Example: Matrix Multiply

Na¨ ıve NESL code for matrix multiplication function dot(a,b) = sum ({ a ∗ b : a; b }) function prod(m,n) = { { dot(m,n) : n } : m } Requires O(n3) space for n × n matrices!

◮ compare to O(n2) for sequential ML

slide-7
SLIDE 7

Example: Matrix Multiply

Na¨ ıve NESL code for matrix multiplication function dot(a,b) = sum ({ a ∗ b : a; b }) function prod(m,n) = { { dot(m,n) : n } : m } Requires O(n3) space for n × n matrices!

◮ compare to O(n2) for sequential ML

Given a parallel functional program, can we determine, “How much space will it use?”

slide-8
SLIDE 8

Example: Matrix Multiply

Na¨ ıve NESL code for matrix multiplication function dot(a,b) = sum ({ a ∗ b : a; b }) function prod(m,n) = { { dot(m,n) : n } : m } Requires O(n3) space for n × n matrices!

◮ compare to O(n2) for sequential ML

Given a parallel functional program, can we determine, “How much space will it use?” Short answer: It depends on the implementation.

slide-9
SLIDE 9

Scheduling Matters

Parallel programs admit many different executions

◮ not all impl. of matrix multiply are O(n3)

Determined (in part) by scheduling policy

◮ lots of parallelism; policy says what runs next

slide-10
SLIDE 10

Semantic Space Profiling

Our approach: factor problem into two parts.

  • 1. Define parallel structure (as graphs)

◮ circumscribes all possible executions ◮ deterministic (independent of policy, &c.) ◮ include approximate space use

  • 2. Define scheduling policies (as traversals of graphs)

◮ used in profiling, visualization ◮ gives specification for implementation

slide-11
SLIDE 11

Contributions

Contributions of this work:

◮ cost semantics accounting for. . .

◮ scheduling policies ◮ space use

◮ semantic space profiling tools ◮ extensible implementation in MLton

slide-12
SLIDE 12

Talk Summary

Cost Semantics, Part I: Parallel Structure Cost Semantics, Part II: Space Use Semantic Profiling

slide-13
SLIDE 13

Talk Summary

Cost Semantics, Part I: Parallel Structure Cost Semantics, Part II: Space Use Semantic Profiling

slide-14
SLIDE 14

Program Execution as a Dag

Model execution as directed acyclic graph (dag) One graph for all parallel executions

◮ nodes represent units of work ◮ edges represent sequential dependencies

slide-15
SLIDE 15

Program Execution as a Dag

Model execution as directed acyclic graph (dag) One graph for all parallel executions

◮ nodes represent units of work ◮ edges represent sequential dependencies

Each schedule corresponds to a traversal

◮ every node must be visited; parents first ◮ limit number of nodes visited in each step

slide-16
SLIDE 16

Program Execution as a Dag

Model execution as directed acyclic graph (dag) One graph for all parallel executions

◮ nodes represent units of work ◮ edges represent sequential dependencies

Each schedule corresponds to a traversal

◮ every node must be visited; parents first ◮ limit number of nodes visited in each step

A policy determines schedule for every program

slide-17
SLIDE 17

Program Execution as a Dag (con’t)

slide-18
SLIDE 18

Program Execution as a Dag (con’t)

Graphs are NOT. . .

◮ control flow graphs ◮ explicitly built at runtime

Graphs are. . .

◮ derived from cost semantics ◮ unique per closed program ◮ independent of scheduling

slide-19
SLIDE 19

Breadth-First Scheduling Policy

Scheduling policy defined by:

◮ breadth-first traversal of the dag

(i.e. visit nodes at shallow depth first)

◮ break ties by taking leftmost node ◮ visit at most p nodes per step

(p = number of processor cores)

slide-20
SLIDE 20

Breadth-First Illustrated (p = 2)

slide-21
SLIDE 21

Breadth-First Illustrated (p = 2)

slide-22
SLIDE 22

Breadth-First Illustrated (p = 2)

slide-23
SLIDE 23

Breadth-First Illustrated (p = 2)

slide-24
SLIDE 24

Breadth-First Illustrated (p = 2)

slide-25
SLIDE 25

Breadth-First Illustrated (p = 2)

slide-26
SLIDE 26

Breadth-First Illustrated (p = 2)

slide-27
SLIDE 27

Breadth-First Illustrated (p = 2)

slide-28
SLIDE 28

Breadth-First Scheduling Policy

Scheduling policy defined by:

◮ breadth-first traversal of the dag

(i.e. visit nodes at shallow depth first)

◮ break ties by taking leftmost node ◮ visit at most p nodes per step

(p = number of processor cores) Variation implicit in impls. of NESL & Data Parallel Haskell

◮ vectorization bakes in schedule

slide-29
SLIDE 29

Depth-First Scheduling Policy

Scheduling policy defined by:

◮ depth-first traversal of the dag

(i.e. favor children of recently visited nodes)

◮ break ties by taking leftmost node ◮ visit at most p nodes per step

(p = number of processor cores)

slide-30
SLIDE 30

Depth-First Illustrated (p = 2)

slide-31
SLIDE 31

Depth-First Illustrated (p = 2)

slide-32
SLIDE 32

Depth-First Illustrated (p = 2)

slide-33
SLIDE 33

Depth-First Illustrated (p = 2)

slide-34
SLIDE 34

Depth-First Illustrated (p = 2)

slide-35
SLIDE 35

Depth-First Illustrated (p = 2)

slide-36
SLIDE 36

Depth-First Illustrated (p = 2)

slide-37
SLIDE 37

Depth-First Illustrated (p = 2)

slide-38
SLIDE 38

Depth-First Illustrated (p = 2)

slide-39
SLIDE 39

Depth-First Scheduling Policy

Scheduling policy defined by:

◮ depth-first traversal of the dag

(i.e. favor children of recently visited nodes)

◮ break ties by taking leftmost node ◮ visit at most p nodes per step

(p = number of processor cores) Sequential execution = one processor depth-first schedule

slide-40
SLIDE 40

Work-Stealing Scheduling Policy

“Work-stealing” means many things:

◮ idle procs. shoulder burden of communication ◮ specific implementations, e.g. Cilk ◮ implied ordering of parallel tasks

For the purposes of space profiling, ordering is important

◮ briefly: globally breadth-first, locally depth-first

slide-41
SLIDE 41

Computation Graphs: Summary

Cost semantics defines graph for each closed program

◮ i.e.. defines parallel structure ◮ call this graph computation graph

Scheduling polices defined on graphs

◮ describe behavior without data structures,

synchronization, &c.

slide-42
SLIDE 42

Talk Summary

Cost Semantics, Part I: Parallel Structure Cost Semantics, Part II: Space Use Semantic Profiling

slide-43
SLIDE 43

Heap Graphs

Goal: describe space use independently of schedule

◮ our innovation: add heap graphs

Heap graphs also act as a specification

◮ constrain use of space by compiler & GC ◮ just as computation graph constrains schedule

slide-44
SLIDE 44

Heap Graphs

Goal: describe space use independently of schedule

◮ our innovation: add heap graphs

Heap graphs also act as a specification

◮ constrain use of space by compiler & GC ◮ just as computation graph constrains schedule

Computation & heap graphs share nodes.

◮ think: one graph w/ two sets of edges

slide-45
SLIDE 45

Cost for Parallel Pairs

Generate costs for parallel pair, {e1, e2}

slide-46
SLIDE 46

Cost for Parallel Pairs

Generate costs for parallel pair, {e1, e2} e1 e2

slide-47
SLIDE 47

Cost for Parallel Pairs

Generate costs for parallel pair, {e1, e2} e1 e2

slide-48
SLIDE 48

Cost for Parallel Pairs

Generate costs for parallel pair, {e1, e2} e1 e2

slide-49
SLIDE 49

Cost for Parallel Pairs

Generate costs for parallel pair, {e1, e2} e1 e2

slide-50
SLIDE 50

Cost for Parallel Pairs

Generate costs for parallel pair, {e1, e2} e1 e2

slide-51
SLIDE 51

Cost for Parallel Pairs

Generate costs for parallel pair, {e1, e2} (see paper for inference rules) e1 e2

slide-52
SLIDE 52

From Cost Graphs to Space Use

Recall, schedule = traversal of computation graph

◮ visiting p nodes per step to simulate p processors

Each step of traversal divides set of nodes into:

  • 1. nodes executed in past
  • 2. notes to be executed in future
slide-53
SLIDE 53

From Cost Graphs to Space Use

Recall, schedule = traversal of computation graph

◮ visiting p nodes per step to simulate p processors

Each step of traversal divides set of nodes into:

  • 1. nodes executed in past
  • 2. notes to be executed in future

Heap edges crossing from future to past are “roots”

◮ i.e. future uses of existing values

slide-54
SLIDE 54

Determining Space Use

slide-55
SLIDE 55

Determining Space Use

slide-56
SLIDE 56

Determining Space Use

slide-57
SLIDE 57

Determining Space Use

slide-58
SLIDE 58

Determining Space Use

slide-59
SLIDE 59

Determining Space Use

slide-60
SLIDE 60

Heap Edges Also Track Uses

Heap edges also added as “possible last-uses,” e.g., if e1 then e2 else e3

slide-61
SLIDE 61

Heap Edges Also Track Uses

Heap edges also added as “possible last-uses,” e.g., if e1 then e2 else e3 (where e1 →∗ true)

slide-62
SLIDE 62

Heap Edges Also Track Uses

Heap edges also added as “possible last-uses,” e.g., if e1 then e2 else e3 (where e1 →∗ true) e1 e2

slide-63
SLIDE 63

Heap Edges Also Track Uses

Heap edges also added as “possible last-uses,” e.g., if e1 then e2 else e3 (where e1 →∗ true) e1 e2

slide-64
SLIDE 64

Heap Edges Also Track Uses

Heap edges also added as “possible last-uses,” e.g., if e1 then e2 else e3 (where e1 →∗ true) e1 e2

slide-65
SLIDE 65

Heap Edges Also Track Uses

Heap edges also added as “possible last-uses,” e.g., if e1 then e2 else e3 (where e1 →∗ true) e1 e2 values of e3

slide-66
SLIDE 66

Heap Edges Also Track Uses

Heap edges also added as “possible last-uses,” e.g., if e1 then e2 else e3 (where e1 →∗ true) e1 e2 values of e3

slide-67
SLIDE 67

Heap Edges Also Track Uses

Heap edges also added as “possible last-uses,” e.g., if e1 then e2 else e3 (where e1 →∗ true) e1 e2 values of e3

slide-68
SLIDE 68

Heap Graphs: Summary

Heap edge from B to A indicates a dependency on A . . . given knowledge up to time corresponding to B

slide-69
SLIDE 69

Heap Graphs: Summary

Heap edge from B to A indicates a dependency on A . . . given knowledge up to time corresponding to B Some push back on semantics from implementation

◮ semantics must be implementable ◮ e.g., “true” vs. “provable” garbage

slide-70
SLIDE 70

Example Graphs

Matrix multiplication

◮ computation graph on left; heap on right

slide-71
SLIDE 71

Talk Summary

Cost Semantics, Part I: Parallel Structure Cost Semantics, Part II: Space Use Semantic Profiling

slide-72
SLIDE 72

Semantic Profiling

Analysis of costs

◮ not a static analysis

slide-73
SLIDE 73

Semantic Profiling

Analysis of costs

◮ not a static analysis

Semantics yields one set of costs per input

◮ run program over many inputs to generalize

slide-74
SLIDE 74

Semantic Profiling

Analysis of costs

◮ not a static analysis

Semantics yields one set of costs per input

◮ run program over many inputs to generalize

Semantic ⇒ independent of implementation

slide-75
SLIDE 75

Semantic Profiling

Analysis of costs

◮ not a static analysis

Semantics yields one set of costs per input

◮ run program over many inputs to generalize

Semantic ⇒ independent of implementation ✖ loses some precision ✔ acts as specification

slide-76
SLIDE 76

Visualizing Schedules

Distill graphs, focusing on parallel structure

◮ coalesce sequential computation ◮ use size, color, relative position ◮ omit less interesting edges

slide-77
SLIDE 77

Visualizing Schedules

Distill graphs, focusing on parallel structure

◮ coalesce sequential computation ◮ use size, color, relative position ◮ omit less interesting edges

Graphs derived from semantics, . . . compressed mechanically, . . . then laid out with GraphViz

slide-78
SLIDE 78

Matrix Multiply (Breadth-First, p = 2)

slide-79
SLIDE 79

Matrix Multiply (Work Stealing, p = 2)

slide-80
SLIDE 80

Quick Hull

slide-81
SLIDE 81

Quick Hull (Depth First, p = 2)

slide-82
SLIDE 82

Quick Hull (Work Stealing, p = 2)

slide-83
SLIDE 83

Space Use By Input Size

Matrix multiply w/ breadth-first scheduling policy:

500 1000 1500 2000 2500 2 4 6 8 10 12 space high-water mark (units) input size (# rows/columns) work queue [b,a] [m,u] append (#1 lr, #2 lr) remainder

slide-84
SLIDE 84

Space Use By Input Size

Matrix multiply w/ breadth-first scheduling policy:

500 1000 1500 2000 2500 2 4 6 8 10 12 space high-water mark (units) input size (# rows/columns) work queue [b,a] [m,u] append (#1 lr, #2 lr) remainder

Scheduler Overhead

slide-85
SLIDE 85

Space Use By Input Size

Matrix multiply w/ breadth-first scheduling policy:

500 1000 1500 2000 2500 2 4 6 8 10 12 space high-water mark (units) input size (# rows/columns) work queue [b,a] [m,u] append (#1 lr, #2 lr) remainder

Scheduler Overhead Closures

slide-86
SLIDE 86

Verifying Profiling Results

slide-87
SLIDE 87

Verifying Profiling Results

Implemented a parallel extension to MLton

◮ including three different schedulers ◮ compared predicted and actual space use

slide-88
SLIDE 88

Matrix Multiply – MLton Space Use

5 10 15 20 25 30 35 50 100 150 200 max live (MB) input size (# of rows/columns)

Depth-First Work-Stealing Breadth-First

slide-89
SLIDE 89

Quicksort – MLton Space Use

50 100 150 200 250 300 350 400 1000 2000 3000 4000 5000 6000 max live (MB) input size (# elements)

Depth-First Work-Stealing Breadth-First

slide-90
SLIDE 90

Initial Quicksort Results

◮ predicted: breadth-first outperforms depth-first

slide-91
SLIDE 91

Initial Quicksort Results

◮ predicted: breadth-first outperforms depth-first ◮ initial observation: same results!

slide-92
SLIDE 92

Space Leak Revealed

Cause: reference flattening optimization (representing reference cells directly in records)

slide-93
SLIDE 93

Space Leak Revealed

Cause: reference flattening optimization (representing reference cells directly in records) Now fixed in MLton source repository

slide-94
SLIDE 94

Space Leak Revealed

Cause: reference flattening optimization (representing reference cells directly in records) Now fixed in MLton source repository Without a cost semantics, there is no bug!

slide-95
SLIDE 95

Also in the Paper

More details, including. . .

◮ rules for cost semantics ◮ discussion of MLton implementation

◮ efficient method for space measurements

◮ more plots (profiling, speedup, &c.) ◮ application to vectorization (in TR)

slide-96
SLIDE 96

Selected Related Work

Cost semantics

◮ Sansom & Peyton Jones. POPL ’95 ◮ Blelloch & Greiner. ICFP ’96

Scheduling

◮ Blelloch, Gibbons, & Matias. JACM ’99 ◮ Blumofe & Leiserson. JACM ’99

Profiling

◮ Runciman & Wakeling. JFP ’93 ◮ ibid. Glasgow FP ’93

slide-97
SLIDE 97

Conclusion

slide-98
SLIDE 98

Conclusion

Semantic profiling for parallel programs. . .

◮ accounts for scheduling, space use ◮ constrains implementation (and finds bugs!) ◮ supports visualization &

predicts actual performance

slide-99
SLIDE 99

Thanks!

Thanks to MLton developers, and Thank you for listening! Questions? spoons@cmu.edu Download binaries, source code, papers, slides: http://www.cs.cmu.edu/~spoons/parallel/ svn co svn://mlton.org/mlton/... branches/shared-heap-multicore mlton