Parallel Dataflow Graph Coloring uce 1 , Erik Saule 2 , and urek 1 , - - PowerPoint PPT Presentation

parallel dataflow graph coloring
SMART_READER_LITE
LIVE PREVIEW

Parallel Dataflow Graph Coloring uce 1 , Erik Saule 2 , and urek 1 , - - PowerPoint PPT Presentation

Parallel Dataflow Graph Coloring uce 1 , Erik Saule 2 , and urek 1 , 3 Ahmet Erdem Sary Umit V. C ataly esaule@uncc.edu, { aerdem,umit } @bmi.osu.edu 1 Department of Biomedical Informatics, The Ohio State University 2 Department of


slide-1
SLIDE 1

Parallel Dataflow Graph Coloring

Ahmet Erdem Sarıy¨ uce1, Erik Saule2, and ¨ Umit V. C ¸ataly¨ urek1,3

esaule@uncc.edu, {aerdem,umit}@bmi.osu.edu

1Department of Biomedical Informatics, The Ohio State University 2Department of Computer Science, University of North Carolina at Charlotte 3Department of Electrical and Computer Engineering, The Ohio State University

Scheduling in AussoisˆW DagstuhlˆWˆWˆW Algorithms and Scheduling Techniques for Exascale Systems

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 1 / 21

slide-2
SLIDE 2

Outline

1

Parallel Graph Coloring

2

Dataflow Graph Coloring

3

What’s the link with scheduling?

4

Conclusion

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 2 / 21

slide-3
SLIDE 3

The Graph Coloring Problem

Definition

Coloring a graph consists in assigning a color (an integer) to each vertex so that no two adjacent vertices have the same color.

Complexity

The problem of finding the coloring with minimum number of colors is NP-Hard. No approximation within |V |1−ǫ. Greedy algorithm returns a solution with less than 1 + ∆ colors.

1 2 5 4 3 1 2 5 4 3

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 3 / 21

slide-4
SLIDE 4

Graph Coloring Algorithm

First Fit algorithm

Pick a vertex and assign it the first available color. Then pick another

  • ne.

There exists a vertex ordering which leads to an optimal coloring.

Algorithm 1: Sequential greedy coloring. Data: G = (V , E) for each v ∈ V do for each w ∈ adj(v) do forbiddenColors[color[w]] ← v color[v] ← min{i > 0 : forbiddenColors[i] = v}

Many derivative algorithms: With Largest First With Smallest Last Dynamic orderings Least Used instead of First Fit. Iterated algorithm to do local descent. Today, let’s talk about the natural one.

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 4 / 21

slide-5
SLIDE 5

Parallel Speculative Graph Coloring (Shared Memory)

1 2 5 4 3 1 2 3 5 4 1 2 5 4 3 Algorithm 2: TentativeColoring

Data: G = (V , E), Visit ⊂ V , color[1 : |V |] maxcolor ← 1 localMC ← 1 for each v ∈ Visit in parallel do for each w ∈ adj(v) do localFC[color[w]] ← v color[v] ← min{i > 0 : localFC[i] = v} if color[v] > localMC then localMC ← color[v] maxcolor ← Reduce(max) localMC return maxcolor

Algorithm 3: DetectConflict

Data: G = (V , E), Visit ⊂ V , color[1 : |V |] Conflict ← ∅ for each v ∈ Visit in parallel do for each w ∈ adj(v) do if color[v] = color[w] then if v < w then atomic Conflict ← Conflict ∪{v} return Conflict

At least two passes. More if unlucky (in practice 2 + ǫ)

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 5 / 21

slide-6
SLIDE 6

Outline

1

Parallel Graph Coloring

2

Dataflow Graph Coloring

3

What’s the link with scheduling?

4

Conclusion

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 6 / 21

slide-7
SLIDE 7

Parallel Dataflow Algorithm

Principle

The principle of Dataflow algorithm is that the generation of a result triggers the computation of the next tasks.

Dataflow coloring

The idea is to pick an absolute order of the vertices and each vertex only consider the color of the vertices with ID lesser than theirs. 1 2 5 4 3 0 and 1 can be executed concurrently 2 and 3 can be executed concurrently 4 and 5 can be executed concurrently Not speculative, so only one pass.

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 7 / 21

slide-8
SLIDE 8

Two approaches

Pick the vertices in some order. What happens when you pick a vertex with neighboors with high priority which haven’t been allocated a color.

Recursive Dataflow

You recursively process the neighbor. No waiting time Some form of “workstealing” algorithm Complex synchronisation Higher memory allocation (or potentially redundant work)

(Direct) Dataflow

You wait. No redudant work Simpler worksharing constraint. But maybe you waste time waiting.

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 8 / 21

slide-9
SLIDE 9

Which is best?

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 speedup number of threads Dataflow Dataflow-recursive

with lots of (yet) unexplained optimizations.

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 9 / 21

slide-10
SLIDE 10

Outline

1

Parallel Graph Coloring

2

Dataflow Graph Coloring

3

What’s the link with scheduling?

4

Conclusion

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 10 / 21

slide-11
SLIDE 11

In practice, parallel speedup is 1 1 2 3 4 5 6 7 8 9 10

The graph is executed one vertex after another. So there are actually dependencies.

Graham List Scheduling

When scheduling a dag, a greedy algorithm gets: Cmax ≤ W

p + (1 − 1 p)CP

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 11 / 21

slide-12
SLIDE 12

In practice, parallel speedup is 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

The graph is executed one vertex after another. So there are actually dependencies.

Graham List Scheduling

When scheduling a dag, a greedy algorithm gets: Cmax ≤ W

p + (1 − 1 p)CP

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 11 / 21

slide-13
SLIDE 13

It gets worse...

Static Scheduling

1 2 3 4 5 6 7 8 9 10

If you use a static OpenMP schedule, you add de facto dependencies in your graph. And the critical path increases significantly.

That’s easy! Let’s use dynamic instead!

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 12 / 21

slide-14
SLIDE 14

It gets worse...

Static Scheduling

1 2 3 4 5 6 7 8 9 10

If you use a static OpenMP schedule, you add de facto dependencies in your graph. And the critical path increases significantly.

That’s easy! Let’s use dynamic instead!

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 12 / 21

slide-15
SLIDE 15

Is Dynamic Better?

Dynamic Scheduling

1 2 3 4 5 6 7 8 9 10

Even if you use a dynamic OpenMP schedule, similar effect still happen. With two threads 4 and 5 need to be executed before 6 can start. So 1, 2 and 3 are implicit predecessor of

  • 6. Because that is what the

scheduler will do.

An easy solution

Compute a level by level order. Well... That requires a graph traversal. The whole point was to traverse the graph only once.

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 13 / 21

slide-16
SLIDE 16

Is Dynamic Better?

Dynamic Scheduling

1 2 3 4 5 6 7 8 9 10

Even if you use a dynamic OpenMP schedule, similar effect still happen. With two threads 4 and 5 need to be executed before 6 can start. So 1, 2 and 3 are implicit predecessor of

  • 6. Because that is what the

scheduler will do.

An easy solution

Compute a level by level order. Well... That requires a graph traversal. The whole point was to traverse the graph only once.

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 13 / 21

slide-17
SLIDE 17

It gets EVEN worse

Nobody should use “dynamic,1”

1 2 3 4 5 6 7 8 9 10

Since vertices are grouped together, by openmp’s granularity, you have implicit edges between the vertices

  • f each group.

There is an implicit edge between 1 and 2, and between 5 and 6.

In this type of kernel, you should at least use groups of 32 vertices.

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 14 / 21

slide-18
SLIDE 18

It gets EVEN worse

Nobody should use “dynamic,1”

1 2 3 4 5 6 7 8 9 10

Since vertices are grouped together, by openmp’s granularity, you have implicit edges between the vertices

  • f each group.

There is an implicit edge between 1 and 2, and between 5 and 6.

In this type of kernel, you should at least use groups of 32 vertices.

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 14 / 21

slide-19
SLIDE 19

First Results

0.0 0.5 1.0 1.5 2.0 2.5 3.0 1 2 3 4 5 6 7 8 speedup number of threads Real Dynamic Real Static Expected Dynamic Expected Static

(a) auto

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 speedup number of threads Expected Dynamic Real Dynamic Real Static Expected Static

(b) ldoor

Ouch!

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 15 / 21

slide-20
SLIDE 20

Reordering of vertex IDs

Just a chain

1 2 3 4 5 6 7 8

The critical path can be quite long in a natural ordering.

At best

1 2 3 4 5 6 7 8

Need to traverse the graphs...

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 16 / 21

slide-21
SLIDE 21

Reordering of vertex IDs

Just a chain

1 2 3 4 5 6 7 8

The critical path can be quite long in a natural ordering.

At best

1 2 3 4 5 6 7 8

Need to traverse the graphs...

At random

1 2 3 4 5 6 7 8

Not best but probably not the worst you can get. For cache purposes, you need to keep some locality, so you shuffle blocks of vertices. Any guarantee on that?

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 16 / 21

slide-22
SLIDE 22

Results

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 17 / 21

slide-23
SLIDE 23

Outline

1

Parallel Graph Coloring

2

Dataflow Graph Coloring

3

What’s the link with scheduling?

4

Conclusion

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 18 / 21

slide-24
SLIDE 24

Faster than speculative coloring?

1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 speedup number of threads CD_best DF_best

Owens (8 cores)

1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 speedup number of threads CD_best DF_best

Oakley (16 cores)

1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 speedup number of threads CD_best DF_best

Mirasol (40 cores) Works on small machines. Impact seem to decrease with core count. There is an other limiting factor

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 19 / 21

slide-25
SLIDE 25

Wrap up

Conclusions

designed and tested dataflow coloring algorithms on multicore architectures. analyzed their performance. Classical scheduling helps understanding performance issues in not-so-related problems. Similar analysis performed on BFS.

Future works

how to pick a better order? can we get rid of the computation and lookup of permutation?

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 20 / 21

slide-26
SLIDE 26

Thank you

More information

Contact : esaule@uncc.edu Visit: http://webpages.uncc.edu/~esaule

Erik Saule (UNCC) Parallel Dataflow Coloring Dagstuhl 2013 21 / 21