Steady-State Scheduling
- n Heterogeneous Platforms
Matthieu Gallet
Advisors: Yves Robert and Fr´ ed´ eric Vivien
´ Ecole Normale Sup´ erieure de Lyon GRAAL team, Laboratoire de l’Informatique du Parall´ elisme
October 20, 2009
1/64
Steady-State Scheduling on Heterogeneous Platforms Matthieu Gallet - - PowerPoint PPT Presentation
tel-00637362, version 1 - 1 Nov 2011 Steady-State Scheduling on Heterogeneous Platforms Matthieu Gallet Advisors: Yves Robert and Fr ed eric Vivien Ecole Normale Sup erieure de Lyon GRAAL team, Laboratoire de lInformatique du
1/64
2/64
3/64
◮ Presentation of the Divisible Load Theory ◮ Scheduling divisible loads on a processor chain
◮ Mono-allocation schedules of task graphs on heterogeneous
◮ Dynamic bag-of-tasks applications ◮ Computing the throughput of replicated workflows ◮ Task graph scheduling on the Cell processor 3/64
◮ Presentation of the Divisible Load Theory ◮ Scheduling divisible loads on a processor chain
◮ Mono-allocation schedules of task graphs on heterogeneous
◮ Dynamic bag-of-tasks applications ◮ Computing the throughput of replicated workflows ◮ Task graph scheduling on the Cell processor 3/64
4/64
5/64
6/64
6/64
6/64
6/64
6/64
6/64
6/64
6/64
6/64
6/64
6/64
6/64
6/64
7/64
◮ undefined for a continuous flow of data sets ◮ does not benefit from regular problem structure 7/64
◮ undefined for a continuous flow of data sets ◮ does not benefit from regular problem structure
7/64
8/64
8/64
9/64
P2 P1 P3 P4 P5 P6 P7 T2 F3,4 F2,3 F1,2 T4 T3 T1 10/64
P5 P3 P3 P1
T2 F3,4 T1 T4 F1,2 F2,3 T3 11/64
11/64
T1 T4 T3 F2,3 T2 F1,2 F3,4 12/64
T1 T4 T3 F2,3 T2 F1,2 F3,4 12/64
P7 P2 P6 P5 P4 P7 P1
T3 F2,3 F3,4 T1 T4 F1,2 T2 12/64
12/64
P4 P1 P6 P2 P7 P5 P3
F3,4 T1 T2 T3 T4 F2,3 F1,2 13/64
P1 P6 P2 P7 P5 P3 P4
F3,4 T1 T2 T3 T4 F2,3 F1,2 13/64
P1 P6 P2 P7 P5 P3 P4
F3,4 T1 T2 T3 T4 F2,3 F1,2 13/64
P1 P6 P2 P7 P5 P3 P4
F3,4 T1 T2 T3 T4 F2,3 F1,2 13/64
P7 P5 P3 P4 P1 P6 P2
T4 T2 F1,2 F2,3 F3,4 T3 T1 13/64
P7 P5 P3 P4 P1 P6 P2
T4 T2 F1,2 F2,3 F3,4 T3 T1 13/64
P7 P5 P3 P4 P1 P6 P2
T4 T2 F1,2 F2,3 F3,4 T3 T1 13/64
P7 P5 P3 P4 P1 P6 P2
T4 T2 F1,2 F2,3 F3,4 T3 T1 13/64
P7 P5 P3 P4 P1 P6 P2
T4 T2 F1,2 F2,3 F3,4 T3 T1 13/64
P1 P2 P6 P4 P3 P5 P7
T1 F3,4 F2,3 F1,2 T2 T3 T4 13/64
13/64
14/64
15/64
16/64
◮ Strict One-Port: a processor can either compute or perform a
◮ Full-Duplex One-Port: a processor can either compute or
◮ One-Port with overlap of computation by communications
17/64
18/64
19/64
◮ Limited incoming bandwidth bwin
q
◮ Limited outgoing bandwidth bwout
q
◮ Limited bandwidth per link bwq,r
20/64
21/64
22/64
23/64
24/64
25/64
◮ assign “largest” task to best processor ◮ continue with second “largest” task, assign it to the processor
◮ . . .
◮ take communication times into account when sorting tasks ◮ when mapping a task, select the processor such that the
26/64
◮ Select the αk
i with maximum value
◮ Set αk
i to 1
◮ Select a task Tk not yet mapped ◮ Randomly choose a processor Pi with probability αk
i
◮ Set αk
i to 1
27/64
◮ for all tasks, we test all possible immediate neighborhoods, and
◮ for a given mapping, we sort all resource occupation times by
28/64
◮ Evaluate the mapping of Tk and its neighbors on each
◮ Definitely assign Tk to best processor
29/64
◮ “Small problems”: 8–12 tasks ◮ “Large problems”: up to 47 tasks (MLP not used) ◮ for each application, we compute a CCR = communications
◮ we try to cover a large CCR range 30/64
Optimal Large communications Small communications 0.01 Normalized throughput log(CCR) 2.5 2 1.5 1 0.5 10 4 2 1 0.4 0.2 0.1 0.04 0.02 Upper bound Delegate
31/64
0.04 0.02 Large communications Small communications Optimal Normalized throughput 1.2 1 0.8 0.6 0.4 0.2 log(CCR) 10 4 2 1 0.4 0.2 0.1 0.01 Delegate HEFT Data-parallel
31/64
Optimal 0.01 0.02 0.04 0.1 0.2 0.4 1 2 4 10 log(CCR) 0.2 0.4 0.6 0.8 1 1.2 Normalized throughput Small communications Large communications Refined Greedy Simple Greedy Delegate Clustering
31/64
0.01 0.02 0.04 0.1 0.2 0.4 1 2 4 10 0.2 0.4 0.6 0.8 1
Optimal Large communications log(CCR) Small communications Normalized throughput
RLP RAND RLP MAX Neighborhood Delegate
31/64
32/64
33/64
34/64
EIB P0 PPE0 SPE4 SPE3 SPE0 SPE1 SPE7 SPE6
MEMORY
SPE2 SPE5 P3 P7 P8 P2 P1 P4 P5 P6
35/64
EIB P4 P5 P1 P2 P8 P7 P3 SPE5 PPE0 SPE4 SPE3 SPE0 SPE6
MEMORY
SPE2 SPE1 SPE7 P6 P0
35/64
P6 SPE2 SPE6 SPE7 SPE1 SPE0 SPE3 SPE4 PPE0 SPE5 P3 P7 P8 P2 P1 P4 P5 P0
MEMORY
35/64
EIB PPE0 SPE4 SPE3 SPE0 SPE1 SPE7 P6 P0 P4 P5 P1 SPE6
MEMORY
P2 SPE2 P8 P7 P3 SPE5
35/64
EIB P0 PPE0 SPE4 SPE3 SPE0 SPE1 SPE7 SPE6
MEMORY
SPE2 SPE5 P3 P7 P8 P2 P1 P4 P5 P6
◮ at most 16 simultaneous incoming communications for each SPE ◮ at most 8 simultaneous communications
35/64
◮ readk: data to read before executing Tk ◮ writek: data to write after the execution of Tk ◮ peekk: number of next data sets to receive before executing Tk
36/64
37/64
37/64
37/64
37/64
37/64
37/64
37/64
37/64
37/64
37/64
37/64
37/64
38/64
39/64
40/64
41/64
41/64
42/64
43/64
44/64
45/64
46/64
47/64
48/64
49/64
49/64
49/64
49/64
49/64
49/64
49/64
49/64
49/64
49/64
50/64
50/64
50/64
50/64
50/64
50/64
50/64
50/64
50/64
50/64
51/64
P2 P3 P6 P5 P4 P7
P1
52/64
P1 P3 P6 P5 P4 P7
P2
52/64
P5 P3 P2 P1
P7 P4 P6
52/64
P5 P3 P2 P1
P7 P4 P6
52/64
P2 P4 P7 P5 P6 P3
P1
52/64
P1
P2 P4 P7 P5 P6 P3
52/64
P1 P2 P3 P6 P5 P4 P7
52/64
P2 P7 P4 P5 P6 P3 P1
52/64
P3 P2 P1 P7 P4 P5 P6
53/64
P1 P2 P3 P4 P7 P5 P6
53/64
P6 P3 P2 P1
P7 P4 P5
53/64
54/64
55/64
56/64
57/64
58/64
59/64
60/64
60/64
61/64
Comments on “Design and performance evaluation of load distribution strategies for multiple loads
Matthieu Gallet, Yves Robert, Fr´ ed´ eric Vivien, Journal of Parallel and Distributed Computing, 68(2), 2008 Divisible Load Scheduling Matthieu Gallet, Yves Robert, Fr´ ed´ eric Vivien, Introduction to Scheduling , 2009
Scheduling communication requests traversing a switch: complexity and algorithms Matthieu Gallet, Yves Robert, Fr´ ed´ eric Vivien, Proceedings of the 15th Euromicro Workshop on Parallel, Distributed and Network-based Processing (PDP’2007) , 2007 Scheduling multiple divisible loads on a linear processor network Matthieu Gallet, Yves Robert, Fr´ ed´ eric Vivien, Proceedings of the 13rd IEEE International Conference on Parallel and Distributed Systems (ICPADS’07) , 2007 Efficient Scheduling of Task Graph Collections on Heterogeneous Resources Matthieu Gallet, Loris Marchal, Fr´ ed´ eric Vivien, Proceedings of the 23rd International Parallel and Distributed Processing Symposium (IPDPS’09) , 2009 Allocating Series of Workflows on Computing Grids Matthieu Gallet, Loris Marchal, Fr´ ed´ eric Vivien, Proceedings of the 14th IEEE International Conference on Parallel and Distributed Systems (ICPADS’08) , 2008 Computing the throughput of replicated workflows on heterogeneous platforms Anne Benoit, Matthieu Gallet, Bruno Gaujal, Yves Robert, Proceedings of the 38th International Conference on Parallel Processing (ICPP’09) , 2009 62/64
63/64
64/64