1/ 28
Scheduling complex streaming applications on the Cell processor - - PowerPoint PPT Presentation
Scheduling complex streaming applications on the Cell processor - - PowerPoint PPT Presentation
Scheduling complex streaming applications on the Cell processor Mathias Jacquelin, joint work with Matthieu Gallet and Loris Marchal INRIA ROMA project-team LIP (ENS-Lyon, CNRS, INRIA) Ecole Normale Sup erieure de Lyon, France Workshop
2/ 28
Outline
Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works
3/ 28
Motivation
◮ Multicore architectures: new opportunity to test the
scheduling strategies designed in the ROMA team.
◮ Our trademark: efficient scheduling on heterogeneous
platforms
◮ Most multicore architecture are homogeneous, regular
◮ Need for tailored algorithms (linear algebra,. . . )
◮ Emerging heterogeneous multicore:
◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator
◮ This study: steady-state scheduling on CELL (bounded
heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques
3/ 28
Motivation
◮ Multicore architectures: new opportunity to test the
scheduling strategies designed in the ROMA team.
◮ Our trademark: efficient scheduling on heterogeneous
platforms
◮ Most multicore architecture are homogeneous, regular
◮ Need for tailored algorithms (linear algebra,. . . )
◮ Emerging heterogeneous multicore:
◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator
◮ This study: steady-state scheduling on CELL (bounded
heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques
3/ 28
Motivation
◮ Multicore architectures: new opportunity to test the
scheduling strategies designed in the ROMA team.
◮ Our trademark: efficient scheduling on heterogeneous
platforms
◮ Most multicore architecture are homogeneous, regular
◮ Need for tailored algorithms (linear algebra,. . . )
◮ Emerging heterogeneous multicore:
◮ Dedicated processing units on GPUs ◮ Mixed system: processor + accelerator
◮ This study: steady-state scheduling on CELL (bounded
heterogeneity) to demonstrate the usefulness of complex (static) scheduling techniques
4/ 28
Introduction: Steady-state Scheduling
Rationale:
◮ A pipelined application:
◮ Simple chain ◮ More complex application
(Directed Acyclic Graph)
◮ Objective: optimize the throughput
- f the application
(number of input files treated per seconds)
◮ Today: simple case where each
task has to be mapped on one single resource
4/ 28
Introduction: Steady-state Scheduling
Rationale:
◮ A pipelined application:
◮ Simple chain ◮ More complex application
(Directed Acyclic Graph)
◮ Objective: optimize the throughput
- f the application
(number of input files treated per seconds)
◮ Today: simple case where each
task has to be mapped on one single resource
T1 T2 T3
4/ 28
Introduction: Steady-state Scheduling
Rationale:
◮ A pipelined application:
◮ Simple chain ◮ More complex application
(Directed Acyclic Graph)
◮ Objective: optimize the throughput
- f the application
(number of input files treated per seconds)
◮ Today: simple case where each
task has to be mapped on one single resource
T1 T2 T3 T4 T5 T6 T7 T8 T9
4/ 28
Introduction: Steady-state Scheduling
Rationale:
◮ A pipelined application:
◮ Simple chain ◮ More complex application
(Directed Acyclic Graph)
◮ Objective: optimize the throughput
- f the application
(number of input files treated per seconds)
◮ Today: simple case where each
task has to be mapped on one single resource
T1 T2 T3 T4 T5 T6 T7 T8 T9
4/ 28
Introduction: Steady-state Scheduling
Rationale:
◮ A pipelined application:
◮ Simple chain ◮ More complex application
(Directed Acyclic Graph)
◮ Objective: optimize the throughput
- f the application
(number of input files treated per seconds)
◮ Today: simple case where each
task has to be mapped on one single resource
P2 P1 P3 P4
T5 T6 T7 T8 T9 T1 T2 T3 T4
5/ 28
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
5/ 28
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB PPE0 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 MEMORY SPE5
5/ 28
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB SPE5 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 PPE0 MEMORY
◮ 1 PPE core
◮ VMX unit ◮ L1, L2 cache ◮ 2 way SMT
5/ 28
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB SPE4 SPE5 SPE2 SPE6 SPE7 SPE1 MEMORY SPE3 PPE0 SPE0
◮ 8 SPEs
◮ 128-bit SIMD instruction set ◮ Local store 256KB ◮ Dedicated Asynchronous DMA engine
5/ 28
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB SPE5 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 PPE0 MEMORY
5/ 28
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB PPE0 SPE3 SPE2 SPE6 SPE7 SPE1 SPE0 SPE4 MEMORY SPE5
◮ Element Interconnect Bus (EIB)
◮ 200 GB/s bandwidth
5/ 28
CELL brief introduction
◮ Multicore heterogeneous processor ◮ Accelerator extension to Power architecture
EIB PPE0 SPE4 SPE2 SPE6 SPE7 SPE1 SPE0 SPE5 MEMORY SPE3
◮ 25 GB/s bandwidth
6/ 28
Outline
Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works
7/ 28
Platform modeling
Simple CELL modeling:
◮ 1 PPE and 8 SPE: 9 processing elements P1, . . . , P9, with
unrelated speed,
◮ Each processing element access the communication bus with a
(bidirectional) bandwidth b = (25GB/s) ,
◮ The bus is able to route all concurrent communications
without contention (in a first step),
◮ Due to the limited size of the DMA stack on each SPE:
◮ Each SPE can perform at most 16 simultaneous DMA
- perations,
◮ The PPE can perform at most 8 simultaneous DMA
- perations to/from a given SPE.
◮ Linear cost communication model:
a data of size S is sent/received in time S/b
8/ 28
Application modeling
Application is described by a directed acyclic graph:
◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is
ti(k),
◮ If there is a dependency Tk → Tl,
datak,l is the size of the file produced by Tk and needed by Tl,
T1 T2 T3 T4 T5 T6 T7 T8 T9
◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,
8/ 28
Application modeling
Application is described by a directed acyclic graph:
◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is
ti(k),
◮ If there is a dependency Tk → Tl,
datak,l is the size of the file produced by Tk and needed by Tl,
T1 T2 T3 T4 T5 T6 T7 T8 T9
◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,
8/ 28
Application modeling
Application is described by a directed acyclic graph:
◮ Tasks T1, . . . , Tn ◮ Processing time of task Tk on Pi is
ti(k),
◮ If there is a dependency Tk → Tl,
datak,l is the size of the file produced by Tk and needed by Tl,
T1 T2 T3 T4 T5 T6 T7 T8 T9
◮ If Tk is an input task, it reads readk bytes from main memory, ◮ If Tk is an output task, it writes writek bytes to main memory,
9/ 28
Target application: any DAG
◮ Today, we will focus on three random task graphs:
9/ 28
Target application: any DAG
◮ Today, we will focus on three random task graphs:
stateful T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateless T3: T2 cost ppe: cost spe: peek: 0 stateless T4: T3 cost ppe: cost spe: peek: 0 stateless T5: T4 cost ppe: cost spe: peek: 1 stateless T6: T5 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateful T7: T6 cost ppe: cost spe: peek: 0 stateless T8: T7 cost ppe: cost spe: peek: 0 stateful T9: T8 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 1 stateless T10: T9 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 1 stateless T12: T11 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateless T18: T17 cost ppe: cost spe: peek: 0 stateful T19: T18 cost ppe: cost spe: peek: 1 stateless T16: T15 cost ppe: cost spe: peek: 0 stateless T17: T16 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 0 stateful T20: T19 cost ppe: cost spe: peek: 0 stateless T21: T20 cost ppe: cost spe: peek: 0 stateful T22: T21 cost ppe: cost spe: peek: 1 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T24: T23 cost ppe: cost spe: peek: 0 stateless T25: T24 cost ppe: cost spe: peek: 0 stateless T26: T25 cost ppe: cost spe: peek: 1 stateless T27: T26 cost ppe: cost spe: peek: 0 stateful T30: T29 cost ppe: cost spe: peek: 1 stateless T29: T28 cost ppe: cost spe: peek: 1 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T28: T27 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T37: T36 cost ppe: cost spe: peek: 1 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 1 stateless T39: T38 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T40: T39 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T42: T41 cost ppe: cost spe: peek: 1 stateless T43: T42 cost ppe: cost spe: peek: 1 stateless T44: T43 cost ppe: cost spe: peek: 0 stateful T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 1 stateless T47: T46 cost ppe: cost spe: peek: 1 stateless T48: T47 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 09/ 28
Target application: any DAG
◮ Today, we will focus on three random task graphs:
stateful T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateless T3: T2 cost ppe: cost spe: peek: 0 stateless T4: T3 cost ppe: cost spe: peek: 0 stateless T5: T4 cost ppe: cost spe: peek: 1 stateless T6: T5 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateful T7: T6 cost ppe: cost spe: peek: 0 stateless T8: T7 cost ppe: cost spe: peek: 0 stateful T9: T8 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 1 stateless T10: T9 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 1 stateless T12: T11 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateless T18: T17 cost ppe: cost spe: peek: 0 stateful T19: T18 cost ppe: cost spe: peek: 1 stateless T16: T15 cost ppe: cost spe: peek: 0 stateless T17: T16 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 0 stateful T20: T19 cost ppe: cost spe: peek: 0 stateless T21: T20 cost ppe: cost spe: peek: 0 stateful T22: T21 cost ppe: cost spe: peek: 1 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T24: T23 cost ppe: cost spe: peek: 0 stateless T25: T24 cost ppe: cost spe: peek: 0 stateless T26: T25 cost ppe: cost spe: peek: 1 stateless T27: T26 cost ppe: cost spe: peek: 0 stateful T30: T29 cost ppe: cost spe: peek: 1 stateless T29: T28 cost ppe: cost spe: peek: 1 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T28: T27 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T37: T36 cost ppe: cost spe: peek: 1 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 1 stateless T39: T38 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T40: T39 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T42: T41 cost ppe: cost spe: peek: 1 stateless T43: T42 cost ppe: cost spe: peek: 1 stateless T44: T43 cost ppe: cost spe: peek: 0 stateful T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 1 stateless T47: T46 cost ppe: cost spe: peek: 1 stateless T48: T47 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 0 stateless T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateful T3: T2 cost ppe: cost spe: peek: 1 stateless T4: T3 cost ppe: cost spe: peek: 1 stateless T5: T4 cost ppe: cost spe: peek: 0 stateless T6: T5 cost ppe: cost spe: peek: 2 stateless T7: T6 cost ppe: cost spe: peek: 2 stateful T8: T7 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 2 stateless T9: T8 cost ppe: cost spe: peek: 2 stateless T12: T11 cost ppe: cost spe: peek: 2 stateless T10: T9 cost ppe: cost spe: peek: 0 stateless T94: T93 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateful T18: T17 cost ppe: cost spe: peek: 1 stateless T20: T19 cost ppe: cost spe: peek: 1 stateless T21: T20 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 1 stateful T17: T16 cost ppe: cost spe: peek: 0 stateless T19: T18 cost ppe: cost spe: peek: 2 stateless T22: T21 cost ppe: cost spe: peek: 2 stateless T16: T15 cost ppe: cost spe: peek: 2 stateless T29: T28 cost ppe: cost spe: peek: 2 stateless T30: T29 cost ppe: cost spe: peek: 2 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T24: T23 cost ppe: cost spe: peek: 0 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T25: T24 cost ppe: cost spe: peek: 2 stateless T26: T25 cost ppe: cost spe: peek: 0 stateless T28: T27 cost ppe: cost spe: peek: 0 stateless T27: T26 cost ppe: cost spe: peek: 2 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T40: T39 cost ppe: cost spe: peek: 2 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 0 stateless T37: T36 cost ppe: cost spe: peek: 2 stateless T39: T38 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T43: T42 cost ppe: cost spe: peek: 0 stateless T42: T41 cost ppe: cost spe: peek: 0 stateful T44: T43 cost ppe: cost spe: peek: 1 stateless T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 0 stateless T47: T46 cost ppe: cost spe: peek: 0 stateful T48: T47 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 1 stateless T51: T50 cost ppe: cost spe: peek: 1 stateful T57: T56 cost ppe: cost spe: peek: 1 stateless T59: T58 cost ppe: cost spe: peek: 2 stateful T60: T59 cost ppe: cost spe: peek: 1 stateless T54: T53 cost ppe: cost spe: peek: 1 stateless T52: T51 cost ppe: cost spe: peek: 0 stateless T53: T52 cost ppe: cost spe: peek: 1 stateful T61: T60 cost ppe: cost spe: peek: 1 stateless T55: T54 cost ppe: cost spe: peek: 0 stateful T56: T55 cost ppe: cost spe: peek: 1 stateless T58: T57 cost ppe: cost spe: peek: 1 stateless T62: T61 cost ppe: cost spe: peek: 0 stateless T63: T62 cost ppe: cost spe: peek: 2 stateless T65: T64 cost ppe: cost spe: peek: 0 stateless T66: T65 cost ppe: cost spe: peek: 0 stateful T64: T63 cost ppe: cost spe: peek: 1 stateless T67: T66 cost ppe: cost spe: peek: 0 stateful T68: T67 cost ppe: cost spe: peek: 0 stateful T69: T68 cost ppe: cost spe: peek: 1 stateful T70: T69 cost ppe: cost spe: peek: 0 stateless T73: T72 cost ppe: cost spe: peek: 0 stateless T75: T74 cost ppe: cost spe: peek: 2 stateful T72: T71 cost ppe: cost spe: peek: 2 stateless T76: T75 cost ppe: cost spe: peek: 1 stateless T71: T70 cost ppe: cost spe: peek: 0 stateless T74: T73 cost ppe: cost spe: peek: 1 stateless T77: T76 cost ppe: cost spe: peek: 0 stateless T78: T77 cost ppe: cost spe: peek: 0 stateless T79: T78 cost ppe: cost spe: peek: 2 stateless T81: T80 cost ppe: cost spe: peek: 0 stateless T82: T81 cost ppe: cost spe: peek: 1 stateless T80: T79 cost ppe: cost spe: peek: 2 stateful T85: T84 cost ppe: cost spe: peek: 1 stateless T90: T89 cost ppe: cost spe: peek: 2 stateless T84: T83 cost ppe: cost spe: peek: 2 stateful T86: T85 cost ppe: cost spe: peek: 0 stateful T87: T86 cost ppe: cost spe: peek: 1 stateless T88: T87 cost ppe: cost spe: peek: 1 stateful T91: T90 cost ppe: cost spe: peek: 1 stateless T92: T91 cost ppe: cost spe: peek: 2 stateful T83: T82 cost ppe: cost spe: peek: 1 stateless T89: T88 cost ppe: cost spe: peek: 0 stateless T93: T92 cost ppe: cost spe: peek: 09/ 28
Target application: any DAG
◮ Today, we will focus on three random task graphs:
stateful T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateless T3: T2 cost ppe: cost spe: peek: 0 stateless T4: T3 cost ppe: cost spe: peek: 0 stateless T5: T4 cost ppe: cost spe: peek: 1 stateless T6: T5 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateful T7: T6 cost ppe: cost spe: peek: 0 stateless T8: T7 cost ppe: cost spe: peek: 0 stateful T9: T8 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 1 stateless T10: T9 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 1 stateless T12: T11 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateless T18: T17 cost ppe: cost spe: peek: 0 stateful T19: T18 cost ppe: cost spe: peek: 1 stateless T16: T15 cost ppe: cost spe: peek: 0 stateless T17: T16 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 0 stateful T20: T19 cost ppe: cost spe: peek: 0 stateless T21: T20 cost ppe: cost spe: peek: 0 stateful T22: T21 cost ppe: cost spe: peek: 1 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T24: T23 cost ppe: cost spe: peek: 0 stateless T25: T24 cost ppe: cost spe: peek: 0 stateless T26: T25 cost ppe: cost spe: peek: 1 stateless T27: T26 cost ppe: cost spe: peek: 0 stateful T30: T29 cost ppe: cost spe: peek: 1 stateless T29: T28 cost ppe: cost spe: peek: 1 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T28: T27 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T37: T36 cost ppe: cost spe: peek: 1 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 1 stateless T39: T38 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T40: T39 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T42: T41 cost ppe: cost spe: peek: 1 stateless T43: T42 cost ppe: cost spe: peek: 1 stateless T44: T43 cost ppe: cost spe: peek: 0 stateful T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 1 stateless T47: T46 cost ppe: cost spe: peek: 1 stateless T48: T47 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 0 stateless T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateful T3: T2 cost ppe: cost spe: peek: 1 stateless T4: T3 cost ppe: cost spe: peek: 1 stateless T5: T4 cost ppe: cost spe: peek: 0 stateless T6: T5 cost ppe: cost spe: peek: 2 stateless T7: T6 cost ppe: cost spe: peek: 2 stateful T8: T7 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 2 stateless T9: T8 cost ppe: cost spe: peek: 2 stateless T12: T11 cost ppe: cost spe: peek: 2 stateless T10: T9 cost ppe: cost spe: peek: 0 stateless T94: T93 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateful T18: T17 cost ppe: cost spe: peek: 1 stateless T20: T19 cost ppe: cost spe: peek: 1 stateless T21: T20 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 1 stateful T17: T16 cost ppe: cost spe: peek: 0 stateless T19: T18 cost ppe: cost spe: peek: 2 stateless T22: T21 cost ppe: cost spe: peek: 2 stateless T16: T15 cost ppe: cost spe: peek: 2 stateless T29: T28 cost ppe: cost spe: peek: 2 stateless T30: T29 cost ppe: cost spe: peek: 2 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T24: T23 cost ppe: cost spe: peek: 0 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T25: T24 cost ppe: cost spe: peek: 2 stateless T26: T25 cost ppe: cost spe: peek: 0 stateless T28: T27 cost ppe: cost spe: peek: 0 stateless T27: T26 cost ppe: cost spe: peek: 2 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T40: T39 cost ppe: cost spe: peek: 2 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 0 stateless T37: T36 cost ppe: cost spe: peek: 2 stateless T39: T38 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T43: T42 cost ppe: cost spe: peek: 0 stateless T42: T41 cost ppe: cost spe: peek: 0 stateful T44: T43 cost ppe: cost spe: peek: 1 stateless T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 0 stateless T47: T46 cost ppe: cost spe: peek: 0 stateful T48: T47 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 1 stateless T51: T50 cost ppe: cost spe: peek: 1 stateful T57: T56 cost ppe: cost spe: peek: 1 stateless T59: T58 cost ppe: cost spe: peek: 2 stateful T60: T59 cost ppe: cost spe: peek: 1 stateless T54: T53 cost ppe: cost spe: peek: 1 stateless T52: T51 cost ppe: cost spe: peek: 0 stateless T53: T52 cost ppe: cost spe: peek: 1 stateful T61: T60 cost ppe: cost spe: peek: 1 stateless T55: T54 cost ppe: cost spe: peek: 0 stateful T56: T55 cost ppe: cost spe: peek: 1 stateless T58: T57 cost ppe: cost spe: peek: 1 stateless T62: T61 cost ppe: cost spe: peek: 0 stateless T63: T62 cost ppe: cost spe: peek: 2 stateless T65: T64 cost ppe: cost spe: peek: 0 stateless T66: T65 cost ppe: cost spe: peek: 0 stateful T64: T63 cost ppe: cost spe: peek: 1 stateless T67: T66 cost ppe: cost spe: peek: 0 stateful T68: T67 cost ppe: cost spe: peek: 0 stateful T69: T68 cost ppe: cost spe: peek: 1 stateful T70: T69 cost ppe: cost spe: peek: 0 stateless T73: T72 cost ppe: cost spe: peek: 0 stateless T75: T74 cost ppe: cost spe: peek: 2 stateful T72: T71 cost ppe: cost spe: peek: 2 stateless T76: T75 cost ppe: cost spe: peek: 1 stateless T71: T70 cost ppe: cost spe: peek: 0 stateless T74: T73 cost ppe: cost spe: peek: 1 stateless T77: T76 cost ppe: cost spe: peek: 0 stateless T78: T77 cost ppe: cost spe: peek: 0 stateless T79: T78 cost ppe: cost spe: peek: 2 stateless T81: T80 cost ppe: cost spe: peek: 0 stateless T82: T81 cost ppe: cost spe: peek: 1 stateless T80: T79 cost ppe: cost spe: peek: 2 stateful T85: T84 cost ppe: cost spe: peek: 1 stateless T90: T89 cost ppe: cost spe: peek: 2 stateless T84: T83 cost ppe: cost spe: peek: 2 stateful T86: T85 cost ppe: cost spe: peek: 0 stateful T87: T86 cost ppe: cost spe: peek: 1 stateless T88: T87 cost ppe: cost spe: peek: 1 stateful T91: T90 cost ppe: cost spe: peek: 1 stateless T92: T91 cost ppe: cost spe: peek: 2 stateful T83: T82 cost ppe: cost spe: peek: 1 stateless T89: T88 cost ppe: cost spe: peek: 0 stateless T93: T92 cost ppe: cost spe: peek: 0And a simple chain graph (50 tasks)
10/ 28
Outline
Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works
11/ 28
How to compute an optimal mapping
◮ Ojective: maximize throughput ρ ◮ Method: write a linear program gathering constraints on the
mapping
◮ Binary variables: αk i =
- 1
if Tk is mapped on Pi
- therwise
◮ Other useful binary variables: βk,l i,j = 1 iff file Tk → Tl is
transfered from Pi to Pj
12/ 28
Constraints 1/2
On the application structure:
◮ Each task is mapped on a processor:
∀Tk
- i
αk
i = 1 ◮ Given a dependency Tk → Tl, the processor computing Tl
must receive the corresponding file: ∀(k, l) ∈ E, ∀Pj,
- i
βk,l
i,j ≥ αl j ◮ Given a dependency Tk → Tl, only the processor computing
Tk can send the corresponding file: ∀(k, l) ∈ E, ∀Pi,
- j
βk,l
i,j ≤ αk i
13/ 28
Constraints 2/2
On the achievable throughput ρ = 1/T:
◮ On a given processor, all tasks must be completed within T:
∀Pi,
- k
αk
i × ti(k) ≤ T ◮ All incoming communications must be completed within T:
∀Pj, 1 b
k
αk
j × readk +
- k,l
- i
βk,l
i,j × datak,l
- ≤ T
◮ All outgoing communications must be completed within T:
∀Pi, 1 b
k
αk
i × writek +
- k,l
- i
βk,l
i,j × datak,l
- ≤ T
+ constraints on the number of incoming/outgoing communications to respect the DMA requirements + constraints on the available memory on SPE
14/ 28
Optimal mapping computation
◮ Linear program with the objective of minimizing T ◮ Integer (binary) variables: Mixed Integer Programming ◮ NP-complete problem ◮ Efficient solvers exist with short running time
◮ for small-size problems ◮ or when an approximate solution is searched
◮ We use CPLEX, and look for an approximate solution (5% of
the optimal throughput is good enough)
15/ 28
Outline
Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
◮ min periodl = maxm∈precl(min periodm) + peekl + 2 ◮ min buffi,l = min periodl − min periodi
min periodk min periodl = maxm∈precl(min periodm) + peekl + 2 min buffj,l min buffi,l =min periodl − min periodi min periodj peekj peekk peeki peekl min buffi,k min buffi,j min periodi min buffk,l
Tl Ti Tk Tj
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
min buffi,j = 3 min buffj,l = 6 min periodl = 9 min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0 peekj = 1 min periodj = 3 min periodk = 5 peekk = 3 peekl = 2 min buffi,k = 5
Tl Ti Tk Tj
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekk = 3 peekl = 2 min buffi,k = 5 min buffi,j = 3 min periodl = 9 min buffi,l = 9 min buffj,l = 6 peeki = 0 min buffk,l = 4 min periodi = 0 min periodj = 3 peekj = 1 min periodk = 5
Ti Tk Tj Tl
period = 0
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2 peekk = 3 min periodj = 3 min periodi = 0 peeki = 0 min periodk = 5 min buffi,l = 9 min buffk,l = 4 peekj = 1 min periodl = 9
Tk Tl Tj Ti
period = 1
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peeki = 0 min buffk,l = 4 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2 peekk = 3 peekj = 1 min periodk = 5 min periodj = 3 min buffi,l = 9 min periodl = 9 min periodi = 0
Tl Tj Tk Ti
period = 2
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2
Tl Tj Tk Ti
period = 3
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2
Tl Tj Tk Ti
period = 4
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2
Tl Tj Tk Ti
period = 5
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekk = 3 min periodk = 5 peekj = 1 min periodj = 3 min periodi = 0 peeki = 0 min buffk,l = 4 min periodl = 9 min buffi,l = 9 min buffj,l = 6 min buffi,j = 3 min buffi,k = 5 peekl = 2
Tl Tj Tk Ti
period = 6
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peekj = 1 min periodi = 0 min buffj,l = 6 peekk = 3 min buffi,j = 3 min periodk = 5 min periodj = 3 peeki = 0 min buffk,l = 4 min buffi,k = 5 peekl = 2 min periodl = 9 min buffi,l = 9
Tk Ti Tj Tl
period = 7
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
peeki = 0 peekj = 1 min periodk = 5 peekk = 3 peekl = 2 min buffi,k = 5 min buffi,j = 3 min buffj,l = 6 min periodl = 9 min buffi,l = 9 min periodi = 0 min buffk,l = 4 min periodj = 3
Ti Tk Tj Tl
period = 8
16/ 28
Preprocessing of the schedule
Main Objective: Compute minimal starting period and buffer sizes.
min buffi,l = 9 min buffk,l = 4 peeki = 0 min periodi = 0 min periodj = 3 peekj = 1 min periodk = 5 peekl = 2 peekk = 3 min buffi,k = 5 min buffi,j = 3 min buffj,l = 6 min periodl = 9
Tj Tl Ti Tk
period = 9
17/ 28
State machine of the framework
Two main phases: Computation and Communication
Select a Task Wait Resources Process Task Signal new Data
Computation Phase
Communicate
17/ 28
State machine of the framework
Two main phases: Computation and Communication
Select a Task Wait Resources Communicate Signal new Data
Computation Phase
Process Task
17/ 28
State machine of the framework
Two main phases: Computation and Communication
Communicate Wait Resources Process Task Signal new Data
Computation Phase
Communicate Select a Task
17/ 28
State machine of the framework
Two main phases: Computation and Communication
Select a Task Wait Resources Communicate Signal new Data
Computation Phase
Process Task
17/ 28
State machine of the framework
Two main phases: Computation and Communication
Select a Task Wait Resources Communicate Signal new Data
Computation Phase
Process Task
17/ 28
State machine of the framework
Two main phases: Computation and Communication
No No more comm. No
Get Data Watch DMA Check input buffers Check input data
Communication Phase
Compute For each inbound comm.
17/ 28
State machine of the framework
Two main phases: Computation and Communication
No No more comm. No
Check input data Watch DMA Check input buffers Get Data
Communication Phase
For each inbound comm. Compute
17/ 28
State machine of the framework
Two main phases: Computation and Communication
No No more comm. No
Check input data Watch DMA Check input buffers Get Data
Communication Phase
For each inbound comm. Compute
17/ 28
State machine of the framework
Two main phases: Computation and Communication
No No more comm. No
Check input data Watch DMA Check input buffers Get Data
Communication Phase
For each inbound comm. Compute
17/ 28
State machine of the framework
Two main phases: Computation and Communication
No No more comm. No
Check input data Watch DMA Check input buffers Get Data
Communication Phase
For each inbound comm. Compute
17/ 28
State machine of the framework
Two main phases: Computation and Communication
No No more comm. No
Check input data Watch DMA Check input buffers Get Data
Communication Phase
For each inbound comm. Compute
18/ 28
Communication between processors
PK PL T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
18/ 28
Communication between processors
Signal Data(i) PL T (i)
2
T (i+1)
1
T (i)
1
PK T (i−1)
2
mfc putb for SPEs’ outbound communications. spe mfcio getb for PPEs’ outbound communications to SPEs. memcpy for PPEs’ outbound communications to main memory.
18/ 28
Communication between processors
cannot be overwritten Input buffers are available to store data Output buffer containing i
Signal Data(i) PL T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK
mfc putb for SPEs’ outbound communications. spe mfcio getb for PPEs’ outbound communications to SPEs. memcpy for PPEs’ outbound communications to main memory.
18/ 28
Communication between processors
Get Data(i)
cannot be overwritten Input buffers are available to store data Output buffer containing i
Signal Data(i) T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK PL
mfc get for SPEs’ inbound communications. spe mfcio put for PPEs’ inbound communications from SPEs. memcpy for PPEs’ inbound communications from main memory.
18/ 28
Communication between processors
Get Data(i)
cannot be overwritten Input buffers are available to store data Output buffer containing i
Signal Data(i) T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK PL
mfc get for SPEs’ inbound communications. spe mfcio put for PPEs’ inbound communications from SPEs. memcpy for PPEs’ inbound communications from main memory.
18/ 28
Communication between processors
Transfer Done(i) Get Data(i)
Output buffer containing i cannot be overwritten Input buffers are available to store data
Signal Data(i) PL T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK
mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.
18/ 28
Communication between processors
Output buffer containing i can now be overwritten
Transfer Done(i) Get Data(i)
Input buffers are available to store data Output buffer containing i cannot be overwritten
Signal Data(i) PL T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK
mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.
18/ 28
Communication between processors
Signal Data(i + 1)
can now be overwritten Output buffer containing i
Transfer Done(i) Get Data(i)
Input buffers are available to store data Output buffer containing i cannot be overwritten
Signal Data(i) T (i−1)
2
T (i)
2
T (i+1)
1
T (i)
1
PK PL
mfc putb for SPEs’ acknowledgements. spe mfcio getb for PPEs’ acknowledgements to SPEs. Self acknowledgement of PPEs’ transfers from main memory.
19/ 28
Experimental setup
◮ Linear-Programming: 5% from optimal to reduce compute
time
◮ GreedyMem: Simple greedy heuristics balancing memory
footprint across PEs.
◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.
◮ GreedyCpu: Simple greedy heuristics balancing compute
load across PEs.
◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free
memory.
19/ 28
Experimental setup
◮ Linear-Programming: 5% from optimal to reduce compute
time
◮ GreedyMem: Simple greedy heuristics balancing memory
footprint across PEs.
◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.
◮ GreedyCpu: Simple greedy heuristics balancing compute
load across PEs.
◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free
memory.
19/ 28
Experimental setup
◮ Linear-Programming: 5% from optimal to reduce compute
time
◮ GreedyMem: Simple greedy heuristics balancing memory
footprint across PEs.
◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.
◮ GreedyCpu: Simple greedy heuristics balancing compute
load across PEs.
◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free
memory.
19/ 28
Experimental setup
◮ Linear-Programming: 5% from optimal to reduce compute
time
◮ GreedyMem: Simple greedy heuristics balancing memory
footprint across PEs.
◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.
◮ GreedyCpu: Simple greedy heuristics balancing compute
load across PEs.
◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free
memory.
19/ 28
Experimental setup
◮ Linear-Programming: 5% from optimal to reduce compute
time
◮ GreedyMem: Simple greedy heuristics balancing memory
footprint across PEs.
◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.
◮ GreedyCpu: Simple greedy heuristics balancing compute
load across PEs.
◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free
memory.
19/ 28
Experimental setup
◮ Linear-Programming: 5% from optimal to reduce compute
time
◮ GreedyMem: Simple greedy heuristics balancing memory
footprint across PEs.
◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.
◮ GreedyCpu: Simple greedy heuristics balancing compute
load across PEs.
◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free
memory.
19/ 28
Experimental setup
◮ Linear-Programming: 5% from optimal to reduce compute
time
◮ GreedyMem: Simple greedy heuristics balancing memory
footprint across PEs.
◮ Tasks are processed in topological order. ◮ Valid SPE with the least loaded memory is selected.
◮ GreedyCpu: Simple greedy heuristics balancing compute
load across PEs.
◮ Tasks are processed in topological order. ◮ Least loaded SPE is selected, provided that it has enough free
memory.
20/ 28
Reaching steady state
Graph 1
stateful T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateless T3: T2 cost ppe: cost spe: peek: 0 stateless T4: T3 cost ppe: cost spe: peek: 0 stateless T5: T4 cost ppe: cost spe: peek: 1 stateless T6: T5 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateful T7: T6 cost ppe: cost spe: peek: 0 stateless T8: T7 cost ppe: cost spe: peek: 0 stateful T9: T8 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 1 stateless T10: T9 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 1 stateless T12: T11 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateless T18: T17 cost ppe: cost spe: peek: 0 stateful T19: T18 cost ppe: cost spe: peek: 1 stateless T16: T15 cost ppe: cost spe: peek: 0 stateless T17: T16 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 0 stateful T20: T19 cost ppe: cost spe: peek: 0 stateless T21: T20 cost ppe: cost spe: peek: 0 stateful T22: T21 cost ppe: cost spe: peek: 1 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T24: T23 cost ppe: cost spe: peek: 0 stateless T25: T24 cost ppe: cost spe: peek: 0 stateless T26: T25 cost ppe: cost spe: peek: 1 stateless T27: T26 cost ppe: cost spe: peek: 0 stateful T30: T29 cost ppe: cost spe: peek: 1 stateless T29: T28 cost ppe: cost spe: peek: 1 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T28: T27 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T37: T36 cost ppe: cost spe: peek: 1 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 1 stateless T39: T38 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T40: T39 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T42: T41 cost ppe: cost spe: peek: 1 stateless T43: T42 cost ppe: cost spe: peek: 1 stateless T44: T43 cost ppe: cost spe: peek: 0 stateful T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 1 stateless T47: T46 cost ppe: cost spe: peek: 1 stateless T48: T47 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 01000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of instances 5 10 15 20 25 30 35 40 Throughput (instances / seconds) Theoretical throughput Experimental throughput
95% of the theoretical throughput is achieved after 1000 periods
21/ 28
Experimental results
Graph 1
stateful T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateless T3: T2 cost ppe: cost spe: peek: 0 stateless T4: T3 cost ppe: cost spe: peek: 0 stateless T5: T4 cost ppe: cost spe: peek: 1 stateless T6: T5 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateful T7: T6 cost ppe: cost spe: peek: 0 stateless T8: T7 cost ppe: cost spe: peek: 0 stateful T9: T8 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 1 stateless T10: T9 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 1 stateless T12: T11 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateless T18: T17 cost ppe: cost spe: peek: 0 stateful T19: T18 cost ppe: cost spe: peek: 1 stateless T16: T15 cost ppe: cost spe: peek: 0 stateless T17: T16 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 0 stateful T20: T19 cost ppe: cost spe: peek: 0 stateless T21: T20 cost ppe: cost spe: peek: 0 stateful T22: T21 cost ppe: cost spe: peek: 1 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T24: T23 cost ppe: cost spe: peek: 0 stateless T25: T24 cost ppe: cost spe: peek: 0 stateless T26: T25 cost ppe: cost spe: peek: 1 stateless T27: T26 cost ppe: cost spe: peek: 0 stateful T30: T29 cost ppe: cost spe: peek: 1 stateless T29: T28 cost ppe: cost spe: peek: 1 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T28: T27 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T37: T36 cost ppe: cost spe: peek: 1 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 1 stateless T39: T38 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T40: T39 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T42: T41 cost ppe: cost spe: peek: 1 stateless T43: T42 cost ppe: cost spe: peek: 1 stateless T44: T43 cost ppe: cost spe: peek: 0 stateful T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 1 stateless T47: T46 cost ppe: cost spe: peek: 1 stateless T48: T47 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 0Speed-up for 5000 instances 1 2 3 4 5 6 7 8 Number of SPEs 1 1.5 2 GreedyCpu GreedyMem Linear Programming
Results are obtained over 5000 periods, 2x speedup using 8 SPEs.
22/ 28
Experimental results
Graph 2
stateless T1: T0 cost ppe: cost spe: peek: 0 stateless T2: T1 cost ppe: cost spe: peek: 1 stateful T3: T2 cost ppe: cost spe: peek: 1 stateless T4: T3 cost ppe: cost spe: peek: 1 stateless T5: T4 cost ppe: cost spe: peek: 0 stateless T6: T5 cost ppe: cost spe: peek: 2 stateless T7: T6 cost ppe: cost spe: peek: 2 stateful T8: T7 cost ppe: cost spe: peek: 1 stateless T11: T10 cost ppe: cost spe: peek: 0 stateless T13: T12 cost ppe: cost spe: peek: 2 stateless T9: T8 cost ppe: cost spe: peek: 2 stateless T12: T11 cost ppe: cost spe: peek: 2 stateless T10: T9 cost ppe: cost spe: peek: 0 stateless T94: T93 cost ppe: cost spe: peek: 0 stateless T14: T13 cost ppe: cost spe: peek: 0 stateful T18: T17 cost ppe: cost spe: peek: 1 stateless T20: T19 cost ppe: cost spe: peek: 1 stateless T21: T20 cost ppe: cost spe: peek: 0 stateless T15: T14 cost ppe: cost spe: peek: 1 stateful T17: T16 cost ppe: cost spe: peek: 0 stateless T19: T18 cost ppe: cost spe: peek: 2 stateless T22: T21 cost ppe: cost spe: peek: 2 stateless T16: T15 cost ppe: cost spe: peek: 2 stateless T29: T28 cost ppe: cost spe: peek: 2 stateless T30: T29 cost ppe: cost spe: peek: 2 stateless T32: T31 cost ppe: cost spe: peek: 1 stateless T24: T23 cost ppe: cost spe: peek: 0 stateless T23: T22 cost ppe: cost spe: peek: 1 stateful T25: T24 cost ppe: cost spe: peek: 2 stateless T26: T25 cost ppe: cost spe: peek: 0 stateless T28: T27 cost ppe: cost spe: peek: 0 stateless T27: T26 cost ppe: cost spe: peek: 2 stateless T31: T30 cost ppe: cost spe: peek: 1 stateless T34: T33 cost ppe: cost spe: peek: 0 stateful T40: T39 cost ppe: cost spe: peek: 2 stateless T33: T32 cost ppe: cost spe: peek: 1 stateless T35: T34 cost ppe: cost spe: peek: 0 stateless T37: T36 cost ppe: cost spe: peek: 2 stateless T39: T38 cost ppe: cost spe: peek: 0 stateless T41: T40 cost ppe: cost spe: peek: 1 stateless T36: T35 cost ppe: cost spe: peek: 1 stateless T38: T37 cost ppe: cost spe: peek: 0 stateless T43: T42 cost ppe: cost spe: peek: 0 stateless T42: T41 cost ppe: cost spe: peek: 0 stateful T44: T43 cost ppe: cost spe: peek: 1 stateless T45: T44 cost ppe: cost spe: peek: 0 stateless T46: T45 cost ppe: cost spe: peek: 0 stateless T47: T46 cost ppe: cost spe: peek: 0 stateful T48: T47 cost ppe: cost spe: peek: 0 stateful T50: T49 cost ppe: cost spe: peek: 0 stateless T49: T48 cost ppe: cost spe: peek: 1 stateless T51: T50 cost ppe: cost spe: peek: 1 stateful T57: T56 cost ppe: cost spe: peek: 1 stateless T59: T58 cost ppe: cost spe: peek: 2 stateful T60: T59 cost ppe: cost spe: peek: 1 stateless T54: T53 cost ppe: cost spe: peek: 1 stateless T52: T51 cost ppe: cost spe: peek: 0 stateless T53: T52 cost ppe: cost spe: peek: 1 stateful T61: T60 cost ppe: cost spe: peek: 1 stateless T55: T54 cost ppe: cost spe: peek: 0 stateful T56: T55 cost ppe: cost spe: peek: 1 stateless T58: T57 cost ppe: cost spe: peek: 1 stateless T62: T61 cost ppe: cost spe: peek: 0 stateless T63: T62 cost ppe: cost spe: peek: 2 stateless T65: T64 cost ppe: cost spe: peek: 0 stateless T66: T65 cost ppe: cost spe: peek: 0 stateful T64: T63 cost ppe: cost spe: peek: 1 stateless T67: T66 cost ppe: cost spe: peek: 0 stateful T68: T67 cost ppe: cost spe: peek: 0 stateful T69: T68 cost ppe: cost spe: peek: 1 stateful T70: T69 cost ppe: cost spe: peek: 0 stateless T73: T72 cost ppe: cost spe: peek: 0 stateless T75: T74 cost ppe: cost spe: peek: 2 stateful T72: T71 cost ppe: cost spe: peek: 2 stateless T76: T75 cost ppe: cost spe: peek: 1 stateless T71: T70 cost ppe: cost spe: peek: 0 stateless T74: T73 cost ppe: cost spe: peek: 1 stateless T77: T76 cost ppe: cost spe: peek: 0 stateless T78: T77 cost ppe: cost spe: peek: 0 stateless T79: T78 cost ppe: cost spe: peek: 2 stateless T81: T80 cost ppe: cost spe: peek: 0 stateless T82: T81 cost ppe: cost spe: peek: 1 stateless T80: T79 cost ppe: cost spe: peek: 2 stateful T85: T84 cost ppe: cost spe: peek: 1 stateless T90: T89 cost ppe: cost spe: peek: 2 stateless T84: T83 cost ppe: cost spe: peek: 2 stateful T86: T85 cost ppe: cost spe: peek: 0 stateful T87: T86 cost ppe: cost spe: peek: 1 stateless T88: T87 cost ppe: cost spe: peek: 1 stateful T91: T90 cost ppe: cost spe: peek: 1 stateless T92: T91 cost ppe: cost spe: peek: 2 stateful T83: T82 cost ppe: cost spe: peek: 1 stateless T89: T88 cost ppe: cost spe: peek: 0 stateless T93: T92 cost ppe: cost spe: peek: 0Speed-up for 5000 instances 1 2 3 4 5 6 7 8 Number of SPEs 1 1.5 2 GreedyCpu GreedyMem Linear Programming
Results are obtained over 5000 periods, 2x speedup using 8 SPEs.
23/ 28
Experimental results
Graph 3: 50 tasks deep chain graph
Speed-up for 5000 instances 1 2 3 4 5 6 7 8 Number of SPEs 1 1.5 2 2.5 3 GreedyCpu GreedyMem Linear Programming
Results are obtained over 5000 periods, 3x speedup using 8 SPEs.
24/ 28
Experimental results
We let the communication to computation ratio of each graph vary
Speed-up for 10000 instances 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Communication to computation ratio 1 1.5 2 2.5 3 3.5 4 Random graph 2 Random graph 3 Random graph 1
Results are obtained over 10000 periods. The heavier communication are, the harder it is to achieve theoretical throughput... ... but increasing the number of periods helps a lot.
25/ 28
Outline
Introduction Steady-state scheduling CELL Platform and Application Modeling Mapping the Application Practical Steady-State on CELL Preprocessing of the schedule State machine of the framework Experimental results Conclusion and Future works
26/ 28
Feedback on our approach
◮ We designed a realistic and yet tractable model of the Cell
processor.
◮ Our framework allowed us to test our scheduling strategy, and
to compare it to simpler heuristic strategies.
◮ We have shown that :
◮ 95% of the throughput predicted by the linear program, ◮ Good and scalable speedup when using up to 8 SPEs, ◮ Clearly outperforms simple heuristics
Scheduling a complex application on a heterogeneous multicore processor is a challenging task Scheduling tools can help to achieve good performance.
27/ 28
Feedback on Cell programming
◮ Multilevel heterogeneity:
◮ 32 bits SPEs vs 64 bits PPE architectures ◮ Different communication mechanism and constraints
◮ Non trivial initialization phase
◮ Varying data structure sizes (32/64bits) ◮ Runtime memory allocation
28/ 28
On-going and Future work
◮ Better communication modeling
◮ Is linear cost model relevant ? ◮ Contention on concurrent DMA operations ?
◮ Larger platforms
◮ Using multiple CELL processors ◮ CELL + other type of processing units ? ◮ Work on communication modeling
◮ Design scheduling heuristics
◮ MIP is costly