1
HiPEAC11 Heraklion - Crete DDM-VM c : The Data-Driven Multithreading - - PowerPoint PPT Presentation
HiPEAC11 Heraklion - Crete DDM-VM c : The Data-Driven Multithreading - - PowerPoint PPT Presentation
HiPEAC11 Heraklion - Crete DDM-VM c : The Data-Driven Multithreading Virtual Machine for the Cell Processor Samer Arandi Skevos (Paraskevas) Evripidou University of Cyprus Computer Science Department 1 Outline Motivation Data
2
Outline
Data Driven Multithreading The DDM-VMc Programming Toolchain Conclusion Motivation Evaluation
3
Outline
Data Driven Multithreading The DDM-VMc Programming Toolchain Conclusion Motivation Evaluation
4
The adoption of multicore architectures ushered the beginning
- f the “Concurrency Era” which gave rise to new challenges:
Traditional programming models do not allow for efficient utilization
- f the large multicore resources
Multi-cores still suffers from the effects of the Memory Wall
Motivation
Heterogeneous multicores ( motivated by a more power and area efficient design) makes this task even more complex One technique to combat the memory wall is to utilize explicitly managed on-chip local memories (scratchpads) This offers great opportunities for optimizations but burdens the programmer with the management of the memory hierarchy
5
re-visit alternative models that are inherently parallel, offering distributed concurrency i.e. Dataflow
Our Take:
Instead of extending sequential models with concurrent constructs which is mostly an ad hoc solution Exploit Data-flow concurrency on commercial multicores with performance as well as or better than similar systems
Our Goal:
6
Outline
Data Driven Multithreading The DDM-VMc Programming Toolchain Conclusion Motivation Evaluation
7
Execution model that combines:
Distributed data-flow concurrency for scheduling threads Efficient sequential execution within a thread
Decouples synchronization from computation Non Blocking- Threads execute to completion The core of DDM is the Thread Scheduling Unit (TSU)
- Holds the meta-data of the threads (dependency graph)
- Uses the graph to schedule thread dynamically at runtime
based on data availability
CacheFlow: Data-Driven prefetching improves drastically the hit ratio of the cache and requires much smaller caches
- RQ gives the near-future execution patterns.
Data Driven Multithreading (DDM)
8
Data-Driven Network of Workstations (D2Now)
A simulated cluster of distributed machines augmented with a hardware Thread Scheduling Unit Explored CacheFlow optimizations and showed that Data- Driven scheduling could generally improve locality
Data Driven Multithreading (DDM) - Projects
ThreadFlux (TFlux)
Developed a portable software platform that runs on a variety of commercial multi-core systems The first full system simulation of a DDM machine TFlux Pragmas: Data-flow specific directives A virtual machine that supports DDM execution on homogeneous and heterogeneous multi-cores
Data Driven Multithreading Virtual Machine (DDM-VM)
9
Data-Driven Multithreading Execution
Thread Synchronization Unit (TSU) Synchronization Memory (SM)
0?
Graph Memory (GM) Ready Queue (RQ)
- Ack. Queue (AQ)
Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit
L2 Cache Memory Processor L1 Cache PC Motherboard
31 32 36 33 34
Threads Dependency Graph
DDM PE with Hardware TSU
10
Thread Synchronization Unit (TSU) Synchronization Memory (SM)
0?
Graph Memory (GM) Ready Queue (RQ)
- Ack. Queue (AQ)
Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit
L2 Cache Memory Processor L1 Cache PC Motherboard
The GM contains the IFP, DFP and the two consumers (Con1 and Con2).
31 32 36 33 34
Threads Dependency Graph
Data-Driven Multithreading Execution
11
Data-Driven Multithreading Execution
Thread Synchronization Unit (TSU) Synchronization Memory (SM)
0?
Graph Memory (GM) Ready Queue (RQ)
- Ack. Queue (AQ)
Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit
L2 Cache Memory Processor L1 Cache PC Motherboard
31 32 36 33 34
Threads Dependency Graph
The SM contains the Ready
- Counts. One value for each
loop iteration.
12
Data-Driven Multithreading Execution
Thread Synchronization Unit (TSU) Synchronization Memory (SM)
0?
Graph Memory (GM) Ready Queue (RQ)
- Ack. Queue (AQ)
Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit
L2 Cache Memory Processor L1 Cache PC Motherboard
31 32 36 33 34
Threads Dependency Graph
The processor reads from the RQ pointers (IFP, DFP and index) of ready threads and executes them
13
Data-Driven Multithreading Execution
Thread Synchronization Unit (TSU) Synchronization Memory (SM)
0?
Graph Memory (GM) Ready Queue (RQ)
- Ack. Queue (AQ)
Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit
L2 Cache Memory Processor L1 Cache PC Motherboard
After executing a thread, the processor stores in the AQ information (Thread#, index and status) of the executed thread.
31 32 36 33 34
Threads Dependency Graph
14
Data-Driven Multithreading Execution
Thread Synchronization Unit (TSU) Synchronization Memory (SM)
0?
Graph Memory (GM) Ready Queue (RQ)
- Ack. Queue (AQ)
Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit
L2 Cache Memory Processor L1 Cache PC Motherboard
The TSU determines the consumers of completed threads from the GM.
31 32 36 33 34
Threads Dependency Graph
Consumers
15
Data-Driven Multithreading Execution
Thread Synchronization Unit (TSU) Synchronization Memory (SM)
0?
Graph Memory (GM) Ready Queue (RQ)
- Ack. Queue (AQ)
Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit
L2 Cache Memory Processor L1 Cache PC Motherboard
Update SM and check if any of the consumers is ready (Ready Count = 0)
31 32 36 33 34
Threads Dependency Graph
2
16
Data-Driven Multithreading Execution
Thread Synchronization Unit (TSU) Synchronization Memory (SM)
0?
Graph Memory (GM) Ready Queue (RQ)
- Ack. Queue (AQ)
Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit
L2 Cache Memory Processor L1 Cache PC Motherboard
The TSU loads in the RQ the pointers (IFP, DFP) of ready thread from the GM and index the SM.
31 32 36 33 34
Threads Dependency Graph
2
17
Outline
Data Driven Multithreading The DDM-VMc Programming Toolchain Conclusion Motivation Evaluation
18
The DDM-VM is a virtual machine that supports DDM execution
- n homogeneous and heterogeneous multicores
The Data-Driven Virtual Machine (DDM-VM)
SPU
SPE 8
...
BUS PPU DDM Thread Execution
DDM-VMc PPE Runtime TSU + S-CachFlow Execution
The DDM-VMc
PPE
Main Memory
DDM-VMc SPE Runtime
The DDM-VMs
Core 1 Core 2
...
Bus Main Memory
Core n
DDM-VMs Runtime
TSU + CacheFlow DDM Thread Execution
DDM-VMs Runtime
TSU Memory Structures TSU Memory Structures Program Data Program Data LS I/O I/O
Network Other Nodes Network Other Nodes
DDM Thread Execution
DDM-VMs Runtime
SPU DDM Thread Execution
DDM-VMc SPE Runtime
LS
SPE 1
Cache Hierarchy
SPU
SPE 8
...
BUS PPU DDM Thread Execution
DDM-VMc PPE Runtime TSU + S-CachFlow Execution
The DDM-VMc
PPE
Main Memory
DDM-VMc SPE Runtime
The DDM-VMs
Core 1 Core 2
...
Bus Main Memory
Core n
DDM-VMs Runtime
TSU + CacheFlow DDM Thread Execution
DDM-VMs Runtime
TSU Memory Structures TSU Memory Structures Program Data Program Data LS I/O I/O
Network Other Nodes Network Other Nodes
DDM Thread Execution
DDM-VMs Runtime
SPU DDM Thread Execution
DDM-VMc SPE Runtime
LS
SPE 1
Cache Hierarchy
19
The DDM-VM is a virtual machine that supports DDM execution
- n homogeneous and heterogeneous multicores
The Data-Driven Virtual Machine (DDM-VM)
TSU TSU
SPU
SPE 8
...
BUS PPU DDM Thread Execution
DDM-VMc PPE Runtime TSU + S-CachFlow Execution
The DDM-VMc
PPE
Main Memory
DDM-VMc SPE Runtime
The DDM-VMs
Core 1 Core 2
...
Bus Main Memory
Core n
DDM-VMs Runtime
TSU + CacheFlow DDM Thread Execution
DDM-VMs Runtime
TSU Memory Structures TSU Memory Structures Program Data Program Data LS I/O I/O
Network Other Nodes Network Other Nodes
DDM Thread Execution
DDM-VMs Runtime
SPU DDM Thread Execution
DDM-VMc SPE Runtime
LS
SPE 1
Cache Hierarchy
20
The DDM-VM is a virtual machine that supports DDM execution
- n homogeneous and heterogeneous multicores
The Data-Driven Virtual Machine (DDM-VM)
Thread Execution
Thread Execution
21
Ease of Programmability
No explicit synchronization, no race conditions, no barriers… Functional /Side-effect frees: exposes maximum amount of parallelism (only partial ordering of instructions, only true dependencies, avoid hazards) Rich Programming Toolchain: macros, pragmas, automatic compilation tools
DDM-VM Overview
Combines Dynamic Dataflow concurrency with Efficient sequential execution with competitive performance on commercial multicore systems
Distributed Concurrency: No central point of control in the program Tolerance to Memory, Synchronization, and Network latencies Interleaves Execution with Synchronization – shortens the critical path
Automatic Management for the memory hierarchy
- Implicit prefetching with locality aware optimizations
22
The Cell Processor
SPU SXU LS MFC
SPE 1
SPU SXU LS MFC
SPE 2
SPU SXU LS MFC
SPE 3
...
Element Interconnect Bus Main Memory MIC L2 L1 PXU PPU SPU SXU LS MFC
SPE 8
BIC
I/O Devices
The Cell provides a high computational power on a Single chip (204 GFLOPs) but programming it is not a trivial process
- SPE code runs in a special pthread. The programmer is responsible
for instantiating, scheduling and synchronizing the threads
- Most importantly, the programmer is responsible for the
data management between the two levels of the memory hierarchy
- Two different ISA and Address
Spaces (two sets of sources)
24
The Data-Driven Virtual Machine on the Cell (DDM-VMc)
Common TSU Structures SM AQ
DFPL CL
SPE0 Structures SPE7 Structures
...
FQ CD WQ RCLD CQ TSU Memory Structures
DDMCommand... DDMCommand... DDMCommand...
FQ CD WQ RCLD CQ
DDMCommand... DDMCommand... DDMCommand...
RC?=0
...
PPE SPE 7 SPE 0
DDM Program
DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data
...
DMA Call
Runtime calls
...
DDM-VMc SPE Runtime Code
RCLD (Cache Lookup) DDM Cache Command Buffer
DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...
DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation
Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info
Main Memory
GM
Common TSU Structures SM AQ
DFPL CL
SPE0 Structures SPE7 Structures
...
FQ CD WQ RCLD CQ TSU Memory Structures
DDMCommand... DDMCommand... DDMCommand...
FQ CD WQ RCLD CQ
DDMCommand... DDMCommand... DDMCommand...
RC?=0
...
PPE SPE 7 SPE 0
DDM Program
DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data
...
DMA Call
Runtime calls
...
DDM-VMc SPE Runtime Code
RCLD (Cache Lookup) DDM Cache Command Buffer
DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...
DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation
Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info
Main Memory
GM
25
(TSU + S-CacheFlow Execution )
The Data-Driven Virtual Machine on the Cell (DDM-VMc)
VM Runtime TSU Structures
Common TSU Structures SM AQ
DFPL CL
SPE0 Structures SPE7 Structures
...
FQ CD WQ RCLD CQ TSU Memory Structures
DDMCommand... DDMCommand... DDMCommand...
FQ CD WQ RCLD CQ
DDMCommand... DDMCommand... DDMCommand...
RC?=0
...
PPE SPE 7 SPE 0
DDM Program
DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data
...
DMA Call
Runtime calls
...
DDM-VMc SPE Runtime Code
RCLD (Cache Lookup) DDM Cache Command Buffer
DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...
DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation
Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info
Main Memory
GM
26
Thread Execution
The Data-Driven Virtual Machine on the Cell (DDM-VMc)
Common TSU Structures SM AQ
DFPL CL
SPE0 Structures SPE7 Structures
...
FQ CD WQ RCLD CQ TSU Memory Structures
DDMCommand... DDMCommand... DDMCommand...
FQ CD WQ RCLD CQ
DDMCommand... DDMCommand... DDMCommand...
RC?=0
...
PPE SPE 7 SPE 0
DDM Program
DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data
...
DMA Call
Runtime calls
...
DDM-VMc SPE Runtime Code
RCLD (Cache Lookup) DDM Cache Command Buffer
DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...
DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation
Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info
Main Memory
GM
27
Thread Code VM runtime DDM Cache
The Data-Driven Virtual Machine on the Cell (DDM-VMc)
Common TSU Structures SM AQ
DFPL CL
SPE0 Structures SPE7 Structures
...
FQ CD WQ RCLD CQ TSU Memory Structures
DDMCommand... DDMCommand... DDMCommand...
FQ CD WQ RCLD CQ
DDMCommand... DDMCommand... DDMCommand...
RC?=0
...
PPE SPE 7 SPE 0
DDM Program
DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data
...
DMA Call
Runtime calls
...
DDM-VMc SPE Runtime Code
RCLD (Cache Lookup) DDM Cache Command Buffer
DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...
DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation
Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info
Main Memory
GM
28 28
DMA Calls
The Data-Driven Virtual Machine on the Cell (DDM-VMc)
Common TSU Structures SM AQ
DFPL CL
SPE0 Structures SPE7 Structures
...
FQ CD WQ RCLD CQ TSU Memory Structures
DDMCommand... DDMCommand... DDMCommand...
FQ CD WQ RCLD CQ
DDMCommand... DDMCommand... DDMCommand...
RC?=0
...
PPE SPE 7 SPE 0
DDM Program
DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data
...
DMA Call
Runtime calls
...
DDM-VMc SPE Runtime Code
RCLD (Cache Lookup) DDM Cache Command Buffer
DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...
DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation
Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info
Main Memory
GM
29
The Data-Driven Virtual Machine on the Cell (DDM-VMc)
Common TSU Structures SM AQ
DFPL CL
SPE0 Structures SPE7 Structures
...
FQ CD WQ RCLD CQ TSU Memory Structures
DDMCommand... DDMCommand... DDMCommand...
FQ CD WQ RCLD CQ
DDMCommand... DDMCommand... DDMCommand...
RC?=0
...
PPE SPE 7 SPE 0
DDM Program
DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data
...
DMA Call
Runtime calls
...
DDM-VMc SPE Runtime Code
RCLD (Cache Lookup) DDM Cache Command Buffer
DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...
DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation
Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info
Main Memory
GM
30
The Data-Driven Virtual Machine on the Cell (DDM-VMc)
31
A portion of the LS is pre-allocated and divided into cache blocks. The size is adjustable based on the application characteristics A Cache Directory allocated in main memory keeps the blocks state Each input/output is separately allocated at least one block The Soft-CacheFlow module in the TSU handles, allocation, eviction, fetching of data from MM<->LS with no user intervention
Software CacheFlow
CacheFlow is a cache management policy utilized with DDM to improve performance by prefetching the threads data into the cache DDM-VMc uses the concept of CacheFlow to implement Software CacheFlow an automated prefetching software cache with variable cache block sizes for the management of the memory hierarchy of Cell
32
S-CacheFlow – Multi-buffering/Locality Optimizations Data transfers and management are overlapped with thread execution to tolerate latencies and achieve multi-buffering Support “explicit locality” which avoids expensive cross-SPE coherence operations:
- Explicitly inserting ”write-backs” in the graph of the program
- Assigning a reference-count
- A dirty bit protects against premature eviction of dirty blocks
Maintains consistency with DF synchronization in a manner similar to DAG consistency
- Supports explicit locality by flagging inputs/outputs data
for re-use (reuse flag for input and keep flag for output)
Dequeu thread info Get thread DFPs Allocate Data in LS
Allocation Success for all DFPs
Issue DMA calls:
- Writeback evicted dirty
blocks from LS to MM
- Fetch DFP from MM to LS
- Copy Lookup info to LS
Yes
record all issued DMAs in the PendingBuffer (PB) Entry ID = ThreadID
WQ has any entry
Yes
PB has any entry that all DMAs have completed Partial allocation Succes
Restore Cache State
Yes No
Move to FQ Thead info with ThreadID = Entry ID
Yes No
Save Cache State PrioWQ has any entry
No No
Enqueu thread in PrioWQ
No Yes
- Consult Cache
Directory
- Block Eviction
- Block Allocation
- Block Reuse
Execute on SPE
Execute on PPE
33
SPE Runtime Activities (MatMult)
60% 65% 70% 75% 80% 85% 90% 95% 100% CF D-CF CF D-CF CF D-CF CF D-CF 1 2 4 6 Runtime(Other) Idle (Wait Process Msg) Idle (Wait NextThread) D-CacheFlow(SPE) Computation
CacheFlow v.s D-CacheFlow
D-CacheFlow successfully achieves an efficient management of the memory in the Cell The Algorithm
34
Outline
Data Driven Multithreading The DDM-VMc Programming Toolchain Conclusion Motivation Evaluation
35
DDM-VMc programs consists of two parts:
Programming DDM-VMc
The programs are coded in C using a set of macros that expand to calls to the DDM-VMc
DDM Thread Producer/Consumer relationship
RC=1 1 1 3 2 1
- The code of the threads
- The dependency graph of the threads
- Identify thread boundaries
- Identify the input/output data
- Identify the dependencies amongst the threads
The programmer can also use the TFlux+ pragmas. Originally developed for the TFlux platform and extended to target the DDM-VMc and generate the macros
36
Two compiler projects, currently under development to automate generating the macros:
Programming Toolchain
- GCC-based auto-parallelizing compiler for C programs
- Source-to-Source Concurrent Collections (CnC) Compiler
Macros Expansion DDM Threads Code + Calls to run-time Dependency Graph + Initialization & Clean-up + Calls to run-time SPU Compiler (gcc or xlc) PPU Compiler (gcc or xlc) DDM-VMc runtime spu library object SPU Linker (gcc or xlc) + PPU Embedd. PPU Linker (gcc or xlc) spu
- bj file(s)
ppu
- bj file(s)
DDM-VMc runtime ppu library object Cell executable
spe source file ppe source file
Cell SDK Compilers
CnC to DDM Compiler
DDM-VMc Internal Representation DDM-VMc Programming Toolchain
GCC Parallelizing Compiler C Program + DDM Macros CnC Program Sequential C Program C Program + TFlux* Pragmas TFlux Preprocessor
37
//Item definitions [int* A <PAIR>]; // Item A, pointer to a block in M.Memory [int* B <PAIR>]; // Item B, pointer to a block in M.Memory [int* C <TRIPLE>];// Item C, pointer to a block in M.Memory // Tag definitions <PAIR ITag>; <TRIPLE MTag>; //Prescriptions (control relationships) <TAG>::(STEP) <ITag> :: (Iterator); <MTag> :: (Multiply); // Step produce/consume relationships (Iterator)-><MTag>; // Iterator produces MTag [A], [B], [C] -> (Multiply);//Multiply consumes A,B,C (Multiply)->[C],<MTag>; // Multiply produces C env -> <ITag>,[A],[B],[C]; //initilization code produces A,B,C [C]-> env ; // post-execution code consumes C
Concurrent Collections (CnC) to DDM
- Declarative parallel programming
language
- Separation of concerns
- Domain Expert
- Parallelization Expert
Tag Collection < > Item Collection [ ] Step Collection ( ) Prod/Cons relationship control relationship Environment Input/output
Blocked MatMult CnC C B A
Multiply Iterator MTag ITag
T1 T2 (Iterator) (Multiply) T
DDM Thread Prod/Cons relationship
DDM Program
Step => DDM Thread Tag => dynamic meta-data Items prod/Cons => static meta-data
- Similar semantics with DDM
38
DDM-VMc Program: Blocked LU Decomposition
- Factorizes a matrix A into L, U matrices, such that A=LU
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb
diag front down LU iter 0 comb
1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3
diag invocation down invocations front invocations comb invocations
40
DDM-VM Program: Blocked LU Decomposition
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb
1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3
41
DDM-VM Program: Blocked LU Decomposition
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb
LU iter 0
1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
invocation can run now
42
DDM-VM Program: Blocked LU Decomposition
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb
LU iter 0
1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
43
DDM-VM Program: Blocked LU Decomposition
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb
LU iter 0
2 2 2 2 2 2 2 2 2
all invocations can run now all invocations can run now
44
DDM-VM Program: Blocked LU Decomposition
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb
LU iter 0
2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
Executing in Parallel Executing in Parallel invocations of comb that can run are underlined
45
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies LU iter 0 Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb e #pragma ddm thread TID_DIAG kernel(dynamic) readycount 1 arity 1 import_export(float *T:A[@context][@context]:BLOCK_SIZE);
... // computation code execution here
bUpdate = @context<TILE-1 ; #pragma ddm endthread cond_update(TID_FRONT,@(@context,@context+1): @(@context, TILES - 1): bUpdate, @(@context,@context+1): @(@context, TILES - 1): bUpdate); #pragma ddm thread TID_FRONT kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *P:A[@context.1][@context.0]:BLOCK_SIZE);
... // computation code execution here
#pragma ddm endthread update(TID_COMB,@(@context.1,@context.1+1, @context.0): @(@context.1,TILES-1, @context.0); #pragma ddm thread TID_DOWN kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *Q:A[@context.0][@context.1]:BLOCK_SIZE);
... // computation code execution here
#pragma ddm endthread update(TID_COMB,@(@context.1,@context.0, @context.1+1): @(@context.1,@context.0, TILES-1); TID_DOWN,
e #pragma ddm thread TID_DIAG kernel(dynamic) readycount 1 arity 1 import_export(float *T:A[@context][@context]:BLOCK_SIZE);
... // computation code execution here
bUpdate = @context<TILE-1 ; #pragma ddm endthread cond_update(TID_FRONT,@(@context,@context+1): @(@context, TILES - 1): bUpdate, @(@context,@context+1): @(@context, TILES - 1): bUpdate); #pragma ddm thread TID_FRONT kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *P:A[@context.1][@context.0]:BLOCK_SIZE);
... // computation code execution here
#pragma ddm endthread update(TID_COMB,@(@context.1,@context.1+1, @context.0): @(@context.1,TILES-1, @context.0); #pragma ddm thread TID_DOWN kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *Q:A[@context.0][@context.1]:BLOCK_SIZE);
... // computation code execution here
#pragma ddm endthread update(TID_COMB,@(@context.1,@context.0, @context.1+1): @(@context.1,@context.0, TILES-1); TID_DOWN,
46
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies LU iter 0 Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb
diag
front
down
e #pragma ddm thread TID_DIAG kernel(dynamic) readycount 1 arity 1 import_export(float *T:A[@context][@context]:BLOCK_SIZE);
... // computation code execution here
bUpdate = @context<TILE-1 ; #pragma ddm endthread cond_update(TID_FRONT,@(@context,@context+1): @(@context, TILES - 1): bUpdate, @(@context,@context+1): @(@context, TILES - 1): bUpdate); #pragma ddm thread TID_FRONT kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *P:A[@context.1][@context.0]:BLOCK_SIZE);
... // computation code execution here
#pragma ddm endthread update(TID_COMB,@(@context.1,@context.1+1, @context.0): @(@context.1,TILES-1, @context.0); #pragma ddm thread TID_DOWN kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *Q:A[@context.0][@context.1]:BLOCK_SIZE);
... // computation code execution here
#pragma ddm endthread update(TID_COMB,@(@context.1,@context.0, @context.1+1): @(@context.1,@context.0, TILES-1); TID_DOWN,
47
thread start thread end thread start thread end thread start thread end
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies LU iter 0 Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb
diag
front
down
e #pragma ddm thread TID_DIAG kernel(dynamic) readycount 1 arity 1 import_export(float *T:A[@context][@context]:BLOCK_SIZE);
... // computation code execution here
bUpdate = @context<TILE-1 ; #pragma ddm endthread cond_update(TID_FRONT,@(@context,@context+1): @(@context, TILES - 1): bUpdate, @(@context,@context+1): @(@context, TILES - 1): bUpdate); #pragma ddm thread TID_FRONT kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *P:A[@context.1][@context.0]:BLOCK_SIZE);
... // computation code execution here
#pragma ddm endthread update(TID_COMB,@(@context.1,@context.1+1, @context.0): @(@context.1,TILES-1, @context.0); #pragma ddm thread TID_DOWN kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *Q:A[@context.0][@context.1]:BLOCK_SIZE);
... // computation code execution here
#pragma ddm endthread update(TID_COMB,@(@context.1,@context.0, @context.1+1): @(@context.1,@context.0, TILES-1); TID_DOWN,
48 48
The input/ouput Data of the Threads
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies LU iter 0 Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb
diag
front
down
e #pragma ddm thread TID_DIAG kernel(dynamic) readycount 1 arity 1 import_export(float *T:A[@context][@context]:BLOCK_SIZE);
... // computation code execution here
bUpdate = @context<TILE-1 ; #pragma ddm endthread cond_update(TID_FRONT,@(@context,@context+1): @(@context, TILES - 1): bUpdate, @(@context,@context+1): @(@context, TILES - 1): bUpdate); #pragma ddm thread TID_FRONT kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *P:A[@context.1][@context.0]:BLOCK_SIZE);
... // computation code execution here
#pragma ddm endthread update(TID_COMB,@(@context.1,@context.1+1, @context.0): @(@context.1,TILES-1, @context.0); #pragma ddm thread TID_DOWN kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *Q:A[@context.0][@context.1]:BLOCK_SIZE);
... // computation code execution here
#pragma ddm endthread update(TID_COMB,@(@context.1,@context.0, @context.1+1): @(@context.1,@context.0, TILES-1); TID_DOWN,
50 50 50 50
Identify the cosumer threads
<0> <0,1> <0,2> <0,3>
<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>
<0,1> <0,2> <0,3>
DDM Threads Dynamic Instantiations Dependencies LU iter 0 Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph
down
Comb
diag
front
down
51
Outline
Data Driven Multithreading The DDM-VMc Programming Toolchain Conclusion Motivation Evaluation
52
Benchmark Suite
Benchmark
- Avg. granularity
- f the thread
Input Size Trapez variable Small Medium Large MatCopy 4.6 µs 512x512 1024x1024 2048x2048 MatAdd 4.6 µs 512x512 1024x1024 2048x2048 MatMult 22.1 µs 512x512 1024x1024 2048x2048 Cholesky 22 µs 512x512 1024x1024 2048x2048 LU 1.82 ms 512x512 1024x1024 2048x2048 IDCT 12.37 µs – 98.8 µs 512x512 1024x1024 2048x2048 Conv2D 12.28 µs – 48.11 µs 512x512 1024x1024 2048x2048 RK4 Variable Small Medium Large FDTD 28.65 µs – 116 µs 304 Y-Cells 608 Y-Cells 1216 Y-Cells
10 applications featuring kernels widely used in: scientific and image processing applications
53
Number of SPEs Number of SPEs Number of SPEs Number of SPEs
1 2 3 4 5 6 1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6 Conv2D(32x32) Conv2D(64x64) MatAdd MatCopy D-CacheFlow CacheFlow
Number of SPEs Number of SPEs Number of SPEs Number of SPEs
1 2 3 4 5 6 1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6 FDTD(304x304) FDTD(608x608) IDCT(32x16) IDCT(64x32) D-CacheFlow CacheFlow
Speedup Speedup
1 2 3 4 5 6 1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6 Mult Cholesky LU Trapez D-CacheFlow CacheFlow
Speedup
For all benchmarks Problem size: Large
- For most of the applications, the platform
scales well & tolerates latencies and
- verheads efficiently
- The Distributed implementation of
CacheFlow is more efficient as it alleviates the pressure on the PPE
- For applications with varying
granularities, the performance improves as granularity increases
CacheFlow v.s D-CacheFlow – Thread Granularity
54
The system generally scales well across the range of the benchmarks achieving an almost linear speedup for the large problem size
1 2 3 4 5 6 2 4 6 16 8 4
1 2 3 4 5 6 2 4 6 Large Medium Small
RK4
Number of SPEs
1 2 3 4 5 6 2 4 6 2048x2048 1024x1024 512x512 1 2 3 4 5 6 2 4 6 2048x2048 1024x1024 512x512
1 2 3 4 5 6 2 4 6 Large Medium Small Number of SPEs
FDTD
Speedup
Number of SPEs
Speedup
Trapez LU
1 2 3 4 5 6 2 4 6 Large Medium Small 1 2 3 4 5 6 2 4 6 Large Medium Small
Cholesky Mult
1 2 3 4 5 6 2 4 6 Large Medium Small
Conv2D(64x64)
Number of SPEs Number of SPEs
IDCT (64x64)
Performance – Problem Size
55
- When limiting the concurrent threads to 1 the TSU overhead is added
to the critical path.
- When it is increased to 2 & 3 the TSU overlaps the scheduling overheads
and data transfers with execution & finishes in time less than the sequential
Synchronization Latency Tolerance
0.2 0.4 0.6 0.8 1 1.2 1.4 Sequential D-CacheFlow-1 D-CacheFlow-2 D-CacheFlow-3
Execution Time (Normalized)
Execution Time on 1 SPE
For all benchmarks Problem size: Large
57
GFLOPs Performance
- Achieves avg. of 88% of the theoretical
peak For Matmult
- Achieves a speedup of 5 out of 6
for Cholesky MatMul-2048x2048 Cholesky-2048x2048 GFLOPs GFLOPs
Number SPEs Number SPEs
20 40 60 80 100 120 140 160 2 4 6 Theoretical Peak DDM-VMc 20 40 60 80 100 120 140 160 2 4 6 Theoretical Peak DDM-VMc
58
Comparison with Sequoia [Stanford]
- An improvement of 10%, 16% & 11%
for the 512,1024 and 2048 sizes MatMult
- An improvement of 27%, 22% & 27%
for the 512,1024 and 2048 sizes Conv2D (9x9 convolution filter)
10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc Sequoia 10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc Sequoia 10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc Sequoia Number of SPEs Number of SPEs Number of SPEs
MatMult-2048 MatMult-1024 MatMult-512
10 20 30 40 50 60 70 80 2 4 6 DDM-VMc Sequoia 10 20 30 40 50 60 70 80 2 4 6 DDM-VMc Sequoia 10 20 30 40 50 60 70 80 2 4 6 DDM-VMc Sequoia
Number of SPEs Number of SPEs Number of SPEs
Conv2D-512 Conv2D-1024 Conv2D-2048
59
Comparison* with CellSs [Barcelona Supercomputing Center]
20 40 60 80 100 120 140 160 2 4 6 DDM-VMc CellSs 20 40 60 80 100 120 140 160 2 4 6 DDM-MVc CellSs 20 40 60 80 100 120 140 160 2 4 6 DDM-VMc CellSs
MatMult-512 MatMult-1024 MatMult-2048
GFLOPs
10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc CellSs
Cholesky-512
10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc CellSs
GFLOPs
Cholesky-1024
10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc CellSs
Cholesky-2048
- An improvement of 80%, 28% & 19%
for the 512,1024 and 2048 sizes MatMult
- An improvement of 213%, 99% & 23%
for the 512,1024 and 2048 sizes Cholesky
*Details in our SAMOS2010 paper
60
Outline
Data Driven Multithreading The DDM-VMc Programming Toolchain Conclusion Motivation Evaluation
61
Models like DDM, that combine dataflow concurrency with efficient sequential execution are promising candidates as the programming model for multicore systems Conclusion Our results demonstrate that Data-Flow concurrency can be efficiently implemented as a Virtual Machine on commercial multicore systems
62