HiPEAC11 Heraklion - Crete DDM-VM c : The Data-Driven Multithreading - - PowerPoint PPT Presentation

hipeac 11
SMART_READER_LITE
LIVE PREVIEW

HiPEAC11 Heraklion - Crete DDM-VM c : The Data-Driven Multithreading - - PowerPoint PPT Presentation

HiPEAC11 Heraklion - Crete DDM-VM c : The Data-Driven Multithreading Virtual Machine for the Cell Processor Samer Arandi Skevos (Paraskevas) Evripidou University of Cyprus Computer Science Department 1 Outline Motivation Data


slide-1
SLIDE 1

1

DDM-VMc: The Data-Driven Multithreading Virtual Machine for the Cell Processor

University of Cyprus Computer Science Department

Samer Arandi Skevos (Paraskevas) Evripidou

HiPEAC’11

Heraklion - Crete

slide-2
SLIDE 2

2

Outline

 Data Driven Multithreading  The DDM-VMc  Programming Toolchain  Conclusion  Motivation  Evaluation

slide-3
SLIDE 3

3

Outline

 Data Driven Multithreading  The DDM-VMc  Programming Toolchain  Conclusion  Motivation  Evaluation

slide-4
SLIDE 4

4

 The adoption of multicore architectures ushered the beginning

  • f the “Concurrency Era” which gave rise to new challenges:

 Traditional programming models do not allow for efficient utilization

  • f the large multicore resources

 Multi-cores still suffers from the effects of the Memory Wall

Motivation

 Heterogeneous multicores ( motivated by a more power and area efficient design) makes this task even more complex  One technique to combat the memory wall is to utilize explicitly managed on-chip local memories (scratchpads)  This offers great opportunities for optimizations but burdens the programmer with the management of the memory hierarchy

slide-5
SLIDE 5

5

 re-visit alternative models that are inherently parallel, offering distributed concurrency i.e. Dataflow

Our Take:

 Instead of extending sequential models with concurrent constructs which is mostly an ad hoc solution  Exploit Data-flow concurrency on commercial multicores with performance as well as or better than similar systems

Our Goal:

slide-6
SLIDE 6

6

Outline

 Data Driven Multithreading  The DDM-VMc  Programming Toolchain  Conclusion  Motivation  Evaluation

slide-7
SLIDE 7

7

 Execution model that combines:

 Distributed data-flow concurrency for scheduling threads  Efficient sequential execution within a thread

 Decouples synchronization from computation  Non Blocking- Threads execute to completion  The core of DDM is the Thread Scheduling Unit (TSU)

  • Holds the meta-data of the threads (dependency graph)
  • Uses the graph to schedule thread dynamically at runtime

based on data availability

 CacheFlow: Data-Driven prefetching improves drastically the hit ratio of the cache and requires much smaller caches

  • RQ gives the near-future execution patterns.

Data Driven Multithreading (DDM)

slide-8
SLIDE 8

8

 Data-Driven Network of Workstations (D2Now)

 A simulated cluster of distributed machines augmented with a hardware Thread Scheduling Unit  Explored CacheFlow optimizations and showed that Data- Driven scheduling could generally improve locality

Data Driven Multithreading (DDM) - Projects

 ThreadFlux (TFlux)

 Developed a portable software platform that runs on a variety of commercial multi-core systems  The first full system simulation of a DDM machine  TFlux Pragmas: Data-flow specific directives  A virtual machine that supports DDM execution on homogeneous and heterogeneous multi-cores

 Data Driven Multithreading Virtual Machine (DDM-VM)

slide-9
SLIDE 9

9

Data-Driven Multithreading Execution

Thread Synchronization Unit (TSU) Synchronization Memory (SM)

0?

Graph Memory (GM) Ready Queue (RQ)

  • Ack. Queue (AQ)

Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit

L2 Cache Memory Processor L1 Cache PC Motherboard

31 32 36 33 34

Threads Dependency Graph

DDM PE with Hardware TSU

slide-10
SLIDE 10

10

Thread Synchronization Unit (TSU) Synchronization Memory (SM)

0?

Graph Memory (GM) Ready Queue (RQ)

  • Ack. Queue (AQ)

Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit

L2 Cache Memory Processor L1 Cache PC Motherboard

The GM contains the IFP, DFP and the two consumers (Con1 and Con2).

31 32 36 33 34

Threads Dependency Graph

Data-Driven Multithreading Execution

slide-11
SLIDE 11

11

Data-Driven Multithreading Execution

Thread Synchronization Unit (TSU) Synchronization Memory (SM)

0?

Graph Memory (GM) Ready Queue (RQ)

  • Ack. Queue (AQ)

Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit

L2 Cache Memory Processor L1 Cache PC Motherboard

31 32 36 33 34

Threads Dependency Graph

The SM contains the Ready

  • Counts. One value for each

loop iteration.

slide-12
SLIDE 12

12

Data-Driven Multithreading Execution

Thread Synchronization Unit (TSU) Synchronization Memory (SM)

0?

Graph Memory (GM) Ready Queue (RQ)

  • Ack. Queue (AQ)

Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit

L2 Cache Memory Processor L1 Cache PC Motherboard

31 32 36 33 34

Threads Dependency Graph

The processor reads from the RQ pointers (IFP, DFP and index) of ready threads and executes them

slide-13
SLIDE 13

13

Data-Driven Multithreading Execution

Thread Synchronization Unit (TSU) Synchronization Memory (SM)

0?

Graph Memory (GM) Ready Queue (RQ)

  • Ack. Queue (AQ)

Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit

L2 Cache Memory Processor L1 Cache PC Motherboard

After executing a thread, the processor stores in the AQ information (Thread#, index and status) of the executed thread.

31 32 36 33 34

Threads Dependency Graph

slide-14
SLIDE 14

14

Data-Driven Multithreading Execution

Thread Synchronization Unit (TSU) Synchronization Memory (SM)

0?

Graph Memory (GM) Ready Queue (RQ)

  • Ack. Queue (AQ)

Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit

L2 Cache Memory Processor L1 Cache PC Motherboard

The TSU determines the consumers of completed threads from the GM.

31 32 36 33 34

Threads Dependency Graph

Consumers

slide-15
SLIDE 15

15

Data-Driven Multithreading Execution

Thread Synchronization Unit (TSU) Synchronization Memory (SM)

0?

Graph Memory (GM) Ready Queue (RQ)

  • Ack. Queue (AQ)

Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit

L2 Cache Memory Processor L1 Cache PC Motherboard

Update SM and check if any of the consumers is ready (Ready Count = 0)

31 32 36 33 34

Threads Dependency Graph

2

slide-16
SLIDE 16

16

Data-Driven Multithreading Execution

Thread Synchronization Unit (TSU) Synchronization Memory (SM)

0?

Graph Memory (GM) Ready Queue (RQ)

  • Ack. Queue (AQ)

Thread 0032 0031 0033 0034 1 1 3 2 1 3 2 1 3 2 1 3 2 Thread 0032 0100 0108 011C 0122 3A00 0033 0033 0034 3A00 3A00 3A00 0032 0034 0036 0000 0033 0031 0033 0034 IFP DFP Con1 Con2 Snooping Unit

L2 Cache Memory Processor L1 Cache PC Motherboard

The TSU loads in the RQ the pointers (IFP, DFP) of ready thread from the GM and index the SM.

31 32 36 33 34

Threads Dependency Graph

2

slide-17
SLIDE 17

17

Outline

 Data Driven Multithreading  The DDM-VMc  Programming Toolchain  Conclusion  Motivation  Evaluation

slide-18
SLIDE 18

18

 The DDM-VM is a virtual machine that supports DDM execution

  • n homogeneous and heterogeneous multicores

The Data-Driven Virtual Machine (DDM-VM)

SPU

SPE 8

...

BUS PPU DDM Thread Execution

DDM-VMc PPE Runtime TSU + S-CachFlow Execution

The DDM-VMc

PPE

Main Memory

DDM-VMc SPE Runtime

The DDM-VMs

Core 1 Core 2

...

Bus Main Memory

Core n

DDM-VMs Runtime

TSU + CacheFlow DDM Thread Execution

DDM-VMs Runtime

TSU Memory Structures TSU Memory Structures Program Data Program Data LS I/O I/O

Network Other Nodes Network Other Nodes

DDM Thread Execution

DDM-VMs Runtime

SPU DDM Thread Execution

DDM-VMc SPE Runtime

LS

SPE 1

Cache Hierarchy

slide-19
SLIDE 19

SPU

SPE 8

...

BUS PPU DDM Thread Execution

DDM-VMc PPE Runtime TSU + S-CachFlow Execution

The DDM-VMc

PPE

Main Memory

DDM-VMc SPE Runtime

The DDM-VMs

Core 1 Core 2

...

Bus Main Memory

Core n

DDM-VMs Runtime

TSU + CacheFlow DDM Thread Execution

DDM-VMs Runtime

TSU Memory Structures TSU Memory Structures Program Data Program Data LS I/O I/O

Network Other Nodes Network Other Nodes

DDM Thread Execution

DDM-VMs Runtime

SPU DDM Thread Execution

DDM-VMc SPE Runtime

LS

SPE 1

Cache Hierarchy

19

 The DDM-VM is a virtual machine that supports DDM execution

  • n homogeneous and heterogeneous multicores

The Data-Driven Virtual Machine (DDM-VM)

TSU TSU

slide-20
SLIDE 20

SPU

SPE 8

...

BUS PPU DDM Thread Execution

DDM-VMc PPE Runtime TSU + S-CachFlow Execution

The DDM-VMc

PPE

Main Memory

DDM-VMc SPE Runtime

The DDM-VMs

Core 1 Core 2

...

Bus Main Memory

Core n

DDM-VMs Runtime

TSU + CacheFlow DDM Thread Execution

DDM-VMs Runtime

TSU Memory Structures TSU Memory Structures Program Data Program Data LS I/O I/O

Network Other Nodes Network Other Nodes

DDM Thread Execution

DDM-VMs Runtime

SPU DDM Thread Execution

DDM-VMc SPE Runtime

LS

SPE 1

Cache Hierarchy

20

 The DDM-VM is a virtual machine that supports DDM execution

  • n homogeneous and heterogeneous multicores

The Data-Driven Virtual Machine (DDM-VM)

Thread Execution

Thread Execution

slide-21
SLIDE 21

21

 Ease of Programmability

 No explicit synchronization, no race conditions, no barriers…  Functional /Side-effect frees: exposes maximum amount of parallelism (only partial ordering of instructions, only true dependencies, avoid hazards)  Rich Programming Toolchain: macros, pragmas, automatic compilation tools

DDM-VM Overview

 Combines Dynamic Dataflow concurrency with Efficient sequential execution with competitive performance on commercial multicore systems

 Distributed Concurrency: No central point of control in the program  Tolerance to Memory, Synchronization, and Network latencies  Interleaves Execution with Synchronization – shortens the critical path

 Automatic Management for the memory hierarchy

  • Implicit prefetching with locality aware optimizations
slide-22
SLIDE 22

22

The Cell Processor

SPU SXU LS MFC

SPE 1

SPU SXU LS MFC

SPE 2

SPU SXU LS MFC

SPE 3

...

Element Interconnect Bus Main Memory MIC L2 L1 PXU PPU SPU SXU LS MFC

SPE 8

BIC

I/O Devices

The Cell provides a high computational power on a Single chip (204 GFLOPs) but programming it is not a trivial process

  • SPE code runs in a special pthread. The programmer is responsible

for instantiating, scheduling and synchronizing the threads

  • Most importantly, the programmer is responsible for the

data management between the two levels of the memory hierarchy

  • Two different ISA and Address

Spaces (two sets of sources)

slide-23
SLIDE 23

24

The Data-Driven Virtual Machine on the Cell (DDM-VMc)

Common TSU Structures SM AQ

DFPL CL

SPE0 Structures SPE7 Structures

...

FQ CD WQ RCLD CQ TSU Memory Structures

DDMCommand... DDMCommand... DDMCommand...

FQ CD WQ RCLD CQ

DDMCommand... DDMCommand... DDMCommand...

RC?=0

...

PPE SPE 7 SPE 0

DDM Program

DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data

...

DMA Call

Runtime calls

...

DDM-VMc SPE Runtime Code

RCLD (Cache Lookup) DDM Cache Command Buffer

DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...

DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation

Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info

Main Memory

GM

slide-24
SLIDE 24

Common TSU Structures SM AQ

DFPL CL

SPE0 Structures SPE7 Structures

...

FQ CD WQ RCLD CQ TSU Memory Structures

DDMCommand... DDMCommand... DDMCommand...

FQ CD WQ RCLD CQ

DDMCommand... DDMCommand... DDMCommand...

RC?=0

...

PPE SPE 7 SPE 0

DDM Program

DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data

...

DMA Call

Runtime calls

...

DDM-VMc SPE Runtime Code

RCLD (Cache Lookup) DDM Cache Command Buffer

DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...

DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation

Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info

Main Memory

GM

25

(TSU + S-CacheFlow Execution )

The Data-Driven Virtual Machine on the Cell (DDM-VMc)

VM Runtime TSU Structures

slide-25
SLIDE 25

Common TSU Structures SM AQ

DFPL CL

SPE0 Structures SPE7 Structures

...

FQ CD WQ RCLD CQ TSU Memory Structures

DDMCommand... DDMCommand... DDMCommand...

FQ CD WQ RCLD CQ

DDMCommand... DDMCommand... DDMCommand...

RC?=0

...

PPE SPE 7 SPE 0

DDM Program

DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data

...

DMA Call

Runtime calls

...

DDM-VMc SPE Runtime Code

RCLD (Cache Lookup) DDM Cache Command Buffer

DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...

DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation

Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info

Main Memory

GM

26

Thread Execution

The Data-Driven Virtual Machine on the Cell (DDM-VMc)

slide-26
SLIDE 26

Common TSU Structures SM AQ

DFPL CL

SPE0 Structures SPE7 Structures

...

FQ CD WQ RCLD CQ TSU Memory Structures

DDMCommand... DDMCommand... DDMCommand...

FQ CD WQ RCLD CQ

DDMCommand... DDMCommand... DDMCommand...

RC?=0

...

PPE SPE 7 SPE 0

DDM Program

DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data

...

DMA Call

Runtime calls

...

DDM-VMc SPE Runtime Code

RCLD (Cache Lookup) DDM Cache Command Buffer

DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...

DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation

Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info

Main Memory

GM

27

Thread Code VM runtime DDM Cache

The Data-Driven Virtual Machine on the Cell (DDM-VMc)

slide-27
SLIDE 27

Common TSU Structures SM AQ

DFPL CL

SPE0 Structures SPE7 Structures

...

FQ CD WQ RCLD CQ TSU Memory Structures

DDMCommand... DDMCommand... DDMCommand...

FQ CD WQ RCLD CQ

DDMCommand... DDMCommand... DDMCommand...

RC?=0

...

PPE SPE 7 SPE 0

DDM Program

DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data

...

DMA Call

Runtime calls

...

DDM-VMc SPE Runtime Code

RCLD (Cache Lookup) DDM Cache Command Buffer

DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...

DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation

Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info

Main Memory

GM

28 28

DMA Calls

The Data-Driven Virtual Machine on the Cell (DDM-VMc)

slide-28
SLIDE 28

Common TSU Structures SM AQ

DFPL CL

SPE0 Structures SPE7 Structures

...

FQ CD WQ RCLD CQ TSU Memory Structures

DDMCommand... DDMCommand... DDMCommand...

FQ CD WQ RCLD CQ

DDMCommand... DDMCommand... DDMCommand...

RC?=0

...

PPE SPE 7 SPE 0

DDM Program

DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data

...

DMA Call

Runtime calls

...

DDM-VMc SPE Runtime Code

RCLD (Cache Lookup) DDM Cache Command Buffer

DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...

DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation

Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info

Main Memory

GM

29

The Data-Driven Virtual Machine on the Cell (DDM-VMc)

slide-29
SLIDE 29

Common TSU Structures SM AQ

DFPL CL

SPE0 Structures SPE7 Structures

...

FQ CD WQ RCLD CQ TSU Memory Structures

DDMCommand... DDMCommand... DDMCommand...

FQ CD WQ RCLD CQ

DDMCommand... DDMCommand... DDMCommand...

RC?=0

...

PPE SPE 7 SPE 0

DDM Program

DDM Thread i : Computation ... DDM Thread i+1 : ... DMA Call DMA Call DMA Call Rest of Main Memory Program Data

...

DMA Call

Runtime calls

...

DDM-VMc SPE Runtime Code

RCLD (Cache Lookup) DDM Cache Command Buffer

DDMCommand_ThreadFinish... DDMCommand_ThreadFinish... DDMCommand_ThreadFinish...

DDM-VMc PPE Runtime Thread Synchronization Unit (TSU) + Soft-CacheFlow Execution Computation

Runtime calls Runtime calls Runtime calls Notify TSU Cache Lookup Ready Threads info

Main Memory

GM

30

The Data-Driven Virtual Machine on the Cell (DDM-VMc)

slide-30
SLIDE 30

31

 A portion of the LS is pre-allocated and divided into cache blocks. The size is adjustable based on the application characteristics  A Cache Directory allocated in main memory keeps the blocks state  Each input/output is separately allocated at least one block  The Soft-CacheFlow module in the TSU handles, allocation, eviction, fetching of data from MM<->LS with no user intervention

Software CacheFlow

 CacheFlow is a cache management policy utilized with DDM to improve performance by prefetching the threads data into the cache  DDM-VMc uses the concept of CacheFlow to implement Software CacheFlow an automated prefetching software cache with variable cache block sizes for the management of the memory hierarchy of Cell

slide-31
SLIDE 31

32

S-CacheFlow – Multi-buffering/Locality Optimizations  Data transfers and management are overlapped with thread execution to tolerate latencies and achieve multi-buffering  Support “explicit locality” which avoids expensive cross-SPE coherence operations:

  • Explicitly inserting ”write-backs” in the graph of the program
  • Assigning a reference-count
  • A dirty bit protects against premature eviction of dirty blocks

 Maintains consistency with DF synchronization in a manner similar to DAG consistency

  • Supports explicit locality by flagging inputs/outputs data

for re-use (reuse flag for input and keep flag for output)

slide-32
SLIDE 32

Dequeu thread info Get thread DFPs Allocate Data in LS

Allocation Success for all DFPs

Issue DMA calls:

  • Writeback evicted dirty

blocks from LS to MM

  • Fetch DFP from MM to LS
  • Copy Lookup info to LS

Yes

record all issued DMAs in the PendingBuffer (PB) Entry ID = ThreadID

WQ has any entry

Yes

PB has any entry that all DMAs have completed Partial allocation Succes

Restore Cache State

Yes No

Move to FQ Thead info with ThreadID = Entry ID

Yes No

Save Cache State PrioWQ has any entry

No No

Enqueu thread in PrioWQ

No Yes

  • Consult Cache

Directory

  • Block Eviction
  • Block Allocation
  • Block Reuse

Execute on SPE

Execute on PPE

33

SPE Runtime Activities (MatMult)

60% 65% 70% 75% 80% 85% 90% 95% 100% CF D-CF CF D-CF CF D-CF CF D-CF 1 2 4 6 Runtime(Other) Idle (Wait Process Msg) Idle (Wait NextThread) D-CacheFlow(SPE) Computation

CacheFlow v.s D-CacheFlow

D-CacheFlow successfully achieves an efficient management of the memory in the Cell The Algorithm

slide-33
SLIDE 33

34

Outline

 Data Driven Multithreading  The DDM-VMc  Programming Toolchain  Conclusion  Motivation  Evaluation

slide-34
SLIDE 34

35

 DDM-VMc programs consists of two parts:

Programming DDM-VMc

 The programs are coded in C using a set of macros that expand to calls to the DDM-VMc

DDM Thread Producer/Consumer relationship

RC=1 1 1 3 2 1

  • The code of the threads
  • The dependency graph of the threads
  • Identify thread boundaries
  • Identify the input/output data
  • Identify the dependencies amongst the threads

 The programmer can also use the TFlux+ pragmas. Originally developed for the TFlux platform and extended to target the DDM-VMc and generate the macros

slide-35
SLIDE 35

36

 Two compiler projects, currently under development to automate generating the macros:

Programming Toolchain

  • GCC-based auto-parallelizing compiler for C programs
  • Source-to-Source Concurrent Collections (CnC) Compiler

Macros Expansion DDM Threads Code + Calls to run-time Dependency Graph + Initialization & Clean-up + Calls to run-time SPU Compiler (gcc or xlc) PPU Compiler (gcc or xlc) DDM-VMc runtime spu library object SPU Linker (gcc or xlc) + PPU Embedd. PPU Linker (gcc or xlc) spu

  • bj file(s)

ppu

  • bj file(s)

DDM-VMc runtime ppu library object Cell executable

spe source file ppe source file

Cell SDK Compilers

CnC to DDM Compiler

DDM-VMc Internal Representation DDM-VMc Programming Toolchain

GCC Parallelizing Compiler C Program + DDM Macros CnC Program Sequential C Program C Program + TFlux* Pragmas TFlux Preprocessor

slide-36
SLIDE 36

37

//Item definitions [int* A <PAIR>]; // Item A, pointer to a block in M.Memory [int* B <PAIR>]; // Item B, pointer to a block in M.Memory [int* C <TRIPLE>];// Item C, pointer to a block in M.Memory // Tag definitions <PAIR ITag>; <TRIPLE MTag>; //Prescriptions (control relationships) <TAG>::(STEP) <ITag> :: (Iterator); <MTag> :: (Multiply); // Step produce/consume relationships (Iterator)-><MTag>; // Iterator produces MTag [A], [B], [C] -> (Multiply);//Multiply consumes A,B,C (Multiply)->[C],<MTag>; // Multiply produces C env -> <ITag>,[A],[B],[C]; //initilization code produces A,B,C [C]-> env ; // post-execution code consumes C

Concurrent Collections (CnC) to DDM

  • Declarative parallel programming

language

  • Separation of concerns
  • Domain Expert
  • Parallelization Expert

Tag Collection < > Item Collection [ ] Step Collection ( ) Prod/Cons relationship control relationship Environment Input/output

Blocked MatMult CnC C B A

Multiply Iterator MTag ITag

T1 T2 (Iterator) (Multiply) T

DDM Thread Prod/Cons relationship

DDM Program

Step => DDM Thread Tag => dynamic meta-data Items prod/Cons => static meta-data

  • Similar semantics with DDM
slide-37
SLIDE 37

38

DDM-VMc Program: Blocked LU Decomposition

  • Factorizes a matrix A into L, U matrices, such that A=LU

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb

diag front down LU iter 0 comb

1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3

diag invocation down invocations front invocations comb invocations

slide-38
SLIDE 38

40

DDM-VM Program: Blocked LU Decomposition

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb

1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3

slide-39
SLIDE 39

41

DDM-VM Program: Blocked LU Decomposition

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb

LU iter 0

1 1 1 1 1 1 2 2 2 2 2 2 2 2 2

invocation can run now

slide-40
SLIDE 40

42

DDM-VM Program: Blocked LU Decomposition

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb

LU iter 0

1 1 1 1 1 1 2 2 2 2 2 2 2 2 2

slide-41
SLIDE 41

43

DDM-VM Program: Blocked LU Decomposition

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb

LU iter 0

2 2 2 2 2 2 2 2 2

all invocations can run now all invocations can run now

slide-42
SLIDE 42

44

DDM-VM Program: Blocked LU Decomposition

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb

LU iter 0

2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1

Executing in Parallel Executing in Parallel invocations of comb that can run are underlined

slide-43
SLIDE 43

45

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies LU iter 0 Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb e #pragma ddm thread TID_DIAG kernel(dynamic) readycount 1 arity 1 import_export(float *T:A[@context][@context]:BLOCK_SIZE);

... // computation code execution here

bUpdate = @context<TILE-1 ; #pragma ddm endthread cond_update(TID_FRONT,@(@context,@context+1): @(@context, TILES - 1): bUpdate, @(@context,@context+1): @(@context, TILES - 1): bUpdate); #pragma ddm thread TID_FRONT kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *P:A[@context.1][@context.0]:BLOCK_SIZE);

... // computation code execution here

#pragma ddm endthread update(TID_COMB,@(@context.1,@context.1+1, @context.0): @(@context.1,TILES-1, @context.0); #pragma ddm thread TID_DOWN kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *Q:A[@context.0][@context.1]:BLOCK_SIZE);

... // computation code execution here

#pragma ddm endthread update(TID_COMB,@(@context.1,@context.0, @context.1+1): @(@context.1,@context.0, TILES-1); TID_DOWN,

slide-44
SLIDE 44

e #pragma ddm thread TID_DIAG kernel(dynamic) readycount 1 arity 1 import_export(float *T:A[@context][@context]:BLOCK_SIZE);

... // computation code execution here

bUpdate = @context<TILE-1 ; #pragma ddm endthread cond_update(TID_FRONT,@(@context,@context+1): @(@context, TILES - 1): bUpdate, @(@context,@context+1): @(@context, TILES - 1): bUpdate); #pragma ddm thread TID_FRONT kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *P:A[@context.1][@context.0]:BLOCK_SIZE);

... // computation code execution here

#pragma ddm endthread update(TID_COMB,@(@context.1,@context.1+1, @context.0): @(@context.1,TILES-1, @context.0); #pragma ddm thread TID_DOWN kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *Q:A[@context.0][@context.1]:BLOCK_SIZE);

... // computation code execution here

#pragma ddm endthread update(TID_COMB,@(@context.1,@context.0, @context.1+1): @(@context.1,@context.0, TILES-1); TID_DOWN,

46

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies LU iter 0 Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb

diag

front

down

slide-45
SLIDE 45

e #pragma ddm thread TID_DIAG kernel(dynamic) readycount 1 arity 1 import_export(float *T:A[@context][@context]:BLOCK_SIZE);

... // computation code execution here

bUpdate = @context<TILE-1 ; #pragma ddm endthread cond_update(TID_FRONT,@(@context,@context+1): @(@context, TILES - 1): bUpdate, @(@context,@context+1): @(@context, TILES - 1): bUpdate); #pragma ddm thread TID_FRONT kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *P:A[@context.1][@context.0]:BLOCK_SIZE);

... // computation code execution here

#pragma ddm endthread update(TID_COMB,@(@context.1,@context.1+1, @context.0): @(@context.1,TILES-1, @context.0); #pragma ddm thread TID_DOWN kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *Q:A[@context.0][@context.1]:BLOCK_SIZE);

... // computation code execution here

#pragma ddm endthread update(TID_COMB,@(@context.1,@context.0, @context.1+1): @(@context.1,@context.0, TILES-1); TID_DOWN,

47

thread start thread end thread start thread end thread start thread end

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies LU iter 0 Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb

diag

front

down

slide-46
SLIDE 46

e #pragma ddm thread TID_DIAG kernel(dynamic) readycount 1 arity 1 import_export(float *T:A[@context][@context]:BLOCK_SIZE);

... // computation code execution here

bUpdate = @context<TILE-1 ; #pragma ddm endthread cond_update(TID_FRONT,@(@context,@context+1): @(@context, TILES - 1): bUpdate, @(@context,@context+1): @(@context, TILES - 1): bUpdate); #pragma ddm thread TID_FRONT kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *P:A[@context.1][@context.0]:BLOCK_SIZE);

... // computation code execution here

#pragma ddm endthread update(TID_COMB,@(@context.1,@context.1+1, @context.0): @(@context.1,TILES-1, @context.0); #pragma ddm thread TID_DOWN kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *Q:A[@context.0][@context.1]:BLOCK_SIZE);

... // computation code execution here

#pragma ddm endthread update(TID_COMB,@(@context.1,@context.0, @context.1+1): @(@context.1,@context.0, TILES-1); TID_DOWN,

48 48

The input/ouput Data of the Threads

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies LU iter 0 Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb

diag

front

down

slide-47
SLIDE 47

e #pragma ddm thread TID_DIAG kernel(dynamic) readycount 1 arity 1 import_export(float *T:A[@context][@context]:BLOCK_SIZE);

... // computation code execution here

bUpdate = @context<TILE-1 ; #pragma ddm endthread cond_update(TID_FRONT,@(@context,@context+1): @(@context, TILES - 1): bUpdate, @(@context,@context+1): @(@context, TILES - 1): bUpdate); #pragma ddm thread TID_FRONT kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *P:A[@context.1][@context.0]:BLOCK_SIZE);

... // computation code execution here

#pragma ddm endthread update(TID_COMB,@(@context.1,@context.1+1, @context.0): @(@context.1,TILES-1, @context.0); #pragma ddm thread TID_DOWN kernel(dynamic) readycount 2 arity 2 import( float *T:A[@context.1][@context.1]:BLOCK_SIZE); import_export(float *Q:A[@context.0][@context.1]:BLOCK_SIZE);

... // computation code execution here

#pragma ddm endthread update(TID_COMB,@(@context.1,@context.0, @context.1+1): @(@context.1,@context.0, TILES-1); TID_DOWN,

50 50 50 50

Identify the cosumer threads

<0> <0,1> <0,2> <0,3>

<0,1,1> <0,1,2> <0,1,3> <0,2,1> <0,2,2> <0,2,3> <0,3,1> <0,3,2> <0,3,3>

<0,1> <0,2> <0,3>

DDM Threads Dynamic Instantiations Dependencies LU iter 0 Data Dependency on initialized data Data dependence diag front Operations on the Matrix Dependency Graph

down

Comb

diag

front

down

slide-48
SLIDE 48

51

Outline

 Data Driven Multithreading  The DDM-VMc  Programming Toolchain  Conclusion  Motivation  Evaluation

slide-49
SLIDE 49

52

Benchmark Suite

Benchmark

  • Avg. granularity
  • f the thread

Input Size Trapez variable Small Medium Large MatCopy 4.6 µs 512x512 1024x1024 2048x2048 MatAdd 4.6 µs 512x512 1024x1024 2048x2048 MatMult 22.1 µs 512x512 1024x1024 2048x2048 Cholesky 22 µs 512x512 1024x1024 2048x2048 LU 1.82 ms 512x512 1024x1024 2048x2048 IDCT 12.37 µs – 98.8 µs 512x512 1024x1024 2048x2048 Conv2D 12.28 µs – 48.11 µs 512x512 1024x1024 2048x2048 RK4 Variable Small Medium Large FDTD 28.65 µs – 116 µs 304 Y-Cells 608 Y-Cells 1216 Y-Cells

10 applications featuring kernels widely used in: scientific and image processing applications

slide-50
SLIDE 50

53

Number of SPEs Number of SPEs Number of SPEs Number of SPEs

1 2 3 4 5 6 1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6 Conv2D(32x32) Conv2D(64x64) MatAdd MatCopy D-CacheFlow CacheFlow

Number of SPEs Number of SPEs Number of SPEs Number of SPEs

1 2 3 4 5 6 1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6 FDTD(304x304) FDTD(608x608) IDCT(32x16) IDCT(64x32) D-CacheFlow CacheFlow

Speedup Speedup

1 2 3 4 5 6 1 2 4 6 1 2 4 6 1 2 4 6 1 2 4 6 Mult Cholesky LU Trapez D-CacheFlow CacheFlow

Speedup

For all benchmarks Problem size: Large

  • For most of the applications, the platform

scales well & tolerates latencies and

  • verheads efficiently
  • The Distributed implementation of

CacheFlow is more efficient as it alleviates the pressure on the PPE

  • For applications with varying

granularities, the performance improves as granularity increases

CacheFlow v.s D-CacheFlow – Thread Granularity

slide-51
SLIDE 51

54

The system generally scales well across the range of the benchmarks achieving an almost linear speedup for the large problem size

1 2 3 4 5 6 2 4 6 16 8 4

1 2 3 4 5 6 2 4 6 Large Medium Small

RK4

Number of SPEs

1 2 3 4 5 6 2 4 6 2048x2048 1024x1024 512x512 1 2 3 4 5 6 2 4 6 2048x2048 1024x1024 512x512

1 2 3 4 5 6 2 4 6 Large Medium Small Number of SPEs

FDTD

Speedup

Number of SPEs

Speedup

Trapez LU

1 2 3 4 5 6 2 4 6 Large Medium Small 1 2 3 4 5 6 2 4 6 Large Medium Small

Cholesky Mult

1 2 3 4 5 6 2 4 6 Large Medium Small

Conv2D(64x64)

Number of SPEs Number of SPEs

IDCT (64x64)

Performance – Problem Size

slide-52
SLIDE 52

55

  • When limiting the concurrent threads to 1 the TSU overhead is added

to the critical path.

  • When it is increased to 2 & 3 the TSU overlaps the scheduling overheads

and data transfers with execution & finishes in time less than the sequential

Synchronization Latency Tolerance

0.2 0.4 0.6 0.8 1 1.2 1.4 Sequential D-CacheFlow-1 D-CacheFlow-2 D-CacheFlow-3

Execution Time (Normalized)

Execution Time on 1 SPE

For all benchmarks Problem size: Large

slide-53
SLIDE 53

57

 GFLOPs Performance

  • Achieves avg. of 88% of the theoretical

peak For Matmult

  • Achieves a speedup of 5 out of 6

for Cholesky MatMul-2048x2048 Cholesky-2048x2048 GFLOPs GFLOPs

Number SPEs Number SPEs

20 40 60 80 100 120 140 160 2 4 6 Theoretical Peak DDM-VMc 20 40 60 80 100 120 140 160 2 4 6 Theoretical Peak DDM-VMc

slide-54
SLIDE 54

58

 Comparison with Sequoia [Stanford]

  • An improvement of 10%, 16% & 11%

for the 512,1024 and 2048 sizes MatMult

  • An improvement of 27%, 22% & 27%

for the 512,1024 and 2048 sizes Conv2D (9x9 convolution filter)

10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc Sequoia 10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc Sequoia 10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc Sequoia Number of SPEs Number of SPEs Number of SPEs

MatMult-2048 MatMult-1024 MatMult-512

10 20 30 40 50 60 70 80 2 4 6 DDM-VMc Sequoia 10 20 30 40 50 60 70 80 2 4 6 DDM-VMc Sequoia 10 20 30 40 50 60 70 80 2 4 6 DDM-VMc Sequoia

Number of SPEs Number of SPEs Number of SPEs

Conv2D-512 Conv2D-1024 Conv2D-2048

slide-55
SLIDE 55

59

 Comparison* with CellSs [Barcelona Supercomputing Center]

20 40 60 80 100 120 140 160 2 4 6 DDM-VMc CellSs 20 40 60 80 100 120 140 160 2 4 6 DDM-MVc CellSs 20 40 60 80 100 120 140 160 2 4 6 DDM-VMc CellSs

MatMult-512 MatMult-1024 MatMult-2048

GFLOPs

10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc CellSs

Cholesky-512

10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc CellSs

GFLOPs

Cholesky-1024

10 20 30 40 50 60 70 80 90 100 110 2 4 6 DDM-VMc CellSs

Cholesky-2048

  • An improvement of 80%, 28% & 19%

for the 512,1024 and 2048 sizes MatMult

  • An improvement of 213%, 99% & 23%

for the 512,1024 and 2048 sizes Cholesky

*Details in our SAMOS2010 paper

slide-56
SLIDE 56

60

Outline

 Data Driven Multithreading  The DDM-VMc  Programming Toolchain  Conclusion  Motivation  Evaluation

slide-57
SLIDE 57

61

Models like DDM, that combine dataflow concurrency with efficient sequential execution are promising candidates as the programming model for multicore systems Conclusion Our results demonstrate that Data-Flow concurrency can be efficiently implemented as a Virtual Machine on commercial multicore systems

slide-58
SLIDE 58

62

Questions!