[PPT] - Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT PowerPoint Presentation

SLIDE 1

Slide-1 Multicore Theory

MIT Lincoln Laboratory

Theory of Multicore Algorithms

Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008

This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

SLIDE 2

MIT Lincoln Laboratory

Slide-2 Multicore Theory

Programming Challenge
Example Issues
Theoretical Approach
Integrated Process

Outline

Parallel Design
Distributed Arrays
Kuck Diagrams
Hierarchical Arrays
Tasks and Conduits
Summary

SLIDE 3

MIT Lincoln Laboratory

Slide-3 Multicore Theory

Multicore Programming Challenge

Great success of Moore’s Law era

– Simple model: load, op, store – Many transistors devoted to delivering this model

Moore’s Law is ending

– Need transistors for performance

Past Programming Model: Von Neumann Future Programming Model: ??? Can we describe how an algorithm is suppose to behave

n a hierarchical heterogeneous multicore processor?
Processor topology includes:

– Registers, cache, local memory, remote memory, disk

Cell has multiple programming

models

SLIDE 4

MIT Lincoln Laboratory

Slide-4 Multicore Theory

Example Issues

Where is the data? How is distributed? Where is it running?

Y = X + 1 X,Y : NxN

A serial algorithm can run a serial processor with relatively

little specification

A hierarchical heterogeneous multicore algorithm requires

a lot more information

Which binary to run?

Initialization policy? How does the data flow? What are the allowed messages size? Overlap computations and communications?

SLIDE 5

MIT Lincoln Laboratory

Slide-5 Multicore Theory

Theoretical Approach

Task1S1()

A : RN×P(N)

M0 P0 M0 P0

1

M0 P0

2

M0 P0

3

Task2S2()

B : RN×P(N)

M0 P0

4

M0 P0

5

M0 P0

6

M0 P0

7

A n⇒

Topic12

⇒m B

Topic12

Conduit

2

Replica 0 Replica 1

Provide notation and diagrams that allow hierarchical

heterogeneous multicore algorithms to be specified

SLIDE 6

MIT Lincoln Laboratory

Slide-6 Multicore Theory

Integrated Development Process

Cluster

2. Parallelize code

Embedded Computer

3. Deploy code
1. Develop serial code

Desktop

4. Automatically parallelize code

Y = X + 1 X,Y : NxN Y = X + 1 X,Y : P(N)xN Y = X + 1 X,Y : P(P(N))xN

Should naturally support standard parallel embedded

software development practices

SLIDE 7

MIT Lincoln Laboratory

Slide-7 Multicore Theory

Serial Program
Parallel Execution
Distributed Arrays
Redistribution

Outline

Parallel Design
Distributed Arrays
Kuck Diagrams
Hierarchical Arrays
Tasks and Conduits
Summary

SLIDE 8

MIT Lincoln Laboratory

Slide-8 Multicore Theory

Serial Program

Math is the language of algorithms
Allows mathematical expressions to be written concisely
Multi-dimensional arrays are fundamental to mathematics

Y = X + 1 X,Y : NxN

SLIDE 9

MIT Lincoln Laboratory

Slide-9 Multicore Theory

Parallel Execution

Run NP copies of same program

– Single Program Multiple Data (SPMD)

Each copy has a unique PID
Every array is replicated on each copy of the program

Y = X + 1 X,Y : NxN

PID=1 PID=0 PID=NP-1

SLIDE 10

MIT Lincoln Laboratory

Slide-10 Multicore Theory

Distributed Array Program

Use P() notation to make a distributed array
Tells program which dimension to distribute data
Each program implicitly operates on only its own data

(owner computes rule)

Y = X + 1 X,Y : P(N)xN

PID=1 PID=0 PID=NP-1

SLIDE 11

MIT Lincoln Laboratory

Slide-11 Multicore Theory

Explicitly Local Program

Use .loc notation to explicitly retrieve local part of a

distributed array

Operation is the same as serial program, but with different

data on each processor (recommended approach)

Y.loc = X.loc + 1 X,Y : P(N)xN

SLIDE 12

MIT Lincoln Laboratory

Slide-12 Multicore Theory

Parallel Data Maps

A map is a mapping of array indices to processors
Can be block, cyclic, block-cyclic, or block w/overlap
Use P() notation to set which dimension to split among

processors

P(N)xN

Math

0 1 2 3

Computer PID Array

NxP(N) P(N)xP(N)

SLIDE 13

MIT Lincoln Laboratory

Slide-13 Multicore Theory

Redistribution of Data

Different distributed arrays can have different maps
Assignment between arrays with the “=” operator causes

data to be redistributed

Underlying library determines all the message to send

Y = X + 1 Y

: NxP(N)

X

: P(N)xN

P0 P1 P2 P3

X =

P0 P1 P2 P3

Y = Data Sent

SLIDE 14

MIT Lincoln Laboratory

Slide-14 Multicore Theory

Serial
Parallel
Hierarchical
Cell

Outline

Parallel Design
Distributed Arrays
Kuck Diagrams
Hierarchical Arrays
Tasks and Conduits
Summary

SLIDE 15

MIT Lincoln Laboratory

Slide-15 Multicore Theory

M0 P0

A : RN×N

Single Processor Kuck Diagram

Processors denoted by boxes
Memory denoted by ovals
Lines connected associated processors and memories
Subscript denotes level in the memory hierarchy

SLIDE 16

MIT Lincoln Laboratory

Slide-16 Multicore Theory

Net0.5

M0 P0 M0 P0 M0 P0 M0 P0

A : RN×P(N)

Parallel Kuck Diagram

Replicates serial processors
Net denotes network connecting memories at a level in the

hierarchy (incremented by 0.5)

Distributed array has a local piece on each memory

SLIDE 17

MIT Lincoln Laboratory

Slide-17 Multicore Theory

Hierarchical Kuck Diagram

The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy

Net1.5 SM2 SMNet2 M0 P0 M0 P0 Net0.5 SM1 M0 P0 M0 P0 Net0.5 SM1 SMNet1 SMNet1

Legend:

P - processor
Net - inter-processor network
M - memory
SM - shared memory
SMNet - shared memory

network

2-LEVEL HIERARCHY Subscript indicates hierarchy level x.5 subscript for Net indicates indirect memory access

*High Performance Computing: Challenges for Future Systems, David Kuck, 1996

SLIDE 18

MIT Lincoln Laboratory

Slide-18 Multicore Theory

Cell Example

M0 P0 M0 P0 M0 P0 Net0.5 M0 P0 M0 P0 M0 P0

1 2 3 7

MNet1 M1 PPE SPEs PPPE = PPE speed (GFLOPS) M0,PPE = size of PPE cache (bytes) PPPE -M0,PPE =PPE to cache bandwidth (GB/sec) PSPE = SPE speed (GFLOPS) M0,SPE = size of SPE local store (bytes) PSPE -M0,SPE = SPE to LS memory bandwidth (GB/sec) Net0.5 = SPE to SPE bandwidth (matrix encoding topology, GB/sec) MNet1 = PPE,SPE to main memory bandwidth (GB/sec) M1 = size of main memory (bytes)

Kuck diagram for the Sony/Toshiba/IBM processor Kuck diagram for the Sony/Toshiba/IBM processor

SLIDE 19

MIT Lincoln Laboratory

Slide-19 Multicore Theory

Hierarchical Arrays
Hierarchical Maps
Kuck Diagram
Explicitly Local Program

Outline

Parallel Design
Distributed Arrays
Kuck Diagrams
Hierarchical Arrays
Tasks and Conduits
Summary

SLIDE 20

MIT Lincoln Laboratory

Slide-20 Multicore Theory

Hierarchical Arrays

1 2 3 Local arrays Global array

Hierarchical arrays allow algorithms to conform to

hierarchical multicore processors

Each processor in P controls another set of processors P
Array local to P is sub-divided among local P processors

PID PID

1 1 1 1

SLIDE 21

MIT Lincoln Laboratory

Slide-21 Multicore Theory

net1.5 M0 P0 M0 P0 net0.5 SM1 M0 P0 M0 P0 net0.5 SM1 SM net1 SM net1

A : RN×P(P(N))

Hierarchical Array and Kuck Diagram

Array is allocated across SM1 of P processors
Within each SM1 responsibility of processing is divided

among local P processors

P processors will move their portion to their local M0

A.loc A.loc.loc

SLIDE 22

MIT Lincoln Laboratory

Slide-22 Multicore Theory

Explicitly Local Hierarchical Program

Extend .loc notation to explicitly retrieve local part of a local

distributed array .loc.loc (assumes SPMD on P)

Subscript p and p provide explicit access to (implicit
therwise)

Y.locp.locp = X.locp.locp + 1 X,Y : P(P(N))xN

SLIDE 23

MIT Lincoln Laboratory

Slide-23 Multicore Theory

Block Hierarchical Arrays

1 2 3 Local arrays Global array

Memory constraints are common at the lowest level of the

hierarchy

Blocking at this level allows control of the size of data
perated on by each P

PID PID

1 1 1 1

b=4

in-core

ut-of-

core

Core blocks blk

1 2 3 1 2 3

SLIDE 24

MIT Lincoln Laboratory

Slide-24 Multicore Theory

Block Hierarchical Program

Pb(4) indicates each sub-array should be broken up into

blocks of size 4.

.nblk provides the number of blocks for looping over each

block; allows controlling size of data on lowest level

for i=0, X.loc.loc.nblk-1 Y.loc.loc.blki = X.loc.loc.blki + 1 X,Y : P(Pb(4)(N))xN

SLIDE 25

MIT Lincoln Laboratory

Slide-25 Multicore Theory

Basic Pipeline
Replicated Tasks
Replicated Pipelines

Outline

Parallel Design
Distributed Arrays
Kuck Diagrams
Hierarchical Arrays
Tasks and Conduits
Summary

SLIDE 26

MIT Lincoln Laboratory

Slide-26 Multicore Theory

Task1S1()

A : RN×P(N)

M0 P0 M0 P0

1

M0 P0

2

M0 P0

3

Task2S2()

B : RN×P(N)

M0 P0

4

M0 P0

5

M0 P0

6

M0 P0

7

Tasks and Conduits

S1 superscript runs task on a set of processors; distributed

arrays allocated relative to this scope

Pub/sub conduits move data between tasks

A n⇒

Topic12

⇒m B

Topic12

Conduit

SLIDE 27

MIT Lincoln Laboratory

Slide-27 Multicore Theory

Task1S1()

A : RN×P(N)

M0 P0 M0 P0

1

M0 P0

2

M0 P0

3

Task2S2()

B : RN×P(N)

M0 P0

4

M0 P0

5

M0 P0

6

M0 P0

7

A n⇒

Topic12

⇒m B

Topic12

Conduit

2

Replica 0 Replica 1

Replicated Tasks

2 subscript creates tasks replicas; conduit will round-robin

SLIDE 28

MIT Lincoln Laboratory

Slide-28 Multicore Theory

Task1S1()

A : RN×P(N)

M0 P0 M0 P0

1

M0 P0

2

M0 P0

3

Task2S2()

B : RN×P(N)

M0 P0

4

M0 P0

5

M0 P0

6

M0 P0

7

A n⇒

Topic12

⇒m B

Topic12

Conduit

2 2

Replica 0 Replica 1 Replica 0 Replica 1

Replicated Pipelines

2 identical subscript on tasks creates replicated pipeline

SLIDE 29

MIT Lincoln Laboratory

Slide-29 Multicore Theory

Summary

Hierarchical heterogeneous multicore processors are

difficult to program

Specifying how an algorithm is suppose to behave on such

a processor is critical

Proposed notation provides mathematical constructs for