[PPT] - X10: a High-Productivity Approach to X10: a High-Productivity PowerPoint Presentation

SLIDE 1

X10: a High-Productivity Approach to High Performance Programming

Rajkishore Barik Christopher Donawa Matteo Frigo Allan Kielstra Vivek Sarkar

X10: a High-Productivity Approach to High Performance Programming

Rajkishore Barik Christopher Donawa Matteo Frigo Allan Kielstra Vivek Sarkar HPC Challenge Class 2 Award Submission

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

HPC Challenge Class 2 Award Submission

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

SLIDE 2

2

High-Productivity, High-Performance Programming with X10

Motivation: Productivity Challenges caused by Future Hardware Trends

Challenge: Develop new language, compiler and tools technologies to support productive portable parallel abstractions for future hardware

Heterogeneous Accelerators

16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

L2

32B/cycle

LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF 16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

PXU

L1 PPU

16B/cycle

L2

32B/cycle

LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

Homogeneous Multi-core

. . .

L2 Cache PEs, L1 $ PEs, L1 $ . . .

. . .

L2 Cache PEs, L1 $ PEs, L1 $

. . .

Memory

PEs,

SMP Node

PEs,

. . .

Memory

PEs,

SMP Node

PEs,

Interconnect

Clusters Global Address Space

SLIDE 3

3

High-Productivity, High-Performance Programming with X10

X10 Programming Model

Dynamic parallelism with a Partitioned Global Address Space
Places encapsulate binding of activities and globally addressable data
All concurrency is expressed as asynchronous activities – subsumes

threads, structured parallelism, messaging, DMA transfers (beyond SPMD)

Atomic sections enforce mutual exclusion of co-located data
No place-remote accesses permitted in atomic section
Immutable data offers opportunity for single-assignment parallelism

Storage classes:

Activity-local
Place-local
Partitioned

global

Immutable

Deadlock safety: any X10 program written with async, atomic, finish, foreach, ateach, and clocks can never deadlock

SLIDE 4

4

High-Productivity, High-Performance Programming with X10

X10 Deployment

X10 Places Physical PEs X10 language defines mapping from X10 objects & activities to X10 places X10 Data Structures X10 deployment defines mapping from virtual X10 places to physical processing elements

Homogeneous Multi-core Clusters Heterogeneous Accelerators

16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

L2

32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF 16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

PXU

L1 PPU

16B/cycle

L2

32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

. . .

L2 Cache PEs, L1 $ PEs, L1 $ . . .

. . .

L2 Cache PEs, L1 $ PEs, L1 $

. . .

Memory

PEs,

SMP Node

PEs,

. . .

Memory

PEs,

SMP Node

PEs,

Interconnect

16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

L2

32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF 16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

PXU

L1 PPU

16B/cycle

L2

32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

. . .

L2 Cache PEs, L1 $ PEs, L1 $ . . .

. . .

L2 Cache PEs, L1 $ PEs, L1 $

. . .

L2 Cache PEs, L1 $ PEs, L1 $

. . .

L2 Cache PEs, L1 $ PEs, L1 $ PEs, L1 $ PEs, L1 $ . . .

. . .

L2 Cache PEs, L1 $ PEs, L1 $

. . .

L2 Cache PEs, L1 $ PEs, L1 $ PEs, L1 $ PEs, L1 $

. . .

Memory

PEs,

SMP Node

PEs,

. . .

Memory

PEs,

SMP Node

PEs,

Interconnect

. . .

Memory

PEs,

SMP Node

PEs,

. . .

Memory

PEs, PEs,

SMP Node

PEs, PEs,

. . .

Memory

PEs,

SMP Node

PEs,

. . .

Memory

PEs, PEs,

SMP Node

PEs, PEs,

Interconnect

SLIDE 5

5

High-Productivity, High-Performance Programming with X10

Current Status: Multi-core SMP Implementation for X10

Analysis passes X10 source AST X10 Parser Code Generation Templates Java code emitter Annotated AST X10 Grammar Target Java DOMO Static Analyzer Java compiler

X10 Front End

Outbound activities Inbound activities Outbound replies Inbound replies

Place

Ready Activities Completed Activities Blocked Activities Clock Future Executing Activities

. . .

Atomic sections do not have blocking semantics Activity can only access its stack, place-local mutable data, or global immutable data

X10 classfiles (Java classfiles with special annotations for X10 analysis info) Java Concurrency Utilities (JCU)

Ready Activities Completed Activities Blocked Activities Clock Future Executing Activities

. . .

Ready Activities Completed Activities Blocked Activities Clock Future Executing Activities

. . .

Place 0 Place 1

. . .

Extern interface

Fortran, C/C++ DLL’s

X10 Runtime

JCU thread pool

High Performance JRE (IBM J9 VM + Testarossa JIT Compiler modified for X10

n PPC/AIX)

Portable Standard Java 5 Runtime Environment (Runs on multiple Platforms)

Java Runtime

Common components w/ SAFARI STM library X10 libraries

SLIDE 6

6

High-Productivity, High-Performance Programming with X10

System Configuration used for Performance Results

Hardware

− STREAM (C/OpenMP & X10), RandomAccess (C/OpenMP & X10), FFT (X10)

64-core POWER5+, p595+, 2.3 GHz, 512 GB (r28n01.pbm.ihost.com)

− FFT (Cilk version)

16-core POWER5+, p570, 1.9 GHz

− All runs performed with page size = 4KB and SMT turned off

Operating System

− AIX v5.3

Compiler

− xlc v7.0.0.5 w/ -O3 option (also qsmp=omp for OpenMP compilation)

X10

− Dynamic compilation options: -J-Xjit:count=0,optLevel=veryHot − X10 activities use serial libraries written in C and linked with X10 runtime − Data size limitation: current X10 runtime is limited to a max heap size of 2GB

All results reported are for runs that passed validation

− Caveat: these results should not be treated as official benchmark measurements of the above systems

SLIDE 7

7

High-Productivity, High-Performance Programming with X10

STREAM

OpenMP / C version

#pragma omp parallel for for (j=0; j<N; j++) { b[j] = scalar*c[j]; }

Hybrid X10 + Serial C version

finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); }

SLIDE 8

8

High-Productivity, High-Performance Programming with X10

STREAM

OpenMP / C version

#pragma omp parallel for for (j=0; j<N; j++) { b[j] = scalar*c[j]; }

Hybrid X10 + Serial C version

finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); }

scale( ) is a sequential C function Multi-place version designed to run unchanged on an SMP or a cluster Restrict operator simplifies computation of local region Implicitly assumes Uniform Memory Access model (no distributed arrays) Traversing array region can be error-prone SLOC counts are comparable

SLIDE 9

9

High-Productivity, High-Performance Programming with X10

Performance Results for STREAM

3.6 3.7 7.1 7.1 14.5 13.7 27.4 26.9 49.6 49.3 87.1 86.3 100.0 98.2 4.3 4.5 8.6 8.8 17.2 15.9 31.4 31.8 53.6 53.8 96.5 98.6 108.6 115.6 2.8 2.9 5.6 5.6 11.2 11.1 21.9 21.9 42.1 41.7 71.8 68.7 77.2 78.4 3.8 4.3 7.6 8.2 15.4 14.6 28.5 28.0 49.1 48.3 79.6 72.2 79.1 77.7 0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0 400.0 C-1 X10-1 C-2 X10-2 C-4 X10-4 C-8 X10-8 C-16 X10-16 C-32 X10-32 C-64 X10-64 #threads / places GB/s Triad Add Scale Copy

Array size = 226 elements Combined memory for 3 arrays = 1.5GB

SLIDE 10

10

High-Productivity, High-Performance Programming with X10

OpenMP / C version

#define NUPDATE (4 * TableSize) for (i=0; i<NUPDATE/128; i++) { #pragma omp parallel for for (j=0; j<128; j++) { ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0); Table[ran[j] & (TableSize-1)] ^= ran[j]; } }

Hybrid X10 + Serial C version

finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; for (int i=0; i<(4 * TableSize)/W; i++) { innerLoop(Table,TableSize,ran,myR.rank(0).low(),myR.rank(0).high()+1); } }

RandomAccess

SLIDE 11

11

High-Productivity, High-Performance Programming with X10

OpenMP / C version

#define NUPDATE (4 * TableSize) for (i=0; i<NUPDATE/128; i++) { #pragma omp parallel for for (j=0; j<128; j++) { ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0); Table[ran[j] & (TableSize-1)] ^= ran[j]; } }

Hybrid X10 + Serial C version

finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; for (int i=0; i<(4 * TableSize)/W; i++) { innerLoop(Table,TableSize,ran,myR.rank(0).low(),myR.rank(0).high()+1); } }

RandomAccess

innerLoop() is a sequential C function Multi-place version designed to run unchanged on an SMP or a cluster Restrict operator simplifies computation of local region Inner parallel loop is a source of inefficiency in OpenMP version SLOC counts are comparable

SLIDE 12

12

High-Productivity, High-Performance Programming with X10

Performance Results for RandomAccess

Array size = 1.8GB

3.7E-03 6.7E-03 1.1E-02 1.5E-02 1.6E-02 1.3E-02 7.5E-03 4.7E-03 8.1E-03 1.6E-02 3.0E-02 4.2E-02 5.3E-02 2.5E-02

0.0E+00 1.0E-02 2.0E-02 3.0E-02 4.0E-02 5.0E-02 6.0E-02 1 2 4 8 16 32 64 #threads/places GUPS OpenMP/C Hybrid X10

SLIDE 13

13

High-Productivity, High-Performance Programming with X10

Cilk / C version (Recursive version)

#define SUB(A, i, j) (A)[(i)*SQRTN+(j)] cilk void transpose(fftw_complex *A, int n) { if (n > 1) { int n2 = n/2; spawn transpose(A, n2); spawn transpose(&SUB(A, n2, n2), n-n2); spawn transpose_and_swap(A, 0, n2, n2, n); } else { /* 1x1 transpose is a NOP */ } }

Hybrid X10 + Serial C version (Non-recursive version)

int nBlocks = SQRTN / bSize; int p = 0; finish for (int r = 0; r < nBlocks; ++r) { for (int c = r; c < nBlocks; ++c) { // Triangular loop final int topLefta_r = (bSize * r); final int topLefta_c = (bSize * c); final int topLeftb_r = (bSize * c); final int topLeftb_c = (bSize * r); async (place.factory.place(p++)) transpose_and_swap(A, topLefta_r, topLefta_c, topLeftb_r, topLeftb_c, bSize); } }

FFT: Transpose example

transpose_and_swap( ) is a sequential C function “finish” operator is used to wait for termination of all subactivities (async’s) Implicit sync at function boundary

SLIDE 14

14

High-Productivity, High-Performance Programming with X10

Performance Results for FFT (w/ memoized sine/cosine twiddle factors)

N = 224 (SQRTN = 212)

0.5 1 1.5 2 2.5 3 1 2 4 8 16 # of threads / places Execution time in Seconds Cilk (1.9 GHz) X10 (2.3GHz)

SLIDE 15

15

High-Productivity, High-Performance Programming with X10

Summary

X10 programming model provides core concurrency and distribution

constructs for new era of parallel processing

Results show competitive performance for Hybrid X10+C relative to

OpenMP/C and Cilk

Past studies have shown other productivity benefits of X10
To find out more, come to the X10 exhibit in the Exotic Technologies area!

Absolute Time Percentage of Total

SLIDE 16

16

High-Productivity, High-Performance Programming with X10

BACKUP SLIDES START HERE

SLIDE 17

17

High-Productivity, High-Performance Programming with X10

X10 context: PERCS Programming Model, Tools and Compilers (PERCS = Productive Easy-to-use Reliable Computer System)

X10 source code Fortran source code (w/ MPI, OpenMP) C/C++ source code (w/ MPI, OpenMP, UPC) JavaTM source code (w/ threads & conc utils) C/C++ components Fortran components C/C++ runtime Fortran runtime Java Components Java runtime Integrated Parallel Runtime: MPI + LAPI + RDMA + OpenMP + threads X10 components X10 runtime Fast extern interface Dynamic Compilation + Continuous Program Optimization Text in blue identifies PERCS contributions Productivity Measurements Java Development Toolkit X10 Development Toolkit C/C++ Development Toolkit + MPI & OpenMP extensions Fortran Development Toolkit Performance Explorer Java Compiler X10 Compiler C/C++ Compiler w/ UPC extensions Fortran Compiler Parallel Tools Platform (PTP)

Eclipse platform

Refactoring for Concurrency Rational PurifyPlus Remote System Explorer Rational Team Platform HPC Toolkit + pSigma + Performance Tuning Automation

SLIDE 18

18

High-Productivity, High-Performance Programming with X10

X10 Eclipse Development Toolkit

SLIDE 19

19

High-Productivity, High-Performance Programming with X10

X10 Eclipse Debugging Toolkit

SLIDE 20

20

High-Productivity, High-Performance Programming with X10

X10 Language

async [(Place)] [clocked(c…)] Stm

− Run Stm asynchronously at Place

finish Stm

− Execute s, wait for all asyncs to terminate (generalizes join)

foreach ( point P : Reg) Stm

− Run Stm asynchronously for each point in region

ateach ( point P : Dist) Stm

− Run Stm asynchronously for each point in dist, in its place.

atomic Stm

− Execute Stm atomically

new T

− Allocate object at this place (here)

new T[d] / new T value [d]

− Array of base type T and distribution d

Region

− Collection of index points, e.g. region r = [1:N,1:M];

Distribution

− Mapping from region to places, e.g.

dist d = block(r);
next

− suspend till all clocks that the current activity is registered with can advance − Clocks are a generalization of barriers and MPI communicators

future [(Place)] [clocked(c…)] Expr

− Compute Expr asynchronously at Place

F. force()

− Block until future F has been computed

extern

− Lightweight interface to native code Deadlock safety: any X10 program written with above constructs (excluding future) can never deadlock

Can be extended to restricted cases of using future

SLIDE 21

21

High-Productivity, High-Performance Programming with X10

X10 Arrays, Regions, Distributions

ArrayExpr: new ArrayType ( Formal ) { Stm } Distribution Expr

- Lifting

ArrayExpr [ Region ] -- Section ArrayExpr | Distribution

- Restriction

ArrayExpr || ArrayExpr

- Union

ArrayExpr.overlay(ArrayExpr) -- Update

ArrayExpr. scan( [fun [, ArgList] )
ArrayExpr. reduce( [fun [, ArgList] )

ArrayExpr.lift( [fun [, ArgList] ) ArrayType: Type [Kind] [ ] Type [Kind] [ region(N) ] Type [Kind] [ Region ] Type [Kind] [ Distribution ] Region: Expr : Expr -- 1-D region [ Range, …, Range ] -- Multidimensional Region Region && Region -- Intersection Region || Region -- Union Region – Region -- Set difference BuiltinRegion Dist: Region -> Place -- Constant distribution Distribution | Place -- Restriction Distribution | Region -- Restriction Distribution || Distribution -- Union Distribution – Distribution -- Set difference Distribution.overlay ( Distribution ) BuiltinDistribution

X10: a High-Productivity Approach to High Performance Programming

Rajkishore Barik Christopher Donawa Matteo Frigo Allan Kielstra Vivek Sarkar

X10: a High-Productivity Approach to High Performance Programming

Rajkishore Barik Christopher Donawa Matteo Frigo Allan Kielstra Vivek Sarkar HPC Challenge Class 2 Award Submission

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

HPC Challenge Class 2 Award Submission

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

Motivation: Productivity Challenges caused by Future Hardware Trends

Challenge: Develop new language, compiler and tools technologies to support productive portable parallel abstractions for future hardware

Heterogeneous Accelerators

Homogeneous Multi-core

. . .

. . .

Interconnect

Clusters Global Address Space

X10 Programming Model

threads, structured parallelism, messaging, DMA transfers (beyond SPMD)

Storage classes:

global

X10 Deployment

X10 Places Physical PEs X10 language defines mapping from X10 objects & activities to X10 places X10 Data Structures X10 deployment defines mapping from virtual X10 places to physical processing elements

. . .

. . .

. . .

. . .

. . .

. . .

Current Status: Multi-core SMP Implementation for X10

System Configuration used for Performance Results

− STREAM (C/OpenMP & X10), RandomAccess (C/OpenMP & X10), FFT (X10)

− FFT (Cilk version)

− All runs performed with page size = 4KB and SMT turned off

− AIX v5.3

− xlc v7.0.0.5 w/ -O3 option (also qsmp=omp for OpenMP compilation)

− Dynamic compilation options: -J-Xjit:count=0,optLevel=veryHot − X10 activities use serial libraries written in C and linked with X10 runtime − Data size limitation: current X10 runtime is limited to a max heap size of 2GB

− Caveat: these results should not be treated as official benchmark measurements of the above systems

STREAM

OpenMP / C version

#pragma omp parallel for for (j=0; j<N; j++) { b[j] = scalar*c[j]; }

Hybrid X10 + Serial C version

finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); }

STREAM

OpenMP / C version

#pragma omp parallel for for (j=0; j<N; j++) { b[j] = scalar*c[j]; }

Hybrid X10 + Serial C version

finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); }

Performance Results for STREAM

Array size = 226 elements Combined memory for 3 arrays = 1.5GB

OpenMP / C version

Hybrid X10 + Serial C version

RandomAccess

OpenMP / C version

Hybrid X10 + Serial C version

RandomAccess

Performance Results for RandomAccess

Array size = 1.8GB

FFT: Transpose example

Performance Results for FFT (w/ memoized sine/cosine twiddle factors)

N = 224 (SQRTN = 212)

Summary

constructs for new era of parallel processing

OpenMP/C and Cilk

BACKUP SLIDES START HERE

X10 context: PERCS Programming Model, Tools and Compilers (PERCS = Productive Easy-to-use Reliable Computer System)

Eclipse platform

X10 Eclipse Development Toolkit

X10 Eclipse Debugging Toolkit

X10 Language

X10 Arrays, Regions, Distributions

Language supports type safety, memory safety, place safety, clock safety.