X10: a High-Productivity Approach to X10: a High-Productivity - - PowerPoint PPT Presentation
X10: a High-Productivity Approach to X10: a High-Productivity - - PowerPoint PPT Presentation
X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance Programming High Performance Programming Rajkishore Barik Rajkishore Barik Christopher Donawa Christopher Donawa Matteo Frigo Matteo Frigo Allan
2
High-Productivity, High-Performance Programming with X10
Motivation: Productivity Challenges caused by Future Hardware Trends
Challenge: Develop new language, compiler and tools technologies to support productive portable parallel abstractions for future hardware
Heterogeneous Accelerators
16B/cycle (2x) 16B/cycle BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE SPE
LS SXU SPU SMF
PXU
L1 PPU
16B/cycle
L2
32B/cycle
LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF 16B/cycle (2x) 16B/cycle BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE SPE
LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF
PXU
L1 PPU
16B/cycle
PXU
L1 PPU
16B/cycle
L2
32B/cycle
LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF
Homogeneous Multi-core
. . .
L2 Cache PEs, L1 $ PEs, L1 $ . . .
. . .
. . .
L2 Cache PEs, L1 $ PEs, L1 $
. . .
Memory
PEs,
SMP Node
PEs,
. . .
. . .
Memory
PEs,
SMP Node
PEs,
Interconnect
Clusters Global Address Space
3
High-Productivity, High-Performance Programming with X10
X10 Programming Model
- Dynamic parallelism with a Partitioned Global Address Space
- Places encapsulate binding of activities and globally addressable data
- All concurrency is expressed as asynchronous activities – subsumes
threads, structured parallelism, messaging, DMA transfers (beyond SPMD)
- Atomic sections enforce mutual exclusion of co-located data
- No place-remote accesses permitted in atomic section
- Immutable data offers opportunity for single-assignment parallelism
Storage classes:
- Activity-local
- Place-local
- Partitioned
global
- Immutable
Deadlock safety: any X10 program written with async, atomic, finish, foreach, ateach, and clocks can never deadlock
4
High-Productivity, High-Performance Programming with X10
X10 Deployment
X10 Places Physical PEs X10 language defines mapping from X10 objects & activities to X10 places X10 Data Structures X10 deployment defines mapping from virtual X10 places to physical processing elements
Homogeneous Multi-core Clusters Heterogeneous Accelerators
16B/cycle (2x) 16B/cycle BICFlexIOTM
MICDual XDRTM
16B/cycleEIB (up to 96B/cycle)
16B/cycle64-bit Power Architecture with VMX
PPE SPE
LS SXU SPU SMFPXU
L1 PPU
16B/cycleL2
32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF 16B/cycle (2x) 16B/cycle BICFlexIOTM
MICDual XDRTM
16B/cycleEIB (up to 96B/cycle)
16B/cycle64-bit Power Architecture with VMX
PPE SPE
LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMFPXU
L1 PPU
16B/cyclePXU
L1 PPU
16B/cycleL2
32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF. . .
L2 Cache PEs, L1 $ PEs, L1 $ . . .
. . .
. . .
L2 Cache PEs, L1 $ PEs, L1 $
. . .
Memory
PEs,
SMP Node
PEs,
. . .
. . .
Memory
PEs,
SMP Node
PEs,
Interconnect
16B/cycle (2x) 16B/cycle BICFlexIOTM
MICDual XDRTM
16B/cycleEIB (up to 96B/cycle)
16B/cycle64-bit Power Architecture with VMX
PPE SPE
LS SXU SPU SMFPXU
L1 PPU
16B/cycleL2
32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF 16B/cycle (2x) 16B/cycle BICFlexIOTM
MICDual XDRTM
16B/cycleEIB (up to 96B/cycle)
16B/cycle64-bit Power Architecture with VMX
PPE SPE
LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMFPXU
L1 PPU
16B/cyclePXU
L1 PPU
16B/cycleL2
32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF. . .
L2 Cache PEs, L1 $ PEs, L1 $ . . .
. . .
. . .
L2 Cache PEs, L1 $ PEs, L1 $
. . .
L2 Cache PEs, L1 $ PEs, L1 $
. . .
L2 Cache PEs, L1 $ PEs, L1 $ PEs, L1 $ PEs, L1 $ . . .
. . .
. . .
L2 Cache PEs, L1 $ PEs, L1 $
. . .
L2 Cache PEs, L1 $ PEs, L1 $ PEs, L1 $ PEs, L1 $
. . .
Memory
PEs,
SMP Node
PEs,
. . .
. . .
Memory
PEs,
SMP Node
PEs,
Interconnect
. . .
Memory
PEs,
SMP Node
PEs,
. . .
Memory
PEs, PEs,
SMP Node
PEs, PEs,
. . .
. . .
Memory
PEs,
SMP Node
PEs,
. . .
Memory
PEs, PEs,
SMP Node
PEs, PEs,
Interconnect
5
High-Productivity, High-Performance Programming with X10
Current Status: Multi-core SMP Implementation for X10
Analysis passes X10 source AST X10 Parser Code Generation Templates Java code emitter Annotated AST X10 Grammar Target Java DOMO Static Analyzer Java compiler
X10 Front End
Outbound activities Inbound activities Outbound replies Inbound replies
Place
Ready Activities Completed Activities Blocked Activities Clock Future Executing Activities
. . .
Atomic sections do not have blocking semantics Activity can only access its stack, place-local mutable data, or global immutable data
X10 classfiles (Java classfiles with special annotations for X10 analysis info) Java Concurrency Utilities (JCU)
Ready Activities Completed Activities Blocked Activities Clock Future Executing Activities. . .
Ready Activities Completed Activities Blocked Activities Clock Future Executing Activities. . .
Place 0 Place 1
. . .
Extern interface
Fortran, C/C++ DLL’s
X10 Runtime
JCU thread pool
High Performance JRE (IBM J9 VM + Testarossa JIT Compiler modified for X10
- n PPC/AIX)
Portable Standard Java 5 Runtime Environment (Runs on multiple Platforms)
Java Runtime
Common components w/ SAFARI STM library X10 libraries
6
High-Productivity, High-Performance Programming with X10
System Configuration used for Performance Results
- Hardware
− STREAM (C/OpenMP & X10), RandomAccess (C/OpenMP & X10), FFT (X10)
- 64-core POWER5+, p595+, 2.3 GHz, 512 GB (r28n01.pbm.ihost.com)
− FFT (Cilk version)
- 16-core POWER5+, p570, 1.9 GHz
− All runs performed with page size = 4KB and SMT turned off
- Operating System
− AIX v5.3
- Compiler
− xlc v7.0.0.5 w/ -O3 option (also qsmp=omp for OpenMP compilation)
- X10
− Dynamic compilation options: -J-Xjit:count=0,optLevel=veryHot − X10 activities use serial libraries written in C and linked with X10 runtime − Data size limitation: current X10 runtime is limited to a max heap size of 2GB
- All results reported are for runs that passed validation
− Caveat: these results should not be treated as official benchmark measurements of the above systems
7
High-Productivity, High-Performance Programming with X10
STREAM
OpenMP / C version
#pragma omp parallel for for (j=0; j<N; j++) { b[j] = scalar*c[j]; }
Hybrid X10 + Serial C version
finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); }
8
High-Productivity, High-Performance Programming with X10
STREAM
OpenMP / C version
#pragma omp parallel for for (j=0; j<N; j++) { b[j] = scalar*c[j]; }
Hybrid X10 + Serial C version
finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); }
scale( ) is a sequential C function Multi-place version designed to run unchanged on an SMP or a cluster Restrict operator simplifies computation of local region Implicitly assumes Uniform Memory Access model (no distributed arrays) Traversing array region can be error-prone SLOC counts are comparable
9
High-Productivity, High-Performance Programming with X10
Performance Results for STREAM
3.6 3.7 7.1 7.1 14.5 13.7 27.4 26.9 49.6 49.3 87.1 86.3 100.0 98.2 4.3 4.5 8.6 8.8 17.2 15.9 31.4 31.8 53.6 53.8 96.5 98.6 108.6 115.6 2.8 2.9 5.6 5.6 11.2 11.1 21.9 21.9 42.1 41.7 71.8 68.7 77.2 78.4 3.8 4.3 7.6 8.2 15.4 14.6 28.5 28.0 49.1 48.3 79.6 72.2 79.1 77.7 0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0 400.0 C-1 X10-1 C-2 X10-2 C-4 X10-4 C-8 X10-8 C-16 X10-16 C-32 X10-32 C-64 X10-64 #threads / places GB/s Triad Add Scale Copy
Array size = 226 elements Combined memory for 3 arrays = 1.5GB
10
High-Productivity, High-Performance Programming with X10
OpenMP / C version
#define NUPDATE (4 * TableSize) for (i=0; i<NUPDATE/128; i++) { #pragma omp parallel for for (j=0; j<128; j++) { ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0); Table[ran[j] & (TableSize-1)] ^= ran[j]; } }
Hybrid X10 + Serial C version
finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; for (int i=0; i<(4 * TableSize)/W; i++) { innerLoop(Table,TableSize,ran,myR.rank(0).low(),myR.rank(0).high()+1); } }
RandomAccess
11
High-Productivity, High-Performance Programming with X10
OpenMP / C version
#define NUPDATE (4 * TableSize) for (i=0; i<NUPDATE/128; i++) { #pragma omp parallel for for (j=0; j<128; j++) { ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0); Table[ran[j] & (TableSize-1)] ^= ran[j]; } }
Hybrid X10 + Serial C version
finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; for (int i=0; i<(4 * TableSize)/W; i++) { innerLoop(Table,TableSize,ran,myR.rank(0).low(),myR.rank(0).high()+1); } }
RandomAccess
innerLoop() is a sequential C function Multi-place version designed to run unchanged on an SMP or a cluster Restrict operator simplifies computation of local region Inner parallel loop is a source of inefficiency in OpenMP version SLOC counts are comparable
12
High-Productivity, High-Performance Programming with X10
Performance Results for RandomAccess
Array size = 1.8GB
3.7E-03 6.7E-03 1.1E-02 1.5E-02 1.6E-02 1.3E-02 7.5E-03 4.7E-03 8.1E-03 1.6E-02 3.0E-02 4.2E-02 5.3E-02 2.5E-02
0.0E+00 1.0E-02 2.0E-02 3.0E-02 4.0E-02 5.0E-02 6.0E-02 1 2 4 8 16 32 64 #threads/places GUPS OpenMP/C Hybrid X10
13
High-Productivity, High-Performance Programming with X10
Cilk / C version (Recursive version)
#define SUB(A, i, j) (A)[(i)*SQRTN+(j)] cilk void transpose(fftw_complex *A, int n) { if (n > 1) { int n2 = n/2; spawn transpose(A, n2); spawn transpose(&SUB(A, n2, n2), n-n2); spawn transpose_and_swap(A, 0, n2, n2, n); } else { /* 1x1 transpose is a NOP */ } }
Hybrid X10 + Serial C version (Non-recursive version)
int nBlocks = SQRTN / bSize; int p = 0; finish for (int r = 0; r < nBlocks; ++r) { for (int c = r; c < nBlocks; ++c) { // Triangular loop final int topLefta_r = (bSize * r); final int topLefta_c = (bSize * c); final int topLeftb_r = (bSize * c); final int topLeftb_c = (bSize * r); async (place.factory.place(p++)) transpose_and_swap(A, topLefta_r, topLefta_c, topLeftb_r, topLeftb_c, bSize); } }
FFT: Transpose example
transpose_and_swap( ) is a sequential C function “finish” operator is used to wait for termination of all subactivities (async’s) Implicit sync at function boundary
14
High-Productivity, High-Performance Programming with X10
Performance Results for FFT (w/ memoized sine/cosine twiddle factors)
N = 224 (SQRTN = 212)
0.5 1 1.5 2 2.5 3 1 2 4 8 16 # of threads / places Execution time in Seconds Cilk (1.9 GHz) X10 (2.3GHz)
15
High-Productivity, High-Performance Programming with X10
Summary
- X10 programming model provides core concurrency and distribution
constructs for new era of parallel processing
- Results show competitive performance for Hybrid X10+C relative to
OpenMP/C and Cilk
- Past studies have shown other productivity benefits of X10
- To find out more, come to the X10 exhibit in the Exotic Technologies area!
Absolute Time Percentage of Total
16
High-Productivity, High-Performance Programming with X10
BACKUP SLIDES START HERE
17
High-Productivity, High-Performance Programming with X10
X10 context: PERCS Programming Model, Tools and Compilers (PERCS = Productive Easy-to-use Reliable Computer System)
X10 source code Fortran source code (w/ MPI, OpenMP) C/C++ source code (w/ MPI, OpenMP, UPC) JavaTM source code (w/ threads & conc utils) C/C++ components Fortran components C/C++ runtime Fortran runtime Java Components Java runtime Integrated Parallel Runtime: MPI + LAPI + RDMA + OpenMP + threads X10 components X10 runtime Fast extern interface Dynamic Compilation + Continuous Program Optimization Text in blue identifies PERCS contributions Productivity Measurements Java Development Toolkit X10 Development Toolkit C/C++ Development Toolkit + MPI & OpenMP extensions Fortran Development Toolkit Performance Explorer Java Compiler X10 Compiler C/C++ Compiler w/ UPC extensions Fortran Compiler Parallel Tools Platform (PTP)
Eclipse platform
Refactoring for Concurrency Rational PurifyPlus Remote System Explorer Rational Team Platform HPC Toolkit + pSigma + Performance Tuning Automation
18
High-Productivity, High-Performance Programming with X10
X10 Eclipse Development Toolkit
19
High-Productivity, High-Performance Programming with X10
X10 Eclipse Debugging Toolkit
20
High-Productivity, High-Performance Programming with X10
X10 Language
- async [(Place)] [clocked(c…)] Stm
− Run Stm asynchronously at Place
- finish Stm
− Execute s, wait for all asyncs to terminate (generalizes join)
- foreach ( point P : Reg) Stm
− Run Stm asynchronously for each point in region
- ateach ( point P : Dist) Stm
− Run Stm asynchronously for each point in dist, in its place.
- atomic Stm
− Execute Stm atomically
- new T
− Allocate object at this place (here)
- new T[d] / new T value [d]
− Array of base type T and distribution d
- Region
− Collection of index points, e.g. region r = [1:N,1:M];
- Distribution
− Mapping from region to places, e.g.
- dist d = block(r);
- next
− suspend till all clocks that the current activity is registered with can advance − Clocks are a generalization of barriers and MPI communicators
- future [(Place)] [clocked(c…)] Expr
− Compute Expr asynchronously at Place
- F. force()
− Block until future F has been computed
- extern
− Lightweight interface to native code Deadlock safety: any X10 program written with above constructs (excluding future) can never deadlock
- Can be extended to restricted cases of using future
21
High-Productivity, High-Performance Programming with X10
X10 Arrays, Regions, Distributions
ArrayExpr: new ArrayType ( Formal ) { Stm } Distribution Expr
- - Lifting
ArrayExpr [ Region ] -- Section ArrayExpr | Distribution
- - Restriction
ArrayExpr || ArrayExpr
- - Union
ArrayExpr.overlay(ArrayExpr) -- Update
- ArrayExpr. scan( [fun [, ArgList] )
- ArrayExpr. reduce( [fun [, ArgList] )
ArrayExpr.lift( [fun [, ArgList] ) ArrayType: Type [Kind] [ ] Type [Kind] [ region(N) ] Type [Kind] [ Region ] Type [Kind] [ Distribution ] Region: Expr : Expr -- 1-D region [ Range, …, Range ] -- Multidimensional Region Region && Region -- Intersection Region || Region -- Union Region – Region -- Set difference BuiltinRegion Dist: Region -> Place -- Constant distribution Distribution | Place -- Restriction Distribution | Region -- Restriction Distribution || Distribution -- Union Distribution – Distribution -- Set difference Distribution.overlay ( Distribution ) BuiltinDistribution