X10: a High-Productivity Approach to X10: a High-Productivity - - PowerPoint PPT Presentation

x10 a high productivity approach to x10 a high
SMART_READER_LITE
LIVE PREVIEW

X10: a High-Productivity Approach to X10: a High-Productivity - - PowerPoint PPT Presentation

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance Programming High Performance Programming Rajkishore Barik Rajkishore Barik Christopher Donawa Christopher Donawa Matteo Frigo Matteo Frigo Allan


slide-1
SLIDE 1

X10: a High-Productivity Approach to High Performance Programming

Rajkishore Barik Christopher Donawa Matteo Frigo Allan Kielstra Vivek Sarkar

X10: a High-Productivity Approach to High Performance Programming

Rajkishore Barik Christopher Donawa Matteo Frigo Allan Kielstra Vivek Sarkar HPC Challenge Class 2 Award Submission

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

HPC Challenge Class 2 Award Submission

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

slide-2
SLIDE 2

2

High-Productivity, High-Performance Programming with X10

Motivation: Productivity Challenges caused by Future Hardware Trends

Challenge: Develop new language, compiler and tools technologies to support productive portable parallel abstractions for future hardware

Heterogeneous Accelerators

16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

L2

32B/cycle

LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF 16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

PXU

L1 PPU

16B/cycle

L2

32B/cycle

LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

Homogeneous Multi-core

. . .

L2 Cache PEs, L1 $ PEs, L1 $ . . .

. . .

. . .

L2 Cache PEs, L1 $ PEs, L1 $

. . .

Memory

PEs,

SMP Node

PEs,

. . .

. . .

Memory

PEs,

SMP Node

PEs,

Interconnect

Clusters Global Address Space

slide-3
SLIDE 3

3

High-Productivity, High-Performance Programming with X10

X10 Programming Model

  • Dynamic parallelism with a Partitioned Global Address Space
  • Places encapsulate binding of activities and globally addressable data
  • All concurrency is expressed as asynchronous activities – subsumes

threads, structured parallelism, messaging, DMA transfers (beyond SPMD)

  • Atomic sections enforce mutual exclusion of co-located data
  • No place-remote accesses permitted in atomic section
  • Immutable data offers opportunity for single-assignment parallelism

Storage classes:

  • Activity-local
  • Place-local
  • Partitioned

global

  • Immutable

Deadlock safety: any X10 program written with async, atomic, finish, foreach, ateach, and clocks can never deadlock

slide-4
SLIDE 4

4

High-Productivity, High-Performance Programming with X10

X10 Deployment

X10 Places Physical PEs X10 language defines mapping from X10 objects & activities to X10 places X10 Data Structures X10 deployment defines mapping from virtual X10 places to physical processing elements

Homogeneous Multi-core Clusters Heterogeneous Accelerators

16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

L2

32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF 16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

PXU

L1 PPU

16B/cycle

L2

32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

. . .

L2 Cache PEs, L1 $ PEs, L1 $ . . .

. . .

. . .

L2 Cache PEs, L1 $ PEs, L1 $

. . .

Memory

PEs,

SMP Node

PEs,

. . .

. . .

Memory

PEs,

SMP Node

PEs,

Interconnect

16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

L2

32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF 16B/cycle (2x) 16B/cycle BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE SPE

LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

PXU

L1 PPU

16B/cycle

PXU

L1 PPU

16B/cycle

L2

32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU LS SXU SPU SMF

. . .

L2 Cache PEs, L1 $ PEs, L1 $ . . .

. . .

. . .

L2 Cache PEs, L1 $ PEs, L1 $

. . .

L2 Cache PEs, L1 $ PEs, L1 $

. . .

L2 Cache PEs, L1 $ PEs, L1 $ PEs, L1 $ PEs, L1 $ . . .

. . .

. . .

L2 Cache PEs, L1 $ PEs, L1 $

. . .

L2 Cache PEs, L1 $ PEs, L1 $ PEs, L1 $ PEs, L1 $

. . .

Memory

PEs,

SMP Node

PEs,

. . .

. . .

Memory

PEs,

SMP Node

PEs,

Interconnect

. . .

Memory

PEs,

SMP Node

PEs,

. . .

Memory

PEs, PEs,

SMP Node

PEs, PEs,

. . .

. . .

Memory

PEs,

SMP Node

PEs,

. . .

Memory

PEs, PEs,

SMP Node

PEs, PEs,

Interconnect

slide-5
SLIDE 5

5

High-Productivity, High-Performance Programming with X10

Current Status: Multi-core SMP Implementation for X10

Analysis passes X10 source AST X10 Parser Code Generation Templates Java code emitter Annotated AST X10 Grammar Target Java DOMO Static Analyzer Java compiler

X10 Front End

Outbound activities Inbound activities Outbound replies Inbound replies

Place

Ready Activities Completed Activities Blocked Activities Clock Future Executing Activities

. . .

Atomic sections do not have blocking semantics Activity can only access its stack, place-local mutable data, or global immutable data

X10 classfiles (Java classfiles with special annotations for X10 analysis info) Java Concurrency Utilities (JCU)

Ready Activities Completed Activities Blocked Activities Clock Future Executing Activities

. . .

Ready Activities Completed Activities Blocked Activities Clock Future Executing Activities

. . .

Place 0 Place 1

. . .

Extern interface

Fortran, C/C++ DLL’s

X10 Runtime

JCU thread pool

High Performance JRE (IBM J9 VM + Testarossa JIT Compiler modified for X10

  • n PPC/AIX)

Portable Standard Java 5 Runtime Environment (Runs on multiple Platforms)

Java Runtime

Common components w/ SAFARI STM library X10 libraries

slide-6
SLIDE 6

6

High-Productivity, High-Performance Programming with X10

System Configuration used for Performance Results

  • Hardware

− STREAM (C/OpenMP & X10), RandomAccess (C/OpenMP & X10), FFT (X10)

  • 64-core POWER5+, p595+, 2.3 GHz, 512 GB (r28n01.pbm.ihost.com)

− FFT (Cilk version)

  • 16-core POWER5+, p570, 1.9 GHz

− All runs performed with page size = 4KB and SMT turned off

  • Operating System

− AIX v5.3

  • Compiler

− xlc v7.0.0.5 w/ -O3 option (also qsmp=omp for OpenMP compilation)

  • X10

− Dynamic compilation options: -J-Xjit:count=0,optLevel=veryHot − X10 activities use serial libraries written in C and linked with X10 runtime − Data size limitation: current X10 runtime is limited to a max heap size of 2GB

  • All results reported are for runs that passed validation

− Caveat: these results should not be treated as official benchmark measurements of the above systems

slide-7
SLIDE 7

7

High-Productivity, High-Performance Programming with X10

STREAM

OpenMP / C version

#pragma omp parallel for for (j=0; j<N; j++) { b[j] = scalar*c[j]; }

Hybrid X10 + Serial C version

finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); }

slide-8
SLIDE 8

8

High-Productivity, High-Performance Programming with X10

STREAM

OpenMP / C version

#pragma omp parallel for for (j=0; j<N; j++) { b[j] = scalar*c[j]; }

Hybrid X10 + Serial C version

finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); }

scale( ) is a sequential C function Multi-place version designed to run unchanged on an SMP or a cluster Restrict operator simplifies computation of local region Implicitly assumes Uniform Memory Access model (no distributed arrays) Traversing array region can be error-prone SLOC counts are comparable

slide-9
SLIDE 9

9

High-Productivity, High-Performance Programming with X10

Performance Results for STREAM

3.6 3.7 7.1 7.1 14.5 13.7 27.4 26.9 49.6 49.3 87.1 86.3 100.0 98.2 4.3 4.5 8.6 8.8 17.2 15.9 31.4 31.8 53.6 53.8 96.5 98.6 108.6 115.6 2.8 2.9 5.6 5.6 11.2 11.1 21.9 21.9 42.1 41.7 71.8 68.7 77.2 78.4 3.8 4.3 7.6 8.2 15.4 14.6 28.5 28.0 49.1 48.3 79.6 72.2 79.1 77.7 0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0 400.0 C-1 X10-1 C-2 X10-2 C-4 X10-4 C-8 X10-8 C-16 X10-16 C-32 X10-32 C-64 X10-64 #threads / places GB/s Triad Add Scale Copy

Array size = 226 elements Combined memory for 3 arrays = 1.5GB

slide-10
SLIDE 10

10

High-Productivity, High-Performance Programming with X10

OpenMP / C version

#define NUPDATE (4 * TableSize) for (i=0; i<NUPDATE/128; i++) { #pragma omp parallel for for (j=0; j<128; j++) { ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0); Table[ran[j] & (TableSize-1)] ^= ran[j]; } }

Hybrid X10 + Serial C version

finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; for (int i=0; i<(4 * TableSize)/W; i++) { innerLoop(Table,TableSize,ran,myR.rank(0).low(),myR.rank(0).high()+1); } }

RandomAccess

slide-11
SLIDE 11

11

High-Productivity, High-Performance Programming with X10

OpenMP / C version

#define NUPDATE (4 * TableSize) for (i=0; i<NUPDATE/128; i++) { #pragma omp parallel for for (j=0; j<128; j++) { ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0); Table[ran[j] & (TableSize-1)] ^= ran[j]; } }

Hybrid X10 + Serial C version

finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; for (int i=0; i<(4 * TableSize)/W; i++) { innerLoop(Table,TableSize,ran,myR.rank(0).low(),myR.rank(0).high()+1); } }

RandomAccess

innerLoop() is a sequential C function Multi-place version designed to run unchanged on an SMP or a cluster Restrict operator simplifies computation of local region Inner parallel loop is a source of inefficiency in OpenMP version SLOC counts are comparable

slide-12
SLIDE 12

12

High-Productivity, High-Performance Programming with X10

Performance Results for RandomAccess

Array size = 1.8GB

3.7E-03 6.7E-03 1.1E-02 1.5E-02 1.6E-02 1.3E-02 7.5E-03 4.7E-03 8.1E-03 1.6E-02 3.0E-02 4.2E-02 5.3E-02 2.5E-02

0.0E+00 1.0E-02 2.0E-02 3.0E-02 4.0E-02 5.0E-02 6.0E-02 1 2 4 8 16 32 64 #threads/places GUPS OpenMP/C Hybrid X10

slide-13
SLIDE 13

13

High-Productivity, High-Performance Programming with X10

Cilk / C version (Recursive version)

#define SUB(A, i, j) (A)[(i)*SQRTN+(j)] cilk void transpose(fftw_complex *A, int n) { if (n > 1) { int n2 = n/2; spawn transpose(A, n2); spawn transpose(&SUB(A, n2, n2), n-n2); spawn transpose_and_swap(A, 0, n2, n2, n); } else { /* 1x1 transpose is a NOP */ } }

Hybrid X10 + Serial C version (Non-recursive version)

int nBlocks = SQRTN / bSize; int p = 0; finish for (int r = 0; r < nBlocks; ++r) { for (int c = r; c < nBlocks; ++c) { // Triangular loop final int topLefta_r = (bSize * r); final int topLefta_c = (bSize * c); final int topLeftb_r = (bSize * c); final int topLeftb_c = (bSize * r); async (place.factory.place(p++)) transpose_and_swap(A, topLefta_r, topLefta_c, topLeftb_r, topLeftb_c, bSize); } }

FFT: Transpose example

transpose_and_swap( ) is a sequential C function “finish” operator is used to wait for termination of all subactivities (async’s) Implicit sync at function boundary

slide-14
SLIDE 14

14

High-Productivity, High-Performance Programming with X10

Performance Results for FFT (w/ memoized sine/cosine twiddle factors)

N = 224 (SQRTN = 212)

0.5 1 1.5 2 2.5 3 1 2 4 8 16 # of threads / places Execution time in Seconds Cilk (1.9 GHz) X10 (2.3GHz)

slide-15
SLIDE 15

15

High-Productivity, High-Performance Programming with X10

Summary

  • X10 programming model provides core concurrency and distribution

constructs for new era of parallel processing

  • Results show competitive performance for Hybrid X10+C relative to

OpenMP/C and Cilk

  • Past studies have shown other productivity benefits of X10
  • To find out more, come to the X10 exhibit in the Exotic Technologies area!

Absolute Time Percentage of Total

slide-16
SLIDE 16

16

High-Productivity, High-Performance Programming with X10

BACKUP SLIDES START HERE

slide-17
SLIDE 17

17

High-Productivity, High-Performance Programming with X10

X10 context: PERCS Programming Model, Tools and Compilers (PERCS = Productive Easy-to-use Reliable Computer System)

X10 source code Fortran source code (w/ MPI, OpenMP) C/C++ source code (w/ MPI, OpenMP, UPC) JavaTM source code (w/ threads & conc utils) C/C++ components Fortran components C/C++ runtime Fortran runtime Java Components Java runtime Integrated Parallel Runtime: MPI + LAPI + RDMA + OpenMP + threads X10 components X10 runtime Fast extern interface Dynamic Compilation + Continuous Program Optimization Text in blue identifies PERCS contributions Productivity Measurements Java Development Toolkit X10 Development Toolkit C/C++ Development Toolkit + MPI & OpenMP extensions Fortran Development Toolkit Performance Explorer Java Compiler X10 Compiler C/C++ Compiler w/ UPC extensions Fortran Compiler Parallel Tools Platform (PTP)

Eclipse platform

Refactoring for Concurrency Rational PurifyPlus Remote System Explorer Rational Team Platform HPC Toolkit + pSigma + Performance Tuning Automation

slide-18
SLIDE 18

18

High-Productivity, High-Performance Programming with X10

X10 Eclipse Development Toolkit

slide-19
SLIDE 19

19

High-Productivity, High-Performance Programming with X10

X10 Eclipse Debugging Toolkit

slide-20
SLIDE 20

20

High-Productivity, High-Performance Programming with X10

X10 Language

  • async [(Place)] [clocked(c…)] Stm

− Run Stm asynchronously at Place

  • finish Stm

− Execute s, wait for all asyncs to terminate (generalizes join)

  • foreach ( point P : Reg) Stm

− Run Stm asynchronously for each point in region

  • ateach ( point P : Dist) Stm

− Run Stm asynchronously for each point in dist, in its place.

  • atomic Stm

− Execute Stm atomically

  • new T

− Allocate object at this place (here)

  • new T[d] / new T value [d]

− Array of base type T and distribution d

  • Region

− Collection of index points, e.g. region r = [1:N,1:M];

  • Distribution

− Mapping from region to places, e.g.

  • dist d = block(r);
  • next

− suspend till all clocks that the current activity is registered with can advance − Clocks are a generalization of barriers and MPI communicators

  • future [(Place)] [clocked(c…)] Expr

− Compute Expr asynchronously at Place

  • F. force()

− Block until future F has been computed

  • extern

− Lightweight interface to native code Deadlock safety: any X10 program written with above constructs (excluding future) can never deadlock

  • Can be extended to restricted cases of using future
slide-21
SLIDE 21

21

High-Productivity, High-Performance Programming with X10

X10 Arrays, Regions, Distributions

ArrayExpr: new ArrayType ( Formal ) { Stm } Distribution Expr

  • - Lifting

ArrayExpr [ Region ] -- Section ArrayExpr | Distribution

  • - Restriction

ArrayExpr || ArrayExpr

  • - Union

ArrayExpr.overlay(ArrayExpr) -- Update

  • ArrayExpr. scan( [fun [, ArgList] )
  • ArrayExpr. reduce( [fun [, ArgList] )

ArrayExpr.lift( [fun [, ArgList] ) ArrayType: Type [Kind] [ ] Type [Kind] [ region(N) ] Type [Kind] [ Region ] Type [Kind] [ Distribution ] Region: Expr : Expr -- 1-D region [ Range, …, Range ] -- Multidimensional Region Region && Region -- Intersection Region || Region -- Union Region – Region -- Set difference BuiltinRegion Dist: Region -> Place -- Constant distribution Distribution | Place -- Restriction Distribution | Region -- Restriction Distribution || Distribution -- Union Distribution – Distribution -- Set difference Distribution.overlay ( Distribution ) BuiltinDistribution

Language supports type safety, memory safety, place safety, clock safety.