[PPT] - Improved Static Analysis to Generate More Effjcient Code for PowerPoint Presentation

SLIDE 1

Improved Static Analysis to Generate More Effjcient Code for Execution of Loop Nests in GPUs

J. Nelson Amaral

Department of Computjng Science

SLIDE 2

Artem H. Brad Simons Artem Chikin Tristan Sarah Lyle Roleman Ben Hao Ben Hao

Eldon

Dylan Thomas

Rebecca Michael Z Braedy Michael N. Nathan Jacky

SLIDE 3

September 2014

SLIDE 4

Antem Chikin

SLIDE 5

July 2018 September 2014

SLIDE 6

Artem Chikin Taylor Lloyd

SLIDE 7

SLIDE 8

November 2014

SLIDE 9

htup://www.cnet.com/news/ibm-nvidia-land-325-million-supercomputer-deal/

htup://www.extremetech.com/computjng/194283-ibm-and-nvidia-will-build-two-ultra-effjcient-150-petafmop-supercomputers-for-the-doe htup://www.anandtech.com/show/8727/nvidia-ibm-supercomputers h tu p : / / w w w . z d n e t . c

m

/ a r tj c l e / i b m

n

v i d i a

t

a p p e d

t
b

u i l d

w
r

l d s

f

a s t e s t

s

u p e r c

m

p u t e r s / 9

SLIDE 10

10

"Summit … is expected to deliver more than fjve tjmes the system-level applicatjon performance of Titan while consuming only 10% more power."

htup://info.nvidianews.com/rs/nvidia/images/Coral %20White%20Paper%20Final-3-2.pdf

htup://www.top500.org/list/2014/11/

88000 Linpack TFLOP/S 88000 Linpack TFLOP/S 9000 KW 9000 KW 15000 KW 15000 KW 200000 Linpack TFLOP/S 200000 Linpack TFLOP/S

SLIDE 11

Technology? Nvidia Volta GPU IBM Power9 Nvidia NVlink

SLIDE 12

Programming Model? OpenMP OpenCL CUDA OpenACC MPI

SLIDE 13

Compiler Technology? LLVM IBM XL Compiler

SLIDE 14

May 2015

SLIDE 15

IBM Canada Sofuware Laboratory

Markham, ON

Taylor Lloyd Etuore Tiotuo Artem Chikin

Science Internship Program

J. Nelson Amaral

SLIDE 16

Programming Model

OpenMP 3.x OpenMP 4.x ➔

SLIDE 17

TARGET CPU void vecAdd(double *a, double *b, double *c, int n) { #pragma omp parallel for for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } void vecAdd(double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } void vecAdd(double *a, double *b, double *c, int n) { for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } core cache CPU Memory Memory team CPU core cache core cache

SLIDE 18

void vecAdd(double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp teams parallel for for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } void vecAdd(double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp parallel for for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } CPU Memory Memory TARGET team team CPU Memory Memory TARGET team team team This two teams are executjng the same computatjon on the same memory locatjons without

synchronizatjon. Likely wrong result.

SLIDE 19

void vecAdd(double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp teams distribute parallel for for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } CPU Memory Memory TARGET team team team The iteratjons of the loop are distributed to the two teams and the result is correct.

SLIDE 20

Memory Coalescing

SLIDE 21

issues 32 accesses in one cycle

warp of 32 threads

One 128-byte L1 cache line

Accesses must be aligned to the boundary of a cache line. accesses coalesced into a single transactjon

One 128-byte L1 cache line

accesses coalesced into a single transactjon

warp of 32 threads Warps may access addresses in any order within the cache line.

SLIDE 22

Four transactjons are required

Four 32-byte L2 cache line

warp of 32 threads

SLIDE 23

32-byte cache line 32-byte cache line 32-byte cache line

warp of 32 threads

32-byte cache line 32-byte cache line 32-byte cache line

warp of 32 threads

Inter-tread stride Intra-tread stride

SLIDE 24

October 2016

SLIDE 25

Taylor Lloyd Karim Ali CSC building Karim Ali

Computjng Science Centre

SLIDE 26

Taint Analysis

SLIDE 27

void main() { long int a = readCreditCardNumber(); long int b = 0; b = foo(a); print(b); } long int foo(int p) { if(p != 0) print(p); }

a is tainted is b tainted?

SLIDE 28

Arithmetjc Control Form (ACF) Analysis

Taylor Lloyd

SLIDE 29

int readBounded(int* a) { int tx = threadIdx.x; if(tx > 256) tx = 256; int *addr = a + tx; return *addr; } tx = threadIdx.x tx > 256 tx = 256 *addr = a + tx return *addr

+

tx <= 256 tx = threadIdx.x *addr = a + tx tx = 256 tx = threadIdx.x tx > 256 *addr = a + tx ACFT(*addr) = ACF0(*addr) = (0 > 256)*([a] + 4*256) + (0 <= 256)*([a] + 4*0) ACF0(*addr) = [a] ACF1(*addr) = [a] + 4 ACF1(*addr) – ACF0(*addr) = 4 (T > 256) *([a] + 4*256) + (T <= 256) *([a] + 4*T)

SLIDE 30

June/July 2017

SLIDE 31

Sanket Kedia Dhruv Jain

Taylor Lloyd Artem Chikin

SLIDE 32

SLIDE 33

August 2017

SLIDE 34

IBM Canada Sofuware Laboratory

Markham, ON

Artem Chikin

Computjng Science Centre

SLIDE 35

Iteratjon Point Difgerence Analysis (IPDA)

ACF can be used for any pair of expressions. ACF can be based on the induction variables in a loop nest. ACF is useful when applied to address expressions in a loop nest. ACF can make a Data Dependence Graph more precise. Compiler can transform code based on IPDA.

SLIDE 36

IPDA Analysis in an example

SLIDE 37

conv2D: Two-dimensional convolutjon

SLIDE 38

for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1)

Base address for array B Assuming that data type size is 8 bytes IPAD propagates symbolic expressions from dominant defjnitjon to each use.

SLIDE 39

for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1) B + 8*((CIVI’+1)*NJ + CIVJ’ + 1) - (B + 8*((CIVI+1)*NJ + CIVJ + 1) ) 8*(CIVI’*NJ + CIVJ’) - 8*(CIVI*NJ + CIVJ)

Iteratjon Point Algebraic Difgerence:

8*((CIVI’-CIVI)*NJ + (CIVJ’-CIVJ)) 8*(∆CIVI*NJ + ∆CIVJ) = 0 ?

SLIDE 40

for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1)

Iteratjon Point Algebraic Difgerence:

8*(∆CIVI*NJ + ∆CIVJ) = 0 ?

∆CIVI ∆CIVJ = 0 ≠ 0 ≠ 0 = 0 ≠ 0 ≠ 0

SLIDE 41

Loop Collapsing and Loop Interchange

SLIDE 42

for (i = 0; i < N; ++i) { for (j = 0; j < N; ++j) { A[i+j*N] = A[i+j*N] * A[i+j*N]; } } for (c = 0; c < N*N; ++c) { i = c / N; j = c % N; A[i+j*N] = A[i+j*N] * A[i+j*N]; }

Loop Collapse

i j c

SLIDE 43

for (i = 0; i < N; ++i) { for (j = 0; j < N; ++j) { A[i+j*N] = A[i+j*N] * A[i+j*N]; } } for (c = 0; c < N*N; ++c) { i = c / N; j = c % N; A[i+j*N] = A[i+j*N] * A[i+j*N]; }

Loop Collapse

for (j = 0; j < N; ++j) { for (i = 0; i < N; ++i) { A[i+j*N] = A[i+j*N] * A[i+j*N]; } }

Loop Interchange

j i c

SLIDE 44

A detailed example of how IPDA Analysis helps

SLIDE 45

557.pcsp

SLIDE 46

557.pcsp

It is an OpenMP program It is a C language program SP = Pentadiagonal Solver

SLIDE 47

4-dimensional loop and Outer-dimension range: 0, 1, 2 for (k = 1; k <= gp2-2; k++) { for (j = 0; j <= gp1-3; j++) { j1 = j + 1; j2 = j + 2; for (i = 1; i <= gp0-2; i++) { ・・・ for (m = 0; m < 3; m++) { } ・・・ } } }

SLIDE 48

i is innermost loop and last coordinate

SLIDE 49

j elements from three rows accessed data dependence on loop j ⇒ j loop is sequentjal

SLIDE 50

loop nest is not perfect

SLIDE 51

Expression Re-materializatjon

SLIDE 52

SLIDE 53

We will focus on m=3

SLIDE 54

i j k

Sequentjal Executjon

for (k = 1; k <= gp2-2; k++) { for (j = 0; j <= gp1-3; j++) { for (i = 1; i <= gp0-2; i++) { ・・・ lhsY[3][k][j][i] = fac1* lhsY[3][k][j][i]; ・・・ } } }

SLIDE 55

i j lhsY[3][k][j][i]

Intra-thread access patuern.

k

Parallelizing loop k

SLIDE 56

i j lhsY[3][k][j][i]

Inter-thread access patuern? None of the accesses are coalesced

k

warp of 32 threads Parallelizing loop k

SLIDE 57

Interchange loops j and i

SLIDE 58

SLIDE 59

i j k lhsY[3][k][j][i]

Inter-thread access patuern? None of the accesses are coalesced warp of 32 threads Parallelizing loop k

SLIDE 60

Collapse loops k and i

SLIDE 61

SLIDE 62

j

Parallelizing loop c

lhsY[3][k][j][i]

Inter-thread access patuern? warp of 32 threads Perfect coalescing

SLIDE 63

On Nvidia Pascal (P100) Kernel is 29.4 tjmes faster Afuer IPAD-enabled transformatjons Executjon Time 41.13 ms 1.4 ms On Nvidia Volta (V100): 16.4 tjmes faster Benchmark speedups: Pascal (P100): 1.53x Volta (V100): 1.26x *Benchmarks was not verifying.

SLIDE 64

On Nvidia Pascal (P100) Kernel is 29.4 tjmes faster Afuer IPAD-enabled transformatjons Executjon Time 41.13 ms 1.4 ms On Nvidia Volta (V100): 16.4 tjmes faster Benchmark speedups: Pascal (P100): 1.53x Volta (V100): 1.26x 88.5 3.33x 2.3x 111 *Afuer bug fjgs with benchmark verifying.

SLIDE 65

October 2017

Artem Chikin

J. Nelson Amaral

Etuore Tiotuo

SLIDE 66

March 2018

SLIDE 67

SLIDE 68

Opportunitjes in three other Benchmarks

SLIDE 69

GEMM 2DCONV 3DCONV 5 10 15 20 25 IBM P8+Nvidia P100 IBM P9 + Nvidia V100

v

GEMM 2DCONV 3DCONV 0,5 1 1,5 2 2,5 3 3,5 IBM P8+Nvidia P100 IBM P9 + Nvidia V100

v

Speedup

SLIDE 70

Symbolic difgerences of control-dependent expressions … … enable code transformatjons that were not possible … … with signifjcant performance improvements … … improve dependence testjng … … and enable increased code portability.

SLIDE 71

May-August 2018

SLIDE 72

Muhammad Usman Tyler Gobian

Artem Chikin

SLIDE 73

SLIDE 74

July 2018 September 2014

SLIDE 75