Improved Static Analysis to Generate More Effjcient Code for - - PowerPoint PPT Presentation

improved static analysis to generate more effjcient code
SMART_READER_LITE
LIVE PREVIEW

Improved Static Analysis to Generate More Effjcient Code for - - PowerPoint PPT Presentation

Improved Static Analysis to Generate More Effjcient Code for Execution of Loop Nests in GPUs J. Nelson Amaral Department of Computjng Science Nathan Michael Z Jacky Michael N. Braedy Sarah Rebecca Eldon Dylan Thomas Ben Hao Ben Hao


slide-1
SLIDE 1

Improved Static Analysis to Generate More Effjcient Code for Execution of Loop Nests in GPUs

  • J. Nelson Amaral

Department of Computjng Science

slide-2
SLIDE 2

Artem H. Brad Simons Artem Chikin Tristan Sarah Lyle Roleman Ben Hao Ben Hao

Eldon

Dylan Thomas

Rebecca Michael Z Braedy Michael N. Nathan Jacky

slide-3
SLIDE 3

September 2014

slide-4
SLIDE 4

Antem Chikin

slide-5
SLIDE 5

July 2018 September 2014

slide-6
SLIDE 6

Artem Chikin Taylor Lloyd

slide-7
SLIDE 7
slide-8
SLIDE 8

November 2014

slide-9
SLIDE 9

htup://www.cnet.com/news/ibm-nvidia-land-325-million-supercomputer-deal/

htup://www.extremetech.com/computjng/194283-ibm-and-nvidia-will-build-two-ultra-effjcient-150-petafmop-supercomputers-for-the-doe htup://www.anandtech.com/show/8727/nvidia-ibm-supercomputers h tu p : / / w w w . z d n e t . c

  • m

/ a r tj c l e / i b m

  • n

v i d i a

  • t

a p p e d

  • t
  • b

u i l d

  • w
  • r

l d s

  • f

a s t e s t

  • s

u p e r c

  • m

p u t e r s / 9

slide-10
SLIDE 10

10

"Summit … is expected to deliver more than fjve tjmes the system-level applicatjon performance of Titan while consuming only 10% more power."

htup://info.nvidianews.com/rs/nvidia/images/Coral %20White%20Paper%20Final-3-2.pdf

htup://www.top500.org/list/2014/11/

88000 Linpack TFLOP/S 88000 Linpack TFLOP/S 9000 KW 9000 KW 15000 KW 15000 KW 200000 Linpack TFLOP/S 200000 Linpack TFLOP/S

slide-11
SLIDE 11

Technology? Nvidia Volta GPU IBM Power9 Nvidia NVlink

slide-12
SLIDE 12

Programming Model? OpenMP OpenCL CUDA OpenACC MPI

slide-13
SLIDE 13

Compiler Technology? LLVM IBM XL Compiler

slide-14
SLIDE 14

May 2015

slide-15
SLIDE 15

IBM Canada Sofuware Laboratory

Markham, ON

Taylor Lloyd Etuore Tiotuo Artem Chikin

Science Internship Program

  • J. Nelson Amaral
slide-16
SLIDE 16

Programming Model

OpenMP 3.x OpenMP 4.x ➔

slide-17
SLIDE 17

TARGET CPU void vecAdd(double *a, double *b, double *c, int n) { #pragma omp parallel for for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } void vecAdd(double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } void vecAdd(double *a, double *b, double *c, int n) { for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } core cache CPU Memory Memory team CPU core cache core cache

slide-18
SLIDE 18

void vecAdd(double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp teams parallel for for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } void vecAdd(double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp parallel for for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } CPU Memory Memory TARGET team team CPU Memory Memory TARGET team team team This two teams are executjng the same computatjon on the same memory locatjons without

  • synchronizatjon. Likely wrong result.
slide-19
SLIDE 19

void vecAdd(double *a, double *b, double *c, int n) { #pragma omp target map(to: a[:n], b[:n]) \ map(from: c[:n]) #pragma omp teams distribute parallel for for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } CPU Memory Memory TARGET team team team The iteratjons of the loop are distributed to the two teams and the result is correct.

slide-20
SLIDE 20

Memory Coalescing

slide-21
SLIDE 21

issues 32 accesses in one cycle

warp of 32 threads

One 128-byte L1 cache line

Accesses must be aligned to the boundary of a cache line. accesses coalesced into a single transactjon

One 128-byte L1 cache line

accesses coalesced into a single transactjon

warp of 32 threads Warps may access addresses in any order within the cache line.

slide-22
SLIDE 22

Four transactjons are required

Four 32-byte L2 cache line

warp of 32 threads

slide-23
SLIDE 23

32-byte cache line 32-byte cache line 32-byte cache line

warp of 32 threads

32-byte cache line 32-byte cache line 32-byte cache line

warp of 32 threads

Inter-tread stride Intra-tread stride

slide-24
SLIDE 24

October 2016

slide-25
SLIDE 25

Taylor Lloyd Karim Ali CSC building Karim Ali

Computjng Science Centre

slide-26
SLIDE 26

Taint Analysis

slide-27
SLIDE 27

void main() { long int a = readCreditCardNumber(); long int b = 0; b = foo(a); print(b); } long int foo(int p) { if(p != 0) print(p); }

a is tainted is b tainted?

slide-28
SLIDE 28

Arithmetjc Control Form (ACF) Analysis

Taylor Lloyd

slide-29
SLIDE 29

int readBounded(int* a) { int tx = threadIdx.x; if(tx > 256) tx = 256; int *addr = a + tx; return *addr; } tx = threadIdx.x tx > 256 tx = 256 *addr = a + tx return *addr

+

tx <= 256 tx = threadIdx.x *addr = a + tx tx = 256 tx = threadIdx.x tx > 256 *addr = a + tx ACFT(*addr) = ACF0(*addr) = (0 > 256)*([a] + 4*256) + (0 <= 256)*([a] + 4*0) ACF0(*addr) = [a] ACF1(*addr) = [a] + 4 ACF1(*addr) – ACF0(*addr) = 4 (T > 256) *([a] + 4*256) + (T <= 256) *([a] + 4*T)

slide-30
SLIDE 30

June/July 2017

slide-31
SLIDE 31

Sanket Kedia Dhruv Jain

Taylor Lloyd Artem Chikin

slide-32
SLIDE 32
slide-33
SLIDE 33

August 2017

slide-34
SLIDE 34

IBM Canada Sofuware Laboratory

Markham, ON

Artem Chikin

Computjng Science Centre

slide-35
SLIDE 35

Iteratjon Point Difgerence Analysis (IPDA)

ACF can be used for any pair of expressions. ACF can be based on the induction variables in a loop nest. ACF is useful when applied to address expressions in a loop nest. ACF can make a Data Dependence Graph more precise. Compiler can transform code based on IPDA.

slide-36
SLIDE 36

IPDA Analysis in an example

slide-37
SLIDE 37

conv2D: Two-dimensional convolutjon

slide-38
SLIDE 38

for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1)

Base address for array B Assuming that data type size is 8 bytes IPAD propagates symbolic expressions from dominant defjnitjon to each use.

slide-39
SLIDE 39

for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1) B + 8*((CIVI’+1)*NJ + CIVJ’ + 1) - (B + 8*((CIVI+1)*NJ + CIVJ + 1) ) 8*(CIVI’*NJ + CIVJ’) - 8*(CIVI*NJ + CIVJ)

Iteratjon Point Algebraic Difgerence:

8*((CIVI’-CIVI)*NJ + (CIVJ’-CIVJ)) 8*(∆CIVI*NJ + ∆CIVJ) = 0 ?

slide-40
SLIDE 40

for (CIVI = 0; CIVI < NI - 2; ++CIVI) { i = CIVI+1; for (CIVJ = 0; CIVJ < NJ - 2; ++CIVJ) { B[i*NJ + CIVJ + 1] = … ; } } B + 8*((CIVI+1)*NJ + CIVJ + 1)

Iteratjon Point Algebraic Difgerence:

8*(∆CIVI*NJ + ∆CIVJ) = 0 ?

∆CIVI ∆CIVJ = 0 ≠ 0 ≠ 0 = 0 ≠ 0 ≠ 0

slide-41
SLIDE 41

Loop Collapsing and Loop Interchange

slide-42
SLIDE 42

for (i = 0; i < N; ++i) { for (j = 0; j < N; ++j) { A[i+j*N] = A[i+j*N] * A[i+j*N]; } } for (c = 0; c < N*N; ++c) { i = c / N; j = c % N; A[i+j*N] = A[i+j*N] * A[i+j*N]; }

Loop Collapse

i j c

slide-43
SLIDE 43

for (i = 0; i < N; ++i) { for (j = 0; j < N; ++j) { A[i+j*N] = A[i+j*N] * A[i+j*N]; } } for (c = 0; c < N*N; ++c) { i = c / N; j = c % N; A[i+j*N] = A[i+j*N] * A[i+j*N]; }

Loop Collapse

for (j = 0; j < N; ++j) { for (i = 0; i < N; ++i) { A[i+j*N] = A[i+j*N] * A[i+j*N]; } }

Loop Interchange

j i c

slide-44
SLIDE 44

A detailed example of how IPDA Analysis helps

slide-45
SLIDE 45

557.pcsp

slide-46
SLIDE 46

557.pcsp

It is an OpenMP program It is a C language program SP = Pentadiagonal Solver

slide-47
SLIDE 47

4-dimensional loop and Outer-dimension range: 0, 1, 2 for (k = 1; k <= gp2-2; k++) { for (j = 0; j <= gp1-3; j++) { j1 = j + 1; j2 = j + 2; for (i = 1; i <= gp0-2; i++) { ・ ・ ・ for (m = 0; m < 3; m++) { } ・ ・ ・ } } }

slide-48
SLIDE 48

i is innermost loop and last coordinate

slide-49
SLIDE 49

j elements from three rows accessed data dependence on loop j ⇒ j loop is sequentjal

slide-50
SLIDE 50

loop nest is not perfect

slide-51
SLIDE 51

Expression Re-materializatjon

slide-52
SLIDE 52
slide-53
SLIDE 53

We will focus on m=3

slide-54
SLIDE 54

i j k

Sequentjal Executjon

for (k = 1; k <= gp2-2; k++) { for (j = 0; j <= gp1-3; j++) { for (i = 1; i <= gp0-2; i++) { ・ ・ ・ lhsY[3][k][j][i] = fac1* lhsY[3][k][j][i]; ・ ・ ・ } } }

slide-55
SLIDE 55

i j lhsY[3][k][j][i]

Intra-thread access patuern.

k

Parallelizing loop k

slide-56
SLIDE 56

i j lhsY[3][k][j][i]

Inter-thread access patuern? None of the accesses are coalesced

k

warp of 32 threads Parallelizing loop k

slide-57
SLIDE 57

Interchange loops j and i

slide-58
SLIDE 58
slide-59
SLIDE 59

i j k lhsY[3][k][j][i]

Inter-thread access patuern? None of the accesses are coalesced warp of 32 threads Parallelizing loop k

slide-60
SLIDE 60

Collapse loops k and i

slide-61
SLIDE 61
slide-62
SLIDE 62

j

Parallelizing loop c

lhsY[3][k][j][i]

Inter-thread access patuern? warp of 32 threads Perfect coalescing

slide-63
SLIDE 63

On Nvidia Pascal (P100) Kernel is 29.4 tjmes faster Afuer IPAD-enabled transformatjons Executjon Time 41.13 ms 1.4 ms On Nvidia Volta (V100): 16.4 tjmes faster Benchmark speedups: Pascal (P100): 1.53x Volta (V100): 1.26x *Benchmarks was not verifying.

slide-64
SLIDE 64

On Nvidia Pascal (P100) Kernel is 29.4 tjmes faster Afuer IPAD-enabled transformatjons Executjon Time 41.13 ms 1.4 ms On Nvidia Volta (V100): 16.4 tjmes faster Benchmark speedups: Pascal (P100): 1.53x Volta (V100): 1.26x 88.5 3.33x 2.3x 111 *Afuer bug fjgs with benchmark verifying.

slide-65
SLIDE 65

October 2017

Artem Chikin

  • J. Nelson Amaral

Etuore Tiotuo

slide-66
SLIDE 66

March 2018

slide-67
SLIDE 67
slide-68
SLIDE 68

Opportunitjes in three other Benchmarks

slide-69
SLIDE 69

GEMM 2DCONV 3DCONV 5 10 15 20 25 IBM P8+Nvidia P100 IBM P9 + Nvidia V100

v

GEMM 2DCONV 3DCONV 0,5 1 1,5 2 2,5 3 3,5 IBM P8+Nvidia P100 IBM P9 + Nvidia V100

v

Speedup

slide-70
SLIDE 70

Symbolic difgerences of control-dependent expressions … … enable code transformatjons that were not possible … … with signifjcant performance improvements … … improve dependence testjng … … and enable increased code portability.

slide-71
SLIDE 71

May-August 2018

slide-72
SLIDE 72

Muhammad Usman Tyler Gobian

Artem Chikin

slide-73
SLIDE 73
slide-74
SLIDE 74

July 2018 September 2014

slide-75
SLIDE 75

Taylor Lloyd Artem Chikin