A Case for Better Integration of Host and Target Compilation When - - PowerPoint PPT Presentation

a case for better integration of host and target
SMART_READER_LITE
LIVE PREVIEW

A Case for Better Integration of Host and Target Compilation When - - PowerPoint PPT Presentation

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, Jos Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University of Alberta Systems Group


slide-1
SLIDE 1

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta

1

Sept 7 FSP 2017

slide-2
SLIDE 2

University of Alberta Systems Group

  • Focused on compiler optimizations, heterogeneous systems
  • Recently working primarily on GPU computing

2

slide-3
SLIDE 3

So can traditional compiler techniques help with OpenCL for FPGAs?

3

slide-4
SLIDE 4

Background: OpenCL Execution Models

Data Parallelism (NDRange)

  • kernel defined per-thread
  • Kernel execution defines

number and grouping of threads

  • Behaviour varies by querying

thread ID Task Parallelism (Single Work-Item)

  • Kernel defines complete unit of

work

  • Kernel execution starts single

thread

4

slide-5
SLIDE 5

Background: OpenCL Execution Model

NDRange Example Single Work-Item Example

__kernel void memcpy(char* tgt, char* src, int length) { int index = get_global_id(0); while (index<length) { tgt[index] = src[index]; index += get_global_size(0); } }

5

slide-6
SLIDE 6

Background: OpenCL Execution Model

NDRange Example Single Work-Item Example

__kernel void memcpy(char* tgt, char* src, int length) { int index = get_global_id(0); while (index<length) { tgt[index] = src[index]; index += get_global_size(0); } }

int offset = 0, threads = 2048, groupsize = 128; clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL);

6

slide-7
SLIDE 7

Background: OpenCL Execution Model

NDRange Example Single Work-Item Example

__kernel void memcpy(char* tgt, char* src, int length) { int index = get_global_id(0); while (index<length) { tgt[index] = src[index]; index += get_global_size(0); } }

int offset = 0, threads = 2048, groupsize = 128; clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL);

__kernel void memcpy(char* tgt, (char* src, int length) { for(int i=0; i<length; i++) { tgt[i] = src[i]; } }

7

slide-8
SLIDE 8

Background: OpenCL Execution Model

NDRange Example Single Work-Item Example

__kernel void memcpy(char* tgt, char* src, int length) { int index = get_global_id(0); while (index<length) { tgt[index] = src[index]; index += get_global_size(0); } } __kernel void memcpy(char* tgt, (char* src, int length) { for(int i=0; i<length; i++) { tgt[i] = src[i]; } }

int offset = 0, threads = 2048, groupsize = 128; clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueTask( queue, kernel, 0, NULL, NULL);

8

slide-9
SLIDE 9

Single Work-Item Kernel versus NDRange Kernel

“ Intel recommends that you structure your OpenCL kernel as a single work-item, if possible”[1]

[1]

9

slide-10
SLIDE 10

NDRange Kernel Single Work Item

__kernel void memcpy(char* tgt, char* src, int length ) { int index = get_global_id(0); while (index<length) { tgt[index] = src[index]; index += get_global_size(0); } }

10

slide-11
SLIDE 11

__kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int group ) { int index = get_global_id(0); while (index<length) { tgt[index] = src[index]; index += get_global_size(0); } }

NDRange Kernel Single Work Item

11

slide-12
SLIDE 12

__kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int groups) { for(int tid=offset; tid<offset+threads; tid++) { int index = tid; while (index<length) { tgt[index] = src[index]; index += threads; } } }

NDRange Kernel Single Work Item

12

slide-13
SLIDE 13

Is that really better?

13

slide-14
SLIDE 14

Loop Canonicalization

__kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int groups) { for(int tid=offset; tid<offset+threads; tid++) { int index = tid; for (int i=0; i<length/threads; i++) { if(index+i*threads < length) tgt[index+i*threads] = src[index+i*threads]; } } }

14

slide-15
SLIDE 15

__kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int groups) { for(int j=0; j<threads; j++) { int tid = j+offset; int index = tid; for (int i=0; i<length/threads; i++) { if(index+i*threads < length) tgt[index+i*threads] = src[index+i*threads]; } } }

Loop Canonicalization

15

slide-16
SLIDE 16

Loop Collapsing

__kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int groups) { for(int x=0; x<threads*length/threads; x++) { int j = x/(length/threads); int i = x%(length/threads); int tid = j+offset; int index = tid; if(index+i*threads < length) tgt[index+i*threads] = src[index+i*threads]; } } }

16

slide-17
SLIDE 17

Copy Propagation

__kernel void memcpy(char* tgt, char* src, int length, int offset, int threads, int groups) { for(int x=0; x<length; x++) { int j = x/(length/threads); int i = x%(length/threads); if(j+offset+i*threads < length) tgt[j+offset+i*threads] = src[j+offset+i*threads]; } } }

17

slide-18
SLIDE 18

Why isn’t this done today?

18

slide-19
SLIDE 19

Recall: Host OpenCL API

  • Host code must be rewritten to pass

new arguments, call different API

19

slide-20
SLIDE 20

Recall: Host OpenCL API

  • Host code must be rewritten to pass

new arguments, call different API

int offset = 0, threads = 2048, groupsize = 128; clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL); int offset = 0, threads = 2048, groupsize = 128; clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clSetKernelArg(kernel, 3, sizeof(int), offset); clSetKernelArg(kernel, 4, sizeof(int), threads); clSetKernelArg(kernel, 5, sizeof(int), groups); clEnqueueTask( queue, kernel, 0, NULL, NULL);

20

slide-21
SLIDE 21

The Altera OpenCL Toolchain

Host Code (.c/.cpp) Kernel Code (.cl) C/C++ Compiler OpenCL Runtime Library Host Binary Altera OpenCL Compiler (LLVM-based) Kernel Code (Verilog) Quartus Placement & Routing FPGA Bitstream

21

slide-22
SLIDE 22

The Argument for Separation

  • Device-side code can be Just-In-Time (JIT) compiled for each device

22

slide-23
SLIDE 23

The Argument for Separation

  • Device-side code can be Just-In-Time (JIT) compiled for each device
  • Host compilers can be separately maintained by experts (icc, xlc, gcc, clang)

23

slide-24
SLIDE 24

The Argument for Separation

  • Device-side code can be Just-In-Time (JIT) compiled for each device
  • Host compilers can be separately maintained by experts (icc, xlc, gcc, clang)
  • Host code can be recompiled without needing to recompile device code

24

slide-25
SLIDE 25

The Argument for Combined Compilation

  • Execution context information (constants, pointer aliases) can be passed from

host to device

  • Context information allows for better compiler transformations

(Strength Reduction, Pipelining)

  • Better transformations improve final executables

25

slide-26
SLIDE 26

Our Proposed OpenCL Toolchain

Host Code (.c/.cpp) Kernel Code (.cl) Combined Host/Device Compiler OpenCL Runtime Library Host Binary Kernel Code (Verilog) Quartus Placement & Routing FPGA Bitstream

26

slide-27
SLIDE 27

Research Question: Can OpenCL be better targeted to FPGAs given communication between host and device compilers?

27

slide-28
SLIDE 28

Inspiration

[SC 16]

28

slide-29
SLIDE 29

Inspiration

  • Zohouri et al. hand-tuned OpenCL benchmarks for FPGA execution
  • Achieved speedups of 30% to 100x
  • Can we match their performance through compiler transformations?

[SC 16]

29

slide-30
SLIDE 30

Prototype OpenCL Toolchain

Host Code (.c/.cpp) Kernel Code (.cl) LLVM 4.0 OpenCL Runtime Library Host Binary Altera OpenCL Compiler (LLVM 3 based) Kernel Code (Verilog) Quartus Placement & Routing FPGA Bitstream Host Context Information Kernel Information Prototype Transformations Prototype Transformations

30

slide-31
SLIDE 31

Prototype Transformations

1. Geometry Propagation 2. NDRange To Loop 3. Restricted Pointer Analysis 4. Reduction Dependence Elimination

31

slide-32
SLIDE 32
  • 1. Geometry Propagation - Motivation
  • Operations on constants in kernel can undergo strength reduction

32

slide-33
SLIDE 33
  • 1. Geometry Propagation - Motivation
  • Operations on constants in kernel can undergo strength reduction
  • Loops of known size are easier to manipulate by the compiler

33

slide-34
SLIDE 34
  • 1. Geometry Propagation

1. Collect Host-Side kernel invocations

int offset = 0, threads = 2048, groupsize = 128; cl_kernel kernel = clCreateKernel(program, “memcpy”, &err); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL);

34

slide-35
SLIDE 35
  • 1. Geometry Propagation

1. Collect Host-Side kernel invocations

int offset = 0, threads = 2048, groupsize = 128; cl_kernel kernel = clCreateKernel(program, “memcpy”, &err); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL);

35

slide-36
SLIDE 36
  • 1. Geometry Propagation

1. Collect Host-Side kernel invocations 2. Identify associated kernels

int offset = 0, threads = 2048, groupsize = 128; cl_kernel kernel = clCreateKernel(program, “memcpy”, &err); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL);

36

slide-37
SLIDE 37
  • 1. Geometry Propagation

1. Collect Host-Side kernel invocations 2. Identify associated kernels

int offset = 0, threads = 2048, groupsize = 128; cl_kernel kernel = clCreateKernel(program, “memcpy”, &err); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL);

37

slide-38
SLIDE 38
  • 1. Geometry Propagation

1. Collect Host-Side kernel invocations 2. Identify associated kernels

int offset = 0, threads = 2048, groupsize = 128; cl_kernel kernel = clCreateKernel(program, “memcpy”, &err); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL);

38

slide-39
SLIDE 39
  • 1. Geometry Propagation

1. Collect Host-Side kernel invocations 2. Identify associated kernels 3. Identify call geometry

int offset = 0, threads = 2048, groupsize = 128; cl_kernel kernel = clCreateKernel(program, “memcpy”, &err); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL);

39

slide-40
SLIDE 40
  • 1. Geometry Propagation

1. Collect Host-Side kernel invocations 2. Identify associated kernels 3. Identify call geometry

int offset = 0, threads = 2048, groupsize = 128; cl_kernel kernel = clCreateKernel(program, “memcpy”, &err); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL);

40

slide-41
SLIDE 41
  • 1. Geometry Propagation

1. Collect Host-Side kernel invocations 2. Identify associated kernels 3. Identify call geometry 4. Discovered constants are passed to the device compiler

int offset = 0, threads = 2048, groupsize = 128; cl_kernel kernel = clCreateKernel(program, “memcpy”, &err); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueNDRangeKernel( queue, kernel, 1, &offset, &threads, &groupsize, 0, NULL, NULL);

41

slide-42
SLIDE 42

Prototype Transformations

1. Geometry Propagation 2. NDRange To Loop 3. Restricted Pointer Analysis 4. Reduction Dependence Elimination

42

slide-43
SLIDE 43
  • 2. NDRange To Loop - Motivation

1. Allow threads to be pipelined together, and share intermediate products

43

slide-44
SLIDE 44
  • 2. NDRange To Loop - Motivation

1. Allow threads to be pipelined together, and share intermediate products 2. Enable further optimization: e.g. Reduction Dependence Elimination

44

slide-45
SLIDE 45
  • 2. NDRange To Loop - Motivation

1. Allow threads to be pipelined together, and share intermediate products 2. Enable further optimization: e.g. Reduction Dependence Elimination 3. Allow inner loops in kernels to be pipelined

45

slide-46
SLIDE 46
  • 2. NDRange To Loop

1. Inject kernel parameters for non-constant geometry

__kernel void kernel(...) { int index = get_global_id(0); f(index); barrier(CLK_GLOBAL_MEM_FENCE); g(index); }

46

slide-47
SLIDE 47
  • 2. NDRange To Loop

1. Inject kernel parameters for non-constant geometry

__kernel void kernel(..., int dims, int gbl_offset_x, int gbl_size_x, int lcl_size_x, ...) { int index = get_global_id(0); f(index); barrier(CLK_GLOBAL_MEM_FENCE); g(index); }

47

slide-48
SLIDE 48
  • 2. NDRange To Loop

1. Inject kernel parameters for non-constant geometry 2. Detect number of dimensions

__kernel void kernel(..., int dims, int gbl_offset_x, int gbl_size_x, int lcl_size_x, ...) { int index = get_global_id(0); f(index); barrier(CLK_GLOBAL_MEM_FENCE); g(index); }

48

slide-49
SLIDE 49

__kernel void kernel(..., int dims, int gbl_offset_x, int gbl_size_x, int lcl_size_x, ...) { int index = get_global_id(0); f(index); barrier(CLK_GLOBAL_MEM_FENCE); g(index); }

  • 2. NDRange To Loop

1. Inject kernel parameters for non-constant geometry 2. Detect number of dimensions

49

slide-50
SLIDE 50
  • 2. NDRange To Loop

1. Inject kernel parameters for non-constant geometry 2. Detect number of dimensions 3. Identify synchronization points

__kernel void kernel(..., int dims, int gbl_offset_x, int gbl_size_x, int lcl_size_x, ...) { int index = get_global_id(0); f(index); barrier(CLK_GLOBAL_MEM_FENCE); g(index); }

50

slide-51
SLIDE 51
  • 2. NDRange To Loop

1. Inject kernel parameters for non-constant geometry 2. Detect number of dimensions 3. Identify synchronization points

__kernel void kernel(..., int dims, int gbl_offset_x, int gbl_size_x, int lcl_size_x, ...) { int index = get_global_id(0); f(index); barrier(CLK_GLOBAL_MEM_FENCE); g(index); }

51

slide-52
SLIDE 52
  • 2. NDRange To Loop

1. Inject kernel parameters for non-constant geometry 2. Detect number of dimensions 3. Identify synchronization points 4. Wrap unsynchronized portions In loops

__kernel void kernel(..., int dims, int gbl_offset_x, int gbl_size_x, int lcl_size_x, ...) { int index = get_global_id(0); f(index); barrier(CLK_GLOBAL_MEM_FENCE); g(index); }

52

slide-53
SLIDE 53
  • 2. NDRange To Loop

1. Inject kernel parameters for non-constant geometry 2. Detect number of dimensions 3. Identify synchronization points 4. Wrap unsynchronized portions In loops

__kernel void kernel(..., int dims, int gbl_offset_x, int gbl_size_x, int lcl_size_x, ...) { for(int i=0;i<gbl_size_x;i+=lcl_size_x) for(int j=0; j<lcl_size_x; j++) { int index = i+j; f(index); } } for(int i=0;i<gbl_size_x;i+=lcl_size_x) for(int j=0; j<lcl_size_x; j++) { int index = i+j; g(index); } } }

53

slide-54
SLIDE 54

Prototype Transformations

1. Geometry Propagation 2. NDRange To Loop 3. Restricted Pointer Analysis 4. Reduction Dependence Elimination

54

slide-55
SLIDE 55
  • 3. Restricted Pointer Analysis - Motivation
  • Pipelining of FPGA loops often fails due to aliased memory operations

55

slide-56
SLIDE 56
  • 3. Restricted Pointer Analysis - Motivation
  • Pipelining of FPGA loops often fails due to aliased memory operations
  • Marking parameters restrict dramatically reduces false aliasing

56

slide-57
SLIDE 57
  • 3. Restricted Pointer Analysis - Motivation
  • Pipelining of FPGA loops often fails due to aliased memory operations
  • Marking parameters restrict dramatically reduces false aliasing
  • Detecting non-aliasing parameters must be done through host analysis

57

slide-58
SLIDE 58
  • 3. Restricted Pointer Analysis

1. (Host) Identify pointed-to host buffers

__kernel void memcpy(char* tgt, (char* src, int length) { for(int i=0; i<length; i++) { tgt[i] = src[i]; } }

cl_mem srcbuf = clCreateBuffer(...); cl_mem tgtbuf = clCreateBuffer(...); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueTask( queue, kernel, 0, NULL, NULL);

58

slide-59
SLIDE 59
  • 3. Restricted Pointer Analysis

1. (Host) Identify pointed-to host buffers 2. Verify buffer distinction

__kernel void memcpy(char* tgt, (char* src, int length) { for(int i=0; i<length; i++) { tgt[i] = src[i]; } }

cl_mem srcbuf = clCreateBuffer(...); cl_mem tgtbuf = clCreateBuffer(...); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueTask( queue, kernel, 0, NULL, NULL);

59

slide-60
SLIDE 60
  • 3. Restricted Pointer Analysis

1. (Host) Identify pointed-to host buffers 2. Verify buffer distinction 3. (Device) Mark parameters restricted

__kernel void memcpy(char *restrict tgt, (char *restrict src, int length) { for(int i=0; i<length; i++) { tgt[i] = src[i]; } }

cl_mem srcbuf = clCreateBuffer(...); cl_mem tgtbuf = clCreateBuffer(...); clSetKernelArg(kernel, 0, sizeof(char*), tgtbuf); clSetKernelArg(kernel, 1, sizeof(char*), srcbuf); clSetKernelArg(kernel, 2, sizeof(int), length); clEnqueueTask( queue, kernel, 0, NULL, NULL);

60

slide-61
SLIDE 61

Prototype Transformations

1. Geometry Propagation 2. NDRange To Loop 3. Restricted Pointer Analysis 4. Reduction Dependence Elimination

61

slide-62
SLIDE 62
  • 4. Reduction Dependency Elimination - Motivation
  • Floating-point operation latency means long initiation intervals for reduction

loops - the pipeline stalls on every iteration

62

slide-63
SLIDE 63
  • 4. Reduction Dependency Elimination - Motivation
  • Floating-point operation latency means long initiation intervals for reduction

loops - the pipeline stalls on every iteration

  • Data dependency on the reduction variable can be resolved by using a

rotating register to modulo schedule the reduction computation

63

slide-64
SLIDE 64
  • 4. Reduction Dependency Elimination - Motivation
  • Floating-point operation latency means long initiation intervals for reduction

loops - the pipeline stalls on every iteration

  • Data dependency on the reduction variable can be resolved by using a

rotating register to modulo schedule the reduction computation

  • Pipelined reduction via a rotating register is an idiom recognized by the Intel

FPGA OpenCL compiler and efficiently implemented using a shift register in hardware

64

slide-65
SLIDE 65

1. Detect reduction idiom in loops

  • 4. Reduction Dependency Elimination

__kernel void vec_sum(__global double *arr, __global double *res, int N) { double temp_sum = 0; for (int i = 0; i < N; ++i) temp_sum += arr[i]; *res = temp_sum; }

65

slide-66
SLIDE 66

1. Detect reduction idiom in loops

  • 4. Reduction Dependency Elimination

__kernel void vec_sum(__global double *arr, __global double *res, int N) { double temp_sum = 0; for (int i = 0; i < N; ++i) temp_sum += arr[i]; *res = temp_sum; }

66

slide-67
SLIDE 67

1. Detect reduction idiom in loops 2. Create and initialize a “shift-register” array

  • 4. Reduction Dependency Elimination

__kernel void vec_sum(__global double *arr, __global double *res, int N) { double shift_reg[II_CYCLES + 1]; for (int j = 0; j < II_CYCLES + 1; ++j) shift_reg[j] = 0; double temp_sum = 0; for (int i = 0; i < N; ++i) temp_sum += arr[i]; *res = temp_sum; }

67

slide-68
SLIDE 68

1. Detect reduction idiom in loops 2. Create and initialize a “shift-register” array 3. Rewrite the reduction update to store into the shift register’s tail element

  • 4. Reduction Dependency Elimination

__kernel void vec_sum(__global double *arr, __global double *res, int N) { double shift_reg[II_CYCLES + 1]; for (int j = 0; j < II_CYCLES + 1; ++j) shift_reg[j] = 0; double temp_sum = 0; for (int i = 0; i < N; ++i) shift_reg[II_CYCLES] = shift_reg[0] + arr[i]; *res = temp_sum; }

68

slide-69
SLIDE 69

1. Detect reduction idiom in loops 2. Create and initialize a “shift-register” array 3. Rewrite the reduction update to store into the shift register’s tail element 4. Shift the values of the shift register down

  • 4. Reduction Dependency Elimination

__kernel void vec_sum(__global double *arr, __global double *res, int N) { double shift_reg[II_CYCLES + 1]; for (int j = 0; j < II_CYCLES + 1; ++j) shift_reg[j] = 0; double temp_sum = 0; for (int i = 0; i < N; ++i) { shift_reg[II_CYCLES] = shift_reg[0] + arr[i]; for (int k = 0; k < II_CYCLES; ++k) shift_reg[k] = shift_reg[k+1]; } *res = temp_sum; }

69

slide-70
SLIDE 70

1. Detect reduction idiom in loops 2. Create and initialize a “shift-register” array 3. Rewrite the reduction update to store into the shift register’s tail element 4. Shift the values of the shift register down 5. Compute the final reduction value by summing shift register values.

  • 4. Reduction Dependency Elimination

__kernel void vec_sum(__global double *arr, __global double *res, int N) { double shift_reg[II_CYCLES + 1]; for (int j = 0; j < II_CYCLES + 1; ++j) shift_reg[j] = 0; double temp_sum = 0; for (int i = 0; i < N; ++i) } shift_reg[II_CYCLES] = shift_reg[0] + arr[i]; for (int k = 0; k < II_CYCLES; ++k) shift_reg[k] = shift_reg[k+1]; } for (int m = 0; m < II_CYCLES; ++m) temp_sum += shift_reg[m]; *res = temp_sum; }

70

slide-71
SLIDE 71

Evaluation

  • OpenCL kernels taken from Rodinia benchmark suite[1]
  • Execution time measured on a DE5-Net Development Kit (Stratix V)

[1] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pp. 44-54, Oct. 2009. 71

slide-72
SLIDE 72

Transformation Applicability

* An opportunity was found, but hurt performance ** Benchmarks that could not be transformed by NDRangeToLoop were excluded from evaluation **

72

slide-73
SLIDE 73

Results

Gaussian 7x slower Hotspot3D 2.6x slower Kmeans 2.8x faster NN 6% faster SRAD 6% slower

73

slide-74
SLIDE 74

What Happened?

74

slide-75
SLIDE 75

Analysis

1. Inter-compiler Communication 2. Compiler Versions 3. Missing Heuristics

75

slide-76
SLIDE 76

Inter-Compiler Communication

  • Loops created by NDRangeToLoop carry no dependencies,

but the Intel FPGA OpenCL Compiler doesn’t read that information

76

slide-77
SLIDE 77

Inter-Compiler Communication

  • Loops created by NDRangeToLoop are carry no dependencies,

but the Intel FPGA OpenCL Compiler doesn’t read that information

  • Often, the Intel FPGA OpenCL Compiler cannot rediscover the parallelism,

And cannot pipeline

77

slide-78
SLIDE 78

Compiler Versions

  • The Intel FPGA OpenCL Compiler is built on LLVM 3.0 (circa 2011)

78

slide-79
SLIDE 79

Compiler Versions

  • The Intel FPGA OpenCL Compiler is built on LLVM 3.0 (circa 2011)
  • Modern LLVM Analyses & Transformations ineffective or unavailable:

AssumptionCache Loop Vectorization Interprocedural Alias Analysis

79

slide-80
SLIDE 80

Missing Heuristics

  • Our prototype cannot access Intel FPGA OpenCL compiler’s heuristics,
  • nly IR between stages

80

slide-81
SLIDE 81

Missing Heuristics

  • Our prototype cannot access Intel FPGA OpenCL Compiler’s heuristics,
  • nly IR between stages
  • Our transformations do not know if they’re helping or hurting

81

slide-82
SLIDE 82

Missing Heuristics

  • Our prototype cannot access Intel FPGA OpenCL Compiler’s heuristics,
  • nly IR between stages
  • Our transformations do not know if they’re helping or hurting
  • An open-source compiler would help a lot with this

82

slide-83
SLIDE 83

Conclusion

83

slide-84
SLIDE 84

Compilers can perform much more powerful transformations when able to inspect and affect both host and device compilation.

84

slide-85
SLIDE 85

Compilers can perform much more powerful transformations when able to inspect and affect both host and device compilation. Deep integration is required to determine if transformations improve performance.

85

slide-86
SLIDE 86

Contact Me

Taylor Lloyd - tjlloyd@ualberta.ca

86