[PPT] - Using Intra-Core Loop-Task Accelerators to Improve the Productivity PowerPoint Presentation

SLIDE 1

Cornell University Ji Kim 1/21 Cornell University Ji Kim 1/21 Cornell University Ji Yun Kim 1

Ji Kim, Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten

Computer Systems Laboratory Cornell University

50th ACM/IEEE Int’l Symp. on Microarchitecture, MICRO-2017

Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance

f Task-Based Parallel Programs

SLIDE 2

Cornell University Ji Kim 2/21 Cornell University Ji Kim 2/21 Cornell University Ji Yun Kim 2

Inter-Core

Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk Intra-Core

Packed-SIMD Vectorization

○ Intel AVX, Arm NEON

SLIDE 3

Cornell University Ji Kim 3/21 Cornell University Ji Kim 3/21 Cornell University Ji Yun Kim 2

Inter-Core

Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk Intra-Core

Packed-SIMD Vectorization

○ Intel AVX, Arm NEON

SLIDE 4

Cornell University Ji Kim 4/21 Cornell University Ji Kim 4/21 Cornell University Ji Yun Kim 2

Inter-Core

Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk Intra-Core

Packed-SIMD Vectorization

○ Intel AVX, Arm NEON

SLIDE 5

Cornell University Ji Kim 5/21 Cornell University Ji Kim 5/21 Cornell University Ji Yun Kim 2

Inter-Core

Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk Intra-Core

Packed-SIMD Vectorization

○ Intel AVX, Arm NEON

SLIDE 6

Cornell University Ji Kim 6/21 Cornell University Ji Kim 6/21 Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

3

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...

SLIDE 7

Cornell University Ji Kim 7/21 Cornell University Ji Kim 7/21 Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

3

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...

SLIDE 8

Cornell University Ji Kim 8/21 Cornell University Ji Kim 8/21 Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

3

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...

Challenge #2: Inefficient Execution of Irregular Tasks

SLIDE 9

Cornell University Ji Kim 9/21 Cornell University Ji Kim 9/21 Cornell University Ji Yun Kim

Regular Irregular

Native Performance Results

4

SLIDE 10

Cornell University Ji Kim 10/21 Cornell University Ji Kim 10/21 Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

5

Motivation
Challenge #1: LTA SW
Challenge #2: LTA HW
Evaluation
Conclusion

SLIDE 11

Cornell University Ji Kim 11/21 Cornell University Ji Kim 11/21 Cornell University Ji Yun Kim

LTA SW: API and ISA Hint

void app_kernel_lta(int N, float* src, float* dst) { LTA_PARALLEL_FOR(0, N, (dst, src), ({ if (src[i] > THRESHOLD) dst[i] = DoComputeLight(src[i]); else dst[i] = DoComputeHeavy(src[i]); })); } void loop_task_func(void* a, int start, int end, int step=1);

6

Hint that hardware can potentially accelerate task execution

SLIDE 12

Cornell University Ji Kim 12/21 Cornell University Ji Kim 12/21 Cornell University Ji Yun Kim

LTA SW: Task-Based Runtime

7

SLIDE 13

Cornell University Ji Kim 13/21 Cornell University Ji Kim 13/21 Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

8

Motivation
Challenge #1: LTA SW
Challenge #2: LTA HW
Evaluation
Conclusion

SLIDE 14

Cornell University Ji Kim 14/21 Cornell University Ji Kim 14/21 Cornell University Ji Yun Kim

LTA HW: Fully-Coupled LTA

9

Coupling better for regular workloads (amortize frontend/memory)

SLIDE 15

Cornell University Ji Kim 15/21 Cornell University Ji Kim 15/21 Cornell University Ji Yun Kim

LTA HW: Fully Decoupled LTA

10

Decoupling better for irregular workloads (hide latencies)

SLIDE 16

Cornell University Ji Kim 16/21 Cornell University Ji Kim 16/21 Cornell University Ji Yun Kim

LTA HW: Task-Coupling Taxonomy

11

+ Higher Perf on Irregular

Higher Area/Energy

Task Group (lock-step execution) More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy

SLIDE 17

Cornell University Ji Kim 17/21 Cornell University Ji Kim 17/21 Cornell University Ji Yun Kim

Does it matter whether we decouple in space or in time?

LTA HW: Task-Coupling Taxonomy

11

SLIDE 18

Cornell University Ji Kim 18/21 Cornell University Ji Kim 18/21 Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

12

Motivation
Challenge #1: LTA SW
Challenge #2: LTA HW
Evaluation
Conclusion

SLIDE 19

Cornell University Ji Kim 19/21 Cornell University Ji Kim 19/21 Cornell University Ji Yun Kim

Evaluation: Methodology

Ported 16 application kernels from PBBS and in-house

benchmark suites with diverse loop-task parallelism

Scientific computing: N-body simulation, MRI-Q, SGEMM
Image processing: bilateral filter, RGB-to-CMYK, DCT
Graph algorithms: breadth-first search, maximal matching
Search/Sort algorithms: radix sort, substring matching
gem5 + PyMTL co-simulation for cycle-level performance
Component/event-based area/energy modeling
Uses area/energy dictionary backed by VLSI results and McPAT

13

SLIDE 20

Cornell University Ji Kim 20/21 Cornell University Ji Kim 20/21 Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

s p a t i a l d e c

u

p l i n g t e m p

r

a l d e c

u

p l i n g resource constraints

SLIDE 21

Cornell University Ji Kim 21/21 Cornell University Ji Kim 21/21 Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

s p a t i a l d e c

u

p l i n g t e m p

r

a l d e c

u

p l i n g

Prefer spatial decoupling over temporal decoupling

resource constraints

SLIDE 22

Cornell University Ji Kim 22/21 Cornell University Ji Kim 22/21 Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

s p a t i a l d e c

u

p l i n g t e m p

r

a l d e c

u

p l i n g r e s

u

r c e c

n

s t r a i n t s

SLIDE 23

Cornell University Ji Kim 23/21 Cornell University Ji Kim 23/21 Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

s p a t i a l d e c

u

p l i n g t e m p

r

a l d e c

u

p l i n g r e s

u

r c e c

n

s t r a i n t s Reduce spatial decoupling to improve energy efficiency

SLIDE 24

Cornell University Ji Kim 24/21 Cornell University Ji Kim 24/21 Cornell University Ji Yun Kim

Evaluation: Multicore LTA Performance

Regular Irregular

4.4x 2.9x 10.7x 5.2x

15

SLIDE 25

Cornell University Ji Kim 25/21 Cornell University Ji Kim 25/21 Cornell University Ji Yun Kim

Evaluation: Area-Normalized Performance

16

1.8x 1.6x 1.2x

SLIDE 26

Cornell University Ji Kim 26/21 Cornell University Ji Kim 26/21 Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

17

Motivation
Challenge #1: LTA SW
Challenge #2: LTA HW
Evaluation
Conclusion

SLIDE 27

Cornell University Ji Kim 27/21 Cornell University Ji Kim 27/21 Cornell University Ji Yun Kim

Related Work

Challenge #1: Intra-Core Parallel Abstraction Gap
Persistent threads for GPGPUs (S. Tzeng et al.)
OpenCL, OpenMP, C++ AMP
Cilk for vectorization (B. Ren et al.)
And more...
Challenge #2: Inefficient Execution of Irregular Tasks
Variable warp sizing (T. Rogers et al.)
Temporal SIMT (S. Keckler et al.)
Vector-lane threading (S. Rivoire et al.)
And more…
Please see paper for more detailed references!

18

SLIDE 28

Cornell University Ji Kim 28/21 Cornell University Ji Kim 28/21 Cornell University Ji Yun Kim

Take-Away Points

19

Intra-core parallel abstraction gap

and inefficient execution of irregular tasks are fundamental challenges for CMPs

LTAs address both challenges with a

lightweight ISA hint and a flexible microarchitectural template

Results suggest in a

resource-constrained environment, architects should favor spatial decoupling over temporal decoupling

First step towards accelerating a wider

Ji Kim, Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten

Computer Systems Laboratory Cornell University

50th ACM/IEEE Int’l Symp. on Microarchitecture, MICRO-2017

Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance

Inter-Core

○ Intel TBB, Cilk Intra-Core

○ Intel AVX, Arm NEON

Inter-Core

○ Intel TBB, Cilk Intra-Core

○ Intel AVX, Arm NEON

Inter-Core

○ Intel TBB, Cilk Intra-Core

○ Intel AVX, Arm NEON

Inter-Core

○ Intel TBB, Cilk Intra-Core

○ Intel AVX, Arm NEON

Challenges of Combining Tasks and Vectors

Challenge #1: Intra-Core Parallel Abstraction Gap

Challenges of Combining Tasks and Vectors

Challenge #1: Intra-Core Parallel Abstraction Gap

Challenges of Combining Tasks and Vectors

Challenge #1: Intra-Core Parallel Abstraction Gap

Challenge #2: Inefficient Execution of Irregular Tasks

Regular Irregular

Native Performance Results

Loop-Task Accelerator (LTA) Vision

LTA SW: API and ISA Hint

void app_kernel_lta(int N, float* src, float* dst) { LTA_PARALLEL_FOR(0, N, (dst, src), ({ if (src[i] > THRESHOLD) dst[i] = DoComputeLight(src[i]); else dst[i] = DoComputeHeavy(src[i]); })); } void loop_task_func(void* a, int start, int end, int step=1);

Hint that hardware can potentially accelerate task execution

LTA SW: Task-Based Runtime

Loop-Task Accelerator (LTA) Vision

LTA HW: Fully-Coupled LTA

Coupling better for regular workloads (amortize frontend/memory)

LTA HW: Fully Decoupled LTA

Decoupling better for irregular workloads (hide latencies)

LTA HW: Task-Coupling Taxonomy

+ Higher Perf on Irregular

Task Group (lock-step execution) More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy

Does it matter whether we decouple in space or in time?

LTA HW: Task-Coupling Taxonomy

Loop-Task Accelerator (LTA) Vision

Evaluation: Methodology

benchmark suites with diverse loop-task parallelism

Evaluation: Design Space Exploration

Evaluation: Design Space Exploration

Prefer spatial decoupling over temporal decoupling

Evaluation: Design Space Exploration

Evaluation: Design Space Exploration

Evaluation: Multicore LTA Performance

Regular Irregular

4.4x 2.9x 10.7x 5.2x

Evaluation: Area-Normalized Performance

1.8x 1.6x 1.2x

Loop-Task Accelerator (LTA) Vision

Related Work

Take-Away Points

and inefficient execution of irregular tasks are fundamental challenges for CMPs

lightweight ISA hint and a flexible microarchitectural template

resource-constrained environment, architects should favor spatial decoupling over temporal decoupling

variety of task parallelism