Using Intra-Core Loop-Task Accelerators to Improve the Productivity - - PowerPoint PPT Presentation

using intra core loop task accelerators to improve the
SMART_READER_LITE
LIVE PREVIEW

Using Intra-Core Loop-Task Accelerators to Improve the Productivity - - PowerPoint PPT Presentation

Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance of Task-Based Parallel Programs Ji Kim , Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten


slide-1
SLIDE 1

Cornell University Ji Kim 1/21 Cornell University Ji Kim 1/21 Cornell University Ji Yun Kim 1

Ji Kim, Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten

Computer Systems Laboratory Cornell University

50th ACM/IEEE Int’l Symp. on Microarchitecture, MICRO-2017

Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance

  • f Task-Based Parallel Programs
slide-2
SLIDE 2

Cornell University Ji Kim 2/21 Cornell University Ji Kim 2/21 Cornell University Ji Yun Kim 2

Inter-Core

  • Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk Intra-Core

  • Packed-SIMD Vectorization

○ Intel AVX, Arm NEON

slide-3
SLIDE 3

Cornell University Ji Kim 3/21 Cornell University Ji Kim 3/21 Cornell University Ji Yun Kim 2

Inter-Core

  • Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk Intra-Core

  • Packed-SIMD Vectorization

○ Intel AVX, Arm NEON

slide-4
SLIDE 4

Cornell University Ji Kim 4/21 Cornell University Ji Kim 4/21 Cornell University Ji Yun Kim 2

Inter-Core

  • Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk Intra-Core

  • Packed-SIMD Vectorization

○ Intel AVX, Arm NEON

slide-5
SLIDE 5

Cornell University Ji Kim 5/21 Cornell University Ji Kim 5/21 Cornell University Ji Yun Kim 2

Inter-Core

  • Task-Based Parallel Programming Frameworks

○ Intel TBB, Cilk Intra-Core

  • Packed-SIMD Vectorization

○ Intel AVX, Arm NEON

slide-6
SLIDE 6

Cornell University Ji Kim 6/21 Cornell University Ji Kim 6/21 Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

3

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...

slide-7
SLIDE 7

Cornell University Ji Kim 7/21 Cornell University Ji Kim 7/21 Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

3

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...

slide-8
SLIDE 8

Cornell University Ji Kim 8/21 Cornell University Ji Kim 8/21 Cornell University Ji Yun Kim

Challenges of Combining Tasks and Vectors

3

Challenge #1: Intra-Core Parallel Abstraction Gap

void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ...

Challenge #2: Inefficient Execution of Irregular Tasks

slide-9
SLIDE 9

Cornell University Ji Kim 9/21 Cornell University Ji Kim 9/21 Cornell University Ji Yun Kim

Regular Irregular

Native Performance Results

4

slide-10
SLIDE 10

Cornell University Ji Kim 10/21 Cornell University Ji Kim 10/21 Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

5

  • Motivation
  • Challenge #1: LTA SW
  • Challenge #2: LTA HW
  • Evaluation
  • Conclusion
slide-11
SLIDE 11

Cornell University Ji Kim 11/21 Cornell University Ji Kim 11/21 Cornell University Ji Yun Kim

LTA SW: API and ISA Hint

void app_kernel_lta(int N, float* src, float* dst) { LTA_PARALLEL_FOR(0, N, (dst, src), ({ if (src[i] > THRESHOLD) dst[i] = DoComputeLight(src[i]); else dst[i] = DoComputeHeavy(src[i]); })); } void loop_task_func(void* a, int start, int end, int step=1);

6

Hint that hardware can potentially accelerate task execution

slide-12
SLIDE 12

Cornell University Ji Kim 12/21 Cornell University Ji Kim 12/21 Cornell University Ji Yun Kim

LTA SW: Task-Based Runtime

7

slide-13
SLIDE 13

Cornell University Ji Kim 13/21 Cornell University Ji Kim 13/21 Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

8

  • Motivation
  • Challenge #1: LTA SW
  • Challenge #2: LTA HW
  • Evaluation
  • Conclusion
slide-14
SLIDE 14

Cornell University Ji Kim 14/21 Cornell University Ji Kim 14/21 Cornell University Ji Yun Kim

LTA HW: Fully-Coupled LTA

9

Coupling better for regular workloads (amortize frontend/memory)

slide-15
SLIDE 15

Cornell University Ji Kim 15/21 Cornell University Ji Kim 15/21 Cornell University Ji Yun Kim

LTA HW: Fully Decoupled LTA

10

Decoupling better for irregular workloads (hide latencies)

slide-16
SLIDE 16

Cornell University Ji Kim 16/21 Cornell University Ji Kim 16/21 Cornell University Ji Yun Kim

LTA HW: Task-Coupling Taxonomy

11

+ Higher Perf on Irregular

  • Higher Area/Energy

Task Group (lock-step execution) More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy

slide-17
SLIDE 17

Cornell University Ji Kim 17/21 Cornell University Ji Kim 17/21 Cornell University Ji Yun Kim

Does it matter whether we decouple in space or in time?

LTA HW: Task-Coupling Taxonomy

11

slide-18
SLIDE 18

Cornell University Ji Kim 18/21 Cornell University Ji Kim 18/21 Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

12

  • Motivation
  • Challenge #1: LTA SW
  • Challenge #2: LTA HW
  • Evaluation
  • Conclusion
slide-19
SLIDE 19

Cornell University Ji Kim 19/21 Cornell University Ji Kim 19/21 Cornell University Ji Yun Kim

Evaluation: Methodology

  • Ported 16 application kernels from PBBS and in-house

benchmark suites with diverse loop-task parallelism

  • Scientific computing: N-body simulation, MRI-Q, SGEMM
  • Image processing: bilateral filter, RGB-to-CMYK, DCT
  • Graph algorithms: breadth-first search, maximal matching
  • Search/Sort algorithms: radix sort, substring matching
  • gem5 + PyMTL co-simulation for cycle-level performance
  • Component/event-based area/energy modeling
  • Uses area/energy dictionary backed by VLSI results and McPAT

13

slide-20
SLIDE 20

Cornell University Ji Kim 20/21 Cornell University Ji Kim 20/21 Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

s p a t i a l d e c

  • u

p l i n g t e m p

  • r

a l d e c

  • u

p l i n g resource constraints

slide-21
SLIDE 21

Cornell University Ji Kim 21/21 Cornell University Ji Kim 21/21 Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

s p a t i a l d e c

  • u

p l i n g t e m p

  • r

a l d e c

  • u

p l i n g

Prefer spatial decoupling over temporal decoupling

resource constraints

slide-22
SLIDE 22

Cornell University Ji Kim 22/21 Cornell University Ji Kim 22/21 Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

s p a t i a l d e c

  • u

p l i n g t e m p

  • r

a l d e c

  • u

p l i n g r e s

  • u

r c e c

  • n

s t r a i n t s

slide-23
SLIDE 23

Cornell University Ji Kim 23/21 Cornell University Ji Kim 23/21 Cornell University Ji Yun Kim

Evaluation: Design Space Exploration

14

s p a t i a l d e c

  • u

p l i n g t e m p

  • r

a l d e c

  • u

p l i n g r e s

  • u

r c e c

  • n

s t r a i n t s Reduce spatial decoupling to improve energy efficiency

slide-24
SLIDE 24

Cornell University Ji Kim 24/21 Cornell University Ji Kim 24/21 Cornell University Ji Yun Kim

Evaluation: Multicore LTA Performance

Regular Irregular

4.4x 2.9x 10.7x 5.2x

15

slide-25
SLIDE 25

Cornell University Ji Kim 25/21 Cornell University Ji Kim 25/21 Cornell University Ji Yun Kim

Evaluation: Area-Normalized Performance

16

1.8x 1.6x 1.2x

slide-26
SLIDE 26

Cornell University Ji Kim 26/21 Cornell University Ji Kim 26/21 Cornell University Ji Yun Kim

Loop-Task Accelerator (LTA) Vision

17

  • Motivation
  • Challenge #1: LTA SW
  • Challenge #2: LTA HW
  • Evaluation
  • Conclusion
slide-27
SLIDE 27

Cornell University Ji Kim 27/21 Cornell University Ji Kim 27/21 Cornell University Ji Yun Kim

Related Work

  • Challenge #1: Intra-Core Parallel Abstraction Gap
  • Persistent threads for GPGPUs (S. Tzeng et al.)
  • OpenCL, OpenMP, C++ AMP
  • Cilk for vectorization (B. Ren et al.)
  • And more...
  • Challenge #2: Inefficient Execution of Irregular Tasks
  • Variable warp sizing (T. Rogers et al.)
  • Temporal SIMT (S. Keckler et al.)
  • Vector-lane threading (S. Rivoire et al.)
  • And more…
  • Please see paper for more detailed references!

18

slide-28
SLIDE 28

Cornell University Ji Kim 28/21 Cornell University Ji Kim 28/21 Cornell University Ji Yun Kim

Take-Away Points

19

  • Intra-core parallel abstraction gap

and inefficient execution of irregular tasks are fundamental challenges for CMPs

  • LTAs address both challenges with a

lightweight ISA hint and a flexible microarchitectural template

  • Results suggest in a

resource-constrained environment, architects should favor spatial decoupling over temporal decoupling

  • First step towards accelerating a wider

variety of task parallelism