Microarchitectural Mechanisms to Exploit Value Structure in SIMT - - PowerPoint PPT Presentation

microarchitectural mechanisms to exploit value structure
SMART_READER_LITE
LIVE PREVIEW

Microarchitectural Mechanisms to Exploit Value Structure in SIMT - - PowerPoint PPT Presentation

Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten Cornell University Cornell University IEEE/ACM International Symposium on


slide-1
SLIDE 1

Cornell University Ji Kim 1/20 Cornell University Ji Kim 1/20 Cornell University Ji Kim

Cornell University

IEEE/ACM International Symposium on Computer Architecture 2013 (ISCA-40)

Cornell University

IEEE/ACM International Symposium on Computer Architecture 2013 (ISCA-40)

Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten

Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures

1/27

slide-2
SLIDE 2

Cornell University Ji Kim 2/20 Cornell University Ji Kim 2/20 Cornell University Ji Kim

Motivation

Value Structure occurs when the same operation uses values across threads which can be represented as a compact function.

  • SIMT architectures exploit:
  • Control Structure (i.e. common instruction fetch/decode/issue)
  • Memory-Access Structure (i.e. memory coalescing)
  • Primary research questions:
  • How does value structure impact control and memory-access structure?
  • How can we realistically implement hardware mechanisms to exploit

value structure to improve performance and energy-efficiency?

         2/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-3
SLIDE 3

Cornell University Ji Kim 3/20 Cornell University Ji Kim 3/20 Cornell University Ji Kim

Presentation Outline

  • General-Purpose vs. Fine-Grain SIMT
  • Characterizing Value Structure
  • FG-SIMT Baseline Architecture
  • Compact Affine Execution
  • Evaluation

3/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-4
SLIDE 4

Cornell University Ji Kim 4/20 Cornell University Ji Kim 4/20 Cornell University Ji Kim

Why GP-SIMT and FG-SIMT?

  • Holistic approach for evaluating on different SIMT architectures
  • GP-SIMT as a model for traditional SIMT architecture
  • Focus on exploiting inter-warp parallelism
  • FG-SIMT as our own alternative SIMT architecture that we are building from

the ground up

  • Targeting flexible, compute-focused data-parallel accelerators
  • Focus on exploiting intra-warp parallelism, area-efficiency
  • Build credibility with FG-SIMT with cycle time, area, and energy analysis

                                              4/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-5
SLIDE 5

Cornell University Ji Kim 5/20 Cornell University Ji Kim 5/20 Cornell University Ji Kim

__global__ void vsadd( int y[], int a )! {! int idx = // get thread index! ! y[idx] = y[idx] + a;! if ( y[idx] > THRESHOLD )! y[idx] = Y_MAX_VALUE;! }!

GP-SIMT Programming Model FG-SIMT Programming Model

  • Key difference is in how kernel is launched
  • GP-SIMT: HW-managed, coarse-grain kernel launch
  • FG-SIMT: HW/SW-managed, fine-grain kernel launch

                                              5/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-6
SLIDE 6

Cornell University Ji Kim 6/20 Cornell University Ji Kim 6/20 Cornell University Ji Kim                            

     



                   



GP-SIMT Microarchitecture FG-SIMT Microarchitecture

  • Multi-warp execution
  • Single-ported register file
  • Wide, unbanked L1 cache
  • Integrated fetch/decode/issue
  • Distinct memory space
  • Single warp execution
  • Multi-ported register file
  • Shared, banked L1 cache
  • SW-programmable control processor
  • Unified memory space

6/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-7
SLIDE 7

Cornell University Ji Kim 7/20 Cornell University Ji Kim 7/20 Cornell University Ji Kim

Presentation Outline

  • General-Purpose vs. Fine-Grain SIMT
  • Characterizing Value Structure
  • FG-SIMT Baseline Architecture
  • Compact Affine Execution
  • Evaluation

7/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-8
SLIDE 8

Cornell University Ji Kim 8/20 Cornell University Ji Kim 8/20 Cornell University Ji Kim

         

Identifying Value Structure

__global__ void! vsadd( int y[], int a ) {! int idx = // get thread index! ! y[idx] = y[idx] + a;! if ( y[idx] > THRESHOLD )! y[idx] = Y_MAX_VALUE;! }! vsadd:! ld.sh R_a, M[A]! ld.sh R_ybase, M[Y]! add R_yptr, R_ybase, IDX! load R_y, M[R_yptr]! add R_y, R_y, R_a! store R_y, M[R_yptr]! branch R_y, THRESHOLD! imm R_max, Y_MAX_VALUE! store R_max, M[R_yptr]! stop! ! branc imm store stop R_a! R_ybase! R_ybase! R_a! R_max! R_max! THRESHOLD! Y_MAX_VALUE! R_yptr! IDX! R_yptr! R_yptr! R_yptr!

V(i) = b+i× s

Affine Value Structure:

44! 2! 32! 40! 3! 2! 2! 2! 32! 32! 32! 40! 40! 40! 0! 1! 2! 32! 36! 40! 127! 19! 89! 8!

8/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-9
SLIDE 9

Cornell University Ji Kim 9/20 Cornell University Ji Kim 9/20 Cornell University Ji Kim

Why does value structure occur?

  • Operating on or loading constants
  • Common control flow (e.g., inner loops)
  • Manipulating addresses for structured memory access

vsadd:! ld.sh R_a, M[A]! ld.sh R_ybase, M[Y]! add R_yptr, R_ybase, IDX! load R_y, M[R_yptr]! add R_y, R_y, R_a! store R_y, M[R_yptr]! branch R_y, THRESHOLD! imm R_max, Y_MAX_VALUE! store R_max, M[R_yptr]! stop! ! branc imm store stop R_a! R_ybase! R_ybase! R_a! R_max! R_max! THRESHOLD! Y_MAX_VALUE! R_yptr! IDX! R_yptr! R_yptr! R_yptr! __global__ void! vsadd( int y[], int a ) {! int idx = // get thread index! ! y[idx] = y[idx] + a;! if ( y[idx] > THRESHOLD )! y[idx] = Y_MAX_VALUE;! }!

9/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-10
SLIDE 10

Cornell University Ji Kim 10/20 Cornell University Ji Kim 10/20 Cornell University Ji Kim

How often does value structure occur?

  • GP-SIMT Hardware detection, Collange et al. HPPC-2009
  • On average, 34% of register reads and 22% of register writes are affine
  • GP-SIMT Software detection, Lee et al. CGO-2013
  • On average, 31% of combined register reads/writes are affine
  • Our own FG-SIMT functional simulation:
  • 30-80% of register reads and 20-70% of register writes are affine

10/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-11
SLIDE 11

Cornell University Ji Kim 11/20 Cornell University Ji Kim 11/20 Cornell University Ji Kim

Presentation Outline

  • General-Purpose vs. Fine-Grain SIMT
  • Characterizing Value Structure
  • FG-SIMT Baseline Architecture
  • Compact Affine Execution
  • Evaluation

11/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-12
SLIDE 12

Cornell University Ji Kim 12/20 Cornell University Ji Kim 12/20 Cornell University Ji Kim

FG-SIMT Baseline Example Execution

12/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

   

                               

  

    



               



                     

slide-13
SLIDE 13

Cornell University Ji Kim 13/20 Cornell University Ji Kim 13/20 Cornell University Ji Kim

Presentation Outline

  • General-Purpose vs. Fine-Grain SIMT
  • Characterizing Value Structure
  • FG-SIMT Baseline Architecture
  • Compact Affine Execution
  • Evaluation

13/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-14
SLIDE 14

Cornell University Ji Kim 14/20 Cornell University Ji Kim 14/20 Cornell University Ji Kim

Tracking Value Structure

  • Store affine values in Affine SIMT

Register File (ASRF)

  • ASRF encodes affine values as base

and stride pair with uniform/affine tags

  • Registers are tagged as affine when:
  • Shared loads (e.g., ld.param, ld.sh)
  • Thread index (e.g., tid.x, IDX)
  • Result of affine arithmetic

14/27

  

  



               





Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-15
SLIDE 15

Cornell University Ji Kim 15/20 Cornell University Ji Kim 15/20 Cornell University Ji Kim

Exploiting Value Structure

  • Affine arithmetic
  • Affine memory operations

vsadd:! ld.sh R_a, M[A]! ld.sh R_ybase, M[Y]! add R_yptr, R_ybase, IDX! load R_y, M[R_yptr]! add R_y, R_y, R_a! store R_y, M[R_yptr]! branch R_y, THRESHOLD! imm R_max, Y_MAX_VALUE! store R_max, M[R_yptr]! stop! ! branc imm store stop

V0(i) = b0 +i× s0 V

1(i) = b 1 +i× s1

V0(i)+V

1(i) = (b0 + b 1)+i×(s0 + s1)

  • addiu!
  • lui!
  • addu!
  • subu!
  • sll/sllv!
  • srl/srlv!
  • sra/srav!
  • mul!
  • lw/lh/lb!
  • sw/sh/sb!

15/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-16
SLIDE 16

Cornell University Ji Kim 16/20 Cornell University Ji Kim 16/20 Cornell University Ji Kim

Exploiting Value Structure

  • Affine arithmetic
  • Affine memory operations
  • Affine branches

vsadd:! ld.sh R_a, M[A]! ld.sh R_ybase, M[Y]! add R_yptr, R_ybase, IDX! load R_y, M[R_yptr]! add R_y, R_y, R_a! store R_y, M[R_yptr]! branch R_a, THRESHOLD! imm R_max, Y_MAX_VALUE! store R_max, M[R_yptr]! stop! ! branc imm store stop

V0(i) = b0 +i× s0 V

1(i) = b 1 +i× s1

V0(i)+V

1(i) = (b0 + b 1)+i×(s0 + s1)

  • addiu!
  • lui!
  • addu!
  • subu!
  • sll/sllv!
  • srl/srlv!
  • sra/srav!
  • mul!
  • lw/lh/lb!
  • sw/sh/sb!
  • beq/bne!
  • blez/bgez!
  • bltz/bgtz!

Consider the common case

  • f comparing uniform

registers

15/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-17
SLIDE 17

Cornell University Ji Kim 17/20 Cornell University Ji Kim 17/20 Cornell University Ji Kim

   

  

  



               





          

Affine Arithmetic

Add parallel affine datapath for base/ stride computation

16/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-18
SLIDE 18

Cornell University Ji Kim 18/20 Cornell University Ji Kim 18/20 Cornell University Ji Kim

Affine Arithmetic

16/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

   

  

  



               





                                              

slide-19
SLIDE 19

Cornell University Ji Kim 19/20 Cornell University Ji Kim 19/20 Cornell University Ji Kim

Affine Memory Operations

17/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

   

  

  



               





                                               

slide-20
SLIDE 20

Cornell University Ji Kim 20/20 Cornell University Ji Kim 20/20 Cornell University Ji Kim

Affine Branches

18/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

   

  

  



               





                                         

slide-21
SLIDE 21

Cornell University Ji Kim 21/20 Cornell University Ji Kim 21/20 Cornell University Ji Kim 19/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

   

  

  



               





                                         

   

                               

  

    



               



                     

slide-22
SLIDE 22

Cornell University Ji Kim 22/20 Cornell University Ji Kim 22/20 Cornell University Ji Kim

Three Types of Affine Expansions

20/27

See paper for more details

  • Affine Source Expansion
  • When generic instructions read affine operands
  • Expand out source operands, then execute on SIMT lanes
  • No performance overhead
  • Affine Destination Expansion
  • When affine instructions execute after divergence
  • Execute compactly on CP, then expand result on SIMT lanes
  • No performance overhead
  • Affine Pre-Destination Expansion
  • When affine register is overwritten after divergence
  • Expand destination first, then execute on SIMT lanes
  • Adds performance overhead

Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-23
SLIDE 23

Cornell University Ji Kim 23/20 Cornell University Ji Kim 23/20 Cornell University Ji Kim

Compact Affine Execution on GP-SIMT

  • Affine arithmetic avoids time spent in the operand

collection, execution, and writeback stages

  • Affine memory operations and branches reduce the

pressure on the operand collector

  • All mechanisms still improve energy-efficiency

21/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-24
SLIDE 24

Cornell University Ji Kim 24/20 Cornell University Ji Kim 24/20 Cornell University Ji Kim

Presentation Outline

  • General-Purpose vs. Fine-Grain SIMT
  • Characterizing Value Structure
  • FG-SIMT Baseline Architecture
  • Compact Affine Execution
  • Evaluation

22/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-25
SLIDE 25

Cornell University Ji Kim 25/20 Cornell University Ji Kim 25/20 Cornell University Ji Kim

Methodology

  • GP-SIMT modeled in GPGPU-Sim 3.0 with PTX front-end
  • FG-SIMT modeled in Verilog RTL
  • Area, cycle time, and energy results obtained using Synopsys

DesignCompiler, IC Compiler, and PrimeTime PX

  • TSMC 40nm standard cell library
  • Cycle time is 3.1ns with critical path through memory system
  • 5% area overhead for adding compact affine execution
  • Benchmarks from Parboil, Rodinia, and in-house applications

23/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-26
SLIDE 26

Cornell University Ji Kim 26/20 Cornell University Ji Kim 26/20 Cornell University Ji Kim

FG-SIMT Detailed Microarchitecture

SGU SLU SAU1 SAU0 Lane 7 SSU SGU SLU SAU1 SAU0 Lane 2 SSU SGU SLU SAU1 SAU0 Lane 1 SSU SRF 124 × 32b 6r3w SGU SLU SAU1 SAU0 Lane 0 SSU Lane Control SAU0 SAU1 SGU SLU SSU Memory Coalescing Unit SMU SRF 124 × 32b 6r3w SRF 124 × 32b 6r3w SRF 124 × 32b 6r3w

256b 256b 32b 32b 32b 32b 32b 32b 32b 32b

SIU SLWQ SMRQ SLDQ SMRQ SLDQ SMRQ SLDQ SMRQ SLDQ BRMR Control Processor CP RF 31 × 32b 2r1w SIQ CP PC Microarch Kernel State

CEVS Execution Engine L1 Memory System

L1 D$ Bank 0 16 KB L1 D$ Bank 1 16 KB L1 D$ Bank 2 16 KB L1 D$ Bank 7 16 KB L1 I$ 16 KB D$ Request and Response Crossbars L2 Request and Response Crossbars

32b 32b 256b 256b 256b 256b 256b 256b 256b 256b

PWFB AWFR PC Mask SMRRQ Shared Load Cache

Eight SIMT lanes Dynamic reconvergence Five vector functional units with support for chaining Multi-ported banked regfile with support for executing 32 threads at a time Shared load cache for kernel input parameters Memory coalescing to dynamically create wide accesses

24/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-27
SLIDE 27

Cornell University Ji Kim 27/20 Cornell University Ji Kim 27/20 Cornell University Ji Kim

cmult mfilt bsearch viterbi rsort dither strsearch rgb2cmykconv kmeans bfs sgemm bilat 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Speedup

FG-SIMT Performance Results

25/27

       

Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

slide-28
SLIDE 28

Cornell University Ji Kim 28/20 Cornell University Ji Kim 28/20 Cornell University Ji Kim

0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Task/Second

0.6 0.7 0.8 0.9 1.0 1.1 1.2

Dynamic Energy/Task

cmult mfilt bsearch viterbi rsort dither strsearch rgb2cmyk conv kmeans bfs sgemm bilateral

FG-SIMT Energy vs. Performance Results

26/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

Baseline

slide-29
SLIDE 29

Cornell University Ji Kim 29/20 Cornell University Ji Kim 29/20 Cornell University Ji Kim

0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Task/Second

0.6 0.7 0.8 0.9 1.0 1.1 1.2

Dynamic Energy/Task

cmult mfilt bsearch viterbi rsort dither strsearch rgb2cmyk conv kmeans bfs sgemm bilateral

FG-SIMT Energy vs. Performance Results

  • viterbi
  • 54% saved within register file
  • 29% saved within functional units
  • 34% saved within memory system

26/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

Baseline

slide-30
SLIDE 30

Cornell University Ji Kim 30/20 Cornell University Ji Kim 30/20 Cornell University Ji Kim

0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Task/Second

0.6 0.7 0.8 0.9 1.0 1.1 1.2

Dynamic Energy/Task

cmult mfilt bsearch viterbi rsort dither strsearch rgb2cmyk conv kmeans bfs sgemm bilateral

FG-SIMT Energy vs. Performance Results

  • strsearch
  • 115 to 130 uJ per task
  • 89% of 15 uJ due to expansion units

26/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation

Baseline

slide-31
SLIDE 31

Cornell University Ji Kim 31/20 Cornell University Ji Kim 31/20 Cornell University Ji Kim

Take-Away Points

  • A significant amount of value structure exists in common

SIMT workloads and is often overlooked

  • Compact affine execution exploits value structure in

arithmetic, branch, and memory instructions to improve performance and energy-efficiency

  • FG-SIMT is a promising architectural paradigm for

compute-focused, area-efficient data-parallel accelerators

27/27 Motivation GP-SIMT vs. FG-SIMT Value Structure FG-SIMT Baseline Compact Affine Execution Evaluation