Get Out of the Valley: Power-Efficient Address Mapping for GPUs The - - PowerPoint PPT Presentation

get out of the valley power efficient address mapping for
SMART_READER_LITE
LIVE PREVIEW

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The - - PowerPoint PPT Presentation

Get Out of the Valley: Power-Efficient Address Mapping for GPUs The 45 th International Symposium on Computer Architecture (ISCA) Monday June 4 th , 2018 Yuxi Liu (Ghent & Peking), Xia Zhao (Ghent), Magnus Jahre (NTNU), Zhenlin Wang (MTU),


slide-1
SLIDE 1

Get Out of the Valley: Power-Efficient Address Mapping for GPUs

The 45th International Symposium on Computer Architecture (ISCA) Monday June 4th, 2018 Yuxi Liu (Ghent & Peking), Xia Zhao (Ghent), Magnus Jahre (NTNU), Zhenlin Wang (MTU), Xiaolin Wang (Peking), Yingwei Luo (Peking), and Lieven Eeckhout (Ghent)

slide-2
SLIDE 2

GPU Memory Systems

2

LLC Slice Network on Chip (NoC) DRAM Banks LLC Slice LLC Slice LLC Slice Streaming Multiprocessors (SMs) Achieving high bandwidth requires effectively utilizing the parallel units in the memory system GPUs require high bandwidth memory systems to support efficient execution of 100s to 1000s of concurrent threads DRAM Channel 0 DRAM Channel 1 DRAM Channel 2 DRAM Channel 3

slide-3
SLIDE 3

Entropy Valley

3

Entropy is a measure of the information content of each address bit Bank and channel bits must be highly variable to ensure even distribution of memory requests across LLC slices, channels and banks Row Channel Bank Column Block Memory Address Entropy Memory Address Bit Most significant bit Least significant bit Entropy valleys create significant resource imbalance in GPU memory systems - leading to poor performance and low power-efficiency

Entropy Valley

CPUs GPUs

slide-4
SLIDE 4

Why Do Entropy Valleys Exist?

4

1 2 3 4 5 6 7 1 2 3 4 5 6 7

X-dimension Y-dimension Memory Addresses and Requests Column-Major 1D Thread Block (TB) Allocation

[1,0] [2,0] [0,0] [3,0] [7,0] … 0000 00 … … 0010 00 … … 0100 00 … … 0110 00 … … 1110 00 … [y,x]

DRAM Channels

[4,0] [5,0] [6,0] … 1000 00 … … 1010 00 … … 1100 00 …

Request [0,0] Request [1,0] Request [2,0] Request [3,0] Request [4,0] Request [5,0] Request [6,0] Request [7,0]

Channel 0 Channel 1 Channel 2 Channel 3

Channel bits

slide-5
SLIDE 5

Why Do Entropy Valleys Exist?

5

1 2 3 4 5 6 7 1 2 3 4 5 6 7

X-dimension Y-dimension Memory Addresses and Requests Column-Major 1D Thread Block (TB) Allocation

[1,0] [2,0] [0,0] [3,0] [7,0] … 0000 00 … … 0010 00 … … 0100 00 … … 0110 00 … … 1110 00 … [y,x]

DRAM Channels

[4,0] [5,0] [6,0] … 1000 00 … … 1010 00 … … 1100 00 …

Request [0,0] Request [1,0] Request [2,0] Request [3,0] Request [4,0] Request [5,0] Request [6,0] Request [7,0]

Channel 0 Channel 2 Channel 3

Channel bits Entropy valleys are caused by dimension-related array indexing All requests end up in Channel 0 Our solution: BIM-based address mapping

slide-6
SLIDE 6

… 1110 01 …

Getting Out of the Entropy Valley

6

1 2 3 4 5 6 7 1 2 3 4 5 6 7

X-dimension Y-dimension Memory Addresses and Requests Column-Major 1D Thread Block (TB) Allocation DRAM Channels Channel 0 Channel 1 Channel 2 Channel 3

… 1110 00 … [1,0] [2,0] [0,0] [3,0] [7,0] … 0000 00 … … 0010 00 … … 0100 00 … … 0110 00 … [y,x] [4,0] [5,0] [6,0] … 1000 00 … … 1010 00 … … 1100 00 … Channel bits [1,0] [2,0] [0,0] [3,0] [4,0] [5,0] [6,0] [7,0] … 0000 00 … … 0010 11 … … 0100 01 … … 0110 10 … … 1000 11 … … 1010 00 … … 1100 10 …

BIM-based Address Mapping

Binary Invertible Matrix (BIM) Input Addr. Output Addr. x =

Request [0,0] Request [1,0] Request [2,0] Request [3,0] Request [4,0] Request [5,0] Request [6,0] Request [7,0]

slide-7
SLIDE 7

Perfect channel utilization! … 1110 01 …

Getting Out of the Entropy Valley

7

1 2 3 4 5 6 7 1 2 3 4 5 6 7

X-dimension Y-dimension Memory Addresses and Requests Column-Major 1D Thread Block (TB) Allocation DRAM Channels Channel 0 Channel 1 Channel 2 Channel 3

… 1110 00 … [1,0] [2,0] [0,0] [3,0] [7,0] … 0000 00 … … 0010 00 … … 0100 00 … … 0110 00 … [y,x] [4,0] [5,0] [6,0] … 1000 00 … … 1010 00 … … 1100 00 … Channel bits [1,0] [2,0] [0,0] [3,0] [4,0] [5,0] [6,0] [7,0] … 0000 00 … … 0010 11 … … 0100 01 … … 0110 10 … … 1000 11 … … 1010 00 … … 1100 10 …

BIM-based Address Mapping

Binary Invertible Matrix (BIM) Input Addr. Output Addr. x =

Request [0,0] Request [1,0] Request [2,0] Request [3,0] Request [4,0] Request [5,0] Request [6,0] Request [7,0]

slide-8
SLIDE 8

Outline

  • 1. Introduction
  • 2. Window-based memory address entropy
  • 3. Binary Invertible Matrix (BIM) address mapping
  • 4. Results
  • 5. Conclusion

8

slide-9
SLIDE 9

Window-based Entropy

9

We need an entropy metric without memory request ordering assumptions With Greedy-Then-Oldest (GTO) warp scheduling, we heuristically set the window size to the number of Streaming Multiprocessors (SMs) Intra-TB Entropy Inter-TB Entropy … 1 0 0 … … 0 0 1 … … 1 0 1 … … 0 0 0 … … 1 1 0 … … 0 1 1 … … 1 1 1 … … 0 1 0 … Bit Value Ratio (BVR) Bit Value Ratio (BVR) Thread Block (TB) 1 Thread Block (TB) 2 1 TB1 BVR TB2 TB3 TB4 1 Compute Shannon’s entropy function over the BVR probabilities within each window Overall entropy = Mean of window entropies Window: The TBs that are likely to issue requests that coexist in the memory system 1

slide-10
SLIDE 10

Entropy Profile Examples

10

6 18 29 Bit

0.0 0.5 1.0

Entropy 6 18 29 Bit

0.0 0.5 1.0

Entropy 6 18 29 Bit

0.0 0.5 1.0

Entropy 6 18 29 Bit

0.0 0.5 1.0

Entropy 6 18 29 Bit

0.0 0.5 1.0

Entropy 6 18 29 Bit

0.0 0.5 1.0

Entropy

All workloads have low entropy bits, and their location is highly application-dependent MT LU GS NW LPS NN (no valley) GPU address mapping schemes must harvest entropy across broad address bit ranges Two channel bits and one bank bit Three bank bits

slide-11
SLIDE 11

Outline

  • 1. Introduction
  • 2. Window-based memory address entropy
  • 3. Binary Invertible Matrix (BIM) address mapping
  • 4. Results
  • 5. Conclusion

11

slide-12
SLIDE 12

The Binary Invertible Matrix (BIM)

The BIM can represent all possible address mapping schemes that consist

  • f AND and XOR operations
  • Matrix covers all possible transformations
  • Invertibility criterion ensures that all possible
  • ne-to-one relations are considered

The BIM has low hardware overhead

  • Can be implemented with a tree of XOR-gates
  • Mapping can be performed in a single clock cycle

12 Binary Invertible Matrix (BIM) Input Addr. Output Addr. x =

Example Memory Map Remap (RMP) Permutation-based mapping (PM) Single 1 per row Two 1s in bank and channel rows Zhang et al. [MICRO’00]

slide-13
SLIDE 13

Our Mapping Schemes

13

Broad mapping strategy Multiple 1s for each bank and channel row Broad sub-strategies

Entropy analysis shows that a GPU address mapping policy needs to harvest entropy across broad address bit ranges

  • We call this the broad mapping strategy
  • Covers many possible mapping schemes

We define three sub-strategies that differ in which memory address fields can be used as input and

  • utput in the BIM
  • Page Address Entropy (PAE)
  • Full Address Entropy (FAE)
  • All

Row Channel Bank Column Block Binary Invertible Matrix (BIM) Row Channel Bank Column Block PAE FAE All

We randomly generate BIMs that match the input and output restrictions of each sub-strategy

FAE All All

slide-14
SLIDE 14

Entropy Impact of Address Mapping Schemes for the MT Benchmark

14

6 18 29 Bit

0.0 0.5 1.0

Entropy

Baseline

6 18 29 Bit

0.0 0.5 1.0

Entropy

Remap

6 18 29 Bit

0.0 0.5 1.0

Entropy

PM PAE FAE All

6 18 29 Bit

0.0 0.5 1.0

Entropy 6 18 29 Bit

0.0 0.5 1.0

Entropy 6 18 29 Bit

0.0 0.5 1.0

Entropy

PAE, FAE, and All remove the entropy valleys – the other mapping schemes do not

slide-15
SLIDE 15

Outline

  • 1. Introduction
  • 2. Window-based memory address entropy
  • 3. Binary Invertible Matrix (BIM) address mapping
  • 4. Results
  • 5. Conclusion

15

slide-16
SLIDE 16

Execution Time vs. DRAM Power

16

BASE PM RMP PAE FAE ALL 0,2 0,4 0,6 0,8 1 1,2 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5 Average Execution Time Normalized to BASE Average DRAM Power Consumption Normalized to BASE

  • 1.51X

+1.30X

slide-17
SLIDE 17

Performance

17

1 2 3 4 5 6 7 8 MT LU GS NW LPS SC SRAD2 DWT2D HS SP HMEAN

Speed-up Relative to BASE

BASE PM RMP PAE FAE ALL

PAE improves performance by +1.31X on average compared to PM +6.7X +7.5X +4.0X +2.0X +1.9X +1.4X +1.4X +1.3X +1.1X +1.0X +1.0X +1.5X

slide-18
SLIDE 18

Performance per Watt

18

0,5 1 1,5 2 2,5 3 3,5 4 4,5 MT LU GS NW LPS SC SRAD2 DWT2D HS SP HMEAN

Performance per Watt

BASE PM RMP PAE FAE ALL

PAE improves Performance per Watt by +1.25X on average compared to PM +3.9X +1.4X

slide-19
SLIDE 19

Why is PAE Most Power-Efficient?

19

PAE saves power by keeping these requests in the same bank FAE and ALL tend to distribute requests with good DRAM page locality to different banks which increases the number of DRAM page activations

10 20 30 40 50 60

MT LU GS NW LPS SC SRAD2 DWT2D HS SP AVG DRAM Power Breakdown (W)

background activate read write

BASE PM RMP PAE FAE ALL

slide-20
SLIDE 20

Outline

  • 1. Introduction
  • 2. Window-based memory address entropy
  • 3. Binary Invertible Matrix (BIM) address mapping
  • 4. Results
  • 5. Conclusion

20

slide-21
SLIDE 21

Conclusion

Window-Based Entropy

  • A novel entropy metric tailored for the highly concurrent memory

behavior of GPU compute workloads

Binary Invertible Matrix (BIM) address mapping

  • A unified representation of address mapping schemes that use

AND and XOR operations

Page Address Entropy (PAE) address mapping

  • PAE improves performance by 1.31X and performance per Watt by

1.25X compared to the state-of-the-art permutation-based address mapping scheme

21

slide-22
SLIDE 22

Thank You!

22