A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch - - PowerPoint PPT Presentation

โ–ถ
a fast analytical model of fully associative caches
SMART_READER_LITE
LIVE PREVIEW

A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth T OBIAS G YSI , T OBIAS G ROSSER , L AURIN B RANDNER , AND T ORSTEN H OEFLER A Fast Analytical Model of Fully Associative Caches spcl.inf.ethz.ch @spcl_eth The Cost of Data Movement Depends on Global State and Does


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

TOBIAS GYSI, TOBIAS GROSSER, LAURIN BRANDNER, AND TORSTEN HOEFLER

A Fast Analytical Model of Fully Associative Caches

slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

The Cost of Data Movement Depends on Global State and Does Not Compose

int N = 1000; for(int i = 0; i < N; i++) { for(int j = 0; j < i; j++) { for(int k = 0; k < j; k++) { A[i][j] -= A[i][k] * A[j][k]; } A[i][j] /= A[j][j]; } for(int k = 0; k < i; k++) { A[i][i] -= A[i][k] * A[i][k]; } A[i][i] = sqrt(A[i][i]); }

Cholesky kernel from https://sourceforge.net/projects/polybench/

percentage of cache misses? amount of compulsory and capacity misses? L1 cache1.6% L2 cache1.4% most expensive memory access? A[j][k] # compulsory misses 31,752 # capacity misses 10,630,620

2

slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

3

5 for (int i = 0; i < N; i++) { 6 for (int j = 0; j < i; j++) { 7 for (int k = 0; k < j; k++) { 8 A[i][j] -= A[i][k] * A[j][k];

  • ref type comp[%] L1[%] L2[%] tot[%] reuse[ln]

A[i][j] rd 0.00459 0.00000 0.00000 24.86910 8,10 A[i][k] rd 0.00000 0.00000 0.00000 24.86910 8,10 A[j][k] rd 0.00000 1.58635 1.38213 24.86910 8,10,13,15 A[i][j] wr 0.00000 0.00000 0.00000 24.86910 8

  • compulsory: 31'752

capacity (L1): 10'630'620 capacity (L2): 9'258'460 total: 668'166'500

absolute number of cache misses (program) relative number of cache misses (statement)

HayStack Output for Cholesky Factorization

parameters:

  • cache sizes (32k and 512k)
  • cacheline size (64B)
slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

Comparison to Simulation

4

1 day 1 hour 1 seconds 1 minute

slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

Symbolic Counting Avoids the Explicit Enumeration

5

1d illustration i j #points = j-i+1 = 3 Barvinok algorithm

Alexander I. Barvinok, A Polynomial Time Algorithm for Counting Integral Points in Polyhedra When the Dimension is Fixed. 1994.

#points = 1 #points = 2 #points = 3 enumeration symbolic #points = j-i+1 = 9 #points = 1 enumeration symbolic i j #points = 2 #points = 3 #points = 4 #points = 5 #points = 6 #points = 7 #points = 8 #points = 9

slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

The LRU Stack Distance Allows Us to Model Fully Associative Caches

memory accesses LRU stack i=1 M(1) i=0 M(0) i=2 M(2) i=3 M(3) j=1 M(2) j=0 M(3) j=2 M(1) j=3 M(0) hit miss M(0) M(0) M(0) M(0) M(3) M(1) M(1) M(1) M(2) M(3) M(2) M(2) M(3) M(1) M(2) M(3) M(0) M(2) M(1)

int sum = 0; for(int i=0; i<4; ++i) S0: M[i] = i; for(int j=0; j<4; ++j) S1: sum += M[3-j];

example

Richard L Mattson, Jan Gecsei, Donald R Slutz, and Irving L Traiger, Evaluation techniques for storage hierarchies. 1970.

6

deliberately generic model distance 1 distance 2 distance 3 distance 4

slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

Compute the LRU Stack Distance

j

1 2 3 4 1 2 3 4

stack distance

7

int sum = 0; for(int i=0; i<4; ++i) S0: M[i] = i; for(int j=0; j<4; ++j) S1: sum += M[3-j];

example ๐‘ž ๐‘˜ = ๐‘˜ + 1

Kristof Beyls and Erik H Dโ€™Hollander, Generating cache hints for improved program efficiency. 2005.

apply symbolic counting once

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

Count the Cache Misses Given the LRU Stack Distance

j

1 2 3 4 1 2 3 4

stack distance

misses hits cache size C=2

8

int sum = 0; for(int i=0; i<4; ++i) S0: M[i] = i; for(int j=0; j<4; ++j) S1: sum += M[3-j];

example ๐‘ž ๐‘˜ = ๐‘˜ + 1 apply symbolic counting twice many different pieces and sometimes non-affine polynomials ๐‘˜ โˆถ ๐’’ ๐’Œ > ๐‘ซ โˆง 0 โ‰ค ๐‘˜ < 4 ๐‘˜ โˆถ ๐’’ ๐’Œ > ๐‘ซ โˆง 0 โ‰ค ๐‘˜ < 4 = 2

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

Some Access Patterns Result in Non-Linearities

9

๐‘ž ๐‘˜ = ๐‘˜ + 1

  • riginal

j i j i additional time loop t ๐‘ž ๐‘˜, ๐‘ข = ๐’–๐Ÿ‘ + ๐‘˜ + 1 partial enumeration

int sum = 0; for(int i=0; i<4; ++i) S0: M[i] = i; for(int j=0; j<4; ++j) S1: sum += M[3-j]; int sum = 0; for(int t=0; t<4; ++t) { for(int i=0; i<4; ++i) S0: M[i] = i; for(int m=0; m<t; ++m) for(int n=0; n<t; ++n) N[m][n] = t; for(int j=0; j<4; ++j) S1: sum += M[3-j]; }

example

slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

Enumerate the Non-Affine Dimensions

0 โ‰ค ๐‘ข < 3 p๐ฎ=๐Ÿ ๐‘˜ = ๐Ÿ + ๐‘˜ + 1 1 2 ๐‘ž ๐‘˜, ๐‘ข = ๐’–๐Ÿ‘ + ๐‘˜ + 1 (0,0) (1,0) (2,0) (0,1) (1,1) (2,1) (0,2) (1,2) (2,2) 1 2 ๐‘ž๐ฎ=๐Ÿ ๐‘˜ = ๐Ÿ + ๐‘˜ + 1 1 2 p๐ฎ=๐Ÿ‘ ๐‘˜ = ๐Ÿ“ + ๐‘˜ + 1 1 2 ฯ€t

10

12.4x speedup due to partial enumeration 0 โ‰ค (๐‘˜, ๐‘ข) < 3

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

Modelling Cache Lines Introduces Floor Terms

11

๐‘ž ๐‘˜ = ๐‘˜ + 1

int sum = 0; for(int i=0; i<4; ++i) S0: M[i] = i; for(int j=0; j<4; ++j) S1: sum += M[3-j];

example j i

  • riginal

equalization and rasterization modelling cache lines ๐‘ž ๐‘˜ = ๐‘˜ 2 ๐’Œ ๐Ÿ‘ โˆ’ ๐’Œ โˆ’ ๐Ÿ ๐Ÿ‘ + 1 j i

slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

Split the Domain to Eliminate Floor Terms

12

1.9x speedup due to equalization ๐‘ž ๐‘˜ = ๐‘˜ 2 ๐’Œ ๐Ÿ‘ โˆ’ ๐’Œ โˆ’ ๐Ÿ ๐Ÿ‘ + 1 0 โ‰ค ๐‘˜ < 4 1 2 3 ๐‘ž ๐‘˜ ๐’Œ%๐Ÿ‘=๐Ÿ = ๐‘˜ 2 ๐Ÿ + 1 2 ๐‘ž ๐‘˜ ๐’Œ%๐Ÿ‘>๐Ÿ = ๐‘˜ 2 ๐Ÿ + 1 1 3

slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

Accuracy of HayStack for the L1 Cache of Our Test System

13

slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

Error of HayStack Compared to Simulation (Dinero IV)

HayStack

(fully associative)

Dinero IV

(8-way associative)

Dinero IV

(fully associative)

14

slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

Performance of HayStack for the Large Problem Size of PolyBench

15

slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

Performance of HayStack Compared to PolyCache and Dinero

16

Wenlei Bao, Sriram Krishnamoorthy, Louis-Noel Pouchet, and P Sadayappan, Analytical modeling of cache behavior for affine programs. 2017. Jan Elder and Mark D. Hill, Dinero IV Trace-Driven Uniprocessor Cache Simulator. 2003.

Dinero IV

  • simulator
  • setup to simulate full associativity
  • problem size dependent performance

PolyCache

  • analytical cache model
  • models set associativity
  • ne core per cache set

370x speedup 21x speedup

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

Conclusion

17

fast enough to provide interactive feedback generic model of fully associative caches accurate results compared to measurements excellent performance compared to alternatives