Advanced cache memory optimizations Computer Architecture J. Daniel - - PowerPoint PPT Presentation

advanced cache memory optimizations
SMART_READER_LITE
LIVE PREVIEW

Advanced cache memory optimizations Computer Architecture J. Daniel - - PowerPoint PPT Presentation

Advanced cache memory optimizations Advanced cache memory optimizations Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito Singh Francisco Javier Garca Blas ARCOS Group Computer Science and Engineering Department


slide-1
SLIDE 1

Advanced cache memory optimizations

Advanced cache memory optimizations

Computer Architecture

  • J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 1/44

slide-2
SLIDE 2

Advanced cache memory optimizations Introduction

1

Introduction

2

Advanced optimizations

3

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 2/44

slide-3
SLIDE 3

Advanced cache memory optimizations Introduction

Why do we use caching?

To overcome the memory wall.

1980 – 2010: Improvement in processors performance better (orders of magnitude) than memory. 2005 – . . . : Situation becomes worse with emerging multi-core architectures.

To reduce both data and instructions access times.

Make memory access time nearer to cache access time. Offer the illusion of a cache size approaching to main memory size. Based on the principle of locality.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 3/44

slide-4
SLIDE 4

Advanced cache memory optimizations Introduction

Memory average access time

1 level. t = th(L1) + mL1 × tp(L1) 2 levels. t = th(L1) + mL1 × (th(L2) + mL2 × tp(L2)) 3 levels. t = th(L1)+mL1×(th(L2) + mL2 × (th(L3) + mL3 × tp(L3))) . . .

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 4/44

slide-5
SLIDE 5

Advanced cache memory optimizations Introduction

Basic optimizations

  • 1. Increase block size.
  • 2. Increase cache size.
  • 3. Increase associativity.
  • 4. Introduce multi-level caches.
  • 5. Give priority to read misses.
  • 6. Avoid address translation during indexing.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 5/44

slide-6
SLIDE 6

Advanced cache memory optimizations Introduction

Advanced optimizations

Metrics to be decreased:

Hit time. Miss rate. Miss penalty.

Metrics to be increased:

Cache bandwidth.

Observation: All advanced optimizations aim to improve some of those metrics.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 6/44

slide-7
SLIDE 7

Advanced cache memory optimizations Advanced optimizations

1

Introduction

2

Advanced optimizations

3

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 7/44

slide-8
SLIDE 8

Advanced cache memory optimizations Advanced optimizations Small and simple caches

2

Advanced optimizations Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 8/44

slide-9
SLIDE 9

Advanced cache memory optimizations Advanced optimizations Small and simple caches

Small caches

Lookup procedures:

Select a line using the index. Read line tag. Compare to address tag.

Lookup time is increased as cache size grows. A smaller cache allows:

Simpler lookup hardware. Cache can better fit into processor chip.

A small cache improves lookup time.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 9/44

slide-10
SLIDE 10

Advanced cache memory optimizations Advanced optimizations Small and simple caches

Simple caches

Cache simplification.

Use mapping mechanisms as simple as possible. Direct mapping:

Allows to parallelize tag comparison and data transfers.

Observation: Most modern processors focus more on using small caches than on simplifying them.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 10/44

slide-11
SLIDE 11

Advanced cache memory optimizations Advanced optimizations Small and simple caches

Intel Core i7

L1 cache (1 per core)

32 KB instructions. 32 KB data. Latency: 3 cycles. Associative 4(i), 8(d) ways.

L2 cache (1 per core)

256 KB Latency: 9 cycles. Associative 8 ways.

L3 cache (shared)

8 MB Latency: 39 cycles. Associative 16 ways.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 11/44

slide-12
SLIDE 12

Advanced cache memory optimizations Advanced optimizations Way prediction

2

Advanced optimizations Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 12/44

slide-13
SLIDE 13

Advanced cache memory optimizations Advanced optimizations Way prediction

Way prediction

Problem:

Direct mapping → fast but many misses. Set associative mapping → less misses but more sets (slower).

Way prediction

Additional bits stored for predicting the way to be selected in the next access. Block prefetching and compare to single tag.

If there is a miss, it is compared with other tags.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 13/44

slide-14
SLIDE 14

Advanced cache memory optimizations Advanced optimizations Pipelined access to cache

2

Advanced optimizations Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 14/44

slide-15
SLIDE 15

Advanced cache memory optimizations Advanced optimizations Pipelined access to cache

Pipelined access to cache

Goal: Improve cache bandwidth. Solution: Pipelined access to the cache in multiple clock cycles. Effects:

Clock cycle can be shortened. A new access can be initiated every clock cycle. Cache bandwidth is increased. Latency is increased.

Positive effect in superscalar processors.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 15/44

slide-16
SLIDE 16

Advanced cache memory optimizations Advanced optimizations Non-blocking caches

2

Advanced optimizations Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 16/44

slide-17
SLIDE 17

Advanced cache memory optimizations Advanced optimizations Non-blocking caches

Non-blocking caches

Problem: Cache miss leads to a stall until a block is

  • btained.

Solution: Out-of-order execution.

But: How is memory accessed while a miss is resolved?

Hit during miss

Allow accesses with hit while waiting. Reduces miss penalty.

Hit during several misses / Miss during miss:

Allow overlapped misses. Needs multi-channel memory. Highly complex.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 17/44

slide-18
SLIDE 18

Advanced cache memory optimizations Advanced optimizations Multi-bank caches

2

Advanced optimizations Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 18/44

slide-19
SLIDE 19

Advanced cache memory optimizations Advanced optimizations Multi-bank caches

Multi-bank caches

Goal: Allow simultaneous accesses to different cache locations. Solution: Divide memory into independent banks. Effect: Bandwidth is increased. Example: Sun Niagara

L2: 4 banks.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 19/44

slide-20
SLIDE 20

Advanced cache memory optimizations Advanced optimizations Multi-bank caches

Bandwidth

For increasing the bandwidth, it is necessary to distribute accesses across banks. Simple approach: Sequential interleaving

Round-robin of blocks across banks.

Block addr. Bank 0 4 8 12 Block addr. Bank 1 1 5 9 13 Block addr. Bank 2 2 6 10 14 Block addr. Bank 2 3 7 11 15

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 20/44

slide-21
SLIDE 21

Advanced cache memory optimizations Advanced optimizations Critical word first and early restart

2

Advanced optimizations Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 21/44

slide-22
SLIDE 22

Advanced cache memory optimizations Advanced optimizations Critical word first and early restart

Critical word first and early restart

Observation: Usually processors need a single word to proceed. Solution: Do not wait until the whole block from memory has been transferred. Alternatives:

Critical word first: Reorder blocks so that first word is the word needed by the processor. Early restart: Block received without reordering.

As soon as the selected word is received, the processor proceeds.

Effects: Depends on block size → the larger the better.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 22/44

slide-23
SLIDE 23

Advanced cache memory optimizations Advanced optimizations Write buffer merge

2

Advanced optimizations Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 23/44

slide-24
SLIDE 24

Advanced cache memory optimizations Advanced optimizations Write buffer merge

Write buffer

A write buffer allows to decrease miss penalty.

When processor writes on buffer, it considers write is completed. Simultaneous writes on memory are more efficient than a single write.

Uses:

Write-through: On every write. Write-back: When block is replaced.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 24/44

slide-25
SLIDE 25

Advanced cache memory optimizations Advanced optimizations Write buffer merge

Merges in write buffer

If buffer contains modified blocks, addresses are checked and, if it is possible, processor performs overwrite. Effects:

Decrease number of memory writes. Decrease amount of stalls due to full buffer.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 25/44

slide-26
SLIDE 26

Advanced cache memory optimizations Advanced optimizations Write buffer merge

Merges in write buffer

Write address

100 108 116 124

V

1 1 1 1 M[100] M[108] M[116] M[124]

V V V Write address

100

V

1 1 1 1 M[100]

V

M[108]

V

M[116]

V

M[124]

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 26/44

slide-27
SLIDE 27

Advanced cache memory optimizations Advanced optimizations Compiler optimizations

2

Advanced optimizations Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 27/44

slide-28
SLIDE 28

Advanced cache memory optimizations Advanced optimizations Compiler optimizations

Compiler optimizations

Goal: Generate code with a reduced number of instructions and data misses. Instructions:

  • 1. Procedure reordering.
  • 2. Align code blocks to cache line start.
  • 3. Branch linearization.

Data:

  • 1. Array merge.
  • 2. Loop interchange.
  • 3. Loop merge.
  • 4. Blocked access.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 28/44

slide-29
SLIDE 29

Advanced cache memory optimizations Advanced optimizations Compiler optimizations

Procedure reordering

Goal: Decrease conflict misses due to two concurrent procedures are mapped to the same cache line. Technique: Reorder procedures in memory.

P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3 P1 P1 P1 P1 P3 P3 P3 P3 P2 P2 P2 P2

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 29/44

slide-30
SLIDE 30

Advanced cache memory optimizations Advanced optimizations Compiler optimizations

Basic block alignment

Definition: A basic block is a set of instructions sequentially executed (contains no branches). Goal: Decrease the cache misses possibility for sequential codes. Technique: Align the first instruction in a basic block with the first word in cache line.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 30/44

slide-31
SLIDE 31

Advanced cache memory optimizations Advanced optimizations Compiler optimizations

Branch linearization

Goal: Decrease cache misses due to conditional branches. Technique: If the compiler detects a branch is likely to be taken, it may invert condition and interchanges basic blocks in both alternatives.

Some compilers define extensions to hint the compiler. Example: gcc (__likely__).

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 31/44

slide-32
SLIDE 32

Advanced cache memory optimizations Advanced optimizations Compiler optimizations

Array merge

Parallel arrays

vector<int> key; vector<int> val; for (int i=0;i<max;++i) { cout << key[i ] << "," << val[ i ] << endl; }

Merged array

struct entry { int key; int val; }; vector<entry> v; for (int i=0;i<max;++i) { cout << v[i ]. key << "," << v[ i ]. val << endl; }

Decrease conflicts. Improve spatial locality.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 32/44

slide-33
SLIDE 33

Advanced cache memory optimizations Advanced optimizations Compiler optimizations

Loop interchange

Striped accesses

for (int j=0; j<100; ++j) { for (int i=0; i<5000; ++i) { v[ i ][ j ] = k ∗ v[ i ][ j ]; } }

Sequential accesses

for (int i=0; i<5000; ++i) { for (int j=0; j<100; ++j) { v[ i ][ j ] = k ∗ v[ i ][ j ]; } }

Goal: Improve spatial locality. Depends on the storage model defined by the programming language.

FOTRAN versus C.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 33/44

slide-34
SLIDE 34

Advanced cache memory optimizations Advanced optimizations Compiler optimizations

Loop merge

Independent loops

for (int i=0; i<rows; ++i) { for (int j=0; j<cols; ++j) { a[i ][ j ] = b[i ][ j ] ∗ c[ i ][ j ]; } } for (int i=0; i<rows; ++i) { for (int j=0; j<cols; ++j) { d[i ][ j ] = a[i ][ j ] + c[ i ][ j ]; } }

Merged loop

for (int i=0; i<rows; ++i) { for (int j=0; j<cols; ++j) { a[i ][ j ] = b[i ][ j ] ∗ c[ i ][ j ]; d[i ][ j ] = a[i ][ j ] + c[ i ][ j ]; } }

Goal: Improve temporal locality. Beware: It may decrease spatial locality.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 34/44

slide-35
SLIDE 35

Advanced cache memory optimizations Advanced optimizations Compiler optimizations

Blocked access

Original product

for (int i=0; i<size; ++i) { for (int j=0; j<size; ++j) { r=0; for (int k=0; k<size; ++k) { r += b[i ][ k] ∗ c[k][ j ]; } a[i ][ j ] = r; } }

Blocked product

for (bj=0; bj<size; bj+=bsize) { for (bk=0; bk<size; bk +=bs) { for ( i=0; i<size; ++i) { for ( j=bj; j<min(bj+bsize,size); ++j) { r=0; for (k=bk;k<min(bk+bsize,size); ++k) { r += b[i ][ k] ∗ c[k][ j ]; } a[i ][ j ] += r; } } } }

bsize: Block factor

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 35/44

slide-36
SLIDE 36

Advanced cache memory optimizations Advanced optimizations Hardware prefetching

2

Advanced optimizations Small and simple caches Way prediction Pipelined access to cache Non-blocking caches Multi-bank caches Critical word first and early restart Write buffer merge Compiler optimizations Hardware prefetching

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 36/44

slide-37
SLIDE 37

Advanced cache memory optimizations Advanced optimizations Hardware prefetching

Instruction prefetching

Observation: Instructions exhibit high spatial locality. Instruction prefetching:

Read two consecutive blocks on miss.

Block causing the miss. Next block.

Location:

Block causing the miss → instruction cache. Next block → instruction buffer.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 37/44

slide-38
SLIDE 38

Advanced cache memory optimizations Advanced optimizations Hardware prefetching

Data prefetching

Example: Pentium 4. Data prefetching: Allows to prefetch a 4KB page to L2 cache. Prefetching is invoked if:

2 misses in L2 due to the same page. Distance between misses lower than 256 bytes.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 38/44

slide-39
SLIDE 39

Advanced cache memory optimizations Conclusion

1

Introduction

2

Advanced optimizations

3

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 39/44

slide-40
SLIDE 40

Advanced cache memory optimizations Conclusion

Summary (I)

Smaller and simpler caches

Improves: Hit time. Worsens: Miss rate. Complexity: Very low. Observation: Widely used.

Way prediction:

Improves: Hit time. Complexity: Low. Observation: Used in Pentium 4.

Pipelined access to cache:

Improves: Hit time. Worsens: Bandwidth. Complexity: Low. Observation: Widely used.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 40/44

slide-41
SLIDE 41

Advanced cache memory optimizations Conclusion

Summary (II)

Non blocking access to cache:

Improves: Bandwidth and miss penalty. Complexity: Very high. Observation: Widely used.

Multi-bank access to cache:

Improves: Bandwidth. Complexity: Low. Observation: Used at L2 in Intel i7 L2 and Cortex A8.

Critical word first and early restart:

Improves: Miss penalty. Complexity: High. Observation: Widely used.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 41/44

slide-42
SLIDE 42

Advanced cache memory optimizations Conclusion

Summary (III)

Write buffer merge:

Improves: Miss penalty. Complexity: Low. Observation: Widely used.

Compiler optimizations:

Improves: Miss rate. Complexity: Low for HW. Observation: Challenge is software.

Hardware prefetching:

Improves: Miss penalty and miss rate. Complexity: Very high. Observation: More common for instructions than data accesses.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 42/44

slide-43
SLIDE 43

Advanced cache memory optimizations Conclusion

References

Computer Architecture. A Quantitative Approach 5th Ed. Hennessy and Patterson. Sections: 2.1, 2.2. Recommended exercises:

2.1, 2.2, 2.3, 2.8, 2.9, 2.10, 2.11, 2.12

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 43/44

slide-44
SLIDE 44

Advanced cache memory optimizations Conclusion

Advanced cache memory optimizations

Computer Architecture

  • J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 44/44