Basic cache memory Computer Architecture J. Daniel Garca Snchez - - PowerPoint PPT Presentation

basic cache memory
SMART_READER_LITE
LIVE PREVIEW

Basic cache memory Computer Architecture J. Daniel Garca Snchez - - PowerPoint PPT Presentation

Basic cache memory Basic cache memory Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito Singh Francisco Javier Garca Blas ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid


slide-1
SLIDE 1

Basic cache memory

Basic cache memory

Computer Architecture

  • J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 1/38

slide-2
SLIDE 2

Basic cache memory Introduction

1

Introduction

2

Policies and strategies

3

Basic optimizations

4

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 2/38

slide-3
SLIDE 3

Basic cache memory Introduction

Latency evolution

Multiple views of performance

performance =

1 latency

Useful for comparing processor and memory evolution.

Processors

Yearly performance increase from 25% to 52%. Combined effect from 1980 to 2010 → above 3, 000 times.

Memory

Yearly performance increase around 7% Combined effect from 1980 to 2010 → around 7.5 times.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 3/38

slide-4
SLIDE 4

Basic cache memory Introduction

Multi-core effect

Intel Core i7.

Two 64 bits data accesses per cycle. 4 cores, 3.2 GHz → 25.6 × 109 accesses/sec Instructions demand: 12.8 × 109 of 128 bits. Peak bandwidth: 409.6 GB/sec

SDRAM memory.

DDR2 (2003): 3.20 GB/sec – 8.50 GB/sec DDR3 (2007): 8.53 GB/sec – 18.00 GB/sec DDR4 (2014): 17.06 GB/sec – 25.60 GB/sec

Solutions:

Multi-port memories, pipelined caches, multi-level caches, per-core caches, instruction/data caches separation.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 4/38

slide-5
SLIDE 5

Basic cache memory Introduction

Principle of locality

Principle of locality.

It is property of programs exploited in the hardware design. Programs are accessed in a relatively small portion of address space.

Types of locality:

Temporal locality: Elements recently accessed tend to be accessed again.

Examples: Loops, variable reuse, . . .

Spatial locality: Elements next to a recently accessed one tend to be accessed in the future.

Examples: Sequential execution of instructions, arrays, . . .

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 5/38

slide-6
SLIDE 6

Basic cache memory Introduction

Situation (2008)

SRAM → Static RAM.

Access time: 0.5 ns – 2.5 ns. Cost per GB: 2,000$ - 5,000$

DRAM – Dynamic RAM.

Access time: 50 ns – 70 ns Cost per GB: 20$ - 75$.

Magnetic disk.

Access time: 5, 000, 000 ns – 20, 000, 000 ns. Cost per GB: 0.20$ - 2$.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 6/38

slide-7
SLIDE 7

Basic cache memory Introduction

Memory hierarchy

Processor L1 Cache L2 Cache L3 Cache

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 7/38

slide-8
SLIDE 8

Basic cache memory Introduction

Memory hierarchy

Block or line: Unit of copy operations.

Usually composed of multiple words.

If accessed data is present in upper level:

Hit: Delivered by higher level.

h = hits acceses If accessed data is missing.

Miss: Block copied from lower level. Data access in upper level. Needed time → Miss penalty.

m = misses acceses = 1 − h

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 8/38

slide-9
SLIDE 9

Basic cache memory Introduction

Metrics

Average memory access time: tM = tH + (1 − h)tM Miss penalty:

Time to replace a block and deliver to CPU. Access time.

Time to get from lower level. Depends on lower level latency.

Transfer time.

Time to transfer a block. Depends on the bandwidth across levels.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 9/38

slide-10
SLIDE 10

Basic cache memory Introduction

Metrics

CPU execution time: tCPU =

  • cyclesCPU + cyclesmemory stall
  • × tcycle

CPU clock cycles: cyclesCPU = IC × CPI Memory stall cycles: cyclesmemory stall = nmisses × penaltymiss = IC × missinstr × penaltymiss = IC × memory_accessesinstr × (1 − h) × penaltymiss

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 10/38

slide-11
SLIDE 11

Basic cache memory Policies and strategies

1

Introduction

2

Policies and strategies

3

Basic optimizations

4

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 11/38

slide-12
SLIDE 12

Basic cache memory Policies and strategies

Four questions about memory hierarchy

1 Where is a block placed in the upper level?

Block placement.

2 How is a block found in the upper level?

Block identification.

3 Which block must be replaced on a miss?

Block replacement.

4 What happens on a write?

Write strategy.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 12/38

slide-13
SLIDE 13

Basic cache memory Policies and strategies

Q1: Block placement

Direct mapping.

Placement → block mod nblocks

Fully associative mapping.

Placement → Anywhere.

Set associative mapping.

Set placement → block mod nsets Block placement within set → Anywhere in set.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 13/38

slide-14
SLIDE 14

Basic cache memory Policies and strategies

Q2: Block identification

Block address:

Tag: Identifies entry address.

Validity bit in every entry to signal whether content is valid.

Index: Selects the set.

Block offset:

Selects data within block.

Higher associativity means:

Less index bits. More tag bits.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 14/38

slide-15
SLIDE 15

Basic cache memory Policies and strategies

Q3: Block replacement

Relevant for associative mapping and set associative mapping:

Random.

Easy to implement.

LRU: Least Recently Used.

Increasing complexity as associative increases.

FIFO.

Approximates LRU with a lower complexity.

2 ways 4 ways 8 ways Tam. LRU Rand FIFO LRU Rand FIFO LRU Rand FIFO 16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 64 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5 Misses per 1000 instr., SPEC 2000.

Source: Computer Architecture: A Quantitative Approach. 5 Ed Hennessy and Patterson. Morgan Kaufmann. 2012.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 15/38

slide-16
SLIDE 16

Basic cache memory Policies and strategies

Q4: Write strategy

Write-through All writes sent to bus and memory. Easy to implement. Performance issues in SMPs. Write-back Many writes are a hit. Write hits do not go to bus and memory. Propagation and serialization problems. More complex.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 16/38

slide-17
SLIDE 17

Basic cache memory Policies and strategies

Q4: Write strategy

Where is write done?:

write-through: In cache block and next level in memory. write-back: Only in cache block.

What happens when a block is evicted from cache?:

write-through: Nothing else. write-back: Next level in memory is updated.

Debugging:

write-through: Easy. write-back: Difficult.

Miss causes write?:

write-through: No. write-back: Yes.

Repeated write goes to next level?:

write-through: Yes. write-back: No.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 17/38

slide-18
SLIDE 18

Basic cache memory Policies and strategies

Write buffer

Processor Cache Buffer Next Level

Why a buffer?

To avoid stalls in CPU.

Why a buffer instead of a register?

Write bursts are frequent.

Are RAW hazards possible? Yes. Alternatives:

Flush buffer before a read. Check buffer before a read.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 18/38

slide-19
SLIDE 19

Basic cache memory Policies and strategies

Miss penalty

Miss penalty:

Total latency miss. Exposed latency (generating CPU stalls).

Miss penalty stall_cyclesmemory IC = misses IC ×

  • latencytotal − latencyoverlapped
  • cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 19/38

slide-20
SLIDE 20

Basic cache memory Basic optimizations

1

Introduction

2

Policies and strategies

3

Basic optimizations

4

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 20/38

slide-21
SLIDE 21

Basic cache memory Basic optimizations

Cache basic optimizations

Decrease the miss rate.

Increase block size. Increase cache size. Increase associativity.

Decrease miss penalty.

Multi-level caches. Prioritize reads over writes.

Decrease the hit time.

Avoid address translation in cache indexing.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 21/38

slide-22
SLIDE 22

Basic cache memory Basic optimizations Decrease miss rate

3

Basic optimizations Decrease miss rate Decrease miss penalty Decrease hit time

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 22/38

slide-23
SLIDE 23

Basic cache memory Basic optimizations Decrease miss rate

1: Increase block size

Larger block size → Lower miss rate.

Better exploitation of spatial locality.

Larger block size → Higher miss penalty.

Upon miss, larger blocks need to be transferred. More misses as cache has less blocks.

Need to balance:

Memory with high latency and high bandwidth:

Increase block size.

Memory with low latency and low bandwidth:

Decrease block size.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 23/38

slide-24
SLIDE 24

Basic cache memory Basic optimizations Decrease miss rate

2: Increase cache size

Larger cache size → lower miss rate.

More data fit in cache.

It may increase hit time.

More time needed to find a block.

Higher cost. Higher energy consumption Need to find a balance.

Specially in on-chip caches.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 24/38

slide-25
SLIDE 25

Basic cache memory Basic optimizations Decrease miss rate

3: Increase associativity

Higher associativity → Lower miss rate.

Less conflicts as more ways in the same set may be used.

It may increase hit time.

More time needed to find a block.

Consequence:

8 ways ≈ Fully associative.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 25/38

slide-26
SLIDE 26

Basic cache memory Basic optimizations Decrease miss penalty

3

Basic optimizations Decrease miss rate Decrease miss penalty Decrease hit time

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 26/38

slide-27
SLIDE 27

Basic cache memory Basic optimizations Decrease miss penalty

4: Multi-level caches

Goal: Decrease miss penalty. Evolution:

Higher distance from DRAM and CPU performance over time. Increasing miss penalty cost.

Alternatives:

Make cache faster. Make cache larger.

Solution:

Both of them! Several cache levels.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 27/38

slide-28
SLIDE 28

Basic cache memory Basic optimizations Decrease miss penalty

Global and local miss rates

Local miss rate:

Misses at a cache level over accesses to that cache level. L1 miss rate → m(L1) L2 miss rate → m(L2)

Global miss rate:

Misses at a cache level over to all memory accesses. L1 miss rate → m(L1) L2 miss rate → m(L1) × m(L2)

Average access time: th(L1) + m(L1) × tm(L1) = th(L1) + m(L1) × (th(L2) + m(L2) × tm(L2))

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 28/38

slide-29
SLIDE 29

Basic cache memory Basic optimizations Decrease miss penalty

5: Prioritize read misses over writes

Goal: Decrease miss penalty.

Avoid that a read miss has to wait until writes are completed.

Write-through caches.

Write buffer might contain the data being read.

a) Wait for write buffer to be empty. b) Check values in write buffer.

Write-back caches.

A read miss might replace a modified block.

Copy modified block to buffer, read, and dump block to memory. Apply options a or b to buffer.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 29/38

slide-30
SLIDE 30

Basic cache memory Basic optimizations Decrease hit time

3

Basic optimizations Decrease miss rate Decrease miss penalty Decrease hit time

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 30/38

slide-31
SLIDE 31

Basic cache memory Basic optimizations Decrease hit time

6: Avoid address translation during indexing

Goal: Decrease hit time. Translation procedure:

Virtual address → Physical address. May require additional memory accesses.

Or at least to TLB.

Idea: Optimize the most common case (hits).

Use virtual addresses for cache.

Tasks:

Indexing the cache → Use offset. Comparing tags → Use translated physical addresses.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 31/38

slide-32
SLIDE 32

Basic cache memory Basic optimizations Decrease hit time

Problems with virtual caches (I)

Protection.

Page-level protection checked during virtual-to-physical translation. Solution: Copy protection information from TLB on misses.

Process switch.

Virtual addresses refer to other physical addresses. Solutions:

Flush the cache. Add PID to tag.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 32/38

slide-33
SLIDE 33

Basic cache memory Basic optimizations Decrease hit time

Problems with virtual caches (II)

Aliasing:

Two different virtual addresses for the same physical address. Anti-aliasing hardware: Guarantee that every cache block corresponds to a unique physical address.

Check multiple addresses and invalidate.

Page coloring: Force all aliases to have identical their last n bits.

Makes impossible that two alias to be at the same time in cache.

Input/output addresses:

I/O typically uses physical addresses. Mapping to virtual addresses to interact with virtual caches.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 33/38

slide-34
SLIDE 34

Basic cache memory Basic optimizations Decrease hit time

Address translation

Solution:

Virtual indexing and physically tagging.

Tasks:

Indexing the cache → Use page offset.

This part is identical in physical and virtual addresses.

Comparing tags → Use translated physical address.

Tag matching uses physical address.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 34/38

slide-35
SLIDE 35

Basic cache memory Conclusion

1

Introduction

2

Policies and strategies

3

Basic optimizations

4

Conclusion

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 35/38

slide-36
SLIDE 36

Basic cache memory Conclusion

Summary

Caching as a solution to mitigate the memory wall. Technology evolution and principle of locality place more pressure over caches. Miss penalty dependent on access time and transfer time. Four key dimensions in cache design:

Block placement, block identification, block replacement, and write strategy.

Six basic cache optimizations:

Decrease miss rate: Increase block size, increase cache size, increase associativity. Decrease miss penalty: Multi-level caches, prioritize reads

  • ver writes.

Decrease hit time: Avoid address translation.

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 36/38

slide-37
SLIDE 37

Basic cache memory Conclusion

References

Computer Architecture. A Quantitative Approach 5th Ed. Hennessy and Patterson. Sections: B.1, B.2, B.3. Recommended exercises:

B.1, B.2, B.3, B.4, B.5, B.6, B.7, B.8, B.9, B.10, B.11

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 37/38

slide-38
SLIDE 38

Basic cache memory Conclusion

Basic cache memory

Computer Architecture

  • J. Daniel García Sánchez (coordinator)

David Expósito Singh Francisco Javier García Blas

ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

cbed

– Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 38/38