SLIDE 1
An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore Architectures
Konstantinos Nikas
Computing Systems Laboratory NTU Athens, Greece
Matthew Horsnell, Jim Garside
Advanced Processor Technologies Group University of Manchester
SLIDE 2 Introduction
- Cores in CMPs typically share some level of the memory
hierarchy
- Applications compete for the limited shared space
- Need for efficient use of the shared cache
– Requests to off-chip memory are expensive (latency and power)
SLIDE 3 Introduction
- LRU (or approximations) is typically employed
- Partitions the cache implicitly on a demand basis
– Application with highest demand gets majority of cache resources – Could be suboptimal (eg. streaming applications)
– Cannot detect and deal with inter-thread interference
SLIDE 4 Motivation
classified into 3 different categories [Qureshi and Patt (MICRO ’06)]
- High Utility
- Applications that continue to
benefit significantly as the cache space is increased
SLIDE 5 Motivation
- Low Utility
- Applications that do not
benefit significantly as the cache space is gradually increased
SLIDE 6 Motivation
- Saturating Utility
- Applications that initially
benefit as the cache space is increased Target : Exploit the differences in the cache utility of concurrently executed applications
SLIDE 7
Static Cache Partitioning
SLIDE 8 Static Cache Partitioning
- Two major drawbacks
- The system must be aware of each application’s profile
- Partitions remain the same throughout the execution
– Programs are known to have distinct phases of behaviour
- Need for a scheme that can partition the cache
dynamically
– Acquire the applications’ profile at run-time – Repartition when the phase of an application changes
SLIDE 9 Dynamic Cache Partitioning
- LRU's “stack property” [Mattson et
- al. 1970]
“An access that hits in a N-way associative cache using the LRU replacement policy is guaranteed to hit also if the cache had more than N ways, provided that the number of sets remains the same.”
SLIDE 10 ABFCP : Overview
- Adaptive Bloom Filter Cache Partitioning (ABFCP)
Shared L2 Cache DRAM CORE 0 I−Cache D−Cache Partitioning Module CORE n I−Cache D−Cache
.
– Track misses and hits – Partitioning Algorithm – Replacement support to enforce partitions
SLIDE 11 ABFCP : Tracking system
– Misses that would have been hits had the application been allowed to use more cache ways – Tracked by Bloom filters
SLIDE 12 ABFCP : Partitioning Algorithm
- 2 counters per core per cache set
– CLRU – CFarMiss
- Each core’s allocation can be changed by ± 1 way
- Estimate performance loss/gain
– -1 way : Hits in the LRU position will become misses
– +1 way : A portion of the far misses will become hits
- perf. gain → a * CFarMiss , a = (1 - ways/assoc)
SLIDE 13 ABFCP : Partitioning Algorithm
- Select the best partition that maximises performance (hits)
- Complexity
– cores = 2 → possible partitions = 3 – cores = 4 → possible partitions = 19 – cores = 8 → possible partitions = 1107 – cores = 16 → possible partitions = 5196627
- Linear algorithm that selects the best partition or a good
approximation thereof.
– N/2 comparisons (worst case) → O(N)
SLIDE 14 ABFCP : Way Partitioning
- Way Partitioning support [Suh et al. HPCA ’02, Qureshi and
Patt MICRO ’06]
- Each line has a core-id field
- On a miss the ways occupied by the miss-causing
application are counted
– ways_occupied < partition_limit → victim is the LRU line of another application – Otherwise the victim is the LRU line of the miss-causing application
SLIDE 15 Evaluation
– 2,4,8 single-issue, in-order cores – Private L1 I and D caches (32KB, 4-way associative, 32B line size, 1 cycle access latency) – Unified shared on-chip L2 cache (4MB, 32-way associative, 32B line size, 16 cycle access latency) – Main memory (32 outstanding requests, 100 cycle access latency)
– 9 apps from JavaGrande + NAS – One application per processor – Simulation stops when one of the benchmarks finishes
SLIDE 16
Results (Dual core system)
SLIDE 17
Results (Dual core system)
SLIDE 18
Results (Quad core system)
SLIDE 19
Results (Eight core system)
SLIDE 20 Evaluation
- Increasing promise as number of cores increase
- Hardware Cost per core
– BF arrays (4096 sets * 32b) → 16KB – Counters (4096 sets * 2 counters * 8b) → 8KB – L2 Cache (240KB tags + 4MB data) → 4336KB – 0.55% increase in area
– 48KB for the core-ids per cache set – Total overhead 240KB → 5.5% increase over L2
SLIDE 21
Evaluation
SLIDE 22
- Cache Partitioning Aware Replacement Policy [Dybdhal et al.
HPC ’06]
– Cannot deal with applications with non-convex miss rate curves
- Utility-Based cache partitioning [Qureshi and Patt MICRO ’06]
– Smaller overhead – Enforces the same partition over all the cache sets
Related Work
SLIDE 23 Conclusions
- It is important to share the cache efficiently in CMPs
- LRU does not achieve optimal sharing of the cache
- Cache partitioning can alleviate its consequences
- ABFCP
– shows increasing promise as the number of cores increase – provides better performance than LRU at a reasonable cost (5.5% increase for an 8-core system achieves similar results to using LRU with a 50% bigger L2 cache)
SLIDE 24
Any Questions?
Thank you!
SLIDE 25
Utility-Based Cache Partitioning
SLIDE 26 Utility-Based Cache Partitioning
- High hardware overhead
- Dynamic Set Sampling (monitor only 32 lines)
– Smaller UMONs
- Enforce the same partition for the whole cache
– Less counters
SLIDE 27
Utility-Based Cache Partitioning
SLIDE 28 ABFCP Comparison with UCP
storage overhead (70KB for an 8-core)
partition on a line basis, it would require 11MB per processor
robust
better as the number
SLIDE 29
ABFCP Comparison with UCP
SLIDE 30
CPARP
SLIDE 31
Conclusions
SLIDE 32 Evaluation
accurate profile than CPARP
– curr_hits = 135 – if app2 gets 6 ways then hits = 145 (UCP) – CPARP does not modify the partition