An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore - - PowerPoint PPT Presentation

▶

Aug 13, 2022 364 likes •695 views

An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore Architectures Konstantinos Nikas Computing Systems Laboratory NTU Athens, Greece Matthew Horsnell, Jim Garside Advanced Processor Technologies Group University of Manchester

SLIDE 1

An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore Architectures

Konstantinos Nikas

Computing Systems Laboratory NTU Athens, Greece

Matthew Horsnell, Jim Garside

Advanced Processor Technologies Group University of Manchester

SLIDE 2

Introduction

Cores in CMPs typically share some level of the memory

hierarchy

Applications compete for the limited shared space
Need for efficient use of the shared cache

– Requests to off-chip memory are expensive (latency and power)

SLIDE 3

Introduction

LRU (or approximations) is typically employed
Partitions the cache implicitly on a demand basis

– Application with highest demand gets majority of cache resources – Could be suboptimal (eg. streaming applications)

Thread-blind policy

– Cannot detect and deal with inter-thread interference

SLIDE 4

Motivation

Applications can be

classified into 3 different categories [Qureshi and Patt (MICRO ’06)]

High Utility
Applications that continue to

benefit significantly as the cache space is increased

SLIDE 5

Motivation

Low Utility
Applications that do not

benefit significantly as the cache space is gradually increased

SLIDE 6

Motivation

Saturating Utility
Applications that initially

benefit as the cache space is increased Target : Exploit the differences in the cache utility of concurrently executed applications

SLIDE 7

Static Cache Partitioning

SLIDE 8

Static Cache Partitioning

Two major drawbacks
The system must be aware of each application’s profile
Partitions remain the same throughout the execution

– Programs are known to have distinct phases of behaviour

Need for a scheme that can partition the cache

dynamically

– Acquire the applications’ profile at run-time – Repartition when the phase of an application changes

SLIDE 9

Dynamic Cache Partitioning

LRU's “stack property” [Mattson et
al. 1970]

“An access that hits in a N-way associative cache using the LRU replacement policy is guaranteed to hit also if the cache had more than N ways, provided that the number of sets remains the same.”

SLIDE 10

ABFCP : Overview

Adaptive Bloom Filter Cache Partitioning (ABFCP)

Shared L2 Cache DRAM CORE 0 I−Cache D−Cache Partitioning Module CORE n I−Cache D−Cache

.

Partitioning Module

– Track misses and hits – Partitioning Algorithm – Replacement support to enforce partitions

SLIDE 11

ABFCP : Tracking system

Far Misses

– Misses that would have been hits had the application been allowed to use more cache ways – Tracked by Bloom filters

SLIDE 12

ABFCP : Partitioning Algorithm

2 counters per core per cache set

– CLRU – CFarMiss

Each core’s allocation can be changed by ± 1 way
Estimate performance loss/gain

– -1 way : Hits in the LRU position will become misses

perf. loss → CLRU

– +1 way : A portion of the far misses will become hits

perf. gain → a * CFarMiss , a = (1 - ways/assoc)

SLIDE 13

ABFCP : Partitioning Algorithm

Select the best partition that maximises performance (hits)
Complexity

– cores = 2 → possible partitions = 3 – cores = 4 → possible partitions = 19 – cores = 8 → possible partitions = 1107 – cores = 16 → possible partitions = 5196627

Linear algorithm that selects the best partition or a good

approximation thereof.

– N/2 comparisons (worst case) → O(N)

SLIDE 14

ABFCP : Way Partitioning

Way Partitioning support [Suh et al. HPCA ’02, Qureshi and

Patt MICRO ’06]

Each line has a core-id field
On a miss the ways occupied by the miss-causing

application are counted

– ways_occupied < partition_limit → victim is the LRU line of another application – Otherwise the victim is the LRU line of the miss-causing application

SLIDE 15

Evaluation

Configuration

– 2,4,8 single-issue, in-order cores – Private L1 I and D caches (32KB, 4-way associative, 32B line size, 1 cycle access latency) – Unified shared on-chip L2 cache (4MB, 32-way associative, 32B line size, 16 cycle access latency) – Main memory (32 outstanding requests, 100 cycle access latency)

Benchmarks

– 9 apps from JavaGrande + NAS – One application per processor – Simulation stops when one of the benchmarks finishes

SLIDE 16

Results (Dual core system)

SLIDE 17

Results (Dual core system)

SLIDE 18

Results (Quad core system)

SLIDE 19

Results (Eight core system)

SLIDE 20

Evaluation

Increasing promise as number of cores increase
Hardware Cost per core

– BF arrays (4096 sets * 32b) → 16KB – Counters (4096 sets * 2 counters * 8b) → 8KB – L2 Cache (240KB tags + 4MB data) → 4336KB – 0.55% increase in area

8-core system

– 48KB for the core-ids per cache set – Total overhead 240KB → 5.5% increase over L2

SLIDE 21

Evaluation

SLIDE 22

Cache Partitioning Aware Replacement Policy [Dybdhal et al.

HPC ’06]

– Cannot deal with applications with non-convex miss rate curves

Utility-Based cache partitioning [Qureshi and Patt MICRO ’06]

– Smaller overhead – Enforces the same partition over all the cache sets

Related Work

SLIDE 23

Conclusions

It is important to share the cache efficiently in CMPs
LRU does not achieve optimal sharing of the cache
Cache partitioning can alleviate its consequences
ABFCP

– shows increasing promise as the number of cores increase – provides better performance than LRU at a reasonable cost (5.5% increase for an 8-core system achieves similar results to using LRU with a 50% bigger L2 cache)

SLIDE 24

Any Questions?

Thank you!

SLIDE 25

Utility-Based Cache Partitioning

SLIDE 26

Utility-Based Cache Partitioning

High hardware overhead
Dynamic Set Sampling (monitor only 32 lines)‏

– Smaller UMONs

Enforce the same partition for the whole cache

– Less counters

SLIDE 27

Utility-Based Cache Partitioning

SLIDE 28

ABFCP Comparison with UCP

UCP has a lower

storage overhead (70KB for an 8-core)‏

If it attempted to

partition on a line basis, it would require 11MB per processor

ABFCP is more

robust

ABFCP performs

better as the number

f cores increases

SLIDE 29

ABFCP Comparison with UCP

SLIDE 30

CPARP

SLIDE 31

Conclusions

SLIDE 32

Evaluation

UCP acquires a more

accurate profile than CPARP

Example

An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore Architectures

Konstantinos Nikas

Computing Systems Laboratory NTU Athens, Greece

Matthew Horsnell, Jim Garside

Advanced Processor Technologies Group University of Manchester

Introduction

hierarchy

– Requests to off-chip memory are expensive (latency and power)

Introduction

– Application with highest demand gets majority of cache resources – Could be suboptimal (eg. streaming applications)

– Cannot detect and deal with inter-thread interference

Motivation

classified into 3 different categories [Qureshi and Patt (MICRO ’06)]

benefit significantly as the cache space is increased

Motivation

benefit significantly as the cache space is gradually increased

Motivation

benefit as the cache space is increased Target : Exploit the differences in the cache utility of concurrently executed applications

Static Cache Partitioning

Static Cache Partitioning

– Programs are known to have distinct phases of behaviour

dynamically

– Acquire the applications’ profile at run-time – Repartition when the phase of an application changes

Dynamic Cache Partitioning

“An access that hits in a N-way associative cache using the LRU replacement policy is guaranteed to hit also if the cache had more than N ways, provided that the number of sets remains the same.”

ABFCP : Overview

.

– Track misses and hits – Partitioning Algorithm – Replacement support to enforce partitions

ABFCP : Tracking system

– Misses that would have been hits had the application been allowed to use more cache ways – Tracked by Bloom filters

ABFCP : Partitioning Algorithm

– CLRU – CFarMiss

– -1 way : Hits in the LRU position will become misses

– +1 way : A portion of the far misses will become hits

ABFCP : Partitioning Algorithm

– cores = 2 → possible partitions = 3 – cores = 4 → possible partitions = 19 – cores = 8 → possible partitions = 1107 – cores = 16 → possible partitions = 5196627

approximation thereof.

– N/2 comparisons (worst case) → O(N)

ABFCP : Way Partitioning

Patt MICRO ’06]

application are counted

– ways_occupied < partition_limit → victim is the LRU line of another application – Otherwise the victim is the LRU line of the miss-causing application

Evaluation

– 9 apps from JavaGrande + NAS – One application per processor – Simulation stops when one of the benchmarks finishes

Results (Dual core system)

Results (Dual core system)

Results (Quad core system)

Results (Eight core system)

Evaluation

– BF arrays (4096 sets * 32b) → 16KB – Counters (4096 sets * 2 counters * 8b) → 8KB – L2 Cache (240KB tags + 4MB data) → 4336KB – 0.55% increase in area

– 48KB for the core-ids per cache set – Total overhead 240KB → 5.5% increase over L2

Evaluation

HPC ’06]

– Cannot deal with applications with non-convex miss rate curves

– Smaller overhead – Enforces the same partition over all the cache sets

Related Work

Conclusions

– shows increasing promise as the number of cores increase – provides better performance than LRU at a reasonable cost (5.5% increase for an 8-core system achieves similar results to using LRU with a 50% bigger L2 cache)

Any Questions?

Thank you!

Utility-Based Cache Partitioning

Utility-Based Cache Partitioning

– Smaller UMONs

– Less counters

Utility-Based Cache Partitioning

ABFCP Comparison with UCP

storage overhead (70KB for an 8-core)‏

partition on a line basis, it would require 11MB per processor

robust

better as the number

ABFCP Comparison with UCP

CPARP

Conclusions

Evaluation

accurate profile than CPARP

– curr_hits = 135 – if app2 gets 6 ways then hits = 145 (UCP)‏ – CPARP does not modify the partition