[PPT] - Rethinking Last-Level Cache Management for Multicores Operating at PowerPoint Presentation

SLIDE 1

COMPUTER ARCHITECTURE GROUP

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold

Farrukh Hijaz, Omer Khan University of Connecticut

SLIDE 2

COMPUTER ARCHITECTURE GROUP

Power Efficiency

2

Years Performance/Watt

Power-performance efficiency stalled

Image Credit: http://www.vr-zone.com

Still need >3× improvement to meet HPC GFLOPS/Watt goal

‘14

Multicores enable efficiency

SLIDE 3

COMPUTER ARCHITECTURE GROUP

The Value of Operating at NTV

3

Near Threshold Voltage operation potentially enables 5-10× power-performance efficiency

[Intel: DAC’12]

SLIDE 4

COMPUTER ARCHITECTURE GROUP

NTV Operation? Logic (✓)

4

[Intel:DAC’12]

NTV (✓)

SLIDE 5

COMPUTER ARCHITECTURE GROUP

NTV Operation? Cache (✗)

5

SRAM bit-cells susceptible to errors at NTV

NTV (✗)

[Intel:DAC’12]

SLIDE 6

COMPUTER ARCHITECTURE GROUP

NTV Approaches for On-chip Memory

High voltage, High frequency
High performance
Low energy efficiency
No faults
Low voltage, Low frequency
Low performance
Highest energy efficiency
No faults
Low voltage, High frequency
High performance
High energy efficiency
Permanent faults

Our Approach!

6

SLIDE 7

COMPUTER ARCHITECTURE GROUP

NTV Approaches for Permanent Faults

Circuit level (8T, 10T SRAM bit-cell)
High area overhead
Higher leakage current
ECC based (SECDED, MS-ECC)
Constant latency overhead
Disabling based (e.g., cache line disabling)
Lower available capacity
Hybrid of ECC and Disabling (e.g., VS-ECC)
Trades off available capacity and latency overhead

7

Our Approach!

SLIDE 8

COMPUTER ARCHITECTURE GROUP

The NTV Challenge in Multicores

Future multicores will have

100s of cores

LLC management is key to
ptimizing performance

and energy

Last-level cache (LLC) data

locality and off-chip miss rates 1st order constraints and often show opposing trends

Lower available LLC

capacity at NTV presents new challenges

Diameter of

n-chip network

increases with core count

On-Chip Latency

Limited

ff-chip

bandwidth

Off-Chip Bandwidth

8

SLIDE 9

COMPUTER ARCHITECTURE GROUP

Static-NUCA

(LLC Data Placement)

Statically address interleaves data across all

physically distributed LLC slices

No replication of data in the LLC slices
High cache utilization since all data evenly distributed
Data resides in a remote LLC slice with high

probability

High remote LLC slice access rate results in higher on-

chip network traffic and high average LLC access latency/energy

9

SLIDE 10

COMPUTER ARCHITECTURE GROUP

Reactive-NUCA

(LLC Data Placement, Limited Replication)

Classifies data as private or shared on page

granularity using the existing virtual memory system

Maps private pages to requesting core’s local LLC slice
Maps shared pages across the chip based on static

address interleaving (similar to Static-NUCA)

Replication of data not allowed
Instructions replicated in LLC slice per cluster of 4,

using rotational interleaving

Low LLC access latency/energy for correctly

classified private data and instructions

No locality optimizations for shared data

10

SLIDE 11

COMPUTER ARCHITECTURE GROUP

Victim Replication

(LLC Data Placement and Replication)

Starts with S-NUCA and uses the local LLC slice of

a core as a victim cache for the cache lines evicted from its L1 cache

Inserts replica only if there exists:
an invalid cache line,
a home cache line with zero sharers, or
another replica
Improves locality and reduces on-chip traffic
Replication strategy causes LLC pollution, resulting

in higher evictions of home cache lines with zero sharers and other replicas

11

SLIDE 12

COMPUTER ARCHITECTURE GROUP

Evaluation Methodology

Evaluation using Graphite multicore simulator for

64 cores

McPAT/CACTI cache energy models and DSENT

network energy models at 11 nm

Evaluated 21 benchmarks from the

SPLASH-2 (11), PARSEC (8), Parallel MI- bench (1) and UHPC (1) suites

LLC managements schemes compared:
Static-NUCA (S-NUCA)
Reactive-NUCA (R-NUCA)
Victim Replication (VR)

12

SLIDE 13

COMPUTER ARCHITECTURE GROUP

NTV Fault Model for LLC

Normal distribution of error bits in a cache line

with random occurrence probabilities

0% ¡ 20% ¡ 40% ¡ 60% ¡ 80% ¡ 100% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ 0.70% ¡ 1.00% ¡ 0e ¡ 1e ¡ ¡2e ¡ 3e ¡ >=4e ¡

LLC Cache Capacity

13

LLC tag arrays extended to record “disable bits”
0e – 2e: ECC correction with additional 1-cycle latency
>2e: Cache line disabling

SLIDE 14

COMPUTER ARCHITECTURE GROUP

Average Results – Completion Time

R-NUCA and VR perform consistently better than S-

NUCA

VR’s replication helps at low fault rates
Lower replication opportunities for VR at higher fault

rates result in completion time on-par with R-NUCA

0 ¡ 0.2 ¡ 0.4 ¡ 0.6 ¡ 0.8 ¡ 1 ¡ 1.2 ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ Comple'on ¡Time ¡ (Normalized) ¡ ¡SynchronizaAon ¡ ¡LLCHome-‑OffChip ¡ ¡LLCHome-‑Sharers ¡ ¡LLCHome-‑WaiAng ¡ ¡L1C-‑LLCHome ¡ ¡L1C-‑LLCReplica ¡ ¡Compute ¡

14

SLIDE 15

COMPUTER ARCHITECTURE GROUP

0 ¡ 0.2 ¡ 0.4 ¡ 0.6 ¡ 0.8 ¡ 1 ¡ 1.2 ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ Energy ¡ (Normalized) ¡ ¡DRAM ¡ ¡Network ¡Link ¡ ¡Network ¡Router ¡ ¡Directory ¡ LLC ¡ ¡L1-‑D ¡Cache ¡ ¡L1-‑I ¡Cache ¡

Average Results – Energy

Static energy dominates the overall energy
Energy consumption tracks completion time

15

SLIDE 16

COMPUTER ARCHITECTURE GROUP

Benchmark Results – Barnes

Replication helps significantly at lower fault rates
Lower replication opportunity at higher fault rates

diminishes advantage over R-NUCA

0 ¡ 0.2 ¡ 0.4 ¡ 0.6 ¡ 0.8 ¡ 1 ¡ 1.2 ¡ 1.4 ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ Comple'on ¡Time ¡ (Normalized) ¡ ¡SynchronizaAon ¡ ¡LLCHome-‑OffChip ¡ ¡LLCHome-‑Sharers ¡ ¡LLCHome-‑WaiAng ¡ ¡L1C-‑LLCHome ¡ ¡L1C-‑LLCReplica ¡ ¡Compute ¡

16

SLIDE 17

COMPUTER ARCHITECTURE GROUP

Benchmark Results – Ocean_NC

R-NUCA performance

degrades due to false sharing

VR better than R-NUCA,

however, lower advantage at higher fault rates

0 ¡ 0.2 ¡ 0.4 ¡ 0.6 ¡ 0.8 ¡ 1 ¡ 1.2 ¡ 1.4 ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ Comple'on ¡Time ¡ (Normalized) ¡ ¡SynchronizaAon ¡ ¡LLCHome-‑OffChip ¡ ¡LLCHome-‑Sharers ¡ ¡LLCHome-‑WaiAng ¡ ¡L1C-‑LLCHome ¡ ¡L1C-‑LLCReplica ¡ ¡Compute ¡

0% ¡ 20% ¡ 40% ¡ 60% ¡ 80% ¡ 100% ¡ Cache-‑Line ¡ Page ¡ Shared ¡Read-‑ Write ¡ Shared ¡Read-‑ Only ¡ Private ¡ InstrucAon ¡

LLC Accesses

17

SLIDE 18

COMPUTER ARCHITECTURE GROUP

Benchmark Results – Dedup

High number of LLC accesses to thread-private data
R-NUCA’s local placement of private data is

effective in improving completion time over VR

0 ¡ 0.2 ¡ 0.4 ¡ 0.6 ¡ 0.8 ¡ 1 ¡ 1.2 ¡ 1.4 ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ ¡S-‑NUCA ¡ ¡R-‑NUCA ¡ ¡VR ¡ 0% ¡ 0.10% ¡ 0.30% ¡ 0.50% ¡ Comple'on ¡Time ¡ (Normalized) ¡ ¡SynchronizaAon ¡ ¡LLCHome-‑OffChip ¡ ¡LLCHome-‑Sharers ¡ ¡LLCHome-‑WaiAng ¡ ¡L1C-‑LLCHome ¡ ¡L1C-‑LLCReplica ¡ ¡Compute ¡

18

SLIDE 19

COMPUTER ARCHITECTURE GROUP

Observations

No one-fits-all data management scheme at the

lower LLC capacity when operating at NTV

A scheme that works optimally at higher LLC

capacity might not be effective at the lower usable capacity

Optimizing locality ends up putting extra stress on

the LLC, increasing the off-chip miss rate

There is a need for a data management scheme that

not only utilizes LLC capacity more intelligently but also possess the ability to handle the random distribution of faults

19