Bloom Filtering Cache Misses for Accurate Data Speculation and - - PowerPoint PPT Presentation

bloom filtering cache misses for accurate data
SMART_READER_LITE
LIVE PREVIEW

Bloom Filtering Cache Misses for Accurate Data Speculation and - - PowerPoint PPT Presentation

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering University of Florida 1


slide-1
SLIDE 1

1

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering University of Florida

ICS’02 Peir

slide-2
SLIDE 2

2

It’s the Memory, Stupid

  • 3 out of 4 cycles are waiting for memory in Pentium-Pro

and Alpha-21164, running TPC-C, - Richard Sites

  • Cache Performance: f (Hit time, Miss ratio, Miss penalty)
  • Performance impact due to cache latency:

Minimum 3-cycle hit latency Speculative issue for hit Squashed / re-issued on miss (Total 3 cycles)

stall stall stall lw r1 <= 0(r2) add r3 <= r2, r1 issue register addgen mem1 mem2 hit/miss commit issue register execute commit

ICS’02 Peir

slide-3
SLIDE 3

3

Outline

  • Introduction
  • Motivation and Related Work
  • Bloom Filtering Cache Misses

– Partitioned-address BF – Partial-address BF

  • Pipeline Microarchitecture with BF
  • Performance Evaluation
  • Summary

ICS’02 Peir

slide-4
SLIDE 4

4

No Speculation vs. Perfect Scheduling

0.5 1 1.5 2 2.5 3 Bzip Gap Gcc Gzip Mcf Parser Perl Twolf Vortex Vpr Average IPC no data speculation perfect scheduler

No data speculation degrades IPC by 15-20% (SPECint2000)

ICS’02 Peir

slide-5
SLIDE 5

5

Related Work

  • Simple solution: Assume loads always hit L1
  • Alpha 21264: 4-bit counter; +1 on hit, -2 on miss

– Predict hit counter>=8; mini-recovery if miss – Predict miss counter<8; delay dependent if hit

  • Pentium-4: Always hit, “replay” when miss

– Agressive way prediction, needs recovery – Replay way prediction and load dependent scheduling

  • Recovery buffer: Free speculative instructions

from scheduling queue

  • Hybrid 2-level hit/miss predictor (like branch)

ICS’02 Peir

slide-6
SLIDE 6

6

Bloom Filter (BF) - Introduction

  • A probabilistic algorithm to test membership in a

large set using hashing functions to a bit array.

  • A BF quickly filters non-members without querying

the large set.

  • Filter cache misses

– Accurate scheduling load dependents – Overlap and reduce cache miss penalty

  • Filter other table accesses

ICS’02 Peir

slide-7
SLIDE 7

7

Partitioned-Address BF

Request Line Addr. Replaced Line Addr.

A1 A2 A3 A4 R1 R2 R3 R4

Cache Miss

BF1 BF4 BF2 BF3

Increment counter on cache miss Decrement counter on cache miss True if cache miss False likely cache hit

Guarantee miss! ICS’02 Peir

slide-8
SLIDE 8

8

Partial-Address BF

Requested Line Address

  • ffset

partial address (p bits) Tag Hit / Miss Detector Collision Detector

L1 Cache Tag

Set bit on cache miss

BF array

Index Reset bit on cache miss but no collision Partial Address (p-bits)

  • f Replaced Cache Line

Collision? (yes/no) False, miss True, (may) hit

Guarantee miss! ICS’02 Peir

slide-9
SLIDE 9

9

Virtual-Address BF

  • Benefit of cache miss filtering

– Must be early before dependent scheduling – Must be accurate

  • Filter cache miss using virtual address

– Must handle address synonym problem – Special handling collision detection for physical caches

  • Partial-address (virtual) BFs

– Separate collision detection from cache tag path – Compare replaced PA with all other PAs with the same page offset – Reset BF array only when no match is found

ICS’02 Peir

slide-10
SLIDE 10

10

Pipeline Execution with BF

stall stall stall lw r1 <= 0(r2) add r3 <= r2, r1 issue register addgen mem1 mem2 hit/miss commit issue register execute commit BF Filtering

  • Virtual address BF filter cache miss 2 cycles earlier, Still
  • ne cycle too late for dependent scheduling

– Delay one cycle for dependent scheduling – Always hit, precise recovery for single-cycle speculation

ICS’02 Peir

slide-11
SLIDE 11

11

Cache Miss Filtered by BF

SCH REG AGN CA1 CA2 H/M BF L2 Access SCH REG EXE WRB CMT SCH REG EXE WRB CMT SCH

Load: Dependent: Independent:

(No Penalty) Filter miss, cancel dependents

  • Cache miss filtered by BF, one cycle window

– Precise recovery, reschedule only dependents (have to wait for miss anyway) – No penalty for independent instructions

ICS’02 Peir

slide-12
SLIDE 12

12

Cache Miss Not Filtered

SCH REG EXE WRB CMT Flush SCH REG EXE

Independent:

SCH WRB CMT Flush SCH REG EXE

Independent: 4-cycle penalty 2-cycle penalty

Load:

SCH REG EXE WRB CMT SCH REG AGN CA1 CA2 H/M BF L2 Access SCH REG EXE WRB CMT SCH SCH REG EXE Flush Flush Cache Miss (not filtered) Speculative Window

Dependent: Dependent: . ICS’02 Peir

slide-13
SLIDE 13

13

Prefetching with BF

SCH REG EXE WRB CMT SCH

Dependent: Filter miss cancel dependents

SCH REG AGN CA1 CA2 H/M BF L2 Access

Load:

SCH REG EXE WRB CMT

Filter miss trigger L2 access 2-cycle earlier

  • BF filltered miss trigger L2 miss 2-cycle earlier

– L1 miss is guaranteed – Applicable to other caches, TLB, branch prediction tables, etc.

ICS’02 Peir

slide-14
SLIDE 14

14

Predictors and Extra Hardware

2048 Counter - 512 4480 Partitioned BF - 4 15360 Partitioned BF - 3

Additional Table (Bits) Prediction Method

2048 Partial BF - 4x 512 Partial BF - 1x 32768 Partial BF - 64x 8192 Partial BF - 16x Always-Hit 4 Counter - 1 (Alpha) 32768 Counter - 8192 8192 Counter - 2048 512 Counter - 128

ICS’02 Peir

slide-15
SLIDE 15

15

Cache Miss Filter Rate

10 20 30 40 50 60 70 80 90 100

B z i p G a p G c c G z i p M c f P a r s e r P e r l T w

  • l

f V

  • r

t e x V p r A v e r a g e Cache Miss Filtering Rate (%)

Partition-3 Partial-1x Partial-4x Partial-16x Partial-64x

Partitioned BFs perform poorly; 97% miss filtered by Partial-16x

ICS’02 Peir

slide-16
SLIDE 16

16

Accuracy of Various Predictors

50 60 70 80 90 100

Counter-1 Counter-128 Counter-512 Counter-2048 Counter-8192 Always-hit Partition-3 Partial-1x Partial-4x Partial-16x Partial-64x Percentage Correct/Incorrect Incorrect-cancel Incorrect-delay Correct hit/miss

Bloom Filter has NO incorrect-delay, i.e. predict miss always miss

ICS’02 Peir

slide-17
SLIDE 17

17

IPC Comparison

0.5 1 1.5 2 2.5 3

Bzip Gap Gcc Gzip Mcf Parser Perl Twolf Vortex Vpr Average IPC No-speculation Counter-1 Counter-2048 Always-hit Partition-3 Partial-16x Partial-16x-DP Perfetct-sch Perfect-sch-DP ICS’02 Peir

slide-18
SLIDE 18

18

Effect of Data Cache Size

8 10 12 14 16 18 20

8KB 16KB 32KB 64KB

Cache Size

IPC Improvement (%)

Perfect-sch-DP Partial-16x-DP Partial-16x Always-hit

ICS’02 Peir

slide-19
SLIDE 19

19

Impact of RUU to Always-Hit

1 2 3 4 5 6 7 8 9 10

ruu32 ruu64 ruu128 Partial-16x Partial-16x-DP Partial-16x-perfect Partial-16x-DP-perfect IPC Improve over Always-Hit (%)

ICS’02 Peir

slide-20
SLIDE 20

20

Summary

  • Data speculation schedules load dependents

without knowing load latency

  • Bloom Filter identifies 97% of the misses early

using a small 1KB bit array

  • 19% IPC improvement over no-speculation,

6% IPC improvement over always-hit method

  • Reach 99.7% IPC of a perfect scheduler

ICS’02 Peir