An Imitation Learning Approach for Cache Replacement Evan Z. Liu, - - PowerPoint PPT Presentation

an imitation learning approach for cache replacement
SMART_READER_LITE
LIVE PREVIEW

An Imitation Learning Approach for Cache Replacement Evan Z. Liu, - - PowerPoint PPT Presentation

An Imitation Learning Approach for Cache Replacement Evan Z. Liu, Milad Hashemi, Kevin Swersky, Paruhasarathy Ranganathan, Junwhan Ahn The Need for Faster Compute Small cache improvements can make large difgerences! (Beckman, 2019) E.g., 1%


slide-1
SLIDE 1

Evan Z. Liu, Milad Hashemi, Kevin Swersky, Paruhasarathy Ranganathan, Junwhan Ahn

An Imitation Learning Approach for Cache Replacement

slide-2
SLIDE 2

The Need for Faster Compute

(htups://openai.com/blog/ai-and-compute/) Small cache improvements can make large difgerences! (Beckman, 2019)

  • E.g., 1% cache hit rate improvement → 35%

decrease in latency (Cidon, et. al., 2016) Caches are everywhere:

  • CPU chips
  • Operating Systems
  • Databases
  • Web applications

Our goal: Faster applications via betuer cache replacement policies

slide-3
SLIDE 3

TL;DR:

I. We approximate the optimal cache replacement policy by (implicitly) predicting the future II. Caching is an aturactive benchmark for the general reinforcement learning / imitation learning communities

slide-4
SLIDE 4

Miss Hit (100x faster)

Cache Replacement

Miss

B A C D A B A D C B A D

Cache Accesses Evict Goal: Evict the cache lines to maximize cache hits

slide-5
SLIDE 5

Miss

Cache Replacement

C C

Cache Accesses Evict

B A D D A B A B D A

Hit Miss Mistake

slide-6
SLIDE 6

Cache Replacement

C C

Cache Accesses

B A D D B D A

Hit Miss

B A A

Optimal decision Miss

slide-7
SLIDE 7

Cache Replacement

C C

Cache Accesses

B A D D B D A

Hit Miss

B A A

Miss

Reuse distance dt(line): number of accesses from access t until the line is reused

d0(A) = 1, d0(B) > 2, d0(C) = 2

Optimal Policy (Belady’s): Evict the line with the greatest reuse distance (Belady, 1966)

slide-8
SLIDE 8

Belady’s Requires Future Information

Reuse distance dt(line): number of accesses from access t until the line is reused Problem: Computing reuse distance requires knowing the future So in practice, we use heuristics, e.g.:

  • Least-recently used (LRU)
  • Most-recently used (MRU)

… but these pergorm poorly on complex access patuerns

slide-9
SLIDE 9

Leveraging Belady’s

Idea: approximate Belady’s from past accesses

Past accesses Current access Future accesses . . . . . . Learned Model Belady’s Predicted decision Optimal decision Training

slide-10
SLIDE 10

Prior Work

Past accesses Current access Current cache state Current line cache friendly or averse? Evict line X Trained on Belady’s Traditional Algorithm Hawkeye / Glider Current state-of-the-aru (Shi et. al., ‘19, Jain et. al., ‘18)

slide-11
SLIDE 11

Prior Work

+ binary classifjcation is relatively easy to learn

  • traditional algorithm can’t

express optimal policy

Past accesses Current access Current cache state Current line cache friendly or averse? Evict line X Trained on Belady’s Traditional Algorithm Hawkeye / Glider Current state-of-the-aru (Shi et. al., ‘19, Jain et. al., ‘18)

slide-12
SLIDE 12

. . . Our proposal

Our Approach

Past accesses Current access Model Current cache state Evict line X Our contribution: Directly approximate Belady’s via imitation learning Trained on Belady’s Past accesses Current access Current cache state Current line cache friendly or averse? Evict line X Trained on Belady’s Traditional Algorithm Hawkeye / Glider Current state-of-the-aru (Shi et. al., ‘19, Jain et. al., ‘18)

slide-13
SLIDE 13

Cache Replacement Markov Decision Process

Miss Hit Miss

B A C D B D C B A D

Cache Accesses Evict

A A

Similar to Wang, et. al., 2019

slide-14
SLIDE 14

Past accesses Current access Miss Hit Miss

B A C D B D C B A D

Cache Accesses Evict

A A

Current cache contents

Cache Replacement Markov Decision Process

Similar to Wang, et. al., 2019

slide-15
SLIDE 15

Miss Hit Miss

B A C D B D C B A D A A

Cache Accesses

Cache Replacement Markov Decision Process

Similar to Wang, et. al., 2019

slide-16
SLIDE 16

B A D

Cache Accesses Evict Miss Hit Miss

D C A B A C B D A

Cache Replacement Markov Decision Process

Similar to Wang, et. al., 2019

slide-17
SLIDE 17

Leveraging the Optimal Policy

Typical imitation learning setuing

(Pomerlau, 1991, Ross, et. al., 2011, Kim, et. al., 2013)

state

Learned policy

  • ptimal action

Learned policy

Approximate optimal policy

state

  • ptimize, e.g.,

Observation: Not all errors are equally bad

  • Learning from optimal policy yields

greater training signal Concretely: minimize a ranking loss

slide-18
SLIDE 18

Reuse distance

Reuse Distance as an Auxiliary Task

Observation: predicting reuse distance is correlated with cache replacement

  • Cast this as an auxiliary task (Jaderberg, et. al., 2016)

State st

State embedding Policy

Loss

slide-19
SLIDE 19

Results

LRU cache-hit rate Optimal cache-hit rate ~19% cache-hit rate increase over Glider (Shi, et. al., 2019) on memory-intensive SPEC2006 applications (Jaleel, et. al., 2009) ~64% cache-hit rate increase over LRU on Google Web Search

slide-20
SLIDE 20

This work: Establish a proof-of-concept

A Note on Practicality

12

... Address: 0x

C5 A1 Byte 1 Byte 2 Byte 3

Linear Layer address embedding

Per-byte address embedding

  • Reduce embedding size from 100MB to <10KB
  • ~6% cache-hit rate increase on SPEC2006 vs.

Glider

  • ~59% cache-hit rate increase on Google Web

Search vs. LRU

slide-21
SLIDE 21

Per-byte address embedding

  • Reduce embedding size from 100MB to <10KB
  • ~6% cache-hit rate increase on SPEC2006 vs.

Glider

  • ~59% cache-hit rate increase on Google Web

Search vs. LRU

This work: Establish a proof-of-concept

A Note on Practicality

12

... Address: 0x

C5 A1 Byte 1 Byte 2 Byte 3

Linear Layer address embedding

Future work: Production ready learned policies

  • Smaller models via distillation (Hinton, et. al., 2015), pruning (Janowsky, 1989,

Han, et. al., 2015, Sze, et. al., 2017), or quantization

  • Target domains with longer latency and larger caches (e.g., sofuware

caches)

slide-22
SLIDE 22

A New Imitation / Reinforcement Learning Benchmark

+ plentiful data

  • delayed real-world utility
  • limited / expensive data

+ immediate real-world impact + plentiful data + immediate real-world impact

Miss

B A C D

Evict

Bellemare, et. al., 2012, Silver, et. al., 2017, OpenAI, 2019, Vinyals, et. al., 2019 Levine, et. al., 2016, Lillicrap, et. al., 2015

Open-source cache replacement Gym environment coming soon!

slide-23
SLIDE 23

Takeaways

  • A new state-of-the-aru approach for cache replacement by imitating the
  • racle policy

○ Future work: making this production ready

  • A new benchmark for imitation learning / reinforcement learning research