A Low-Latency, Energy-Efficient L1 Cache Based on a Self-Timed - - PowerPoint PPT Presentation

a low latency energy efficient l1 cache based on a self
SMART_READER_LITE
LIVE PREVIEW

A Low-Latency, Energy-Efficient L1 Cache Based on a Self-Timed - - PowerPoint PPT Presentation

Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary A Low-Latency, Energy-Efficient L1 Cache Based on a Self-Timed Pipeline Louis-Charles Trudeau 1 Ghyslain Gagnon 1 Franois Gagnon 1 Claude Thibeault 1


slide-1
SLIDE 1

1 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

A Low-Latency, Energy-Efficient L1 Cache Based on a Self-Timed Pipeline

Louis-Charles Trudeau1 Ghyslain Gagnon1 François Gagnon1 Claude Thibeault1 Thomas Awad2 Doug Morrissey2

1École de technologie supérieure, Montréal, Canada 2Octasic Inc., Montréal, Canada

21st IEEE International Symposium on Asynchronous Circuits and Systems, 2015

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-2
SLIDE 2

2 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Plan Introduction Problematic Motivations Scope of Work Cache Implementation Architecture and Organization Operation Self-Timed Pipeline Design Design Guidelines Pipeline Control Pipeline Operation Performance Results Summary

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-3
SLIDE 3

3 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Problematic Motivations Scope of Work

Research Program

Objective: Adapting Octasic’s power-efficient asynchronous architecture in a general purpose processor (ARM v7-A).

Collaborators

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-4
SLIDE 4

4 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Problematic Motivations Scope of Work

Problematic

Current architecture separates the asynchronous CPU from the synchronous L1 memory.

Execution Unit.0

Crossbar. Switch

Instruction.Bus Inst. Data

Execution Unit.1

Inst. Data

Execution Unit.2

Inst. Data

Execution Unit.3

Inst. Data

Execution Unit.7

Inst. Data

Execution Unit.6

Inst. Data

Execution Unit.5

Inst. Data

Execution Unit.4

Inst. Data

Prog..Counter. m.Branch. Predictor Instruction. Fetch. Instruction. Decoder. Data.Memory. Load/Store.Unit Register.File

L1.Data Memory. L1.Instruction Memory. Execution.Sub-System Memory.Sub-System

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-5
SLIDE 5

4 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Problematic Motivations Scope of Work

Problematic

Current architecture separates the asynchronous CPU from the synchronous L1 memory.

◮ 2-cycle synchronization

penalty.

Execution Unit.0

Crossbar. Switch

Instruction.Bus Inst. Data

Execution Unit.1

Inst. Data

Execution Unit.2

Inst. Data

Execution Unit.3

Inst. Data

Execution Unit.7

Inst. Data

Execution Unit.6

Inst. Data

Execution Unit.5

Inst. Data

Execution Unit.4

Inst. Data

Prog..Counter. m.Branch. Predictor Instruction. Fetch. Instruction. Decoder. Data.Memory. Load/Store.Unit Register.File

L1.Data Memory. L1.Instruction Memory. Execution.Sub-System Memory.Sub-System

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-6
SLIDE 6

4 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Problematic Motivations Scope of Work

Problematic

Current architecture separates the asynchronous CPU from the synchronous L1 memory.

◮ 2-cycle synchronization

penalty.

◮ Energy efficiency is

suboptimal.

Execution Unit.0

Crossbar. Switch

Instruction.Bus Inst. Data

Execution Unit.1

Inst. Data

Execution Unit.2

Inst. Data

Execution Unit.3

Inst. Data

Execution Unit.7

Inst. Data

Execution Unit.6

Inst. Data

Execution Unit.5

Inst. Data

Execution Unit.4

Inst. Data

Prog..Counter. m.Branch. Predictor Instruction. Fetch. Instruction. Decoder. Data.Memory. Load/Store.Unit Register.File

L1.Data Memory. L1.Instruction Memory. Execution.Sub-System Memory.Sub-System

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-7
SLIDE 7

5 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Problematic Motivations Scope of Work

Motivations

This work focuses on improving the L1 memory access.

Why Go Asynchronous ?

◮ No balanced clock trees.

Clocks are point-to-point and skew insensitive.

◮ No major critical path due to frequency constraints.

Less large/leaky gates.

◮ Less complex pipeline structure.

Only neighboring stages are connected together.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-8
SLIDE 8

6 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Problematic Motivations Scope of Work

Scope of Work

Design an asynchronous cache based on a self-timed pipeline.

Objectives

  • 1. Mitigate CPU ↔ L1 memory access latency.
  • 2. Reduce the average memory access time.
  • 3. Improve the cache energy efficiency.
  • 4. Push the synchronization barrier at the L2 memory.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-9
SLIDE 9

7 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Architecture and Organization Operation

Plan Introduction Problematic Motivations Scope of Work Cache Implementation Architecture and Organization Operation Self-Timed Pipeline Design Design Guidelines Pipeline Control Pipeline Operation Performance Results Summary

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-10
SLIDE 10

8 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Architecture and Organization Operation

L1 Instruction Cache Design Dual-fetch, 32kB, 4-way set-associative phased-cache.

Synchronous Cache

◮ 5-stage pipeline (hit). ◮ Pipeline stall on miss. ◮ 2-cycle pipeline reinjection

following cache fill.

Asynchronous Cache

◮ 4-stage pipeline (hit) ◮ Single stage stall on miss. ◮ Resource arbitration for

concurrent cache fill.

Integration in ARM-like processor

⇒ Dhrystone & Coremark (armcc compiled).

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-11
SLIDE 11

9 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Architecture and Organization Operation

Cache Pipeline

Shared Resources

◮ Tag Memory ◮ Data Memory ◮ (L2 Memory)

Tasks Partitioning

◮ 6 pipeline stages. ◮ Two-phase

handshake protocol.

Tag Write L1/L2l FIFO

Tol L2

Data Write Tag Read Data Read Data Out Fwd Addr Data RAMs

(+lDatalFFs)

Tag RAMs

(4lways)

Tag Comp. Controller

PC PC Inst[1:0]

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-12
SLIDE 12

9 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Architecture and Organization Operation

Cache Pipeline

Shared Resources

◮ Tag Memory ◮ Data Memory ◮ (L2 Memory)

Tasks Partitioning

◮ 6 pipeline stages. ◮ Two-phase

handshake protocol.

L1/L2R FIFO

ToR L2

Data RAMs

(+RDataRFFs)

Tag RAMs

(4Rways)

Tag Comp.

PC PC Inst[1:0]

Controller Tag Read Tag Write Data Write Data Read Fwd Addr Data Out

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-13
SLIDE 13

10 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Architecture and Organization Operation

Cache Operation

Pipeline Stages

◮ Tag Read ◮ Forward Address ◮ Tag & Data Write ◮ Data Read ◮ Data Output

Tag Write L1/L2l FIFO

Tol L2

Data Write Tag Read Data Read Data Out Fwd Addr Data RAMs

(+lDatalFFs)

Tag RAMs

(4lways)

Tag Comp. Controller

PC PC Inst[1:0]

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-14
SLIDE 14

10 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Architecture and Organization Operation

Cache Operation

Pipeline Stages

◮ Tag Read ◮ Forward Address ◮ Tag & Data Write ◮ Data Read ◮ Data Output

Tag Write L1/L2l FIFO

Tol L2

Data Write Tag Read Data Read Data Out Fwd Addr Data RAMs

(+lDatalFFs)

Tag RAMs

(4lways)

Tag Comp. Controller

PC PC Inst[1:0]

Miss Hit

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-15
SLIDE 15

10 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Architecture and Organization Operation

Cache Operation

Pipeline Stages

◮ Tag Read ◮ Forward Address ◮ Tag & Data Write ◮ Data Read ◮ Data Output

Tag Write L1/L2l FIFO

Tol L2

Data Write Tag Read Data Read Data Out Fwd Addr Data RAMs

(+lDatalFFs)

Tag RAMs

(4lways)

Tag Comp. Controller

PC PC Inst[1:0]

Miss

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-16
SLIDE 16

10 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Architecture and Organization Operation

Cache Operation

Pipeline Stages

◮ Tag Read ◮ Forward Address ◮ Tag & Data Write ◮ Data Read ◮ Data Output

Tag Write L1/L2l FIFO

Tol L2

Data Write Tag Read Data Read Data Out Fwd Addr Data RAMs

(+lDatalFFs)

Tag RAMs

(4lways)

Tag Comp. Controller

PC PC Inst[1:0]

Miss

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-17
SLIDE 17

10 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Architecture and Organization Operation

Cache Operation

Pipeline Stages

◮ Tag Read ◮ Forward Address ◮ Tag & Data Write ◮ Data Read ◮ Data Output

Tag Write L1/L2l FIFO

Tol L2

Data Write Tag Read Data Read Data Out Fwd Addr Data RAMs

(+lDatalFFs)

Tag RAMs

(4lways)

Tag Comp. Controller

PC PC Inst[1:0] Hit/Miss

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-18
SLIDE 18

10 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Architecture and Organization Operation

Cache Operation

Pipeline Stages

◮ Tag Read ◮ Forward Address ◮ Tag & Data Write ◮ Data Read ◮ Data Output

Tag Write L1/L2l FIFO

Tol L2

Data Write Tag Read Data Read Data Out Fwd Addr Data RAMs

(+lDatalFFs)

Tag RAMs

(4lways)

Tag Comp. Controller

PC PC Inst[1:0]

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-19
SLIDE 19

11 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Plan Introduction Problematic Motivations Scope of Work Cache Implementation Architecture and Organization Operation Self-Timed Pipeline Design Design Guidelines Pipeline Control Pipeline Operation Performance Results Summary

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-20
SLIDE 20

12 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Design Guidelines Self-timed pipeline design had to follow these guidelines:

◮ Standard cell libraries. ◮ Standard edge-triggered flip-flops. ◮ Prioritize High-Voltage Threshold (HVT) cells to limit

leakage.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-21
SLIDE 21

13 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Overview Of A Single Pipeline Stage

  • 1. Click controllers

◮ Two-phase

handshake protocol.

  • 2. Token modules

◮ Synchronization; ◮ Pulse generation.

  • 3. Adjustable delays

N.phase_del

Delay N Stage N logic DFF N.pulse Toggle

N.available

Token logic N.request N.available N.{others}

N.request N.phase N.data N-1.data N-1.phase_del N+1.phase

1 2 3

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-22
SLIDE 22

14 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Click Controllers

◮ Based on a two-phase handshake protocol. ◮ Stores the stage phase, toggles it upon usage.

N.phase_del

Delay N Toggle

N.available N.request N.phase N-1.phase_del N+1.phase

Control Signals

◮ Request:

N − 1.phasedel ⊕ N.phase

◮ Available:

N.phase ⊕ N + 1.phase

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-23
SLIDE 23

15 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Token Modules

General Idea

To enable a transaction with a specific resource, “users” must possess the resource’s token.

Token User N ... Shared Resource Token User 1 Token User 2

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-24
SLIDE 24

15 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Token Modules

General Idea

To enable a transaction with a specific resource, “users” must possess the resource’s token.

Operation

  • 1. Hold the token

Token User N ... Shared Resource T Token User 1 Token User 2

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-25
SLIDE 25

15 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Token Modules

General Idea

To enable a transaction with a specific resource, “users” must possess the resource’s token.

Operation

  • 1. Hold the token
  • 2. Consume the token and

access resource

Token User N ... Shared Resource Token User 1 Token User 2

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-26
SLIDE 26

15 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Token Modules

General Idea

To enable a transaction with a specific resource, “users” must possess the resource’s token.

Operation

  • 1. Hold the token
  • 2. Consume the token and

access resource

  • 3. Pass the token

Token User N ... Shared Resource T Token User 1 Token User 2

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-27
SLIDE 27

16 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Token Modules

Internal Structure

◮ Token control function ◮ Pulse generation ◮ Token ring delay

F( ) Token In Token Out

Pulse Generation

Delay N.request N.available N.{others}

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-28
SLIDE 28

16 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Token Modules

Operation

  • 1. Token conditions are met.

F( ) Token In Token Out

Pulse Generation

Delay N.request N.available N.{others}

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-29
SLIDE 29

16 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Token Modules

Operation

  • 1. Token conditions are met.
  • 2. Token passes through F()

thus causing a transition.

F( ) Token In Token Out T ↕

Pulse Generation

Delay N.request N.available N.{others}

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-30
SLIDE 30

16 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Token Modules

Operation

  • 1. Token conditions are met.
  • 2. Token passes through F()

thus causing a transition.

  • 3. Transition (edge)

generates a pulse signal.

F( ) Token In Token Out

Pulse Generation

Delay N.request N.available N.{others}

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-31
SLIDE 31

16 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Token Modules

Operation

  • 1. Token conditions are met.
  • 2. Token passes through F()

thus causing a transition.

  • 3. Transition (edge)

generates a pulse signal.

  • 4. Token is delayed, then

passed to next user.

F( ) Token In Token Out

Pulse Generation

Delay N.request N.available N.{others}

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-32
SLIDE 32

17 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Self-Timed Pipeline Operation

Delay N DFF N.pulse Toggle

N.available

Token logic N.request N.available N.{others}

N.request

Delay N+1 DFF N+1.pulse Toggle

N+1.available

Token logic N+1.request N+1.available N+1.{others}

N+1.request

Stage N logic

Stage N Stage N+1 N.phase = 0 N+1.phase = 0

Stage N+1 logic

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-33
SLIDE 33

17 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Self-Timed Pipeline Operation

Delay N DFF N.pulse Toggle

N.available

Token logic N.request N.available N.{others}

N.request

Delay N+1 DFF N+1.pulse Toggle

N+1.available

Token logic N+1.request N+1.available N+1.{others}

N+1.request

Stage N logic

Stage N Stage N+1 N.phase = 0 N+1.phase = 0

Stage N+1 logic

  • 1. Initialization: all stages are available.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-34
SLIDE 34

17 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Self-Timed Pipeline Operation

Delay N DFF N.pulse Toggle

N.available

Token logic N.request N.available N.{others}

N.request

Delay N+1 DFF N+1.pulse Toggle

N+1.available

Token logic N+1.request N+1.available N+1.{others}

N+1.request

Stage N logic

Stage N Stage N+1 N.phase = 0 N+1.phase = 0

Stage N+1 logic

  • 2. Request in: stage N conditions are met.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-35
SLIDE 35

17 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Self-Timed Pipeline Operation

Delay N DFF N.pulse Toggle

N.available

Token logic N.request N.available N.{others}

N.request

Delay N+1 DFF N+1.pulse Toggle

N+1.available

Token logic N+1.request N+1.available N+1.{others}

N+1.request

Stage N logic

Stage N Stage N+1 N.phase = 0 N+1.phase = 0

Stage N+1 logic

  • 3. Pulse generation: N flops input data & toggles phase.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-36
SLIDE 36

17 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Self-Timed Pipeline Operation

DFF

N.pulse

Toggle

N.available

Token logic N.request N.available N.{others}

N.request

Delay N+1 DFF N+1.pulse Toggle

N+1.available

Token logic N+1.request N+1.available N+1.{others}

N+1.request

Stage N logic

Stage N Stage N+1 N.phase = 1 N+1.phase = 0

Stage N+1 logic Delay N

  • 4. Processing: stage N is used, therefore unavailable.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-37
SLIDE 37

17 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Self-Timed Pipeline Operation

Delay N DFF N.pulse Toggle

N.available

Token logic N.request N.available N.{others}

N.request

Delay N+1 DFF N+1.pulse Toggle

N+1.available

Token logic N+1.request N+1.available N+1.{others}

N+1.request

Stage N logic

Stage N Stage N+1 N.phase = 1 N+1.phase = 0

Stage N+1 logic

  • 5. Request in: delayed N phase triggers stage N+1 request.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-38
SLIDE 38

17 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Self-Timed Pipeline Operation

Delay N DFF N.pulse Toggle

N.available

Token logic N.request N.available N.{others}

N.request

Delay N+1 DFF N+1.pulse Toggle

N+1.available

Token logic N+1.request N+1.available N+1.{others}

N+1.request

Stage N logic

Stage N Stage N+1 N.phase = 1 N+1.phase = 0

Stage N+1 logic

  • 6. Pulse generation: N+1 flops input data & toggles phase.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-39
SLIDE 39

17 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Design Guidelines Pipeline Control Pipeline Operation

Self-Timed Pipeline Operation

Delay N DFF N.pulse Toggle

N.available

Token logic N.request N.available N.{others}

N.request

Delay N+1 DFF N+1.pulse

N+1.available

Token logic N+1.request N+1.available N+1.{others}

N+1.request

Stage N logic

Stage N Stage N+1 N.phase = 1 N+1.phase = 1

Stage N+1 logic

Toggle

  • 7. Stage N is now available, N+1 processes data.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-40
SLIDE 40

18 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Plan Introduction Problematic Motivations Scope of Work Cache Implementation Architecture and Organization Operation Self-Timed Pipeline Design Design Guidelines Pipeline Control Pipeline Operation Performance Results Summary

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-41
SLIDE 41

19 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Extracting Results

Objective

Use synchronous instruction cache for baseline results. ⇒ Keep very similar silicon layout for proper comparison.

Performance Metrics

◮ Average memory access time ◮ Energy consumption ◮ Area

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-42
SLIDE 42

20 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Results: Area

Synchronous Asynchronous

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-43
SLIDE 43

20 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Results: Area Synchronous Asynchronous Pipeline Area (µm2) (%) Area (µm2) (%) Total 11185 100 9900 100 (w/o L2 FIFO) 8455 75,6 7170 72,4 (control) 375 3,35 915 9,24

Table: Pipeline Size Comparison

⇒ Pipeline size reduced by : 10-15%. ⇒ Pipeline control is ∼ 2.5× larger.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-44
SLIDE 44

21 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Results : Energy

Power Analysis

◮ Power estimations based on capacitive switching. ◮ Routing estimated from Manhattan distance. ◮ L1-L2 interface clock frequency matched.

L1 Cache behavior tests (32kB program)

  • 1. Miss : 20k random instructions fetch;
  • 2. Hit : 200k, 2M, 20M random instructions fetch.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-45
SLIDE 45

22 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Results : Energy Synchronous Asynchronous Sequence Energy (nJ) Energy (nJ) ∆ E (%) 20k inst. 150.6 118.0 21.6 200k inst. 1373.5 1048.3 23.7 2M inst. 13609.1 10357.9 23.9 20M inst. 135917.3 103424.7 24.0

Table: Energy Consumption

⇒ Energy efficiency improved by : > 22%.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-46
SLIDE 46

23 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Results : Performance Synchronous Asynchronous Sequence

  • Exec. Time (ms)
  • Exec. Time (ms)

∆ T (%) 20k inst. 43.1 29.5 31.6 200k inst. 358.1 265.8 25.8 2M inst. 3508.1 2628.3 25.1 20M inst. 34360.0 25769.9 25.0

Table: Average Memory Access Time

⇒ Access time reduced by : > 25%. ⇒ Throughput at L1-L2 interface : > 40%.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-47
SLIDE 47

24 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Results : What needs to be addressed

Future Work

◮ Further pipeline cache to reach > 1 GHz equivalent. ◮ Design L1 data cache based on self-timed pipeline. ◮ Integrate asynchronous L1 caches in Octasic’s

next-generation processors.

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-48
SLIDE 48

25 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Plan Introduction Problematic Motivations Scope of Work Cache Implementation Architecture and Organization Operation Self-Timed Pipeline Design Design Guidelines Pipeline Control Pipeline Operation Performance Results Summary

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline

slide-49
SLIDE 49

26 Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary

Summary

Problematic

⇒ Asynchronous CPU accesses synchronous L1 memory.

Goals & Results

Design and implement an asynchronous L1 cache:

  • 1. Mitigate CPU ↔ L1 memory access latency:

  • 2. Improve the cache energy efficiency:

> 22%

  • 3. Reduce the average memory access time:

> 25%

  • 4. Push the synchronization barrier at the L2 memory:

Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline