Scratch Pads Aviral Shrivastava Compiler Microarchitecture Labs - - PowerPoint PPT Presentation

scratch pads
SMART_READER_LITE
LIVE PREVIEW

Scratch Pads Aviral Shrivastava Compiler Microarchitecture Labs - - PowerPoint PPT Presentation

The Rise and the fall of Scratch Pads Aviral Shrivastava Compiler Microarchitecture Labs Arizona State University M M C C L L Utopia of Caches Few other things affect programming as much as the memory architecture No. of


slide-1
SLIDE 1

C M L C M L

The Rise and the fall of Scratch Pads

Aviral Shrivastava Compiler Microarchitecture Labs Arizona State University

slide-2
SLIDE 2

C M L C M L

Utopia of Caches

  • Few other things affect programming as much

as the memory architecture

– No. of registers – Pipeline structure – Bypasses

  • Illusion of large unified memory makes

programming simple

– Coherent caches – Unique address of a variable – it’s name – Cache gets the latest value of the variable from wherever in the memory

slide-3
SLIDE 3

C M L SPMs for Power, Perf., and Area

1 2 3 4 5 6 7 8 9 256 512 1024 2048 4096 8192 16384 memory size Energy per access [nJ] . Scratch pad Cache, 2way, 4GB space Cache, 2way, 16 MB space Cache, 2way, 1 MB space

Data Array Tag Array Tag Comparators, Muxes Address Decoder

Cache SPM

  • 40% less energy as compared to cache [Banakar02]

– Absence of tag arrays, comparators and muxes

  • 34 % less area as compared to cache of same size [Banakar02]

– Simple hardware design (only a memory array & address decoding circuitry)

  • Faster access to SPM than cache
slide-4
SLIDE 4

C M L C M L

SPMs for Predictability

  • In hard real-time systems WCET analysis is essential

– Can add an application, only if all their WCETs fit in the period

  • Caches: Estimating the number of misses in a

program is at least doubly exponential.

– Presburger arithmetic with nested existential operators – Simply assume no cache – WCET very large

  • With static data mapping on SPM

– Tighter WCET -- can fit more applications

slide-5
SLIDE 5

C M L C M L

Rise of SPMs

  • SuperH in Sega Gaming Consoles used SPMs
  • Sony Playstations have extensively used SPMs

– PS1: could use SPM for stack data – PS2: 16KB SPM – PS3: Each SPU has 256KB SPM

  • Intel Network processors have used SPMs
  • Graphics Processing Units GPUs use SPMs

– Nvidia Tesla

  • Many embedded processors used line locking

– Coldfire MCF5249, PowerPC440, MPC5554, ARM940, and ARM946E-S

  • SPMs remained in embedded systems

Sony Playstation Sega Saturn

slide-6
SLIDE 6

C M L C M L

A storm is brewing

  • Power and temperature are becoming key

design concerns

– Multi-cores seem to be the solution

  • Each core is smaller and less perf, but still high

throughput

– Perfsys = n*Perfcore – Powersys = n*Powercore – PEsys = PEcore

  • Throughput is the new metric

– Throughput increase is by n – Each core needs to be as low-power as possible

– Energy hogs need to go away

  • Out-of order execution
  • Register renaming
  • Branch prediction
slide-7
SLIDE 7

C M L C M L

Era of disillusionment

  • Illusion of unified memory is breaking

– Cache coherency protocols do not scale beyond tens of cores

  • Tilera64 has coherent cache architecture
  • Intel 48-core and 80-core have non-coherent caches

– Big push towards software coherency and TLM

  • Most of the times, do not need it
  • Rarely done things can be slower (Ahmdal’s law)
  • Illusion of large memory is breaking

– Reduce the automation of cache – Software exposed to distributed memory reality

  • SGI Altix, 320 GB RAM, ½ Million dollars

– MPI-like communication coming inside the core

  • Most of the time, core can operate on local data
slide-8
SLIDE 8

C M L C M L

Limited Local Memory (LLM) Architecture

  • Distributed memory platform with each core having its own

small scratch pad memory

  • Cores can access only local memory
  • Access to the global memory is through explicit DMA calls in

the application program

  • Ex. IBM Cell Broadband Engine
slide-9
SLIDE 9

C M L C M L LLM Programming Model

  • Extremely power-efficient execution if

– all code and application data can fit in the local memory

#include<libspe2.h> extern spe_program_handle_t hello_spu; int main(void) { int speid, status; speid = spe_create_thread (&hello_spu); spe_wait( speid, &status); return 0; }

Main Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }

Local Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }

Local Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }

Local Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }

Local Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }

Local Core

<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }

Local Core

slide-10
SLIDE 10

C M L

Managing Data on Limited Local Memory

  • Shared LLM for all data and code

– No protection

  • All data needs to be managed
  • Code

– May not fit

  • Stack

– May grow and overwrite heap data, or code

  • Heap

– May grow and overwrite stack data

  • IBM’s Fix:

– Software Cache – Do not use pointers – Do not use recursion

code global stack heap heap heap

slide-11
SLIDE 11

C M L Using LLM is difficult

Original Code Using SPM

int global; f1(){ int a,b; global = a + b; f2(); } int global; f1(){ int a,b; glob = DSPM.fetch(global); glob = a + b; DSPM.writeback(glob, global); ISPM.fetch(f2); f2(); }

slide-12
SLIDE 12

C M L C M L

LLM different than SPM

ARM SPM Cache DMA ARM Memory Architecture Global Memory

  • Programs work without using SPM

– SPM for optimization – by placing frequently used data in SPM

  • “What to place in SPM?”

– Can be more than SPM size

  • “Where to place in SPM?”
  • Programs will not work without LLM

– SPM essential for execution – Need to make it’s use more efficient

  • “What to place in SPM?”

– everything

  • “Where to place in SPM?”

SPU LLM DMA SPU LLM Architecture Global Memory

slide-13
SLIDE 13

C M L

Outline

  • 1. Code management
  • 0. Global
  • [HIPC 2008] SDRM: Simultaneous Determination of Regions and Function-

to-Region Mapping for Scratchpad Memories

  • [ASAP 2010] Dynamic Code Mapping for Limited Local Memory Systems
slide-14
SLIDE 14

C M L C M L

(c) Local Memory

F2 F3 F1

Code Management Mechanism

(d) Main Memory

heap variable stack

code

F2 F1 F3 F1 F2 F3

F1 F2 F3

(a) Application Call Graph SECTIONS { OVERLAY { F1.o F3.o } OVERLAY { F2.o } }

(b) Linker Script

http://www.public.asu.edu/~ashriva6

slide-15
SLIDE 15

C M L C M L

Code Management Problem

11/30/2010 15

REGION REGION REGION

  • # of Regions and Function-To-Region Mapping

– Two extreme cases

  • Code management is NP-Complete

– Minimum data transfer with given space

Local Memory Code Section

http://www.public.asu.edu/~ashriva6

slide-16
SLIDE 16

C M L C M L Capturing call pattern

11/30/2010 16 F1(){ F2(); F3(); } F2(){ for(i=0;i<10;i++){ F4(); } for(i=0;i<100;i++){ F5(); } } F3(){ for(i=0;i<10;i++){ F6(); F7(); } }

F1 F3 F6 F7 F2 F4 F5 1 1 10 10 10 100 F1 F2 F3 1 1 L1 L2 F4 F5 L3 F6 F7 1 1 1 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 10 100 10

(b) Call Graph (a) Example Application (c) GCCFG

1 1 1 1

Explicit Execution Order

http://www.public.asu.edu/~ashriva6

slide-17
SLIDE 17

C M L C M L

FMUM Heuristic

11/30/2010 17

1KB 1.5KB 0.5KB 2KB 1.5KB 1KB F1 F2 F3 F4 F5 F6 F2 1.5KB 1.5KB F3 F4 F6 0.5KB 2KB 1KB F1,F5 1.5KB 1.5KB F3 0.5KB F4 2KB

(a) Start (b) Next step (c) Final

F1 F5 F2 F6

Maximum (7.5KB) Given (5.5KB)

http://www.public.asu.edu/~ashriva6

slide-18
SLIDE 18

C M L C M L

New Region New Region

FMUP Heuristic

  • Minimum (2KB) Given Size (5KB)

11/30/2010 18

2KB 2KB 1.5KB

(a) START (b) STEP1 (e) FINAL

1.5KB F1 F2 F3 F4 F5 F6 1.5KB

(c) STEP2 (d) STEP3 http://www.public.asu.edu/~ashriva6

slide-19
SLIDE 19

C M L C M L

Typical Performance Result

11/30/2010 19

8 8.5 9 9.5 10 10.5 11 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3792 Total Number of Execution Cycles Millions Given Code Space in Limited Local Memory

Stringsearch

FMUM FMUP SDRM

FMUP performs better FMUM performs better

http://www.public.asu.edu/~ashriva6

slide-20
SLIDE 20

C M L

Outline

  • 1. Code management
  • 0. Global
  • 2. Stack management

[ASPDAC 2009] A Software Solution for Dynamic Stack Management

  • n Scratch Pad Memory
slide-21
SLIDE 21

C M L

Circular Stack Management

F1 F2 F3 F1 50 F2 20

Stack Size = 70 bytes

SP F3 30

Local Memory Main Memory

Main MemPtr

Function Frame Size (bytes)

F1 50 F2 20 F3 30

70

slide-22
SLIDE 22

C M L

Circular Stack Management

F1 F2 F3 F1 50 F2 20

Stack Size = 70 bytes

SP F3 30

Local Memory Main Memory

Main MemPtr

Function Frame Size (bytes)

F1 50 F2 20 F3 30

70

slide-23
SLIDE 23

C M L C M L

How to evict data to global memory?

23

  • Can use DMA to transfer heap object to global memory

— DMA is very fast – no core-to-core communication

  • But eventually, you can overwrite some other data
  • Need OS mediation

Execution Core

malloc

Main Core

malloc

Global Memory

Execution Core

malloc Global Memory

DMA

  • Thread communication between cores is slow!
slide-24
SLIDE 24

C M L C M L

Hybrid DMA + Communication

24

  • Can use DMA to transfer heap object to global memory

— DMA is very fast – no core-to-core communication

  • But eventually, you can overwrite some other data
  • Need OS mediation

if (enough space in global memory) then write data using DMA else request more space in global memory

Execution Thread on execution core

S startAddr endAddr

mail-box based communication

Global Memory allocate ≥S space

DMA write from local memory to global memory

Main core

slide-25
SLIDE 25

C M L C M L

Pointer Problem

F1() { int k = 3, a = -1; int *ptrA = &a; fci(F2); F2(k,ptrA); fco(F2); printf("%d %d\n",a); } F2(int k, int *ptr){ if(k == 1) { *ptr = 1000; return; } fci(F2); F2(--k,ptr); fco(F2); }

Space for Stack = 80 bytes

F1 F2 F2 F2

SP

&a?

SP

Local memory when k=3 Local memory When k=1 Main memory

a

FUNC FRAME_SIZE F1 50 F2 30 F1 F2

a

F2

slide-26
SLIDE 26

C M L C M L

Pointer Resolution

F1() { int k = 3, a = -1; int *ptrA = s2p(&a), fci(F2); F2(k,ptrA); fco(F2); printf("%d %d\n",a); } F2(int k, int *ptr){ if(k == 1) { ptr_wr(ptr,1000); return; } fci(F2); F2(--k,ptr); fco(F2); }

Space for Stack = 80 bytes

F1 F2 F2 F2 SP

&a?

(b) Address mapping by our pointer management functions.

SP

Local memory when k=3 Local memory When k=1 Main memory

a

F1 F2

a

F2

3070 3100 3060

FUNC FRAME_SIZE F1 50 F2 30

181310 181340 181300

slide-27
SLIDE 27

C M L C M L

Enabling Limitless Stack Depth

100 1000 10000 100000

Log of Runtime(us)

Parameter n

Without Stack Management Our Approach n = 3842

int rcount(int n) { if (n==0) return 0; return rcount(n-1) + 1; }

  • 1% - 20% overhead
slide-28
SLIDE 28

C M L

Outline

  • 1. Code management
  • 0. Global
  • 2. Stack management
  • 3. Heap management

[CODES+ISSS 2010] Heap Data Management for Limited Local Memory (LLM) Multi-core Processors

slide-29
SLIDE 29

C M L C M L

Heap Data Management

malloc2 malloc1

Heap Size = 32bytes sizeof(student)=16bytes

HP

Local Memory Global Memory

GM_HP

typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc( sizeof(Student) ); } for (i=0; i<N; i++) { student[i].id = i; } } 29 malloc3

  • mymalloc()

—May need to evict older heap

  • bjects to global memory

—It may need to allocate more global memory

  • malloc()

— allocates space in local memory

slide-30
SLIDE 30

C M L C M L

30

Address Translation Functions

  • Mapping from SPU address to Global address is one to many

– Cannot easily find Global address from SPU address

  • All heap accesses must happen through global addresses
  • p2s() will translate the global address to spu address

– Make sure the heap object is in the local memory

  • s2p() will translate the spu address to global address

main() { for (i=0; i<N; i++) { student[i] = malloc( sizeof(Student) ); } for (i=0; i<N; i++) { student[i].id = i; } }

student[i] = p2s(student[i]); student[i] = s2p(student[i]);

malloc2 malloc1

Heap Size = 32bytes sizeof(student)=16bytes

HP

Local Memory Global Memory

GM_HP

malloc3

slide-31
SLIDE 31

C M L Heap Management API

typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc(sizeof(Student)); student[i].id = i; } }

malloc()

  • allocate space in local

memory and global memory and return global addr free()

  • free space in the

global memory p2s()

  • Assures heap variable

exists in the local memory and uses spuAddr. s2p()

  • Translate the spuAddr

back to ppuAddr.

  • Code with Heap

Management

  • Original Code

typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc(sizeof(Student)); student[i].id = i; } } 31 student[i] = p2s(student[i]); student[i] = s2p(student[i]);

Heap management overhead: 1-20%

slide-32
SLIDE 32

C M L

Outline

  • 1. Code management
  • 0. Global
  • 2. Stack management
  • 3. Heap management
slide-33
SLIDE 33

C M L C M L

Scalability of Data Management

10000 100000 1000000 10000000 100000000 1E+09 1 2 3 4 5 6 Log of Runtime(us) DFS dijkstra fft fft_inverse MST rbTree stringsearch Number of Cores

33

Scattered data management requests

slide-34
SLIDE 34

C M L C M L

Summary

  • Rise 1: SPMs in embedded systems for power, performance and

predictability

  • Rise 2: With power and temperature becoming important concerns, and

core scaling, SPMs in high-performance systems

  • Cell may die, but…
  • LLM architecture different from traditional SPMs

– Need software memory management

  • Code Management
  • Stack data management
  • Heap data management
  • Goal: Allow any multi-threaded application to execute on LLM

architecture

  • http://www.public.asu.edu/~ashriva6
slide-35
SLIDE 35

C M L C M L

Experimental Setup

35

  • Sony PlayStation 3 running a Fedora Core 9

Linux

  • MiBench Benchmark Suite and other possible

applications

http://www.public.asu.edu/~kbai3/publications.html

  • The runtimes are measured with

spu_decrementer() for SPE and _mftb() for the PPE provided with IBM Cell SDK 3.1

slide-36
SLIDE 36

C M L C M L SPMs worse than Caches?

  • Caches request for data at the last moment

– SPM: insert a DMA instruction just before a load – Question: DMA instructions vs. hardware simplicity

  • Then on, there are only advantages

– Can also specify how much to DMA – Can hoist the DMA – Forces more intelligence in compiler

slide-37
SLIDE 37

C M L C M L

  • One important reason for rise the of SPMs is

the failure of prefetching

– Even after so much research on prefetching, no processor implements more than next-line prefetching

  • And that too very cautiously, since it is not very

accurate and leads to cache pollution

  • Add a separate prefetch cache

– Caches request for data when it is already too late

  • Power and temperature have never been so

important

slide-38
SLIDE 38

C M L C M L

  • Trend to improve power-efficiency
  • Only do things that are very frequent in

the processor

  • Things that happen rarely can be handled

in s/w, since h/w consumes power all the time

– E.g., coherency, soft error detection in h/w, correction in s/w.