Scratch Pads Aviral Shrivastava Compiler Microarchitecture Labs - - PowerPoint PPT Presentation
Scratch Pads Aviral Shrivastava Compiler Microarchitecture Labs - - PowerPoint PPT Presentation
The Rise and the fall of Scratch Pads Aviral Shrivastava Compiler Microarchitecture Labs Arizona State University M M C C L L Utopia of Caches Few other things affect programming as much as the memory architecture No. of
C M L C M L
Utopia of Caches
- Few other things affect programming as much
as the memory architecture
– No. of registers – Pipeline structure – Bypasses
- Illusion of large unified memory makes
programming simple
– Coherent caches – Unique address of a variable – it’s name – Cache gets the latest value of the variable from wherever in the memory
C M L SPMs for Power, Perf., and Area
1 2 3 4 5 6 7 8 9 256 512 1024 2048 4096 8192 16384 memory size Energy per access [nJ] . Scratch pad Cache, 2way, 4GB space Cache, 2way, 16 MB space Cache, 2way, 1 MB space
Data Array Tag Array Tag Comparators, Muxes Address Decoder
Cache SPM
- 40% less energy as compared to cache [Banakar02]
– Absence of tag arrays, comparators and muxes
- 34 % less area as compared to cache of same size [Banakar02]
– Simple hardware design (only a memory array & address decoding circuitry)
- Faster access to SPM than cache
C M L C M L
SPMs for Predictability
- In hard real-time systems WCET analysis is essential
– Can add an application, only if all their WCETs fit in the period
- Caches: Estimating the number of misses in a
program is at least doubly exponential.
– Presburger arithmetic with nested existential operators – Simply assume no cache – WCET very large
- With static data mapping on SPM
– Tighter WCET -- can fit more applications
C M L C M L
Rise of SPMs
- SuperH in Sega Gaming Consoles used SPMs
- Sony Playstations have extensively used SPMs
– PS1: could use SPM for stack data – PS2: 16KB SPM – PS3: Each SPU has 256KB SPM
- Intel Network processors have used SPMs
- Graphics Processing Units GPUs use SPMs
– Nvidia Tesla
- Many embedded processors used line locking
– Coldfire MCF5249, PowerPC440, MPC5554, ARM940, and ARM946E-S
- SPMs remained in embedded systems
Sony Playstation Sega Saturn
C M L C M L
A storm is brewing
- Power and temperature are becoming key
design concerns
– Multi-cores seem to be the solution
- Each core is smaller and less perf, but still high
throughput
– Perfsys = n*Perfcore – Powersys = n*Powercore – PEsys = PEcore
- Throughput is the new metric
– Throughput increase is by n – Each core needs to be as low-power as possible
– Energy hogs need to go away
- Out-of order execution
- Register renaming
- Branch prediction
C M L C M L
Era of disillusionment
- Illusion of unified memory is breaking
– Cache coherency protocols do not scale beyond tens of cores
- Tilera64 has coherent cache architecture
- Intel 48-core and 80-core have non-coherent caches
– Big push towards software coherency and TLM
- Most of the times, do not need it
- Rarely done things can be slower (Ahmdal’s law)
- Illusion of large memory is breaking
– Reduce the automation of cache – Software exposed to distributed memory reality
- SGI Altix, 320 GB RAM, ½ Million dollars
– MPI-like communication coming inside the core
- Most of the time, core can operate on local data
C M L C M L
Limited Local Memory (LLM) Architecture
- Distributed memory platform with each core having its own
small scratch pad memory
- Cores can access only local memory
- Access to the global memory is through explicit DMA calls in
the application program
- Ex. IBM Cell Broadband Engine
C M L C M L LLM Programming Model
- Extremely power-efficient execution if
– all code and application data can fit in the local memory
#include<libspe2.h> extern spe_program_handle_t hello_spu; int main(void) { int speid, status; speid = spe_create_thread (&hello_spu); spe_wait( speid, &status); return 0; }
Main Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }
Local Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }
Local Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }
Local Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }
Local Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }
Local Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); return 0; }
Local Core
C M L
Managing Data on Limited Local Memory
- Shared LLM for all data and code
– No protection
- All data needs to be managed
- Code
– May not fit
- Stack
– May grow and overwrite heap data, or code
- Heap
– May grow and overwrite stack data
- IBM’s Fix:
– Software Cache – Do not use pointers – Do not use recursion
code global stack heap heap heap
C M L Using LLM is difficult
Original Code Using SPM
int global; f1(){ int a,b; global = a + b; f2(); } int global; f1(){ int a,b; glob = DSPM.fetch(global); glob = a + b; DSPM.writeback(glob, global); ISPM.fetch(f2); f2(); }
C M L C M L
LLM different than SPM
ARM SPM Cache DMA ARM Memory Architecture Global Memory
- Programs work without using SPM
– SPM for optimization – by placing frequently used data in SPM
- “What to place in SPM?”
– Can be more than SPM size
- “Where to place in SPM?”
- Programs will not work without LLM
– SPM essential for execution – Need to make it’s use more efficient
- “What to place in SPM?”
– everything
- “Where to place in SPM?”
SPU LLM DMA SPU LLM Architecture Global Memory
C M L
Outline
- 1. Code management
- 0. Global
- [HIPC 2008] SDRM: Simultaneous Determination of Regions and Function-
to-Region Mapping for Scratchpad Memories
- [ASAP 2010] Dynamic Code Mapping for Limited Local Memory Systems
C M L C M L
(c) Local Memory
F2 F3 F1
Code Management Mechanism
(d) Main Memory
heap variable stack
code
F2 F1 F3 F1 F2 F3
F1 F2 F3
(a) Application Call Graph SECTIONS { OVERLAY { F1.o F3.o } OVERLAY { F2.o } }
(b) Linker Script
http://www.public.asu.edu/~ashriva6
C M L C M L
Code Management Problem
11/30/2010 15
REGION REGION REGION
- # of Regions and Function-To-Region Mapping
– Two extreme cases
- Code management is NP-Complete
– Minimum data transfer with given space
Local Memory Code Section
http://www.public.asu.edu/~ashriva6
C M L C M L Capturing call pattern
11/30/2010 16 F1(){ F2(); F3(); } F2(){ for(i=0;i<10;i++){ F4(); } for(i=0;i<100;i++){ F5(); } } F3(){ for(i=0;i<10;i++){ F6(); F7(); } }
F1 F3 F6 F7 F2 F4 F5 1 1 10 10 10 100 F1 F2 F3 1 1 L1 L2 F4 F5 L3 F6 F7 1 1 1 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 1KB 10 100 10
(b) Call Graph (a) Example Application (c) GCCFG
1 1 1 1
Explicit Execution Order
http://www.public.asu.edu/~ashriva6
C M L C M L
FMUM Heuristic
11/30/2010 17
1KB 1.5KB 0.5KB 2KB 1.5KB 1KB F1 F2 F3 F4 F5 F6 F2 1.5KB 1.5KB F3 F4 F6 0.5KB 2KB 1KB F1,F5 1.5KB 1.5KB F3 0.5KB F4 2KB
(a) Start (b) Next step (c) Final
F1 F5 F2 F6
Maximum (7.5KB) Given (5.5KB)
http://www.public.asu.edu/~ashriva6
C M L C M L
New Region New Region
FMUP Heuristic
- Minimum (2KB) Given Size (5KB)
11/30/2010 18
2KB 2KB 1.5KB
(a) START (b) STEP1 (e) FINAL
1.5KB F1 F2 F3 F4 F5 F6 1.5KB
(c) STEP2 (d) STEP3 http://www.public.asu.edu/~ashriva6
C M L C M L
Typical Performance Result
11/30/2010 19
8 8.5 9 9.5 10 10.5 11 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3792 Total Number of Execution Cycles Millions Given Code Space in Limited Local Memory
Stringsearch
FMUM FMUP SDRM
FMUP performs better FMUM performs better
http://www.public.asu.edu/~ashriva6
C M L
Outline
- 1. Code management
- 0. Global
- 2. Stack management
[ASPDAC 2009] A Software Solution for Dynamic Stack Management
- n Scratch Pad Memory
C M L
Circular Stack Management
F1 F2 F3 F1 50 F2 20
Stack Size = 70 bytes
SP F3 30
Local Memory Main Memory
Main MemPtr
Function Frame Size (bytes)
F1 50 F2 20 F3 30
70
C M L
Circular Stack Management
F1 F2 F3 F1 50 F2 20
Stack Size = 70 bytes
SP F3 30
Local Memory Main Memory
Main MemPtr
Function Frame Size (bytes)
F1 50 F2 20 F3 30
70
C M L C M L
How to evict data to global memory?
23
- Can use DMA to transfer heap object to global memory
— DMA is very fast – no core-to-core communication
- But eventually, you can overwrite some other data
- Need OS mediation
Execution Core
malloc
Main Core
malloc
Global Memory
Execution Core
malloc Global Memory
DMA
- Thread communication between cores is slow!
C M L C M L
Hybrid DMA + Communication
24
- Can use DMA to transfer heap object to global memory
— DMA is very fast – no core-to-core communication
- But eventually, you can overwrite some other data
- Need OS mediation
if (enough space in global memory) then write data using DMA else request more space in global memory
Execution Thread on execution core
S startAddr endAddr
mail-box based communication
Global Memory allocate ≥S space
DMA write from local memory to global memory
Main core
C M L C M L
Pointer Problem
F1() { int k = 3, a = -1; int *ptrA = &a; fci(F2); F2(k,ptrA); fco(F2); printf("%d %d\n",a); } F2(int k, int *ptr){ if(k == 1) { *ptr = 1000; return; } fci(F2); F2(--k,ptr); fco(F2); }
Space for Stack = 80 bytes
F1 F2 F2 F2
SP
&a?
SP
Local memory when k=3 Local memory When k=1 Main memory
a
FUNC FRAME_SIZE F1 50 F2 30 F1 F2
a
F2
C M L C M L
Pointer Resolution
F1() { int k = 3, a = -1; int *ptrA = s2p(&a), fci(F2); F2(k,ptrA); fco(F2); printf("%d %d\n",a); } F2(int k, int *ptr){ if(k == 1) { ptr_wr(ptr,1000); return; } fci(F2); F2(--k,ptr); fco(F2); }
Space for Stack = 80 bytes
F1 F2 F2 F2 SP
&a?
(b) Address mapping by our pointer management functions.
SP
Local memory when k=3 Local memory When k=1 Main memory
a
F1 F2
a
F2
3070 3100 3060
FUNC FRAME_SIZE F1 50 F2 30
181310 181340 181300
C M L C M L
Enabling Limitless Stack Depth
100 1000 10000 100000
Log of Runtime(us)
Parameter n
Without Stack Management Our Approach n = 3842
int rcount(int n) { if (n==0) return 0; return rcount(n-1) + 1; }
- 1% - 20% overhead
C M L
Outline
- 1. Code management
- 0. Global
- 2. Stack management
- 3. Heap management
[CODES+ISSS 2010] Heap Data Management for Limited Local Memory (LLM) Multi-core Processors
C M L C M L
Heap Data Management
malloc2 malloc1
Heap Size = 32bytes sizeof(student)=16bytes
HP
Local Memory Global Memory
GM_HP
typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc( sizeof(Student) ); } for (i=0; i<N; i++) { student[i].id = i; } } 29 malloc3
- mymalloc()
—May need to evict older heap
- bjects to global memory
—It may need to allocate more global memory
- malloc()
— allocates space in local memory
C M L C M L
30
Address Translation Functions
- Mapping from SPU address to Global address is one to many
– Cannot easily find Global address from SPU address
- All heap accesses must happen through global addresses
- p2s() will translate the global address to spu address
– Make sure the heap object is in the local memory
- s2p() will translate the spu address to global address
main() { for (i=0; i<N; i++) { student[i] = malloc( sizeof(Student) ); } for (i=0; i<N; i++) { student[i].id = i; } }
student[i] = p2s(student[i]); student[i] = s2p(student[i]);
malloc2 malloc1
Heap Size = 32bytes sizeof(student)=16bytes
HP
Local Memory Global Memory
GM_HP
malloc3
C M L Heap Management API
typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc(sizeof(Student)); student[i].id = i; } }
malloc()
- allocate space in local
memory and global memory and return global addr free()
- free space in the
global memory p2s()
- Assures heap variable
exists in the local memory and uses spuAddr. s2p()
- Translate the spuAddr
back to ppuAddr.
- Code with Heap
Management
- Original Code
typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc(sizeof(Student)); student[i].id = i; } } 31 student[i] = p2s(student[i]); student[i] = s2p(student[i]);
Heap management overhead: 1-20%
C M L
Outline
- 1. Code management
- 0. Global
- 2. Stack management
- 3. Heap management
C M L C M L
Scalability of Data Management
10000 100000 1000000 10000000 100000000 1E+09 1 2 3 4 5 6 Log of Runtime(us) DFS dijkstra fft fft_inverse MST rbTree stringsearch Number of Cores
33
Scattered data management requests
C M L C M L
Summary
- Rise 1: SPMs in embedded systems for power, performance and
predictability
- Rise 2: With power and temperature becoming important concerns, and
core scaling, SPMs in high-performance systems
- Cell may die, but…
- LLM architecture different from traditional SPMs
– Need software memory management
- Code Management
- Stack data management
- Heap data management
- Goal: Allow any multi-threaded application to execute on LLM
architecture
- http://www.public.asu.edu/~ashriva6
C M L C M L
Experimental Setup
35
- Sony PlayStation 3 running a Fedora Core 9
Linux
- MiBench Benchmark Suite and other possible
applications
http://www.public.asu.edu/~kbai3/publications.html
- The runtimes are measured with
spu_decrementer() for SPE and _mftb() for the PPE provided with IBM Cell SDK 3.1
C M L C M L SPMs worse than Caches?
- Caches request for data at the last moment
– SPM: insert a DMA instruction just before a load – Question: DMA instructions vs. hardware simplicity
- Then on, there are only advantages
– Can also specify how much to DMA – Can hoist the DMA – Forces more intelligence in compiler
C M L C M L
- One important reason for rise the of SPMs is
the failure of prefetching
– Even after so much research on prefetching, no processor implements more than next-line prefetching
- And that too very cautiously, since it is not very
accurate and leads to cache pollution
- Add a separate prefetch cache
– Caches request for data when it is already too late
- Power and temperature have never been so
important
C M L C M L
- Trend to improve power-efficiency
- Only do things that are very frequent in
the processor
- Things that happen rarely can be handled
in s/w, since h/w consumes power all the time
– E.g., coherency, soft error detection in h/w, correction in s/w.