[PPT] - Scratch Pads Aviral Shrivastava Compiler Microarchitecture Labs PowerPoint Presentation

SLIDE 1

C M L C M L

The Rise and the fall of Scratch Pads

Aviral Shrivastava Compiler Microarchitecture Labs Arizona State University

SLIDE 2

C M L C M L

Utopia of Caches

Few other things affect programming as much

as the memory architecture

– No. of registers – Pipeline structure – Bypasses

Illusion of large unified memory makes

programming simple

– Coherent caches – Unique address of a variable – it’s name – Cache gets the latest value of the variable from wherever in the memory

SLIDE 3

C M L SPMs for Power, Perf., and Area

1 2 3 4 5 6 7 8 9 256 512 1024 2048 4096 8192 16384 memory size Energy per access [nJ] . Scratch pad Cache, 2way, 4GB space Cache, 2way, 16 MB space Cache, 2way, 1 MB space

Data Array Tag Array Tag Comparators, Muxes Address Decoder

Cache SPM

40% less energy as compared to cache [Banakar02]

– Absence of tag arrays, comparators and muxes

34 % less area as compared to cache of same size [Banakar02]

– Simple hardware design (only a memory array & address decoding circuitry)

Faster access to SPM than cache

SLIDE 4

C M L C M L

SPMs for Predictability

In hard real-time systems WCET analysis is essential

– Can add an application, only if all their WCETs fit in the period

Caches: Estimating the number of misses in a

program is at least doubly exponential.

– Presburger arithmetic with nested existential operators – Simply assume no cache – WCET very large

With static data mapping on SPM

– Tighter WCET -- can fit more applications

SLIDE 5

C M L C M L

Rise of SPMs

SuperH in Sega Gaming Consoles used SPMs
Sony Playstations have extensively used SPMs

– PS1: could use SPM for stack data – PS2: 16KB SPM – PS3: Each SPU has 256KB SPM

Intel Network processors have used SPMs
Graphics Processing Units GPUs use SPMs

– Nvidia Tesla

Many embedded processors used line locking

– Coldfire MCF5249, PowerPC440, MPC5554, ARM940, and ARM946E-S

SPMs remained in embedded systems

Sony Playstation Sega Saturn

SLIDE 6

C M L C M L

A storm is brewing

Power and temperature are becoming key

design concerns

– Multi-cores seem to be the solution

Each core is smaller and less perf, but still high

throughput

– Perfsys = n*Perfcore – Powersys = n*Powercore – PEsys = PEcore

Throughput is the new metric

– Throughput increase is by n – Each core needs to be as low-power as possible

– Energy hogs need to go away

Out-of order execution
Register renaming
Branch prediction

SLIDE 7

C M L C M L

Era of disillusionment

Illusion of unified memory is breaking

– Cache coherency protocols do not scale beyond tens of cores

Tilera64 has coherent cache architecture
Intel 48-core and 80-core have non-coherent caches

– Big push towards software coherency and TLM

Most of the times, do not need it
Rarely done things can be slower (Ahmdal’s law)
Illusion of large memory is breaking

– Reduce the automation of cache – Software exposed to distributed memory reality

SGI Altix, 320 GB RAM, ½ Million dollars

– MPI-like communication coming inside the core

Most of the time, core can operate on local data

SLIDE 8

C M L C M L

Limited Local Memory (LLM) Architecture

Distributed memory platform with each core having its own

small scratch pad memory

Cores can access only local memory
Access to the global memory is through explicit DMA calls in

the application program

Ex. IBM Cell Broadband Engine

SLIDE 9

C M L C M L LLM Programming Model

Extremely power-efficient execution if

– all code and application data can fit in the local memory

#include<libspe2.h> extern spe_program_handle_t hello_spu; int main(void) { int speid, status; speid = spe_create_thread (&hello_spu); spe_wait( speid, &status); return 0; }

Main Core