[PPT] - RAMP-White / FAST-MP Hari Angepat and Derek Chiou Electrical and PowerPoint Presentation

SLIDE 1

RAMP-White / FAST-MP

Hari Angepat and Derek Chiou

Electrical and Computer Engineering University of Texas at Austin

Supported in part by DOE, NSF, SRC,Bluespec, Intel, Xilinx, IBM, and Freescale

SLIDE 2

RAMP-White Overview

Use existing FPGA processor implementations

to build scalable, flexible, coherent shared memory platforms that run standard

perating systems
Standard ISA/OS enables more complex

applications such as software emulators (QEMU) when desired.

SLIDE 3

RAMP White Architecture

Classic shared memory machine design

Processor Intersection Unit Intersection Unit

Router Router

Processor IO/MEM IO/MEM

SLIDE 4

RAMP White Architecture

RAMP-White

Processor Proc shim

IO Device

IO Device IO Device DRAM Peripheral bus Proc shim Intersection Unit NIU Intersection Unit NIU

Ring Router Ring Router

DRAM Peripheral bus Periph shim Periph shim Processor

SLIDE 5

RAMP White Architecture

RAMP-White

Processor Proc shim

IO Device

IO Device IO Device DRAM Peripheral bus Proc shim Intersection Unit NIU Intersection Unit NIU

Ring Router Ring Router

DRAM Peripheral bus Periph shim Periph shim Processor

Model CMP/SMP targets
Coherent shared memory platform
Single image OS
RAMP scalability (1K cores) via spatial and temporal

replication

SLIDE 6

RAMP White Architecture

RAMP-White

Processor Proc shim

IO Device

IO Device IO Device DRAM Peripheral bus Proc shim Intersection Unit NIU Intersection Unit NIU

Ring Router Ring Router

DRAM Peripheral bus Periph shim Periph shim Processor

Ability to use commodity cores:
SparcV8: Leon3 soft‐core
PowerPC: PPC405 hard‐core
Configurable coherence protocol, enginesc

SLIDE 7

RAMP White Architecture

RAMP-White

Processor Proc shim

IO Device

IO Device IO Device DRAM Peripheral bus Proc shim Intersection Unit NIU Intersection Unit NIU

Ring Router Ring Router

DRAM Peripheral bus Periph shim Periph shim Processor

Configurable modules:
NIC, network, coherence engine, intersection unit
Modules connected by Connectors:
Point‐to‐point FIFOs that can model target time if required
Shim adapters

SLIDE 8

RAMP-White Status

Working:
Multi processor Leon
Soft‐fp kernel and userspace as initramfs
Standard pthread Splash benchmarks
Still debugging:
Multichip crossing with scalable interrupt

components

Integration with parametrizable FAST cache model
See me during retreat if interested in Alpha release

SLIDE 9

Prototype (See at Demo)

Hardware
Sparc V8 32bit soft‐core processor (Leon3)
50 Mhz core clock, soft‐FP, 16KB Icache, Dcache bypassed
GRLIB Components {serial, ethernet, ddr, jtag}
Software
Linux SMP 2.6.21 for Leon3
Pthread‐based Splash2 benchmarks
RAM disk rootfs with simple userspace apps
Platform
BEE2 control FPGA with JTAG based programming
Ethernet for kernel loading/debugging

RAMP-White

SLIDE 10

FAST-MP

SLIDE 11

FAST-MP: High Level Goal

Multi‐resolution coherent shared memory

target emulation

Predict performance/power for wide range of

micro‐architectures at accuracies ranging from cycle accurate to functional‐only

Capable of running real ISAs aided by binary

translation (x86, Sparc, PowerPC, etc), operating systems (unmodified Windows, Linux), compilers, applications (SQLServer, Apache, etc)

Extensible/flexible (new instructions, different

micro architectures)

SLIDE 12

Performance Modeling on RAMP-White

RAMP‐White host predicts RAMP‐White target

performance perfectly

Predicting performance of arbitrary micro‐architectures

requires additional support

FAST (FPGA Accelerated Simulation Techniques) uses a

timing model to predict performance of arbitrary micro‐architecture

Special purpose structure designed to predict time
Very small (complex model in a fraction of an FPGA)
Uses same functional model for any micro‐architecture
White as a scalable functional model for FAST‐MP

SLIDE 13

FAST (FPGA Accelerated Simulation)

Speculative FM with checkpoint/rollback of

FM when FM/TM paths diverge

Ex) branch mispredict/resolve

SLIDE 14

FAST-MP Approach

Multicore functional model executes as it

wishes

Functional instruction stream generated (per core)

and sent to timing model

Rollback when functional model execution differs

from timing model

Branch mispredictions, address speculation, etc.
Possible for functional model to access

memory in different order than target

SLIDE 15

FAST-MP Memory Reordering

All memory references tagged with a version

number

FM passes a version number in trace to TM
essentially a precondition on the validity of the

given trace

If TM version != FM version
Freeze timing models (to avoid corrupting TM)
Rollback functional models to restore correct

memory/architectural state

Use TM directed order to re‐execute

SLIDE 16

White

Processor Intersection Unit Intersection Unit

Router Router

Processor IO/MEM IO/MEM

SLIDE 17

White + Timing Model

PowerPC/Sparc ISA with arbitrary timing

model

Processor Intersection Unit Intersection Unit

Router Router

Processor IO/MEM IO/MEM Net model Timing Model Timing Model

SLIDE 18

White + VM + Timing Model

Sparc ISA with QEMU to emulate any ISA
Requires trace/rollback:
Hardware
Software (QEMU) ‐ can also be hardware

accelerated

SMP OS

QEMU x86 VCPU QEMU x86 VCPU

Processor Intersection Unit Intersection Unit

Router Router

Processor IO/MEM IO/MEM

X86 Timing Model X86 Timing Model

Net model Timing Model Timing Model

SLIDE 19

Probability of Reordered Memory Ops

Functionally‐driven speculation in a MP costly

if timing ordered memory references conflict

Preliminary study with on X86 applications

studying atomic operations

Use Pin dynamic instrumentation tool to monitor

every atomic operation running a multi‐threaded app

Analyze inter‐atomic distance for existing shared

memory workloads (Splash2, Parsec)

SLIDE 20

Interprocessor Atomic Reuse Distance

0% 5% 10% 15% 20% 25% 30% 35% 2500 7500 10000 20000 30000 40000 50000 60000 70000 80000 90000 Percent Atomic Operations Interprocessor Reuse Distance (Cycles) FFT LU Ocean Radix BlackScholes BodyTrack FaceSim Ferret FluidAnimate FreqMine Swaptions

SLIDE 21

Task Size Scaling on Intel CMPs

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 1000 2000 3000 4000 5000 6000 7000 8000 Speedup Normailzed to Serial Implementation Task Size (cycles) Xeon5140‐1Thread XeonX3230‐1Thread Xeon5140‐2Threads XeonX3230‐2Threads Xeon5140‐4Threads XeonX3230‐4Threads

SLIDE 22

FAST-MP Can Be Less Than Accurate!

Nearly accurate
Functional model backpressured by timing model
Don’t want to overflow buffers
Each functional core roughly at correct instruction relative

to other cores

Do not rollback to reorder memory operations
Still correct, just locks taken in different order
Eliminate rollback overheads, probably quite accurate
Model RAMP‐White on FAST‐MP to check accuracy
Functional + cache
Run with just cache simulators
Etc.

SLIDE 23

QEMU on White-Leon3

QEMU 0.9.1 with patches
Some issues remaining with Dyngen for V8 ISA with

Leon3 cross compiler

For initial Linux Boot:
X86 instructions:

1

QEMU uOPs:

~3.1

Sparc instructions:

~22.5

 High overheads involved in address computation,

segmentation checks, software tlb, etc

Can modify/replace Leon3 to improve efficiency
MicroOP‐based processor

SLIDE 24

Conclusions

Initial RAMP‐White Alpha design functional
FAST‐MP
Provide various ISAs
Cycle‐accuracy to purely functional
Developing power models
FAST‐MP will run on top of RAMP‐White as

well as standard multicore system

SLIDE 25