RAMP-White / FAST-MP Hari Angepat and Derek Chiou Electrical and - - PowerPoint PPT Presentation
RAMP-White / FAST-MP Hari Angepat and Derek Chiou Electrical and - - PowerPoint PPT Presentation
RAMP-White / FAST-MP Hari Angepat and Derek Chiou Electrical and Computer Engineering University of Texas at Austin Supported in part by DOE, NSF, SRC,Bluespec, Intel, Xilinx, IBM, and Freescale RAMP-White Overview Use existing FPGA
RAMP-White Overview
- Use existing FPGA processor implementations
to build scalable, flexible, coherent shared memory platforms that run standard
- perating systems
- Standard ISA/OS enables more complex
applications such as software emulators (QEMU) when desired.
RAMP White Architecture
- Classic shared memory machine design
Processor Intersection Unit Intersection Unit
Router Router
Processor IO/MEM IO/MEM
RAMP White Architecture
RAMP-White
Processor Proc shim
IO Device
IO Device IO Device DRAM Peripheral bus Proc shim Intersection Unit NIU Intersection Unit NIU
Ring Router Ring Router
DRAM Peripheral bus Periph shim Periph shim Processor
RAMP White Architecture
RAMP-White
Processor Proc shim
IO Device
IO Device IO Device DRAM Peripheral bus Proc shim Intersection Unit NIU Intersection Unit NIU
Ring Router Ring Router
DRAM Peripheral bus Periph shim Periph shim Processor
- Model CMP/SMP targets
- Coherent shared memory platform
- Single image OS
- RAMP scalability (1K cores) via spatial and temporal
replication
RAMP White Architecture
RAMP-White
Processor Proc shim
IO Device
IO Device IO Device DRAM Peripheral bus Proc shim Intersection Unit NIU Intersection Unit NIU
Ring Router Ring Router
DRAM Peripheral bus Periph shim Periph shim Processor
- Ability to use commodity cores:
- SparcV8: Leon3 soft‐core
- PowerPC: PPC405 hard‐core
- Configurable coherence protocol, enginesc
RAMP White Architecture
RAMP-White
Processor Proc shim
IO Device
IO Device IO Device DRAM Peripheral bus Proc shim Intersection Unit NIU Intersection Unit NIU
Ring Router Ring Router
DRAM Peripheral bus Periph shim Periph shim Processor
- Configurable modules:
- NIC, network, coherence engine, intersection unit
- Modules connected by Connectors:
- Point‐to‐point FIFOs that can model target time if required
- Shim adapters
RAMP-White Status
- Working:
- Multi processor Leon
- Soft‐fp kernel and userspace as initramfs
- Standard pthread Splash benchmarks
- Still debugging:
- Multichip crossing with scalable interrupt
components
- Integration with parametrizable FAST cache model
- See me during retreat if interested in Alpha release
Prototype (See at Demo)
- Hardware
- Sparc V8 32bit soft‐core processor (Leon3)
- 50 Mhz core clock, soft‐FP, 16KB Icache, Dcache bypassed
- GRLIB Components {serial, ethernet, ddr, jtag}
- Software
- Linux SMP 2.6.21 for Leon3
- Pthread‐based Splash2 benchmarks
- RAM disk rootfs with simple userspace apps
- Platform
- BEE2 control FPGA with JTAG based programming
- Ethernet for kernel loading/debugging
RAMP-White
FAST-MP
FAST-MP: High Level Goal
- Multi‐resolution coherent shared memory
target emulation
- Predict performance/power for wide range of
micro‐architectures at accuracies ranging from cycle accurate to functional‐only
- Capable of running real ISAs aided by binary
translation (x86, Sparc, PowerPC, etc), operating systems (unmodified Windows, Linux), compilers, applications (SQLServer, Apache, etc)
- Extensible/flexible (new instructions, different
micro architectures)
Performance Modeling on RAMP-White
- RAMP‐White host predicts RAMP‐White target
performance perfectly
- Predicting performance of arbitrary micro‐architectures
requires additional support
- FAST (FPGA Accelerated Simulation Techniques) uses a
timing model to predict performance of arbitrary micro‐architecture
- Special purpose structure designed to predict time
- Very small (complex model in a fraction of an FPGA)
- Uses same functional model for any micro‐architecture
- White as a scalable functional model for FAST‐MP
FAST (FPGA Accelerated Simulation)
- Speculative FM with checkpoint/rollback of
FM when FM/TM paths diverge
- Ex) branch mispredict/resolve
FAST-MP Approach
- Multicore functional model executes as it
wishes
- Functional instruction stream generated (per core)
and sent to timing model
- Rollback when functional model execution differs
from timing model
- Branch mispredictions, address speculation, etc.
- Possible for functional model to access
memory in different order than target
FAST-MP Memory Reordering
- All memory references tagged with a version
number
- FM passes a version number in trace to TM
- essentially a precondition on the validity of the
given trace
- If TM version != FM version
- Freeze timing models (to avoid corrupting TM)
- Rollback functional models to restore correct
memory/architectural state
- Use TM directed order to re‐execute
White
Processor Intersection Unit Intersection Unit
Router Router
Processor IO/MEM IO/MEM
White + Timing Model
- PowerPC/Sparc ISA with arbitrary timing
model
Processor Intersection Unit Intersection Unit
Router Router
Processor IO/MEM IO/MEM Net model Timing Model Timing Model
White + VM + Timing Model
- Sparc ISA with QEMU to emulate any ISA
- Requires trace/rollback:
- Hardware
- Software (QEMU) ‐ can also be hardware
accelerated
SMP OS
QEMU x86 VCPU QEMU x86 VCPU
Processor Intersection Unit Intersection Unit
Router Router
Processor IO/MEM IO/MEM
X86 Timing Model X86 Timing Model
Net model Timing Model Timing Model
Probability of Reordered Memory Ops
- Functionally‐driven speculation in a MP costly
if timing ordered memory references conflict
- Preliminary study with on X86 applications
studying atomic operations
- Use Pin dynamic instrumentation tool to monitor
every atomic operation running a multi‐threaded app
- Analyze inter‐atomic distance for existing shared
memory workloads (Splash2, Parsec)
Interprocessor Atomic Reuse Distance
0% 5% 10% 15% 20% 25% 30% 35% 2500 7500 10000 20000 30000 40000 50000 60000 70000 80000 90000 Percent Atomic Operations Interprocessor Reuse Distance (Cycles) FFT LU Ocean Radix BlackScholes BodyTrack FaceSim Ferret FluidAnimate FreqMine Swaptions
Task Size Scaling on Intel CMPs
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 1000 2000 3000 4000 5000 6000 7000 8000 Speedup Normailzed to Serial Implementation Task Size (cycles) Xeon5140‐1Thread XeonX3230‐1Thread Xeon5140‐2Threads XeonX3230‐2Threads Xeon5140‐4Threads XeonX3230‐4Threads
FAST-MP Can Be Less Than Accurate!
- Nearly accurate
- Functional model backpressured by timing model
- Don’t want to overflow buffers
- Each functional core roughly at correct instruction relative
to other cores
- Do not rollback to reorder memory operations
- Still correct, just locks taken in different order
- Eliminate rollback overheads, probably quite accurate
- Model RAMP‐White on FAST‐MP to check accuracy
- Functional + cache
- Run with just cache simulators
- Etc.
QEMU on White-Leon3
- QEMU 0.9.1 with patches
- Some issues remaining with Dyngen for V8 ISA with
Leon3 cross compiler
- For initial Linux Boot:
- X86 instructions:
1
- QEMU uOPs:
~3.1
- Sparc instructions:
~22.5
- High overheads involved in address computation,
segmentation checks, software tlb, etc
- Can modify/replace Leon3 to improve efficiency
- MicroOP‐based processor
Conclusions
- Initial RAMP‐White Alpha design functional
- FAST‐MP
- Provide various ISAs
- Cycle‐accuracy to purely functional
- Developing power models
- FAST‐MP will run on top of RAMP‐White as