The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. - PowerPoint PPT Presentation
The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at RAMP Retreat, June 2009 The ProtoFlex Simulator History Project started (circa 2007) to build
The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at RAMP Retreat, June 2009
The ProtoFlex Simulator • History – Project started (circa 2007) to build scalable, full-system multiprocessor simulators using FPGAs • Key Features – Functional simulator for N-way UltraSPARC III server (~50-90 MIPS) – Using hybrid simulation, runs real server apps + Solaris OS – Employs multithreading to virtualize # CPUs per FPGA core Hybrid Simulation Virtualization 2
Open Sourcing ProtoFlex • Why open source? – Demonstration of FPGAs as viable architecture research vehicle – Facilitate adoption of hybrid simulation & host multithreading – Encourage building on top of our work • What are we releasing? – Bluespec source HDL, Verilog and pre-generated netlists for SPARCV9 CPU model + interfaces – XUPV5 Reference Design for EDK 10.1 – Virtutech Simics plug-ins for hybrid simulation – Top-level SW controller, user command-line interface – Documentation through online wiki 3
Outline • Motivation • The ProtoFlex Simulator – High level components – UltraSPARC core model • Using ProtoFlex • XUPV5 Reference Design • Distribution Details 4
The ProtoFlex Simulator • User perceives familiar SW-like UltraSPARC III simulator – Software user interface similar to Simics – Applications load directly from Simics checkpoints – Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring 5
The ProtoFlex Simulator FPGA Linux PC User Ethernet FPGA PowerPC PFMON Interfac Core (or uBlaze) e SIMICS Main Memory (I/O) • User perceives familiar SW-like UltraSPARC III simulator – Software user interface similar to Simics – Applications load directly from Simics checkpoints – Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring 6
Our UltraSPARC III Core Model Context Scheduler • ISA Specifications I-TLB Stage 1 – 64-bit SPARCV9 ISA + US III extensions I-TLB Stage 2 – 8 register windows, 4 global register files I-Fetch Address Generate Nonblocking I-cache – 512-entry D-TLB, 128 I-TLB (BRAM) I-Fetch Tag Check • Implementation Integer RF US III Decoder (BRAM) 64-bit ALU Stage 1 – 14-stage, multi-threaded pipeline, switch context on each cycle 64-bit ALU Stage 2 Arbiter to DDR Memory – On Virtex-5, XST~148MHz, D-TLB Stage 1 Placed & routed @ 100MHz D-TLB Stage 2 D-TLB Stage 3 – Parameterized non-blocking caches Nonblocking D-Cache Address Generate – FP + rare MMU instructions are D-cache (BRAM) D-Cache Tag Check SW-emulated by nearby uBlaze Multi-Cycle Writeback – 100% mirrors Virtutech Simics model 7 Instruction Unit
Core Design Statistics • Runs 100MHz on V5 – Synthesizes up to 148MHz using standard tools (ISE XST) • Logic usage – 23.5 KLUTs (11.3% LX330T) • BRAM usage – 120 BRAMs for 16-context configuration (37% LX330T) • Future optimizations – Paging structures to SRAM or DRAM can reduce BRAM by significant amount – Will release in future updates 8
Outline • Motivation • The ProtoFlex Simulator • Using ProtoFlex • XUPv5 Reference Design • Distribution 9
Using ProtoFlex Context Scheduler • Add passive monitors Counters Counters Counters I-TLB Stage 1 – Counters, histograms I-TLB Stage 2 Histogram Histogram Histogram Tracker Tracker – Roll your own Trackers I-Fetch Address Generate Nonblocking I-cache (BRAM) I-Fetch Tag Check • Trace-based simulation Integer RF US III Decoder – Collect dynamic traces (BRAM) 64-bit ALU Stage 1 – Feed traces to functional-first Arbiter to Timing 64-bit ALU Stage 2 DDR timing model Model Memory D-TLB Stage 1 • Sampled Program Monitoring D-TLB Stage 2 D-TLB Stage 3 – Use micro-blaze (or PPC) to FPGA Nonblocking D-Cache Address Generate monitor core/memory state Hard/Soft Core D-cache (PowerPC or D-Cache Tag Check (BRAM) MicroBlaze) – Unintrusive profiling w/o Multi-Cycle Writeback changes to target SW Instruction Unit 10
Applications of ProtoFlex • Examples – Functional-first CMP cache coherency model for first-order timing models and functional warming *TRETS’09+ – Real-time stack trace profiling – CMP interconnect model (in progress) – Realistic CPU traffic generators (in progress) Piranha CMP Cache (First-Order Timing Model) • … running real 16 -CPU server workloads – Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K Statistics + Warmed Coherency & Tag States 11
What does the RTL look like? • We use Bluespec System Verilog (high-level, synthesizable HDL) – 4-8 weeks learning curve for normal HDL users – Once learned, easier to read/modify than conventional RTL – Requires BSV compiler (free for academics) – Paper in MEMOCODE’09 describes BSV coding/validation of core • Sample code: rule split_ALU_pipeline (True); rule merged_ALU_pipeline (True); … … p1 = piperegs[ DECODE ]; p1 = piperegs[DECODE]; piperegs[ ALU1 ] <= doALUStage1 ( p1 , alu_ifc); p_tmp1 = doALUStage1 (p1, alu_ifc); p2 = piperegs[ ALU1 ]; p_tmp2 = doALUStage2 (p_tmp1, alu_ifc); piperegs[ ALU2 ] <= doALUStage2 ( p2 , alu_ifc); piperegs[ALU] <= p_tmp2; … … endrule endrule 2-stage ALU 1-stage ALU 12
Other Simulator Features • Changing Core Parameters – Number of CPU contexts – Cache sizes – Merge/split pipeline stages – Enable/disable modules for profiling & debugging – Clock frequency (tested @ 10 MHz – 100 MHz) – Set optimal LUTRAM size (16 = V2P, 64 = V5) – Choose LUTRAMs or BRAMs for any CPU state • System Parameters – UDP or TCP/IP (for PFMON-to-FPGA communication) – XUPv5, BEE2 13
Outline • Motivation • The ProtoFlex Simulator • Using ProtoFlex • XUPv5 Reference Design • Distribution 14
Platform Release: XUPV5 • Why XUPv5? – Inexpensive (~$750), easily accessible – Standard tool flows (EDK, ISE) – Reference design portable to other platforms – just drop in our ‘ pcores ’ • Supporting other platforms – Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP) – Plan to release with future updates 15
Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 16
Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 17
XUPv5 Overview • Virtex-5 LX110T • DDR2 Memory – up to 2GB • 1ΜΒ SRAM • 1Gbps Ethernet • 3Gbps SATA • Serial Port 18
Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 19
BlueSPARC BRAM • EDK IP core BRAM Ethernet Serial Port BRAM – connects to PLB & NPI PLB • Runs @ 100MHz BlueSPARC MicroBlaze • 4 CPU contexts M ulti- P ort SRAM • 64KB I&D L1 caches M emory C ontroller Controller LX110T SRAM DRAM XUPv5 20
Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 21
BlueSPARC BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze • 81% utilization M ulti- P ort SRAM – Core 51% (76 out of 148) M emory C ontroller Controller LX110T – Rest 30% (45 out of 148) SRAM DRAM XUPv5 22
Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 23
Ethernet BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze • 4 MB/sec bandwidth • 350 usec RTT latency M ulti- P ort SRAM M emory C ontroller Controller • Socket Abstraction LX110T – using LWIP RAW interface SRAM DRAM XUPv5 24
Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 25
DDR2 Memory Controller • 1.5GB/s peak BW BRAM BRAM Ethernet Serial Port BRAM • 115ns latency PLB • Multiple ports/interfaces BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 26
Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 27
Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 28
Linux PC • Software requirements – SuSE Linux 10.1 – CAD tools + licenses (Bluespec compiler, Xilinx ISE/EDK) – Simics 3.0.22 – Hybrid simulation plug-in modules – ProtoFlex MONitor tool (PFMON) 29
Linux PC • Runs PFMON (ProtoFlex MONitor) – Orchestrates communication between Simics & BlueSPARC – Provides CLI interface to simulator (like Simics Console) • Runs Simics – Handles I/O, FPGA Core and memory initialization Linux PC Simics PFMON BlueSPARC 30
From RTL to Running System • Bluespec Verilog 1 Bluespec code – Bluespec compiler 1 – ~30 minutes Verilog code • Verilog Bitstream 2 – Xilinx EDK 2 – ~ 3 hours Bitstream • Bitstream Working System 3 3 – Stream mem. image over ethernet Working – ~ 5 minutes (for 512MB image) System 31
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.