[PPT] - PROCESSOR SYSTEM 387 Acknowledgements Results originate in project PowerPoint Presentation

SLIDE 1

CASE STUDY 4: CUSTOM DESIGNED MULTI- PROCESSOR SYSTEM

Active Cells: A Programming Model for Configurable Multicore Systems

387

SLIDE 2

Acknowledgements

Results originate in project "Supercomputer in the Pocket" (J. Gutknecht,

L. Liu, A. Morzov, P. Hunziker, A. Gokhberg, FF) in the Microsoft Innovation

Cluster for Embedded Systems funded by Microsoft Research (2009-2014). We are particularly inbebted for consulting to Chuck Thacker, Niklaus Wirth, Timothée Martiel, Paul Reed and Florian Negele. Thanks for Support from Xilinx Academic Program. CRBM Implementation and exercise 12 by Stephan Koster.

388

SLIDE 3

Vision

389 P P NIL P NIL NIL P P P P P P P P

core core cache bus cache memory

P P

core core

P P

core engine

P

core core

P P

core

P

core

P

engine General Purpose Shared Memory Computer Application Specific Multicore Network On Chip

SLIDE 4

Motivation: Multicore Systems Challenges

Cache Coherence
Shared Memory Communication Bottleneck
Thread Synchronization Overhead

Hard to predict performance of a program Difficult to scale the design to massive multi-core architecture

390

SLIDE 5

Operating System Challenges

Processor Time Sharing
Interrupts
Context Switches
Thread Synchronisation
Memory Sharing
Inter-process: Paging
Intra-process, Inter-Thread: Monitors

391

SLIDE 6

Focus

Academia: Education

holistic design of computing systems
simplicity
consistency

Industry: High Performance Sensor Driven Medical IT

streaming applications: ultrasound, tomography,

hemodynamics, etc.

392

SLIDE 7

Focus: Streaming Applications

393

Stream-Parallelism: Pipelining

Task Parallelism: Parallel Execution Data Paralellism: Vector Computing Loop-level parallelism

SLIDE 8

4.1. HARDWARE BUILDING BLOCKS TRM AND INTERCONNECTS

394

SLIDE 9

TRM: Tiny Register Machine*

Extremely simple processor on FPGA with Harvard architecture.
Two-stage pipelined
Each TRM contains
Arithmetic-logic unit (ALU) and a shifter.
32-bit operands and results stored in a bank of 2*8 registers.
local data memory: d*512 words of 32 bits.
local program memory: i*1024 instructions with 18 bits.
7 general purpose registers
Register H for storing the high 32 bits of a product, and 4 conditional registers C, N, V, Z.
No caches

395

* Invented and implemented by Dr. Ling Liu and Prof. Niklaus Wirth

SLIDE 10

TRM Machine Language

Machine language: binary representation of instructions
18-bit instructions
Three instruction types:
Type a: arithmetical and logical operations
Type b: load and store instructions
Type c: branch instructions (for jumping)

396

from Lectures on Reconfigurable Computing, Dr. Ling Liu, ETH Zürich

SLIDE 11

Encoding Overview

Register Operations
Load and Store
Conditional Branches
Special Instructions
Branch and Link

397

9 10 11 13 14 17 imm Rd

p

imm is zero extended to 32 bits

9 10 11 13 14 17 1 Rs Rd

p

0 0 0 x x 0

(a) (b)

9 10 11 13 14 17 1 VRs VRd

p

1 0 0 x x x

(c)

9 10 11 13 14 17 1 x x x Rd

p

0 0 0 0 0 1

(a) (c)

9 10 11 13 14 17 1 Rs Rd

p

1 0 x x x x 9 10 11 13 14 17 1 Rs Rd

p

0 1 x x x x

(d)

9 10 11 13 14 17 1 x x x VRd

p

1 0 0 001

(b)

9 10 11 13 14 17 1 Rs Rd

p

101

(e)

xxxx 6

(a) (b)

9 10 11 13 14 17 Rs Rd

p
ff

9 10 11 13 14 17 1 Rs VRd

p
ff

3 3

ff is zero extended

9 10 13 14 17 cond 1110

ff
ff is sign extended

13 14 17 1111

ff
ff is 14-bit offset

SLIDE 12

TRM architecture

398

Figure from: Niklaus Wirth, Experiments in Computer System Design, Technical Report, August 2010 http://www.inf.ethz.ch/personal/wirth/Articles/FPGA-relatedWork/ComputerSystemDesign.pdf

SLIDE 13

Variants of TRM

FTRM
includes floating point unit
VTRM (Master Thesis Dan Tecu)
includes a vector processing unit
supports 8 x 8-word registers
available with / without FP unit
TRM with software-configurable instruction width

(Master Thesis Stefan Koster, 2015)

399

SLIDE 14

Initial Experiments

TRM12 Bus TRM12 Ring

400

Column 0 Column 1 Column 2 Column 3 H0 H1 H2 H3

C0 inbound arbiter

utbound arbiter

inbound arbiter

utbound arbiter

inbound arbiter

utbound arbiter

inbound arbiter

utbound arbiter

N0 C7 N1 N2 C2 N6 N7 N8 C6 C1 C8 N3 N4 N5 N9 N10 N11 C5 C11 C4 C3 C10 C9 RS232TR Ci Ni

: processor core : network controller

RS232TR : RS232 transmitter receiver Timer LCD LEDs DDR2

node0 node1 node2 node3 node4 node5 node11 node10 node9 node8 node7 node6

TRM0 TRM1 TRM2 TRM3 TRM4 TRM5 TRM11 TRM10 TRM9 TRM8 TRM7 TRM6

TRMRing

RS232

TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing

0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110

SLIDE 15

4.2. HARDWARE SOFTWARE CODESIGN

401

SLIDE 16

Software / Hardware Co-design Vision: Custom System on Button Push

System design as high-level program code Electronic circuits

Computing model Programming Language Compiler, Synthesizer, Hardware Library, Simulator Programmable Hardware (FPGA)

402

SLIDE 17

System specification, HW/SW partitioning

Program microcontroller in C/C++ Program system specific hardware in HDL

Compilation Synthesis Microcontroller + machine code + specific hardware (eg. DSP)

Traditional HW/SW codesign for embedded systems

Traditional HW/SW co-design

One Program One Toolchain System on FPGA

Active Cells approach for embedded systems development

Goal

403

SLIDE 18

Active Cells Computing Model

On-chip distributed system Cell

Scope and environment for a running isolated process.
Integrated control thread(s)
Provides communication ports

Net

Network of communication cells
Cells connected via channels (FIFOs)

404

Inspired by

Kahn Process Networks
Dataflow Programming
CSP (e.g. Google's Go)
Actor Model (e.g. Erlang)

SLIDE 19

Software  Hardware Map

channel cell fifo

405

SLIDE 20

Consequences of the approach

No global memory
No processor sharing
No pecularities of specific processor
No predefined topology (NoC)
No interrupts

 No operating system

406

SLIDE 21

Cell

407

type BernoulliSampler* = cell (probIn: port in; valOut: port out); var r: Random.Generator; p: real; begin new(r); loop p << probIn; valOut << r.Bernoulli(p); end end BernoulliSampler; non-typed communication ports blocking receive asynchronous send Bernoulli Sampler probIn valOut

Cell Activity

SLIDE 22

Properties

type Controller = cell {Processor=TRM, FPU, DataMemory=2048, BitWidth=18} (in: port in (64); result: port out); ... begin (* ... controller action ... *) end Controller; ....

Port Width

408

Properties can influence both, generation of hardware and the generation of software code. Controller (TRM) FPU

SLIDE 23

Configurable Processor on PL

409

Tiny Register Machine

FPU

ff
n

Vector Unit

ff
n

Instruction Width

14 16 18 20 22

Multiplier

ff
n

Program Memory

0k 1k 2k 3k 4k

Data Memory

0.5k 1k 1.5k

SLIDE 24

Engine

type Merger = cell {Engine, inputs=1} (ind: array inputs of port in; outd: port out); var data: longint; begin loop for i := 0 to len(in)-1 do data << ind[i]

utd << data;

end end end Merger;

410

Engines are prebuilt components instantiated as electronic circuits

n a target hardware

Merger

…

SLIDE 25

Unit of Deployment: (Terminal) Cellnet

LearnTest = cellnet; var learner: CRBMNet.CRBMLearner; reader: MLUtil.imageReader; ims0,ims1: MLUtil.imshow; … begin new(learner) new(reader); new(ims0{name='v0debug',posx=0,posy=100}); new(ims1{name='v1debug',posx=300,posy=100}); … reader.imageOUT >> learner.imgIN; learner.v0DebugOUT >> ims0.imageIN; learner.v1DebugOUT >> ims1.imageIN; … end LearnTest;

411

CRBMLearner imShow imageReader imShow connection dynamic construction connection

SLIDE 26

Hierarchical Composition

412

CRBMLearner imShow imageReader imShow

CRBMLearner

DownNet UpNet

Delay Delay Delay Delay

debug result img in

… … …

Sample UpNet

…

GetGradients

visible

P(h|v)

visible hidden

…

hidden kernels

UpNet

split

…

UpStep

SLIDE 27

Hierarchic Composition: non-terminal Cellnet

UpNet = cellnet {vr=28,vc=28,kr=5,kc=5,k=9,c=2,name='upstep'} (pvIN, kerIN, bIN : port in; phOUT: array k of port out; pvSideOUT: port out ); var i,hr,hc: longint; upstep: CRBMUpstepCell; split: MLFunctions.SplitterCell; begin hr:=vr-kr+1; hc:=vc-kc+1; new(vSplit {dataSize=vr*vc,numOut=2}); new(upstep {vr=vr,vc=vc,kr=kr,kc=kc,k=k,c=c}); pvIN >> vSplit.dataIN; vSplit.dataOUT[0] >> pvSideOUT; vSplit.dataOUT[1] >> UpstepCell.vIN; kerIN >> UpstepCell.kerIN; bIN >> UpstepCell.bIN; for i:=0 to k-1 do upstep.phOUT[i] >> phOUT[i]; end; end CRBMUpNet;

413

UpNet

img in split

…

UpStep pvOut phOut kernIn biasIn

port delegation port delegation ports and properties

SLIDE 28

Software  Hardware Map

414

ARM TRM ARM

ENGINE ENGINE

ZYNQ PL PS cell engine port thread softcore hw engine AXI4 FIFO AXI4 Stream Interconnect

SLIDE 29

Hybrid Compilation

Code body Role Compilation method Cell (Softcore) Program logic Software Compilation Cell (Engine) Computation unit Hardware Generation Cell Net Architecture Hardware Compilation

415

cellnet N; type A=cell(pi: port in; po: port out); var x: integer; begin … pi ? x; … po ! x; … end A; var a,b: A; begin … connect(a.po, b.pi) end N.

SLIDE 30

Implementation Alternatives (1)

416

compiler frontend Zybo Backend Spartan3 Backend Spartan3e Backend …

+ simple

redundant
not flexible
hard to extend

source bitstream

SLIDE 31

Implementation Alternatives (2)

417

compiler frontend

+ extensible + not redundant

not simple
static configuration limits flexibility

Hardware Library Component Library cellnet interpreter backend source bitstream

SLIDE 32

Implementation Alternatives (3)

418

compiler frontend

+ extensible + not redundant + simpler + simulation becomes execution

Hardware Library Executables Component Library Executables hdl runtime source bitstream executable

SLIDE 33

Frontend

419

Source Code (Module) Intermediate Code (Module) Executable Compiler Frontend Compiler Backend Simulation Runtime Visualization Runtime Software Libraries Software Libraries Software Libraries HDL Backend Runtime

SLIDE 34

Backend

420

HDL Backend Runtime

Dependency Analysis HDL Code Generator Instruction Code Generator HW Library Dependency Graph HDL Code TRM Kernel TRM Code Executable ARM Kernel ARM Code Executable TRM Kernel TRM Code Executable ARM Kernel ARM Code Executable TRM Kernel TRM Code Executable Runtime Libraries Compiler Backend Executable

FPGA Vendor Implementation Tools

bitstream Deployment Intermediate Code (Module) Runtime Libraries Runtime Libraries Runtime Libraries

SLIDE 35

Extensibility: Defining components and platforms

Software HLL Code

type Gpo = cell{Engine,DataWidth=32,InitState="0H"} (input: port in);

421

Hardware HDL Code

module Gpo #( parameter integer DW = 8 … ) ( input aclk, input aresetn , input [DW−1:0] … );

Component Specification

Build Command

AcHdlBackend.Build

-target="ZyboBoard"
p="Vivado"

CRBM.TestCellnet ~

Platform Specification

Hardware Platform and Tools

Hardware Types Platform Instances Vendor specific tools

SLIDE 36

Component Specification

422

module Gpo; import Hdl := AcHdlBackend; var c: Hdl.Engine; begin new(c,"Gpo","Gpo"); c.SetDescription("General Purpose Output … "); c.SupportedOn("∗"); (∗ portable ∗) c.NewProperty("DataWidth","DW",Hdl.NewInteger(32),Hdl.IntegerPropertyRangeCheck(1,Hdl.MaxInteger)); c.NewProperty("InitState","InitState",Hdl.NewBinaryValue("0H"),nil); c.SetMainClockInput("aclk"); (* main component's clock *) c.SetMainResetInput("aresetn",Hdl.ActiveLow); (* active-low reset *) c.NewAxisPort("input","inp",Hdl.In,8); c.NewExternalHdlPort("gpo","gpo",Hdl.Out,8); c.NewDependency("Gpo.v",true,false); c.AddPostParamSetter(Hdl.SetPortWidthFromProperty("inp","DW")); c.AddPostParamSetter(Hdl.SetPortWidthFromProperty("gpo","DW")); Hdl.hwLibrary.AddComponent(c); end Gpo.

Identification and description Supported devices Parameters Ports Dependencies Configuration Actions

c.NewExternalHdlPort("gpo","gpo",Hdl.Out,8);

SLIDE 37

Generic Peer-to-Peer Communication Interface

Use of AXI4 Stream interconnect standard from ARM
Generic, flexible
Non-redundant

423

TVALID: Data valid Sender port Receiver port TREADY: Ready to process Multiplexing network Data out Select Select Data in clk clk

master must assert tvalid and keep asserted slave can wait for master's tvalid and then assert tready

SLIDE 38

Target Platform Specification

module Basys2Board; import Hdl := AcHdlBackend, AcXilinx; var t: Hdl.TargetDevice; pldPart: AcXilinx.PldPart; ioSetup: Hdl.IoSetup; pin: Hdl.IoPin; begin new(pldPart,"XC3S100E-4CP132"); pldPart.SetJtagChainIndex(0); new(t,"Basys2Board",pldPart); new(pin,Hdl.In,"B8","LVCMOS33"); t.NewExternalClock(pin,50000000,50,0); (* ExternalClock0 *) t.SetSystemClock(t.clocks.GetClockByName("ExternalClock0"),1,1); new(pin,Hdl.In,"G12","LVCMOS33"); t.SetSystemReset(pin,true); new(ioSetup,"Gpo_0"); ioSetup.NewIoPort("gpo",Hdl.Out,"U16,E19,U19,V19","LVCMOS33"); t.AddIoSetup(ioSetup); Hdl.hwLibrary.AddTarget(t); end Basys2Board.

424

System Signals FPGA Part Mapping of external Ports

ioSetup.NewIoPort("gpo",Hdl.Out,"U16,E19,U19,V19","LVCMOS33");

SLIDE 39

TRM1 TRM2 TRM9 TRM10 TRM11 TRM12 FIFO1 FIFO8 FIFO9 FIFO16 FIFO17 FIFO18 FIFO19 FIFO20 FIFO33 FIFO34 UART controller CF controller LCD controller Virtex-5LX50T FPGA Xilinx ML505 board RS232 CF LCD ECG Sensor

· · · · · ·

Case Study 1: ECG

Focus: Resources and Power

Real-time ECG Monitor

Signal input Wave proc_1 QRS detect HRV analysis Disease classifier Wave proc_2 Wave proc_8 ECG bitstream

ut

stream 425

SLIDE 40

Resources

ECG Monitor*
Maximum number of TRMs in communication chain

426

#TRMs #LUTs #BRAMs #DSPs TRM load

12 13859 (48%) 52 (86%) 12 (25%) <5% @116 MHz

FPGA #TRMs #LUTs #BRAMs #DSPs

Virtex-5 30 27692 (96%) 60 (100%) 30 (62%) Virtex 6 500

*8 physical channels @ 500 Hz sampling frequency implemented on Virtex 5

426

SLIDE 41

Comparative Power Usage

Preconfigured FPGA (#TRMs, IM/DM, I/O, Interconnect fixed)

versus fully configurable FPGA (Active Cells)

System Static Power (W) Dynamic Power (W) Preconfigured ("TRM12") 3.44 0.59 Dynamically configured 0.5 0.58

86% saving!

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Sign al inpu t Wave proc_ 1 QR S det ect HRV analysis Disease classifier Wave proc_ 2 Wave proc_ 8

427

SLIDE 42

Case Study: Non-Invasive Continuous Blood Pressure Monitor

Focus: Development Cycle Time

A2 Host OS with GUI on ARM Sensor control and medical algorithms on Zynq PL Sensors and Motors

n Bracelet

428

SLIDE 43

Medical Monitor Network On Chip

429 429

Dominated by TRM processors. Feedback driven. Not performance critical.

SLIDE 44

Development Cycle Times

Step Medical Monitor OCT (full) Software Compilation 4s 2s Hardware Implementation 20 min 20 min Stream Patching (all) 2 min

Stream Patching (typical)

12 s

Deployment

11s 16s

sporadic

ften

430 430

SLIDE 45

Case Study 3: Optical Coherence Tomography

Focus: Performance

A(λi) = A(1/fi) f(z) z-Axis Processing

1. Non uniform sampling

A(λi)  A(f i)

2. Dispersion compensation
3. (Inverse) FFT

… for many lines x in a row (2d) … and many rows y in a column (3d) ~ z x y

431 431

SLIDE 46

A component of OCT image processing

Dispersion Compensation Dominated by Engines. Dataflow driven.

432

SLIDE 47

Case Study: ANN

433

Sigmoid InnerProdFlt

Í Í Í + +

x0 x0 x1 x1 x2 x2 y0 y0 y1 y1 y2 y2

Í

+

Frac LUT y y Fix z z input[0] input[0] input[1] input[1]

utput
utput

x x y b x0 x1 x2 w0 w1 w2

ARM SoC

x w P0 R0 P2 R2 P1 R1 z

ANN Layer 1 (More Layers)

b

Programmable Logic

Perceptron

SLIDE 48

Performance and Resource Usage

Medical Monitor Simple OCT OCT Perceptron Architecture Spartan 6 XC6SLX75 Zynq 7000 XC7Z020 Zynq 7000 XC7Z020 Zynq 7000 XC7Z010 Resources 28% Slice LUTs, 4% Slice Registers 80% BRAMs 24% DSPs 11% Slice LUTs, 6% Slice Registers 7% BRAMs 15% DSPs 1 ARM Cortex A9 17% Slice LUTS 8% Slice Registers 22% BRAMs 31% DSPs 1 ARM Cortex A9 83% Slice LUTS, 67% Slice Registers 13.33% BRAMs, 21% DSP 1 ARM Cortex A9 Clock Rate 58 MHz 118 MHz 50 MHz 147 MHz Data Bandwidth 1.25 Mbit /s (in) 23 kB/s (out) 236 MWords/s (in) 118 MWords/s (out) 50 MWords/s (in) 50 MWords/s (out) 9.6 GBits/s in 9.6 GBits/s out Performance

8.3 GFPOps*

up to 32 GFPops** 4.3 GFPOps* 4.9 GFlops Power ~2W ~5W ~5W 2.1 W

** Fixed point operations, 32bit * when instantiated 4 times

434

SLIDE 49

Conclusion

ActiveCells: Computing model and tool-chain for configurable computing

Configurable interconnect  Simple Computing, Power Saving
Embedding of task engines  High Performance
Hybrid compilation Quick Development
Backend execution  Eased Flexibility and Extensibility

435

SLIDE 50

Mapping to Hardware – Simple Example

IO*=CELL (input: PORT IN; output: PORT OUT; buttons: PORT IN; leds: PORT OUT); BEGIN ... END IO;

Controller*=CELL (input: PORT IN; output: PORT OUT) BEGIN ... END Controller; VAR controller: Controller; io: IO; gpo{DataWidth=8}: Engines.Gpo; gpi{DataWidth=11}: Engines.Gpi; BEGIN NEW(controller); NEW(io); NEW(gpi); NEW(gpo); CONNECT (controller.output, io.input, 32); CONNECT (io.output, controller.input, 32); CONNECT (gpi.output, io.buttons); CONNECT (io.leds, gpo.input); END Simple.

436

IO TRM Controller TRM fifo fifo gpi gpo  Blackboard / Code inspection

SLIDE 51

Spartan3 Board

Resources 4-input LUTS 3840 Flip Flops 3840 BRAMs 12 DSPs

437

SLIDE 52

Resource Usage Scenarios

438

CELLNET Game; IMPORT LED; TYPE IO*=CELL {DataMemorySize(512), CodeMemorySize(1024), LED} (in: PORT IN); BEGIN END IO; VAR io: IO; BEGIN NEW(io); END Game.

1 TRM

Resources 4-input LUTS 27% Flip Flops 5% BRAMs 16% Resources 4-input LUTS 58% Flip Flops 9% BRAMs 33%

 TRM ≈ 21-27% (LUTs), 16% (BRAMs); Fifo ≈6% (LUTs)

VAR io: IO; trm: TRM; BEGIN NEW(io); NEW(trm); CONNECT(trm.out, io.in); END Game.

TRM TRM

(cf. next page)

TRM  TRM

> TRM

Resources 4-input LUTS 86% Flip Flops 18% BRAMs 75%

SLIDE 53

Resource Usage (Lab)

controller: Controller; ioTRM: IO; digitsDriver: DigitsDriver; digits: Engines.LEDDigits; gpo{DataWidth=8}: Engines.Gpo; gpi{DataWidth=11}: Engines.Gpi; BEGIN NEW(controller); NEW(ioTRM); NEW(digitsDriver); NEW(digits);NEW(gpi);NEW(gpo); CONNECT( controller.cmdOUT,ioTRM.cmdIN,10); ONNECT( ioTRM.cmdOUT,controller.cmdIN,10); CONNECT( gpi.output, ioTRM.ButtonsIN); CONNECT( ioTRM.LEDOUT, gpo.input); CONNECT(ioTRM.digitsOUT, digitsDriver.cmdIN,10); CONNECT(digitsDriver.ledOUT, digits.input);

3 TRMs
3 FIFOs

439

Resources 4-input LUTS 86% Flip Flops 18% BRAMs 75%