PROCESSOR SYSTEM 387 Acknowledgements Results originate in project - - PowerPoint PPT Presentation

processor system
SMART_READER_LITE
LIVE PREVIEW

PROCESSOR SYSTEM 387 Acknowledgements Results originate in project - - PowerPoint PPT Presentation

Active Cells: A Programming Model for Configurable Multicore Systems CASE STUDY 4: CUSTOM DESIGNED MULTI- PROCESSOR SYSTEM 387 Acknowledgements Results originate in project "Supercomputer in the Pocket" (J. Gutknecht, L. Liu, A.


slide-1
SLIDE 1

CASE STUDY 4: CUSTOM DESIGNED MULTI- PROCESSOR SYSTEM

Active Cells: A Programming Model for Configurable Multicore Systems

387

slide-2
SLIDE 2

Acknowledgements

Results originate in project "Supercomputer in the Pocket" (J. Gutknecht,

  • L. Liu, A. Morzov, P. Hunziker, A. Gokhberg, FF) in the Microsoft Innovation

Cluster for Embedded Systems funded by Microsoft Research (2009-2014). We are particularly inbebted for consulting to Chuck Thacker, Niklaus Wirth, Timothée Martiel, Paul Reed and Florian Negele. Thanks for Support from Xilinx Academic Program. CRBM Implementation and exercise 12 by Stephan Koster.

388

slide-3
SLIDE 3

Vision

389 P P NIL P NIL NIL P P P P P P P P

core core cache bus cache memory

P P

core core

P P

core engine

P

core core

P P

core

P

core

P

engine General Purpose Shared Memory Computer Application Specific Multicore Network On Chip

slide-4
SLIDE 4

Motivation: Multicore Systems Challenges

  • Cache Coherence
  • Shared Memory Communication Bottleneck
  • Thread Synchronization Overhead

Hard to predict performance of a program Difficult to scale the design to massive multi-core architecture

390

slide-5
SLIDE 5

Operating System Challenges

  • Processor Time Sharing
  • Interrupts
  • Context Switches
  • Thread Synchronisation
  • Memory Sharing
  • Inter-process: Paging
  • Intra-process, Inter-Thread: Monitors

391

slide-6
SLIDE 6

Focus

Academia: Education

  • holistic design of computing systems
  • simplicity
  • consistency

Industry: High Performance Sensor Driven Medical IT

  • streaming applications: ultrasound, tomography,

hemodynamics, etc.

392

slide-7
SLIDE 7

Focus: Streaming Applications

393

Stream-Parallelism: Pipelining

Task Parallelism: Parallel Execution Data Paralellism: Vector Computing Loop-level parallelism

slide-8
SLIDE 8

4.1. HARDWARE BUILDING BLOCKS TRM AND INTERCONNECTS

394

slide-9
SLIDE 9

TRM: Tiny Register Machine*

  • Extremely simple processor on FPGA with Harvard architecture.
  • Two-stage pipelined
  • Each TRM contains
  • Arithmetic-logic unit (ALU) and a shifter.
  • 32-bit operands and results stored in a bank of 2*8 registers.
  • local data memory: d*512 words of 32 bits.
  • local program memory: i*1024 instructions with 18 bits.
  • 7 general purpose registers
  • Register H for storing the high 32 bits of a product, and 4 conditional registers C, N, V, Z.
  • No caches

395

* Invented and implemented by Dr. Ling Liu and Prof. Niklaus Wirth

slide-10
SLIDE 10

TRM Machine Language

  • Machine language: binary representation of instructions
  • 18-bit instructions
  • Three instruction types:
  • Type a: arithmetical and logical operations
  • Type b: load and store instructions
  • Type c: branch instructions (for jumping)

396

from Lectures on Reconfigurable Computing, Dr. Ling Liu, ETH Zürich

slide-11
SLIDE 11

Encoding Overview

  • Register Operations
  • Load and Store
  • Conditional Branches
  • Special Instructions
  • Branch and Link

397

9 10 11 13 14 17 imm Rd

  • p

imm is zero extended to 32 bits

9 10 11 13 14 17 1 Rs Rd

  • p

0 0 0 x x 0

(a) (b)

9 10 11 13 14 17 1 VRs VRd

  • p

1 0 0 x x x

(c)

9 10 11 13 14 17 1 x x x Rd

  • p

0 0 0 0 0 1

(a) (c)

9 10 11 13 14 17 1 Rs Rd

  • p

1 0 x x x x 9 10 11 13 14 17 1 Rs Rd

  • p

0 1 x x x x

(d)

9 10 11 13 14 17 1 x x x VRd

  • p

1 0 0 001

(b)

9 10 11 13 14 17 1 Rs Rd

  • p

101

(e)

xxxx 6

(a) (b)

9 10 11 13 14 17 Rs Rd

  • p
  • ff

9 10 11 13 14 17 1 Rs VRd

  • p
  • ff

3 3

  • ff is zero extended

9 10 13 14 17 cond 1110

  • ff
  • ff is sign extended

13 14 17 1111

  • ff
  • ff is 14-bit offset
slide-12
SLIDE 12

TRM architecture

398

Figure from: Niklaus Wirth, Experiments in Computer System Design, Technical Report, August 2010 http://www.inf.ethz.ch/personal/wirth/Articles/FPGA-relatedWork/ComputerSystemDesign.pdf

slide-13
SLIDE 13

Variants of TRM

  • FTRM
  • includes floating point unit
  • VTRM (Master Thesis Dan Tecu)
  • includes a vector processing unit
  • supports 8 x 8-word registers
  • available with / without FP unit
  • TRM with software-configurable instruction width

(Master Thesis Stefan Koster, 2015)

399

slide-14
SLIDE 14

Initial Experiments

TRM12 Bus TRM12 Ring

400

Column 0 Column 1 Column 2 Column 3 H0 H1 H2 H3

C0 inbound arbiter

  • utbound arbiter

inbound arbiter

  • utbound arbiter

inbound arbiter

  • utbound arbiter

inbound arbiter

  • utbound arbiter

N0 C7 N1 N2 C2 N6 N7 N8 C6 C1 C8 N3 N4 N5 N9 N10 N11 C5 C11 C4 C3 C10 C9 RS232TR Ci Ni

: processor core : network controller

RS232TR : RS232 transmitter receiver Timer LCD LEDs DDR2

node0 node1 node2 node3 node4 node5 node11 node10 node9 node8 node7 node6

TRM0 TRM1 TRM2 TRM3 TRM4 TRM5 TRM11 TRM10 TRM9 TRM8 TRM7 TRM6

TRMRing

RS232

TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing TRMRing

0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110 0111 1110

slide-15
SLIDE 15

4.2. HARDWARE SOFTWARE CODESIGN

401

slide-16
SLIDE 16

Software / Hardware Co-design Vision: Custom System on Button Push

System design as high-level program code Electronic circuits

Computing model Programming Language Compiler, Synthesizer, Hardware Library, Simulator Programmable Hardware (FPGA)

402

slide-17
SLIDE 17

System specification, HW/SW partitioning

Program microcontroller in C/C++ Program system specific hardware in HDL

Compilation Synthesis Microcontroller + machine code + specific hardware (eg. DSP)

Traditional HW/SW co- design for embedded systems

Traditional HW/SW co-design

One Program One Toolchain System on FPGA

Active Cells approach for embedded systems development

Goal

403

slide-18
SLIDE 18

Active Cells Computing Model

On-chip distributed system Cell

  • Scope and environment for a running isolated process.
  • Integrated control thread(s)
  • Provides communication ports

Net

  • Network of communication cells
  • Cells connected via channels (FIFOs)

404

Inspired by

  • Kahn Process Networks
  • Dataflow Programming
  • CSP (e.g. Google's Go)
  • Actor Model (e.g. Erlang)
slide-19
SLIDE 19

Software  Hardware Map

channel cell fifo

405

slide-20
SLIDE 20

Consequences of the approach

  • No global memory
  • No processor sharing
  • No pecularities of specific processor
  • No predefined topology (NoC)
  • No interrupts

 No operating system

406

slide-21
SLIDE 21

Cell

407

type BernoulliSampler* = cell (probIn: port in; valOut: port out); var r: Random.Generator; p: real; begin new(r); loop p << probIn; valOut << r.Bernoulli(p); end end BernoulliSampler; non-typed communication ports blocking receive asynchronous send Bernoulli Sampler probIn valOut

Cell Activity

slide-22
SLIDE 22

Properties

type Controller = cell {Processor=TRM, FPU, DataMemory=2048, BitWidth=18} (in: port in (64); result: port out); ... begin (* ... controller action ... *) end Controller; ....

Port Width

408

Properties can influence both, generation of hardware and the generation of software code. Controller (TRM) FPU

slide-23
SLIDE 23

Configurable Processor on PL

409

Tiny Register Machine

FPU

  • ff
  • n

Vector Unit

  • ff
  • n

Instruction Width

14 16 18 20 22

Multiplier

  • ff
  • n

Program Memory

0k 1k 2k 3k 4k

Data Memory

0.5k 1k 1.5k

slide-24
SLIDE 24

Engine

type Merger = cell {Engine, inputs=1} (ind: array inputs of port in; outd: port out); var data: longint; begin loop for i := 0 to len(in)-1 do data << ind[i]

  • utd << data;

end end end Merger;

410

Engines are prebuilt components instantiated as electronic circuits

  • n a target hardware

Merger

slide-25
SLIDE 25

Unit of Deployment: (Terminal) Cellnet

LearnTest = cellnet; var learner: CRBMNet.CRBMLearner; reader: MLUtil.imageReader; ims0,ims1: MLUtil.imshow; … begin new(learner) new(reader); new(ims0{name='v0debug',posx=0,posy=100}); new(ims1{name='v1debug',posx=300,posy=100}); … reader.imageOUT >> learner.imgIN; learner.v0DebugOUT >> ims0.imageIN; learner.v1DebugOUT >> ims1.imageIN; … end LearnTest;

411

CRBMLearner imShow imageReader imShow connection dynamic construction connection

slide-26
SLIDE 26

Hierarchical Composition

412

CRBMLearner imShow imageReader imShow

CRBMLearner

DownNet UpNet

Delay Delay Delay Delay

debug result img in

… … …

Sample UpNet

GetGradients

visible

P(h|v)

visible hidden

hidden kernels

UpNet

split

UpStep

slide-27
SLIDE 27

Hierarchic Composition: non-terminal Cellnet

UpNet = cellnet {vr=28,vc=28,kr=5,kc=5,k=9,c=2,name='upstep'} (pvIN, kerIN, bIN : port in; phOUT: array k of port out; pvSideOUT: port out ); var i,hr,hc: longint; upstep: CRBMUpstepCell; split: MLFunctions.SplitterCell; begin hr:=vr-kr+1; hc:=vc-kc+1; new(vSplit {dataSize=vr*vc,numOut=2}); new(upstep {vr=vr,vc=vc,kr=kr,kc=kc,k=k,c=c}); pvIN >> vSplit.dataIN; vSplit.dataOUT[0] >> pvSideOUT; vSplit.dataOUT[1] >> UpstepCell.vIN; kerIN >> UpstepCell.kerIN; bIN >> UpstepCell.bIN; for i:=0 to k-1 do upstep.phOUT[i] >> phOUT[i]; end; end CRBMUpNet;

413

UpNet

img in split

UpStep pvOut phOut kernIn biasIn

port delegation port delegation ports and properties

slide-28
SLIDE 28

Software  Hardware Map

414

ARM TRM ARM

ENGINE ENGINE

ZYNQ PL PS cell engine port thread softcore hw engine AXI4 FIFO AXI4 Stream Interconnect

slide-29
SLIDE 29

Hybrid Compilation

Code body Role Compilation method Cell (Softcore) Program logic Software Compilation Cell (Engine) Computation unit Hardware Generation Cell Net Architecture Hardware Compilation

415

cellnet N; type A=cell(pi: port in; po: port out); var x: integer; begin … pi ? x; … po ! x; … end A; var a,b: A; begin … connect(a.po, b.pi) end N.

slide-30
SLIDE 30

Implementation Alternatives (1)

416

compiler frontend Zybo Backend Spartan3 Backend Spartan3e Backend …

+ simple

  • redundant
  • not flexible
  • hard to extend

source bitstream

slide-31
SLIDE 31

Implementation Alternatives (2)

417

compiler frontend

+ extensible + not redundant

  • not simple
  • static configuration limits flexibility

Hardware Library Component Library cellnet interpreter backend source bitstream

slide-32
SLIDE 32

Implementation Alternatives (3)

418

compiler frontend

+ extensible + not redundant + simpler + simulation becomes execution

Hardware Library Executables Component Library Executables hdl runtime source bitstream executable

slide-33
SLIDE 33

Frontend

419

Source Code (Module) Intermediate Code (Module) Executable Compiler Frontend Compiler Backend Simulation Runtime Visualization Runtime Software Libraries Software Libraries Software Libraries HDL Backend Runtime

slide-34
SLIDE 34

Backend

420

HDL Backend Runtime

Dependency Analysis HDL Code Generator Instruction Code Generator HW Library Dependency Graph HDL Code TRM Kernel TRM Code Executable ARM Kernel ARM Code Executable TRM Kernel TRM Code Executable ARM Kernel ARM Code Executable TRM Kernel TRM Code Executable Runtime Libraries Compiler Backend Executable

FPGA Vendor Implementation Tools

bitstream Deployment Intermediate Code (Module) Runtime Libraries Runtime Libraries Runtime Libraries

slide-35
SLIDE 35

Extensibility: Defining components and platforms

Software HLL Code

type Gpo = cell{Engine,DataWidth=32,InitState="0H"} (input: port in);

421

Hardware HDL Code

module Gpo #( parameter integer DW = 8 … ) ( input aclk, input aresetn , input [DW−1:0] … );

Component Specification

Build Command

AcHdlBackend.Build

  • -target="ZyboBoard"
  • p="Vivado"

CRBM.TestCellnet ~

Platform Specification

Hardware Platform and Tools

Hardware Types Platform Instances Vendor specific tools

slide-36
SLIDE 36

Component Specification

422

module Gpo; import Hdl := AcHdlBackend; var c: Hdl.Engine; begin new(c,"Gpo","Gpo"); c.SetDescription("General Purpose Output … "); c.SupportedOn("∗"); (∗ portable ∗) c.NewProperty("DataWidth","DW",Hdl.NewInteger(32),Hdl.IntegerPropertyRangeCheck(1,Hdl.MaxInteger)); c.NewProperty("InitState","InitState",Hdl.NewBinaryValue("0H"),nil); c.SetMainClockInput("aclk"); (* main component's clock *) c.SetMainResetInput("aresetn",Hdl.ActiveLow); (* active-low reset *) c.NewAxisPort("input","inp",Hdl.In,8); c.NewExternalHdlPort("gpo","gpo",Hdl.Out,8); c.NewDependency("Gpo.v",true,false); c.AddPostParamSetter(Hdl.SetPortWidthFromProperty("inp","DW")); c.AddPostParamSetter(Hdl.SetPortWidthFromProperty("gpo","DW")); Hdl.hwLibrary.AddComponent(c); end Gpo.

Identification and description Supported devices Parameters Ports Dependencies Configuration Actions

c.NewExternalHdlPort("gpo","gpo",Hdl.Out,8);

slide-37
SLIDE 37

Generic Peer-to-Peer Communication Interface

  • Use of AXI4 Stream interconnect standard from ARM
  • Generic, flexible
  • Non-redundant

423

TVALID: Data valid Sender port Receiver port TREADY: Ready to process Multiplexing network Data out Select Select Data in clk clk

master must assert tvalid and keep asserted slave can wait for master's tvalid and then assert tready

slide-38
SLIDE 38

Target Platform Specification

module Basys2Board; import Hdl := AcHdlBackend, AcXilinx; var t: Hdl.TargetDevice; pldPart: AcXilinx.PldPart; ioSetup: Hdl.IoSetup; pin: Hdl.IoPin; begin new(pldPart,"XC3S100E-4CP132"); pldPart.SetJtagChainIndex(0); new(t,"Basys2Board",pldPart); new(pin,Hdl.In,"B8","LVCMOS33"); t.NewExternalClock(pin,50000000,50,0); (* ExternalClock0 *) t.SetSystemClock(t.clocks.GetClockByName("ExternalClock0"),1,1); new(pin,Hdl.In,"G12","LVCMOS33"); t.SetSystemReset(pin,true); new(ioSetup,"Gpo_0"); ioSetup.NewIoPort("gpo",Hdl.Out,"U16,E19,U19,V19","LVCMOS33"); t.AddIoSetup(ioSetup); Hdl.hwLibrary.AddTarget(t); end Basys2Board.

424

System Signals FPGA Part Mapping of external Ports

ioSetup.NewIoPort("gpo",Hdl.Out,"U16,E19,U19,V19","LVCMOS33");

slide-39
SLIDE 39

TRM1 TRM2 TRM9 TRM10 TRM11 TRM12 FIFO1 FIFO8 FIFO9 FIFO16 FIFO17 FIFO18 FIFO19 FIFO20 FIFO33 FIFO34 UART controller CF controller LCD controller Virtex-5LX50T FPGA Xilinx ML505 board RS232 CF LCD ECG Sensor

· · · · · ·

Case Study 1: ECG

Focus: Resources and Power

Real-time ECG Monitor

Signal input Wave proc_1 QRS detect HRV analysis Disease classifier Wave proc_2 Wave proc_8 ECG bitstream

  • ut

stream 425

slide-40
SLIDE 40

Resources

  • ECG Monitor*
  • Maximum number of TRMs in communication chain

426

#TRMs #LUTs #BRAMs #DSPs TRM load

12 13859 (48%) 52 (86%) 12 (25%) <5% @116 MHz

FPGA #TRMs #LUTs #BRAMs #DSPs

Virtex-5 30 27692 (96%) 60 (100%) 30 (62%) Virtex 6 500

*8 physical channels @ 500 Hz sampling frequency implemented on Virtex 5

426

slide-41
SLIDE 41

Comparative Power Usage

  • Preconfigured FPGA (#TRMs, IM/DM, I/O, Interconnect fixed)

versus fully configurable FPGA (Active Cells)

System Static Power (W) Dynamic Power (W) Preconfigured ("TRM12") 3.44 0.59 Dynamically configured 0.5 0.58

86% saving!

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Sign al inpu t Wave proc_ 1 QR S det ect HRV analysis Disease classifier Wave proc_ 2 Wave proc_ 8

427

slide-42
SLIDE 42

Case Study: Non-Invasive Continuous Blood Pressure Monitor

Focus: Development Cycle Time

A2 Host OS with GUI on ARM Sensor control and medical algorithms on Zynq PL Sensors and Motors

  • n Bracelet

428

slide-43
SLIDE 43

Medical Monitor Network On Chip

429 429

Dominated by TRM processors. Feedback driven. Not performance critical.

slide-44
SLIDE 44

Development Cycle Times

Step Medical Monitor OCT (full) Software Compilation 4s 2s Hardware Implementation 20 min 20 min Stream Patching (all) 2 min

  • Stream Patching (typical)

12 s

  • Deployment

11s 16s

sporadic

  • ften

430 430

slide-45
SLIDE 45

Case Study 3: Optical Coherence Tomography

Focus: Performance

A(λi) = A(1/fi) f(z) z-Axis Processing

  • 1. Non uniform sampling

A(λi)  A(f i)

  • 2. Dispersion compensation
  • 3. (Inverse) FFT

… for many lines x in a row (2d) … and many rows y in a column (3d) ~ z x y

431 431

slide-46
SLIDE 46

A component of OCT image processing

Dispersion Compensation Dominated by Engines. Dataflow driven.

432

slide-47
SLIDE 47

Case Study: ANN

433

Sigmoid InnerProdFlt

Í Í Í + +

x0 x0 x1 x1 x2 x2 y0 y0 y1 y1 y2 y2

  • Í

+

Frac LUT y y Fix z z input[0] input[0] input[1] input[1]

  • utput
  • utput

x x y b x0 x1 x2 w0 w1 w2

ARM SoC

x w P0 R0 P2 R2 P1 R1 z

ANN Layer 1 (More Layers)

b

Programmable Logic

Perceptron

slide-48
SLIDE 48

Performance and Resource Usage

Medical Monitor Simple OCT OCT Perceptron Architecture Spartan 6 XC6SLX75 Zynq 7000 XC7Z020 Zynq 7000 XC7Z020 Zynq 7000 XC7Z010 Resources 28% Slice LUTs, 4% Slice Registers 80% BRAMs 24% DSPs 11% Slice LUTs, 6% Slice Registers 7% BRAMs 15% DSPs 1 ARM Cortex A9 17% Slice LUTS 8% Slice Registers 22% BRAMs 31% DSPs 1 ARM Cortex A9 83% Slice LUTS, 67% Slice Registers 13.33% BRAMs, 21% DSP 1 ARM Cortex A9 Clock Rate 58 MHz 118 MHz 50 MHz 147 MHz Data Bandwidth 1.25 Mbit /s (in) 23 kB/s (out) 236 MWords/s (in) 118 MWords/s (out) 50 MWords/s (in) 50 MWords/s (out) 9.6 GBits/s in 9.6 GBits/s out Performance

  • 8.3 GFPOps*

up to 32 GFPops** 4.3 GFPOps* 4.9 GFlops Power ~2W ~5W ~5W 2.1 W

** Fixed point operations, 32bit * when instantiated 4 times

434

slide-49
SLIDE 49

Conclusion

ActiveCells: Computing model and tool-chain for configurable computing

  • Configurable interconnect  Simple Computing, Power Saving
  • Embedding of task engines  High Performance
  • Hybrid compilation Quick Development
  • Backend execution  Eased Flexibility and Extensibility

435

slide-50
SLIDE 50

Mapping to Hardware – Simple Example

IO*=CELL (input: PORT IN; output: PORT OUT; buttons: PORT IN; leds: PORT OUT); BEGIN ... END IO;

Controller*=CELL (input: PORT IN; output: PORT OUT) BEGIN ... END Controller; VAR controller: Controller; io: IO; gpo{DataWidth=8}: Engines.Gpo; gpi{DataWidth=11}: Engines.Gpi; BEGIN NEW(controller); NEW(io); NEW(gpi); NEW(gpo); CONNECT (controller.output, io.input, 32); CONNECT (io.output, controller.input, 32); CONNECT (gpi.output, io.buttons); CONNECT (io.leds, gpo.input); END Simple.

436

IO TRM Controller TRM fifo fifo gpi gpo  Blackboard / Code inspection

slide-51
SLIDE 51

Spartan3 Board

Resources 4-input LUTS 3840 Flip Flops 3840 BRAMs 12 DSPs

  • 437
slide-52
SLIDE 52

Resource Usage Scenarios

438

CELLNET Game; IMPORT LED; TYPE IO*=CELL {DataMemorySize(512), CodeMemorySize(1024), LED} (in: PORT IN); BEGIN END IO; VAR io: IO; BEGIN NEW(io); END Game.

1 TRM

Resources 4-input LUTS 27% Flip Flops 5% BRAMs 16% Resources 4-input LUTS 58% Flip Flops 9% BRAMs 33%

 TRM ≈ 21-27% (LUTs), 16% (BRAMs); Fifo ≈6% (LUTs)

VAR io: IO; trm: TRM; BEGIN NEW(io); NEW(trm); CONNECT(trm.out, io.in); END Game.

TRM TRM

(cf. next page)

TRM  TRM

  • > TRM

Resources 4-input LUTS 86% Flip Flops 18% BRAMs 75%

slide-53
SLIDE 53

Resource Usage (Lab)

controller: Controller; ioTRM: IO; digitsDriver: DigitsDriver; digits: Engines.LEDDigits; gpo{DataWidth=8}: Engines.Gpo; gpi{DataWidth=11}: Engines.Gpi; BEGIN NEW(controller); NEW(ioTRM); NEW(digitsDriver); NEW(digits);NEW(gpi);NEW(gpo); CONNECT( controller.cmdOUT,ioTRM.cmdIN,10); ONNECT( ioTRM.cmdOUT,controller.cmdIN,10); CONNECT( gpi.output, ioTRM.ButtonsIN); CONNECT( ioTRM.LEDOUT, gpo.input); CONNECT(ioTRM.digitsOUT, digitsDriver.cmdIN,10); CONNECT(digitsDriver.ledOUT, digits.input);

  • 3 TRMs
  • 3 FIFOs

439

Resources 4-input LUTS 86% Flip Flops 18% BRAMs 75%