Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its - - PowerPoint PPT Presentation

tsunami simulation on fpga gpu tsunami simulation on fpga
SMART_READER_LITE
LIVE PREVIEW

Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its - - PowerPoint PPT Presentation

Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its analysis based on Statistical Model Checking Masahiro Fujita VLSI Design and Education Center (VDEC) University of Tokyo 1 2 Outline What Tsunami simulation means in


slide-1
SLIDE 1

Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its analysis based on Statistical Model Checking

Masahiro Fujita VLSI Design and Education Center (VDEC) University of Tokyo

1

slide-2
SLIDE 2

Outline

2

  • What Tsunami simulation means in this talk
  • Acceleration with FPGA/GPU
  • Acceleration with FPGA/GPU

– Based

  • n

stream processing (pipelining) with loop unrolling unrolling – Based on parallel processing for decomposed regions

  • (Formal verification of those implementation)
  • (Formal verification of those implementation)

– (Equivalence checking between FPGA/GPU implementation and the original program in C/Fortran) implementation and the original program in C/Fortran) – Just show our strategy

Statistical model checking

  • Statistical model checking

– On software in Fortran l i i h PG /GP – Acceleration with FPGA/GPU

slide-3
SLIDE 3

Motivation

3

  • Based on the values of many earthquake sensors

(wired/wireless) compute how Tsunami wave will (wired/wireless), compute how Tsunami wave will propagate G l R li l l f i

  • Goal: Realize supercomputer level performance in

Tsunami simulation with FPGA/GPU

C /P di Compute/Predict Tsunami as fast and accurate as possible accurate as possible

Earthquake sensors Generate initial wave from sensor data geographically distributed Propagate wave by numerically solving partial differential equations

slide-4
SLIDE 4

Tsunami simulation

4

  • Tsunami simulation algorithm: Find solutions of fluid

dynamics equations dynamics equations – Law of Conservation of Mass – Law of Conservation of Momentum with and without bottom friction

  • Solved

with known boundary conditions and bathymetric input of the region y p g

  • Here the above is processed by numerically solving

sets of partial differential equations with finite sets of partial differential equations with finite difference methods

slide-5
SLIDE 5

Partial differential equations to be solved

5

Partial differential equations to be solved

ὴ = vertical displacement of water above still water ὴ p D= Total water depth = h+ὴ g = Acceleration due to gravity A h i l dd i i Momentum equations along A = horizontal eddy viscosity current τ = friction along x or y direction M = water flux discharge along X direction X‐axis and Y‐axis respectively ith t b tt M = water flux discharge along X direction N = water flux discharge along Y direction Mass conservation without bottom friction Mass conservation

Reference: Tsunami Modeling Manual by Prof Nobuo Shuto

slide-6
SLIDE 6

Here we use a simplified model: Linear

6

  • ne (valid if sea depth is large enough)
  • Shallow Water Theory (Long Wave Theory)

(Mass Conservation)

      N M M

(Mass Conservation) (Momentum Conservation)

      y x t

2 2 3 7 2 2

                             N M D M gn x gD D MN y D M x t M 

(Momentum Conservation)

2 2 3 7 2 2

                              N M D M gn y gD D N y D MN x t M D 

y flaxofx N M Manning n gravity g depth D waveheight , : , : : : : 

  • Linear Long Wave Theory

y f f g g y g p g , , 

(Mass Conservation) (M t C ti ) (Momentum Conservation)

slide-7
SLIDE 7

Finite difference methods

7

  • Solution of mass conservation equation based on

finite difference method finite difference method

– Z(i,j,t+1) = Z(I,j,t) – (dt/dx)*(M(I,j,t)-M(i-1,j,t)+N(i,j,t)-N(i,j-1,t))

   N M M

Where i, j = x, y coordinate

t

         y N x M t M

Z(i,j,t) = Water Surface level at time t H(i,j,t) = Still water depth at time t dt = temporal step dx = spatial step

t+1

M(i,j,1) = water flux discharge along x-axis at time t N(i,j,1) = water flux discharge along y-axis at time t Z(i,j,2) = water surface level at time t+dt

slide-8
SLIDE 8

Wave Height Computation

8 Mass conservation (Wave height update) Momentum conservation (Wave height update) ・・・ ・・・ ・・・ ・・・ me T ・・・ ・・・ ・・・ ・・・ N M M,N, M,N, H ・・・ ・・・ Tim ・・・ ・・・ M H,Z H H H 1 second 1 second ・・・ ・・・ T+1 ・・・ ・・・ ・・・ ・・・ ・・・ ・・・ Time T ・・・ ・・・ ・・・ ・・・ Z M,N Z Z

8

・・・ ・・・ ・・・ ・・・ Z

slide-9
SLIDE 9

Target Tsunami Simulator

9

g

  • “TUNAMI N1” program in FORTRAN

D l d b T h k U i i – Developed by Tohoku University

Standing Water h ( ) Earthquake f Initial Wave Height (Z0) Depth (H) Depth (H) Source Info. Source Info. Mass Conservation Computation e d) 1040 grids Computation (Wave height: Z) ht Update = 1 secon 1040 grids Momentum Conservation Computation ave Heigh eration = Standing water depth H (1 grid: 1km x 1km)

9

p (Momentum: M,N) Wa (1 it

slide-10
SLIDE 10

C Implementation (base program)

10

p p g

  • Mass and Momentum functions are computed

alternatively alternatively

– Each function raster‐scans the grids

h b h

  • Since there is no data dependency between the

computations at grids, they can be parallelized

Mass Conservation (Z update) Momentum Conservation (M,N update)

10

No dependency (Parallelizable) No dependency (Parallelizable)

slide-11
SLIDE 11

Speed of N1 simulation program

11

  • Original Fortran program has been manually

converted into C

– C is 4 times faster than Fortran in our environment – C version is the base simulator

  • Size of simulation area
  • Grid width: 1[km]
  • Numbers of grids: 1040*668
  • Simulated time
  • 1 time step = 1 sec
  • 7,200 steps computed (2 hours)

l l

  • Tsunami simulation time on Intel microprocessor

(i7@2.93GHz, single core)

→ 78 7 → 78.7 sec

slide-12
SLIDE 12

Profiling of Computation time of the software

12

Profiling of Computation time of the software

Simulation Cycles and Computation time of TUNAMI

Start Input

120 140 160

sec]

p Initial condition

80 100

tion time[s

moment conservation

  • pen boundary

Main Loop t=1 ; t<=T ; t++ Mass Conservation

40 60

Computat

  • pen boundary

mass conservation initial

Mass Conservation M t C ti Open Boundary Condition

20 3600 7200 10800 14400 18000 21600

Simulation Cycles (t)

Momentum Conservation

12

Simulation Cycles (t)

Stop

slide-13
SLIDE 13

Co-execution

13

C i FPGA/GPU C on microprocessor

  • Read earthquake

FPGA/GPU

information from files

  • Compute initial wave

Mass Conservation Open Boundary Condition

  • Load initial wave to DRAM

memory of FPGA board Momentum Conservation Open Boundary Condition

  • Run FPGA
  • Read data from DRAM

Idl

. . .

  • Main loop
  • Store results to DRAM

Idle

. . .

  • Store results into files
slide-14
SLIDE 14

Pipeline processing for higher throughput

14

  • Latency

– After receiving input, how many cycles are required to generate its output generate its output

  • Throughput

– How frequently input data can be processed Iter 2 Iter 3 Iter N Iteration Pipeline processing Iter 1 Pipeline processing Time Iter N latency Iter 3 y Pipeline stages Iter 1 Iteration Iter 2 Goal: Faster throughput = Larger numbers of pipeline stages Initiation interval (~throughput) Larger numbers of pipeline stages

slide-15
SLIDE 15

Typical way for larger pipelines

15

  • Usually each loop becomes one pipeline

– Multiple loops should be merged as much as possible

  • Number of pipeline stages depends on the length of

each iteration Better to have larger loops – Better to have larger loops

  • Various loop optimizations have been proposed

– Formal analysis becomes possible with such transformations y p

for ( i=0; i<N; i++) for ( j=0; j<N; j++) S0(0,i,j): A[ i ][ j ] += u[ i ] * v[ j ]; for (t1=0; t1<N; t1++) for (t2=0; t2<N; t2++) { S0(j,i,0)A[t2,t1] += u[t2]*v[t1]; S1(i j 1) [t1] A[t2 t1]* [t1] for ( i=0; i<N; i++) for ( j=0; j<N; j++) S1(1,i,j): x[ i ] += A[ j ][ i ] * y[ j ];

i

S0(i,j)

i’

S1(i’,j’)

S1(i,j,1)x[t1] += A[t2,t1]*y[t1]; } [Pluto 08] U. Bondhugula, et al. “"A Practical and Automatic Polyhedral Program Optimization System,” in ACM PLDI'08, 2008

j j’

slide-16
SLIDE 16

Transforming latency-based to

16

throughput-based computation

  • Stream based programming
  • Stream based programming

– Communication/buffering becomes explicit – Easier for formal analysis as well – Easier for formal analysis as well

  • Works for both FPGA and GPU

– And also for many-cores

Instruction memory Data memory

– And also for many-cores

memory µP memory

P1 P2 P3 Hardware

“data, throughput-based” “instruction, latency-based”

slide-17
SLIDE 17

Introducing streams

  • Steam

Sequence of data functions or combined

17

  • Steam = Sequence of data, functions, or combined
  • M. N. H. Z

t t+1

One stream for

Formal equivalence h ki i h

t+1

One stream for each region

checking with manipulation of transformation matrix

b One stream for each time step Original program code Data object (can be a single stream)

slide-18
SLIDE 18

Strategy for GPU implementation

18

  • As shown earlier, stream is based on

each region

e

each region

– Easier and more efficient for GPU B t d d

Tim

– But depend on memory access architecture of the target GPU systems

  • Essentially area where Tsunami should be simulated

d d f ll is decomposed into a set of small regions

– Each core of GPU is in charge of one region – Straight forward parallelism – Pipelined computation inside each core

slide-19
SLIDE 19

Target GPGPU Architecture

19 Core Core Core Core Core Core Core Core SM SM SM SM SM SM SM (6GB) Core Core Core Core Core Core Core Core L2 Cache (768KB) Memory ( Core Core Core Core Core Core Core Core SM SM SM SM SM SM SM Global M Core Core Core Core Core Core Core Core Shared Memory /L1 Cache Register File (32k words) NVIDIA Tesla C2075 S a ed e

  • y /

Cac e (64KB) Streaming Multiprocessor (SM) (Fermi architecture) 14 Streaming Multiprocessors 6GB Main Memory h g p ( ) 32 Integer & FP cores 768KB L2 Cache

slide-20
SLIDE 20

Naïve GPGPU Implementation

[Gidra et al., IEEE HPCC 2011] 20 ization [Gidra et al., IEEE HPCC 2011] ynchroni Mass Conservation (Z update) Momentum Conservation (M,N update) Sy W (32 th d ) Block (16x16 threads) Warp (32 threads) Threads in a warp are executed in parallel Threads in a block shares the shared memory

slide-21
SLIDE 21

Performance Bottleneck

21

  • Runtime is dominated by global memory accesses

– #global accesses

  • Mass: Read H,Z,M,N (1040x668x6), Write Z (1040x668)
  • Momentum: Read H,Z,M,N (1040x668x6), Write M,N (1040x668x2)

T t l 1040 668 12 d & 1040 668 3 it

  • Total: 1040x668x12 reads & 1040x668x3 writes

– Global memory synchronization between Mass and Momentum

H t d th ?

  • How to reduce the accesses?

– Technique 1: Using shared memory to share H,Z,M,N between M d M t Mass and Momentum

  • Can eliminate all H,Z,M,N read in Momentum

Technique 2: Merging Mass and Momentum to eliminate global – Technique 2: Merging Mass and Momentum to eliminate global memory synchronization

  • More chance to utilize computation cores during memory access
  • e c a ce o u

e co pu a o co es du g e o y access

slide-22
SLIDE 22

Technique 1: Using Shared Memory

22

  • For each block, (H,Z,M,N) are loaded to shared memory

– #global accesses #global accesses

  • Mass: Read H,Z,M,N (1040x668x4), Write Z (1040x668)
  • Momentum: Write M,N (1040x668x2)
  • Total: 1040x668x4 reads (67% reduction) & 1040x668x3 writes

Gl b l M Gl b l M Global Memory H, Z, M, N Global Memory H, Z, M, N Block transfer Core Core Core Core Core Core Core Core Shared Memory Core Core Core Core O i i l I l t ti Core Core Core Core Core Core Core Core Sh d M I l t ti Original Implementation Shared Memory Implementation

slide-23
SLIDE 23

Technique 2: Eliminating Synchronization

23

  • Global synchronization can be eliminated by

merging Mass and Momentum functions g g

– However, Mass and Momentum depend on neighboring values of the block neighboring values of the block

  • Neighboring values are loaded onto the shared memory
  • Neighboring Z values are also computed
  • Neighboring Z values are also computed
  • Duplicated load & computation do not impact on runtime

Dependency on Dependency on neighborhood

ck ization Origina Expanded Block with Mass Momentum Bloc Synchron Origina l Block Expanded Block with Neighborhood Mass Computation Momentum Computation S

slide-24
SLIDE 24

Experimental Results

24

p

  • CUDA based implementation

R ti f 7200 it ti (2 h )

  • Runtime of 7200 iterations (2 hours)
  • Original C implementation

g p

– Runtime: 78.7 seconds

  • Naïve GPGPU implementation
  • Naïve GPGPU implementation

– Runtime: 2.75 seconds (28.6X speedup)

  • Our GPGPU implementation

– Runtime: 1 96 seconds (40 2X speedup) – Runtime: 1.96 seconds (40.2X speedup)

slide-25
SLIDE 25

Overview of FPGA System

25

Overview of FPGA System

Device Memory

24[Gbyte] Host Memory 48[Gbyte]

38[Gb t / ]

FPGA(Virtex6 SX475T) Resources FPGA

Virtex6 SX475T

CPU

Xeon X5650 @2 67GHz

38[Gbyte/sec] 2[Gbyte/sec] LUT 297600 FF 595200 BRAM 1064

SX475T @2.67GHz FPGA Board

HDD

DSP 2016

Public class Example{

Data Flow Graph (MaxJava)

int in_data[n] = {1,2,3,4,5}; int out data[n];

Host Code (C)

x = input(); y = x*x + x; y = output() } int out_data[n]; run_fpga( input(“x”, in_data),

  • utput(“result” out data)
  • utput( result ,out_data)

, run_cycle(“Example”,n) )

slide-26
SLIDE 26

Strategy for FPGA implementation

26

  • As shown earlier, stream is based on each

time step computation

– Like to keep communication between FPGA and DRAM as small amount as possible

Time

f h Stream for each time step

  • 17*4[Byte]*200[MHz]=

13.6[GByte/s] Stream for each region

  • 84*4[Byte]*200[MHz]

=67[GByte/s] 13.6[GByte/s] for 12 time steps =67[GByte/s] for 12 time steps

slide-27
SLIDE 27

MaxCompiler

27

MaxCompiler

Public class Example{ x input();

Data Flow Graph(MaxJava)

  • Development using RTL

require time and effort

x = input(); y = x*x + x; y = output() }

require time and effort

  • DFG is more abstract

d d h

}

MaxCompiler

Automatic Pipelining according to

and reduces the development time

L i S th i the DFG and Clock Frequency

RTL RTL

  • This enable us to try

more design alternatives

Logic Synthesis

Gate Level Gate Level

Placing and Routing FPGA Configuration

slide-28
SLIDE 28

DFG example

G t DFG di t

28

int a, b, c;

Generate DFG corresponding to as large as possible portions of codes

int a, b, c; void fct()

a c b 1

void fct() { a++;

> + 1

a ; if (c > 0) b = a + c;

> +

b a c; else b = a * c;

+

*

1

b a c; c = b; }

1

7

}

slide-29
SLIDE 29

Optimizations

29

Optimizations

  • Final Implementation has over 1,200 pipeline

stages

Before After Preprocess Unit

g

slide-30
SLIDE 30

Performance of the main loop part

30

Performance of the main loop part

46.0 41.5

45.0 50.0 35.0 40.0

OPS]

25.0 30.0

mance[GFLO

15.0 20.0

Perform

1.1

5.0 10.0 0.0 C FPGA GPU

7200(loop only)

slide-31
SLIDE 31

Power Consumption

31

Power Consumption

600 400 500 300

Power[W] SW(GPUWorkstation) GPU(GPUWorkstation) ( k )

100 200

P SW(FPGAWorkstation) FPGA(FPGAWorkstation)

‐100 ‐50 50 100 150 200 250

Ti [ ] Time[sec]

slide-32
SLIDE 32

Comparison of Power Consumption

32

Comparison of Power Consumption

  • FPGA is much better in terms of energy consumption

Power per second Power Consumption

129 120 140

att)]

1888.8 1600 1800 2000

ule)]

80 100

ption (Wa

1200 1400

mptionJou

42 60

consump

600 800 1000

y consum

24 20 40

Power

77.7 264.45 200 400 600

Energy

C FPGA GPU C FPGA GPU

slide-33
SLIDE 33

Compilation time

33

Compilation time

  • Time to compile DFG into FPGA implementation
  • Time to compile DFG into FPGA implementation
  • High/logic synthesis, placement & routing
  • The relationship of the number of unrolls and

compilation time

1000 1200 1400 [sec] 600 800 1000 tion Time[ 200 400 Compila 1 2 5 10 12 20 Number of Unrolls

slide-34
SLIDE 34

Statistical model checking on Tsunami i l ti lt

34

simulation results

  • Used the SMC developed by Prof Clarke’s

Used the SMC developed by Prof. Clarke s group

Wi h B i i l i – With Bayes statistics analysis – Software based

Parameters of earth quake D th Generation of i iti l Computation of ti ‐ Depth ‐ Fault/dislocation initial wave propagation S b

  • Sensors may not be so accurate
  • What will happen if they have some

errors ? Fixed values are used

slide-35
SLIDE 35

Software implementation

35

Software implementation

  • Used the SMC developed by Prof Clarke’s

Used the SMC developed by Prof. Clarke s group

O l l d ( ll ) l i h – Only colored (yellow) ones are replace with ours

Tsunami simulator with parameter variations Earth quake data p Sufficiently Checker: checker.cpp Statistical analysis: bngwrap.cpp Property: BLTL Sufficiently tried Results g p pp

slide-36
SLIDE 36

Results (1)

36

Parameters Test Property A/R Satisfy All Time[sec] H, sigma=1% BFT, 0.9, 1000, 1, 1 G[1800] ( Z1 < 3.3 ) R 3 149 G[1800] ( Z1 < 3.4 ) R 3 149 Earth quake depth G[1800] ( Z1 < 3 5 ) A 44 44 1635 Earth quake depth G[1800] ( Z1 < 3.5 ) A 44 44 1635 G[1800] ( Z1 < 3.6 ) A 44 44 1635 G[1800] ( Z1 < 3.7 ) A 44 44 1635 G[1800] ( Z1 < 3.8 ) A 44 44 1635 H sigma=1% BFT 0 99 1000 1 1 G[1800] ( Z1 < 3 3 ) R 2 74 H, sigma=1% BFT, 0.99, 1000, 1, 1 G[1800] ( Z1 < 3.3 ) R 2 74 G[1800] ( Z1 < 3.4 ) R 2 74 Earth quake depth G[1800] ( Z1 < 3.5 ) A 239 239 8962 G[1800] ( Z1 < 3.6 ) A 239 239 8962 G[1800] ( Z1 < 3 7 ) A 239 239 8962 G[1800] ( Z1 < 3.7 ) A 239 239 8962 G[1800] ( Z1 < 3.8 ) A 239 239 8962 L, W, sigma=5% BFT, 0.9, 1000, 1, 1 G[1800] ( Z1 < 3.3 ) R 3 149 G[1800] ( Z1 < 3.4 ) R 3 149 [ ] ( ) Fault/dislocation G[1800] ( Z1 < 3.5 ) A 224 237 8865 length and width G[1800] ( Z1 < 3.6 ) A 44 44 1638 G[1800] ( Z1 < 3.7 ) A 44 44 1638 G[1800] ( Z1 < 3.8 ) A 44 44 1638 L, W, sigma=5% BFT, 0.99, 1000, 1, 1 G[1800] ( Z1 < 3.3 ) R 2 77 G[1800] ( Z1 < 3.4 ) R 4 7 299 Fault/dislocation G[1800] ( Z1 < 3.5 ) R 306 319 11927 length and width G[1800] ( Z1 < 3.6 ) A 239 239 8939

36

G[1800] ( Z1 < 3.7 ) A 239 239 8939 G[1800] ( Z1 < 3.8 ) A 239 239 8939

slide-37
SLIDE 37

Results (2)

37

Parameters Test Property p Satisfy All Time[sec] H, sigma=1% BEST,0.05,0.9,1,1 G[1800] ( Z1 < 3.3 ) 0.0434783 21 817 ( ) [ ] ( ) (C-H Bound : 460) G[1800] ( Z1 < 3.4 ) 0.0434783 21 817 G[1800] ( Z1 < 3.5 ) 0.956522 21 21 817 G[1800] ( Z1 < 3.6 ) 0.956522 21 21 817 G[1800] ( Z1 < 3.7 ) 0.956522 21 21 817 G[1800] ( Z1 < 3.8 ) 0.956522 21 21 817 L, W, sigma=5% BEST,0.05,0.9,1,1 G[1800] ( Z1 < 3.3 ) 0.0434783 21 796 (C-H Bound : 460) G[1800] ( Z1 < 3.4 ) 0.430189 113 263 9531 G[1800] ( Z1 < 3.5 ) 0.956522 21 21 796 G[1800] ( Z1 < 3.5 ) 0.956522 21 21 796 G[1800] ( Z1 < 3.6 ) 0.956522 21 21 796 G[1800] ( Z1 < 3.7 ) 0.956522 21 21 796 G[1800] ( Z1 < 3.8 ) 0.956522 21 21 796 L W sigma=5% BEST 0 01 0 9 1 1 G[1800] ( Z1 < 3 3 ) 0 0251177 15 635 23803 L, W, sigma=5% BEST, 0.01, 0.9, 1, 1 G[1800] ( Z1 < 3.3 ) 0.0251177 15 635 23803 (C-H Bound : 11513) G[1800] ( Z1 < 3.4 ) 0.424508 2805 6608 249805 G[1800] ( Z1 < 3.5 ) 0.96146 947 984 36908 G[1800] ( Z1 < 3.6 ) 0.991304 113 113 4229 G[1800] ( Z1 < 3 7 ) 0 991304 113 113 4229 G[1800] ( Z1 < 3.7 ) 0.991304 113 113 4229 G[1800] ( Z1 < 3.8 ) 0.991304 113 113 4229

37

slide-38
SLIDE 38

Acceleration by HW Implementation

38

Acceleration by HW Implementation

M i l f TUNAMI i l i b 46 0 i

  • Main loop of TUNAMI simulation can be 46.0 times

faster

  • In case of GPU 41 5 time acceleration is realized (just for
  • In case of GPU, 41.5 time acceleration is realized (just for

reference)

46.0 41 5

45 0 50.0

Comparison among CPU(single core), FPGA, GPU

Device memory

Host memory

41.5

25 0 30.0 35.0 40.0 45.0

ce [GFLOPS]

FPGA

y

24[Gbyte]

CPU

memory 48[Gbyte]

38[Gbyte/sec] 2[Gbyte/sec] 1 1

5 0 10.0 15.0 20.0 25.0

Performanc

FPGA

Virtex6 SX475T

CPU

Xeon X5650 @2.67GHz FPGA board

2[Gbyte/sec] 1.1

0.0 5.0 C FPGA GPU

7200(loop only)

Simulation cycle (7200 sec)

G boa d

HDD

slide-39
SLIDE 39

How can we speed up statistical model h k

39

checking

  • Tsunami simulation can be accelerated with

FPGA/GPU by 40 times or more

But data transfer speed between FPGA/GPU board – But data transfer speed between FPGA/GPU board and microprocessor (PCI‐e) is not so fast

Acceleration of Tsunami i l i simulation

Tsunami h i ht Finish ?

Checker Valid or not Statistical analysis

height

39

y

T/F

slide-40
SLIDE 40

Data Transfer b/w Host and FPGA

40

Data Transfer b/w Host and FPGA

l f i i l i h ld b

  • Results of Tsunami simulation should be

transferred from FPGA/GPU to host processor

– FPGA  Host: 2Gbyte/sec (by PCI Express bus)

  • FPGA‐based Tsunami simulation needs:

– 28 byte data / clock cycle (16 byte for input, 12 byte for output) byte for output) – Needs 5.6Gbyte/sec @ (200MHz FPGA)

  • Considering data transfer actual acceleration
  • Considering data transfer, actual acceleration

by FPGA‐based implementation is 16 times

slide-41
SLIDE 41

Statistical Model Checking

41

  • f Tsunami Simulation Results
  • SMC of Tsunami simulation can be accelerated
  • SMC of Tsunami simulation can be accelerated

by hardware implementation

D f b d d – Data transfer can be reduced – Can fully utilize the acceleration of Tsunami i l i i SMC simulation in SMC

Initial Wave Initial Wave

FPGA@200MHz FPGA@200MHz

Initial Wave Generation Initial Wave Generation Tsunami SMC Checker Tsunami

5.6 GB/sec (required) Only T/F

Statistical Analysis

SW implementation of SMC part HW implementation of (partial) SMC part

Pass or Fail Pass or Fail

slide-42
SLIDE 42

HW Implementation of “checker”

Wi h FSM f h

42

  • With FSM for each property

GtΦ Φ1UtΦ2 G Φ Φ1U Φ2

  • With model checking of the traces

– LTL formulae can be checked in a bottom up way with linear time Time 1234567891111 – Example: F(a‐>(b U c))

  • Check each sub‐formula

a b 0123 0101010111011 1110001001111

  • Combine in bottom up way b

c b U c 1110001001111 0011110001111 1111110001111 a‐>(b U c) F(a‐>(b U c)) 1111111001111 1111111111111

slide-43
SLIDE 43

Performance Improvement of SMC of

43

Tsunami Simulation

45 50

Performance improvement compared for SW execution

25 30 35 40 5 10 15 20 25 Performance 5 Improvement

slide-44
SLIDE 44

Conclusions and on‐going works

44

g g

  • Tsunami simulation has been accelerated by

40‐45 times

– Space decomposition with GPU Time wise pipelining with FPGA – Time‐wise pipelining with FPGA

  • Statistical model checking on Tsunami

i l ti lt simulation results

– Could be time consuming with SW only implementation (15 X speed up) – By HW implementation of checker, 40X achieved

  • Entire HW implementation is on‐going

– Target example: Rounding robustness of floating – Target example: Rounding robustness of floating point computation with Monte Carlo Arithmetic