Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its analysis based on Statistical Model Checking
Masahiro Fujita VLSI Design and Education Center (VDEC) University of Tokyo
1
Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its - - PowerPoint PPT Presentation
Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its analysis based on Statistical Model Checking Masahiro Fujita VLSI Design and Education Center (VDEC) University of Tokyo 1 2 Outline What Tsunami simulation means in
1
2
– Based
stream processing (pipelining) with loop unrolling unrolling – Based on parallel processing for decomposed regions
– (Equivalence checking between FPGA/GPU implementation and the original program in C/Fortran) implementation and the original program in C/Fortran) – Just show our strategy
Statistical model checking
– On software in Fortran l i i h PG /GP – Acceleration with FPGA/GPU
3
(wired/wireless) compute how Tsunami wave will (wired/wireless), compute how Tsunami wave will propagate G l R li l l f i
Tsunami simulation with FPGA/GPU
C /P di Compute/Predict Tsunami as fast and accurate as possible accurate as possible
Earthquake sensors Generate initial wave from sensor data geographically distributed Propagate wave by numerically solving partial differential equations
4
dynamics equations dynamics equations – Law of Conservation of Mass – Law of Conservation of Momentum with and without bottom friction
with known boundary conditions and bathymetric input of the region y p g
sets of partial differential equations with finite sets of partial differential equations with finite difference methods
5
ὴ = vertical displacement of water above still water ὴ p D= Total water depth = h+ὴ g = Acceleration due to gravity A h i l dd i i Momentum equations along A = horizontal eddy viscosity current τ = friction along x or y direction M = water flux discharge along X direction X‐axis and Y‐axis respectively ith t b tt M = water flux discharge along X direction N = water flux discharge along Y direction Mass conservation without bottom friction Mass conservation
Reference: Tsunami Modeling Manual by Prof Nobuo Shuto
6
(Mass Conservation)
N M M
(Mass Conservation) (Momentum Conservation)
y x t
2 2 3 7 2 2
N M D M gn x gD D MN y D M x t M
(Momentum Conservation)
2 2 3 7 2 2
N M D M gn y gD D N y D MN x t M D
y flaxofx N M Manning n gravity g depth D waveheight , : , : : : :
y f f g g y g p g , ,
(Mass Conservation) (M t C ti ) (Momentum Conservation)
7
finite difference method finite difference method
– Z(i,j,t+1) = Z(I,j,t) – (dt/dx)*(M(I,j,t)-M(i-1,j,t)+N(i,j,t)-N(i,j-1,t))
N M M
Where i, j = x, y coordinate
t
y N x M t M
Z(i,j,t) = Water Surface level at time t H(i,j,t) = Still water depth at time t dt = temporal step dx = spatial step
t+1
M(i,j,1) = water flux discharge along x-axis at time t N(i,j,1) = water flux discharge along y-axis at time t Z(i,j,2) = water surface level at time t+dt
8 Mass conservation (Wave height update) Momentum conservation (Wave height update) ・・・ ・・・ ・・・ ・・・ me T ・・・ ・・・ ・・・ ・・・ N M M,N, M,N, H ・・・ ・・・ Tim ・・・ ・・・ M H,Z H H H 1 second 1 second ・・・ ・・・ T+1 ・・・ ・・・ ・・・ ・・・ ・・・ ・・・ Time T ・・・ ・・・ ・・・ ・・・ Z M,N Z Z
8
・・・ ・・・ ・・・ ・・・ Z
9
D l d b T h k U i i – Developed by Tohoku University
Standing Water h ( ) Earthquake f Initial Wave Height (Z0) Depth (H) Depth (H) Source Info. Source Info. Mass Conservation Computation e d) 1040 grids Computation (Wave height: Z) ht Update = 1 secon 1040 grids Momentum Conservation Computation ave Heigh eration = Standing water depth H (1 grid: 1km x 1km)
9
p (Momentum: M,N) Wa (1 it
10
– Each function raster‐scans the grids
Mass Conservation (Z update) Momentum Conservation (M,N update)
10
No dependency (Parallelizable) No dependency (Parallelizable)
11
– C is 4 times faster than Fortran in our environment – C version is the base simulator
(i7@2.93GHz, single core)
12
Simulation Cycles and Computation time of TUNAMI
Start Input
120 140 160
sec]
p Initial condition
80 100
tion time[s
moment conservation
Main Loop t=1 ; t<=T ; t++ Mass Conservation
40 60
Computat
mass conservation initial
Mass Conservation M t C ti Open Boundary Condition
20 3600 7200 10800 14400 18000 21600
Simulation Cycles (t)
Momentum Conservation
12
Simulation Cycles (t)
Stop
13
C i FPGA/GPU C on microprocessor
FPGA/GPU
information from files
Mass Conservation Open Boundary Condition
memory of FPGA board Momentum Conservation Open Boundary Condition
Idl
Idle
14
– After receiving input, how many cycles are required to generate its output generate its output
– How frequently input data can be processed Iter 2 Iter 3 Iter N Iteration Pipeline processing Iter 1 Pipeline processing Time Iter N latency Iter 3 y Pipeline stages Iter 1 Iteration Iter 2 Goal: Faster throughput = Larger numbers of pipeline stages Initiation interval (~throughput) Larger numbers of pipeline stages
15
– Multiple loops should be merged as much as possible
each iteration Better to have larger loops – Better to have larger loops
– Formal analysis becomes possible with such transformations y p
for ( i=0; i<N; i++) for ( j=0; j<N; j++) S0(0,i,j): A[ i ][ j ] += u[ i ] * v[ j ]; for (t1=0; t1<N; t1++) for (t2=0; t2<N; t2++) { S0(j,i,0)A[t2,t1] += u[t2]*v[t1]; S1(i j 1) [t1] A[t2 t1]* [t1] for ( i=0; i<N; i++) for ( j=0; j<N; j++) S1(1,i,j): x[ i ] += A[ j ][ i ] * y[ j ];
i
S0(i,j)
i’
S1(i’,j’)
S1(i,j,1)x[t1] += A[t2,t1]*y[t1]; } [Pluto 08] U. Bondhugula, et al. “"A Practical and Automatic Polyhedral Program Optimization System,” in ACM PLDI'08, 2008
j j’
16
– Communication/buffering becomes explicit – Easier for formal analysis as well – Easier for formal analysis as well
– And also for many-cores
Instruction memory Data memory
– And also for many-cores
memory µP memory
P1 P2 P3 Hardware
“data, throughput-based” “instruction, latency-based”
Sequence of data functions or combined
17
t t+1
One stream for
Formal equivalence h ki i h
t+1
One stream for each region
checking with manipulation of transformation matrix
b One stream for each time step Original program code Data object (can be a single stream)
18
each region
e
each region
– Easier and more efficient for GPU B t d d
Tim
– But depend on memory access architecture of the target GPU systems
d d f ll is decomposed into a set of small regions
– Each core of GPU is in charge of one region – Straight forward parallelism – Pipelined computation inside each core
19 Core Core Core Core Core Core Core Core SM SM SM SM SM SM SM (6GB) Core Core Core Core Core Core Core Core L2 Cache (768KB) Memory ( Core Core Core Core Core Core Core Core SM SM SM SM SM SM SM Global M Core Core Core Core Core Core Core Core Shared Memory /L1 Cache Register File (32k words) NVIDIA Tesla C2075 S a ed e
Cac e (64KB) Streaming Multiprocessor (SM) (Fermi architecture) 14 Streaming Multiprocessors 6GB Main Memory h g p ( ) 32 Integer & FP cores 768KB L2 Cache
[Gidra et al., IEEE HPCC 2011] 20 ization [Gidra et al., IEEE HPCC 2011] ynchroni Mass Conservation (Z update) Momentum Conservation (M,N update) Sy W (32 th d ) Block (16x16 threads) Warp (32 threads) Threads in a warp are executed in parallel Threads in a block shares the shared memory
21
– #global accesses
T t l 1040 668 12 d & 1040 668 3 it
– Global memory synchronization between Mass and Momentum
H t d th ?
– Technique 1: Using shared memory to share H,Z,M,N between M d M t Mass and Momentum
Technique 2: Merging Mass and Momentum to eliminate global – Technique 2: Merging Mass and Momentum to eliminate global memory synchronization
e co pu a o co es du g e o y access
22
– #global accesses #global accesses
Gl b l M Gl b l M Global Memory H, Z, M, N Global Memory H, Z, M, N Block transfer Core Core Core Core Core Core Core Core Shared Memory Core Core Core Core O i i l I l t ti Core Core Core Core Core Core Core Core Sh d M I l t ti Original Implementation Shared Memory Implementation
23
– However, Mass and Momentum depend on neighboring values of the block neighboring values of the block
Dependency on Dependency on neighborhood
ck ization Origina Expanded Block with Mass Momentum Bloc Synchron Origina l Block Expanded Block with Neighborhood Mass Computation Momentum Computation S
24
– Runtime: 78.7 seconds
– Runtime: 2.75 seconds (28.6X speedup)
– Runtime: 1 96 seconds (40 2X speedup) – Runtime: 1.96 seconds (40.2X speedup)
25
Device Memory
24[Gbyte] Host Memory 48[Gbyte]
38[Gb t / ]
FPGA(Virtex6 SX475T) Resources FPGA
Virtex6 SX475T
CPU
Xeon X5650 @2 67GHz
38[Gbyte/sec] 2[Gbyte/sec] LUT 297600 FF 595200 BRAM 1064
SX475T @2.67GHz FPGA Board
HDD
DSP 2016
Public class Example{
Data Flow Graph (MaxJava)
int in_data[n] = {1,2,3,4,5}; int out data[n];
Host Code (C)
x = input(); y = x*x + x; y = output() } int out_data[n]; run_fpga( input(“x”, in_data),
, run_cycle(“Example”,n) )
26
time step computation
– Like to keep communication between FPGA and DRAM as small amount as possible
Time
f h Stream for each time step
13.6[GByte/s] Stream for each region
=67[GByte/s] 13.6[GByte/s] for 12 time steps =67[GByte/s] for 12 time steps
27
Public class Example{ x input();
Data Flow Graph(MaxJava)
require time and effort
x = input(); y = x*x + x; y = output() }
require time and effort
d d h
}
MaxCompiler
Automatic Pipelining according to
and reduces the development time
L i S th i the DFG and Clock Frequency
RTL RTL
more design alternatives
Logic Synthesis
Gate Level Gate Level
Placing and Routing FPGA Configuration
G t DFG di t
28
int a, b, c;
Generate DFG corresponding to as large as possible portions of codes
int a, b, c; void fct()
a c b 1
void fct() { a++;
> + 1
a ; if (c > 0) b = a + c;
> +
b a c; else b = a * c;
+
*
1
b a c; c = b; }
1
7
}
29
Before After Preprocess Unit
30
46.0 41.5
45.0 50.0 35.0 40.0
OPS]
25.0 30.0
mance[GFLO
15.0 20.0
Perform
1.1
5.0 10.0 0.0 C FPGA GPU
7200(loop only)
31
600 400 500 300
Power[W] SW(GPUWorkstation) GPU(GPUWorkstation) ( k )
100 200
P SW(FPGAWorkstation) FPGA(FPGAWorkstation)
‐100 ‐50 50 100 150 200 250
Ti [ ] Time[sec]
32
Power per second Power Consumption
129 120 140
att)]
1888.8 1600 1800 2000
ule)]
80 100
ption (Wa
1200 1400
mptionJou
42 60
consump
600 800 1000
y consum
24 20 40
Power
77.7 264.45 200 400 600
Energy
C FPGA GPU C FPGA GPU
33
compilation time
1000 1200 1400 [sec] 600 800 1000 tion Time[ 200 400 Compila 1 2 5 10 12 20 Number of Unrolls
34
Wi h B i i l i – With Bayes statistics analysis – Software based
Parameters of earth quake D th Generation of i iti l Computation of ti ‐ Depth ‐ Fault/dislocation initial wave propagation S b
errors ? Fixed values are used
35
O l l d ( ll ) l i h – Only colored (yellow) ones are replace with ours
Tsunami simulator with parameter variations Earth quake data p Sufficiently Checker: checker.cpp Statistical analysis: bngwrap.cpp Property: BLTL Sufficiently tried Results g p pp
36
Parameters Test Property A/R Satisfy All Time[sec] H, sigma=1% BFT, 0.9, 1000, 1, 1 G[1800] ( Z1 < 3.3 ) R 3 149 G[1800] ( Z1 < 3.4 ) R 3 149 Earth quake depth G[1800] ( Z1 < 3 5 ) A 44 44 1635 Earth quake depth G[1800] ( Z1 < 3.5 ) A 44 44 1635 G[1800] ( Z1 < 3.6 ) A 44 44 1635 G[1800] ( Z1 < 3.7 ) A 44 44 1635 G[1800] ( Z1 < 3.8 ) A 44 44 1635 H sigma=1% BFT 0 99 1000 1 1 G[1800] ( Z1 < 3 3 ) R 2 74 H, sigma=1% BFT, 0.99, 1000, 1, 1 G[1800] ( Z1 < 3.3 ) R 2 74 G[1800] ( Z1 < 3.4 ) R 2 74 Earth quake depth G[1800] ( Z1 < 3.5 ) A 239 239 8962 G[1800] ( Z1 < 3.6 ) A 239 239 8962 G[1800] ( Z1 < 3 7 ) A 239 239 8962 G[1800] ( Z1 < 3.7 ) A 239 239 8962 G[1800] ( Z1 < 3.8 ) A 239 239 8962 L, W, sigma=5% BFT, 0.9, 1000, 1, 1 G[1800] ( Z1 < 3.3 ) R 3 149 G[1800] ( Z1 < 3.4 ) R 3 149 [ ] ( ) Fault/dislocation G[1800] ( Z1 < 3.5 ) A 224 237 8865 length and width G[1800] ( Z1 < 3.6 ) A 44 44 1638 G[1800] ( Z1 < 3.7 ) A 44 44 1638 G[1800] ( Z1 < 3.8 ) A 44 44 1638 L, W, sigma=5% BFT, 0.99, 1000, 1, 1 G[1800] ( Z1 < 3.3 ) R 2 77 G[1800] ( Z1 < 3.4 ) R 4 7 299 Fault/dislocation G[1800] ( Z1 < 3.5 ) R 306 319 11927 length and width G[1800] ( Z1 < 3.6 ) A 239 239 8939
36
G[1800] ( Z1 < 3.7 ) A 239 239 8939 G[1800] ( Z1 < 3.8 ) A 239 239 8939
37
Parameters Test Property p Satisfy All Time[sec] H, sigma=1% BEST,0.05,0.9,1,1 G[1800] ( Z1 < 3.3 ) 0.0434783 21 817 ( ) [ ] ( ) (C-H Bound : 460) G[1800] ( Z1 < 3.4 ) 0.0434783 21 817 G[1800] ( Z1 < 3.5 ) 0.956522 21 21 817 G[1800] ( Z1 < 3.6 ) 0.956522 21 21 817 G[1800] ( Z1 < 3.7 ) 0.956522 21 21 817 G[1800] ( Z1 < 3.8 ) 0.956522 21 21 817 L, W, sigma=5% BEST,0.05,0.9,1,1 G[1800] ( Z1 < 3.3 ) 0.0434783 21 796 (C-H Bound : 460) G[1800] ( Z1 < 3.4 ) 0.430189 113 263 9531 G[1800] ( Z1 < 3.5 ) 0.956522 21 21 796 G[1800] ( Z1 < 3.5 ) 0.956522 21 21 796 G[1800] ( Z1 < 3.6 ) 0.956522 21 21 796 G[1800] ( Z1 < 3.7 ) 0.956522 21 21 796 G[1800] ( Z1 < 3.8 ) 0.956522 21 21 796 L W sigma=5% BEST 0 01 0 9 1 1 G[1800] ( Z1 < 3 3 ) 0 0251177 15 635 23803 L, W, sigma=5% BEST, 0.01, 0.9, 1, 1 G[1800] ( Z1 < 3.3 ) 0.0251177 15 635 23803 (C-H Bound : 11513) G[1800] ( Z1 < 3.4 ) 0.424508 2805 6608 249805 G[1800] ( Z1 < 3.5 ) 0.96146 947 984 36908 G[1800] ( Z1 < 3.6 ) 0.991304 113 113 4229 G[1800] ( Z1 < 3 7 ) 0 991304 113 113 4229 G[1800] ( Z1 < 3.7 ) 0.991304 113 113 4229 G[1800] ( Z1 < 3.8 ) 0.991304 113 113 4229
37
38
M i l f TUNAMI i l i b 46 0 i
faster
reference)
46.0 41 5
45 0 50.0
Comparison among CPU(single core), FPGA, GPU
Device memory
Host memory
41.5
25 0 30.0 35.0 40.0 45.0
ce [GFLOPS]
FPGA
y
24[Gbyte]
CPU
memory 48[Gbyte]
38[Gbyte/sec] 2[Gbyte/sec] 1 1
5 0 10.0 15.0 20.0 25.0
Performanc
FPGA
Virtex6 SX475T
CPU
Xeon X5650 @2.67GHz FPGA board
2[Gbyte/sec] 1.1
0.0 5.0 C FPGA GPU
7200(loop only)
Simulation cycle (7200 sec)
G boa d
HDD
39
But data transfer speed between FPGA/GPU board – But data transfer speed between FPGA/GPU board and microprocessor (PCI‐e) is not so fast
Acceleration of Tsunami i l i simulation
Tsunami h i ht Finish ?
Checker Valid or not Statistical analysis
height
39
y
T/F
40
– FPGA Host: 2Gbyte/sec (by PCI Express bus)
– 28 byte data / clock cycle (16 byte for input, 12 byte for output) byte for output) – Needs 5.6Gbyte/sec @ (200MHz FPGA)
41
D f b d d – Data transfer can be reduced – Can fully utilize the acceleration of Tsunami i l i i SMC simulation in SMC
Initial Wave Initial Wave
FPGA@200MHz FPGA@200MHz
Initial Wave Generation Initial Wave Generation Tsunami SMC Checker Tsunami
5.6 GB/sec (required) Only T/F
Statistical Analysis
SW implementation of SMC part HW implementation of (partial) SMC part
Pass or Fail Pass or Fail
Wi h FSM f h
42
GtΦ Φ1UtΦ2 G Φ Φ1U Φ2
– LTL formulae can be checked in a bottom up way with linear time Time 1234567891111 – Example: F(a‐>(b U c))
a b 0123 0101010111011 1110001001111
c b U c 1110001001111 0011110001111 1111110001111 a‐>(b U c) F(a‐>(b U c)) 1111111001111 1111111111111
43
45 50
Performance improvement compared for SW execution
25 30 35 40 5 10 15 20 25 Performance 5 Improvement
44
– Space decomposition with GPU Time wise pipelining with FPGA – Time‐wise pipelining with FPGA
– Could be time consuming with SW only implementation (15 X speed up) – By HW implementation of checker, 40X achieved
– Target example: Rounding robustness of floating – Target example: Rounding robustness of floating point computation with Monte Carlo Arithmetic