S7527 - Unstructured low-order finite-element earthquake simulation using OpenACC on Pascal GPUs
Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Muneo Hori, Lalith Maddegedara
GPU Technology Conference May 11, 2017
S7527 - Unstructured low-order finite-element earthquake simulation - - PowerPoint PPT Presentation
GPU Technology Conference May 11, 2017 S7527 - Unstructured low-order finite-element earthquake simulation using OpenACC on Pascal GPUs Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Muneo Hori, Lalith Maddegedara Introduction
GPU Technology Conference May 11, 2017
disaster by full use of CPU based K computer system
Poster
2
Earthquake disaster process K computer: 8 core CPU x 82944 node system with peak performance of 10.6 PFLOPS (7th in Top 500)
3
a) Earthquake wave propagation
0 km
c) Resident evacuation b) City response simulation
Shinjuku Two million agents evacuating to nearest safe site Tokyo station Ikebukuro Shibuya Shinbashi Ueno
Earthquake Post earthquake
World’s largest finite-element simulation enabled by developed solver
4
Sparse, symmetric positive definite matrix Unknown vector with 1 trillion degrees of freedom Outer force vector
5
f = Σe Pe Ke Pe
T u
[Ke is generated on-the-fly]
Element-by-Element method
+= … += Element #0 Element #1
Ke u f
Element #N-1 …
6
7
computation
8
➔
Structured mesh Unstructured mesh Pure unstructured mesh Unstructured Structured Unstructured Structured
1/3.6 1/3.0 Operation count for Element-by-Element kernel (linear elements) FLOP count Random Register-to-L1 cache access
mitigation
9
10
11
Element-by-Element method
+= … +=
Data recurrence (add into same node)
Element #0 Element #1
Ke
Element #N-1
u f
…
12
…
Atomic add
Element #0 Element #1
u (on cache) f
!$ACC DATA PRESENT(…) … !$ACC PARALLEL LOOP do i=1,ne ! read arrays ... ! compute Ku Ku11=… Ku12=… ... ! add to global vector !$ACC ATOMIC f(1,cny1)=Ku11+f(1,cny1) !$ACC ATOMIC f(2,cny1)=Ku21+f(2,cny1) ... !$ACC ATOMIC f(3,cny4)=Ku34+f(3,cny4) enddo !$ACC END DATA
13
!$ACC DATA PRESENT(...) ... do icolor=1,ncolor !$ACC PARALLEL LOOP do i=ns(icolor),ne(icolor) ! read arrays ... ! compute Ku Ku11=… Ku12=… ... ! add to global vector f(1,cny1)=Ku11+f(1,cny1) f(2,cny1)=Ku21+f(2,cny1) ... f(3,cny4)=Ku34+f(3,cny4) enddo enddo !$ACC END DATA
a) Coloring add b) Atomic add
14
10 20 30 40 50 Coloring Atomic
Elapsed time per EBE call (ms)
K40 P100 1/4.2 1/2.8
15
4 8 12 16 20
Elapsed time per EBE call (ms)
Tetra Voxel
K40 P100
1/1.81
Tetra⇒Voxel Tetra⇒Voxel
16
EBE boundary Send EBE Inner part v GPU #0 GPU #1 GPU #2 Recv Recv
Packing
MPI Send
Unpacking
17
DGX-1 (P100) GPU cluster (K40) K computer
Computation time in the EBE kernel
# of nodes CPU/node GPU/node Hardware peak FLOPS Memory bandwidth K computer 8 1 x SPARC64 VIIIfx
512 GB/s GPU cluster 4 2 x Xeon E5-2695 v2 2 x K40 34.3 TFLOPS 2.30 TB/s NVIDIA DGX-1 1 2 x Xeon E5-2698 v4 8 x P100 84.8 TFLOPS 5.76 TB/s
10 20 30 40 50
Elapsed time (s)
Target part Other part(CPU)
18