Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems
Research and Development Group, Hitachi, Ltd. Akio SHIMADA
LENS INTERNATIONAL WORKSHOP 2015
Optimizing MPI Intra-node Communication with New Task Model for - - PowerPoint PPT Presentation
Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems Research and Development Group, Hitachi, Ltd. Akio SHIMADA LENS INTERNATIONAL WORKSHOP 2015 Background core system since the appearance of multi-core processor
Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems
Research and Development Group, Hitachi, Ltd. Akio SHIMADA
LENS INTERNATIONAL WORKSHOP 2015
core system
since the appearance of multi-core processor and try to accelerate intra-node communication
2
Core Core Core Core
Process Process Process Process
Parallel*Processes
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process
Parallel*Processes Communica/on*on*Mul/1core*Node Communica/on*on*Many1core*Node Node Node
Conventional Intra-node Communication Schemes
are produced
3
Shared Memory
required for every communication
OS kernel assistance (KNEM, LiMIC, etc.)
communication
Sender
Send Buffer
Receiver
Receive Buffer
Shared Memory
Intermediate Buffer
memory copy memory copy
Sender
Send Buffer
Receiver
Receive Buffer
OS Kernel
memory copy
within the same node to run in the same address space
space boundary from intra-node communication
4
them to parallel processes (PVAS tasks)
within a PVAS partition assigned to the other PVAS task)
5
Process 0 TEXT DATA&BSS HEAP STACK KERNEL Process 1 TEXT DATA&BSS HEAP STACK KERNEL PVAS Partition 0 KERNEL
Address Low High
PVAS Partition 1 TEXT DATA&BSS HEAP STACK TEXT DATA&BSS HEAP STACK PVAS Task 0 PVAS Task 1
Normal Task Model PVAS Task Model
・・・
(There are no address space boundaries among them)
without overheads for crossing an address space boundary
6
Transfer Layer (BTL) of the Open MPI
assistance (using KNEM)
OS kernel assistance by using PVAS facility
7
8
MPI Process 0 (PVAS Task 0) Send Buffer MPI Process 1 (PVAS Task 1) Receive Buffer
① Sender posts the pointer to the send buffer ② Receiver copies the data from the send buffer
when transferring the data
Intel MPI Benchmarks
10
when message size is small because of the system call overhead
1" 10" 100" 1000" 10000" 100000" 1000000" 64" 128" 256" 512" 1K" 2K" 4K" 8K" 16K" 32K" 64K" 128K" 256K" 512K" 1M" 2M" 4M" 8M" 16M" 32M"
Lanteyc"(usec) Message"Size"(Bytes) SM" SM"(KNEM)" PVAS"
performance by up to 28%
11
!60$ !55$ !50$ !45$ !40$ !35$ !30$ !25$ !20$ !15$ !10$ !5$ 0$ 5$ 10$ 15$ MG$ CG$ FT$ IS$ LU$ SP$ BT$
Performance$Improvement$(%) Benchmark$(CLASS$A) SM$(KNEM)$ PVAS$
!35$ !30$ !25$ !20$ !15$ !10$ !5$ 0$ 5$ 10$ 15$ 20$ 25$ MG$ CG$ FT$ IS$ LU$ SP$ BT$
Performance$Improvement$(%)$ Bechmark$(CLASS$B) SM$(KNEM)$ PVAS$
!20$ !15$ !10$ !5$ 0$ 5$ 10$ 15$ 20$ 25$ 30$ MG$ CG$ FT$ IS$ LU$ SP$ BT$
Performance$Improvement$ Bemchmark$(CLASS$C) SM$(KNEM)$ PVAS$
N/A
MPI Process 0 (PVAS Task 0)
Send Buffer
MPI Process 1 (PVAS Task 1)
Receive Buffer
MPI Process 1 (PVAS Task 1)
Receive Buffer
MPI Process 0 (PVAS Task 0)
Optimizing Non-contiguous Data Transfer Using Derived Data Types
when using PVAS facility
consulting the data type informations of them
12
① memory copy by sender
Send Buffer Shared Memory
SM BTL PVAS BTL
② memory copy by sender
③ memory copy by receiver
② Sender posts the pointer to the intermediate buffer
②’ memory copy by receiver
① Sender and receiver exchange the pointer to the data type informations of them
derived data types
SPECFEM3D
13
0" 200" 400" 600" 800" 1000" 1200" 43K" 55K" 63K" 75K" 90K"
WRF_y_sa SM" PVAS"
0" 1000" 2000" 3000" 4000" 5000" 6000" 63K" 102K" 173K"
WRF_x_sa SM" PVAS"
0" 200" 400" 600" 800" 1000" 1200" 43K" 55K" 63K" 75K" 90K"
WRF_y_vec SM" PVAS"
0" 1000" 2000" 3000" 4000" 5000" 6000" 63K" 102K" 173K"
WRF_x_vec SM" PVAS"
0" 10000" 20000" 30000" 40000" 50000" 60000" 70000" 2K" 32K" 131K" 524K"
NAS_MG_x SM" PVAS"
0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 4K" 65K" 262K" 1M"
NAS_MG_y SM" PVAS"
0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 4K" 65K" 262K" 1M"
NAS_MG_z SM" PVAS"
0" 100" 200" 300" 400" 500" 600" 700" 800" 12K" 24K" 49K" 98K"
MILC_su3_zd SM" PVAS"
X-axis: Data Size, Y-axis: Latency (usec)
14
0" 50" 100" 150" 200" 250" 300" 350" 400" 0.48K" 1.3K" 2.5K" 4K" 6.4K" 16K" 40K"
NAS_LU_x SM" PVAS"
0" 100" 200" 300" 400" 500" 600" 700" 800" 0.48K" 1.3K" 2.5K" 4K" 6.4K" 16K" 40K"
NAS_LU_y SM" PVAS"
0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 8000" 9000" 10000" 106K" 143K" 195K"
LAMMPS_full SM" PVAS"
0" 100" 200" 300" 400" 500" 600" 700" 5.4K" 6.9K" 11K"
LAMMPS_atomic SM" PVAS"
0" 200000" 400000" 600000" 800000" 1000000" 1200000" 1400000" 524K" 2M" 4.7M" 8.3M" 18.8M"
FFT SM" PVAS"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 67K" 74K" 76K" 91K"
SPECFEM3D_mt SM" PVAS"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 1.9K" 3.5K" 7.5K" 12K"
SPECFEM3D_oc SM" PVAS"
0" 5000" 10000" 15000" 20000" 25000" 26K" 55K" 106K" 177K"
SPECFEM3D_cm SM" PVAS"
X-axis: Data Size, Y-axis: Latency (usec)
processes
small
data from complex data type buffer to complex data type buffer
misses during data copy
15
code
for matrix transpose
send/recv side
benchmark performance by up to 21%
16
fft2d_datatype results (NP=240)
transferring contiguous data
transfer non-contiguous data through one RDMA operation
data types [Mingzhe et al., IEEE Cluster’15]
17
systems
from intra-node communication by running the parallel processes within the same address space
Open MPI
18