[PPT] - Optimizing MPI Intra-node Communication with New Task Model for PowerPoint Presentation

SLIDE 1

Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems

Research and Development Group, Hitachi, Ltd. Akio SHIMADA

LENS INTERNATIONAL WORKSHOP 2015

SLIDE 2

Background

A large number of parallel processes can be invoked within a node on a many-

core system

MPI and some PGAS language runtimes invokes multiple processes
Fast Intra-node communication is required
Many researches proposed a variety of intra-node communication schemes(e.g. KNEM, LiMIC)

since the appearance of multi-core processor and try to accelerate intra-node communication

n many-core systems (e.g. hybrid MPI)

2

Core Core Core Core

Process Process Process Process

Parallel*Processes

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process

Parallel*Processes Communica/on*on*Mul/1core*Node Communica/on*on*Many1core*Node Node Node

SLIDE 3

Conventional Intra-node Communication Schemes

Overheads for “crossing address space boundaries among processes”

are produced

There are address space boundaries among processes

3

Shared Memory

Double-copy via shared memory is

required for every communication

OS kernel assistance (KNEM, LiMIC, etc.)

System call overhead is produced for every

communication

Sender

Send Buffer

Receiver

Receive Buffer

Shared Memory

Intermediate Buffer

memory copy memory copy

Sender

Send Buffer

Receiver

Receive Buffer

OS Kernel

memory copy

SLIDE 4

Proposal

Partitioned Virtual Address Space（PVAS）
A new task model for efficient parallel processing
n many-core systems
PVAS make it possible for parallel processes

within the same node to run in the same address space

PVAS can remove overheads for crossing address

space boundary from intra-node communication

4

SLIDE 5

Address Space Layout

PVAS partitions a single address space into multiple segments (PVAS partition) and assigns

them to parallel processes (PVAS tasks)

Parallel processes uses the same page table for managing memory mapping informations
PVAS task can use only its own PVAS partition as its local memory (cannot allocate memory

within a PVAS partition assigned to the other PVAS task)

PVAS task is almost same as normal process except sharing the same address space with
ther processes

5

Process 0 TEXT DATA&BSS HEAP STACK KERNEL Process 1 TEXT DATA&BSS HEAP STACK KERNEL PVAS Partition 0 KERNEL

Address Low High

PVAS Partition 1 TEXT DATA&BSS HEAP STACK TEXT DATA&BSS HEAP STACK PVAS Task 0 PVAS Task 1

Normal Task Model PVAS Task Model

･･･

SLIDE 6

PVAS Feature

All memory of the PVAS task is exposed to the
ther PVAS tasks within the same node
PVAS task can access the memory of the
ther PVAS tasks by load/store instructions

(There are no address space boundaries among them)

A pair of PVAS tasks can exchange the data

without overheads for crossing an address space boundary

6

SLIDE 7

Optimizing Open MPI by PVAS

PVAS BTL component is implemented in the Byte

Transfer Layer (BTL) of the Open MPI

SM BTL
Supporting double-copy communication via shared memory
Supporting single-copy communication with OS kernel

assistance (using KNEM)

PVAS BTL（developed on the basis of the SM BTL）
Copying the data from send buffer to receive buffer without

OS kernel assistance by using PVAS facility

7

SLIDE 8

PVAS BTL

8

MPI Process 0 (PVAS Task 0) Send Buffer MPI Process 1 (PVAS Task 1) Receive Buffer

① Sender posts the pointer  to the send buffer ② Receiver copies the data  from the send buffer

Invoking MPI process as PVAS task
Copying the data from send buffer to receive buffer directly
The overheads for crossing address space boundary is not produced

when transferring the data

Single-copy communication (avoiding extra memory copy)
OS kernel assistance is not necessary (avoiding system call overhead)

SLIDE 9

Evaluation Environment

Intel Xeon Phi 5110P
1.083 GHZ, 60 cores (4HT)
32 KB L1 cache, 512 KB L2 cache
8 GB of main memory
OS
Intel MPSS linux 2.6.38.8 with PVAS facility
MPI
Open MPI 1.8 with PVAS BTL

SLIDE 10

Latency Evaluation

Ping-pong communication latency was measured by running

Intel MPI Benchmarks

10

PVAS BTL outperforms others regardless of the message size
Latency of the SM BTL (KNEM) is higher than that of SM BTL

when message size is small because of the system call overhead

1" 10" 100" 1000" 10000" 100000" 1000000" 64" 128" 256" 512" 1K" 2K" 4K" 8K" 16K" 32K" 64K" 128K" 256K" 512K" 1M" 2M" 4M" 8M" 16M" 32M"

Lanteyc"(usec) Message"Size"(Bytes) SM" SM"(KNEM)" PVAS"

SLIDE 11

NAS Parallel Benchmarks (NPB)

Running NPB on a single node
Number of Processes
128（MG, CG, FT, IS, LU）
225（SP, BT）
Problem size
CLASS A, B, C (A < B < C)
PVAS BTL improves benchmark

performance by up to 28%

SP（CLASS C）

11

!60$ !55$ !50$ !45$ !40$ !35$ !30$ !25$ !20$ !15$ !10$ !5$ 0$ 5$ 10$ 15$ MG$ CG$ FT$ IS$ LU$ SP$ BT$

Performance$Improvement$(%) Benchmark$(CLASS$A) SM$(KNEM)$ PVAS$

!35$ !30$ !25$ !20$ !15$ !10$ !5$ 0$ 5$ 10$ 15$ 20$ 25$ MG$ CG$ FT$ IS$ LU$ SP$ BT$

Performance$Improvement$(%)$ Bechmark$(CLASS$B) SM$(KNEM)$ PVAS$

!20$ !15$ !10$ !5$ 0$ 5$ 10$ 15$ 20$ 25$ 30$ MG$ CG$ FT$ IS$ LU$ SP$ BT$

Performance$Improvement$ Bemchmark$(CLASS$C) SM$(KNEM)$ PVAS$

N/A

SLIDE 12

MPI Process 0 (PVAS Task 0)

Send Buffer

MPI Process 1 (PVAS Task 1)

Receive Buffer

MPI Process 1 (PVAS Task 1)

Receive Buffer

MPI Process 0 (PVAS Task 0)

Optimizing Non-contiguous Data Transfer Using Derived Data Types

Sender and receiver exchange the pointer to the data type informations of them
MPI process can access the MPI internal objects of the other MPI process

when using PVAS facility

Sender and receiver copies the data from the send buffer to the receive buffer

consulting the data type informations of them

Sender and receiver copy the data in parallel

12

① memory copy by sender

Send Buffer Shared Memory

SM BTL PVAS BTL

② memory copy by sender

③ memory copy by receiver

② Sender posts the pointer to the intermediate buffer

②’ memory copy by receiver

① Sender and receiver exchange the pointer to  the data type informations of them

SLIDE 13

Latency Evaluation Using DDTBench(1/2)

DDTBench [Timo et al., EurMPI’12] mimics the commutation pattern of MPI applications by using

derived data types

MPI processes send and receive the non-contiguous data in WRF, MILC, NPB, LAMMPS,

SPECFEM3D

13

0" 200" 400" 600" 800" 1000" 1200" 43K" 55K" 63K" 75K" 90K"

WRF_y_sa SM" PVAS"

0" 1000" 2000" 3000" 4000" 5000" 6000" 63K" 102K" 173K"

WRF_x_sa SM" PVAS"

0" 200" 400" 600" 800" 1000" 1200" 43K" 55K" 63K" 75K" 90K"

WRF_y_vec SM" PVAS"

0" 1000" 2000" 3000" 4000" 5000" 6000" 63K" 102K" 173K"

WRF_x_vec SM" PVAS"

0" 10000" 20000" 30000" 40000" 50000" 60000" 70000" 2K" 32K" 131K" 524K"

NAS_MG_x SM" PVAS"

0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 4K" 65K" 262K" 1M"

NAS_MG_y SM" PVAS"

0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 4K" 65K" 262K" 1M"

NAS_MG_z SM" PVAS"

0" 100" 200" 300" 400" 500" 600" 700" 800" 12K" 24K" 49K" 98K"

MILC_su3_zd SM" PVAS"

X-axis: Data Size, Y-axis: Latency (usec)

SLIDE 14

Latency Evaluation Using DDTBench(2/2)

14

0" 50" 100" 150" 200" 250" 300" 350" 400" 0.48K" 1.3K" 2.5K" 4K" 6.4K" 16K" 40K"

NAS_LU_x SM" PVAS"

0" 100" 200" 300" 400" 500" 600" 700" 800" 0.48K" 1.3K" 2.5K" 4K" 6.4K" 16K" 40K"

NAS_LU_y SM" PVAS"

0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 8000" 9000" 10000" 106K" 143K" 195K"

LAMMPS_full SM" PVAS"

0" 100" 200" 300" 400" 500" 600" 700" 5.4K" 6.9K" 11K"

LAMMPS_atomic SM" PVAS"

0" 200000" 400000" 600000" 800000" 1000000" 1200000" 1400000" 524K" 2M" 4.7M" 8.3M" 18.8M"

FFT SM" PVAS"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 67K" 74K" 76K" 91K"

SPECFEM3D_mt SM" PVAS"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 1.9K" 3.5K" 7.5K" 12K"

SPECFEM3D_oc SM" PVAS"

0" 5000" 10000" 15000" 20000" 25000" 26K" 55K" 106K" 177K"

SPECFEM3D_cm SM" PVAS"

X-axis: Data Size, Y-axis: Latency (usec)

SLIDE 15

Latency Analysis

Performance improvement can be larger when data size is large
PVAS implementation can accelerate data copy between

processes

Time for data copy does not impact when message size is

small

Performance improvement can be smaller when transferring

data from complex data type buffer to complex data type buffer

Access to the sparsely located data incurs a lot of cache

misses during data copy

15

SLIDE 16

FFT2D_datatype

2D Fast Fourie Transform

code

Using Derived Data Types

for matrix transpose

Different vector types on

send/recv side

PVAS BTL improves

benchmark performance by up to 21%

16

fft2d_datatype results (NP=240)

SLIDE 17

Related Work

SMARTMAP [Ron et al., SC’08]
SMARTMAP enables process to map the whole memory of the
ther process into its address space
It is similar to the PVAS, but the implementation is different
SMARTMAP accelerates MPI Intra-node communication for

transferring contiguous data

User-mod Memory Registration
UMR is a function of Mellanox IB, which makes it possible to

transfer non-contiguous data through one RDMA operation

UMR accelerates MPI inter-node communication using derived

data types [Mingzhe et al., IEEE Cluster’15]

17

SLIDE 18

Summary

We introduced PVAS task model
A new task model for efficient parallel processing on many-core

systems

PVAS removes overheads for crossing address space boundary

from intra-node communication by running the parallel processes within the same address space

We optimized MPI intra-node communication by using PVAS facility
We optimized contiguous and non-contiguous data transfers in

Open MPI

PVAS implementation outperforms SM implementation

18