Optimizing MPI Intra-node Communication with New Task Model for - - PowerPoint PPT Presentation

optimizing mpi intra node communication with new task
SMART_READER_LITE
LIVE PREVIEW

Optimizing MPI Intra-node Communication with New Task Model for - - PowerPoint PPT Presentation

Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems Research and Development Group, Hitachi, Ltd. Akio SHIMADA LENS INTERNATIONAL WORKSHOP 2015 Background core system since the appearance of multi-core processor


slide-1
SLIDE 1

Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems

Research and Development Group, Hitachi, Ltd. Akio SHIMADA

LENS INTERNATIONAL WORKSHOP 2015

slide-2
SLIDE 2

Background

  • A large number of parallel processes can be invoked within a node on a many-

core system

  • MPI and some PGAS language runtimes invokes multiple processes
  • Fast Intra-node communication is required
  • Many researches proposed a variety of intra-node communication schemes(e.g. KNEM, LiMIC)

since the appearance of multi-core processor and try to accelerate intra-node communication

  • n many-core systems (e.g. hybrid MPI)

2

Core Core Core Core

Process Process Process Process

Parallel*Processes

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process Process

Parallel*Processes Communica/on*on*Mul/1core*Node Communica/on*on*Many1core*Node Node Node

slide-3
SLIDE 3

Conventional Intra-node Communication Schemes

  • Overheads for “crossing address space boundaries among processes”

are produced

  • There are address space boundaries among processes

3

Shared Memory

  • Double-copy via shared memory is

required for every communication

OS kernel assistance (KNEM, LiMIC, etc.)

  • System call overhead is produced for every

communication

Sender

Send Buffer

Receiver

Receive Buffer

Shared Memory

Intermediate Buffer

memory copy memory copy

Sender

Send Buffer

Receiver

Receive Buffer

OS Kernel

memory copy

slide-4
SLIDE 4

Proposal

  • Partitioned Virtual Address Space(PVAS)
  • A new task model for efficient parallel processing
  • n many-core systems
  • PVAS make it possible for parallel processes

within the same node to run in the same address space

  • PVAS can remove overheads for crossing address

space boundary from intra-node communication

4

slide-5
SLIDE 5

Address Space Layout

  • PVAS partitions a single address space into multiple segments (PVAS partition) and assigns

them to parallel processes (PVAS tasks)

  • Parallel processes uses the same page table for managing memory mapping informations
  • PVAS task can use only its own PVAS partition as its local memory (cannot allocate memory

within a PVAS partition assigned to the other PVAS task)

  • PVAS task is almost same as normal process except sharing the same address space with
  • ther processes

5

Process 0 TEXT DATA&BSS HEAP STACK KERNEL Process 1 TEXT DATA&BSS HEAP STACK KERNEL PVAS Partition 0 KERNEL

Address Low High

PVAS Partition 1 TEXT DATA&BSS HEAP STACK TEXT DATA&BSS HEAP STACK PVAS Task 0 PVAS Task 1

Normal Task Model PVAS Task Model

・・・

slide-6
SLIDE 6

PVAS Feature

  • All memory of the PVAS task is exposed to the
  • ther PVAS tasks within the same node
  • PVAS task can access the memory of the
  • ther PVAS tasks by load/store instructions

(There are no address space boundaries among them)

  • A pair of PVAS tasks can exchange the data

without overheads for crossing an address space boundary

6

slide-7
SLIDE 7

Optimizing Open MPI by PVAS

  • PVAS BTL component is implemented in the Byte

Transfer Layer (BTL) of the Open MPI

  • SM BTL
  • Supporting double-copy communication via shared memory
  • Supporting single-copy communication with OS kernel

assistance (using KNEM)

  • PVAS BTL(developed on the basis of the SM BTL)
  • Copying the data from send buffer to receive buffer without

OS kernel assistance by using PVAS facility

7

slide-8
SLIDE 8

PVAS BTL

8

MPI Process 0 (PVAS Task 0) Send Buffer MPI Process 1 (PVAS Task 1) Receive Buffer

① Sender posts the pointer
 to the send buffer ② Receiver copies the data
 from the send buffer

  • Invoking MPI process as PVAS task
  • Copying the data from send buffer to receive buffer directly
  • The overheads for crossing address space boundary is not produced

when transferring the data

  • Single-copy communication (avoiding extra memory copy)
  • OS kernel assistance is not necessary (avoiding system call overhead)
slide-9
SLIDE 9

Evaluation Environment

  • Intel Xeon Phi 5110P
  • 1.083 GHZ, 60 cores (4HT)
  • 32 KB L1 cache, 512 KB L2 cache
  • 8 GB of main memory
  • OS
  • Intel MPSS linux 2.6.38.8 with PVAS facility
  • MPI
  • Open MPI 1.8 with PVAS BTL
slide-10
SLIDE 10

Latency Evaluation

  • Ping-pong communication latency was measured by running

Intel MPI Benchmarks

10

  • PVAS BTL outperforms others regardless of the message size
  • Latency of the SM BTL (KNEM) is higher than that of SM BTL

when message size is small because of the system call overhead

1" 10" 100" 1000" 10000" 100000" 1000000" 64" 128" 256" 512" 1K" 2K" 4K" 8K" 16K" 32K" 64K" 128K" 256K" 512K" 1M" 2M" 4M" 8M" 16M" 32M"

Lanteyc"(usec) Message"Size"(Bytes) SM" SM"(KNEM)" PVAS"

slide-11
SLIDE 11

NAS Parallel Benchmarks (NPB)

  • Running NPB on a single node
  • Number of Processes
  • 128(MG, CG, FT, IS, LU)
  • 225(SP, BT)
  • Problem size
  • CLASS A, B, C (A < B < C)
  • PVAS BTL improves benchmark

performance by up to 28%

  • SP(CLASS C)

11

!60$ !55$ !50$ !45$ !40$ !35$ !30$ !25$ !20$ !15$ !10$ !5$ 0$ 5$ 10$ 15$ MG$ CG$ FT$ IS$ LU$ SP$ BT$

Performance$Improvement$(%) Benchmark$(CLASS$A) SM$(KNEM)$ PVAS$

!35$ !30$ !25$ !20$ !15$ !10$ !5$ 0$ 5$ 10$ 15$ 20$ 25$ MG$ CG$ FT$ IS$ LU$ SP$ BT$

Performance$Improvement$(%)$ Bechmark$(CLASS$B) SM$(KNEM)$ PVAS$

!20$ !15$ !10$ !5$ 0$ 5$ 10$ 15$ 20$ 25$ 30$ MG$ CG$ FT$ IS$ LU$ SP$ BT$

Performance$Improvement$ Bemchmark$(CLASS$C) SM$(KNEM)$ PVAS$

N/A

slide-12
SLIDE 12

MPI Process 0 (PVAS Task 0)

Send Buffer

MPI Process 1 (PVAS Task 1)

Receive Buffer

MPI Process 1 (PVAS Task 1)

Receive Buffer

MPI Process 0 (PVAS Task 0)

Optimizing Non-contiguous Data Transfer Using Derived Data Types

  • Sender and receiver exchange the pointer to the data type informations of them
  • MPI process can access the MPI internal objects of the other MPI process

when using PVAS facility

  • Sender and receiver copies the data from the send buffer to the receive buffer

consulting the data type informations of them

  • Sender and receiver copy the data in parallel

12

① memory copy by sender

Send Buffer Shared Memory

SM BTL PVAS BTL

② memory copy by sender

③ memory copy by receiver

② Sender posts the pointer to the intermediate buffer

②’ memory copy by receiver

① Sender and receiver exchange the pointer to
 the data type informations of them

slide-13
SLIDE 13

Latency Evaluation Using DDTBench(1/2)

  • DDTBench [Timo et al., EurMPI’12] mimics the commutation pattern of MPI applications by using

derived data types

  • MPI processes send and receive the non-contiguous data in WRF, MILC, NPB, LAMMPS,

SPECFEM3D

13

0" 200" 400" 600" 800" 1000" 1200" 43K" 55K" 63K" 75K" 90K"

WRF_y_sa SM" PVAS"

0" 1000" 2000" 3000" 4000" 5000" 6000" 63K" 102K" 173K"

WRF_x_sa SM" PVAS"

0" 200" 400" 600" 800" 1000" 1200" 43K" 55K" 63K" 75K" 90K"

WRF_y_vec SM" PVAS"

0" 1000" 2000" 3000" 4000" 5000" 6000" 63K" 102K" 173K"

WRF_x_vec SM" PVAS"

0" 10000" 20000" 30000" 40000" 50000" 60000" 70000" 2K" 32K" 131K" 524K"

NAS_MG_x SM" PVAS"

0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 4K" 65K" 262K" 1M"

NAS_MG_y SM" PVAS"

0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 4K" 65K" 262K" 1M"

NAS_MG_z SM" PVAS"

0" 100" 200" 300" 400" 500" 600" 700" 800" 12K" 24K" 49K" 98K"

MILC_su3_zd SM" PVAS"

X-axis: Data Size, Y-axis: Latency (usec)

slide-14
SLIDE 14

Latency Evaluation Using DDTBench(2/2)

14

0" 50" 100" 150" 200" 250" 300" 350" 400" 0.48K" 1.3K" 2.5K" 4K" 6.4K" 16K" 40K"

NAS_LU_x SM" PVAS"

0" 100" 200" 300" 400" 500" 600" 700" 800" 0.48K" 1.3K" 2.5K" 4K" 6.4K" 16K" 40K"

NAS_LU_y SM" PVAS"

0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 8000" 9000" 10000" 106K" 143K" 195K"

LAMMPS_full SM" PVAS"

0" 100" 200" 300" 400" 500" 600" 700" 5.4K" 6.9K" 11K"

LAMMPS_atomic SM" PVAS"

0" 200000" 400000" 600000" 800000" 1000000" 1200000" 1400000" 524K" 2M" 4.7M" 8.3M" 18.8M"

FFT SM" PVAS"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 67K" 74K" 76K" 91K"

SPECFEM3D_mt SM" PVAS"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 1.9K" 3.5K" 7.5K" 12K"

SPECFEM3D_oc SM" PVAS"

0" 5000" 10000" 15000" 20000" 25000" 26K" 55K" 106K" 177K"

SPECFEM3D_cm SM" PVAS"

X-axis: Data Size, Y-axis: Latency (usec)

slide-15
SLIDE 15

Latency Analysis

  • Performance improvement can be larger when data size is large
  • PVAS implementation can accelerate data copy between

processes

  • Time for data copy does not impact when message size is

small

  • Performance improvement can be smaller when transferring

data from complex data type buffer to complex data type buffer

  • Access to the sparsely located data incurs a lot of cache

misses during data copy

15

slide-16
SLIDE 16

FFT2D_datatype

  • 2D Fast Fourie Transform

code

  • Using Derived Data Types

for matrix transpose

  • Different vector types on

send/recv side

  • PVAS BTL improves

benchmark performance by up to 21%

16

fft2d_datatype results (NP=240)

slide-17
SLIDE 17

Related Work

  • SMARTMAP [Ron et al., SC’08]
  • SMARTMAP enables process to map the whole memory of the
  • ther process into its address space
  • It is similar to the PVAS, but the implementation is different
  • SMARTMAP accelerates MPI Intra-node communication for

transferring contiguous data

  • User-mod Memory Registration
  • UMR is a function of Mellanox IB, which makes it possible to

transfer non-contiguous data through one RDMA operation

  • UMR accelerates MPI inter-node communication using derived

data types [Mingzhe et al., IEEE Cluster’15]

17

slide-18
SLIDE 18

Summary

  • We introduced PVAS task model
  • A new task model for efficient parallel processing on many-core

systems

  • PVAS removes overheads for crossing address space boundary

from intra-node communication by running the parallel processes within the same address space

  • We optimized MPI intra-node communication by using PVAS facility
  • We optimized contiguous and non-contiguous data transfers in

Open MPI

  • PVAS implementation outperforms SM implementation

18