[PPT] - MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build PowerPoint Presentation

SLIDE 1

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds

Jie Zhang, Xiaoyi Lu, Mark Arnold and Dhabaleswar. K. Panda

SLIDE 2

2 Network Based Computing Laboratory

Introduction
Problem Statement
Proposed Design
Performance Evaluation

Outline

SLIDE 3

3 Network Based Computing Laboratory

Single Root I/O Virtualization (SR-IOV) is providing new opportunities

to design HPC cloud with very little low overhead

Single Root I/O Virtualization (SR-IOV)

– Allows a single physical device,

r a Physical Function (PF), to

present itself as multiple virtual devices, or Virtual Functions (VFs) – Each VF can be dedicated to a single VM through PCI pass- through – VFs are designed based on the existing non-virtualized PFs, no need for driver change

Guest 1 Guest OS VF Driver Guest 2 Guest OS VF Driver Guest 3 Guest OS VF Driver Hypervisor PF Driver I/O MMU SR-IOV Hardware Virtual Function Virtual Function Virtual Function Physical Function PCI Express

SLIDE 4

4 Network Based Computing Laboratory

SR-IOV shows near to native performance for inter-node point to point

communication

However, NOT VM locality aware
IVShmem offers zero-copy access to data on shared memory of co-resident VMs

Inter-VM Shared Memory (IVShmem)

Host Environment

Guest 1

Hypervisor PF Driver

Infiniband Adapter Physical Function user space kernel space

MPI proc PCI Device VF Driver

Guest 2

user space kernel space

MPI proc PCI Device VF Driver

Virtual Function Virtual Function /dev/shm/ IV-SHM

IV-Shmem Channel SR-IOV Channel

SLIDE 5

5 Network Based Computing Laboratory

Introduction
Problem Statement
Proposed Design
Performance Evaluation

Outline

SLIDE 6

6 Network Based Computing Laboratory

How to design a high performance MPI library to efficiently take advantage SR-

IOV and IVShmem to deliver VM locality aware communication and optimal performance?

How to build an HPC Cloud with near native performance for MPI applications
ver SR-IOV enabled InfiniBand clusters?
How much performance improvement can be achieved by our proposed

design on MPI point-to-point, collective operations and applications in HPC clouds?

How much benefit the proposed approach with InfiniBand can provide

compared to Amazon EC2?

Problem Statement

SLIDE 7

7 Network Based Computing Laboratory

Introduction
Problem Statement
Proposed Design
Performance Evaluation

Outline

SLIDE 8

8 Network Based Computing Laboratory

MVAPICH2 library running in native and virtualization environments
In virtualized environment
Support shared-memory channels (SMP, IVShmem) and SR-IOV channel
Locality detection
Communication coordination

VM Locality Aware MVAPICH2 Design Overview

Application MPI Layer ADI3 Layer SMP Channel Network Channel Shared Memory InfiniBand API MPI Library Communication Device APIs Native Hardware Application MPI Layer ADI3 Layer IVShmem Channel SR-IOV Channel Shared Memory InfiniBand API Virtual Machine Aware Communication Device APIs Virtualized Hardware Communication Coordinator Locality Detector SMP Channel

SLIDE 9

9 Network Based Computing Laboratory

Virtual Machine Locality Detection

Create a VM List

structure on IVShmem region of each host

Each MPI process writes

its own membership information into shared VM List structure according to its global rank

One byte each, lock-free,

O(N)

Host Environment Hypervisor PF Driver

/dev/shm/ IVShmem user space kernel space

rank0

MPI proc PCI Device VF Driver

user space kernel space

MPI proc PCI Device VF Driver

user space kernel space

MPI proc PCI Device VF Driver

user space kernel space

MPI proc PCI Device VF Driver

0 0 0 0

rank1 rank4 rank5

0 0 0 0

1 1 1 1

VM VM List st

SLIDE 10

10 Network Based Computing Laboratory

Communication Coordination

Retrieve VM locality

detection information

Schedule

communication channels based on VM locality information

Fast index, light-

weight

Host Environment Hypervisor PF Driver

IVShmem

Guest1

user space kernel space

PCI Device VF Driver Physical Function Virtual Function Virtual Function

InfiniBand Adapter /dev/shm

IVShmem Channel SR-IOV Channel

1 1 0 0 1 1 0 0

Communication Coordinator

MPI Process Rank 1

Guest2

user space kernel space

PCI Device VF Driver IVShmem Channel SR-IOV Channel

1 1 0 0 1 1 0 0

Communication Coordinator

MPI Process Rank 4

SLIDE 11

11 Network Based Computing Laboratory

OpenStack is one of the most

– Supporting SR-IOV configuration – Extending Nova in OpenStack to support IVShmem – Virtual Machine Aware design of MVAPICH2 with SR-IOV

An efficient approach to build HPC

Clouds

MVAPICH2 with SR-IOV over OpenStack

Nova Glance Neutron Swift

Keystone

Cinder Heat

Ceilometer Horizon

VM

Backup volumes in Stores images in Provides images Provides Network Provisions Provides Volumes Monitors Provides UI Provides Auth for Orchestrates cloud

SLIDE 12

12 Network Based Computing Laboratory

Experimental HPC Cloud

SLIDE 13

13 Network Based Computing Laboratory

Introduction
Problem Statement
Proposed Design
Performance Evaluation

Outline

SLIDE 14

14 Network Based Computing Laboratory

Cloud Testbeds

Cluster Nowlab Cloud Amazon EC2 Instance 4 Core/VM 8 Core/VM 4 Core/VM 8 Core/VM Platform RHEL 6.5 Qemu+KVM HVM Amazon Linux (EL6) Xen HVM C3.xlarge Instance Amazon Linux (EL6) Xen HVM C3.2xlarge Instance CPU SandyBridge Intel(R) Xeon E5-2670 (2.6GHz) IvyBridge Intel(R) Xeon E5-2680v2 (2.8GHz) RAM 6 GB 12 GB 7.5 GB 15 GB Interconnect FDR (56Gbps) InfiniBand Mellanox ConnectX-3 with SR-IOV 10 GigE with Intel ixgbevf SR-IOV driver

SLIDE 15

15 Network Based Computing Laboratory

Performance of MPI Level Point-to-point Operations

– Inter-node MPI Level Two-sided Operations – Intra-node MPI Level Two-sided Operations – Intra-node MPI Level One-sided Operations

Performance of MPI Level Collectives Operations

– Broadcast, Allreduce, Allgather and Alltoall

Performance of Typical MPI Benchmarks and Applications

– NAS and Graph500

Performance Evaluation

*Amazon EC2 does not support users to explicitly allocate VMs in one physical node so far. We allocate multiple VMs in one logical group and compare the point-to-point performance for each pair of VMs. We see the VMs who have the lowest latency as located within one physical node (Intra-node), otherwise Inter-node.

SLIDE 16

16 Network Based Computing Laboratory

Inter-node MPI Level Two-sided Point-to-Point Performance

EC2 C3.xlarge instances
Similar performance with SR-IOV-Def
Compared to Native, similar overhead as basic IB level
Compared to EC2, up to 29X and 16X performance speedup on Lat & BW

0%

SLIDE 17

17 Network Based Computing Laboratory

Intra-node MPI Level Two-sided Point-to-Point Performance

EC2 C3.xlarge instances
Compared to SR-IOV-Def, up to 84% and 158% performance improvement on Lat & BW
Compared to Native, 3%-7% overhead for Lat, 3%-8% overhead for BW
Compared to EC2, up to 160X and 28X performance speedup on Lat & BW

SLIDE 18

18 Network Based Computing Laboratory

Intra-node MPI Level One-sided Put Performance

EC2 C3.xlarge instances
Compared to SR-IOV-Def, up to 63% and 42% improvement on Lat & BW
Compared to EC2, up to 134X and 33X performance speedup on Lat & BW

SLIDE 19

19 Network Based Computing Laboratory

Intra-node MPI Level One-sided Get Performance

EC2 C3.xlarge instances
Compared to SR-IOV-Def, up to 70% improvement on both Lat & BW
Compared to EC2, up to 121X and 24X performance speedup on Lat & BW

SLIDE 20

20 Network Based Computing Laboratory

MPI Level Collectives Operations Performance

(4 cores/VM * 4 VMs)

EC2 C3.xlarge instances
Compared to SR-IOV-Def, up to 74% and 60% performance improvement on Broadcast &

Allreduce

Compared to EC2, up to 65X and 22X performance speedup on Bcast & Allreduce

SLIDE 21

21 Network Based Computing Laboratory

MPI Level Collectives Operations Performance

(4 cores/VM * 4 VMs)

EC2 C3.xlarge instances
Compared to SR-IOV-Def, up to 74% and 81% performance improvement on Allgather &

Alltoall

Compared to EC2, up to 28X and 45X performance speedup on Allgather & Alltoall

SLIDE 22

22 Network Based Computing Laboratory

MPI Level Collectives Operations Performance

(4 cores/VM * 16 VMs)

Compared to SR-IOV-Def, up to 41% and 45% performance improvement on

Bcast & Allreduce

SLIDE 23

23 Network Based Computing Laboratory

MPI Level Collectives Operations Performance

(4 cores/VM * 16 VMs)

Compared to SR-IOV-Def, up to 40% and 39% performance improvement on

Allgather & Alltoall

SLIDE 24

24 Network Based Computing Laboratory

Performance of Typical MPI Benchmarks and Applications

(8 cores/VM * 4 VMs)

EC2 C3.2xlarge instances
Compared to Native, 2%-9% overhead for NAS, around 6% overhead for Graph500
Compared to EC2, up to 4.4X (FT) speedup for NAS, up to 12X (20,10) speedup for

Graph500

SLIDE 25

25 Network Based Computing Laboratory

Performance of Typical MPI Benchmarks and Applications

(8 cores/VM * 8 VMs)

EC2 C3.2xlarge instances
Compared to Native, 6%-9% overhead for NAS, around 8% overhead for Graph500

SLIDE 26

Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?

Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda

SLIDE 27

27 Network Based Computing Laboratory

Introduction
Problem Statement
Evaluation Methodology
Performance Evaluation

Outline

SLIDE 28

28 Network Based Computing Laboratory

Virtualization Technology (Hypervisor vs. Container)

Provides abstractions of multiple virtual resources by utilizing an intermediate

software layer on top of the underlying system

Hardware Host OS Hypervisor bins/ libs Guest OS Redhat Linux App2 Stack bins/ libs Guest OS Window App3 Stack bins/ libs Guest OS Ubuntu VM1 Hypervisor-based Virtualization App1 Stack

Hypervisor provides a full abstraction of

VM

Full virtualization, different guest OS,

better isolation

Larger overhead due to heavy stack

SLIDE 29

29 Network Based Computing Laboratory

Virtualization Technology (Hypervisor vs. Container)

Provides abstractions of multiple virtual resources by utilizing an intermediate

software layer on top of the underlying system

Container1 Hardware Host Linux OS bins/ libs bins/ libs bins/ libs App1 Stack App2 Stack App3 Stack Container-based Virtualization

Share host kernel
Allows execution of isolated user space

instance

Lightweight, portability
Not strong isolation

SLIDE 30

30 Network Based Computing Laboratory

Container Technology (Docker vs. Singularity)

Inherit advantages of container

technique

Active community contribution
Root owned daemon process
Root escalation in Docker

container

Non-negligible performance
verhead

SLIDE 31

31 Network Based Computing Laboratory

Reproducible software stacks

– Easily verify via checksum or cryptographic signature

Mobility of compute

– Able to transfer (and store) containers via standard data mobility tools

Compatibility with complicated architectures

– Runtime immediately compatible with existing HPC architecture

Security model

– Support untrusted users running untrusted containers http://singularity.lbl.gov/about

Singularity Overview

SLIDE 32

32 Network Based Computing Laboratory

Container Technology (Docker vs. Singularity)

Singularity aims to provide reproducible and

mobile environments across HPC centers

NO root owned daemon
NO root escalation
mpirun_rsh –np 2 –hostfile htfiles singualrity exec

/tmp/Centos-7.img /usr/bin/osu_latency

Performance ?

SLIDE 33

33 Network Based Computing Laboratory

Introduction
Problem Statement
Evaluation Methodology
Performance Evaluation

Outline

SLIDE 34

34 Network Based Computing Laboratory

What is the performance characterization of running Singularity on HPC cloud?
Can Singularity deliver near-native performance for MPI applications on HPC

cloud with different cutting-edge hardware technologies?

Is Singularity-based container technology ready for running MPI applications
n HPC clouds on top of HPC infrastructure?

Problem Statement

SLIDE 35

35 Network Based Computing Laboratory

Introduction
Problem Statement
Evaluation Methodology
Performance Evaluation

Outline

SLIDE 36

36 Network Based Computing Laboratory

Evaluation Methodology

Multi-core Processor (Haswell) Many-core Processor (KNL)

Processor Architecture Memory Access Mode

NUMA Cache Flat InfiniBand Omni-Path

High Speed Interconnects

Virtualization Solution Overhead

– Singularity vs. Native

Three-dimensional Evaluation

SLIDE 37

37 Network Based Computing Laboratory

Processor Architecture

Haswell KNL

Up to 72 cores (1.4GHz) on 36 active tiles
Each tile has a single 1MB L2 cache shared

between two cores

Each core supports 4 threads
6 DDRs + 8 Multi-Channel DRAMs (MCDRAM)
Dual socket (NUMA)
Each socket with 12 cores (2.30GHz)
Each core supports 2 threads
4 DDRs, 2 for each

SLIDE 38

38 Network Based Computing Laboratory

Memory Access Mode

68 cores 16GB cache 96GB RAM DDR4 96GB MCDRAM 16GB

MCDRAM 16GB 112GB RAM DDR4 96GB 68 cores

Haswell: 2 NUMA Nodes KNL with Cache Mode KNL with Flat Mode

2 NUMA nodes
QPI channels between

sockets

Intra-/Inter-socket
Cache Mode:
MCDRAM as an L3 cache
OS transparently uses

MCDRAM to move data from main memory

DDR4 and MCDRAM act as

two distinct NUMA nodes

Need to specify the type of

memory (DDR4 or MCDRAM) when allocating

SLIDE 39

39 Network Based Computing Laboratory

Introduction
Problem Statement
Evaluation Methodology
Performance Evaluation

Outline

SLIDE 40

40 Network Based Computing Laboratory

Testbeds

Cluster Chameleon Cloud Nowlab Cloud CPU Intel Xeon E5-2670 Haswell processors, 24 cores (2.3 GHz)

Intel Xeon Phi CPU 7250 KNL co-processor, 68 cores (1.40GHz)

Memory 2 NUMA Nodes, 128 GB

96GB host memory, and 16GB MCDRAM

Interconnect Mellanox ConnectX-3 HCA, (FDR 56Gbps)

Omni-Path HFI Silicon 100 Series fabric controller (100Gbps)

OS CentOS Linux release 7.1.1503 (Core)

CentOS Linux release 7.3.1611 (Core) Singularity 2.3, MVAPICH2-2.3a, OSU micro-benchmarks v5.3

SLIDE 41

41 Network Based Computing Laboratory

Evaluation Methodology

Multi-core Processor (Haswell) Many-core Processor (KNL)

Processor Architecture

Virtualization Solution Overhead

– Singularity vs. Native

Dimension 1

– Multi-core Processor – Many-core Processor

SLIDE 42

42 Network Based Computing Laboratory 1000 2000 3000 4000 5000 6000 7000 8000 9000 1K 4K 16K 64K 256K 1M Bandwidth(MB/s) Mesage Size ( bytes) Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node

Processor Architecture (Haswell & KNL)

MPI point-to-point Bandwidth
On both Haswell and KNL, less than 7% overhead for Singularity solution
Worse intra-node performance than Haswell because low CPU frequency, complex cluster mode, and cost

maintaining cache coherence

KNL - Inter-node performs better than intra-node case after around 256 Kbytes, as Omni-Path

interconnect outperforms shared memory-based transfer for large message size

BW on Haswell BW on KNL

7%

2000 4000 6000 8000 10000 12000 14000 16000 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) Mesage Size ( bytes) Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node

SLIDE 43

43 Network Based Computing Laboratory

Evaluation Methodology

Memory Access Mode

NUMA Cache Flat

Virtualization Solution Overhead

– Singularity vs. Native

Dimension 2

– NUMA – Cache – Flat

SLIDE 44

44 Network Based Computing Laboratory 5 10 15 20 25 30 35 40 45 1 4 16 64 256 1K 4K 16K 64K Latency (us) Mesage Size ( bytes) Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node

Memory Access Mode (NUMA, Cache)

MPI point-to-point Latency
NUMA

– Intra-socket performs better than inter-socket case, as the QPI bottleneck between NUMA nodes

– Performance difference is gradually decreased, as the message size increases

Overall, less than 8% overhead for Singularity solution in both cases, compared with Native

2 4 6 8 10 12 1 4 16 64 256 1K 4K 16K 64K Latency (us) Mesage Size ( bytes) Singularity-Intra-Socket Native-Intra-Socket Singularity-Inter-Socket Native-Inter-Socket

2 NUMA nodes w Cache mode

8%

SLIDE 45

45 Network Based Computing Laboratory

Memory Access Mode (Flat)

1000 2000 3000 4000 5000 6000 7000 8000 Bandwidth (MB/s) Mesage Size ( bytes) Singularity(DDR) Native(DDR) Singularity(MCDRAM) Native(MCDRAM)

Explicitly specify DDR or MCDRAM for memory allocation
MPI point-to-point BW: No significant performance difference
MPI collective Allreduce: Clear benefits (up to 67%) with MCDRAM after around 256 KB message,

compared with DDR

More parallel processes increase data access, which can NOT fit in L2 cache, higher BW in MCDRAM
Near-native performance for Singularity (less than 8% overhead)

BW w Flat mode

1000 2000 3000 4000 5000 6000 7000 8000 9000 Latency (us) Mesage Size ( bytes) Singularity(DDR) Native(DDR) Singularity(MCDRAM) Native(MCDRAM)

MPI_Allreduce w Flat mode

67% 8%

SLIDE 46

46 Network Based Computing Laboratory

Evaluation Methodology

InfiniBand Omni-Path

High Speed Interconnects

Virtualization Solution Overhead

– Singularity vs. Native

Dimension 3

– InfiniBand – Omni-Path

SLIDE 47

47 Network Based Computing Laboratory

High Performance Interconnects (InfiniBand & Omni-Path)

1 10 100 1000 10000 Latency (us) Mesage Size ( bytes) Singularity Native

MPI_Allreduce
InfiniBand - 512 Processes across 32 nodes
Omni-Path - 128 Processes on 2 nodes
Near-native performance for Singularity

1 10 100 1000 10000 100000 Latency (us) Mesage Size ( bytes) Singularity Native

InfiniBand Intel Omni-Path

8%

SLIDE 48

48 Network Based Computing Laboratory

Put It All Together

Multi-core Processor (Haswell) Many-core Processor (KNL)

Processor Architecture Memory Access Mode

NUMA Cache Flat InfiniBand Omni-Path

High Speed Interconnects

Virtualization Solution Overhead

– Singularity vs. Native

Haswell + InfiniBand
KNL + Cache mode + Omni-Path
KNL + Flat mode + Omni-Path

SLIDE 49

49 Network Based Computing Laboratory

500 1000 1500 2000 2500 3000 22,16 22,20 24,16 24,20 26,16 26,20 Execution Time (ms) Singularity Native

Application Performance on Haswell with InfiniBand

50 100 150 200 250 300 CG EP FT IS LU MG Execution Time (s) Singularity Native

Class D NAS Graph500

512 processors across 32 Haswell nodes
Singularity delivers near-native performance, less than 7% overhead on Haswell

with InfiniBand

7%

SLIDE 50

50 Network Based Computing Laboratory

128 Processors across 2 KNL nodes with Omni-Path
Singularity only incurs less than 6% overhead on KNL in both Cache and Flat modes
No clear performance difference between DDR and MCDRAM in Flat mode

– Graph500 heavily utilizes pt2pt communication with 4 Kbytes message size for BFS search – Consistent with pt2pt performance on KNL with Flat mode

50 100 150 200 250 300 350 400 CG EP FT IS LU MG Execution Time (s) Singularity Native

Application Performance on KNL with Omni-Path

200 400 600 800 1000

20,10 20,16 20,20 22,10 22,16 22,20 24,10 24,16

Execution Time (s) Singularity(DDR) Native(DDR) Singularity(MCDRAM) Native(MCDRAM)

Class C NAS with Cache mode Graph500 with Flat mode

6%

SLIDE 51

High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand

H. Subramoni, P. Lai, R. Kettimuthu and D. K. Panda

SLIDE 52

52 Network Based Computing Laboratory

GridFTP is a high-performance, secure, reliable extension of the standard FTP
ptimized for WAN
Globus XIO framework, used to design GridFTP, offers easy-to-use interface
The framework hides the complications of communication semantics of

underlying devices (network or disk)

Overview

SLIDE 53

53 Network Based Computing Laboratory

Combining the ease of use of Globus XIO framework and the high performance

achieved through IB

Enhancing the disk I/O performance of the existing ADTS library

– By decoupling the network processing from disk I/O operations

Evaluation of the design

– micro-benchmark level – applications like Community Climate System Model and ultra scale visualization

Contribution

SLIDE 54

54 Network Based Computing Laboratory

Most HPC applications require movement of huge amount of data

– Needs slower hard disks and RAIDs for storage – With low bandwidth provided by TCP/UDP based FTP, this was not an issue – Will be an issue for Globus ADTS XIO

Solution

– decoupling of network from disk I/O

Problem Statement

SLIDE 55

55 Network Based Computing Laboratory

Introduction of

– multiple threads

read, write and network thread

– set of buffers to stage the data

Read thread prefetches a set of locations

from the disk and keeps it ready for the network

Avoid frequent context switches

– Low and High Water Marks

High water mark: max size of circular buffer

– Read when available buffers less than low- water mark

Design of the Globus ADTS XIO Driver

SLIDE 56

56 Network Based Computing Laboratory

Variant network delays
Transmit 128 GB of aggregate data as multiples of 256 MB files
Legend: staging buffer sizes

Evaluation

SLIDE 57

57 Network Based Computing Laboratory

Variant network delays
Transmit 128 GB of aggregate data as multiples of 256 MB files
Legend: staging buffer sizes

Evaluation

SLIDE 58

58 Network Based Computing Laboratory

Community Climate System

Model (CCSM)

– National Center for Atmospheric Research (NCAR) and Lawrence Livermore National Laboratory (LLNL) – Most files are 256 MB

Ultra-Scale Visualization

– ORNL and UC Davis – Most files are 2.6 GB

Evaluation

SLIDE 59

59 Network Based Computing Laboratory

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds

Jie Zhang, Xiaoyi Lu, Mark Arnold and Dhabaleswar. K. Panda

Outline

to design HPC cloud with very little low overhead

Single Root I/O Virtualization (SR-IOV)

– Allows a single physical device,

present itself as multiple virtual devices, or Virtual Functions (VFs) – Each VF can be dedicated to a single VM through PCI pass- through – VFs are designed based on the existing non-virtualized PFs, no need for driver change

communication

Inter-VM Shared Memory (IVShmem)

Outline

IOV and IVShmem to deliver VM locality aware communication and optimal performance?

design on MPI point-to-point, collective operations and applications in HPC clouds?

compared to Amazon EC2?

Problem Statement

Outline

VM Locality Aware MVAPICH2 Design Overview

Virtual Machine Locality Detection

structure on IVShmem region of each host

its own membership information into shared VM List structure according to its global rank

O(N)

Communication Coordination

detection information

communication channels based on VM locality information

weight

popular open-source solutions to build a cloud and manage huge amounts of virtual machines

– Supporting SR-IOV configuration – Extending Nova in OpenStack to support IVShmem – Virtual Machine Aware design of MVAPICH2 with SR-IOV

Clouds

MVAPICH2 with SR-IOV over OpenStack

Experimental HPC Cloud

Outline

Cloud Testbeds

– Inter-node MPI Level Two-sided Operations – Intra-node MPI Level Two-sided Operations – Intra-node MPI Level One-sided Operations

– Broadcast, Allreduce, Allgather and Alltoall

– NAS and Graph500

Performance Evaluation

Inter-node MPI Level Two-sided Point-to-Point Performance

0%

Intra-node MPI Level Two-sided Point-to-Point Performance

Intra-node MPI Level One-sided Put Performance

Intra-node MPI Level One-sided Get Performance

MPI Level Collectives Operations Performance

(4 cores/VM * 4 VMs)

Allreduce

MPI Level Collectives Operations Performance

(4 cores/VM * 4 VMs)

Alltoall

MPI Level Collectives Operations Performance

(4 cores/VM * 16 VMs)

Bcast & Allreduce

MPI Level Collectives Operations Performance

(4 cores/VM * 16 VMs)

Allgather & Alltoall

Performance of Typical MPI Benchmarks and Applications

(8 cores/VM * 4 VMs)

Graph500

Performance of Typical MPI Benchmarks and Applications

(8 cores/VM * 8 VMs)

Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?

Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda

Outline

Virtualization Technology (Hypervisor vs. Container)

software layer on top of the underlying system

VM

better isolation

Virtualization Technology (Hypervisor vs. Container)

software layer on top of the underlying system

instance

Container Technology (Docker vs. Singularity)

technique

container

– Easily verify via checksum or cryptographic signature

– Able to transfer (and store) containers via standard data mobility tools

– Runtime immediately compatible with existing HPC architecture

– Support untrusted users running untrusted containers http://singularity.lbl.gov/about

Singularity Overview

Container Technology (Docker vs. Singularity)

mobile environments across HPC centers

/tmp/Centos-7.img /usr/bin/osu_latency

Outline

cloud with different cutting-edge hardware technologies?