MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build - - PowerPoint PPT Presentation
MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build - - PowerPoint PPT Presentation
MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds Jie Zhang, Xiaoyi Lu, Mark Arnold and Dhabaleswar. K. Panda Outline Introduction Problem Statement Proposed Design Performance Evaluation
2 Network Based Computing Laboratory
- Introduction
- Problem Statement
- Proposed Design
- Performance Evaluation
Outline
3 Network Based Computing Laboratory
- Single Root I/O Virtualization (SR-IOV) is providing new opportunities
to design HPC cloud with very little low overhead
Single Root I/O Virtualization (SR-IOV)
– Allows a single physical device,
- r a Physical Function (PF), to
present itself as multiple virtual devices, or Virtual Functions (VFs) – Each VF can be dedicated to a single VM through PCI pass- through – VFs are designed based on the existing non-virtualized PFs, no need for driver change
Guest 1 Guest OS VF Driver Guest 2 Guest OS VF Driver Guest 3 Guest OS VF Driver Hypervisor PF Driver I/O MMU SR-IOV Hardware Virtual Function Virtual Function Virtual Function Physical Function PCI Express
4 Network Based Computing Laboratory
- SR-IOV shows near to native performance for inter-node point to point
communication
- However, NOT VM locality aware
- IVShmem offers zero-copy access to data on shared memory of co-resident VMs
Inter-VM Shared Memory (IVShmem)
Host Environment
Guest 1
Hypervisor PF Driver
Infiniband Adapter Physical Function user space kernel space
MPI proc PCI Device VF Driver
Guest 2
user space kernel space
MPI proc PCI Device VF Driver
Virtual Function Virtual Function /dev/shm/ IV-SHM
IV-Shmem Channel SR-IOV Channel
5 Network Based Computing Laboratory
- Introduction
- Problem Statement
- Proposed Design
- Performance Evaluation
Outline
6 Network Based Computing Laboratory
- How to design a high performance MPI library to efficiently take advantage SR-
IOV and IVShmem to deliver VM locality aware communication and optimal performance?
- How to build an HPC Cloud with near native performance for MPI applications
- ver SR-IOV enabled InfiniBand clusters?
- How much performance improvement can be achieved by our proposed
design on MPI point-to-point, collective operations and applications in HPC clouds?
- How much benefit the proposed approach with InfiniBand can provide
compared to Amazon EC2?
Problem Statement
7 Network Based Computing Laboratory
- Introduction
- Problem Statement
- Proposed Design
- Performance Evaluation
Outline
8 Network Based Computing Laboratory
- MVAPICH2 library running in native and virtualization environments
- In virtualized environment
- Support shared-memory channels (SMP, IVShmem) and SR-IOV channel
- Locality detection
- Communication coordination
VM Locality Aware MVAPICH2 Design Overview
Application MPI Layer ADI3 Layer SMP Channel Network Channel Shared Memory InfiniBand API MPI Library Communication Device APIs Native Hardware Application MPI Layer ADI3 Layer IVShmem Channel SR-IOV Channel Shared Memory InfiniBand API Virtual Machine Aware Communication Device APIs Virtualized Hardware Communication Coordinator Locality Detector SMP Channel
9 Network Based Computing Laboratory
Virtual Machine Locality Detection
- Create a VM List
structure on IVShmem region of each host
- Each MPI process writes
its own membership information into shared VM List structure according to its global rank
- One byte each, lock-free,
O(N)
Host Environment Hypervisor PF Driver
/dev/shm/ IVShmem user space kernel space
rank0
MPI proc PCI Device VF Driver
user space kernel space
MPI proc PCI Device VF Driver
user space kernel space
MPI proc PCI Device VF Driver
user space kernel space
MPI proc PCI Device VF Driver
0 0 0 0
rank1 rank4 rank5
0 0 0 0
1 1 1 1
VM VM List st
10 Network Based Computing Laboratory
Communication Coordination
- Retrieve VM locality
detection information
- Schedule
communication channels based on VM locality information
- Fast index, light-
weight
Host Environment Hypervisor PF Driver
IVShmem
Guest1
user space kernel space
PCI Device VF Driver Physical Function Virtual Function Virtual Function
InfiniBand Adapter /dev/shm
IVShmem Channel SR-IOV Channel
1 1 0 0 1 1 0 0
Communication Coordinator
MPI Process Rank 1
Guest2
user space kernel space
PCI Device VF Driver IVShmem Channel SR-IOV Channel
1 1 0 0 1 1 0 0
Communication Coordinator
MPI Process Rank 4
11 Network Based Computing Laboratory
- OpenStack is one of the most
popular open-source solutions to build a cloud and manage huge amounts of virtual machines
- Deployment with OpenStack
– Supporting SR-IOV configuration – Extending Nova in OpenStack to support IVShmem – Virtual Machine Aware design of MVAPICH2 with SR-IOV
- An efficient approach to build HPC
Clouds
MVAPICH2 with SR-IOV over OpenStack
Nova Glance Neutron Swift
Keystone
Cinder Heat
Ceilometer Horizon
VM
Backup volumes in Stores images in Provides images Provides Network Provisions Provides Volumes Monitors Provides UI Provides Auth for Orchestrates cloud
12 Network Based Computing Laboratory
Experimental HPC Cloud
13 Network Based Computing Laboratory
- Introduction
- Problem Statement
- Proposed Design
- Performance Evaluation
Outline
14 Network Based Computing Laboratory
Cloud Testbeds
Cluster Nowlab Cloud Amazon EC2 Instance 4 Core/VM 8 Core/VM 4 Core/VM 8 Core/VM Platform RHEL 6.5 Qemu+KVM HVM Amazon Linux (EL6) Xen HVM C3.xlarge Instance Amazon Linux (EL6) Xen HVM C3.2xlarge Instance CPU SandyBridge Intel(R) Xeon E5-2670 (2.6GHz) IvyBridge Intel(R) Xeon E5-2680v2 (2.8GHz) RAM 6 GB 12 GB 7.5 GB 15 GB Interconnect FDR (56Gbps) InfiniBand Mellanox ConnectX-3 with SR-IOV 10 GigE with Intel ixgbevf SR-IOV driver
15 Network Based Computing Laboratory
- Performance of MPI Level Point-to-point Operations
– Inter-node MPI Level Two-sided Operations – Intra-node MPI Level Two-sided Operations – Intra-node MPI Level One-sided Operations
- Performance of MPI Level Collectives Operations
– Broadcast, Allreduce, Allgather and Alltoall
- Performance of Typical MPI Benchmarks and Applications
– NAS and Graph500
Performance Evaluation
*Amazon EC2 does not support users to explicitly allocate VMs in one physical node so far. We allocate multiple VMs in one logical group and compare the point-to-point performance for each pair of VMs. We see the VMs who have the lowest latency as located within one physical node (Intra-node), otherwise Inter-node.
16 Network Based Computing Laboratory
Inter-node MPI Level Two-sided Point-to-Point Performance
- EC2 C3.xlarge instances
- Similar performance with SR-IOV-Def
- Compared to Native, similar overhead as basic IB level
- Compared to EC2, up to 29X and 16X performance speedup on Lat & BW
0%
17 Network Based Computing Laboratory
Intra-node MPI Level Two-sided Point-to-Point Performance
- EC2 C3.xlarge instances
- Compared to SR-IOV-Def, up to 84% and 158% performance improvement on Lat & BW
- Compared to Native, 3%-7% overhead for Lat, 3%-8% overhead for BW
- Compared to EC2, up to 160X and 28X performance speedup on Lat & BW
18 Network Based Computing Laboratory
Intra-node MPI Level One-sided Put Performance
- EC2 C3.xlarge instances
- Compared to SR-IOV-Def, up to 63% and 42% improvement on Lat & BW
- Compared to EC2, up to 134X and 33X performance speedup on Lat & BW
19 Network Based Computing Laboratory
Intra-node MPI Level One-sided Get Performance
- EC2 C3.xlarge instances
- Compared to SR-IOV-Def, up to 70% improvement on both Lat & BW
- Compared to EC2, up to 121X and 24X performance speedup on Lat & BW
20 Network Based Computing Laboratory
MPI Level Collectives Operations Performance
(4 cores/VM * 4 VMs)
- EC2 C3.xlarge instances
- Compared to SR-IOV-Def, up to 74% and 60% performance improvement on Broadcast &
Allreduce
- Compared to EC2, up to 65X and 22X performance speedup on Bcast & Allreduce
21 Network Based Computing Laboratory
MPI Level Collectives Operations Performance
(4 cores/VM * 4 VMs)
- EC2 C3.xlarge instances
- Compared to SR-IOV-Def, up to 74% and 81% performance improvement on Allgather &
Alltoall
- Compared to EC2, up to 28X and 45X performance speedup on Allgather & Alltoall
22 Network Based Computing Laboratory
MPI Level Collectives Operations Performance
(4 cores/VM * 16 VMs)
- Compared to SR-IOV-Def, up to 41% and 45% performance improvement on
Bcast & Allreduce
23 Network Based Computing Laboratory
MPI Level Collectives Operations Performance
(4 cores/VM * 16 VMs)
- Compared to SR-IOV-Def, up to 40% and 39% performance improvement on
Allgather & Alltoall
24 Network Based Computing Laboratory
Performance of Typical MPI Benchmarks and Applications
(8 cores/VM * 4 VMs)
- EC2 C3.2xlarge instances
- Compared to Native, 2%-9% overhead for NAS, around 6% overhead for Graph500
- Compared to EC2, up to 4.4X (FT) speedup for NAS, up to 12X (20,10) speedup for
Graph500
25 Network Based Computing Laboratory
Performance of Typical MPI Benchmarks and Applications
(8 cores/VM * 8 VMs)
- EC2 C3.2xlarge instances
- Compared to Native, 6%-9% overhead for NAS, around 8% overhead for Graph500
Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?
Jie Zhang, Xiaoyi Lu and Dhabaleswar K. Panda
27 Network Based Computing Laboratory
- Introduction
- Problem Statement
- Evaluation Methodology
- Performance Evaluation
Outline
28 Network Based Computing Laboratory
Virtualization Technology (Hypervisor vs. Container)
- Provides abstractions of multiple virtual resources by utilizing an intermediate
software layer on top of the underlying system
Hardware Host OS Hypervisor bins/ libs Guest OS Redhat Linux App2 Stack bins/ libs Guest OS Window App3 Stack bins/ libs Guest OS Ubuntu VM1 Hypervisor-based Virtualization App1 Stack
- Hypervisor provides a full abstraction of
VM
- Full virtualization, different guest OS,
better isolation
- Larger overhead due to heavy stack
29 Network Based Computing Laboratory
Virtualization Technology (Hypervisor vs. Container)
- Provides abstractions of multiple virtual resources by utilizing an intermediate
software layer on top of the underlying system
Container1 Hardware Host Linux OS bins/ libs bins/ libs bins/ libs App1 Stack App2 Stack App3 Stack Container-based Virtualization
- Share host kernel
- Allows execution of isolated user space
instance
- Lightweight, portability
- Not strong isolation
30 Network Based Computing Laboratory
Container Technology (Docker vs. Singularity)
- Inherit advantages of container
technique
- Active community contribution
- Root owned daemon process
- Root escalation in Docker
container
- Non-negligible performance
- verhead
31 Network Based Computing Laboratory
- Reproducible software stacks
– Easily verify via checksum or cryptographic signature
- Mobility of compute
– Able to transfer (and store) containers via standard data mobility tools
- Compatibility with complicated architectures
– Runtime immediately compatible with existing HPC architecture
- Security model
– Support untrusted users running untrusted containers http://singularity.lbl.gov/about
Singularity Overview
32 Network Based Computing Laboratory
Container Technology (Docker vs. Singularity)
- Singularity aims to provide reproducible and
mobile environments across HPC centers
- NO root owned daemon
- NO root escalation
- mpirun_rsh –np 2 –hostfile htfiles singualrity exec
/tmp/Centos-7.img /usr/bin/osu_latency
- Performance ?
33 Network Based Computing Laboratory
- Introduction
- Problem Statement
- Evaluation Methodology
- Performance Evaluation
Outline
34 Network Based Computing Laboratory
- What is the performance characterization of running Singularity on HPC cloud?
- Can Singularity deliver near-native performance for MPI applications on HPC
cloud with different cutting-edge hardware technologies?
- Is Singularity-based container technology ready for running MPI applications
- n HPC clouds on top of HPC infrastructure?
Problem Statement
35 Network Based Computing Laboratory
- Introduction
- Problem Statement
- Evaluation Methodology
- Performance Evaluation
Outline
36 Network Based Computing Laboratory
Evaluation Methodology
Multi-core Processor (Haswell) Many-core Processor (KNL)
Processor Architecture Memory Access Mode
NUMA Cache Flat InfiniBand Omni-Path
High Speed Interconnects
- Virtualization Solution Overhead
– Singularity vs. Native
- Three-dimensional Evaluation
37 Network Based Computing Laboratory
Processor Architecture
Haswell KNL
- Up to 72 cores (1.4GHz) on 36 active tiles
- Each tile has a single 1MB L2 cache shared
between two cores
- Each core supports 4 threads
- 6 DDRs + 8 Multi-Channel DRAMs (MCDRAM)
- Dual socket (NUMA)
- Each socket with 12 cores (2.30GHz)
- Each core supports 2 threads
- 4 DDRs, 2 for each
38 Network Based Computing Laboratory
Memory Access Mode
68 cores 16GB cache 96GB RAM DDR4 96GB MCDRAM 16GB
MCDRAM 16GB 112GB RAM DDR4 96GB 68 cores
Haswell: 2 NUMA Nodes KNL with Cache Mode KNL with Flat Mode
- 2 NUMA nodes
- QPI channels between
sockets
- Intra-/Inter-socket
- Cache Mode:
- MCDRAM as an L3 cache
- OS transparently uses
MCDRAM to move data from main memory
- DDR4 and MCDRAM act as
two distinct NUMA nodes
- Need to specify the type of
memory (DDR4 or MCDRAM) when allocating
39 Network Based Computing Laboratory
- Introduction
- Problem Statement
- Evaluation Methodology
- Performance Evaluation
Outline
40 Network Based Computing Laboratory
Testbeds
Cluster Chameleon Cloud Nowlab Cloud CPU Intel Xeon E5-2670 Haswell processors, 24 cores (2.3 GHz)
Intel Xeon Phi CPU 7250 KNL co-processor, 68 cores (1.40GHz)
Memory 2 NUMA Nodes, 128 GB
96GB host memory, and 16GB MCDRAM
Interconnect Mellanox ConnectX-3 HCA, (FDR 56Gbps)
Omni-Path HFI Silicon 100 Series fabric controller (100Gbps)
OS CentOS Linux release 7.1.1503 (Core)
CentOS Linux release 7.3.1611 (Core) Singularity 2.3, MVAPICH2-2.3a, OSU micro-benchmarks v5.3
41 Network Based Computing Laboratory
Evaluation Methodology
Multi-core Processor (Haswell) Many-core Processor (KNL)
Processor Architecture
- Virtualization Solution Overhead
– Singularity vs. Native
- Dimension 1
– Multi-core Processor – Many-core Processor
42 Network Based Computing Laboratory 1000 2000 3000 4000 5000 6000 7000 8000 9000 1K 4K 16K 64K 256K 1M Bandwidth(MB/s) Mesage Size ( bytes) Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node
Processor Architecture (Haswell & KNL)
- MPI point-to-point Bandwidth
- On both Haswell and KNL, less than 7% overhead for Singularity solution
- Worse intra-node performance than Haswell because low CPU frequency, complex cluster mode, and cost
maintaining cache coherence
- KNL - Inter-node performs better than intra-node case after around 256 Kbytes, as Omni-Path
interconnect outperforms shared memory-based transfer for large message size
BW on Haswell BW on KNL
7%
2000 4000 6000 8000 10000 12000 14000 16000 1K 4K 16K 64K 256K 1M Bandwidth (MB/s) Mesage Size ( bytes) Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node
43 Network Based Computing Laboratory
Evaluation Methodology
Memory Access Mode
NUMA Cache Flat
- Virtualization Solution Overhead
– Singularity vs. Native
- Dimension 2
– NUMA – Cache – Flat
44 Network Based Computing Laboratory 5 10 15 20 25 30 35 40 45 1 4 16 64 256 1K 4K 16K 64K Latency (us) Mesage Size ( bytes) Singularity-Intra-Node Native-Intra-Node Singularity-Inter-Node Native-Inter-Node
Memory Access Mode (NUMA, Cache)
- MPI point-to-point Latency
- NUMA
– Intra-socket performs better than inter-socket case, as the QPI bottleneck between NUMA nodes
– Performance difference is gradually decreased, as the message size increases
- Overall, less than 8% overhead for Singularity solution in both cases, compared with Native
2 4 6 8 10 12 1 4 16 64 256 1K 4K 16K 64K Latency (us) Mesage Size ( bytes) Singularity-Intra-Socket Native-Intra-Socket Singularity-Inter-Socket Native-Inter-Socket
2 NUMA nodes w Cache mode
8%
45 Network Based Computing Laboratory
Memory Access Mode (Flat)
1000 2000 3000 4000 5000 6000 7000 8000 Bandwidth (MB/s) Mesage Size ( bytes) Singularity(DDR) Native(DDR) Singularity(MCDRAM) Native(MCDRAM)
- Explicitly specify DDR or MCDRAM for memory allocation
- MPI point-to-point BW: No significant performance difference
- MPI collective Allreduce: Clear benefits (up to 67%) with MCDRAM after around 256 KB message,
compared with DDR
- More parallel processes increase data access, which can NOT fit in L2 cache, higher BW in MCDRAM
- Near-native performance for Singularity (less than 8% overhead)
BW w Flat mode
1000 2000 3000 4000 5000 6000 7000 8000 9000 Latency (us) Mesage Size ( bytes) Singularity(DDR) Native(DDR) Singularity(MCDRAM) Native(MCDRAM)
MPI_Allreduce w Flat mode
67% 8%
46 Network Based Computing Laboratory
Evaluation Methodology
InfiniBand Omni-Path
High Speed Interconnects
- Virtualization Solution Overhead
– Singularity vs. Native
- Dimension 3
– InfiniBand – Omni-Path
47 Network Based Computing Laboratory
High Performance Interconnects (InfiniBand & Omni-Path)
1 10 100 1000 10000 Latency (us) Mesage Size ( bytes) Singularity Native
- MPI_Allreduce
- InfiniBand - 512 Processes across 32 nodes
- Omni-Path - 128 Processes on 2 nodes
- Near-native performance for Singularity
1 10 100 1000 10000 100000 Latency (us) Mesage Size ( bytes) Singularity Native
InfiniBand Intel Omni-Path
8%
48 Network Based Computing Laboratory
Put It All Together
Multi-core Processor (Haswell) Many-core Processor (KNL)
Processor Architecture Memory Access Mode
NUMA Cache Flat InfiniBand Omni-Path
High Speed Interconnects
- Virtualization Solution Overhead
– Singularity vs. Native
- Haswell + InfiniBand
- KNL + Cache mode + Omni-Path
- KNL + Flat mode + Omni-Path
49 Network Based Computing Laboratory
500 1000 1500 2000 2500 3000 22,16 22,20 24,16 24,20 26,16 26,20 Execution Time (ms) Singularity Native
Application Performance on Haswell with InfiniBand
50 100 150 200 250 300 CG EP FT IS LU MG Execution Time (s) Singularity Native
Class D NAS Graph500
- 512 processors across 32 Haswell nodes
- Singularity delivers near-native performance, less than 7% overhead on Haswell
with InfiniBand
7%
50 Network Based Computing Laboratory
- 128 Processors across 2 KNL nodes with Omni-Path
- Singularity only incurs less than 6% overhead on KNL in both Cache and Flat modes
- No clear performance difference between DDR and MCDRAM in Flat mode
– Graph500 heavily utilizes pt2pt communication with 4 Kbytes message size for BFS search – Consistent with pt2pt performance on KNL with Flat mode
50 100 150 200 250 300 350 400 CG EP FT IS LU MG Execution Time (s) Singularity Native
Application Performance on KNL with Omni-Path
200 400 600 800 1000
20,10 20,16 20,20 22,10 22,16 22,20 24,10 24,16
Execution Time (s) Singularity(DDR) Native(DDR) Singularity(MCDRAM) Native(MCDRAM)
Class C NAS with Cache mode Graph500 with Flat mode
6%
High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand
- H. Subramoni, P. Lai, R. Kettimuthu and D. K. Panda
52 Network Based Computing Laboratory
- GridFTP is a high-performance, secure, reliable extension of the standard FTP
- ptimized for WAN
- Globus XIO framework, used to design GridFTP, offers easy-to-use interface
- The framework hides the complications of communication semantics of
underlying devices (network or disk)
Overview
53 Network Based Computing Laboratory
- Combining the ease of use of Globus XIO framework and the high performance
achieved through IB
- Enhancing the disk I/O performance of the existing ADTS library
– By decoupling the network processing from disk I/O operations
- Evaluation of the design
– micro-benchmark level – applications like Community Climate System Model and ultra scale visualization
Contribution
54 Network Based Computing Laboratory
- Most HPC applications require movement of huge amount of data
– Needs slower hard disks and RAIDs for storage – With low bandwidth provided by TCP/UDP based FTP, this was not an issue – Will be an issue for Globus ADTS XIO
- Solution
– decoupling of network from disk I/O
Problem Statement
55 Network Based Computing Laboratory
- Introduction of
– multiple threads
- read, write and network thread
– set of buffers to stage the data
- Read thread prefetches a set of locations
from the disk and keeps it ready for the network
- Avoid frequent context switches
– Low and High Water Marks
- High water mark: max size of circular buffer
– Read when available buffers less than low- water mark
Design of the Globus ADTS XIO Driver
56 Network Based Computing Laboratory
- Variant network delays
- Transmit 128 GB of aggregate data as multiples of 256 MB files
- Legend: staging buffer sizes
Evaluation
57 Network Based Computing Laboratory
- Variant network delays
- Transmit 128 GB of aggregate data as multiples of 256 MB files
- Legend: staging buffer sizes
Evaluation
58 Network Based Computing Laboratory
- Community Climate System
Model (CCSM)
– National Center for Atmospheric Research (NCAR) and Lawrence Livermore National Laboratory (LLNL) – Most files are 256 MB
- Ultra-Scale Visualization
– ORNL and UC Davis – Most files are 2.6 GB
Evaluation
59 Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/