[PPT] - S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI PowerPoint Presentation

SLIDE 1

Presenter: Louis Capps, Solution Architect, NVIDIA, lcapps@nvidia.com

S7750 - THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE

SLIDE 2

2

Basic OK List 10 for x = 1 to 3 20 print x 30 next x Run 1 1 2 2 3 3 OK

A TALE OF ENLIGHTENMENT

1 FPS

SLIDE 3

3

Assembly LDX #$00 dec: INX JSR printx CPX #$03 BNE dec BRK

DEEP LEARNING – A NEW COMPUTING PLATFORM

30 FPS DL -->

SLIDE 4

4

Innovation is fueled by the right engine! Deep Learning scalability; move outside the box Drive research and Deep Learning application Partner with university research, government and industry collaborations Enable data science in HPC

SATURNV PURPOSE

124 Node Supercomputing Cluster

SLIDE 5

5

124 NVIDIA DGX-1 Nodes – 992 P100 GPUs

8x NVIDIA Tesla P100 SXM GPUs – NVLINK CubeMesh 2x Intel Xeon 20 core GPUs 512TB DDR4 System Memory SSD – 7 TB scratch + 0.5 TB OS

Mellanox 36 port EDR L1 and L2 switches

4 ports per system Fat tree topology

Ubuntu 14.04, CUDA 8, OpenMPI 1.10.5a1, Docker, DL Frameworks

NVIDIA GPU BLAS + Intel MKL (NVIDIA GPU HPL)

Deep Learning applied research

Many users, frameworks, algorithms, networks, new approaches Embedded, robotic, auto, hyperscale, HPC

NVIDIA DGX SATURNV ARCHITECTURE

124 node Cluster

nvidia.com/dgx1

SLIDE 6

6

SATURNV STACK

SLIDE 7

7

DGX-1 MULTI-SYSTEM

SLIDE 8

8

NVIDIA DGX SATURNV

Greenest Supercomputer

SLIDE 9

9

HPL Setup

Problem contained mainly in GPU memory (~16GB / GPU) 124 nodes * 8 GPU/node * 16 GB mem/GPU = 15,872 GB mem

-- N = 1419552

Measurement

PDU input power time-stamped during full run All cluster hardware – nodes, switches, storage

Performance

HPL Rpeak – 4,896 TF HPL Rmax 3,307 TF Pwr Full run avg – 321.2 KW Pwr Core avg – 349.5 KW

9.4 GF / Watt 40% better than nearest competing technology

NVIDIA DGX-1 SATURNV HPL RUN

124 node Supercomputing cluster

SATURNV produced groundbreaking 9.4 GF/W at full scale

-> Sets the stage for future Exascale class computing

~15KW sustained per rack

SLIDE 10

10

NOV2016 TOP GREEN500 SYSTEM

Green500.org Top500.org

SATURNV produced groundbreaking 9.4 GF/W at full scale

-> Sets the stage for future Exascale class computing

SLIDE 11

11

HPL – High Performance Linpack

Multi-system benchmark - measures optimized double-precision floating performance Solves system of dense linear equations One system or many connected in a cluster - usually Ethernet or InfiniBand Single problem split across many systems –single final performance number Well designed to scale across large clusters and push limits

Top500 (top500.org)

List of the fastest HPL clusters in the world Updated twice a year – June and Sept – Published at ISC and SC conferences

Green500 (green500.org)

Same HPL clusters, but rank by power used during the HPL run Published at same time as Top500

WHAT IS HPL, TOP500, GREEN500?

SLIDE 12

12

Compute

Significant math performance – FP32, FP16, INT8
Highly optimized frameworks
Training, Inference

Interconnect

Multiple compute units inside node
Multiple systems

Storage

Low latency, high bandwidth
Equal perf to all systems
Local caching for DL workloads

Facilities

Sufficient for bursts
Maintain inlet air temp always
High power density

DGX-1 SUPERCOMPUTER CHALLENGES

Giant Leap Towards Exascale AI

SLIDE 13

13

NVIDIA DGX-1 COMPUTE

NCCL Collective Library

SLIDE 14

14

DGX-1 single system considerations

Higher performance per system
27x to 58x faster
Ingest data faster, provides faster results
Also more power and heat
High data ingest for DL workloads
More storage and I/O into single system
Cache data locally
NFS cache on local SSD for training data
Higher power/thermal density
Example: 32 Racks @ 750 KW vs 200 @ 1,000 KW
Ambient temperatures very important
Silicon uses more power @ higher temps
Clocks will gate at thermal and power limits
Variability lowers overall performance of multi-

GPU and multi-system runs

DGX-1 COMPUTE AND MULTI-SYSTEM

SLIDE 15

15

DGX-1 COMPUTE CONSIDERATIONS

#1 Recommendation - Using containers improves performance

Access to latest NVIDIA tuned codes
Latest NCCL libraries

Clocking

CPUs set to performance mode to improve

memory/I/O bandwidth

Leave GPU clocks at default – if you do set

them, use base or slightly higher

Running set at max can cause extreme

variation and reduced performance depending on workload

Monitor with nvidia-smi dmon

1320 1340 1360 1380 1400 1420 1440 1460 1480 1500 Time

Effects of Clocking

SLIDE 16

16

DGX-1 COMPUTE CONSIDERATIONS

Affinity

Best performance when CPU/GPU/mem/IB affinity are aligned
E.g. cpu socket 0<->gpu0/1<->mlx5_0

Interrupt traffic can be high

Keep core 0 and core 20 free for interrupts

SLIDE 17

17

Example affinity with numactl:

mpirun \

np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_0 -x CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=1-4 ./mycode : \
np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_0 -x CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6-9 ./mycode : \
np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_1 -x CUDA_VISIBLE_DEVICES=2 numactl --physcpubind=10-13 ./mycode : \
np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_1 -x CUDA_VISIBLE_DEVICES=3 numactl --physcpubind=15-18 ./mycode : \
np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_2 -x CUDA_VISIBLE_DEVICES=4 numactl --physcpubind=21-24 ./mycode : \
np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_2 -x CUDA_VISIBLE_DEVICES=5 numactl --physcpubind=25-28 ./mycode : \
np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_3 -x CUDA_VISIBLE_DEVICES=6 numactl --physcpubind=30-33 ./mycode : \
np 4 —bind-to node -mca btl openib,sm,self --mca btl_openib_if_include mlx5_3 -x CUDA_VISIBLE_DEVICES=7 numactl --physcpubind=35-38 ./mycode

DGX-1 MULTI-SYSTEM CONSIDERATIONS

GPU0 GPU1 GPU3 GPU4 CPU0 MLX0 MLX1 PCIe PCIe GPU5 GPU6 MLX1 GPU7 GPU8 MLX1 MEM0 IB Leaf Switch CPU1 PCIe PCIe MEM1

SLIDE 18

18

Design topologies that reduce latency and improve total bandwidth

Fat-tree topologies for instance
Equal bandwidth from a system all the way

up to top level switch

Ensure GPUDirect RDMA enablement
DL and many computation workloads rely on

fast synchronization

Collectives
Consistent iteration times

System hierarchy

CPU0 <-> GPU0/1 <-> mlx5_0
CPU0 <-> GPU2/3 <-> mlx5_1
CPU1 <-> GPU4/5 <-> mlx5_2
CPU2 <-> GPU6/7 <-> mlx5_3

If designing with only two IB ports, hook up mlx5_0, mlx5_2

DGX-1 MULTI-NODE INTERCONNECT DESIGN

6,012 GB/s

SLIDE 19

19

DGX-1 multi-system considerations

High node to node communications
DL and HPC workloads

4 IB ports à 2 ports

DL: up to 5% loss
Compute: up to 18% loss

1 IB port per system low performance

Significant contention for many workloads
Can’t GPU Direct RDMA across full system

Switch hierarchy critical

Low bandwidth on second level
Same issues as lowering ports per system
Contention, lower bandwidth, variability

DGX-1 MULTI-SYSTEM INTERCONNECT

SLIDE 20

20

Storage needs

HPC needs well known
Parallel FS like Lustre and Spectrum Scale well suited
DL workloads just being understood
Read dominated
Input data rarely changes
Can be raw or formatted in a DB (like LMDB)
Large group of random read, then reread same data later
Approaches
Local caching helps significantly
Can be many GB (>16GB for instance)
Another approach is keep full datasets local (>100GB for ImageNet)
Local SSD RAID
Alternately, copy all data to nodes at beginning of job

Reference designs

10Gb attached Central NFS with local caching
Spectrum Scale IB attached (still evaluating)
Lustre IB attached (still evaluating)

DGX-1 STORAGE CONSIDERATIONS

SLIDE 21

21

CANDLE - Accelerate cancer research Energy / Fusion – Future of low cost energy Weather and Climate – Disaster Preparedness Astrophysics – Our future? Autonomous Cars

AI GRAND CHALLENGES

SLIDE 22

22

Summary

DGX-1 crafted for AI and Computational workloads

High compute density, but also high power and thermal density
Watch ambient – can cause large variability

Single system has large demands in data ingest and GPU to GPU communication Multi DGX-1 systems have large demands on inter-node communication for most workloads

Need at least two IB rails per system (1 EDR IB for every 2 GPU)

DL Storage needs are very high

But read dominated (vs writes with HPC)

Many codes benefit significantly when watching affinity

Align CPU/memory with GPUs and IB cards
Avoid cores handling interrupts

NVIDIA pre made containers significantly reduce user work

Affinity is already handled
Provides technologies like NCCL and the latest, tuned code and frameworks

DGX-1 DL SCALABILITY SUMMARY

SLIDE 23

23

Thanks!!! More info at NVIDIA DGX-1 System Architecture:

http://www.nvidia.com/object/dgx-1-system-architecture-whitepaper.html

CANDLE sessions (http://www.gputechconf.com/agenda/schedule)

S7788 - CANDLE: PREDICTING TUMOR CELL RESPONSE TO DRUG TREATMENTS
S7782 - THE DOE AND NCI PARTNERSHIP ON PRECISION ONCOLOGY AND THE CANCER MOONSHOT
S7792 - BUILDLING EXASCALE DEEP LEARNING TOOLS TO HELP UNDERSTAND CANCER BIOLOGY AT THE MOLECULAR SCALE
S7780 - BUILDING EXASCALE DEEP TEXT COMPREHENSION TOOLS FOR EFFECTIVE CANCER SURVEILLANCE

DGX-1 DL SCALABILITY SUMMARY

S7754 - WHAT'S NEXT IN DGX SERVER SOLUTIONS FOR DEEP LEARNING

Thursday, May 11, 10:00 AM - 10:50 AM – Room 210B

SLIDE 24