[PPT] - Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin PowerPoint Presentation

SLIDE 1

Cheng-Han Du* I-Hsin Chung** Weichung Wang*

* I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s N a t i o n a l T a i w a n U n i v e r s i t y T a i p e i , T a i w a n * * I B M T . J . W a t s o n R e s e a r c h C e n t e r N Y , U S

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

5/8/2017

1

GTC 2017 @ San Jose

SLIDE 2

Outline

5/8/2017

2

 Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary

GTC 2017 @ San Jose

SLIDE 3

Introduction

5/8/2017

3

 Photonics

 Waveguides  Resonant cavities  Frequency filters  Plasmonic devices

 Design concerns

 Structural characteristics  Parameter refinement  Experiment data

(Ref: Sun et al., Nature 528, 2015) (Ref: Ivinskaya & Lavrinenko, 2011)

GTC 2017 @ San Jose

SLIDE 4

Introduction - Why Multi-GPU Scaling

5/8/2017

4

 Global supercomputing trend

 High energy efficiency

 Growing popularity in deep learning applications

 Integration of high-performance numerical simulation

and deep learning

Source: ORNL Source: NVIDIA GTC 2017 @ San Jose

SLIDE 5

Introduction

5/8/2017

5

Parallel Direct FDFD Solver Kernel

Shift-Inverse Eigensolver

Preconditioner and Algorithm for Iterative Side-Equation Solver

Photonic Crystal Analyzer Photonic Integrated Circuit Design Broadband Spectral Analysis Nonlinear Equations with Multiphysics Features

Machine-Learning-Derived Behavior Model and Intelligent Design

GTC 2017 @ San Jose

SLIDE 6

Introduction

5/8/2017

6

Parallel Direct FDFD Solver Kernel

Shift-Inverse Eigensolver

Preconditioner and Algorithm for Iterative Side-Equation Solver

Photonic Crystal Analyzer Photonic Integrated Circuit Design Broadband Spectral Analysis Nonlinear Equations with Multiphysics Features

Machine-Learning-Derived Behavior Model

When iterative solver fails…

GTC 2017 @ San Jose

SLIDE 7

Introduction

5/8/2017

7

Parallel Direct FDFD Solver Kernel  Objectives

 Fast generation of numerical data for different parameters

 Data-driven intelligent design of optical components

 Explicit and fast acquisition of quantitative characteristics

 Reduction of postprocessing and data storage/transfer

requirement

 Finite-Difference Frequency-Domain

GTC 2017 @ San Jose

SLIDE 8

Outline

5/8/2017

8

 Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary

GTC 2017 @ San Jose

SLIDE 9

Implementation

5/8/2017

9

Parallel Direct FDFD Solver Kernel  FDFD Problem

 Linear system

−𝜶 × 𝜶 × 𝑭 + 𝒍𝟏 𝟑𝜻𝒔𝑭 = 𝒅Ԧ

𝑲

 Direct solver for robust solution

Yee’s mesh
Perfectly-matched layer
High-frequency problem

 Challenge

Heavy factorization loads

GTC 2017 @ San Jose

SLIDE 10

Implementation

5/8/2017

10

 Compressed hierarchical Schur method (CHiS)

 Domain decomposition, multi-level algorithm  3D nested dissection of Yee’s mesh (𝑂𝑦 × 𝑂𝑧 × 𝑂𝑨)  Ideal periodic structure  𝑬𝟐 = 𝐸2 = 𝐸3 = ⋯ = 𝐸16  𝑻𝟐,𝟐 = 𝑇1,2 = 𝑇1,3 = ⋯ = 𝑇1,8  𝑻𝟑,𝟐 = 𝑇2,2 = 𝑇2,3 = 𝑇2,4  𝑻𝟒,𝟐 = 𝑇3,2  𝑻𝟓,𝟐 GTC 2017 @ San Jose

SLIDE 11

Implementation

5/8/2017

11

 Compressed hierarchical Schur method

 Elimination tree deduplication  Diagonals  Interfaces to children 5/8/2017

𝑱𝑽 𝑱𝑴

GTC 2017 @ San Jose

SLIDE 12

5/8/2017

Implementation

5/8/2017

12

 Compressed hierarchical Schur method

 Elimination tree deduplication  Diagonals  Interfaces to children

GTC 2017 @ San Jose

SLIDE 13

Implementation

5/8/2017

13

 Compressed hierarchical Schur method

 Leaf-level Interface Compression (LIC)  Use one updating submatrix over multiple Schur complement

submatrices with row/column permutations.

 The less sparse matrix computing, the less CPU-centric load

GTC 2017 @ San Jose

SLIDE 14

Implementation

5/8/2017

14

 Compressed Hierarchical Schur method

 Expose larger chunks of matrix computation  Major function calls and libraries  Subdomains

 Sparse diagonal: Sparse factorize  Sparse interface: Sparse LS solve and matrix multiply

 Separators

 Dense diagonal: Dense LU  Packed dense interface: Dense LS solve and matrix multiply

(Option 1) PARDISO, Sparse BLAS (Option 2) MUMPS BLAS (ZGEMM) and LAPACK (ZGETRF, ZGETRS) Hardware Acceleration (GPU: cuBLAS, cuSolver, etc.)

GTC 2017 @ San Jose

SLIDE 15

Implementation

5/8/2017

15

 GPU acceleration

 Considerations  Multi-GPU scaling in single node (Scale-up)

 No longer solely based on nested dissection

 Asynchronous streams for small submatrices  Overlapping some computation kernels

 Hardware scheduling

 Threaded GPU controls  Thread affinity

GTC 2017 @ San Jose

SLIDE 16

Implementation

5/8/2017

16

 GPU acceleration

Factorize all diagonal blocks 𝑇𝑗,𝑘 related to level 𝑗. (CPU or GPU work.)

GTC 2017 @ San Jose

SLIDE 17

Implementation

5/8/2017

17

 GPU acceleration

Asynchronously send some blocks to GPU and perform 𝑇𝑗,𝑘

−1𝐽𝑉

GTC 2017 @ San Jose

SLIDE 18

Implementation

5/8/2017

18

 GPU acceleration

Continue to ZGEMM, no D2H data transmission 𝑇𝑗,𝑘

−1𝐽𝑉 kept in GPU for 𝐽𝑀𝑇𝑗,𝑘 −1𝐽𝑉

peration later. Workspace

will be simply discarded if no longer needed.

GTC 2017 @ San Jose

SLIDE 19

Implementation

5/8/2017

19

 GPU acceleration

Asynchronously perform ZGEMM 𝐽𝑀(𝑇𝑗,𝑘

−1𝐽𝑉)

GTC 2017 @ San Jose

SLIDE 20

Implementation

5/8/2017

20

 GPU acceleration

Collect 𝐽𝑀(𝑇𝑗,𝑘

−1𝐽𝑉) from all GPUs

and perform higher-level Schur update by CPU

GTC 2017 @ San Jose

SLIDE 21

Implementation

5/8/2017

21

 GPU acceleration

Continue more ZGEMM 𝐽𝑀(𝑇𝑗,𝑘

−1𝐽𝑉) related to (𝑇𝑗,𝑘 −1𝐽𝑉) and

Schur updates…

GTC 2017 @ San Jose

SLIDE 22

Implementation

5/8/2017

22

 GPU acceleration

 Workload balance for

multi-GPU

 Distribute 𝐽𝑉 blocks

by parent levels

 Tackle extreme cases

with lots of duplicates

 Minor increase in

H2D transfer

GTC 2017 @ San Jose

SLIDE 23

Implementation

5/8/2017

23

 GPU acceleration

 Workload balance for

multi-GPU

 Panel 𝐽𝑉  Each 𝐽𝑉 column

should be large enough

 Multiple 𝐽𝑀 copies

sent to GPUs

 Moderate increase

in H2D transfer

GTC 2017 @ San Jose

SLIDE 24

Implementation

5/8/2017

24

 GPU acceleration

 Without workload balance

Finishing time > 325 seconds

GTC 2017 @ San Jose

SLIDE 25

Implementation

5/8/2017

25

 GPU acceleration

 With workload balance

Finishing time < 250 seconds

GTC 2017 @ San Jose

SLIDE 26

Outline

5/8/2017

26

 Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary

GTC 2017 @ San Jose

SLIDE 27

Numerical Results I

5/8/2017

27

 Hardware specifications

Server Brillante P8Exp CPU 2 × Intel E5-2670 v3 12 + 12 cores used 2 × IBM Power8 8 + 8 cores used Memory 256 GB 1 TB GPU 2 × K40 4 × K80 Software Intel Parallel Studio 2016 update 1 IBM ESSL and Parallel ESSL Intel PARDISO IBM XL Fortran and XL C Compiler CUDA 7.5 MUMPS 5.0.1 CUDA 7.5

GTC 2017 @ San Jose

SLIDE 28

Numerical Results I

5/8/2017

28

 SOI dielectric waveguide

 Total grids: 79 × 319 × 39, 2,948,517 in matrix dimension  Wavelength: 1.5 𝜈𝑛  Grid size: 0.02 𝜈𝑛  100 GB RAM

GTC 2017 @ San Jose

SLIDE 29

Numerical Results I

5/8/2017

29

 Brillante: 2 × 𝐿40

ZGETRS + ZGEMM

𝟓𝟒𝟘. 𝟒 seconds

(𝟘𝟏% overall time)

GTC 2017 @ San Jose

SLIDE 30

Numerical Results I

5/8/2017

30

 Brillante: 2 × 𝐿40

Naïve GPU acceleration yields good speedup due to high AI. “Scatter” time includes D2H transfer.

GTC 2017 @ San Jose

SLIDE 31

Numerical Results I

5/8/2017

31

 Brillante: 2 × 𝐿40

Async streams apply to low-level separators, which is finished in seconds even in CPU-only mode.

GTC 2017 @ San Jose

SLIDE 32

Numerical Results I

5/8/2017

32

 Brillante: 2 × 𝐿40

Workload balance yields better speedup and multi-GPU scaling.

GTC 2017 @ San Jose

SLIDE 33

Numerical Results I

5/8/2017

33

 P8Exp: 4 × K80 with autoboost

Good performance scaling in quad-K80 server
Higher performance with half-K80 computing
Two threads competing single PCI-E bandwidth

when using full-K80

GTC 2017 @ San Jose

SLIDE 34

Numerical Results I

5/8/2017

34

 P8Exp: 4 × K80 with autoboost

 AccTRSMM: multi-GPU scaling  Increased H2D transfer due to multiple 𝐽𝑀 copies to work-

sharing GPUs

 We still get acceptable scaling performance

GTC 2017 @ San Jose

SLIDE 35

Numerical Results I

5/8/2017

35

 Periodic air hole wavelength filter

 No propagation at 𝜇0 = 1.5 μm  Total grids: 79 × 575 × 47, 6,404,925 in matrix dimension  188 GB RAM

GTC 2017 @ San Jose

SLIDE 36

Numerical Results I

5/8/2017

36

 Brillante: 2 × 𝐿40

GTC 2017 @ San Jose

SLIDE 37

Numerical Results I

5/8/2017

37

 P8Exp: 4 × K80 with autoboost

GTC 2017 @ San Jose

SLIDE 38

Numerical Results I

5/8/2017

38

 P8Exp: GPU-scaling of AccTRSMM

 Much more dense matrix operations  Good scaling in multi-GPU systems

GTC 2017 @ San Jose

SLIDE 39

Outline

5/8/2017

39

 Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary

GTC 2017 @ San Jose

SLIDE 40

P2P Matrix Sharing

5/8/2017

40

 Improved multi-GPU scaling

with P2P transfer

 Past: Multiple 𝐽𝑀 copies sent to

work-sharing GPUs

 Growing H2D transfer with

increasing GPU sharing

 Major bottleneck for multi-P100

acceleration

 No cublas-XT: some matrix contents

already distributed in GPUs

𝑻−𝟐

Broadcast

GTC 2017 @ San Jose

SLIDE 41

5/8/2017

41

+ + +

GTC 2017 @ San Jose

SLIDE 42

P2P Matrix Sharing

5/8/2017

42

 Improved multi-GPU scaling with

P2P transfer

 𝐽𝑀 division  cudaMemcpyPeerAsync  Threaded GPU control with busy-waiting  𝑇−1 division  𝐽𝑉 is shared with identical 𝑇−1  Expectation  Replace massive H2D with P2P  Reduced H2D transmission

 Other improvements

 Asynchronous D2H transfer right after

ZGEMM

𝑻−𝟐

D2H will be counted in AccTRSMM time in our P2P scheme

GTC 2017 @ San Jose

SLIDE 43

Outline

5/8/2017

43

 Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary

GTC 2017 @ San Jose

SLIDE 44

Numerical Results II

5/8/2017

44

 IntelExp

 2 × Intel E5-2640 v4 (20 physical cores)  8 × Tesla P100 with 16 GB device memory  PCI-E switch enclosure  No NVLink

 DGX-1

 2 × Intel E5-2698 v4 (40 physical cores)  8 × NVLink-enabled Tesla P100

GTC 2017 @ San Jose

SLIDE 45

Numerical Results II

5/8/2017

45

 IntelExp

 PCI-E enclosure on one CPU (experimental build)  Aggregate CPU-GPU bandwidth: 10~12 GB/s (Uni-direction)  GPU-GPU link bandwidth: 12.5 GB/s (Uni-direction)

CPU0 CPU1

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7

GTC 2017 @ San Jose

SLIDE 46

Numerical Results II

5/8/2017

46

 IntelExp: 4GPU

 Consistent PCI-E speed between GPUs at 12.5 GB/s  Saturated CPU-GPU link

GTC 2017 @ San Jose

SLIDE 47

Numerical Results II

5/8/2017

47

 IntelExp: 8GPU

 Some GPU links slow down by half  Heavy congestion between CPU-GPU

GTC 2017 @ San Jose

SLIDE 48

Numerical Results II

5/8/2017

48

 IntelExp: SOI waveguide simulation

GTC 2017 @ San Jose

SLIDE 49

Numerical Results II

5/8/2017

49

 IntelExp: AccTRSMM Speedup (SOI waveguide)

GTC 2017 @ San Jose

SLIDE 50

Numerical Results II

5/8/2017

50

 GPU AccTRSMM in SOI waveguide case

 Great scaling performance in computing  H2D and D2H transfer becomes the major scaling bottleneck  P2P sharing eliminates H2D growth in multi-GPU

Total H2D (GB) Total D2H (GB) AccTRSMM time (seconds) AccTRSMM scale No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P 1-GPU 207.8 207.8 170.9 170.9 121.3 146.1 1.00𝑌 1.00𝑌 2-GPU 341.1 207.8 170.9 170.9 85.4 89.6 1.42𝑌 1.63𝑌 4-GPU 531.2 207.8 170.9 170.9 87.1 67.1 1.39𝑌 2.18𝑌 8-GPU 805.5 207.8 170.9 170.9 109.3 58.4 1.11𝑌 2.50𝑌

GTC 2017 @ San Jose

SLIDE 51

Numerical Results II

5/8/2017

51

 IntelExp: Periodic air hole wavelength filter

GTC 2017 @ San Jose

SLIDE 52

Numerical Results II

5/8/2017

52

 IntelExp: AccTRSMM Speedup (Air hole filter)

GTC 2017 @ San Jose

SLIDE 53

Numerical Results II

5/8/2017

53

 GPU AccTRSMM in filter case

 Great scaling performance in computing  H2D and D2H transfer becomes the major scaling bottleneck  P2P sharing eliminates H2D growth in multi-GPU

Total H2D (GB) Total D2H (GB) AccTRSMM time (seconds) AccTRSMM scale No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P 1-GPU 427.5 427.5 348.0 348.0 320.4 376.3 1.00𝑌 1.00𝑌 2-GPU 690.9 427.5 348.0 348.0 204.2 220.6 1.57𝑌 1.71𝑌 4-GPU 1144.2 427.5 348.0 348.0 195.5 158.8 1.64𝑌 2.37𝑌 8-GPU 1839.9 𝟓𝟑𝟖. 𝟔 348.0 348.0 252.1 𝟐𝟒𝟓. 𝟏 1.27𝑌 𝟑. 𝟗𝟐𝒀

GTC 2017 @ San Jose

SLIDE 54

Numerical Results II

5/8/2017

54

 DGX-1

 Doubled CPU-GPU bandwidth in

multi-GPU computing

 Aggregate bandwidth: 24 GB/s

(Uni-direction)

 NVLink  Up to 20GB/s (Uni-direction)  Over 18GB/s in profiler

Source: NVIDIA

GTC 2017 @ San Jose

SLIDE 55

Numerical Results II

5/8/2017

55

 DGX-1: SOI waveguide simulation

Strange CPU behavior with OpenMP under investigation

GTC 2017 @ San Jose

SLIDE 56

Numerical Results II

5/8/2017

56

 DGX-1: AccTRSMM (SOI waveguide)

GTC 2017 @ San Jose

SLIDE 57

Numerical Results II

5/8/2017

57

 DGX-1 AccTRSMM in SOI waveguide case

 Significant speedup from H2D and D2H (Double CPU-GPU links)  NVLink further reduces sharing overheads  NVLink between CPU-GPU?

AccTRSMM time (seconds) AccTRSMM scale DGX1 IntelExp DGX1 IntelExp 1-GPU 146.1 146.1 1.00𝑌 1.00𝑌 2-GPU 78.5 89.6 1.86𝑌 1.63𝑌 4-GPU 47.5 67.1 3.08𝑌 2.18𝑌 8-GPU 𝟒𝟔. 𝟒 58.4 4.14𝑌 2.50𝑌 From 439.3 (24 Haswell cores) to 35.3 seconds

Over 𝟐𝟑. 𝟓X speedup

GTC 2017 @ San Jose

SLIDE 58

Outline

5/8/2017

58

 Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary

GTC 2017 @ San Jose

SLIDE 59

Summary

5/8/2017

59

 CHiS solver for 3D photonic simulation with multi-GPU

 FLOP, time, and memory saving: CPU-GPU traffic reduced  Dense LA functions: ready for modern HPC architecture  Sparse LA functions: SpMM, sparse LS solver  Balanced multi-GPU acceleration with asynchronous data transfer and

matrix computations

 P2P transfer: great computation scaling up to 8 GPUs  Successful harnessing high-density GPU-accelerated systems  Fast transfer between CPU-GPU

 MPI implementation in progress

 Fit computation task unit into GPU  Maintain resource saving and scheduling and expose parallelization

simultaneously

GTC 2017 @ San Jose

SLIDE 60

Acknowledgement

5/8/2017

60

 IBM Research  NVIDIA Taiwan  NVAITC Program

Thank you!

GTC 2017 @ San Jose