The Scalable Petascale Data-Driven Approach for the Cholesky - - PowerPoint PPT Presentation

the scalable petascale data driven approach for the
SMART_READER_LITE
LIVE PREVIEW

The Scalable Petascale Data-Driven Approach for the Cholesky - - PowerPoint PPT Presentation

The Scalable Petascale Data-Driven Approach for the Cholesky Factorization with multiple GPUs Yuki Tsujita, Toshio Endo, Katsuki Fujisawa Tokyo Institute of Technology, Japan ESPM2 2015@Austin Texas, USA 1 What is the Cholesky factorization?


slide-1
SLIDE 1

The Scalable Petascale Data-Driven Approach for the Cholesky Factorization with multiple GPUs

Yuki Tsujita, Toshio Endo, Katsuki Fujisawa Tokyo Institute of Technology, Japan ESPM2 2015@Austin Texas, USA

1

slide-2
SLIDE 2

What is the Cholesky factorization?

  • The Cholesky factorization is a factorization of a real

symmetric positive-definite matrix into the product of a lower triangular matrix and its transpose

  • Statement

A=LLT(A∈Rm×m)

  • The time complexity of the Cholesky is O(m3)

A L LT

2

slide-3
SLIDE 3

SDPARA: Our Target application

  • Dense Cholesky factorization is the important

kernel of SDPARA GPU ver.[Fujisawa et al. 2011]

  • SDPARA GPU ver.

– Application to solve SDP(SemiDefinite Program) – Offload a part of its calculations to GPU

3

Year n m CHOLESKY (Flops) 2003 630 24,503 78.58 Giga 2010 10,462 76,554 2.414 Tera 2012 1,779,204 1,484,406 0.533 Peta 2014 2,752,649 2,339331 1.713 Peta Table: Performance record of CHOLESKY of SDPARA

slide-4
SLIDE 4

Existing approach I: Synchronous Implementation [Fujisawa et al. IPDPS 2014]

  • Block Cholesky factorization

– The input data is divided into the blocks

  • The calculations proceed in each block

– The blocks are assigned to the processes by two dimensional block cyclic division

  • Processes do calculations of the only assigned data
  • Each iteration proceeds synchronously

– The data are transferred from CPU to GPU at the beginning of each iteration – If a process has no task in a certain iteration, it has to wait for the other processes finishing without doing anything

L0 A11 A21 A22 L0 L21 L11

  • A22

4

slide-5
SLIDE 5

The DAG of the Cholesky factorization

  • Kernels are divided into fine-grained tasks

– Basically each task proceeds asynchronously

  • PCIe comm. performs only when it needs
  • Inter-process comm. performs

in Point-to-Point way

  • We found the performance may decrease

in extremely large scale case

Existing approach II: Data-Driven Implementation [Tsujita&Endo JSSPP 2015]

5

DPOTR F DTRSM DSYRK DGEM M intra-process dependency inter-process dependency proc 0 proc 1 proc 2 proc 3

slide-6
SLIDE 6
  • Large problem size(m>2M)

– Use capacity of host memory to put the matrix data[Tsujita,Endo JSSPP2015]

  • High performance(>1.7PFlops)

– Use multiple GPUs and reduce PCIe communication by GPU memory aware scheduling[Tsujita,Endo JSSPP2015] – Solve the communication bottleneck by introducing the scalable communication

6

Our Target

slide-7
SLIDE 7

Contribution

7

  • Goal

– Performance improvement of the multi-node multi-GPU Cholesky factorization

  • Approach

– Data-Driven scheduling to reduce data movement (presented@JSSPP2015)

  • Scheduling tasks in an application
  • Task selection to improve GPU memory reusability

– The MPI communication pattern for the scalability improvement

  • Achieve the performance of 1.77PFlops with 1360 nodes
slide-8
SLIDE 8

Existing method Ⅰ (Synchronous) Existing method Ⅱ (Data-Driven) Proposed method Data driven × ✓ ✓ PCIe Comm reducing × (Naïve) ✓ (Swap) ✓ (Swap) MPI Comm Scalability ✓ (Group) × (Point-to-point) ✓ (Scalable communication) Overlap of calculations & communications ✓ ✓ ✓

8

Implementation overview

slide-9
SLIDE 9
  • GPU memory-aware scheduling

– Task selection considering the reusability of GPU memory

  • Point-to-Point asynchronous MPI

communication

  • GPU memory management by swapping

– select an unnecessary data as a victim

9

Our Basic Data-Driven Implementation (Existing Approach II)

slide-10
SLIDE 10

Worker thread & Ignition thread

  • MPI process has several worker thread and one

ignition thread

  • Worker

– Executes tasks – Process has two or three worker per one GPU in order to achieve overlapping of calculation, PCIe and MPI simply

  • Ignition

– checks arrival of notice messages from other processes – handles data request

  • All threads in a process shares a single task queue

10

slide-11
SLIDE 11

Task Execution

11

MPI Process 1 Task A

Receive data request Send data

MPI Process 2

Receive data cudaMemcpy Execute on GPU Send notice of task end

Firing

Send data request

worker2 Firing ignition2

Task B

data request Notice worker1 ignition1 Task C MPI Process 3 Notice Firing ignition3 worker3 T1 T1 T2

slide-12
SLIDE 12

12

The pitfall of Data-Driven

0.5 1 1.5 2 2.5 3 3.5 5 10 15 20

Speed (TFlops) Number of Nodes

Synchronous Data-Driven

100 200 300 400 500 600 100 200 300 400 500 Speed (TFlops) Number of Nodes SYNC (QAP5) SYNC (QAP6) SYNC (QAP7) D2 (QAP5) D2 (QAP6) D2 (QAP7)

As problem size or number of the nodes increases, the performance decreases in data-driven execution By Data-Driven implementation, we get better performance

The suspected bottleneck is a

concentration of MPI communication Not only does our approach suffer from this problem !

slide-13
SLIDE 13

Synchronous implementation uses MPI_Bcast for data transfer But in Data-Driven implementation

  • Each task runs asynchronously -> MPI_Bcast, MPI_Ibcast: ×
  • When many processes request the same tile,

Point-to-Point communication is executed 2√P times The existing data-driven shows less performance in high parallel situation

13

The pitfall of Data-Driven

For scalable data transfer, we create a broadcast tree structure dynamically

・・・

2√P

T

slide-14
SLIDE 14
  • Presupposition

– Data send is occurred only when a process receive requests from other processes – The order of data requests is unsettle

  • For scalable data transfer,

We make CSlist(Client-Server list)

– one CSlist for one tile – CSlist has clients and corresponding servers – When a process receives requests, checks CSlist ・Server: send data ・Others: forward to its server – When a process sends data, forces a part of its clients on requestor

14

Scalable Communication

Tile A C 2 3 4 5 S 1 1 1 1 CSlist

slide-15
SLIDE 15

15

Scalable Communication

Tile A Process 1 Process 2 Process 3 Process 4 Process 5

  • 1. request

C 2 3 4 5 S 1 1 1 1 CSlist

slide-16
SLIDE 16

16

Scalable Communication

C 2 3 4 5 S 3 - 1 1 Tile A Process 1 Process 2 Process 3 Process 4 Process 5 Tile A C 2 3 S 3 -

  • 2. data send

CSlist

slide-17
SLIDE 17

17

Scalable Communication

C 2 3 4 5 S 3 - 1 1 Tile A Process 1 Process 2 Process 3 Process 4 Process 5 Tile A C 2 3 S 3 -

  • 3. data send
  • 1. request

CSlist

  • 2. request(forward)
slide-18
SLIDE 18

18

Scalable termination detection

A process cannot exit even if all of its tasks has been finished → Process may still receive requests for its owned data from

  • ther running processes

The detection of process’s termination becomes difficult ! we solve this by using CSlist

CSlist shows “which client has been requested this tile, or not yet” So there is no further request message, when CSlists for all local tiles become empty By using CSlist we can detect process’s termination without especial communications

Tile A Process 1 Process 2 Process 3 Process 4 Process 5

C 2 3 4 5 S

  • If all servers in the CSlist

become empty its tile has been sent to all processes that need it

C 2 3 S

  • Tile A
slide-19
SLIDE 19

Experiment Conditions

  • We use 1360 nodes of TSUBAME2.5

19

node architecture of TSUBAME 2.5 CPU Intel Xeon 2.93 GHz (6 cores) x 2 CPU memory 54GiB GPU NVIDIA Tesla K20X × 3 GPU memory 6GiB

  • Three MPI processes per a node
  • One GPU per a MPI process(3 GPU/node)
  • Tile Size:2,048 x 2,048
  • GPU memory:5,000MiB per a GPU
  • NVIDIA CUDA 7.0 and CUBLAS 7.0
slide-20
SLIDE 20

Performance Evaluation

  • Compared Implementations

20

  • Evaluation

– Scalability evaluation – Extremely Large Scale

  • Problem size

QAP5: m=379,350 QAP6: m=709,275 QAP7: m=1,218,400 QAP9: m=1,962,225

Existing approach Ⅰ (Synchronous: SYNC) Existing approach Ⅱ (Data-Driven:DD) Proposed method (Proposal) PCI Comm Reducing × ✓ ✓ MPI Comm Scalability ✓ (Group) × (Point-to-point) ✓ (scalable communication)

slide-21
SLIDE 21

100 200 300 400 500 600 700 800 100 200 300 400 500 Speed (TFlops) Number of Nodes SYNC (QAP5) DD (QAP5) Proposal (QAP5) SYNC (QAP6) DD (QAP6) Proposal (QAP6) SYNC (QAP7) DD (QAP7) Proposal (QAP7)

Conduct scalability evaluation on TSUBAME2.5 using until 400 nodes(3 GPUs per a node)

21

Scalability Evaluation

By Data-Driven + Tree Comm. 37% performance improvement 695 TFlops on 400 nodes with 1200 GPUs Without Tree Comm. Performance largely decrease than SYNC (communication bottleneck)

slide-22
SLIDE 22

22

Extremely Large Scale

  • Conduct scalability evaluation on from 400 nodes to 1360

nodes (3GPUs per a node)

200 400 600 800 1000 1200 1400 1600 1800 2000 500 1000 1500 Speed (TFlops) Number of Nodes SYNC (QAP7) Proposal (QAP7) SYNC (QAP9) Proposal (QAP9)

1.775PFlops

  • n 1360 nodes

with 4080 GPU by

  • ur approach
slide-23
SLIDE 23

Related work

  • StarPU: a unified platform for task scheduling on heterogeneous multicore

Architectures[Cédric Augonnet et al.]

– A DAG scheduling framework for heterogeneous environments – Allows for each task to run either on CPUs or GPUs according to the resource utilization in order to improve the performance – But StarPU does not have scalability improvement techniques as our approach

  • DAGuE: A generic distributed DAG engine for high performance

computing[George Bosilca et al.]

– DAG(Direct Acyclic Graph) scheduler for distributed environments with GPUs – The Cholesky factorization is one of their target application – But it is not clear how DAGuE treats memory objects when GPU memory is full

23

slide-24
SLIDE 24
  • The solution of the scalability issues in typical data

driven implementation by introducing scalable data transfer method and termination detection method

  • Compared with the synchronous implementation,

37% performance improvement on 400 nodes and 1,200 GPUs of TSUBAME2.5 supercomputer

  • Achieved 1.775PFlops on 1360 nodes and 4080

GPUs of TSUBAME2.5 supercomputer

24

Conclusion

slide-25
SLIDE 25
  • Use both CPU & GPU for kernel calculations
  • Comparative experiments with related works
  • Construct the ideal task selection model and

conduct comparative experiments with it

  • Apply our approach to other applications

25

Future Work