Introduction to StarNEig A Task-based Library for Solving - - PowerPoint PPT Presentation

▶

Mar 06, 2023 484 likes •740 views

Introduction to StarNEig A Task-based Library for Solving Nonsymmetric Eigenvalue Problems Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen in collaboration with Angelika Schwarz, Lars Karlsson, Bo K agstr om and Mahmoud Eljammaly

SLIDE 1

Introduction to StarNEig

A Task-based Library for Solving Nonsymmetric Eigenvalue Problems Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen

in collaboration with Angelika Schwarz, Lars Karlsson, Bo K˚ agstr¨

m and

Mahmoud Eljammaly

Department of Computing Science Ume˚ a University

PPAM 2019

1 / 24

SLIDE 2

Eigenvalue problem

◮ StarNEig library aims to implement a complete stack of algorithms for solving dense nonsymmetric eigenvalue problems. ◮ Both standard Axi = λixi and generalized Axi = λiBxi eigenvalue problems are considered.

2 / 24

SLIDE 3

Eigenvalue problem (algorithm stack)

Schur reduction Hessenberg reduction AED Eigenvalue reordering Hessenberg reduction Eigenvectors Eigenvalue reordering Eigenvectors Schur reduction Preprocessing Eigenvalues Invariant subspaces

Figure: An illustration of the complete algorithm stack in standard case.

3 / 24

SLIDE 4

Motivation (eigenvalue reordering)

◮ In some cases, we want to reorder the Schur form S such that a selected cluster of eigenvalues appears in the leading diagonal blocks of the updated Schur form ˜ S. ◮ Gives an orthonormal basis for a desired invariant subspace.

Figure: An illustration of the reordering process in standard case.

4 / 24

SLIDE 5

Motivation (accumulated transformations)

◮ A modern algorithm

◮ groups a set of orthogonal transformations together and ◮ initially applies them only within a small diagonal window.

◮ The transformations are accumulated and later propagated with level 3 BLAS operations.

Apply locally Propagate with BLAS-3 updates Group transformations

Figure: An illustration of accumulated transformations.

5 / 24

SLIDE 6

Motivation (concurrent windows)

◮ Multiple diagonal windows can be active concurrently. ◮ The level 3 BLAS updates must be propagated in a sequentially consistent order.

◮ Requires careful coordination!

fl i c t

Figure: An illustration of two concurrent windows.

6 / 24

SLIDE 7

Motivation (ScaLAPACK-style approach)

◮ Eigenvalue reordering is implemented in ScaLAPACK1. ◮ With p cores, we can have up to √p concurrent windows. ◮ The transformation are broadcasted and applied in parallel.

◮ Theoretically possible degree of parallelism is p. ◮ Only if we have √p concurrent windows. Figure: An illustration of a ScaLAPACK-style algorithm.

1Granat, R., K˚ agstr¨

m, B., Kressner, D.: Parallel eigenvalue reordering in real Schur forms. Concurrency and

Computation: Practice and Experience 21(9), 1225–1250 (2009). 7 / 24

SLIDE 8

Motivation (task-based approach and task graphs)

◮ Computational work is cut into self-contained tasks. ◮ A runtime system

◮ derives the task dependences and ◮ schedules the tasks to computational resources.

◮ The task dependencies can be visualized as a task graph.

W R W R W R L R W L L R R R W L R R R L L W L R R R R L L L R L L L L L

Figure: A simplified task graph arising from eigenvalue reordering.

8 / 24

SLIDE 9

Motivation (more opportunities for concurrency)

◮ Real live task graphs are much more complex.

◮ But enclose more opportunities for increased concurrency.

◮ The runtime system unrolls the task graph.

◮ No global synchronization. ◮ Computational steps are allowed overlap and merge. Figure: An illustration of a task-based algorithm2.

2Myllykoski, M.: A task-based algorithm for reordering the eigenvalues of a matrix in real schur form. In: Parallel Processing and Applied Mathematics, PPAM 2017. LNCS, vol. 10777, pp. 207–216. Springer International Publishing (2018) 9 / 24

SLIDE 10

Motivation (GPUs, distributed memory, other benefits)

◮ Other benefits of the task-based approach include

◮ better load balancing, ◮ task priorities, ◮ accelerators support (GPUs) with performance models, ◮ automatic data transfers between memory spaces and ◮ implicit MPI communications.

Node 0 Node 1 Node 2 Node 3

Send Receive

Figure: An illustration of implicit MPI communications.

10 / 24

SLIDE 11

StarNEig library (overview)

◮ Designed and implemented at Ume˚ a University as a part of NLAFET project. ◮ Runs on top of the StarPU task-based runtime system. ◮ Targets both

◮ shared memory and ◮ distributed memory machines.

◮ Some components of the library support GPUs. ◮ Real arithmetic supported, complex arithmetic planned. ◮ Beta release (v0.1-beta.2) available at https://github.com/NLAFET/StarNEig.

11 / 24

SLIDE 12

StarNEig library (current status)

Standard case Generalized case SM DM GPU SM DM GPU Hessenberg

Schur

Reordering
Eigenvectors
✗

—

— Ready Experimental LAPACK or ScaLAPACK wrapper ✗ In progress — Not planned

12 / 24

SLIDE 13

Distributed memory (data distribution)

◮ StarNEig distributes matrices in rectangular blocks of a uniform size. ◮ User has three options:

1. Use the default data distribution.
2. Use a two-dimensional block cyclic distribution.
3. Define a data distribution function d : Z+ × Z+ → Z+ that

maps the block indices to the MPI rank space.

(0,0) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (1,0) (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (2,0) (2,1) (2,2) (2,3) (2,4) (2,5) (2,6) (3,0) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) (4,0) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 2 2

(a) 2D block cyclic

2 2 1 3 1 3 2 3 0 2 2 0 0 3 1 2 1 1 2 0 1 2 3 2 0 3

(b) Arbitrary Figure: Examples of various data distributions.

13 / 24

SLIDE 14

Distributed memory (block size)

◮ StarNEig divides the distributed blocks into square tiles. ◮ Tile size is closely connected to task granularity.

◮ Tiny tile size ⇒ fine-grained task granularity ⇒ large scheduling overhead. ◮ Distributed blocks should be relatively large. ◮ Many ScaLAPACK-style codes are designed for / perform

ptimally with smaller block sizes.

Figure: An illustration of how the block are divided into tiles.

14 / 24

SLIDE 15

Distributed memory (CPU core mapping)

◮ StarPU manages a set of worker threads.

◮ Usually one thread per CPU core / GPU + MPI thread.

◮ One process per node (1ppn) configuration required.

◮ A node can be, e.g., a full compute node or a NUMA island. ◮ Many ScaLAPACK-style codes are designed for / perform

ptimally in one process per core (1ppc) configuration.

16 2 1 3 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

MPI ranks CPU core mapping Data distribution ScaLAPACK StarNEig nodes cores threads

Figure: Illustrations of CPU core mappings and data distributions.

15 / 24

SLIDE 16

Distributed memory (ScaLAPACK compatability)

◮ StarNEig is compatible with ScaLAPACK and provides a ScaLAPACK compatibility layer:

// create a 2D block cyclic data distribution (pm X pn process mesh) starneig_distr_t distr = starneig_distr_init_mesh (pm , pn , STARNEIG_ORDER_DEFAULT ); // create a n X n distributed matrix (bn X bn blocks ) starneig_distr_matrix_t dA = starneig_distr_matrix_create (n, n, bn , bn , STARNEIG_REAL_DOUBLE , distr); ... // convert the data distribution to a BLACS context starneig_blacs_context_t context = starneig_distr_to_blacs_context (distr); // convert the distributed matrix to a BLACS descriptor and a local buffer starneig_blacs_descr_t descr_a; double *local_a; starneig_distr_matrix_to_blacs_descr (dA , context , &descr_a , (void **)&local_a); // ScaLAPACK subroutine for reducing general distributed matrix to upper // Hessenberg form extern void pdgehrd_(int const *, int const *, int const *, double *, int const *, int const *, starneig_blacs_descr_t const *, double *, double *, int const *, int *); pdgehrd_ (&n, &ilo , &ihi , local_a , &ia , &ja , &descr_a , tau , ...); 16 / 24

SLIDE 17

Regarding the presented numerical results

◮ Computational experiments were performed on the Kebnekaise system, HPC2N, Ume˚ a University.

◮ Regular compute node: 28 Intel Xeon E5-2690v4 Broadwell

cores. 128 GB memory. FDR Infiniband.

◮ V100 GPU node: 28 Intel Xeon Gold 6132 Skylake cores. 192 GB memory. Two NVIDIA Tesla V100 GPUs.

◮ The results are extracted from

◮ Mirko Myllykoski, Carl Christian Kjelgaard Mikkelsen, Angelika Schwarz, Bo K˚ agstr¨

m: D2.7 Eigenvalue solvers for nonsymmetric problems, public

NLAFET deliverable, 2019.

17 / 24

SLIDE 18

Schur reduction (distributed memory performance)

20k 40k 60k 80k 100k 120k Matrix dimension 0.0 0.2 0.4 0.6 0.8 1.0 Relative runtime 1.6 - 2.9 fold speedup StarNEig PDHSEQR

(a) Standard case3.

20k 40k 60k 80k Matrix dimension 0.0 0.2 0.4 0.6 0.8 1.0 Relative runtime 1.4 - 7.9 fold speedup StarNEig PDHGEQZ

(b) Generalized case4. Figure: StarNEig versus ScaLAPACK-style approach (relative run-time improvement).

3https://github.com/NLAFET/SEVP-PDHSEQR-Alg953/. 4https://github.com/NLAFET/GEVP-PDHGEQZ. 18 / 24

SLIDE 19

Schur reduction (distributed memory scalability)

20k 40k 60k 80k 100k 120k 140k 160k Matrix dimension 500 1000 1500 2000 2500 3000 Runtime [s] 1 nodes 4 nodes 9 nodes 16 nodes 25 nodes

Figure: Standard case, 28 cores / node, max 700 cores.

19 / 24

SLIDE 20

Eigenvalue reordering (distributed memory performance)

20k 40k 60k 80k 100k 120k Matrix dimension 0.0 0.2 0.4 0.6 0.8 1.0 Relative runtime 2.8 - 5.0 fold speedup StarNEig PDTRSEN

(a) Standard case5.

20k 40k 60k 80k Matrix dimension 0.0 0.2 0.4 0.6 0.8 1.0 Relative runtime 1.6 - 4.3 fold speedup StarNEig PDTGSEN

(b) Generalized case6. Figure: StarNEig versus ScaLAPACK-style approach, 35% selected.

5http://www.netlib.org/scalapack/explore-html/d8/db0/pdtrsen_8f.html. 6https://github.com/NLAFET/GEVP-PDHGEQZ. 20 / 24

SLIDE 21

Eigenvalue reordering (distributed memory scalability)

20k 40k 60k 80k 100k 120k 140k 160k Matrix dimension 200 400 600 800 1000 1200 Runtime [s] 1 nodes 4 nodes 9 nodes 16 nodes 25 nodes

Figure: Standard case, 35% selected, 28 cores / node, max 700 cores.

21 / 24

SLIDE 22

Eigenvalue reordering (GPU performance)

20k 40k 60k 80k Matrix dimension 200 400 600 800 1000 1200 Runtime [s] 28 cores, no GPUs 28 cores + 1 GPU 28 cores + 2 GPUs

Figure: Standard case, 35% selected, NVIDIA V100.

22 / 24

SLIDE 23

Summary

◮ Task-based algorithms for most steps in the algorithm chain:

Standard case Generalized case SM DM GPU SM DM GPU Hessenberg

Schur

Reordering
Eigenvectors
✗

—

◮ Support for shared and distributed memory; and GPUs. ◮ Increased parallelism through expressing algorithms as DAGs. ◮ Better (heterogeneous) scheduling and load balancing. ◮ Overlapping communications and computations.

◮ Parallel, efficient and robust algorithm for computing eigenvectors.

◮ See preceding presentation from Carl Christian (Parallel Robust Computation of Generalized Eigenvectors of Matrix Pencils).

23 / 24

SLIDE 24

Extra (Hessenberg reduction, GPU performance)

5000 10000 15000 20000 25000 30000 Matrix dimension 20 40 60 80 100 120 140 160 Runtime [s] StarNEig MAGMA

Figure: StarNEig versus MAGMA, NVIDIA V100.

24 / 24