Kokkos Task-DAG: Photos placed in Memory Management and Locality - - PowerPoint PPT Presentation

▶

Jan 06, 2023 756 likes •937 views

Kokkos Task-DAG: Photos placed in Memory Management and Locality horizontal position with even amount of white space Challenges Conquered between photos and header PADAL Workshop Photos placed in horizontal August 2-4, 2017 position

SLIDE 1

Photos placed in horizontal position with even amount

f white space

between photos and header

Photos placed in horizontal position with even amount of white space between photos and header

Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly

wned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Kokkos Task-DAG: Memory Management and Locality Challenges Conquered

PADAL Workshop

August 2-4, 2017 Chicago, IL

H. Carter Edwards

SAND2017-8173 C

SLIDE 2

DDR HBM DDR HBM DDR DDR DDR HBM HBM

Multi-Core Many-Core APU CPU+GPU

Drekar Trilinos SPARC Applications & Libraries

Kokkos*

performance portability for C++ applications Albany EMPIRE LAMMPS

*κόκκος Greek: “granule” or “grain” ; like grains of sand on a beach

SLIDE 3

Dynamic Directed Acyclic Graph (DAG) of Tasks

§ Parallel Pattern

§ Tasks: Heterogeneous collection of parallel computations § DAG: Tasks may have acyclic execute-after dependences § Dynamic: Tasks allocated by executing tasks, deallocated when complete

§ Task Scheduler Responsibilities

§ Execute ready tasks § Choose from among ready tasks § Honor “execute after” dependences § Manage tasks’ dynamic lifecycle § Manage tasks’ dynamic memory

SLIDE 4

Motivating Use Cases

1. Multifrontal Cholesky factorization of sparse matrix

§ Frontal matrices require different sizes of workspace (green) for sub-assembly § Hybrid task parallelism: tree-parallel & matrix-parallel within supernodes (brown) § Dynamic task-dag with memory constraints § Matrix computation is internally data parallel § Lead: Kyungjoo Kim / SNL

2. Triangle enumeration in social networks, highly irregular graphs

§ Discover triangles within the graph § Compute statistics on those triangles § Triangles are an intermediate result that do not need to be saved / stored Ø Challenge: memory “high water mark” § Lead: Michael Wolf / SNL

4 5 1 2 3

k2 k1

4 1 X X X X X X X X X 7 X 8 3 2 X 6 3 4 X X X X X X 8 7 5 X 6 7 X X X 8

X X X X X 8 7 6 5 4 3 2 1 X 5 3 8 2 6 4 7 1 3 1 5 4 8 6 2 7 X X X X X X X X X X X

SLIDE 5

parallel_for parallel_reduce

Hierarchical Parallelism

§ Shared functionality with hierarchical data-data parallelism

§ The same kernel (task) executed on … § OpenMP: League of Teams of Threads § Cuda: Grid of Blocks of Threads

§ Inter-Team Parallelism (data or task)

§ Threads within a team execute concurrently § Data: each team executes the same computation ØTask: each team executes a different task

§ Intra-Team Parallelism (data)

§ Nested parallel patterns: for, reduce, scan

§ Mapping teams onto hardware

§ CPU : team == hyperthreads sharing L1 cache’ § GPU : team == warp, for a modest degree of intra-team data parallelism

SLIDE 6

Anatomy and Life-cycle of a Task

§ Anatomy

§ Is a C++ closure (e.g., functor) of data + function § Is referenced by a Kokkos::future § Executes on a single thread or a thread team § May only execute when its dependences are complete (DAG)

§ Life-cycle:

constructing waiting executing task with internal data parallelism

n a thread team

serial task

n a single thread

complete

SLIDE 7

Dynamic Task DAG Challenges

§ A DAG of heterogeneous closures

§ Map execution to a single thread or a thread team § Scalable, low latency scheduling § Scalable dynamically allocated / deallocated tasks § Scalable dynamically created and completed execute-after dependences

§ GPU idiosyncrasies

Ø Non-blocking tasks, forced a beneficial reconceptualization! § Eliminate context switching overhead: stack, registers, ... § Heterogeneous function pointers (CPU, GPU) § Creating GPU tasks on the host and within tasks executing on the GPU § Bounded memory pool and scalable allocation/deallocation § Non-coherent L1 caches

SLIDE 8

Managing a Non-blocking Task’s Lifecycle

§ Create: allocate and construct

§ By main process or within another task § Allocate from a memory pool § Construct internal data § Assign DAG dependences

§ Spawn: enqueue to scheduler

§ Assign DAG dependences § Assign priority: high, regular, low

§ Respawn: re-enqueue to scheduler

§ Replaces waiting or yielding § Assign new DAG dependences and/or priority § Reconceived wait-for-child-task pattern

Ø Create & spawn child task(s) Ø Reassign DAG dependence(s) to new child task(s) Ø Re-spawn to execute again after child task(s) complete

constructing waiting executing complete

create spawn respawn

SLIDE 9

Task Scheduler and Memory Pool

§ Memory Pool

§ Large chunk of memory allocated in Kokkos memory space § Allocate & deallocate small blocks of varying size within a parallel execution § Lock free, extremely low latency § Tuning: min-alloc-size <= max-alloc-size <= superblock-size <= total-size

§ Task Scheduler

§ Uses memory pool for tasks’ memory § Ready queues (by priority) and waiting queues Ø Each queue is a simple linked list of tasks § A ready queue is a head of a linked list § Each task is head of linked list of “execute after” tasks § Limit updates to push/pop, implemented with atomic operations § “When all” is a non-executing task with list of dependences for data

next next dep next dep

SLIDE 10

Memory Pool Performance

§ Test Setup

§ 10Mb pool comprised of 153 x 64k superblocks, min block size 32 bytes § Allocations ranging between 32 and 128 bytes; average 80 bytes § [1] Allocate to N% ; [2] cyclically deallocate & allocate between N and 2/3 N § parallel_for: every index allocates ; cyclically deallocates & allocates § Measure allocate + deallocate operations / second (best of 10 trials)

§ Deallocate much simpler and fewer operations than allocate

§ Test Hardware: Pascal, Broadwell, Knights Landing

§ Fully subscribe cores § Every thread within every warp allocates & deallocates

§ For reference, an “apples to oranges” comparison

§ CUDA malloc / free on Pascal § jemalloc on Knights Landing

SLIDE 11

Memory Pool Performance

§ Memory pools have finite size with well-bounded scope

§ Algorithms’ and data structures’ memory pools do not pollute (fragment) each other’s memory

Fill 75% Fill 95% Cycle 75% Cycle 95% blocks: 938,500 1,187,500 Pascal 79 M/s 74 M/s 287 M/s 244 M/s Broadwell 13 M/s 13 M/s 46 M/s 49 M/s Knights Landing 5.8 M/s 5.8 M/s 40 M/s 43 M/s apples to oranges comparison: Pascal using CUDA malloc 3.5 M/s 2.9 M/s 15 M/s 12 M/s Knights Landing using jemalloc 379 M/s 4115 M/s thread local caches, optimal blocking, NOT fixed pool size

SLIDE 12

Scheduler Unit Test Performance

§ Test Setup, (silly) Fibonacci task-dag algorithm

§ F(k) = F(k-1) + F(k-2) § if k >= 2 spawn F(k-1) and F(k-2) then § respawn F(k) dependent on completion of when_all( { F(k-1) , F(k-2) } ) § F(k) cumulatively allocates/deallocates N tasks >> “high water mark” § 1Mb pool comprised of 31 x 32k superblocks, min block size 32 bytes § Fully subscribe cores; single thread Fibonacci task consumes entire GPU warp § Real algorithms’ tasks have modest internal parallelism § Measure tasks / second; compare to raw allocate + deallocate performance

F(21) F(23) Alloc/Dealloc cumulative tasks: 53131 139102 (for comparison) Pascal 1.2 M/s 1.3 M/s 144 M/s Broadwell 0.98 M/s 1.1 M/s 24 M/s Knights Landing 0.30 M/s 0.31 M/s 21 M/s

SLIDE 13

block task-A task-B

GPU Non-Coherent L1 Cache

§ Production and Consumption of Tasks

§ Create: allocate from memory pool and construct closure in that memory § Complete: destroy closure and deallocate to memory pool § Task memory is re-used as the dynamic task-DAG executes § “Race” consequence of non-coherent L1 cache: Global Memory Pool

[1] execute & complete [2] deallocate [3] allocate [5] pop-queue

[6] execute task-?? SM0 L1 cache SM1 L1 cache

[4] construct & push-queue {[3-4] untouched}

SLIDE 14

block task-A task-B

GPU Non-Coherent L1 Cache: Conquered

§ Options:

§ Mark all user task code with “virtual” qualifier to bypass L1 cache (CUDA) § Extremely annoying to users: ugly and degrades performance § Manage memory motion through GPU shared memory (a.k.a., explicit L1) Ø Transparent to user code and retains L1 performance Global Memory Pool

[1] execute & complete [2] deallocate [3] allocate

SM0 explicit cache SM1 explicit cache

[4.1] construct [4.2] copy [4.3] push-queue [5.1] pop-queue

[6] execute task-B

[5.2] copy

task-B

SLIDE 15

Tacho’s Sparse Cholesky Factorization

§ Multifrontal algorithm with bounded memory constraint

§ Kokkos task DAG + Kokkos memory pool for shared scratch memory § Task fails allocation => respawn to try again after other tasks deallocate § Test setup: scratch memory size = M * sparse matrix supernode size § Compare to Intel’s pardiso, sparse matrix N=57k, NNZ=383k, 6662 supernodes

100 200 300 400 500 600 700 20 40 60 80

factorization/minute

# threads

Knights Landing (1x68x4)

pardiso tacho 4 tacho 8 tacho 16 50 100 150 200 250 300 350 400 20 40 60 80

peak memory MB

# threads

pardiso tacho tacho tacho 500 1000 1500 20 40 60 80

factorization/minute

# threads

Haswell (2x16x2)

pardiso tacho 4 tacho 8 tacho 16 50 100 150 200 250 300 350 400 20 40 60 80

peak memory MB

# threads

pardiso tacho 4 tacho 8 tacho 16

SLIDE 16

Conclusion

§ Initial Dynamic Task-DAG capability

§ Portable: CPU and NVIDIA GPU architectures § Directed acyclic graph (DAG) of heterogeneous tasks § Dynamic – tasks may create tasks and dependences § Hierarchical – thread-team data parallelism within tasks

§ Challenges conquered, esp. for GPU portability and performance

§ Non-blocking tasks è respawn instead of wait § Pool of finite memory with scalable, lock-free allocation/deallocation § Intra-team parallelism mapping tasks on to GPU warps (or CPU cores) § Non-coherent L1 cache, use GPU shared memory (a.k.a., explicit L1)

§ Tacho’s Sparse Cholesky Factorization

§ Straightforward to manage dependences and total memory constraint § Overhead for Kokkos’ heterogeneous dynamic task-DAG is acceptable

SLIDE 17

Future Work

§ Performance evaluation & improvement

§ Mini-application based benchmarks § Tacho on NVIDIA Volta (and Pascal) § CPU thread team collectives’ memory network contention § Leverage new CUDA 9 “thread group” functionality

§ Simplified (constrained) task-DAG use cases

§ Reduce overhead / improve performance? § Homogeneous dynamic task-DAG § Single closure (C++ functor) § Dynamic task-DAG of work identifiers (e.g., integers) § Homogeneous static task-DAG § Single closure (C++ functor) § Static task-DAG of work identifiers – compressed sparse row graph