MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, - - PowerPoint PPT Presentation

memory management on modern gpu architectures
SMART_READER_LITE
LIVE PREVIEW

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, - - PowerPoint PPT Presentation

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES Nikolay Sakharnykh, Tue Mar 19, 3:00 PM HOW DO WE ALLOCATE MEMORY IN CUDA? cudaMallocHost cudaHostRegister cudaMalloc cudaMallocManaged cudaMalloc3D cudaMallocArray 2 HOW DO WE ALLOCATE


slide-1
SLIDE 1

Nikolay Sakharnykh, Tue Mar 19, 3:00 PM

MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES

slide-2
SLIDE 2

2

HOW DO WE ALLOCATE MEMORY IN CUDA?

cudaMalloc cudaMallocHost cudaHostRegister cudaMallocManaged cudaMallocArray cudaMalloc3D

slide-3
SLIDE 3

3

HOW DO WE ALLOCATE MEMORY IN CUDA?

cudaMalloc cudaMallocHost cudaHostRegister cudaMallocManaged

  • Accessible by GPU only
  • Pinned to single GPU
  • Accessible by CPU & GPU
  • Pinned to CPU mem node
  • Accessible by CPU & GPU
  • Can “migrate”
slide-4
SLIDE 4

4

AGENDA

Key principles Performance tuning Multi-GPU systems Summit & Sierra OS integration

*Here is some behavior that may change in the future

slide-5
SLIDE 5

5

UNIFIED MEMORY BASICS

Single virtual memory shared between computing processors page1 page2 page3 …

Process P GPU A’s memory GPU B’s memory GPU C’s memory CPU 0’s memory CPU 1’s memory

slide-6
SLIDE 6

6

UNIFIED MEMORY BASICS

GPU A GPU B A’s page table A’s phys mem B’s phys mem B’s page table

page1 page2 page3 … page1 page2 page3 …

slide-7
SLIDE 7

7

EXAMPLE: LOCAL ACCESS

A’s phys mem B’s phys mem

page1 page2 page3 … *addr1 = 1 local access

A’s page table B’s page table

page1 page2 page3 …

slide-8
SLIDE 8

8

EXAMPLE: POPULATE

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table

*addr3 = 1 page fault

slide-9
SLIDE 9

9

EXAMPLE: POPULATE

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table

*addr3 = 1 page fault allocate memory for page3’s data

slide-10
SLIDE 10

10

EXAMPLE: POPULATE

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table

*addr3 = 1 page fault populate page3 and map into the new location

slide-11
SLIDE 11

11

EXAMPLE: POPULATE

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table

*addr3 = 1 access replay

slide-12
SLIDE 12

12

EXAMPLE: MIGRATE

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 … *addr3 = 1 page fault

A’s page table B’s page table

*addr2 = 1 page fault

slide-13
SLIDE 13

13

EXAMPLE: MIGRATE

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 … *addr3 = 1 page fault

A’s page table B’s page table

*addr2 = 1 page fault unmap page2 and page3 from B’s memory

slide-14
SLIDE 14

14

EXAMPLE: MIGRATE

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 … copy pages’ data from B to A

A’s page table B’s page table

*addr3 = 1 page fault *addr2 = 1 page fault

slide-15
SLIDE 15

15

EXAMPLE: MIGRATE

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table

*addr3 = 1 page fault *addr2 = 1 page fault map page2 and page3 into A’s memory

slide-16
SLIDE 16

16

EXAMPLE: MIGRATE

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table

*addr3 = 1 access replay *addr2 = 1 access replay

slide-17
SLIDE 17

17

EXAMPLE: OVERSUBSCRIBE

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table GPU memory is FULL

slide-18
SLIDE 18

18

EXAMPLE: OVERSUBSCRIBE

page1 page2 page3 … page6

A’s phys mem B’s phys mem

page1 page2 page3 … page6

A’s page table B’s page table

*addr6 = 1 page fault

slide-19
SLIDE 19

19

EXAMPLE: OVERSUBSCRIBE

page1 page2 page3 … page6

A’s phys mem B’s phys mem

page1 page2 page3 … page6

A’s page table B’s page table

*addr6 = 1 page fault unmap page3 from A’s memory

slide-20
SLIDE 20

20

EXAMPLE: OVERSUBSCRIBE

page1 page2 page3 … page6

A’s phys mem B’s phys mem

page1 page2 page3 … page6

A’s page table B’s page table

*addr6 = 1 page fault copy page3’s data to B’s memory

slide-21
SLIDE 21

21

EXAMPLE: OVERSUBSCRIBE

page1 page2 page3 … page6

A’s phys mem B’s phys mem

page1 page2 page3 … page6

A’s page table B’s page table

*addr6 = 1 page fault

slide-22
SLIDE 22

22

EXAMPLE: OVERSUBSCRIBE

page1 page2 page3 … page6

A’s phys mem B’s phys mem

page1 page2 page3 … page6

A’s page table B’s page table

*addr6 = 1 page fault populate page6

slide-23
SLIDE 23

23

EXAMPLE: OVERSUBSCRIBE

page1 page2 page3 … page6

A’s phys mem B’s phys mem

page1 page2 page3 … page6

A’s page table B’s page table

*addr6 = 1 access replay *B cannot be a GPU in this case

slide-24
SLIDE 24

24

RECAP

Migrate Populate Proc A Proc B Oversubscribe Proc A Proc B

slide-25
SLIDE 25

25

APPLICATIONS IN ANALYTICS AND DL

S9726 - Unified Memory for Data Analytics and Deep Learning

Thursday, Mar 21, 3:00 PM – SJCC Room 211A (Concourse Level)

CSV DF Arrow

read CSV filter join groupby

Arrow Arrow Arrow DF

concat

DMatrix

convert XGboost

slide-26
SLIDE 26

26

APPLICATIONS IN ANALYTICS AND DL

S9726 - Unified Memory for Data Analytics and Deep Learning

Thursday, Mar 21, 3:00 PM – SJCC Room 211A (Concourse Level)

GPU

CPU Mem

… … …

  • versubscribe
slide-27
SLIDE 27

27

AGENDA

Key principles Performance tuning Multi-GPU systems Summit & Sierra OS integration

slide-28
SLIDE 28

28

PREFETCH

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table anticipating access in the future

slide-29
SLIDE 29

29

PREFETCH

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table

cudaMemPrefetchAsync (ptr, size, proc A, stream)

anticipating access in the future

slide-30
SLIDE 30

30

PREFETCH

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table

cudaMemPrefetchAsync (ptr, size, proc A, stream)

anticipating access in the future

slide-31
SLIDE 31

31

PREFETCH

page1 page2 page3 …

A’s phys mem B’s phys mem

page1 page2 page3 …

A’s page table B’s page table

cudaMemPrefetchAsync (ptr, size, proc A, stream)

anticipating access in the future

slide-32
SLIDE 32

32

MIGRATION PERFORMANCE

For more details see blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/ __global__ void kernel(int *host, int *device) { int i = threadIdx.x + blockDim.x * blockIdx.x; device[i] = host[i]; } // allocate and initialize memory cudaMallocManaged(&host, size); memset(host, 0, size); // benchmark CPU->GPU migration if (prefetch) cudaMemPrefetchAsync(host, size, gpuId); kernel<<<grid, block>>>(host, device);

5.8 11.4 12.0 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 fault-based prefetch cudaMemcpy

Tesla V100 PCIe3 throughput (GB/s)

slide-33
SLIDE 33

33

MIGRATION W/ OVERSUBSCRIPTION

// pre-populate GPU memory cudaMallocManaged(&tmp, GPU_MEM_SIZE); cudaMemPrefetchAsync(tmp, GPU_MEM_SIZE, gpuId); // allocate and initialize memory cudaMallocManaged(&host, size); memset(host, 0, size); // benchmark CPU->GPU migration if (prefetch) cudaMemPrefetchAsync(host, size, gpuId); kernel<<<grid, block>>>(host, device);

5.8 11.4 12.0 3.8 8.8 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 fault-based prefetch cudaMemcpy

Tesla V100 PCIe, throughput (GB/s)

GPU memory free GPU memory fully populated

slide-34
SLIDE 34

34

POPULATION PERFORMANCE

// fault-based cudaMallocManaged(&ptr, size); cudaMemset(ptr, 0, size); // prefetch cudaMallocManaged(&ptr, size); cudaMemPrefetchAsync(ptr, size, gpuId); cudaMemset(ptr, 0, size); no migration traffic, just page population

20 40 60 80 100 120 140 160

4 9 6 8 1 9 2 1 6 3 8 4 3 2 7 6 8 6 5 5 3 6 1 3 1 7 2 2 6 2 1 4 4 5 2 4 2 8 8 1 4 8 5 7 6 2 9 7 1 5 2 4 1 9 4 3 4 8 3 8 8 6 8 1 6 7 7 7 2 1 6 3 3 5 5 4 4 3 2 6 7 1 8 8 6 4 1 3 4 2 1 7 7 2 8 2 6 8 4 3 5 4 5 6 5 3 6 8 7 9 1 2 1 7 3 7 4 1 8 2 4 2 1 4 7 4 8 3 6 4 8 4 2 9 4 9 6 7 2 9 6

buffer size (bytes)

Tesla V100 population throughput (GB/s)

fault-based, driver 410 fault-based, driver418 prefetch, driver 410 prefetch, driver 418

slide-35
SLIDE 35

35

PREFETCH GOTCHAS

CPU overhead related to updating page table mappings Driver may defer prefetches to a background thread How this may impact your applications:

  • DtoH prefetch may not return until the operation is completed
  • Achieving good DtoH / HtoD overlap may be difficult in some cases

We’re actively working on improving prefetch implementation to alleviate those issues

For more details see my blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

slide-36
SLIDE 36

36

USER POLICIES

Default: data migrates on access/prefetch ReadMostly: data duplicated on read/prefetch PreferredLocation: resist migrating away from it AccessedBy: establish direct mapping / avoid faults GPU1 GPU0 GPU1 GPU0 GPU1 GPU0 GPU1 GPU0

slide-37
SLIDE 37

37

READ DUPLICATION

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetReadMostly, myGpuId); cudaMemPrefetchAsync(data, N, myGpuId, s); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaDeviceSynchronize(); cudaFree(data);

both CPU and GPU can read data simultaneously without faults writes will collapse all copies into one, subsequent reads will fault & duplicate creates a copy on the GPU populates data on the CPU

slide-38
SLIDE 38

38

PREFERRED LOCATION

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaDeviceSynchronize(); cudaFree(data);

GPU faults and creates direct mapping to data populates data on the CPU

Possible reasons for migrating away: 1) Cannot access directly 2) Oversubscription 3) Prefetch

slide-39
SLIDE 39

39

PREFERRED LOCATION

char *data; cudaMallocManaged(&data, N); cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaDeviceSynchronize(); cudaFree(data); *pages are populated in the preferred location if the faulting processor can access it Note: this is true for GPUs, you can set preferred to GPU1, run kernel on GPU0 and it will populate data on GPU1

GPU faults, populates data on the CPU and creates direct mapping

slide-40
SLIDE 40

40

ACCESSED BY

char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaDeviceSynchronize(); cudaFree(data);

GPU creates direct mapping GPU accesses data remotely without page faults populates data on the CPU

*memory can move freely to other processors and mapping will carry over

slide-41
SLIDE 41

41

CUDAMALLOC VS UNIFIED MEMORY

cudaMalloc(&ptr, size); cudaMallocManaged(&ptr, size); for (int i = 0; i < ngpus; i++) cudaMemAdvise(ptr, size, ..AccessedBy, i); cudaMemAdvise(ptr, size, ..PreferredLocation, gpuId); cudaSetDevice(gpuId); cudaMemPrefetchAsync(ptr, size, gpuId); cudaDeviceSynchronize();

=

ptr ptr+size virtual memory GPU memory

slide-42
SLIDE 42

42

PAGE STRIPING

cudaMallocManaged(&ptr, size); for (int i = 0; i < ngpus; i++) cudaMemAdvise(ptr, size, ..AccessedBy, i); for (pages) { cudaMemAdvise(p, page_size, ..PreferredLocation, g); cudaSetDevice(g); cudaMemPrefetchAsync(p, page_size, g); } for (int i = 0; i < ngpus; i++) { cudaSetDevice(i); cudaDeviceSynchronize(); } ptr ptr+size virtual memory GPU0 memory GPU1 memory

slide-43
SLIDE 43

43

AGENDA

Key principles Performance tuning Multi-GPU systems Summit & Sierra OS integration

slide-44
SLIDE 44

44

UNIFIED MEMORY + DGX-2

UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User control of data locality

GPU GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 GPU 8 GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15

512 GB Unified Memory

S9241 – All You Need To Know about Programming NVIDIA’s DGX-2

Wednesday, Mar 20, 1:00 PM – SJCC Room 220C (Concourse Level)

slide-45
SLIDE 45

45

ENABLE MULTI-GPU WITH SINGLE PROCESS

__global__ void kernel(int *data) { int idx = threadIdx.x + blockDim.x * blockIdx.x; doSomeStuff(idx, data, ...); } cudaMallocManaged(&data, N * sizeof(int)); // initialize data on the CPU kernel<<<grid, block>>>(data); __global__ void kernel(int *data, int gpuId) { int idx = threadIdx.x + blockDim.x * (blockIdx.x + gpuId * gridDim.x); doSomeStuff(idx, data, ...); } cudaMallocManaged(&data, N * sizeof(int)); // initialize data on the CPU for (int i = 0; i < ngpus; i++) { cudaSetDevice(i); kernel<<<grid/ngpus, block>>>(data, i); }

Single-GPU Multi-GPU update launch config update blockIdx.x

slide-46
SLIDE 46

46

MULTI-GPU WITH UNIFIED MEMORY

GPU 3 GPU 0 GPU 2 GPU 1 SYSMEM GPU kernels or cudaMemPrefetchAsync initiate migrations

slide-47
SLIDE 47

47

MULTI-GPU WITH UNIFIED MEMORY

GPU 3 GPU 0 GPU 2 GPU 1 SYSMEM Data partitioned between GPUs

slide-48
SLIDE 48

48

MULTI-GPU WITH UNIFIED MEMORY

GPU 3 GPU 0 GPU 2 GPU 1 By default accesses to remote memory will fault and initiate migrations SYSMEM

slide-49
SLIDE 49

49

GPU 0

MULTI-GPU WITH UNIFIED MEMORY

GPU 3 GPU 2 GPU 1 Remote memory accessed directly without faults or migrations

for (int i = 0; i < ngpus; i++) cudaMemAdvise(ptr, size, ..SetAccessedBy, i); for (int i = 0; i < ngpus; i++) { size_t off = (size / ngpus) * i; cudaMemAdvise(ptr + off, ..PreferredLocation, i); cudaMemPrefetchAsync(ptr + off, size / ngpus, i); } https://github.com/NVIDIA/multi-gpu-programming-models/tree/master/multi_threaded_um

slide-50
SLIDE 50

50

MULTI-GPU WITH UNIFIED MEMORY

GPU 3 GPU 0 GPU 2 GPU 1 Read-only data is duplicated and accessed locally

cudaMemAdvise(ptr, size, ..SetReadMostly, cudaInvalidDeviceId); for (int i = 0; i < ngpus; i++) { size_t off = (size / ngpus) * i; cudaMemPrefetchAsync(ptr + off, size / ngpus, i); } https://github.com/NVIDIA/multi-gpu-programming-models/tree/master/multi_threaded_um

slide-51
SLIDE 51

51

UNIFIED MEMORY: MULTIPLE PROCESSES

CUDA IPC used to exchange cudaMalloc pointers – this doesn’t work for cudaMallocManaged! All major MPI implementations support Unified Memory through staging

cudaSetDevice(0); cudaMallocManaged(&ptr, GPU_MEM_SIZE); cudaMemPrefetchAsync(ptr, GPU_MEM_SIZE, 0); Process A cudaSetDevice(0); cudaMallocManaged(&ptr, GPU_MEM_SIZE); cudaMemPrefetchAsync(ptr, GPU_MEM_SIZE, 0); Process B evicts A’s data to the CPU Multiple processes sharing single GPU memory

slide-52
SLIDE 52

52

AGENDA

Key principles Performance tuning Multi-GPU systems Summit & Sierra OS integration

slide-53
SLIDE 53

53

SUMMIT NODE

(2) IBM Power9 + (6) NVIDIA Volta V100

CPU 0

256 GB

(DDR4)

(100 GB/s) bidirectional NVLink2

GPU 0 GPU 1 GPU 2

16 GB

(HBM2)

16 GB

(HBM2)

16 GB

(HBM2)

CPU 1

256 GB

(DDR4)

GPU 3 GPU 4 GPU 5

16 GB

(HBM2)

16 GB

(HBM2)

16 GB

(HBM2)

64 GB/s

135 GB/s 135 GB/s

(900 GB/s)

slide-54
SLIDE 54

54

SIERRA NODE

(2) IBM Power9 + (4) NVIDIA Volta V100

CPU 0

128 GB

(DDR4)

CPU 1

128 GB

(DDR4)

64 GB/s

120 GB/s 120 GB/s

(150 GB/s) bidirectional NVLink2

GPU 0 GPU 1

16 GB

(HBM2)

16 GB

(HBM2)

(900 GB/s)

GPU 3 GPU 4

16 GB

(HBM2)

16 GB

(HBM2)

slide-55
SLIDE 55

55

VOLTA+P9 FEATURES

Volta’s access counters cudaMallocManaged may use access counters to guide migrations (opt-in) NVLINK2 protocol Enables HW coherency (CPU access GPU memory) Indirect peers: GPU access memory of remote GPUs on a different socket Native atomics support for all accessible memory ATS (Address Translation Service) GPU can access all system memory (malloc, stack, mmap files)

slide-56
SLIDE 56

56

UNIFIED MEMORY ON VOLTA+P9

If memory is mapped to the GPU, migration can be triggered by access counters

New Feature: Access Counters

page1 page2

A’s phys mem B’s phys mem

page1 page2 few accesses many accesses

A’s page table B’s page table

slide-57
SLIDE 57

57

UNIFIED MEMORY ON VOLTA+P9

page2 is unmapped from B

New Feature: Access Counters

page1 page2

A’s phys mem B’s phys mem

page1 page2 many accesses

A’s page table B’s page table

slide-58
SLIDE 58

58

UNIFIED MEMORY ON VOLTA+P9

Data for page2 is copied to A and mapping updated

New Feature: Access Counters

page1 page2

A’s phys mem B’s phys mem

page1 page2 many accesses

A’s page table B’s page table

slide-59
SLIDE 59

59

ACCESSED BY ON VOLTA+P9

char *data; cudaMallocManaged(&data, N); cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId); init_data(data, N); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaDeviceSynchronize(); cudaFree(data);

Volta’s access counters may eventually trigger migration of frequently accessed pages to GPU

  • n non P9/V100 systems the data will stay in CPU memory

populates data on the CPU GPU creates direct mapping

slide-60
SLIDE 60

60

UNIFIED MEMORY ON VOLTA+P9

CPU can directly access and cache GPU memory Native atomics support for all accessible memory

New Feature: Hardware Coherency with NVLINK2

page1 page2

GPU mem system mem

page1 page2 V100 P9

V100’s page table P9’s page table

slide-61
SLIDE 61

61

PREFERRED LOCATION ON VOLTA+P9

char *data; cudaMallocManaged(&data, N); cudaMemAdvise(data, N, ..PreferredLocation, gpuId); init_data(data, N); mykernel<<<..., s>>>(data, N); use_data(data, N); cudaFree(data);

CPU will page fault and populate data on the GPU The driver will “resist” migrating data away from GPU CPU accesses GPU memory directly

  • n non P9/V100 systems the driver will migrate back to the CPU
slide-62
SLIDE 62

62

UNIFIED MEMORY ON VOLTA+P9

ATS: address translation service; CPU and GPU can share a single page table

New Feature: ATS support

page1 page2

GPU mem CPU mem

V100 P9

ATS page table

slide-63
SLIDE 63

63

ptr = malloc(size); doStuffOnGpu<<<...>>>(ptr, size);

MANAGED VS MALLOC ON VOLTA+P9

First touch allocation policy

GPU page faults Unified Memory driver allocates on GPU GPU accesses GPU memory ptr = cudaMallocManaged(size); doStuffOnGpu<<<...>>>(ptr, size); GPU uses ATS, faults OS allocates on CPU (by default) GPU uses ATS to access CPU memory

slide-64
SLIDE 64

64

MANAGED VS MALLOC ON P9

cudaMallocManaged: same behavior as x86

ptr = cudaMallocManaged(size); fillData(ptr, size); doStuffOnGpu<<<...>>>(ptr, size); cudaDeviceSynchronize(); doStuffOnCpu(ptr, size); GPU page faults ptr migrated to GPU CPU page faults ptr migrated to CPU

slide-65
SLIDE 65

65

MANAGED VS MALLOC ON P9

malloc: no on-demand migrations*

ptr = malloc(size); fillData(ptr, size); doStuffOnGpu<<<...>>>(ptr, size); cudaDeviceSynchronize(); doStuffOnCpu(ptr, size); GPU uses ATS to access CPU memory CPU accesses CPU memory

*In the future access counters may be used to migrate malloc memory

no on-demand migration except cudaMemPrefetchAsync*

slide-66
SLIDE 66

66

POWER9+V100 APPLICATIONS

https://www.exascaleproject.org/project/amrex-co-design-center-block-structured-amr/

slide-67
SLIDE 67

67

AGENDA

Key principles Performance tuning Multi-GPU systems Summit & Sierra OS integration

slide-68
SLIDE 68

68

UNIFIED MEMORY IN MANY LANGUAGES

CUDA C/C++: cudaMallocManaged CUDA Fortran: managed attribute Python: pycuda.driver.managed_empty OpenACC: -ta=managed compiler option (all dynamic allocations) Available as opt-in in GPU caching allocators and memory managers (CNMEM, RMM)

slide-69
SLIDE 69

69

UNIFIED MEMORY + OPENACC

Literally adding a single line will get your code running on the GPU Easy to optimize later: add loop and data directives

Effortless way to run you code on GPUs

#pragma acc kernels { for (i = 0; i < n; ++i) { c[i] = a[i] + b[i]; ... } } ...

Initiate parallel execution

slide-70
SLIDE 70

70

UNIFIED MEMORY WITH SYSTEM ALLOCATOR

System allocator support allows GPU to access all system memory malloc, stack, global, file system P9: Address Translation Service (ATS) - enabled since CUDA 9.2 x86: Heterogeneous Memory Management (HMM) Initial version of the patchset integrated into 4.14 kernel Still in the kernel - 5.0 J NVIDIA will be supporting upcoming versions of HMM

https://lkml.org/lkml/2017/6/23/443

slide-71
SLIDE 71

71

WHAT YOU CAN DO WITH UNIFIED MEMORY

int *data; cudaMallocManaged(&data, sizeof(int) * n); kernel<<<grid, block>>>(data); int *data = (int*)malloc(sizeof(int) * n); kernel<<<grid, block>>>(data); int data[1024]; kernel<<<grid, block>>>(data); extern int *data; kernel<<<grid, block>>>(data);

Works everywhere today Works today on Power9 + Volta Will work in the future on x86 + HMM

int *data = mmap(0, size, .. , fd, 0); kernel<<<grid, block>>>(data);

slide-72
SLIDE 72

72

DEMO TIME!

slide-73
SLIDE 73

73

TAKEAWAY

Unified Memory enables new, more productive ways of managing CPU/GPU memory It’s easy to start with bare metal Unified Memory, then tune performance with hints ATS and HMM provide easier integration with OS and legacy CPU libraries Connect with the Experts session on memory management Hall 3 Pod C – 4:00pm, right after this talk!

slide-74
SLIDE 74