ZEN AND THE ART OF VGPU SELECTION Jeremy Main - Lead Solution - - PowerPoint PPT Presentation

zen and the art of vgpu selection
SMART_READER_LITE
LIVE PREVIEW

ZEN AND THE ART OF VGPU SELECTION Jeremy Main - Lead Solution - - PowerPoint PPT Presentation

ZEN AND THE ART OF VGPU SELECTION Jeremy Main - Lead Solution Architect NVIDIA GRID, Japan jmain@nvidia.com The real purpose of the scientific method is to make sure nature hasnt misled you into thinking you know something you actually


slide-1
SLIDE 1

Jeremy Main - Lead Solution Architect NVIDIA GRID, Japan jmain@nvidia.com

ZEN AND THE ART OF VGPU SELECTION

slide-2
SLIDE 2

2

“ The real purpose of the scientific

method is to make sure nature hasn’t misled you into thinking you know something you actually don’t know.”

Robert M. Pirsig Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values

slide-3
SLIDE 3

3

FUNCTIONAL VIEWPOINTS WITH APPLICATION OF RATIONAL ANALYSIS

slide-4
SLIDE 4

4

FUNDAMENTALS: FRAME RATE

slide-5
SLIDE 5

5

3D APPLICATION : CATIA V5

slide-6
SLIDE 6

6

4 seconds

FRAMERATE

slide-7
SLIDE 7

7

4 seconds 1 second

FRAMERATE

slide-8
SLIDE 8

8

4 Frames / Second

250ms / Frame

1 second

FRAMERATE

slide-9
SLIDE 9

9

8 Frames / Second

125ms / Frame

1 second

FRAMERATE

slide-10
SLIDE 10

10

16 Frames / Second

62ms / Frame

1 second

FRAMERATE

slide-11
SLIDE 11

11

30 Frames / Second

33ms / Frame

1 second

FRAMERATE

slide-12
SLIDE 12

12

60 Frames / Second

16ms / Frame

1 second

FRAMERATE

slide-13
SLIDE 13

13

AND SO?

slide-14
SLIDE 14

14

IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough… Max FPS = 60 FPS GPU Utilization = 100% (grossly simplified for illustrative purposes only)

1 second

FRAMERATE

slide-15
SLIDE 15

15

FUNDAMENTALS: GPU UTILIZATION

slide-16
SLIDE 16

16

IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)

1 second

GPU UTILIZATION

slide-17
SLIDE 17

17

IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)

1 second BUSY IDLE

GPU UTILIZATION

slide-18
SLIDE 18

18

IF the application can’t construct 3D data fast enough (inefficient geometry representation) AND the GPU is powerful enough Max FPS = 15 FPS GPU Utilization = 20% (grossly simplified for illustrative purposes only)

1 second IDLE BUSY

GPU UTILIZATION

slide-19
SLIDE 19

19

IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is NOT powerful enough Max FPS = 15 FPS GPU Utilization = 100% (grossly simplified for illustrative purposes only)

1 second BUSY

GPU UTILIZATION

slide-20
SLIDE 20

20

FUNDAMENTALS: VSYNC

slide-21
SLIDE 21

21

IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)

1 second

VSYNC = ON : ~(v)Display horizontal Sync. Ex: 60Hz == 16ms/frame

IDLE BUSY

VSYNC

slide-22
SLIDE 22

22

IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 30 FPS GPU Utilization = 25% (grossly simplified for illustrative purposes only)

1 second

VSYNC = ON (Half Display Refresh): ~(v)Display horizontal Sync. Ex: (60Hz / 2) == 33ms/frame

IDLE BUSY

VSYNC

slide-23
SLIDE 23

23

FUNDAMENTALS: FRAME RATE LIMITER

slide-24
SLIDE 24

24

IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)

1 second

Frame Rate Limiter = ON : <= ~60 Potential frames rendered / second

IDLE BUSY

FRAME RATE LIMITER

slide-25
SLIDE 25

25

FUNDAMENTALS GOING (OR NOT GOING) FASTER

slide-26
SLIDE 26

26

IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)

1 second

Frame Rate Limiter = OFF , VSYNC = ON

IDLE BUSY

GOING (OR NOT GOING) FASTER

slide-27
SLIDE 27

27

IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)

1 second

Frame Rate Limiter = ON , VSYNC = OFF

IDLE BUSY

GOING (OR NOT GOING) FASTER

slide-28
SLIDE 28

28

IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = (until CPU or GPU bottleneck) GPU Utilization = 100% (grossly simplified for illustrative purposes only)

1 second

Frame Rate Limiter = OFF , VSYNC = OFF

BUSY

GOING (OR NOT GOING) FASTER

slide-29
SLIDE 29

29

FUNDAMENTALS: RENDERED VS. SAMPLED

slide-30
SLIDE 30

30

Sampling

20 Frames / Second

50ms / Sample

Rendering

60 Frames / Second

16ms / Frame

slide-31
SLIDE 31

31

Sampling

20 Frames / Second

50ms / Sample

Rendering

60 Frames / Second

16ms / Frame

slide-32
SLIDE 32

32

Sampling

20 Frames / Second

50ms / Sample

Rendered but… unused frames

Rendering

60 Frames / Second

16ms / Frame

slide-33
SLIDE 33

33

@VIRTUALIZED_RESOURCE “A BAD, VERY BAD WASTE OF SHARED RESOURCES”

slide-34
SLIDE 34

34

RENDERED VS. SAMPLED

If your sample framerate < 30 FPS, consider changing VSYNC policy to: “Adaptive Half Refresh” to lock max FPS @ 30 FPS and reduce “waste” May lead to additional input/output latency due to longer period between frame updates CPU based image compression can limit the actual delivered framerate based on quality settings, percentage of display changed, number of displays Network bandwidth deficiencies, quality affect delivered framerate Endpoint performance (ability to decode compression) affects displayable framerate

Options… Cautions

slide-35
SLIDE 35

35

FUNDAMENTALS: CPU

slide-36
SLIDE 36

36

CPU / VCPU UTILIZATION

1 of 1 vCPUs @ 100% utilization = 100% reported utilization 1 of 2 vCPUs @ 100% utilization = 50% reported utilization 1 of 4 vCPUs @ 100% utilization = 25% reported utilization 1 of 8 vCPUs @ 100% utilization = 13% reported utilization Virtual environments using CPU-based image compression with full-screen updates can expect to have the compressor process consume a single vCPU Adding more vCPU cores can negatively impact VM performance due to pCPU scheduling contention by the hypervisor Know how much CPU resources your application and workload requires

More is not always better

slide-37
SLIDE 37

37

FUNDAMENTALS: SYSTEM MEMORY

slide-38
SLIDE 38

38

SYSTEM MEMORY

vDGA and vGPU VMs require all VM memory to be locked on startup Important consideration during PoC phase as well as production Be aware of VM memory exceeding the per-socket capacity (NUMA traversal)

Locked in (like a time-share contract)

slide-39
SLIDE 39

39

FUNDAMENTALS: FRAMEBUFFER

slide-40
SLIDE 40

40

FRAMEBUFFER

It is yours for the duration so ensure you get the correct “size”, i.e. Profile Can not use another GPU’s framebuffer Does not support dynamic resizing Can not use excess “unused” capacity of other VM framebuffers on the same GPU Applications may efficiently represent geometry but will fall back to legacy methods when framebuffer is exhausted. Will lead to reduced rendering performance

I own thee… until shutdown

slide-41
SLIDE 41

41

FUNDAMENTALS: DECODE

slide-42
SLIDE 42

42

DECODE

Stream must be h.264, VP8, HVEC Main Profile, VP9 Profile 0 Complete details in NVIDIA Video Codec SDK Application Notes – Decoder Application must support GPU decode capability for supported streams YouTube playback on Chrome uses VP9 (Caution) -> VP9 decode not verified FireFox, Edge will playback with hardware decode Splash player with GPU decode enabled will playback with hardware decode Other video players natively support available GPU decode as well

For most of your video playback needs

slide-43
SLIDE 43

43

FUNDAMENTALS: ENCODE

slide-44
SLIDE 44

44

ENCODER

Dedicated silicon for encode on each GPU Out of band encoding, does not impact rendering performance NVENC added from Citrix XenDesktop 7.11 and VMware Horizon 7.0 Blast Extreme Confirm endpoints can perform H.264 decode, and enabled in client settings Up-to-date endpoint software required Ensure policies or settings do not override GPU encoder use; i.e. “build to lossless”

Free a vCPU do to other special things

slide-45
SLIDE 45

45

MEASUREMENT

slide-46
SLIDE 46

46

MEASUREMENT PRINCIPLES

Clarify and document the context(s) being measured Select metrics that will help explain different points of resource contention Capture workstation, PC data for pre-PoC sizing investigation (Optional) Capture screenshots @ 1FPS -> PNG -> ffmpeg -> MP4 file Capture VM, Endpoint and host metrics (nvidia-smi) for PoC Save data in a consistent manner, document testing procedures

Not all possible data points!

slide-47
SLIDE 47

47

TOOLS

slide-48
SLIDE 48

48

TOOLS: SYSINFO32

Use SysInfo32 “System Information” to capture the measurement context CPU model, Clocks, Logical Cores Operating System Display Adapters Lots of ‘other’ information that surely must be interesting to someone?

Available in all Windows Environments

slide-49
SLIDE 49

49

TITLE ONLY SLIDE

slide-50
SLIDE 50

50

TOOLS: PERFMON

A large variety of counters! Very powerful for local or remote collection Some counters only exist in WMI, sadly Export hundreds of data points to CSV for endless sorting

Available in all Windows Environments

slide-51
SLIDE 51

COUNTER CREATION AND USAGE

slide-52
SLIDE 52

Create new ”User Defined” collector

Start ”perfmon” Expand “Data Collector Sets” Select “User Defined”

  • > “New” -> “Data Collector Set”
slide-53
SLIDE 53

Set base collector properties

Enter a name for the collector Select “Create Manually” Click “Next”

slide-54
SLIDE 54

Configuration (continued 1)

Select “Performance Counter”

slide-55
SLIDE 55

Configuration (continued 2)

  • Change sample interval

1 Second

  • Click “Add” to add counters
slide-56
SLIDE 56

Counters

¥Processor(_Total)¥% Processor Time ¥Processor(*)¥% Processor Time ¥Memory¥Available MBytes ¥Memory¥% Committed Bytes In Use ¥NVIDIA GPU(*)¥% GPU Usage ¥NVIDIA GPU(*)¥% GPU Memory Usage ¥PCoIP Session Imaging Statistics(*)¥Imaging Encoded Frames/sec ¥VMware Blast¥Estimated fps ¥Network Interface(*)¥Bytes Received/sec (scale 0.00001) ¥Network Interface(*)¥Bytes Sent/sec (scale 0.00001)

slide-57
SLIDE 57

Counters

  • Click “Finish” to complete
slide-58
SLIDE 58

Starting the collector

  • Right click on collector and select “Start”
slide-59
SLIDE 59

DO STUFF

slide-60
SLIDE 60
slide-61
SLIDE 61

Stopping the Collector

Switch to PerfMon Stop the collector by Right clicking and select “Stop”

slide-62
SLIDE 62

OR

slide-63
SLIDE 63

GPUPROFILER

slide-64
SLIDE 64

64

GPUPROFILER

A Freeware Community Tool

Not an official NVIDIA supported / sanctioned tool http://www.gpuprofiler.com (redirect to GitHub) Utilizes NVML API and can monitor both physical and virtual GPUs* Does not use the WMI interface to capture metrics Captures key system details, can save data for later analysis or sharing Save in native GPD format with system details + utilization data Export utilization data (only) to CSV file at any time

*Virtual GPU support limited to Maxwell GPUs with GRID August 2016 SW and later

slide-65
SLIDE 65

65

TITLE ONLY SLIDE

slide-66
SLIDE 66

66

DATA SAMPLES FROM POCS

slide-67
SLIDE 67

67

TITLE ONLY SLIDE

slide-68
SLIDE 68

68

TITLE ONLY SLIDE

slide-69
SLIDE 69

69

TITLE ONLY SLIDE

slide-70
SLIDE 70

70

TITLE ONLY SLIDE

slide-71
SLIDE 71

71

TITLE ONLY SLIDE

slide-72
SLIDE 72

COMPARING PROTOCOLS

slide-73
SLIDE 73

HDX3D PRO : CPU VS GPU

YouTube HTML5 Video Playback

HDX3D Pro : CPU HDX3D Pro : NVENC

slide-74
SLIDE 74

XENAPP HOST MONITORING

slide-75
SLIDE 75

PCOIP VS BLAST EXTREME

YouTube HTML5 Video Playback

PCoIP : CPU Blast Extreme : NVENC

slide-76
SLIDE 76

APPLICATION EXAMPLES

slide-77
SLIDE 77
slide-78
SLIDE 78

TITLE ONLY SLIDE

slide-79
SLIDE 79
slide-80
SLIDE 80

TITLE ONLY SLIDE

slide-81
SLIDE 81
slide-82
SLIDE 82

TITLE ONLY SLIDE

slide-83
SLIDE 83
slide-84
SLIDE 84

TITLE ONLY SLIDE

slide-85
SLIDE 85
slide-86
SLIDE 86

TITLE ONLY SLIDE

slide-87
SLIDE 87

MANAGEMENT & MONITORING SDK

slide-88
SLIDE 88

NVIDIA GRID MONITORING FEATURES

Host Monitoring

Physical characteristics Existing Physical GPU utilization Existing vGPU discovery New! vGPU properties New! 3D engine utilization New! Frame buffer utilization New! Encode engine utilization New! Decode engine utilization New!

Granular vGPU metrics : better measurement, management and support

Guest Monitoring

Frame buffer usage Existing Encoder utilization New! Decoder utilization New! Frame buffer utilization New! GPU utilization New!

VM

slide-89
SLIDE 89

CODE SAMPLES

Get started creating your own monitoring tools

Both samples demonstrate:

  • How to initialize the APIs and enumerate available GPUs and…

NVAPI sample https://github.com/JeremyMain/NVAPIQuery-Windows-

  • Query pGPU/vGPU, frame buffer utilization

NVML sample https://github.com/JeremyMain/NVMLQuery-Windows

  • Query pGPU/vGPU, frame buffer, encoder and decoder utilization

REQUIRED: Download NVIDIA GRID Management SDK from: https://developer.nvidia.com/nvidia-grid-software-management-sdk

slide-90
SLIDE 90

NVIDIA GRID RESOURCES

jmain@nvidia.com @_JeremyMain

http://www.nvidia.com/object/grid-enterprise-resources.html http://developer.nvidia.com/nvidia-grid-software-management-sdk