Jeremy Main - Lead Solution Architect NVIDIA GRID, Japan jmain@nvidia.com
ZEN AND THE ART OF VGPU SELECTION Jeremy Main - Lead Solution - - PowerPoint PPT Presentation
ZEN AND THE ART OF VGPU SELECTION Jeremy Main - Lead Solution - - PowerPoint PPT Presentation
ZEN AND THE ART OF VGPU SELECTION Jeremy Main - Lead Solution Architect NVIDIA GRID, Japan jmain@nvidia.com The real purpose of the scientific method is to make sure nature hasnt misled you into thinking you know something you actually
2
“ The real purpose of the scientific
method is to make sure nature hasn’t misled you into thinking you know something you actually don’t know.”
Robert M. Pirsig Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values
3
FUNCTIONAL VIEWPOINTS WITH APPLICATION OF RATIONAL ANALYSIS
4
FUNDAMENTALS: FRAME RATE
5
3D APPLICATION : CATIA V5
6
4 seconds
FRAMERATE
7
4 seconds 1 second
FRAMERATE
8
4 Frames / Second
250ms / Frame
1 second
FRAMERATE
9
8 Frames / Second
125ms / Frame
1 second
FRAMERATE
10
16 Frames / Second
62ms / Frame
1 second
FRAMERATE
11
30 Frames / Second
33ms / Frame
1 second
FRAMERATE
12
60 Frames / Second
16ms / Frame
1 second
FRAMERATE
13
AND SO?
14
IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough… Max FPS = 60 FPS GPU Utilization = 100% (grossly simplified for illustrative purposes only)
1 second
FRAMERATE
15
FUNDAMENTALS: GPU UTILIZATION
16
IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)
1 second
GPU UTILIZATION
17
IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)
1 second BUSY IDLE
GPU UTILIZATION
18
IF the application can’t construct 3D data fast enough (inefficient geometry representation) AND the GPU is powerful enough Max FPS = 15 FPS GPU Utilization = 20% (grossly simplified for illustrative purposes only)
1 second IDLE BUSY
GPU UTILIZATION
19
IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is NOT powerful enough Max FPS = 15 FPS GPU Utilization = 100% (grossly simplified for illustrative purposes only)
1 second BUSY
GPU UTILIZATION
20
FUNDAMENTALS: VSYNC
21
IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)
1 second
VSYNC = ON : ~(v)Display horizontal Sync. Ex: 60Hz == 16ms/frame
IDLE BUSY
VSYNC
22
IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 30 FPS GPU Utilization = 25% (grossly simplified for illustrative purposes only)
1 second
VSYNC = ON (Half Display Refresh): ~(v)Display horizontal Sync. Ex: (60Hz / 2) == 33ms/frame
IDLE BUSY
VSYNC
23
FUNDAMENTALS: FRAME RATE LIMITER
24
IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)
1 second
Frame Rate Limiter = ON : <= ~60 Potential frames rendered / second
IDLE BUSY
FRAME RATE LIMITER
25
FUNDAMENTALS GOING (OR NOT GOING) FASTER
26
IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)
1 second
Frame Rate Limiter = OFF , VSYNC = ON
IDLE BUSY
GOING (OR NOT GOING) FASTER
27
IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = 60 FPS GPU Utilization = 50% (grossly simplified for illustrative purposes only)
1 second
Frame Rate Limiter = ON , VSYNC = OFF
IDLE BUSY
GOING (OR NOT GOING) FASTER
28
IF the application can construct 3D data fast enough (efficient geometry representation) AND the GPU is powerful enough Max FPS = (until CPU or GPU bottleneck) GPU Utilization = 100% (grossly simplified for illustrative purposes only)
1 second
Frame Rate Limiter = OFF , VSYNC = OFF
BUSY
GOING (OR NOT GOING) FASTER
29
FUNDAMENTALS: RENDERED VS. SAMPLED
30
Sampling
20 Frames / Second
50ms / Sample
Rendering
60 Frames / Second
16ms / Frame
31
Sampling
20 Frames / Second
50ms / Sample
Rendering
60 Frames / Second
16ms / Frame
32
Sampling
20 Frames / Second
50ms / Sample
Rendered but… unused frames
Rendering
60 Frames / Second
16ms / Frame
33
@VIRTUALIZED_RESOURCE “A BAD, VERY BAD WASTE OF SHARED RESOURCES”
34
RENDERED VS. SAMPLED
If your sample framerate < 30 FPS, consider changing VSYNC policy to: “Adaptive Half Refresh” to lock max FPS @ 30 FPS and reduce “waste” May lead to additional input/output latency due to longer period between frame updates CPU based image compression can limit the actual delivered framerate based on quality settings, percentage of display changed, number of displays Network bandwidth deficiencies, quality affect delivered framerate Endpoint performance (ability to decode compression) affects displayable framerate
Options… Cautions
35
FUNDAMENTALS: CPU
36
CPU / VCPU UTILIZATION
1 of 1 vCPUs @ 100% utilization = 100% reported utilization 1 of 2 vCPUs @ 100% utilization = 50% reported utilization 1 of 4 vCPUs @ 100% utilization = 25% reported utilization 1 of 8 vCPUs @ 100% utilization = 13% reported utilization Virtual environments using CPU-based image compression with full-screen updates can expect to have the compressor process consume a single vCPU Adding more vCPU cores can negatively impact VM performance due to pCPU scheduling contention by the hypervisor Know how much CPU resources your application and workload requires
More is not always better
37
FUNDAMENTALS: SYSTEM MEMORY
38
SYSTEM MEMORY
vDGA and vGPU VMs require all VM memory to be locked on startup Important consideration during PoC phase as well as production Be aware of VM memory exceeding the per-socket capacity (NUMA traversal)
Locked in (like a time-share contract)
39
FUNDAMENTALS: FRAMEBUFFER
40
FRAMEBUFFER
It is yours for the duration so ensure you get the correct “size”, i.e. Profile Can not use another GPU’s framebuffer Does not support dynamic resizing Can not use excess “unused” capacity of other VM framebuffers on the same GPU Applications may efficiently represent geometry but will fall back to legacy methods when framebuffer is exhausted. Will lead to reduced rendering performance
I own thee… until shutdown
41
FUNDAMENTALS: DECODE
42
DECODE
Stream must be h.264, VP8, HVEC Main Profile, VP9 Profile 0 Complete details in NVIDIA Video Codec SDK Application Notes – Decoder Application must support GPU decode capability for supported streams YouTube playback on Chrome uses VP9 (Caution) -> VP9 decode not verified FireFox, Edge will playback with hardware decode Splash player with GPU decode enabled will playback with hardware decode Other video players natively support available GPU decode as well
For most of your video playback needs
43
FUNDAMENTALS: ENCODE
44
ENCODER
Dedicated silicon for encode on each GPU Out of band encoding, does not impact rendering performance NVENC added from Citrix XenDesktop 7.11 and VMware Horizon 7.0 Blast Extreme Confirm endpoints can perform H.264 decode, and enabled in client settings Up-to-date endpoint software required Ensure policies or settings do not override GPU encoder use; i.e. “build to lossless”
Free a vCPU do to other special things
45
MEASUREMENT
46
MEASUREMENT PRINCIPLES
Clarify and document the context(s) being measured Select metrics that will help explain different points of resource contention Capture workstation, PC data for pre-PoC sizing investigation (Optional) Capture screenshots @ 1FPS -> PNG -> ffmpeg -> MP4 file Capture VM, Endpoint and host metrics (nvidia-smi) for PoC Save data in a consistent manner, document testing procedures
Not all possible data points!
47
TOOLS
48
TOOLS: SYSINFO32
Use SysInfo32 “System Information” to capture the measurement context CPU model, Clocks, Logical Cores Operating System Display Adapters Lots of ‘other’ information that surely must be interesting to someone?
Available in all Windows Environments
49
TITLE ONLY SLIDE
50
TOOLS: PERFMON
A large variety of counters! Very powerful for local or remote collection Some counters only exist in WMI, sadly Export hundreds of data points to CSV for endless sorting
Available in all Windows Environments
COUNTER CREATION AND USAGE
Create new ”User Defined” collector
Start ”perfmon” Expand “Data Collector Sets” Select “User Defined”
- > “New” -> “Data Collector Set”
Set base collector properties
Enter a name for the collector Select “Create Manually” Click “Next”
Configuration (continued 1)
Select “Performance Counter”
Configuration (continued 2)
- Change sample interval
1 Second
- Click “Add” to add counters
Counters
¥Processor(_Total)¥% Processor Time ¥Processor(*)¥% Processor Time ¥Memory¥Available MBytes ¥Memory¥% Committed Bytes In Use ¥NVIDIA GPU(*)¥% GPU Usage ¥NVIDIA GPU(*)¥% GPU Memory Usage ¥PCoIP Session Imaging Statistics(*)¥Imaging Encoded Frames/sec ¥VMware Blast¥Estimated fps ¥Network Interface(*)¥Bytes Received/sec (scale 0.00001) ¥Network Interface(*)¥Bytes Sent/sec (scale 0.00001)
Counters
- Click “Finish” to complete
Starting the collector
- Right click on collector and select “Start”
DO STUFF
Stopping the Collector
Switch to PerfMon Stop the collector by Right clicking and select “Stop”
OR
GPUPROFILER
64
GPUPROFILER
A Freeware Community Tool
Not an official NVIDIA supported / sanctioned tool http://www.gpuprofiler.com (redirect to GitHub) Utilizes NVML API and can monitor both physical and virtual GPUs* Does not use the WMI interface to capture metrics Captures key system details, can save data for later analysis or sharing Save in native GPD format with system details + utilization data Export utilization data (only) to CSV file at any time
*Virtual GPU support limited to Maxwell GPUs with GRID August 2016 SW and later
65
TITLE ONLY SLIDE
66
DATA SAMPLES FROM POCS
67
TITLE ONLY SLIDE
68
TITLE ONLY SLIDE
69
TITLE ONLY SLIDE
70
TITLE ONLY SLIDE
71
TITLE ONLY SLIDE
COMPARING PROTOCOLS
HDX3D PRO : CPU VS GPU
YouTube HTML5 Video Playback
HDX3D Pro : CPU HDX3D Pro : NVENC
XENAPP HOST MONITORING
PCOIP VS BLAST EXTREME
YouTube HTML5 Video Playback
PCoIP : CPU Blast Extreme : NVENC
APPLICATION EXAMPLES
TITLE ONLY SLIDE
TITLE ONLY SLIDE
TITLE ONLY SLIDE
TITLE ONLY SLIDE
TITLE ONLY SLIDE
MANAGEMENT & MONITORING SDK
NVIDIA GRID MONITORING FEATURES
Host Monitoring
Physical characteristics Existing Physical GPU utilization Existing vGPU discovery New! vGPU properties New! 3D engine utilization New! Frame buffer utilization New! Encode engine utilization New! Decode engine utilization New!
Granular vGPU metrics : better measurement, management and support
Guest Monitoring
Frame buffer usage Existing Encoder utilization New! Decoder utilization New! Frame buffer utilization New! GPU utilization New!
VM
CODE SAMPLES
Get started creating your own monitoring tools
Both samples demonstrate:
- How to initialize the APIs and enumerate available GPUs and…
NVAPI sample https://github.com/JeremyMain/NVAPIQuery-Windows-
- Query pGPU/vGPU, frame buffer utilization
NVML sample https://github.com/JeremyMain/NVMLQuery-Windows
- Query pGPU/vGPU, frame buffer, encoder and decoder utilization
REQUIRED: Download NVIDIA GRID Management SDK from: https://developer.nvidia.com/nvidia-grid-software-management-sdk
NVIDIA GRID RESOURCES
jmain@nvidia.com @_JeremyMain
http://www.nvidia.com/object/grid-enterprise-resources.html http://developer.nvidia.com/nvidia-grid-software-management-sdk