GPUnet: networking abstractions for GPU programs Mark Silberstein - PowerPoint PPT Presentation
GPUnet: networking abstractions for GPU programs Mark Silberstein Technion Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Amir Wated Yige Hu, Emmett Witchel Technion University of Texas at Austin Mark Silberstein -
GPUnet: networking abstractions for GPU programs Mark Silberstein Technion – Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Amir Wated Yige Hu, Emmett Witchel Technion University of Texas at Austin Mark Silberstein - EE, Technion
What A socket API for programs running on GPU Why GPU-accelerated servers are hard to build Results GPU vs. CPU 50% throughput, 60% latency , ½ LOC Mark Silberstein - EE, Technion
Motivation: GPU-accelerated networking applications Data processing server Data processing server GPU GPU GPU MapReduce MapReduce GPU GPU GPU GPU Mark Silberstein - EE, Technion
Recent GPU-accelerated networking applications SSLShader (Jang 2011), GPU MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014) ... Mark Silberstein - EE, Technion
Recent GPU-accelerated networking applications SSLShader (Jang 2011), GPU MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014) ... required heroic efforts Mark Silberstein - EE, Technion
GPU-accelerated networking apps: Recurring themes NIC-GPU interaction Pipelining and buffer management Request batching Mark Silberstein - EE, Technion
GPU-accelerated networking apps: Recurring themes NIC-GPU interaction CPU-GPU-NIC Pipelining We will sidestep these problems Request batching Mark Silberstein - EE, Technion
The real problem: CPU is the only boss NIC Storage GPU CPU Mark Silberstein - EE, Technion
Example: CPU server CPU recv() compute() Memory NIC send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server CPU GPU Memory Memory NIC PCIe bus Theory Theory recv() recv() GPU_compute() GPU_compute() send() send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server recv(); CPU GPU Memory Memory NIC recv(); batch(); Theory recv() GPU_compute() send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server transfer(); CPU GPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); recv() GPU_compute() send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server invoke(); CPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); balance(); recv() GPU_compute(); GPU_compute() GPU_compute() send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server transfer(); CPU GPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); balance(); recv() GPU_compute(); GPU_compute() GPU_compute() transfer(); cleanup(); send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server send(); CPU GPU Memory Memory NIC recv (); batch(); Theory optimize(); transfer(); recv() balance(); GPU_compute() GPU_compute() transfer(); cleanup(); send() dispatch(); send(); Mark Silberstein - EE, Technion
Aggressive pipelining Inside a GPU-accelerated server Double buffering, asynchrony, multithreading CPU Memory Memory NIC recv (); recv (); recv (); recv (); batch(); batch(); batch(); optimize(); batch(); optimize(); optimize(); transfer(); optimize(); transfer(); transfer(); balance(); transfer(); balance(); recv() balance(); GPU_compute(); balance(); GPU_compute(); GPU_compute(); transfer(); GPU_compute(); GPU_compute() GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); send() cleanup(); dispatch(); dispatch(); send(); dispatch(); send(); send(); send(); Mark Silberstein - EE, Technion
This code is for a CPU to manage a GPU recv (); recv (); recv (); batch(); batch(); batch(); batch(); optimize(); optimize(); optimize(); optimize(); transfer(); transfer(); transfer(); transfer(); balance(); balance(); balance(); balance(); GPU_compute(); GPU_compute(); GPU_compute(); transfer(); GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); dispatch(); cleanup(); dispatch(); send(); send(); send(); dispatch(); Mark Silberstein - EE, Technion
GPUs are not co- processors GPUs are peer- processors They need I/O abstractions File system I/O – [GPUfs ASPLOS13] Network I/O – this work Mark Silberstein - EE, Technion
GPUnet: socket API for GPUs Application view node0.technion.ac.il GPU native native server socket (AF_INET,SOCK_STREAM); listen (:2340) GPUnet Network GPU native native client CPU client socket (AF_INET,SOCK_STREAM); socket(AF_INET,SOCK_STREAM); connect (“node0:2340”); connect (“node0:2340”) GPUnet Mark Silberstein - EE, Technion
GPU-accelerated server with GPUnet CPU not involved CPU GPU Memory Memory NIC PCIe bus recv() GPU_compute() send() Mark Silberstein - EE, Technion
GPU-accelerated server with GPUnet GPU Memory NIC PCIe bus recv() GPU_compute() send() Mark Silberstein - EE, Technion
GPU-accelerated server with GPUnet No request batching send() recv() Memory NIC recv() recv() recv() GPU_compute() GPU_compute() GPU_compute() send() send() send() Mark Silberstein - EE, Technion
GPU-accelerated server with GPUnet Automatic send() request pipelining recv() Memory NIC Automatic buffer recv() recv() recv() management GPU_compute() GPU_compute() GPU_compute() send() send() send() Mark Silberstein - EE, Technion
Building a socket abstraction for GPUs Mark Silberstein - EE, Technion
Goals CPU GPU recv() Memory Memory NIC PCIe bus Simplicity Performance Reliable streaming NIC → GPU abstraction for GPUs data path optimizations Mark Silberstein - EE, Technion
Design option 1: Transport layer processing on CPU CPU GPU recv() Transport GPU controls processing the flow of data Network Memory buffers NIC Mark Silberstein - EE, Technion
Design option 1: Transport layer processing on CPU CPU GPU recv() Transport processing Network Memory buffers Extra CPU-GPU NIC memory transfers Mark Silberstein - EE, Technion
Design option 2: Transport layer processing on GPU CPU GPU recv() Transport processing Network Memory buffers P2P DMA P2P DMA NIC Mark Silberstein - EE, Technion
Design option 2: Transport layer processing on GPU CPU GPU recv() Transport processing TCP/IP on GPU? Network buffers CPU applications P2P DMA access network through GPU? NIC Mark Silberstein - EE, Technion
Not CPU, Not GPU We need help from NIC hardware Mark Silberstein - EE, Technion
RDMA: offloading transport layer processing to NIC CPU GPU Streaming Streaming Message Message buffers buffers Reliable RDMA NIC Mark Silberstein - EE, Technion
GPUnet layers GPU Socket API Reliable in-order streaming Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Socket, TCP/IP Mark Silberstein - EE, Technion
GPUnet layers Simplicity GPU Socket API Reliable in-order streaming GPU Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Socket, TCP/IP NIC CPU Performance Mark Silberstein - EE, Technion
See the paper for ● Coalesced API calls ● Latency-optimized GPU-CPU flow control ● Memory management ● Bounce buffers ● Non-RDMA support ● GPU performance optimizations Mark Silberstein - EE, Technion
Implementation ● Standard API calls, blocking/nonblocking ● libGPUnet.a : AF_INET, Streaming over Infiniband RDMA ● Fully compatible with CPU rsocket library ● libUNIXnet.a : AF_LOCAL: Unix Domain Sockets support for inter GPU/CPU-GPU Mark Silberstein - EE, Technion
Implementation GPU CPU GPU application GPUnet socket library Bounce GPUnet Flow Network buffers proxy control buffers CPU memory GPU memory NIC fallback Mark Silberstein - EE, Technion
Evaluation ● Analysis of GPU-native server design ● Matrix product server ● In-GPU-memory MapReduce ● Face verification server 2x6 Intel E5-2620, NVIDIA Tesla K20Xm GPU, Mellanox Connect-IB HCA, Switch-X bridge Mark Silberstein - EE, Technion
In-GPU-memory MapReduce GPUfs GPU GPU Map Map GPUnet Receiver Receiver Sort Sort Reduce Reduce Mark Silberstein - EE, Technion
In-GPU-memory MapReduce: Scalability 1 GPU 4 GPUs (no network) (GPUnet) K-means 5.6 sec 1.6 sec ( 3.5x ) Word-count 29.6 sec 10 sec ( 2.9x ) GPUnet enables scale-out for GPU – accelerated systems Mark Silberstein - EE, Technion
Face verification server memcached CPU client GPU server (unmodified) (unmodified) (GPUnet) via rsocket via rsocket Infiniband ? = recv() features() GPU_features() query_DB() GPU_compare() compare() send() Mark Silberstein - EE, Technion
Face verification: Different implementations 1 GPU 2500 Latency (μsec) (no GPUnet) CPU 2000 6 cores 99 th % 1500 25 th -75 th % 1 GPU 1000 Median GPUnet 500 23 34 54 Throughput (KReq/sec) Mark Silberstein - EE, Technion
Face verification: Different implementations 1.9x throughput 1 GPU 1/3x latency 2500 Latency (μsec) (no GPUnet) CPU ½ LOC 2000 6 cores 99 th % 1500 25 th -75 th % 1 GPU 1000 Median GPUnet 500 23 34 54 Throughput (KReq/sec) Mark Silberstein - EE, Technion
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.