[PPT] - Major IT companies run datacenters Datacenter infra market is huge. PowerPoint Presentation

SLIDE 1

DCS: A Fast, Scalable, Flexible

Device-Centric Server Architecture

Jangwoo Kim

E-mail: jangwoo@snu.ac.kr Web: https://hpcs.snu.ac.kr/~jangwoo High Performance Computer System (HPCS) Lab Department of Electrical and Computer Engineering Seoul National University

SLIDE 2

@ 2018 Jangwoo Kim

Major IT companies run datacenters

Datacenter infra market is huge.

1

SLIDE 3

@ 2018 Jangwoo Kim

All others use the datacenters

Buy a SW/HW platform as a service

Client A Client B Client E Client D Client C

Again, datacenter infra market is huge.

2

SLIDE 4

@ 2018 Jangwoo Kim

Moore’s Law is Dead

can’t build a faster CPU due to the power ceiling What is the use

f extra transistors?

3

SLIDE 5

@ 2018 Jangwoo Kim

CPU is NOT the 1st-class citizen any more

“Un-CPU” devices now dominate the performance, power, and costs.

VS

4

SLIDE 6

@ 2018 Jangwoo Kim

Every company now deals with big data

Storage infra market is EVEN larger!

5

SLIDE 7

@ 2018 Jangwoo Kim

Neuromorphic computer is coming

Brain-inspired computing  New World?

6

SLIDE 8

@ 2018 Jangwoo Kim

Message #1 (for system engineers)

We must build a datacenter-friendly, intelligent server (e.g., cloud, big data, artificial intelligence)

Message #2 (for system engineers)

The advantage must come from emerging devices (e.g., Memory, SSD, GPU, ASIC, ..)

7

SLIDE 9

@ 2018 Jangwoo Kim

My solution:

Let’s use our intelligent server architecture! “DCS: Device-Centric Server Architecture”

Three papers appeared in

2018 ACM/IEEE International Symposium on Computer Architecture (ISCA)
2017 ACM/IEEE International Symposium on Microarchitecture (MICRO)
2015 ACM/IEEE International Symposium on Microarchitecture (MICRO)

8

SLIDE 10

@ 2018 Jangwoo Kim

Existing servers do not work

Host-centric device management

− Host manages every device invocation − Frequent host-involved layer crossings

Increases latency and management cost

Userspace Kernel Hardware Application Device A Device B Driver B Kernel stack Driver A Kernel stack Device C Driver C Kernel stack

Datapath Metadata/Command path

9

SLIDE 11

@ 2018 Jangwoo Kim

Latency: High software overhead

Single sendfile: Storage read & NIC send

− Faster devices, more software overhead

Software overhead

Latency Decomposition (Normalized)

7%

HDD 10Gb NIC

50%

NVMe 10Gb NIC

77%

PCM 10Gb NIC

82%

PCM 100Gb NIC

Software Storage NIC 0% 100%

10

SLIDE 12

@ 2018 Jangwoo Kim

Cost: High host resource demand

Sendfile under host resource (CPU) contention

− Faster devices, more host resource consumption

Sendfile bandwidth 100%

No contention CPU Busy Sendfile bandwidth

*Measured from NVMe SSD/10Gb NIC

Sendfile CPU usage 34%

High contention

Sendfile bandwidth 14% Sendfile CPU usage 6%

11

SLIDE 13

@ 2018 Jangwoo Kim

Limitations of existing work

Single-device optimization

− Do not address inter-device communication

e.g., Moneta (SSD), DCA (NIC), mTCP (NIC), Arrakis (Generic)

Inter-device communication

− Not applicable for unsupported devices

e.g., GPUNet (GPU-NIC), GPUDirect RDMA (GPU-Infiniband)

Integrating devices

− Custom devices and protocols, limited applicability

e.g., QuickSAN (SSD+NIC), BlueDBM (Accelerator–SSD+NIC)

Need for fast, scalable, and generic inter-device communication

12

SLIDE 14

@ 2018 Jangwoo Kim

Our solution: Device-Centric Server

Minimize host involvement & data movement

Userspace Kernel Hardware Application Device A Device B Device C Driver B Kernel stack Driver A Kernel stack Driver C Kernel stack DCS Driver Device drivers & Kernel stacks DCS Engine Device C Device B Device A

Datapath Metadata/Command path

Single command → Optimized multi-device invocation

DCS Library Application

13

SLIDE 15

@ 2018 Jangwoo Kim

DCS: Benefits

Selective, D2D transfer

− Faster data delivery, lower total operation latency

Better host performance/efficiency

− Resource/time spent for device management

now available for other applications

High applicability

− Relies on existing drivers / kernel supports / interfaces

− Easy to extend and cover more devices

14

SLIDE 16

@ 2018 Jangwoo Kim

Device-Centric Server Components

DCS Engine

− A custom HW device to selectively connect devices

DCS drivers

− Convert commodity devices to work with DCS engines

DCS library

− OS library to hook with the existing system calls

DCS applications

− Applications developed or tuned for DCS systems

15

SLIDE 17

@ 2018 Jangwoo Kim

DCS: Architecture overview

Userspace Kernel Hardware Application DCS Library sendfile(), encrypted sendfile() DCS Driver Command generator Kernel communicator DCS Engine (on NetFPGA NIC) NVMe SSD GPU NetFPGA NIC

Fully compatible with existing systems

Command Queue Command interpreter Per-device manager PCIe Switch Drivers & Kernel stack Existing System

16

SLIDE 18

@ 2018 Jangwoo Kim

Communicating with storage

Userspace Kernel Hardware Application DCS Library DCS Driver DCS Engine NVMe SSD

Block addr (in device) / buffer addr (cached)

VFS cache Source device

File descriptor Hook / API call

Data consistency guaranteed

Source device Target (Virtual) Filesystem

17

SLIDE 19

@ 2018 Jangwoo Kim

Communicating with network interface

Userspace Kernel Hardware Application DCS Library DCS Driver DCS Engine Data buffer Network stack

Connection information

NetFPGA NIC

Packet generation & Send

HW PacketGen

Socket descriptor Hook / API call

HW-assisted packet generation

18

SLIDE 20

@ 2018 Jangwoo Kim

Communicating with accelerator

Userspace Kernel Hardware Application DCS Library DCS Driver DCS Engine Memory GPU

Memory allocation

GPU user library GPU kernel driver

Get memory mapping

DMA / NVMe transfer Source device

Kernel invocation Process data (Kernel launch) Call DCS library

Direct data loading without memcpy

19

SLIDE 21

@ 2018 Jangwoo Kim

DCS sytem in a big picture!

20

SLIDE 22

@ 2018 Jangwoo Kim

Experimental setup

Host: Power-efficient system

− Core 2 Duo @ 2.00GHz, 2MB LLC − 2GB DDR2 DRAM

Device: Off-the-shelf emerging devices

− Storage: Samsung XS1715 NVMe SSD − NIC: NetFPGA with Xilinx Virtex 5 (up to 1Gb bandwidth) − Accelerator: NVIDIA Tesla K20m − Device interconnect: Cyclone Microsystems PCIe2-2707

(Gen 2 switch, 5 slots, up to 80Gbps)

21

SLIDE 23

@ 2018 Jangwoo Kim

DCS prototype implementation

Our 4-node DCS prototype

− Can support many devices per host

A working prototype of Device-Centric Server (DCS)!

22

SLIDE 24

@ 2018 Jangwoo Kim

Reducing device utilization latency

Single sendfile: Storage read & NIC send

− Host-centric: Per-device layer crossings − DCS: Batch management in HW layer

Latency (ms) HW

75 SW

79 75

Host-centric DCS DCS

39

2x latency improvement

(with low-latency devices)

Host-centric DCS

Latency

23

SLIDE 25

@ 2018 Jangwoo Kim

71% BW / CPU 11% busy 100% BW / CPU 29% busy

Host-independent performance

Sendfile under host resource (CPU) contention

− Host-centric: host-dependent, high management cost − DCS: host-independent, low management cost

CPU Busy Sendfile bandwidth Host-centric DCS 100% BW / CPU 70% busy 13% BW / CPU 10% busy No contention High contention

High performance even on weak hosts

24

SLIDE 26

@ 2018 Jangwoo Kim

Multi-device invocation

Encrypted sendfile (SSD → GPU → NIC, 512MB)

− DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps)

Normalized processing time 68 62 6

Host-centric DCS

32 6 6 6 Network send (1Gb) 14% reduction 13 12 Network send (10Gb)

38% reduction

GPU data loading GPU processing Network send NVIDIA driver

25

SLIDE 27

@ 2018 Jangwoo Kim

Real-world workload: Hadoop-grep

Hadoop-grep (10GB)

− Faster input delivery & smaller host resource consumption

25 50 75 100 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Map progress Reduce progress

Host-centric %

25 50 75 100

DCS %

Map/Reduce progress

40% faster processing

26

SLIDE 28

@ 2018 Jangwoo Kim

Scalability: More devices per host

Doubling # of devices in a single host

CPU Utilization 60%

Total device throughput (Normalized) 2x 1.3x

Scalable many-device support

100% 22% 37% Devices

SSD NIC SSDx2 NICx2 SSD NIC SSDx2 NICx2

Host-centric DCS

27

SLIDE 29

@ 2018 Jangwoo Kim

1st prototype in 2015

A new server architecture: DCS!

− Device latency reduction: ~25% − Host resource savings: ~61% − Hadoop speed improvement: ~40%

[MICRO 2015]

28

SLIDE 30

@ 2018 Jangwoo Kim

Wait. We can do even better!

29

SLIDE 31

@ 2018 Jangwoo Kim

Limitations of Existing D2D Comm.

P2P communication

− Direct data transfers through PCI Express  D2D comm. − Slow, high-overhead control path becomess a killer

Data path Control path

Dev A Dev C CPU Dev B 30 60 90 120 Control Data copy

SW Latency (us) SW

pt

P2P

0% 25% 50% 75% 100% Others Control

CPU util. (%) SW

pt

P2P

30

SLIDE 32

@ 2018 Jangwoo Kim

Limitations of Existing D2D Comm.

Integrated devices

− Integrating heterogeneous devices  D2D comm. − Fast data & control transfers − Fixed and inflexible aggregate implementation CPU Dev A Dev C Dev B New Dev $$$ Controllers

31

SLIDE 33

@ 2018 Jangwoo Kim

Limited Performance Potential

while (true) { rc_recv = recv(fd_sock, buffer, recv_size, 0); if (rc_recv <= 0) break; processing(&md_ctx, buffer, recv_size); rc_write = write(fd_file, buffer, recv_size); … }

“Intermediate” processing between device ops

− Prevent applications from using direct D2D comm. − Cause host-side resource contention (CPU and memory) Dev A Dev B CPU

32

SLIDE 34

@ 2018 Jangwoo Kim

DCS-v2: Key Ideas & Benefits

DCS HDC

void ssd_to_nic() { get_from_ssd(&data); process_in_HDC(&data); write_to_nic(&data); }

Dev A Dev B CPU

Optimized dev. control  Faster & scalable communication Generic dev. interfaces  Higher flexibility Near-device processing  Higher applicability

New Dev CPU Dev A Dev C Dev B DCS

Device controlle r Data path Control path

CPU Dev A Dev C Dev B DCS

33

SLIDE 35

@ 2018 Jangwoo Kim

DCS-v2: (1) standard device interfaces

Standard interfaces in DCS Engine

−Based on “scoreboard” with independent queues

Keep track of (src, dst, commands, status)

Standard interface provided by FPGA

34

SLIDE 36

@ 2018 Jangwoo Kim

DCS-v2: (2) HW-based fast D2D “control”

Device ctrl functions in DCS Engine

− Bypass OS as much as possible

Handle kernel-dependent functions (e.g., recvfile)

Both data and control managed by FPGA

Device controller Submission queue Completion queue Device

Doorbell registers

PCIe switch

35

SLIDE 37

@ 2018 Jangwoo Kim

DCS-v2: (3) Near-Device Processing (NDP)

Intermediate processing (MD5_Update) between device Ops
CPU- and memory-intensive routines in existing applications

− Prevent applications from using direct D2D communications − Cause host-side resource contention (CPU and memory)

MD5_Init(&md_ctx); while (true) { rc_recv = recv(fd_sock, buffer, recv_size, 0); if (rc_recv <= 0) break; MD5_Update(&md_ctx, buffer, recv_size); rc_write = write(fd_file, buffer, recv_size); if (recv_size != rc_write) { break; } } MD5_Final(md_res, &md_ctx);

Intermediate computation by FPGA

36

SLIDE 38

@ 2018 Jangwoo Kim

A new DCS system in a big picture!

DCS2 Library

ioctl()

DCS2 Driver

Command generator Interrupt handler Memory manager Completion queue

D2D commands

NVMe SSD controller

DCS2 Engine

Submissio n queue Completio n queue Broadcom NIC controller Send queue Recv queue TCP/I P header Memory NVMe SSDs 10Gbp s NIC NVIDIA GPU Command queue Command parser Scoreboard GPU driver TCP/IP network stack EXT4 file system GPU library Issue Forward Buffer Buffer Buffer PIO interface

GPU memory information Block address TCP/IP header GPU memory management

Interrupt generator Near-device processing units (MD5, CRC32, packet gather)

37

SLIDE 39

@ 2018 Jangwoo Kim

DCS-v2: a working prototype now

Off-the-shelf emerging devices

− Storage: Intel 750 Series SSD 400GB

− NIC: Broadcom Corporation NetXtreme II BCM57711(10Gb)

− Accelerator: NVIDIA Tesla K20m − PCIe switch: Cyclone Microsystems PCIe2-2707 (Gen2) − FPGA: Xilinx Virtex 7 VC 707 board

DCS2 Engine

DCS2 Driver DCS2 Library

[ISCA’18]

38

SLIDE 40

@ 2018 Jangwoo Kim

Performance: Low D2D Latency

encrypted_sendfile(): SSD  hash  NIC

− SW opt (+P2P): frequent boundary crossings, complex software − DCS-ctrl: less crossings, hardware-based device control

without processing with processing (AES256)

Significant performance boost!

39

SLIDE 41

@ 2018 Jangwoo Kim

Utilization: CPU become silent

Swift & HDFS workloads

− Offload device control & data transfers to hardware

Swift HDFS

Significant host CPU saving!

40

SLIDE 42

@ 2018 Jangwoo Kim

Scalability: support many devices

Swift & HDFS workloads

− More CPU-efficient  support more high-performance devices

Swift HDFS

Significant scalability boost!

41

SLIDE 43

@ 2018 Jangwoo Kim

What we are doing now!

New applications require new systems!

42

SLIDE 44

@ 2018 Jangwoo Kim

Question? Thank You!

Jangwoo Kim e-mail: jangwoo@snu.ac.kr https://hpcs.snu.ac.kr/~jangwoo

43