[PPT] - Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for PowerPoint Presentation

SLIDE 1

NVRAMOS 2019

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs

Jinkyu Jeong Sungkyunkwan University (SKKU)

Source: Gyusun Lee et al., Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs, USENIX ATC 19

SLIDE 2

NVRAMOS 2019

Emerging ultra-low latency SSDs deliver I/Os in a few µs

Storage Performance Trends

2

Source: R. E. Bryant and D. R. O'Hallaron, Computer Systems: A Programmer's Perspective, Second Edition, Pearson Education, Inc., 2015

1985 1990 1995 2000 2003 2005 2010 2017 Latency (ns) Year HDD SSD ULL-SSD DRAM SRAM

Samsung Z-SSD Intel Optane SSD

10 10 10 10 10 10 10 10 10

SLIDE 3

NVRAMOS 2019

0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSD Optane SSD SATA SSD NVMe SSD Z-SSD Optane SSD Read Write (+fsync) Normalized Latency Device Kernel User

Low-latency SSDs expose the overhead of kernel I/O stack

Overhead of Kernel I/O Stack

3

SLIDE 4

NVRAMOS 2019

Synchronous I/O vs. Asynchronous I/O

4

Our Idea: apply asynchronous I/O concept to the I/O stack itself Throughput Total latency

CPU Device A (computation) B (I/O) Total latency

Synchronous I/O

CPU Device A’ A” B Total latency

A” is independent to B

Asynchronous I/O

SLIDE 5

NVRAMOS 2019

Read Path Overview

5

CPU Device I/O I/O stack operations I/O stack operations CPU Device I/O

Latency reduction

Async. operations

sys_read() Return to user sys_read() Return to user Vanilla Read Path Proposed Read Path

SLIDE 6

NVRAMOS 2019

Write Path Overview

CPU Device sys_write() Buffered write Vanilla Write Path Return to user

SLIDE 7

NVRAMOS 2019

Write Path Overview

7

CPU Device CPU Device

Latency reduction

sys_fsync() I/O I/O I/O I/O stack ops. I/O stack ops. I/O stack ops. sys_fsync() I/O stack ops. I/O I/O I/O stack ops. I/O stack ops. Return to user Return to user Vanilla Fsync Path Proposed Fsync Path I/O

SLIDE 8

NVRAMOS 2019

Read path

− Analysis of vanilla read path − Proposed read path

Light-weight block I/O layer
Write path

− Analysis of vanilla write path − Proposed write path

Evaluation
Conclusion

Agenda

8

SLIDE 9

NVRAMOS 2019

Interrupt

Analysis of Vanilla Read Path

9

CPU Device I/O submit Total latency 12.82μs I/O 7.26µs Page cache lookup 0.30µs Page allocation 0.19µs Page cache insertion 0.33µs LBA retrieval 0.09µs BIO submission 0.72µs DMA mapping 0.29µs NVMe I/O submission 0.37µs Context switch 0.95µs DMA unmapping 0.23µs Request completion 0.81µs Copy-to-user 0.21µs Context switch 0.95µs Page cache File system Block layer Device driver Interrupt handler

sys_read() Return to user

SLIDE 10

NVRAMOS 2019

Page Allocation / DMA Mapping

10

CPU Device I/O submit I/O 7.26µs Page allocation 0.19µs DMA mapping 0.29µs Interrupt

SLIDE 11

NVRAMOS 2019

DMA-mapped page pool

Asynchronous Page Allocation / DMA Mapping

11

CPU Device I/O submit I/O 7.26µs Page allocation 0.19µs DMA mapping 0.29µs Interrupt

…

Core 0

…

64 pages Core N

…

4KB DMA-mapped pages Pagepool allocation

SLIDE 12

NVRAMOS 2019

DMA-mapped page pool

Asynchronous Page Allocation / DMA Mapping

12

…

Core 0 64 pages Core N

…

Page refill CPU Device I/O submit I/O 7.26µs Pagepool allocation 0.016µs DMA mapping 0.29µs Page allocation 0.19µs Interrupt Pagepool allocation

…

4KB DMA-mapped pages

SLIDE 13

NVRAMOS 2019

Page Cache Insertion

13

CPU Device I/O submit I/O 7.26µs Page cache insertion 0.33µs Interrupt

Page Page Page

… …

Root Leaf node

Prevention from duplicated I/O requests for the same file index Page cache lookup overhead Page cache tree extension overhead

Make I/O request Cache insertion success Wait for page read

Page cache tree

Cache lookup?

Miss

Cache lookup?

Hit

SLIDE 14

NVRAMOS 2019

Lazy Page Cache Insertion

14

CPU Device I/O submit I/O 7.26µs Page cache insertion 0.35µs Interrupt

Page Page

…

Root Leaf node Page free

Duplicated I/O requests Page cache tree Page cache lookup overhead Page cache tree extension overhead

Lazy cache insertion?

Success

Cache lookup? Make I/O request Cache lookup? Lazy cache insertion? Make I/O request

Fail Miss Miss

(extremely low frequency)

SLIDE 15

NVRAMOS 2019

DMA Unmapping

15

CPU Device I/O submit I/O 7.26µs DMA unmapping 0.23µs Interrupt

SLIDE 16

NVRAMOS 2019

Implementation

− Delays DMA unmapping to when a system is idle or waiting for another I/O requests − Extended version of the deferred protection scheme in Linux [ASPLOS’16] − Optionally disabled for safety

Lazy DMA Unmapping

16

CPU Device I/O submit I/O 7.26µs Lazy DMA unmapping 0.35µs Interrupt

SLIDE 17

NVRAMOS 2019

Remaining Overheads in the Proposed Read Path

17

CPU Device I/O submit I/O 7.26µs Interrupt BIO submission 0.72µs NVMe I/O submission 0.37µs Request completion 0.81µs

SLIDE 18

NVRAMOS 2019

Read path

− Analysis of vanilla read path − Proposed read path

Light-weight block I/O layer
Write path

− Analysis of vanilla write path − Proposed write path

Evaluation
Conclusion

Agenda

18

SLIDE 19

NVRAMOS 2019

Structure conversion

− Merge bio with pending request via I/O merging − Assign new tag & request and convert from bio

Multi-queue structure

− Software staging queue (SW queue)

ü Support I/O scheduling and reordering

− Hardware dispatch queue (HW queue)

ü Deliver the I/O request to the device driver

Multiple dynamic memory allocations

− Bio (block layer) − NVMe iod, scatter/gather list, NVMe PRP* list (device driver)

Linux Multi-queue Block I/O Layer

19

iod prp_list

…

NVMe Queue Pairs Multi-queue Block Layer sg_list bio Device Driver Linux multi-queue block layer Per-core SW Queues HW Queues

…

request: length, bio(s)

…

Tag

request NVMe CMD bio: LBA, length, page(s), … page submit_bio()

*PRP: physical region page

SLIDE 20

NVRAMOS 2019

Structure conversion

− Inefficiency of I/O merging [Zhang,OSDI’18]

ü Useful feature for low-performance storage device

Multi-queue structure
Multiple dynamic memory allocations

Linux Multi-queue Block I/O Layer

20

iod prp_list

…

NVMe Queue Pairs Multi-queue Block Layer sg_list bio Device Driver Linux multi-queue block layer Per-core SW Queues HW Queues

…

request: length, bio(s)

…

Tag

request NVMe CMD bio: LBA, length, page(s), … page submit_bio()

SLIDE 21

NVRAMOS 2019

Structure conversion

− Inefficiency of I/O merging [Zhang,OSDI’18]

ü Useful feature for low-performance storage device

Multi-queue structure

− Inefficiency of I/O scheduling for low-latency SSDs [Saxena,ATC’10] [Xu,SYSTOR’15]

ü Default configuration is noop scheduler

− Bypass multi-queue structure [Zhang,OSDI’18] − Device-side I/O scheduling [Peter,OSDI’14]

[Joshi,HotStorage’17]

Multiple dynamic memory allocations

Linux Multi-queue Block I/O Layer

21

iod prp_list

…

NVMe Queue Pairs Multi-queue Block Layer sg_list bio Device Driver Linux multi-queue block layer Per-core SW Queues HW Queues

…

request: length, bio(s)

…

Tag

request NVMe CMD bio: LBA, length, page(s), … page submit_bio()

SLIDE 22

NVRAMOS 2019

Light-weight Block I/O Layer

22

prp_list

…

NVMe Queue Pairs Light-weight Block Layer Device Driver Light-weight block layer Per-CPU Lbio Pool

Tag

lbio: LBA, length, prp_list, page(s), dma_addr(s) page submit_lbio() DMA-mapped Page Pool

Core 0

lbio NVMe CMD

… … … … … Core n

Light-weight bio (lbio) structure

− Contains only essential arguments for to make NVMe I/O request − Eliminates unnecessary structure conversions and allocations

Per-CPU lbio pool

− Supports lockless lbio object allocation − Supports tagging function

Single dynamic memory allocation

− NVMe PRP* list (device driver)

*PRP: physical region page

SLIDE 23

NVRAMOS 2019

Read Path Comparison

23

Proposed Read Path Proposed Read Path (before applying light-weight block I/O layer) CPU Device I/O 7.26µs LBIO submission 0.13µs LBIO completion 0.65µs I/O submit Interrupt

Latency reduction

CPU Device I/O submit I/O 7.26µs Interrupt BIO submission 0.72µs Request completion 0.81µs NVMe I/O submission 0.37µs sys_read() Return to user sys_read() Return to user

SLIDE 24

NVRAMOS 2019

Read Path Comparison

CPU Device I/O 7.26µs

24

CPU Device I/O submit Total latency 12.82μs I/O 7.26µs Total latency 10.10μs Interrupt I/O submit Interrupt

Latency reduction

Vanilla Read Path Proposed Read Path sys_read() Return to user sys_read() Return to user

SLIDE 25

NVRAMOS 2019

Read path

− Analysis of vanilla read path − Proposed read path

Light-weight block I/O layer
Write path

− Analysis of vanilla write path − Proposed write path

Evaluation
Conclusion

Agenda

25

SLIDE 26

NVRAMOS 2019

Analysis of Vanilla Fsync Path (Ext4 Ordered Mode)

26

Data block I/O Data block submit Journal block submit Flush & commit block submit CPU jbd2 Device Journal block I/O Commit block I/O 12.73μs 10.72μs 12.57μs Data writeback 5.68µs jbd2 call 0.80µs Journal block preparation 5.55µs

Allocating buffer pages
Allocating journal area block
Checksum computation…

Commit block preparation 2.15µs

sys_fsync() Return to user

SLIDE 27

NVRAMOS 2019

Proposed Fsync Path (Ext4 Ordered Mode)

27

Data block I/O CPU jbd2 Device 11.37μs Data block submit Journal block submit Data Writeback 4.18µs

Reduced latency by applying light-weight block I/O layer

jbd2 call 0.78µs

Wake jbd2 before data block wait

Journal block preparation 5.21µs Commit block preparation 1.90µs

sys_fsync()

jbd2 commit wait

SLIDE 28

NVRAMOS 2019

Proposed Fsync Path (Ext4 Ordered Mode)

28

Data block I/O Journal block I/O Commit block I/O CPU jbd2 Device 10.61μs 10.44μs 11.37μs Data block submit Journal block submit Flush & commit block submit Data Writeback 4.18µs

Reduced latency by applying light-weight block I/O layer

jbd2 call 0.78µs

Wake jbd2 before data block wait

Journal block preparation 5.21µs Commit block preparation 1.90µs Data & journal block I/O wait Flush & commit block dispatch 0.04µs

sys_fsync() Return to user

jbd2 commit wait

SLIDE 29

NVRAMOS 2019

Fsync Path Comparison (Ext4 Ordered Mode)

29

Data block I/O Journal block I/O Commit block I/O CPU jbd2 Device Data block submit Journal block submit Flush & commit block submit Total latency 38.03μs CPU jbd2 Data block I/O Device Journal block I/O Commit block I/O Data block submit Journal block submit Flush & commit block submit Total latency 53.94μs

Latency reduction

Vanilla Fsync Path Proposed Fsync Path

SLIDE 30

NVRAMOS 2019

Read path

− Analysis of vanilla read path − Proposed read path

Light-weight block I/O layer
Write path

− Analysis of vanilla write path − Proposed write path

Evaluation
Conclusion

Agenda

30

SLIDE 31

NVRAMOS 2019

Experimental Setup

31

Server Dell R730 OS Ubuntu 16.04.4 Base kernel Linux 5.0.5 CPU Intel Xeon E5-2640v3 2.6GHz 8-cores Memory DDR4 32GB Storage devices Z-SSD: Samsung SZ985 800GB Optane SSD: Intel Optane 905P 960GB Workloads Synthetic micro-benchmark: FIO Real-world workload: RocksDB DBbench

SLIDE 32

NVRAMOS 2019

4KB block size

FIO Performance (Random Read)

Single thread

32

20 40 60 80 100 Latency (µs) Block size Vanilla AIOS AIOS-poll Up to 23% latency reduction

7.6µs

100 200 300 400 500 600 700 1 2 4 8 16 32 64 128 IOPS (k) Threads Vanilla AIOS Up to 26% IOPS improvement

SLIDE 33

NVRAMOS 2019

4KB block size

FIO Performance (Random Write+Fsync, Ext4 Ordered)

Single thread

33

50 100 150 200 250 Latency (µs) Block size Vanilla AIOS Up to 26% latency reduction 20 40 60 80 100 120 140 160 1 2 4 8 16 IOPS (k) Threads Vanilla AIOS Up to 27% IOPS improvement

SLIDE 34

NVRAMOS 2019

DBbench readrandom

− 64GB dataset

DBbench fillsync

− 16GB dataset

50 100 150 200 250 300 350 400 450 500 1 2 4 8 16 OP/s (k) Threads Vanilla AIOS 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 OP/s (k) Threads Vanilla AIOS

RocksDB Performance

34

Up to 44% performance improvement Up to 27% performance improvement

27.2% ▲ 22.3% ▲ 22.9% ▲ 18.5% ▲ 10.7% ▲ 44.3% ▲ 38.9% ▲ 35.7% ▲ 30.6% ▲ 23.6% ▲

SLIDE 35

NVRAMOS 2019

Conclusion

35

Asynchronous I/O stack

− Applies asynchronous I/O concept to the kernel I/O stack itself − Overlaps computation with I/O to reduce total I/O latency

Light-weight block I/O layer

− Provides low-latency block I/O services for low-latency NVMe SSDs

Performance evaluation

− Achieves a single-digit microsecond I/O latency on Optane SSD − Achieves significant latency reduction and performance improvement on real-world workloads

Source code: https://github.com/skkucsl/aios

SLIDE 36

NVRAMOS 2019

Comparison with User-level Direct Access Approach

36

10 20 30 40 50 60 70 80 90 Random Read Latency (µs) User+Device Copy-to-user Kernel User+Device (SPDK) 4KB 8KB 16KB 32KB 64KB 128KB Block size

SLIDE 37

NVRAMOS 2019

Q&A

Thank you

37

SLIDE 38

NVRAMOS 2019

Acknowledgment

This talk is supported by the National Research Foundation of

Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2017R1C1B2007273).

38

SLIDE 39

NVRAMOS 2019

Extra Slides

SLIDE 40

NVRAMOS 2019

Extent status tree (in-memory structure)

− Contains LBA mapping info. for file blocks (composed of extent objects) − Occurs additional I/O request to read mapping block in case of extent lookup miss

Control & data plane approach [OSDI‘14]

− Preloads the entire mapping info. into the memory in the control plane (e.g., file open) − Selectively applies to files requiring low latency access

Preloading Extent Tree

40

CPU Device I/O submit Device I/O 7.26µs LBA retrieval 0.09µs 0.07µs Interrupt

SLIDE 41

NVRAMOS 2019

CDF of the Block I/O Submission Latency

41

4KB random read 32KB random read

Block I/O submission latency

− Time from allocating a object (bio or lbio) to dispatching an I/O command to a device

SLIDE 42

NVRAMOS 2019

Read Performance Analysis with Opt. Levels

42

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4KB 8KB 16KB 32KB 64KB 128KB Normalized Latency Block size Vanilla + Preload + MQ-bypass + LBIO + Async-page + Async-DMA

Preload

− Preloading extent tree

MQ-bypass

− Bypassing SW, HW queue

LBIO

− Light-weight block I/O layer

Async-page

− Asynchronous page allocation − Lazy page cache insertion

Async-DMA

− Asynchronous DMA mapping − Lazy DMA unmapping

SLIDE 43

NVRAMOS 2019

Write(+fsync) Performance Analysis with Opt. Levels

43

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4KB 8KB 16KB 32KB 64KB 128KB Normalized Latency Block size Vanilla + LBIO + AIOS

LBIO

− Light-weight block I/O layer − Synchronous page allocation / DMA mapping

AIOS

− Proposed fsync path

SLIDE 44

NVRAMOS 2019

CPU Usage Breakdown

RocksDB DBbench readrandom / fillsync using 8 threads

44

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vanilla AIOS Vanilla AIOS readrandom fillsync Normalized CPU Usage Idle I/O wait Kernel User

Effective overlapping between I/O and kernel operations through AIOS Reduced overall runtime Reduced I/O wait & kernel CPU time

SLIDE 45

NVRAMOS 2019

Comparison with User-level Direct Access Approach

45

10 20 30 40 50 60 70 80 90 Random Read Latency (µs) User+Device Copy-to-user Kernel User+Device (SPDK) 4KB 8KB 16KB 32KB 64KB 128KB Block size

SLIDE 46

NVRAMOS 2019

Memory Overheads on Asynchronous I/O Stack

46

Object Linux block layer Light-weight block layer Statically allocated

bjects (per-core)

request pool 412KB

lbio array
192KB

Free page pool

256KB

128KB block request (l)bio + (l)bio_vec 648B 704B request 384B

iod

974B

prp_list

4096B 4096B Total 6104B 4800B

SLIDE 47

NVRAMOS 2019

Proposed Fsync Path (Ext4 Ordered Mode)

47

Data block I/O Journal block I/O Commit block I/O CPU jbd2 Device 10.61μs 10.44μs 11.37μs Data block submit Journal block submit Flush & commit block submit Data Writeback 4.18µs

Reduced latency by applying light-weight block I/O layer

jbd2 call 0.78µs

Wake jbd2 before data block wait

Journal block preparation 5.21µs Commit block preparation 1.90µs Data & journal block I/O wait Flush & commit block dispatch 0.04µs

sys_fsync() Return to user

jbd2 commit wait

SLIDE 48

NVRAMOS 2019

Implementation

− Delays DMA unmapping to when a system is idle or waiting for another I/O requests − Extended version of the deferred protection scheme in Linux [ASPLOS’16] − Optionally disabled for safety

Lazy DMA Unmapping

48

CPU Device I/O submit I/O 7.26µs Lazy DMA unmapping 0.35µs Interrupt