[PPT] - LightKV: A Cross Media Key Value Store with Persistent Memory to Cut PowerPoint Presentation

SLIDE 1

LightKV: A Cross Media Key Value Store with Persistent Memory to Cut Long Tail Latency

Shukai Han, Dejun Jiang, Jin Xiong

Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences

MSST '20

October 29-30, 2020

SLIDE 2

MSST '20

2

Outline

ü Background & Motivation

Design
Evaluation
Conclusion

SLIDE 3

MSST '20

3

Key-Value Store

Key-Value (KV) stores are widely deployed in data centers.
KV stores are latency-critical applications.

Workloads with a high percentage of small KV items[1] Applications with low latency requirements

[1] Berk, SIGMETRICS '2012

SLIDE 4

MSST '20

Log-Structured Merge Tree (LSM-Tree)

4

Immutable MemTable MemTable

LOG

SSTable

Level0 Level1 Level2 Level k-1

Level k

…… …… …… …… 3.Flush

KV Pair

1.WAL 2.Write KV data (sorted)

Metedata (bloom filter, index, etc)

SSTable Structure

4.Compaction 4 4

SLIDE 5

MSST '20

Limitations of Persistent KV Store

Inefficient indexing for cross-media

5

Immutable MemTable MemTable

LOG

SSTable

Level0 Level1 Level2 Level k-1

Level k

…… …… …… …… 4.Compaction 3.Flush

KV Pair

1.WAL 2.Write 4 4

ü On one hand, LSM-Tree adopts skiplist to index in-memory data. ü On the other hand, LSM-Tree builds manifest files to record key range of each on-disk SSTable.

Read Read

SLIDE 6

MSST '20

Limitations of Persistent KV Store

High write amplification

6

Immutable MemTable MemTable

LOG

SSTable

Level0 Level1 Level2 Level k-1

Level k

…… …… …… …… 4.Compaction 3.Flush

KV Pair

1.WAL 2.Write 4 4

Write log and data transfer between levels increase write amplification. Write (read) amplification is defined as the ratio between the amount of data written to (read from) the underlying storage device and the amount of data requested by the user.

SLIDE 7

MSST '20 8.80 10.47 11.42 12.29 12.90 13.28 13.57 14.01 14.41 14.64 6.84 7.78 8.69 9.38 10.14 10.60 10.75 10.90 11.23 11.68 5.16 6.30 7.80 9.22 9.43 9.72 10.16 10.56 10.54 10.95 2 4 6 8 10 12 14 16 18 20 10G 20G 30G 40G 50G 60G 70G 80G 90G 100G write amplification LevelDB HyperLevelDB RocksDB

Limitations of Persistent KV Store

High write amplification

7

10X+

The write amplification of LSM-Tree can reach 10x, and with the continuous increase of data amount, the write amplification continue to show an upward trend.

SLIDE 8

MSST '20

Limitations of Persistent KV Store

Heavy tailed read latency under mixed workload

8 ü We first warm up LevelDB with 100 GB data. ü We measure the average latency as well as 99th and 99.9th percentile read latencies every 10 seconds. The maximum 99th and 99.9th percentile read latencies can reach 13 and 28 times than the average read latency. t1: Run a mixed workload of randomly reading 50 GB existing data and randomly inserting another 50 GB data.

28X

SLIDE 9

MSST '20

Limitations of Persistent KV Store

Heavy tailed read latency under mixed workload

9

t2: Run read-only workload.

After the compaction finishes, the read tail latency is significantly reduced. Reducing write amplification is not only helpful for reducing the total write amount of the disk, increasing system throughput, but also helping to reduce the read tail latency under mixed read and write loads.

SLIDE 10

MSST '20

1.NVM can persist data after power off

Non-Volatile Memory

10

The first PM product, Intel Optane DC Persistent Memory (PM), was

announced [19] in April 2019.

2.The write latency of Optane DC PM is close to DRAM, while its read latency is 3 to 4 times that of DRAM. 3.The write and read bandwidths of Optane DC PM are around 2GB/s and 6.5GB/s, which is about 1/8 and 1/4 that of DRAM separately.

Non-Volatile Memories (NVMs) provide low latency and byte addressable

features.

3D XPoint, Phase Change Memory (PCM), and Resistive Memory (ReRAM)

SLIDE 11

MSST '20

11

Outline

Background & Motivation

ü Design

Evaluation
Conclusion

SLIDE 12

MSST '20

LightKV System Overview

12

1.Radix Hash Tree (RH-Tree) index

Segment

…… flush

2.Persistent Write Buffer (PWB)

SSTable …… Partition1 Partition2 Partition N …… ……

compaction

3.Main Data Store

Persistent Memory DRAM SSD

SLIDE 13

MSST '20

Challenges

How does Radix Hash tree index KV items across media?
How does Radix Hash tree balance performance and data

growth?

How does Radix Hash tree conduct well-controlled data

compaction to reduce write amplification?

13

SLIDE 14

MSST '20

Radix Hash Tree Structure

14

Prefix Search Tree

HashTable HashTable HashTable

…… …

[0,32] [128,255] [0,64] [96,255]

… …

[64,255]

Prefix Search Tree

SSTable or Semgnet

K V K V K V K V pointer

HashTable

signature cache kv offset HashTable Bucket (64B)

4B*4=16B 4B*4=16B 8B*4=32B

SLIDE 15

MSST '20

RH-Tree split

15 IN1

LN1

[0,3] ……

IN1

LN1

[0,127] ……

IN1

LN1 [0,63] LN2

[64,127]

……

normal split

IN1

[0,3] ……

IN2

[0,127] LN1 LN2 [128,255]

level split

SLIDE 16

MSST '20

Linked hash leaf node

16

LN1

Segment1

index

LN1’

persist

SSTable

flush index LN2

Segment2

link link LN2

LN1’

Segment2 SSTable

index index stage1 LN1 index

Segment1

stage2 stage3 DRAM PM SSD

SLIDE 17

MSST '20

RH-Tree placement

17

Prefix Search Tree

Segment

……

Persistent Write Buffer

SSTable …… Partition1 Partition2 Partition N …… ……

Main Data Store Radix Hashing Tree

…… …… DRAM Persistent Memory SSD …… …… …… …… ……

SLIDE 18

MSST '20

Partition-based data compaction

18

S5 (1) S5 (1) S10 (1) compaction t1 t2 t3 t4 t5 t6

1 log  N

k

S21 (2) Compaction Size (CS) is 4 S1 (0) S2 (0) S3 (0) S4 (0) S6 (0) S7 (0) S8 (0) S9 (0) S5 (1) S10 (1) S15 (1) S110) S12(0) S13(0) S14 (0) S16 (0) S17 (0) S18(0) S19(0) S5 (1) S10 (1) S15 (1) S20 (1)

SLIDE 19

MSST '20

Recovery

19

Prefix Search Tree

Segment

…… …… ……

Persistent Write Buffer

SSTable …… Partition1 Partition2 Partition N …… ……

Main Data Store Radix Hashing Tree

DRAM Persistent Memory SSD

rebuild

SLIDE 20

MSST '20

20

Outline

Background & Motivation
Design

ü Evaluation

Conclusion

SLIDE 21

MSST '20

Experiment Setup

System and hardware configuration

– Two Intel Xeon Gold 5215 CPU (2.5GHZ), 64GB memory and one Intel DC P3700 SSD of 400GB. – CentOS Linux release 7.6.1810 with 4.18.8 kernel and use ext4 file system.

Compared systems

– LevelDB、RocksDB – NoveLSM、SLM-DB

Workloads

– db_bench as microbenchmark – YCSB as the actual workload

21

Workload YCSB Workload Description A 50% reads and 50% updates B 95% reads and 5% updates C 100% reads D 95% reads for latest keys and 5% inserts E 95% scan and 5% inserts F 50% reads and 50% read-modify-writes

SLIDE 22

MSST '20

Reducing write amplification

22

LightKV are reduced by

7.1x, 5.1x, 2.9x and 2.3x compared to that

f LevelDB, RocksDB,

NoveLSM, and SLM- DB respectively. When the total amount of written data increases, the write amplification of LightKV remains stable (e.g. from 1.6 to 1.8 when the data amount increases from 50 GB to 100 GB).

SLIDE 23

MSST '20

Basic Operations

23

Thanks to the global index and partition compaction, LightKV can effectively reduce read-write amplification and improve read and write performance.

13.5x, 8.3x, 5.0x, 4.0x 4.5x, 1.9x, 4.2x, 1.3x

SLIDE 24

MSST '20

Basic Operations

24

The performance of LightKV in short range query is low. This is because it needs to search all SSTables in one or more partitions when performing a short range query. reduced by 24.3% and 13.2%

SLIDE 25

MSST '20

Tail latency under read-write workload

25

Thanks to lower write amplification and global indexing, LightKV provides a lower and stable read and write tail latency.

lower and stable

99th:17.9x, 10.5x, 6.4x, 3.5x 99.9th:15.7x, 9.2x, 8.8x, 3.4x

SLIDE 26

MSST '20

Results with YCSB

26

LightKV provides better throughput in simulating actual workloads.

SLIDE 27

MSST '20

27

Outline

Background & Motivation
Design
Evaluation

ü Conclusion

SLIDE 28

MSST '20

Conclusion

LSM-Tree based on traditional storage devices faces problems such as read-write

amplification

At the same time, the emergence of non-volatile memory provides opportunities and

challenges for building efficient key-value storage systems

In this paper, we propose LightKV a cross media key-value store with persistent memory.

LightKV effectively reduces the read-write amplification of the system by establishing a RH-Tree and adopting a column-based partition compaction.

The experiment results show that LightKV reduces write amplification by up to 8.1x and

improves read performance by up to 9.2x. It also reduces read tail latency by up to 18.8x under read-write mixed workload.

28

SLIDE 29

MSST '20

29

THANK YOU !

Q & A

Author Email: hanshukai@ict.ac.cn

SLIDE 30

MSST '20

Sensitivity analysis

30

As the maximum number of partitions increases, the read and write performance

f LightKV increases, but the NVM capacity consumption also increase.

SLIDE 31

MSST '20

Sensitivity analysis

31

As the compaction size increases, the merging frequency is reduced, and the write amplification is reduced, which is beneficial to improve the write performance, but is not conducive to reading.