ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ - - PowerPoint PPT Presentation

advanced database systems
SMART_READER_LITE
LIVE PREVIEW

ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ - - PowerPoint PPT Presentation

Lect ure # 08 ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721 (Spring 2019) 2 Index Implementation Issues Judy Array ART Masstree CMU 15-721 (Spring 2019) 3 IN DEX IM


slide-1
SLIDE 1

OLTP Indexes (Trie Data Structures)

@ Andy_Pavlo // 15- 721 // Spring 2019

ADVANCED DATABASE SYSTEMS

Lect ure # 08

slide-2
SLIDE 2 CMU 15-721 (Spring 2019)

Index Implementation Issues Judy Array ART Masstree

2

slide-3
SLIDE 3 CMU 15-721 (Spring 2019)

IN DEX IM PLEM EN TATIO N ISSUES

Garbage Collection Memory Pools Non-Unique Keys Variable-length Keys Compression

3

slide-4
SLIDE 4 CMU 15-721 (Spring 2019)

GARBAGE CO LLECTIO N

We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.

→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…

4

K2 V2 K3 V3 K4 V4

slide-5
SLIDE 5 CMU 15-721 (Spring 2019)

GARBAGE CO LLECTIO N

We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.

→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…

4

K2 V2 K3 V3 K4 V4

X

slide-6
SLIDE 6 CMU 15-721 (Spring 2019)

GARBAGE CO LLECTIO N

We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.

→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…

4

K2 V2 K4 V4

slide-7
SLIDE 7 CMU 15-721 (Spring 2019)

GARBAGE CO LLECTIO N

We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.

→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…

4

K2 V2 K4 V4

slide-8
SLIDE 8 CMU 15-721 (Spring 2019)

REFEREN CE CO UN TIN G

Maintain a counter for each node to keep track of the number of threads that are accessing it.

→ Increment the counter before accessing. → Decrement it when finished. → A node is only safe to delete when the count is zero.

This has bad performance for multi-core CPUs

→ Incrementing/decrementing counters causes a lot of cache coherence traffic.

5

slide-9
SLIDE 9 CMU 15-721 (Spring 2019)

O BSERVATIO N

We don’t actually care about the actual value of the reference counter. We only need to know when it reaches zero. We don’t have to perform garbage collection immediately when the counter reaches zero.

6

Source: Stephen Tu

slide-10
SLIDE 10 CMU 15-721 (Spring 2019)

EPO CH GARBAGE CO LLECTIO N

Maintain a global epoch counter that is periodically updated (e.g., every 10 ms).

→ Keep track of what threads enter the index during an epoch and when they leave.

Mark the current epoch of a node when it is marked for deletion.

→ The node can be reclaimed once all threads have left that epoch (and all preceding epochs).

Also known as Read-Copy-Update (RCU) in Linux.

7

slide-11
SLIDE 11 CMU 15-721 (Spring 2019)

M EM O RY PO O LS

We don’t want to be calling malloc and free anytime we need to add or delete a node. If all the nodes are the same size, then the index can maintain a pool of available nodes.

→ Insert: Grab a free node, otherwise create a new one. → Delete: Add the node back to the free pool.

Need some policy to decide when to retract the pool size.

8

slide-12
SLIDE 12 CMU 15-721 (Spring 2019)

N O N- UN IQ UE IN DEXES

Approach #1: Duplicate Keys

→ Use the same node layout but store duplicate keys multiple times.

Approach #2: Value Lists

→ Store each key only once and maintain a linked list of unique values.

9

MODERN B B- TREE TECHNIQUES

NOW PUBLISHERS 2010

slide-13
SLIDE 13 CMU 15-721 (Spring 2019)

B+Tree Leaf Node

N O N- UN IQ UE: DUPLICATE KEYS

10

Sorted Keys

K1 K1 K1 K2 K2

  • • • Kn

¤

Prev

¤

Next

#

Level

#

Slots Values

¤ ¤ ¤ ¤ ¤

  • • • ¤
slide-14
SLIDE 14 CMU 15-721 (Spring 2019)

B+Tree Leaf Node

N O N- UN IQ UE: VALUE LISTS

11

Values

¤ ¤ ¤ ¤ ¤

  • • •

¤

Prev

¤

Next

#

Level

#

Slots Sorted Keys

K1 K2 K3 K4 K5

  • • • Kn
slide-15
SLIDE 15 CMU 15-721 (Spring 2019)

VARIABLE LEN GTH KEYS

Approach #1: Pointers

→ Store the keys as pointers to the tuple’s attribute.

Approach #2: Variable Length Nodes

→ The size of each node in the index can vary. → Requires careful memory management.

Approach #3: Padding

→ Always pad the key to be max length of the key type.

Approach #4: Key Map / Indirection

→ Embed an array of pointers that map to the key + value list within the node.

12

slide-16
SLIDE 16 CMU 15-721 (Spring 2019)

¤ ¤ ¤ ¤

Andy V1 Obama Prashanth V3 V4 Lin V2

B+Tree Leaf Node

KEY M AP / IN DIRECTIO N

13

Key+Values

¤

Prev

¤

Next

#

Level

#

Slots Sorted Key Map

slide-17
SLIDE 17 CMU 15-721 (Spring 2019)

¤ ¤ ¤ ¤

Andy V1 Obama Prashanth V3 V4 Lin V2

B+Tree Leaf Node

KEY M AP / IN DIRECTIO N

13

Key+Values

¤

Prev

¤

Next

#

Level

#

Slots Sorted Key Map

slide-18
SLIDE 18 CMU 15-721 (Spring 2019)

¤ ¤ ¤ ¤

Andy V1 Obama Prashanth V3 V4 Lin V2

B+Tree Leaf Node

KEY M AP / IN DIRECTIO N

13

Key+Values

¤

Prev

¤

Next

#

Level

#

Slots Sorted Key Map

A·¤ L·¤ O·¤ P·¤

slide-19
SLIDE 19 CMU 15-721 (Spring 2019)

PREFIX CO M PRESSIO N

Sorted keys in the same leaf node are likely to have the same prefix. Instead of storing the entire key each time, extract common prefix and store only unique suffix for each key.

14

robbed robbing robot bed bing

  • t

Prefix:rob

slide-20
SLIDE 20 CMU 15-721 (Spring 2019)

SUFFIX TRUN CATIO N

The keys in the inner nodes are only used to "direct traffic". We don't need the entire key. Store a minimum prefix that is needed to correctly route probes into the index.

15

abcdefghijk lmnopqrstuv … … … …

slide-21
SLIDE 21 CMU 15-721 (Spring 2019)

SUFFIX TRUN CATIO N

The keys in the inner nodes are only used to "direct traffic". We don't need the entire key. Store a minimum prefix that is needed to correctly route probes into the index.

15

… … … …

abc lmn

slide-22
SLIDE 22 CMU 15-721 (Spring 2019)

O BSERVATIO N

The inner node keys in a B+tree cannot tell you whether a key exists in the index. You always have to traverse to the leaf node. This means that you could have (at least) one cache miss per level in the tree.

16

slide-23
SLIDE 23 CMU 15-721 (Spring 2019)

TRIE IN DEX

Use a digital representation of keys to examine prefixes one- by-one instead of comparing entire key.

→ Also known as Digital Search Tree, Prefix Tree.

17

Keys: HELLO, HAT, HAVE

L L O ¤ ¤ E ¤ H A E V T

slide-24
SLIDE 24 CMU 15-721 (Spring 2019)

TRIE IN DEX

Use a digital representation of keys to examine prefixes one- by-one instead of comparing entire key.

→ Also known as Digital Search Tree, Prefix Tree.

17

Keys: HELLO, HAT, HAVE

L L O ¤ ¤ E ¤ H A E V T

slide-25
SLIDE 25 CMU 15-721 (Spring 2019)

TRIE IN DEX PRO PERTIES

Shape only depends on key space and lengths.

→ Does not depend on existing keys or insertion order. → Does not require rebalancing operations.

All operations have O(k) complexity where k is the length of the key.

→ The path to a leaf node represents the key of the leaf → Keys are stored implicitly and can be reconstructed from paths.

18

slide-26
SLIDE 26 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

The span of a trie level is the number of bits that each partial key / digit represents.

→ If the digit exists in the corpus, then store a pointer to the next level in the trie branch. Otherwise, store null.

This determines the fan-out of each node and the physical height of the tree.

→ n-way Trie = Fan-Out of n

19

slide-27
SLIDE 27 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

Keys: K10,K25,K31

20

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

Tuple Pointer Node Pointer

slide-28
SLIDE 28 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

Keys: K10,K25,K31

20

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

Repeat 10x

Tuple Pointer Node Pointer

slide-29
SLIDE 29 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

Keys: K10,K25,K31

20

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

Repeat 10x

Tuple Pointer Node Pointer

slide-30
SLIDE 30 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

Keys: K10,K25,K31

20

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

Repeat 10x

Tuple Pointer Node Pointer

slide-31
SLIDE 31 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

Keys: K10,K25,K31

20

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

Repeat 10x

Tuple Pointer Node Pointer

slide-32
SLIDE 32 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

Keys: K10,K25,K31

20

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

Repeat 10x

Tuple Pointer Node Pointer

slide-33
SLIDE 33 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

Keys: K10,K25,K31

20

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

Repeat 10x

Tuple Pointer Node Pointer

slide-34
SLIDE 34 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

Keys: K10,K25,K31

20

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤

Repeat 10x

Tuple Pointer Node Pointer

slide-35
SLIDE 35 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

Keys: K10,K25,K31

20

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

Repeat 10x

¤ Ø ¤ Ø ¤ ¤ ¤ Ø Ø ¤ ¤ Ø ¤ Ø Ø ¤ Ø ¤ Ø ¤ ¤ ¤ Ø ¤ Ø ¤

Tuple Pointer Node Pointer

slide-36
SLIDE 36 CMU 15-721 (Spring 2019)

TRIE KEY SPAN

Keys: K10,K25,K31

20

K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111

1-bit Span Trie

Repeat 10x

¤ Ø ¤ Ø ¤ ¤ ¤ Ø Ø ¤ ¤ Ø ¤ Ø Ø ¤ Ø ¤ Ø ¤ ¤ ¤ Ø ¤ Ø ¤

Tuple Pointer Node Pointer

slide-37
SLIDE 37 CMU 15-721 (Spring 2019)

RADIX TREE

Omit all nodes with only a single child.

→ Also known as Patricia Tree.

21

1-bit Span Radix Tree

¤ Ø ¤ Ø ¤ ¤ Ø ¤ ¤ ¤

Repeat 10x

Tuple Pointer Node Pointer

slide-38
SLIDE 38 CMU 15-721 (Spring 2019)

TRIE VARIAN TS

Judy Arrays (HP) ART Index (HyPer) Masstree (Silo)

22

slide-39
SLIDE 39 CMU 15-721 (Spring 2019)

J UDY ARRAYS

Variant of a 256-way radix tree. First known radix tree that supports adaptive node representation. Three array types

→ Judy1: Bit array that maps integer keys to true/false. → JudyL: Map integer keys to integer values. → JudySL: Map variable-length keys to integer values.

Open-Source Implementation (LGPL). Patented by HP in 2000. Expires in 2022.

→ Not an issue according to authors. → http://judy.sourceforge.net/

23

slide-40
SLIDE 40 CMU 15-721 (Spring 2019)

J UDY ARRAYS

Do not store meta-data about node in its header.

→ This could lead to additional cache misses.

Pack meta-data about a node in 128-bit "Judy Pointers" stored in its parent node.

→ Node Type → Population Count → Child Key Prefix / Value (if only one child below) → 64-bit Child Pointer

24

A COMPARISON OF ADAPTIVE RADIX TREES AND HASH TABLES

ICDE 2 2015

slide-41
SLIDE 41 CMU 15-721 (Spring 2019)

J UDY ARRAYS: N O DE TYPES

Every node can store up to 256 digits. Not all nodes will be 100% full though. Adapt node's organization based on its keys.

→ Linear Node: Sparse Populations → Bitmap Node: Typical Populations → Uncompressed Node: Dense Population

25

A COMPARISON OF ADAPTIVE RADIX TREES AND HASH TABLES

ICDE 2 2015

slide-42
SLIDE 42 CMU 15-721 (Spring 2019)

J UDY ARRAYS: LIN EAR N O DES

Store sorted list of partial prefixes up to two cache lines.

→ Original spec was one cache line

Store separate array of pointers to children ordered according to prefix sorted.

26

Linear Node

K0 K2 K8 ¤

¤ ¤

1 5 ... ... 1 5

Sorted Digits

slide-43
SLIDE 43 CMU 15-721 (Spring 2019)

J UDY ARRAYS: LIN EAR N O DES

Store sorted list of partial prefixes up to two cache lines.

→ Original spec was one cache line

Store separate array of pointers to children ordered according to prefix sorted.

26

Linear Node

K0 K2 K8 ¤

¤ ¤

1 5 ... ... 1 5

Sorted Digits Child Pointers

slide-44
SLIDE 44 CMU 15-721 (Spring 2019)

J UDY ARRAYS: LIN EAR N O DES

Store sorted list of partial prefixes up to two cache lines.

→ Original spec was one cache line

Store separate array of pointers to children ordered according to prefix sorted.

26

Linear Node

K0 K2 K8 ¤

¤ ¤

1 5 ... ... 1 5

Sorted Digits Child Pointers 6 × 1-byte = 6 bytes 6 × 16-bytes = 96 bytes 102 bytes 128 bytes (padded)

slide-45
SLIDE 45 CMU 15-721 (Spring 2019)

J UDY ARRAYS: BITM AP N O DES

256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.

27

Bitmap Node

01000110 ¤

0-7 8-15 248-255

00000000 ¤ ... 00100100 ¤

...

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ Prefix Bitmaps

slide-46
SLIDE 46 CMU 15-721 (Spring 2019)

J UDY ARRAYS: BITM AP N O DES

256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.

27

Bitmap Node

01000110 ¤

0-7 8-15 248-255

00000000 ¤ ... 00100100 ¤

...

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ Prefix Bitmaps Sub-Array Pointers

slide-47
SLIDE 47 CMU 15-721 (Spring 2019)

J UDY ARRAYS: BITM AP N O DES

256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.

27

Bitmap Node

01000110 ¤

0-7 8-15 248-255

00000000 ¤ ... 00100100 ¤

...

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ Child Pointers Prefix Bitmaps Sub-Array Pointers

slide-48
SLIDE 48 CMU 15-721 (Spring 2019)

J UDY ARRAYS: BITM AP N O DES

256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.

27

Bitmap Node

01000110 ¤

0-7 8-15 248-255

00000000 ¤ ... 00100100 ¤

...

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ Child Pointers Prefix Bitmaps Sub-Array Pointers

slide-49
SLIDE 49 CMU 15-721 (Spring 2019)

ADAPATIVE RADIX TREE (ART)

Developed for TUM HyPer DBMS in 2013. 256-way radix tree that supports different node types based on its population.

→ Stores meta-data about each node in its header.

Concurrency support was added in 2015.

28

THE ADAPTIVE RADIX TREE: ARTFUL INDEXING FOR MAIN- MEMORY DATABASES

ICDE 2 2013

slide-50
SLIDE 50 CMU 15-721 (Spring 2019)

ART vs. J UDY

Difference #1: Node Types

→ Judy has three node types with different organizations. → ART has four nodes types that (mostly) vary in the maximum number of children.

Difference #2: Purpose

→ Judy is a general-purpose associative array. It "owns" the keys and values. → ART is a table index and does not need to cover the full

  • keys. Values are pointers to tuples.

29

slide-51
SLIDE 51 CMU 15-721 (Spring 2019)

ART: IN N ER N O DE TYPES (1)

Store only the 8-bit digits that exist at a given node in a sorted array. The offset in sorted digit array corresponds to offset in value array.

30

Node16

K0 K2 K8 ¤

¤ ¤

1 15 ... ... 1 15

Node4

K0 K2 K3 K8 ¤

¤ ¤ ¤ Sorted Digits

1 2 3 1 2 3

slide-52
SLIDE 52 CMU 15-721 (Spring 2019)

ART: IN N ER N O DE TYPES (1)

Store only the 8-bit digits that exist at a given node in a sorted array. The offset in sorted digit array corresponds to offset in value array.

30

Node16

K0 K2 K8 ¤

¤ ¤

1 15 ... ... 1 15

Node4

K0 K2 K3 K8 ¤

¤ ¤ ¤ Sorted Digits Child Pointers

1 2 3 1 2 3

slide-53
SLIDE 53 CMU 15-721 (Spring 2019)

ART: IN N ER N O DE TYPES (2)

Instead of storing 1-byte digits, maintain an array of 1-byte

  • ffsets to a child pointer array

that is indexed on the digit bits.

31

Node48

K0 ...

¤ ¤ ¤

... 1 47 K1 K2 K255

¤

Ø

¤ ¤

slide-54
SLIDE 54 CMU 15-721 (Spring 2019)

ART: IN N ER N O DE TYPES (2)

Instead of storing 1-byte digits, maintain an array of 1-byte

  • ffsets to a child pointer array

that is indexed on the digit bits.

31

Node48

K0 ...

¤ ¤ ¤

... 1 47 K1 K2 K255

¤

Ø

¤ ¤ Pointer Array Offsets

slide-55
SLIDE 55 CMU 15-721 (Spring 2019)

ART: IN N ER N O DE TYPES (2)

Instead of storing 1-byte digits, maintain an array of 1-byte

  • ffsets to a child pointer array

that is indexed on the digit bits.

31

Node48

K0 ...

¤ ¤ ¤

... 1 47 K1 K2 K255

¤

Ø

¤ ¤ 256 × 1-byte = 256 bytes 48 × 8-bytes = 384 bytes 640 bytes Pointer Array Offsets

slide-56
SLIDE 56 CMU 15-721 (Spring 2019)

ART: IN N ER N O DE TYPES (3)

Store an array of 256 pointers to child nodes. This covers all possible values in 8-bit digits. Same as Judy Array's Uncompressed Node.

32

Node256

K0 ... K1 K2 K255

¤

Ø

¤ ¤ 256 × 8-byte = 2048 bytes

K3 K4 K5

¤

Ø

¤

K6

Ø

slide-57
SLIDE 57 CMU 15-721 (Spring 2019)

ART: BIN ARY CO M PARABLE KEYS

Not all attribute types can be decomposed into binary comparable digits for a radix tree.

→ Unsigned Integers: Byte order must be flipped for little endian machines. → Signed Integers: Flip two’s-complement so that negative numbers are smaller than positive. → Floats: Classify into group (neg vs. pos, normalized vs. denormalized), then store as unsigned integer. → Compound: Transform each attribute separately.

33

slide-58
SLIDE 58 CMU 15-721 (Spring 2019)

ART: BIN ARY CO M PARABLE KEYS

34

Hex Key: 0A 0B 0C 0D Int Key: 168496141

0A 0B 0C 0D

Big Endian 0D 0C 0B 0A Little Endian

0A 0F 0B 0B 1D 0C

¤ ¤ ¤

0D 0B

¤ ¤

8-bit Span Radix Tree

slide-59
SLIDE 59 CMU 15-721 (Spring 2019)

ART: BIN ARY CO M PARABLE KEYS

34

Hex Key: 0A 0B 0C 0D Int Key: 168496141

0A 0B 0C 0D

Big Endian 0D 0C 0B 0A Little Endian

Hex: 0A 0B 1D Find: 658205

0A 0F 0B 0B 1D 0C

¤ ¤ ¤

0D 0B

¤ ¤

8-bit Span Radix Tree

slide-60
SLIDE 60 CMU 15-721 (Spring 2019)

ART: BIN ARY CO M PARABLE KEYS

34

Hex Key: 0A 0B 0C 0D Int Key: 168496141

0A 0B 0C 0D

Big Endian 0D 0C 0B 0A Little Endian

Hex: 0A 0B 1D Find: 658205

0A 0F 0B 0B 1D 0C

¤ ¤ ¤

0D 0B

¤ ¤

8-bit Span Radix Tree

slide-61
SLIDE 61 CMU 15-721 (Spring 2019)

ART: BIN ARY CO M PARABLE KEYS

34

Hex Key: 0A 0B 0C 0D Int Key: 168496141

0A 0B 0C 0D

Big Endian 0D 0C 0B 0A Little Endian

Hex: 0A 0B 1D Find: 658205

0A 0F 0B 0B 1D 0C

¤ ¤ ¤

0D 0B

¤ ¤

8-bit Span Radix Tree

slide-62
SLIDE 62 CMU 15-721 (Spring 2019)

CO N CURREN T ART IN DEX

HyPer’s ART is not latch-free.

→ The authors argue that it would be a significant amount

  • f work to make it latch-free.

Approach #1: Optimistic Lock Coupling Approach #2: Read-Optimized Write Exclusion

35

THE ART OF PRACTICAL SYNCHRONIZATION

DAMON 2016

slide-63
SLIDE 63 CMU 15-721 (Spring 2019)

O PTIM ISTIC LATCH CO UPLIN G

Optimistic crabbing scheme where writers are not blocked on readers. Every node now has a version number (counter).

→ Writers increment counter when they acquire latch. → Readers proceed if a node’s latch is available but then do not acquire it. → It then checks whether the latch’s counter has changed from when it checked the latch.

Relies on epoch GC to ensure pointers are valid.

36

slide-64
SLIDE 64 CMU 15-721 (Spring 2019)

O PTIM ISTIC LATCH CO UPLIN G

37

A B D G

K20 K10 K35 K6 K12 K23 K38 K44

C E F

A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5

Search K44

slide-65
SLIDE 65 CMU 15-721 (Spring 2019)

O PTIM ISTIC LATCH CO UPLIN G

37

A B D G

K20 K10 K35 K6 K12 K23 K38 K44

C E F

A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node

Search K44

slide-66
SLIDE 66 CMU 15-721 (Spring 2019)

O PTIM ISTIC LATCH CO UPLIN G

37

A B D G

K20 K10 K35 K6 K12 K23 K38 K44

C E F

A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node C: Read v9 B: Recheck v5 C: Examine Node

Search K44

slide-67
SLIDE 67 CMU 15-721 (Spring 2019)

O PTIM ISTIC LATCH CO UPLIN G

37

A B D G

K20 K10 K35 K6 K12 K23 K38 K44

C E F

A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node C: Read v9

Search K44

slide-68
SLIDE 68 CMU 15-721 (Spring 2019)

O PTIM ISTIC LATCH CO UPLIN G

37

A B D G

K20 K10 K35 K6 K12 K23 K38 K44

C E F

A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node C: Read v9

Search K44

v6

slide-69
SLIDE 69 CMU 15-721 (Spring 2019)

O PTIM ISTIC LATCH CO UPLIN G

37

A B D G

K20 K10 K35 K6 K12 K23 K38 K44

C E F

A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node C: Read v9 B: Recheck v5

Search K44

v6

slide-70
SLIDE 70 CMU 15-721 (Spring 2019)

O PTIM ISTIC LATCH CO UPLIN G

37

A B D G

K20 K10 K35 K6 K12 K23 K38 K44

C E F

A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node C: Read v9 B: Recheck v5

Search K44

v6

X

slide-71
SLIDE 71 CMU 15-721 (Spring 2019)

READ- O PTIM IZED WRITE EXCLUSIO N

Each node includes an exclusive latch that blocks

  • nly other writers and not readers.

→ Readers proceed without checking versions or latches. → Every writer must ensure that reads are always consistent.

Requires fundamental changes to how threads make modifications to the data structure.

→ Creating new nodes means that we have to atomically update pointers from other nodes (see Bw-Tree).

38

slide-72
SLIDE 72 CMU 15-721 (Spring 2019)

M ASSTREE

Instead of using different layouts for each trie node based on its size, use an entire B+Tree.

→ Each B+tree represents 8-byte span. → Optimized for long keys. → Uses something similar to OLC.

Part of the Harvard Silo project.

39

CACHE CRAFTINESS FOR FAST MULTICORE KEY- VALUE S STORAGE

EUROSYS 2012

Masstree

Bytes [0-7] Bytes [8-15] Bytes [8-15]

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤

slide-73
SLIDE 73 CMU 15-721 (Spring 2019)

IN- M EM O RY IN DEXES

40

9.94 15.5 13.3 5.43 2.51 2.78 1.51 2.43 8.09 29 25.1 18.9 17.9 30.5 22 3.68 44.9 51.5 42.9 3.43

10 20 30 40 50 60 Insert-Only Read-Only Read/Update Scan/Insert

Operations/sec (M)

Open Bw-Tree Skip List B+Tree Masstree ART

Processor: 1 socket, 10 cores w/ 2×HT Workload: 50m Random Integer Keys (64-bit)

Source: Ziqi Wang

slide-74
SLIDE 74 CMU 15-721 (Spring 2019)

IN- M EM O RY IN DEXES

41

2.34 1.79 1.91 2.07 2.18 2.49 1.59 1.15 1.3 3.37 2.86 4.22 0.42 1.44 0.722

1 2 3 4 5 Mono Int Rand Int Emails

Memory (GB)

Open Bw-Tree Skip List B+Tree Masstree ART

Processor: 1 socket, 10 cores w/ 2×HT Workload: 50m Keys

Source: Ziqi Wang

slide-75
SLIDE 75 CMU 15-721 (Spring 2019)

PARTIN G TH O UGH TS

Andy was wrong about the Bw-Tree and latch- free indexes. Radix trees have interesting properties, but a well- written B+tree is still a solid design choice.

42

slide-76
SLIDE 76 CMU 15-721 (Spring 2019)

N EXT CLASS

System Catalogs Data Layout Storage Models

43