OLTP Indexes (Trie Data Structures)
@ Andy_Pavlo // 15- 721 // Spring 2019
ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ - - PowerPoint PPT Presentation
Lect ure # 08 ADVANCED DATABASE SYSTEMS OLTP Indexes (Trie Data Structures) @ Andy_Pavlo // 15- 721 // Spring 2019 CMU 15-721 (Spring 2019) 2 Index Implementation Issues Judy Array ART Masstree CMU 15-721 (Spring 2019) 3 IN DEX IM
@ Andy_Pavlo // 15- 721 // Spring 2019
Index Implementation Issues Judy Array ART Masstree
2
IN DEX IM PLEM EN TATIO N ISSUES
Garbage Collection Memory Pools Non-Unique Keys Variable-length Keys Compression
3
GARBAGE CO LLECTIO N
We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.
→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…
4
K2 V2 K3 V3 K4 V4
GARBAGE CO LLECTIO N
We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.
→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…
4
K2 V2 K3 V3 K4 V4
GARBAGE CO LLECTIO N
We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.
→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…
4
K2 V2 K4 V4
GARBAGE CO LLECTIO N
We need to know when it is safe to reclaim memory for deleted nodes in a latch-free index.
→ Reference Counting → Epoch-based Reclamation → Hazard Pointers → Many others…
4
K2 V2 K4 V4
REFEREN CE CO UN TIN G
Maintain a counter for each node to keep track of the number of threads that are accessing it.
→ Increment the counter before accessing. → Decrement it when finished. → A node is only safe to delete when the count is zero.
This has bad performance for multi-core CPUs
→ Incrementing/decrementing counters causes a lot of cache coherence traffic.
5
O BSERVATIO N
We don’t actually care about the actual value of the reference counter. We only need to know when it reaches zero. We don’t have to perform garbage collection immediately when the counter reaches zero.
6
Source: Stephen Tu
EPO CH GARBAGE CO LLECTIO N
Maintain a global epoch counter that is periodically updated (e.g., every 10 ms).
→ Keep track of what threads enter the index during an epoch and when they leave.
Mark the current epoch of a node when it is marked for deletion.
→ The node can be reclaimed once all threads have left that epoch (and all preceding epochs).
Also known as Read-Copy-Update (RCU) in Linux.
7
M EM O RY PO O LS
We don’t want to be calling malloc and free anytime we need to add or delete a node. If all the nodes are the same size, then the index can maintain a pool of available nodes.
→ Insert: Grab a free node, otherwise create a new one. → Delete: Add the node back to the free pool.
Need some policy to decide when to retract the pool size.
8
N O N- UN IQ UE IN DEXES
Approach #1: Duplicate Keys
→ Use the same node layout but store duplicate keys multiple times.
Approach #2: Value Lists
→ Store each key only once and maintain a linked list of unique values.
9
MODERN B B- TREE TECHNIQUES
NOW PUBLISHERS 2010
B+Tree Leaf Node
N O N- UN IQ UE: DUPLICATE KEYS
10
Sorted Keys
K1 K1 K1 K2 K2
¤
Prev
¤
Next
#
Level
#
Slots Values
¤ ¤ ¤ ¤ ¤
B+Tree Leaf Node
N O N- UN IQ UE: VALUE LISTS
11
Values
¤ ¤ ¤ ¤ ¤
¤
Prev
¤
Next
#
Level
#
Slots Sorted Keys
K1 K2 K3 K4 K5
VARIABLE LEN GTH KEYS
Approach #1: Pointers
→ Store the keys as pointers to the tuple’s attribute.
Approach #2: Variable Length Nodes
→ The size of each node in the index can vary. → Requires careful memory management.
Approach #3: Padding
→ Always pad the key to be max length of the key type.
Approach #4: Key Map / Indirection
→ Embed an array of pointers that map to the key + value list within the node.
12
¤ ¤ ¤ ¤
Andy V1 Obama Prashanth V3 V4 Lin V2
B+Tree Leaf Node
KEY M AP / IN DIRECTIO N
13
Key+Values
¤
Prev
¤
Next
#
Level
#
Slots Sorted Key Map
¤ ¤ ¤ ¤
Andy V1 Obama Prashanth V3 V4 Lin V2
B+Tree Leaf Node
KEY M AP / IN DIRECTIO N
13
Key+Values
¤
Prev
¤
Next
#
Level
#
Slots Sorted Key Map
¤ ¤ ¤ ¤
Andy V1 Obama Prashanth V3 V4 Lin V2
B+Tree Leaf Node
KEY M AP / IN DIRECTIO N
13
Key+Values
¤
Prev
¤
Next
#
Level
#
Slots Sorted Key Map
A·¤ L·¤ O·¤ P·¤
PREFIX CO M PRESSIO N
Sorted keys in the same leaf node are likely to have the same prefix. Instead of storing the entire key each time, extract common prefix and store only unique suffix for each key.
14
robbed robbing robot bed bing
Prefix:rob
SUFFIX TRUN CATIO N
The keys in the inner nodes are only used to "direct traffic". We don't need the entire key. Store a minimum prefix that is needed to correctly route probes into the index.
15
abcdefghijk lmnopqrstuv … … … …
SUFFIX TRUN CATIO N
The keys in the inner nodes are only used to "direct traffic". We don't need the entire key. Store a minimum prefix that is needed to correctly route probes into the index.
15
… … … …
abc lmn
O BSERVATIO N
The inner node keys in a B+tree cannot tell you whether a key exists in the index. You always have to traverse to the leaf node. This means that you could have (at least) one cache miss per level in the tree.
16
TRIE IN DEX
Use a digital representation of keys to examine prefixes one- by-one instead of comparing entire key.
→ Also known as Digital Search Tree, Prefix Tree.
17
Keys: HELLO, HAT, HAVE
L L O ¤ ¤ E ¤ H A E V T
TRIE IN DEX
Use a digital representation of keys to examine prefixes one- by-one instead of comparing entire key.
→ Also known as Digital Search Tree, Prefix Tree.
17
Keys: HELLO, HAT, HAVE
L L O ¤ ¤ E ¤ H A E V T
TRIE IN DEX PRO PERTIES
Shape only depends on key space and lengths.
→ Does not depend on existing keys or insertion order. → Does not require rebalancing operations.
All operations have O(k) complexity where k is the length of the key.
→ The path to a leaf node represents the key of the leaf → Keys are stored implicitly and can be reconstructed from paths.
18
TRIE KEY SPAN
The span of a trie level is the number of bits that each partial key / digit represents.
→ If the digit exists in the corpus, then store a pointer to the next level in the trie branch. Otherwise, store null.
This determines the fan-out of each node and the physical height of the tree.
→ n-way Trie = Fan-Out of n
19
TRIE KEY SPAN
Keys: K10,K25,K31
20
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
1-bit Span Trie
Tuple Pointer Node Pointer
TRIE KEY SPAN
Keys: K10,K25,K31
20
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
1-bit Span Trie
¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤
Repeat 10x
Tuple Pointer Node Pointer
TRIE KEY SPAN
Keys: K10,K25,K31
20
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
1-bit Span Trie
¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤
Repeat 10x
Tuple Pointer Node Pointer
TRIE KEY SPAN
Keys: K10,K25,K31
20
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
1-bit Span Trie
¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤
Repeat 10x
Tuple Pointer Node Pointer
TRIE KEY SPAN
Keys: K10,K25,K31
20
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
1-bit Span Trie
¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤
Repeat 10x
Tuple Pointer Node Pointer
TRIE KEY SPAN
Keys: K10,K25,K31
20
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
1-bit Span Trie
¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤
Repeat 10x
Tuple Pointer Node Pointer
TRIE KEY SPAN
Keys: K10,K25,K31
20
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
1-bit Span Trie
¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤
Repeat 10x
Tuple Pointer Node Pointer
TRIE KEY SPAN
Keys: K10,K25,K31
20
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
1-bit Span Trie
¤ 1 Ø ¤ 1 Ø ¤ 1 ¤ ¤ 1 Ø Ø 1 ¤ ¤ 1 Ø ¤ 1 Ø Ø 1 ¤ Ø 1 ¤ Ø 1 ¤ ¤ 1 ¤ Ø 1 ¤ Ø 1 ¤
Repeat 10x
Tuple Pointer Node Pointer
TRIE KEY SPAN
Keys: K10,K25,K31
20
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
1-bit Span Trie
Repeat 10x
¤ Ø ¤ Ø ¤ ¤ ¤ Ø Ø ¤ ¤ Ø ¤ Ø Ø ¤ Ø ¤ Ø ¤ ¤ ¤ Ø ¤ Ø ¤
Tuple Pointer Node Pointer
TRIE KEY SPAN
Keys: K10,K25,K31
20
K10→ 00000000 00001010 K25→ 00000000 00011001 K31→ 00000000 00011111
1-bit Span Trie
Repeat 10x
¤ Ø ¤ Ø ¤ ¤ ¤ Ø Ø ¤ ¤ Ø ¤ Ø Ø ¤ Ø ¤ Ø ¤ ¤ ¤ Ø ¤ Ø ¤
Tuple Pointer Node Pointer
RADIX TREE
Omit all nodes with only a single child.
→ Also known as Patricia Tree.
21
1-bit Span Radix Tree
¤ Ø ¤ Ø ¤ ¤ Ø ¤ ¤ ¤
Repeat 10x
Tuple Pointer Node Pointer
TRIE VARIAN TS
Judy Arrays (HP) ART Index (HyPer) Masstree (Silo)
22
J UDY ARRAYS
Variant of a 256-way radix tree. First known radix tree that supports adaptive node representation. Three array types
→ Judy1: Bit array that maps integer keys to true/false. → JudyL: Map integer keys to integer values. → JudySL: Map variable-length keys to integer values.
Open-Source Implementation (LGPL). Patented by HP in 2000. Expires in 2022.
→ Not an issue according to authors. → http://judy.sourceforge.net/
23
J UDY ARRAYS
Do not store meta-data about node in its header.
→ This could lead to additional cache misses.
Pack meta-data about a node in 128-bit "Judy Pointers" stored in its parent node.
→ Node Type → Population Count → Child Key Prefix / Value (if only one child below) → 64-bit Child Pointer
24
A COMPARISON OF ADAPTIVE RADIX TREES AND HASH TABLES
ICDE 2 2015
J UDY ARRAYS: N O DE TYPES
Every node can store up to 256 digits. Not all nodes will be 100% full though. Adapt node's organization based on its keys.
→ Linear Node: Sparse Populations → Bitmap Node: Typical Populations → Uncompressed Node: Dense Population
25
A COMPARISON OF ADAPTIVE RADIX TREES AND HASH TABLES
ICDE 2 2015
J UDY ARRAYS: LIN EAR N O DES
Store sorted list of partial prefixes up to two cache lines.
→ Original spec was one cache line
Store separate array of pointers to children ordered according to prefix sorted.
26
Linear Node
K0 K2 K8 ¤
¤ ¤
1 5 ... ... 1 5
Sorted Digits
J UDY ARRAYS: LIN EAR N O DES
Store sorted list of partial prefixes up to two cache lines.
→ Original spec was one cache line
Store separate array of pointers to children ordered according to prefix sorted.
26
Linear Node
K0 K2 K8 ¤
¤ ¤
1 5 ... ... 1 5
Sorted Digits Child Pointers
J UDY ARRAYS: LIN EAR N O DES
Store sorted list of partial prefixes up to two cache lines.
→ Original spec was one cache line
Store separate array of pointers to children ordered according to prefix sorted.
26
Linear Node
K0 K2 K8 ¤
¤ ¤
1 5 ... ... 1 5
Sorted Digits Child Pointers 6 × 1-byte = 6 bytes 6 × 16-bytes = 96 bytes 102 bytes 128 bytes (padded)
J UDY ARRAYS: BITM AP N O DES
256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.
27
Bitmap Node
01000110 ¤
0-7 8-15 248-255
00000000 ¤ ... 00100100 ¤
...
¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ Prefix Bitmaps
J UDY ARRAYS: BITM AP N O DES
256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.
27
Bitmap Node
01000110 ¤
0-7 8-15 248-255
00000000 ¤ ... 00100100 ¤
...
¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ Prefix Bitmaps Sub-Array Pointers
J UDY ARRAYS: BITM AP N O DES
256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.
27
Bitmap Node
01000110 ¤
0-7 8-15 248-255
00000000 ¤ ... 00100100 ¤
...
¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ Child Pointers Prefix Bitmaps Sub-Array Pointers
J UDY ARRAYS: BITM AP N O DES
256-bit map to mark whether a prefix is present in node. Bitmap is divided into eight segments, each with a pointer to a sub-array with pointers to child nodes.
27
Bitmap Node
01000110 ¤
0-7 8-15 248-255
00000000 ¤ ... 00100100 ¤
...
¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ Child Pointers Prefix Bitmaps Sub-Array Pointers
ADAPATIVE RADIX TREE (ART)
Developed for TUM HyPer DBMS in 2013. 256-way radix tree that supports different node types based on its population.
→ Stores meta-data about each node in its header.
Concurrency support was added in 2015.
28
THE ADAPTIVE RADIX TREE: ARTFUL INDEXING FOR MAIN- MEMORY DATABASES
ICDE 2 2013
ART vs. J UDY
Difference #1: Node Types
→ Judy has three node types with different organizations. → ART has four nodes types that (mostly) vary in the maximum number of children.
Difference #2: Purpose
→ Judy is a general-purpose associative array. It "owns" the keys and values. → ART is a table index and does not need to cover the full
29
ART: IN N ER N O DE TYPES (1)
Store only the 8-bit digits that exist at a given node in a sorted array. The offset in sorted digit array corresponds to offset in value array.
30
Node16
K0 K2 K8 ¤
¤ ¤
1 15 ... ... 1 15
Node4
K0 K2 K3 K8 ¤
¤ ¤ ¤ Sorted Digits
1 2 3 1 2 3
ART: IN N ER N O DE TYPES (1)
Store only the 8-bit digits that exist at a given node in a sorted array. The offset in sorted digit array corresponds to offset in value array.
30
Node16
K0 K2 K8 ¤
¤ ¤
1 15 ... ... 1 15
Node4
K0 K2 K3 K8 ¤
¤ ¤ ¤ Sorted Digits Child Pointers
1 2 3 1 2 3
ART: IN N ER N O DE TYPES (2)
Instead of storing 1-byte digits, maintain an array of 1-byte
that is indexed on the digit bits.
31
Node48
K0 ...
¤ ¤ ¤
... 1 47 K1 K2 K255
¤
Ø
¤ ¤
ART: IN N ER N O DE TYPES (2)
Instead of storing 1-byte digits, maintain an array of 1-byte
that is indexed on the digit bits.
31
Node48
K0 ...
¤ ¤ ¤
... 1 47 K1 K2 K255
¤
Ø
¤ ¤ Pointer Array Offsets
ART: IN N ER N O DE TYPES (2)
Instead of storing 1-byte digits, maintain an array of 1-byte
that is indexed on the digit bits.
31
Node48
K0 ...
¤ ¤ ¤
... 1 47 K1 K2 K255
¤
Ø
¤ ¤ 256 × 1-byte = 256 bytes 48 × 8-bytes = 384 bytes 640 bytes Pointer Array Offsets
ART: IN N ER N O DE TYPES (3)
Store an array of 256 pointers to child nodes. This covers all possible values in 8-bit digits. Same as Judy Array's Uncompressed Node.
32
Node256
K0 ... K1 K2 K255
¤
Ø
¤ ¤ 256 × 8-byte = 2048 bytes
K3 K4 K5
¤
Ø
¤
K6
Ø
ART: BIN ARY CO M PARABLE KEYS
Not all attribute types can be decomposed into binary comparable digits for a radix tree.
→ Unsigned Integers: Byte order must be flipped for little endian machines. → Signed Integers: Flip two’s-complement so that negative numbers are smaller than positive. → Floats: Classify into group (neg vs. pos, normalized vs. denormalized), then store as unsigned integer. → Compound: Transform each attribute separately.
33
ART: BIN ARY CO M PARABLE KEYS
34
Hex Key: 0A 0B 0C 0D Int Key: 168496141
0A 0B 0C 0D
Big Endian 0D 0C 0B 0A Little Endian
0A 0F 0B 0B 1D 0C
¤ ¤ ¤
0D 0B
¤ ¤
8-bit Span Radix Tree
ART: BIN ARY CO M PARABLE KEYS
34
Hex Key: 0A 0B 0C 0D Int Key: 168496141
0A 0B 0C 0D
Big Endian 0D 0C 0B 0A Little Endian
Hex: 0A 0B 1D Find: 658205
0A 0F 0B 0B 1D 0C
¤ ¤ ¤
0D 0B
¤ ¤
8-bit Span Radix Tree
ART: BIN ARY CO M PARABLE KEYS
34
Hex Key: 0A 0B 0C 0D Int Key: 168496141
0A 0B 0C 0D
Big Endian 0D 0C 0B 0A Little Endian
Hex: 0A 0B 1D Find: 658205
0A 0F 0B 0B 1D 0C
¤ ¤ ¤
0D 0B
¤ ¤
8-bit Span Radix Tree
ART: BIN ARY CO M PARABLE KEYS
34
Hex Key: 0A 0B 0C 0D Int Key: 168496141
0A 0B 0C 0D
Big Endian 0D 0C 0B 0A Little Endian
Hex: 0A 0B 1D Find: 658205
0A 0F 0B 0B 1D 0C
¤ ¤ ¤
0D 0B
¤ ¤
8-bit Span Radix Tree
CO N CURREN T ART IN DEX
HyPer’s ART is not latch-free.
→ The authors argue that it would be a significant amount
Approach #1: Optimistic Lock Coupling Approach #2: Read-Optimized Write Exclusion
35
THE ART OF PRACTICAL SYNCHRONIZATION
DAMON 2016
O PTIM ISTIC LATCH CO UPLIN G
Optimistic crabbing scheme where writers are not blocked on readers. Every node now has a version number (counter).
→ Writers increment counter when they acquire latch. → Readers proceed if a node’s latch is available but then do not acquire it. → It then checks whether the latch’s counter has changed from when it checked the latch.
Relies on epoch GC to ensure pointers are valid.
36
O PTIM ISTIC LATCH CO UPLIN G
37
A B D G
K20 K10 K35 K6 K12 K23 K38 K44
C E F
A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5
Search K44
O PTIM ISTIC LATCH CO UPLIN G
37
A B D G
K20 K10 K35 K6 K12 K23 K38 K44
C E F
A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node
Search K44
O PTIM ISTIC LATCH CO UPLIN G
37
A B D G
K20 K10 K35 K6 K12 K23 K38 K44
C E F
A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node C: Read v9 B: Recheck v5 C: Examine Node
Search K44
O PTIM ISTIC LATCH CO UPLIN G
37
A B D G
K20 K10 K35 K6 K12 K23 K38 K44
C E F
A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node C: Read v9
Search K44
O PTIM ISTIC LATCH CO UPLIN G
37
A B D G
K20 K10 K35 K6 K12 K23 K38 K44
C E F
A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node C: Read v9
Search K44
v6
O PTIM ISTIC LATCH CO UPLIN G
37
A B D G
K20 K10 K35 K6 K12 K23 K38 K44
C E F
A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node C: Read v9 B: Recheck v5
Search K44
v6
O PTIM ISTIC LATCH CO UPLIN G
37
A B D G
K20 K10 K35 K6 K12 K23 K38 K44
C E F
A: Read v3 A: Examine Node v3 v5 v6 v9 v4 v4 v5 B: Read v5 A: Recheck v3 B: Examine Node C: Read v9 B: Recheck v5
Search K44
v6
READ- O PTIM IZED WRITE EXCLUSIO N
Each node includes an exclusive latch that blocks
→ Readers proceed without checking versions or latches. → Every writer must ensure that reads are always consistent.
Requires fundamental changes to how threads make modifications to the data structure.
→ Creating new nodes means that we have to atomically update pointers from other nodes (see Bw-Tree).
38
M ASSTREE
Instead of using different layouts for each trie node based on its size, use an entire B+Tree.
→ Each B+tree represents 8-byte span. → Optimized for long keys. → Uses something similar to OLC.
Part of the Harvard Silo project.
39
CACHE CRAFTINESS FOR FAST MULTICORE KEY- VALUE S STORAGE
EUROSYS 2012
Masstree
Bytes [0-7] Bytes [8-15] Bytes [8-15]
¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤
IN- M EM O RY IN DEXES
40
9.94 15.5 13.3 5.43 2.51 2.78 1.51 2.43 8.09 29 25.1 18.9 17.9 30.5 22 3.68 44.9 51.5 42.9 3.43
10 20 30 40 50 60 Insert-Only Read-Only Read/Update Scan/Insert
Operations/sec (M)
Open Bw-Tree Skip List B+Tree Masstree ART
Processor: 1 socket, 10 cores w/ 2×HT Workload: 50m Random Integer Keys (64-bit)
Source: Ziqi Wang
IN- M EM O RY IN DEXES
41
2.34 1.79 1.91 2.07 2.18 2.49 1.59 1.15 1.3 3.37 2.86 4.22 0.42 1.44 0.722
1 2 3 4 5 Mono Int Rand Int Emails
Memory (GB)
Open Bw-Tree Skip List B+Tree Masstree ART
Processor: 1 socket, 10 cores w/ 2×HT Workload: 50m Keys
Source: Ziqi Wang
PARTIN G TH O UGH TS
Andy was wrong about the Bw-Tree and latch- free indexes. Radix trees have interesting properties, but a well- written B+tree is still a solid design choice.
42
N EXT CLASS
System Catalogs Data Layout Storage Models
43