Main Memory Adaptive Indexing for Multi-core Systems Felix Martin - - PowerPoint PPT Presentation
Main Memory Adaptive Indexing for Multi-core Systems Felix Martin - - PowerPoint PPT Presentation
SIGMOD DaMoN 23.06.2014 Main Memory Adaptive Indexing for Multi-core Systems Felix Martin Schuhknecht Victor Alvarez Jens Dittrich Stefan Richter Information Systems Group Saarland University https://infosys.uni-saarland.de/ Problem:
Problem: Answer Range Queries
2 / 30
select A from R where R.A >= 10 and R.A < 20
Problem: Answer Range Queries
2 / 30
select A from R where R.A >= 10 and R.A < 20
Problem: Answer Range Queries
One extreme: Scan + Filter
>= 10 && < 20 ?
R.A (unsorted) 43 9 13 22 19 15 7 99 48 17 34
2 / 30
select A from R where R.A >= 10 and R.A < 20
Problem: Answer Range Queries
One extreme: Scan + Filter
>= 10 && < 20 ?
R.A (unsorted) 43 9 13 22 19 15 7 99 48 17 34 43 9 13 22 19 15 7 99 48 17 34
2 / 30
select A from R where R.A >= 10 and R.A < 20
Problem: Answer Range Queries
One extreme: Scan + Filter
>= 10 && < 20 ?
R.A (unsorted) 43 9 13 22 19 15 7 99 48 17 34 43 9 13 22 19 15 7 99 48 17 34 Other extreme: Index R.A (sorted)
>= 10 && < 20
7 9 13 15 17 19 22 34 43 48 99
2 / 30
select A from R where R.A >= 10 and R.A < 20
Problem: Answer Range Queries
One extreme: Scan + Filter
>= 10 && < 20 ?
R.A (unsorted)
<10 >=10 <20 >=20
43 9 13 22 19 15 7 99 48 17 34 43 9 13 22 19 15 7 99 48 17 34 Other extreme: Index R.A (sorted)
>= 10 && < 20
7 9 13 15 17 19 22 34 43 48 99 7 9 13 15 17 19 22 34 43 48 99
2 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
Pressure on the system Time
3 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
Pressure on the system Time
3 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
build index Pressure on the system Time
3 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
build index Pressure on the system Time
3 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
build index build finished Pressure on the system Time
3 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
build index build finished Pressure on the system Time
3 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
build index build finished Pressure on the system Time
3 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
build index build finished Pressure on the system Time
Other extreme: Incrementally at query time (Adaptive Indexing)
3 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
build index build finished Pressure on the system Time
Other extreme: Incrementally at query time (Adaptive Indexing)
3 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
build index build finished Pressure on the system Time
Other extreme: Incrementally at query time (Adaptive Indexing)
initialize index
3 / 30
Index: When to build?
One extreme: At once (Traditional Indexing)
build index build finished Pressure on the system Time
Other extreme: Incrementally at query time (Adaptive Indexing)
initialize index
3 / 30
Traditional Indexing: Sort + Binary Search
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
4 / 30
Traditional Indexing: Sort + Binary Search
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
1 2 3 4 6 7 8 9 11 12 13 14 16 19
Sort Index Column (A)
4 / 30
Traditional Indexing: Sort + Binary Search
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
select A from R where R.A > 10 and R.A < 14
1 2 3 4 6 7 8 9 11 12 13 14 16 19 1 2 3 4 6 7 8 9 11 12 13 14 16 19
Sort Index Column (A) Index Column (A)
4 / 30
Traditional Indexing: Sort + Binary Search
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
select A from R where R.A > 10 and R.A < 14
1 2 3 4 6 7 8 9 11 12 13 14 16 19
Binary Search
1 2 3 4 6 7 8 9 11 12 13 14 16 19
Sort Index Column (A) Index Column (A)
4 / 30
Traditional Indexing: Sort + Binary Search
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
select A from R where R.A > 10 and R.A < 14
1 2 3 4 6 7 8 9 11 12 13 14 16 19
Binary Search
1 2 3 4 6 7 8 9 11 12 13 14 16 19 1 2 3 4 6 7 8 9 11 12 13 14 16 19
Sort Index Column (A) Index Column (A)
4 / 30
Adaptive Indexing: Standard Cracking
[Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]
5 / 30
Adaptive Indexing: Standard Cracking
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
[Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]
5 / 30
Adaptive Indexing: Standard Cracking
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
select A from R where R.A > 10 and R.A < 14 [Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]
5 / 30
Adaptive Indexing: Standard Cracking
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
lumn A after
<=10 <14 >=14
4 9 2 7 1 3 8 6 13 12 11 16 19 14
Cracked Column (A)
select A from R where R.A > 10 and R.A < 14 [Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]
5 / 30
index
10 < A <14 7 < A <=10 14 <= A <=16 16 < A A <= 7
Cracker Index AVL
Adaptive Indexing: Standard Cracking
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
lumn A after
<=10 <14 >=14
4 9 2 7 1 3 8 6 13 12 11 16 19 14
Cracked Column (A)
select A from R where R.A > 10 and R.A < 14 [Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]
5 / 30
index
10 < A <14 7 < A <=10 14 <= A <=16 16 < A A <= 7
Cracker Index AVL
Adaptive Indexing: Standard Cracking
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
lumn A after
<=10 <14 >=14
4 9 2 7 1 3 8 6 13 12 11 16 19 14
Cracked Column (A)
select A from R where R.A > 10 and R.A < 14 select A from R where R.A > 7 and R.A <= 16 [Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]
5 / 30
index
10 < A <14 7 < A <=10 14 <= A <=16 16 < A A <= 7
Cracker Index AVL
Adaptive Indexing: Standard Cracking
Column A
13 16 4 9 2 12 7 1 19 3 14 11 8 6
A
lumn A after
<=10 <14 >=14
4 9 2 7 1 3 8 6 13 12 11 16 19 14
Cracked Column (A)
select A from R where R.A > 10 and R.A < 14 select A from R where R.A > 7 and R.A <= 16
<=10 <= 7
4 2 1 3 6 7 9 8
<=16 < A
14 16 19
[Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]
5 / 30
Motivation
Standard Cracking
6 / 30
Motivation
Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking
6 / 30
Motivation
Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking Single-threaded adaptive algorithms
6 / 30
Motivation
Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking Single-threaded adaptive algorithms Quicksort Radixsort Mergesort
6 / 30
Motivation
Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking Single-threaded adaptive algorithms Single-threaded sorting algorithms Quicksort Radixsort Mergesort
6 / 30
Motivation
Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking Single-threaded adaptive algorithms Multi-threaded adaptive algorithms Parallel Standard Cracking Single-threaded sorting algorithms Quicksort Radixsort Mergesort
6 / 30
Motivation
Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking Single-threaded adaptive algorithms Multi-threaded adaptive algorithms Parallel Standard Cracking Single-threaded sorting algorithms Quicksort Radixsort Mergesort
?
6 / 30
Setup
7 / 30
Setup
100 million entries
7 / 30
Setup
100 million entries Key RowID
7 / 30
Setup
100 million entries Key RowID 4 Byte + 4 Byte
7 / 30
Setup
100 million entries Key RowID 4 Byte + 4 Byte Uniform Random Key Distribution
7 / 30
Setup
100 million entries Key RowID 4 Byte + 4 Byte Uniform Random Key Distribution Query 1%
7 / 30
Setup
100 million entries Key RowID 4 Byte + 4 Byte Uniform Random Key Distribution Query 1% ~762 MB
7 / 30
Setup
2.2 GHz 2.2 GHz 2.2 GHz 2.2 GHz 10MB L3 2.2 GHz 2.2 GHz 2.2 GHz 2.2 GHz 10MB L3 24 GB RAM 24 GB RAM
Xeon E5-2407 Xeon E5-2407
8 / 30
Setup
2.2 GHz 2.2 GHz 2.2 GHz 2.2 GHz 10MB L3 2.2 GHz 2.2 GHz 2.2 GHz 2.2 GHz 10MB L3 24 GB RAM 24 GB RAM
Xeon E5-2407 Xeon E5-2407 No Turbo, no HyperThreading
8 / 30
Single-threaded algorithms
5 10 15 20 25 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 1 Thread, 100 Million Elements
Standard Cracking (SC) Hybrid Crack Sort (HCS) Coarse-granular Index (CGI) Radix Sort (RS) STL std::sort (STL-S)
[The Uncracked Pieces in Database Cracking. F. M. Schuhknecht, A. Jindal, J. Dittrich. In PVLDB 2013]
9 / 30
Single-threaded algorithms
5 10 15 20 25 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 1 Thread, 100 Million Elements
Standard Cracking (SC) Hybrid Crack Sort (HCS) Coarse-granular Index (CGI) Radix Sort (RS) STL std::sort (STL-S)
750 Queries
[The Uncracked Pieces in Database Cracking. F. M. Schuhknecht, A. Jindal, J. Dittrich. In PVLDB 2013]
9 / 30
Single-threaded algorithms
5 10 15 20 25 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 1 Thread, 100 Million Elements
Standard Cracking (SC) Hybrid Crack Sort (HCS) Coarse-granular Index (CGI) Radix Sort (RS) STL std::sort (STL-S)
750 Queries > 10000 Queries
[The Uncracked Pieces in Database Cracking. F. M. Schuhknecht, A. Jindal, J. Dittrich. In PVLDB 2013]
9 / 30
Multi-threaded environments? Multi-threaded algorithms!
10 / 30
Multi-threaded algorithms: Parallel Standard Cracking (P-SC)
[Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]
11 / 30
Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1
[Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]
11 / 30
Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2
[Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]
11 / 30
Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2
[Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]
T1 T2
11 / 30
Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2
Requested Locks
W R R W R R R R W R W
Q1 Q2
[Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]
T1 T2
11 / 30
Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2
Requested Locks
W R R W R R R R W R W
Q1 Q2
⚡
- ✓
✓ ✓ ✓ ✓
⚡
- [Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]
T1 T2
11 / 30
Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2
Requested Locks
W R R W R R R R W R W
Q1 Q2
⚡
- ✓
✓ ✓ ✓ ✓
⚡
- [Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]
Inter-query parallelism
T1 T2
11 / 30
Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2
Requested Locks
W R R W R R R R W R W
Q1 Q2
⚡
- ✓
✓ ✓ ✓ ✓
⚡
- [Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]
Inter-query parallelism Lock contention
T1 T2
11 / 30
Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2
Requested Locks
W R R W R R R R W R W
Q1 Q2
⚡
- ✓
✓ ✓ ✓ ✓
⚡
- [Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]
Inter-query parallelism Lock contention Underutilize resources (T3, T4, T5, ...)
T1 T2
11 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query
12 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
12 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
12 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
T1 T2 T3 Tk
12 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
T1 T2 T3 Tk
Local Result Local Result Local Result Local Result
12 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
T1 T2 T3 Tk
12 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query k Chunks
Cracker Index
T1
Cracker Index
T2
Cracker Index
T3
Cracker Index
Tk
13 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query k Chunks
Cracker Index
T1
Cracker Index
T2
Cracker Index
T3
Cracker Index
Tk
13 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query k Chunks
Cracker Index
T1
Cracker Index
T2
Cracker Index
T3
Cracker Index
Tk Complete independence
13 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query k Chunks
Cracker Index
T1
Cracker Index
T2
Cracker Index
T3
Cracker Index
Tk Complete independence Fully utilize resources
13 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query k Chunks
Cracker Index
T1
Cracker Index
T2
Cracker Index
T3
Cracker Index
Tk Complete independence Fully utilize resources No consecutive result
13 / 30
Micro Benchmark Reading 1% from k locations using one thread
2.5 5 7.5 10 1 10 100 1000 10000 100000 1000000 Time [s] Number of Chunks (k)
14 / 30
Micro Benchmark Reading 1% from k locations using one thread
2.5 5 7.5 10 1 10 100 1000 10000 100000 1000000 Time [s] Number of Chunks (k) No problem for realistic k
14 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)
A
15 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)
- 1. Range-partition
while copying Index on A A Index(A) 1024 partitions
15 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)
Cracker Index
- 1. Range-partition
while copying Index on A A Index(A) 1024 partitions
15 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)
Cracker Index
- 1. Range-partition
while copying Index on A A Query
- 2. Perform P-SC
Index(A) 1024 partitions
15 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)
Cracker Index
- 1. Range-partition
while copying Index on A A Query
- 2. Perform P-SC
Index(A) 1024 partitions W R R W
15 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)
Cracker Index
- 1. Range-partition
while copying Index on A A Query
- 2. Perform P-SC
Index(A) 1024 partitions Like starting ... ... after 1000 cracks W R R W
15 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)
Cracker Index
- 1. Range-partition
while copying Index on A A Query
- 2. Perform P-SC
Index(A) 1024 partitions Like starting ... ... after 1000 cracks W R R W Reduces lock contention
- f P-SC
15 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)
Cracker Index
- 1. Range-partition
while copying Index on A A Query
- 2. Perform P-SC
Index(A) 1024 partitions Like starting ... ... after 1000 cracks W R R W Reduces lock contention
- f P-SC
Reduces Variance
15 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)
Cracker Index
- 1. Range-partition
while copying Index on A A Query
- 2. Perform P-SC
Index(A) 1024 partitions Like starting ... ... after 1000 cracks W R R W Reduces lock contention
- f P-SC
Adds (small) initialization time Reduces Variance
15 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)
Cracker Index
- 1. Range-partition
while copying Index on A A Query
- 2. Perform P-SC
Index(A) 1024 partitions Like starting ... ... after 1000 cracks How to do? W R R W Reduces lock contention
- f P-SC
Adds (small) initialization time Reduces Variance
15 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
16 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
- 1. Build Histogram
16 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
- 1. Build Histogram
Range-partition
16 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
- 1. Build Histogram
Range-partition
- 2. Copy entries
16 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
- 1. Build Histogram
Range-partition
- 2. Copy entries
16 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
- 1. Build Histogram
Range-partition
- 2. Copy entries
No locks required
16 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
- 1. Build Histogram
Range-partition
- 2. Copy entries
No locks required Fully utilize resources
16 / 30
Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
- 1. Build Histogram
Range-partition
- 2. Copy entries
No locks required Fully utilize resources NUMA- fragmented memory
16 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
T1 T2 T3 Tk
17 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
T1 T2 T3 Tk
Coarse-Granular Index (P-CCGI)
17 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
T1 T2 T3 Tk
Coarse-Granular Index (P-CCGI)
Range-partitioning
17 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
T1 T2 T3 Tk
Query Coarse-Granular Index (P-CCGI)
Range-partitioning
17 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
T1 T2 T3 Tk
Query Coarse-Granular Index (P-CCGI)
Range-partitioning
Local Result Local Result Local Result Local Result
17 / 30
Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)
Cracker Index Cracker Index Cracker Index Cracker Index
k Chunks
T1 T2 T3 Tk
Query Coarse-Granular Index (P-CCGI)
Range-partitioning
P-CSC + Range Partitioning Local Result Local Result Local Result Local Result
17 / 30
Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)
A
18 / 30
Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)
- 1. Range-partition
while copying A Index(A) 1024 partitions
18 / 30
Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)
- 1. Range-partition
while copying A Index(A) 1024 partitions
- 2. Perform in-place
radix sort on each partition Index(A) Fully sorted
18 / 30
Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)
- 1. Range-partition
while copying A Index(A) 1024 partitions
- 2. Perform in-place
radix sort on each partition Index(A) Fully sorted shared with P-CCGI
18 / 30
Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)
- 1. Range-partition
while copying A Index(A) 1024 partitions
- 2. Perform in-place
radix sort on each partition Index(A) Fully sorted 256 bucket Most significant byte ➞ 4 recursion levels shared with P-CCGI
18 / 30
2 4 6 8 10 12 14 Time to sort [s]
P-RPRS (Parallel) Mergesort (GNU libstdc++) 4 Cores / 8 Threads 512 million 4 byte integers Uniform random distribution
Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)
19 / 30
Multi-threaded algorithms: Parallel-chunked Range-Partitioned Radix Sort (P-CRS) k Chunks
T1 T2 T3 Tk
Query
Range-partitioning CGI + RS CGI + RS CGI + RS CGI + RS
20 / 30
Multi-threaded algorithms: Parallel-chunked Range-Partitioned Radix Sort (P-CRS) k Chunks
T1 T2 T3 Tk
Query
Range-partitioning CGI + RS CGI + RS CGI + RS CGI + RS
Local Result Local Result Local Result Local Result
20 / 30
Multi-threaded algorithms: Parallel-chunked Range-Partitioned Radix Sort (P-CRS) k Chunks
T1 T2 T3 Tk
Query
Range-partitioning
P-RPRS + Chunking
CGI + RS CGI + RS CGI + RS CGI + RS
Local Result Local Result Local Result Local Result
20 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC)
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
lock vs. lock-free (#Chunks * SC)
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
lock vs. lock-free almost 2x faster (#Chunks * SC)
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC)
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (Par. Range Partitioning + P-SC)
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (Par. Range Partitioning + P-SC) faster despite of range partitioning
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (Par. Range Partitioning + P-SC)
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI P-CCGI
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + P-SC)
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI P-CCGI
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + P-SC) huge improvement in querying
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI P-CCGI
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + P-SC)
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI P-CCGI P-RPRS
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + #Parts * RS) (Par. Range Partitioning + P-SC)
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI P-CCGI P-RPRS
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + #Parts * RS) (Par. Range Partitioning + P-SC) most expensive initialization
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI P-CCGI P-RPRS
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + #Parts * RS) (Par. Range Partitioning + P-SC)
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + P-SC) (Par. Range Partitioning + #Parts * RS) (#Chunks * (Range Partitioning + RS))
21 / 30
Multi-threaded Results
1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS
(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + P-SC) (Par. Range Partitioning + #Parts * RS) (#Chunks * (Range Partitioning + RS)) chunked non chunked
21 / 30
Multi-threaded Results Factor Speedup from 1 to 8 Threads
1 2 3 4 5 6 7 8 P-SC Speedup [times]
Initialization (including first query) Total
22 / 30
P-SC: Analysis
Bandwidth Mutex Wait Time (sec) Piece lock Cracker index lock 11.671 5.169 Total 16.84 Average (Total by 8) 2.105 Lock Time Intel VTune Amplifier XE 2013 Data
23 / 30
P-SC: Analysis
Bandwidth Mutex Wait Time (sec) Piece lock Cracker index lock 11.671 5.169 Total 16.84 Average (Total by 8) 2.105 Lock Time Intel VTune Amplifier XE 2013 Data
23 / 30
1 2 3 4 5 6 7 8 P-SC P-CGI P-RPRS Speedup [times]
Initialization (including first query) Total
Multi-threaded Results Factor Speedup from 1 to 8 Threads
24 / 30
Non-Chunked Algorithms: Analysis (P-RPRS)
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
Range Partitioning (RP) Phase:
25 / 30
Non-Chunked Algorithms: Analysis (P-RPRS)
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
Range Partitioning (RP) Phase:
Socket 1 Socket 2
25 / 30
Non-Chunked Algorithms: Analysis (P-RPRS)
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
Range Partitioning (RP) Phase:
Socket 1 Socket 2
25 / 30
Non-Chunked Algorithms: Analysis (P-RPRS)
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Source Destination # Elements # Threads (k)
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
Range Partitioning (RP) Phase:
Socket 1 Socket 2 NUMA 1 NUMA 2
25 / 30
Non-Chunked Algorithms: Analysis (P-RPRS)
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Destination
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
Query Phase:
Socket 1 Socket 2 NUMA 1 NUMA 2 NUMA 1 NUMA 2
26 / 30
Non-Chunked Algorithms: Analysis (P-RPRS)
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Destination
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
Query Phase:
Socket 1 Socket 2 NUMA 1 NUMA 2 NUMA 1 NUMA 2
Query 1 Thread 1 Socket 1
26 / 30
Non-Chunked Algorithms: Analysis (P-RPRS)
A B tk t2 t1 t1 tk t2
n k
. . . . . .
Destination
Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k
Query Phase:
Socket 1 Socket 2 NUMA 1 NUMA 2 NUMA 1 NUMA 2
Query 1 Thread 1 Socket 1 Remote Access
26 / 30
1 2 3 4 5 6 7 8 P-SC P-CGI P-RPRS P-CSC P-CCGI P-CRS Speedup [times]
Initialization (including first query) Total
Multi-threaded Results Factor Speedup from 1 to 8 Threads
27 / 30
Chunked Algorithms: Analysis (P-CRS)
All chunks are completely independent - 8x Speedup?
Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk
28 / 30
Chunked Algorithms: Analysis (P-CRS)
All chunks are completely independent - 8x Speedup?
Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk
5.413x Speedup
28 / 30
Chunked Algorithms: Analysis (P-CRS)
All chunks are completely independent - 8x Speedup?
Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk
2.9x Speedup 2.9x Speedup
28 / 30
Chunked Algorithms: Analysis (P-CRS)
All chunks are completely independent - 8x Speedup?
Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk
2.9x Speedup 2.9x Speedup
Shared LLC Shared LLC
28 / 30
Chunked Algorithms: Analysis (P-CRS)
All chunks are completely independent - 8x Speedup?
Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk
2.9x Speedup 2.9x Speedup
Shared LLC Shared LLC Main Memory (NUMA Region 1) Main Memory (NUMA Region 2)
28 / 30
Chunked Algorithms: Analysis (P-CRS)
All chunks are completely independent - 8x Speedup?
Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk
2.9x Speedup 2.9x Speedup
Shared LLC Shared LLC Main Memory (NUMA Region 1) Main Memory (NUMA Region 2)
28 / 30
Conclusion
29 / 30
Conclusion
1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads
RS P-RPRS P-CRS
29 / 30
Conclusion
1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads
RS P-RPRS P-CRS 100 million in less than a second
29 / 30
Conclusion
1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads
RS P-RPRS P-CRS
1 2 3 4 5 6 non-chunked chunked Total Time [s]
P
- S
C P
- C
G I P
- C
S C P
- C
C G I 100 million in less than a second
29 / 30
Conclusion
1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads
RS P-RPRS P-CRS
1 2 3 4 5 6 non-chunked chunked Total Time [s]
P
- S
C P
- C
G I P
- C
S C P
- C
C G I
0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-CCGI P-CRS
100 million in less than a second
29 / 30
Conclusion
1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads
RS P-RPRS P-CRS
1 2 3 4 5 6 non-chunked chunked Total Time [s]
P
- S
C P
- C
G I P
- C
S C P
- C
C G I
0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-CCGI P-CRS
100 million in less than a second > 10000 queries to win over best cracking
29 / 30
Conclusion
1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads
RS P-RPRS P-CRS
1 2 3 4 5 6 non-chunked chunked Total Time [s]
P
- S
C P
- C
G I P
- C
S C P
- C
C G I
0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-CCGI P-CRS
100 million in less than a second > 10000 queries to win over best cracking gap decreased from 5 seconds (1T) to 0.5 seconds (8T)
29 / 30
Upcoming
30 / 30
Upcoming
0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-CCGI P-CRS
30 / 30
Upcoming
0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-CCGI P-CRS
Uses plain Standard Cracking inside
30 / 30
Upcoming
0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements
P-CCGI P-CRS
Uses plain Standard Cracking inside Improvable? Next talk!
30 / 30