Main Memory Adaptive Indexing for Multi-core Systems Felix Martin - - PowerPoint PPT Presentation

main memory adaptive indexing for multi core systems
SMART_READER_LITE
LIVE PREVIEW

Main Memory Adaptive Indexing for Multi-core Systems Felix Martin - - PowerPoint PPT Presentation

SIGMOD DaMoN 23.06.2014 Main Memory Adaptive Indexing for Multi-core Systems Felix Martin Schuhknecht Victor Alvarez Jens Dittrich Stefan Richter Information Systems Group Saarland University https://infosys.uni-saarland.de/ Problem:


slide-1
SLIDE 1

Main Memory Adaptive Indexing for Multi-core Systems

Felix Martin Schuhknecht Victor Alvarez Jens Dittrich Stefan Richter Information Systems Group Saarland University https://infosys.uni-saarland.de/ SIGMOD DaMoN 23.06.2014

slide-2
SLIDE 2

Problem: Answer Range Queries

2 / 30

slide-3
SLIDE 3

select A from R where R.A >= 10 and R.A < 20

Problem: Answer Range Queries

2 / 30

slide-4
SLIDE 4

select A from R where R.A >= 10 and R.A < 20

Problem: Answer Range Queries

One extreme: Scan + Filter

>= 10 && < 20 ?

R.A (unsorted) 43 9 13 22 19 15 7 99 48 17 34

2 / 30

slide-5
SLIDE 5

select A from R where R.A >= 10 and R.A < 20

Problem: Answer Range Queries

One extreme: Scan + Filter

>= 10 && < 20 ?

R.A (unsorted) 43 9 13 22 19 15 7 99 48 17 34 43 9 13 22 19 15 7 99 48 17 34

2 / 30

slide-6
SLIDE 6

select A from R where R.A >= 10 and R.A < 20

Problem: Answer Range Queries

One extreme: Scan + Filter

>= 10 && < 20 ?

R.A (unsorted) 43 9 13 22 19 15 7 99 48 17 34 43 9 13 22 19 15 7 99 48 17 34 Other extreme: Index R.A (sorted)

>= 10 && < 20

7 9 13 15 17 19 22 34 43 48 99

2 / 30

slide-7
SLIDE 7

select A from R where R.A >= 10 and R.A < 20

Problem: Answer Range Queries

One extreme: Scan + Filter

>= 10 && < 20 ?

R.A (unsorted)

<10 >=10 <20 >=20

43 9 13 22 19 15 7 99 48 17 34 43 9 13 22 19 15 7 99 48 17 34 Other extreme: Index R.A (sorted)

>= 10 && < 20

7 9 13 15 17 19 22 34 43 48 99 7 9 13 15 17 19 22 34 43 48 99

2 / 30

slide-8
SLIDE 8

Index: When to build?

One extreme: At once (Traditional Indexing)

Pressure on the system Time

3 / 30

slide-9
SLIDE 9

Index: When to build?

One extreme: At once (Traditional Indexing)

Pressure on the system Time

3 / 30

slide-10
SLIDE 10

Index: When to build?

One extreme: At once (Traditional Indexing)

build index Pressure on the system Time

3 / 30

slide-11
SLIDE 11

Index: When to build?

One extreme: At once (Traditional Indexing)

build index Pressure on the system Time

3 / 30

slide-12
SLIDE 12

Index: When to build?

One extreme: At once (Traditional Indexing)

build index build finished Pressure on the system Time

3 / 30

slide-13
SLIDE 13

Index: When to build?

One extreme: At once (Traditional Indexing)

build index build finished Pressure on the system Time

3 / 30

slide-14
SLIDE 14

Index: When to build?

One extreme: At once (Traditional Indexing)

build index build finished Pressure on the system Time

3 / 30

slide-15
SLIDE 15

Index: When to build?

One extreme: At once (Traditional Indexing)

build index build finished Pressure on the system Time

Other extreme: Incrementally at query time (Adaptive Indexing)

3 / 30

slide-16
SLIDE 16

Index: When to build?

One extreme: At once (Traditional Indexing)

build index build finished Pressure on the system Time

Other extreme: Incrementally at query time (Adaptive Indexing)

3 / 30

slide-17
SLIDE 17

Index: When to build?

One extreme: At once (Traditional Indexing)

build index build finished Pressure on the system Time

Other extreme: Incrementally at query time (Adaptive Indexing)

initialize index

3 / 30

slide-18
SLIDE 18

Index: When to build?

One extreme: At once (Traditional Indexing)

build index build finished Pressure on the system Time

Other extreme: Incrementally at query time (Adaptive Indexing)

initialize index

3 / 30

slide-19
SLIDE 19

Traditional Indexing: Sort + Binary Search

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

4 / 30

slide-20
SLIDE 20

Traditional Indexing: Sort + Binary Search

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

1 2 3 4 6 7 8 9 11 12 13 14 16 19

Sort Index Column (A)

4 / 30

slide-21
SLIDE 21

Traditional Indexing: Sort + Binary Search

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

select A from R where R.A > 10 and R.A < 14

1 2 3 4 6 7 8 9 11 12 13 14 16 19 1 2 3 4 6 7 8 9 11 12 13 14 16 19

Sort Index Column (A) Index Column (A)

4 / 30

slide-22
SLIDE 22

Traditional Indexing: Sort + Binary Search

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

select A from R where R.A > 10 and R.A < 14

1 2 3 4 6 7 8 9 11 12 13 14 16 19

Binary Search

1 2 3 4 6 7 8 9 11 12 13 14 16 19

Sort Index Column (A) Index Column (A)

4 / 30

slide-23
SLIDE 23

Traditional Indexing: Sort + Binary Search

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

select A from R where R.A > 10 and R.A < 14

1 2 3 4 6 7 8 9 11 12 13 14 16 19

Binary Search

1 2 3 4 6 7 8 9 11 12 13 14 16 19 1 2 3 4 6 7 8 9 11 12 13 14 16 19

Sort Index Column (A) Index Column (A)

4 / 30

slide-24
SLIDE 24

Adaptive Indexing: Standard Cracking

[Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]

5 / 30

slide-25
SLIDE 25

Adaptive Indexing: Standard Cracking

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

[Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]

5 / 30

slide-26
SLIDE 26

Adaptive Indexing: Standard Cracking

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

select A from R where R.A > 10 and R.A < 14 [Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]

5 / 30

slide-27
SLIDE 27

Adaptive Indexing: Standard Cracking

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

lumn A after

<=10 <14 >=14

4 9 2 7 1 3 8 6 13 12 11 16 19 14

Cracked Column (A)

select A from R where R.A > 10 and R.A < 14 [Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]

5 / 30

slide-28
SLIDE 28

index

10 < A <14 7 < A <=10 14 <= A <=16 16 < A A <= 7

Cracker Index AVL

Adaptive Indexing: Standard Cracking

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

lumn A after

<=10 <14 >=14

4 9 2 7 1 3 8 6 13 12 11 16 19 14

Cracked Column (A)

select A from R where R.A > 10 and R.A < 14 [Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]

5 / 30

slide-29
SLIDE 29

index

10 < A <14 7 < A <=10 14 <= A <=16 16 < A A <= 7

Cracker Index AVL

Adaptive Indexing: Standard Cracking

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

lumn A after

<=10 <14 >=14

4 9 2 7 1 3 8 6 13 12 11 16 19 14

Cracked Column (A)

select A from R where R.A > 10 and R.A < 14 select A from R where R.A > 7 and R.A <= 16 [Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]

5 / 30

slide-30
SLIDE 30

index

10 < A <14 7 < A <=10 14 <= A <=16 16 < A A <= 7

Cracker Index AVL

Adaptive Indexing: Standard Cracking

Column A

13 16 4 9 2 12 7 1 19 3 14 11 8 6

A

lumn A after

<=10 <14 >=14

4 9 2 7 1 3 8 6 13 12 11 16 19 14

Cracked Column (A)

select A from R where R.A > 10 and R.A < 14 select A from R where R.A > 7 and R.A <= 16

<=10 <= 7

4 2 1 3 6 7 9 8

<=16 < A

14 16 19

[Database Cracking. S. Idreos, M. Kersten, S. Manegold. In CIDR 2007.]

5 / 30

slide-31
SLIDE 31

Motivation

Standard Cracking

6 / 30

slide-32
SLIDE 32

Motivation

Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking

6 / 30

slide-33
SLIDE 33

Motivation

Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking Single-threaded adaptive algorithms

6 / 30

slide-34
SLIDE 34

Motivation

Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking Single-threaded adaptive algorithms Quicksort Radixsort Mergesort

6 / 30

slide-35
SLIDE 35

Motivation

Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking Single-threaded adaptive algorithms Single-threaded sorting algorithms Quicksort Radixsort Mergesort

6 / 30

slide-36
SLIDE 36

Motivation

Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking Single-threaded adaptive algorithms Multi-threaded adaptive algorithms Parallel Standard Cracking Single-threaded sorting algorithms Quicksort Radixsort Mergesort

6 / 30

slide-37
SLIDE 37

Motivation

Standard Cracking Hybrid Cracking Stochastic Cracking Coarse-granular Index Sideways Cracking Single-threaded adaptive algorithms Multi-threaded adaptive algorithms Parallel Standard Cracking Single-threaded sorting algorithms Quicksort Radixsort Mergesort

?

6 / 30

slide-38
SLIDE 38

Setup

7 / 30

slide-39
SLIDE 39

Setup

100 million entries

7 / 30

slide-40
SLIDE 40

Setup

100 million entries Key RowID

7 / 30

slide-41
SLIDE 41

Setup

100 million entries Key RowID 4 Byte + 4 Byte

7 / 30

slide-42
SLIDE 42

Setup

100 million entries Key RowID 4 Byte + 4 Byte Uniform Random Key Distribution

7 / 30

slide-43
SLIDE 43

Setup

100 million entries Key RowID 4 Byte + 4 Byte Uniform Random Key Distribution Query 1%

7 / 30

slide-44
SLIDE 44

Setup

100 million entries Key RowID 4 Byte + 4 Byte Uniform Random Key Distribution Query 1% ~762 MB

7 / 30

slide-45
SLIDE 45

Setup

2.2 GHz 2.2 GHz 2.2 GHz 2.2 GHz 10MB L3 2.2 GHz 2.2 GHz 2.2 GHz 2.2 GHz 10MB L3 24 GB RAM 24 GB RAM

Xeon E5-2407 Xeon E5-2407

8 / 30

slide-46
SLIDE 46

Setup

2.2 GHz 2.2 GHz 2.2 GHz 2.2 GHz 10MB L3 2.2 GHz 2.2 GHz 2.2 GHz 2.2 GHz 10MB L3 24 GB RAM 24 GB RAM

Xeon E5-2407 Xeon E5-2407 No Turbo, no HyperThreading

8 / 30

slide-47
SLIDE 47

Single-threaded algorithms

5 10 15 20 25 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 1 Thread, 100 Million Elements

Standard Cracking (SC) Hybrid Crack Sort (HCS) Coarse-granular Index (CGI) Radix Sort (RS) STL std::sort (STL-S)

[The Uncracked Pieces in Database Cracking. F. M. Schuhknecht, A. Jindal, J. Dittrich. In PVLDB 2013]

9 / 30

slide-48
SLIDE 48

Single-threaded algorithms

5 10 15 20 25 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 1 Thread, 100 Million Elements

Standard Cracking (SC) Hybrid Crack Sort (HCS) Coarse-granular Index (CGI) Radix Sort (RS) STL std::sort (STL-S)

750 Queries

[The Uncracked Pieces in Database Cracking. F. M. Schuhknecht, A. Jindal, J. Dittrich. In PVLDB 2013]

9 / 30

slide-49
SLIDE 49

Single-threaded algorithms

5 10 15 20 25 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 1 Thread, 100 Million Elements

Standard Cracking (SC) Hybrid Crack Sort (HCS) Coarse-granular Index (CGI) Radix Sort (RS) STL std::sort (STL-S)

750 Queries > 10000 Queries

[The Uncracked Pieces in Database Cracking. F. M. Schuhknecht, A. Jindal, J. Dittrich. In PVLDB 2013]

9 / 30

slide-50
SLIDE 50

Multi-threaded environments? Multi-threaded algorithms!

10 / 30

slide-51
SLIDE 51

Multi-threaded algorithms: Parallel Standard Cracking (P-SC)

[Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]

11 / 30

slide-52
SLIDE 52

Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1

[Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]

11 / 30

slide-53
SLIDE 53

Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2

[Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]

11 / 30

slide-54
SLIDE 54

Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2

[Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]

T1 T2

11 / 30

slide-55
SLIDE 55

Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2

Requested Locks

W R R W R R R R W R W

Q1 Q2

[Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]

T1 T2

11 / 30

slide-56
SLIDE 56

Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2

Requested Locks

W R R W R R R R W R W

Q1 Q2

✓ ✓ ✓ ✓

  • [Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]

T1 T2

11 / 30

slide-57
SLIDE 57

Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2

Requested Locks

W R R W R R R R W R W

Q1 Q2

✓ ✓ ✓ ✓

  • [Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]

Inter-query parallelism

T1 T2

11 / 30

slide-58
SLIDE 58

Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2

Requested Locks

W R R W R R R R W R W

Q1 Q2

✓ ✓ ✓ ✓

  • [Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]

Inter-query parallelism Lock contention

T1 T2

11 / 30

slide-59
SLIDE 59

Multi-threaded algorithms: Parallel Standard Cracking (P-SC) Q1 Q2

Requested Locks

W R R W R R R R W R W

Q1 Q2

✓ ✓ ✓ ✓

  • [Concurrency control for adaptive indexing. G.Graefe, F.Halim, S.Idreos, H.Kuno, S.Manegold. In PVLDB 2013]

Inter-query parallelism Lock contention Underutilize resources (T3, T4, T5, ...)

T1 T2

11 / 30

slide-60
SLIDE 60

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query

12 / 30

slide-61
SLIDE 61

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

12 / 30

slide-62
SLIDE 62

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

12 / 30

slide-63
SLIDE 63

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

T1 T2 T3 Tk

12 / 30

slide-64
SLIDE 64

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

T1 T2 T3 Tk

Local Result Local Result Local Result Local Result

12 / 30

slide-65
SLIDE 65

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

T1 T2 T3 Tk

12 / 30

slide-66
SLIDE 66

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query k Chunks

Cracker Index

T1

Cracker Index

T2

Cracker Index

T3

Cracker Index

Tk

13 / 30

slide-67
SLIDE 67

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query k Chunks

Cracker Index

T1

Cracker Index

T2

Cracker Index

T3

Cracker Index

Tk

13 / 30

slide-68
SLIDE 68

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query k Chunks

Cracker Index

T1

Cracker Index

T2

Cracker Index

T3

Cracker Index

Tk Complete independence

13 / 30

slide-69
SLIDE 69

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query k Chunks

Cracker Index

T1

Cracker Index

T2

Cracker Index

T3

Cracker Index

Tk Complete independence Fully utilize resources

13 / 30

slide-70
SLIDE 70

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC) Query k Chunks

Cracker Index

T1

Cracker Index

T2

Cracker Index

T3

Cracker Index

Tk Complete independence Fully utilize resources No consecutive result

13 / 30

slide-71
SLIDE 71

Micro Benchmark Reading 1% from k locations using one thread

2.5 5 7.5 10 1 10 100 1000 10000 100000 1000000 Time [s] Number of Chunks (k)

14 / 30

slide-72
SLIDE 72

Micro Benchmark Reading 1% from k locations using one thread

2.5 5 7.5 10 1 10 100 1000 10000 100000 1000000 Time [s] Number of Chunks (k) No problem for realistic k

14 / 30

slide-73
SLIDE 73

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)

A

15 / 30

slide-74
SLIDE 74

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)

  • 1. Range-partition

while copying Index on A A Index(A) 1024 partitions

15 / 30

slide-75
SLIDE 75

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)

Cracker Index

  • 1. Range-partition

while copying Index on A A Index(A) 1024 partitions

15 / 30

slide-76
SLIDE 76

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)

Cracker Index

  • 1. Range-partition

while copying Index on A A Query

  • 2. Perform P-SC

Index(A) 1024 partitions

15 / 30

slide-77
SLIDE 77

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)

Cracker Index

  • 1. Range-partition

while copying Index on A A Query

  • 2. Perform P-SC

Index(A) 1024 partitions W R R W

15 / 30

slide-78
SLIDE 78

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)

Cracker Index

  • 1. Range-partition

while copying Index on A A Query

  • 2. Perform P-SC

Index(A) 1024 partitions Like starting ... ... after 1000 cracks W R R W

15 / 30

slide-79
SLIDE 79

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)

Cracker Index

  • 1. Range-partition

while copying Index on A A Query

  • 2. Perform P-SC

Index(A) 1024 partitions Like starting ... ... after 1000 cracks W R R W Reduces lock contention

  • f P-SC

15 / 30

slide-80
SLIDE 80

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)

Cracker Index

  • 1. Range-partition

while copying Index on A A Query

  • 2. Perform P-SC

Index(A) 1024 partitions Like starting ... ... after 1000 cracks W R R W Reduces lock contention

  • f P-SC

Reduces Variance

15 / 30

slide-81
SLIDE 81

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)

Cracker Index

  • 1. Range-partition

while copying Index on A A Query

  • 2. Perform P-SC

Index(A) 1024 partitions Like starting ... ... after 1000 cracks W R R W Reduces lock contention

  • f P-SC

Adds (small) initialization time Reduces Variance

15 / 30

slide-82
SLIDE 82

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI)

Cracker Index

  • 1. Range-partition

while copying Index on A A Query

  • 2. Perform P-SC

Index(A) 1024 partitions Like starting ... ... after 1000 cracks How to do? W R R W Reduces lock contention

  • f P-SC

Adds (small) initialization time Reduces Variance

15 / 30

slide-83
SLIDE 83

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

16 / 30

slide-84
SLIDE 84

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

  • 1. Build Histogram

16 / 30

slide-85
SLIDE 85

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

  • 1. Build Histogram

Range-partition

16 / 30

slide-86
SLIDE 86

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

  • 1. Build Histogram

Range-partition

  • 2. Copy entries

16 / 30

slide-87
SLIDE 87

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

  • 1. Build Histogram

Range-partition

  • 2. Copy entries

16 / 30

slide-88
SLIDE 88

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

  • 1. Build Histogram

Range-partition

  • 2. Copy entries

No locks required

16 / 30

slide-89
SLIDE 89

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

  • 1. Build Histogram

Range-partition

  • 2. Copy entries

No locks required Fully utilize resources

16 / 30

slide-90
SLIDE 90

Multi-threaded algorithms: Parallel Coarse-Granular Index (P-CGI): Parallel Range Partitioning

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

  • 1. Build Histogram

Range-partition

  • 2. Copy entries

No locks required Fully utilize resources NUMA- fragmented memory

16 / 30

slide-91
SLIDE 91

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

T1 T2 T3 Tk

17 / 30

slide-92
SLIDE 92

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

T1 T2 T3 Tk

Coarse-Granular Index (P-CCGI)

17 / 30

slide-93
SLIDE 93

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

T1 T2 T3 Tk

Coarse-Granular Index (P-CCGI)

Range-partitioning

17 / 30

slide-94
SLIDE 94

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

T1 T2 T3 Tk

Query Coarse-Granular Index (P-CCGI)

Range-partitioning

17 / 30

slide-95
SLIDE 95

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

T1 T2 T3 Tk

Query Coarse-Granular Index (P-CCGI)

Range-partitioning

Local Result Local Result Local Result Local Result

17 / 30

slide-96
SLIDE 96

Multi-threaded algorithms: Parallel-chunked Standard Cracking (P-CSC)

Cracker Index Cracker Index Cracker Index Cracker Index

k Chunks

T1 T2 T3 Tk

Query Coarse-Granular Index (P-CCGI)

Range-partitioning

P-CSC + Range Partitioning Local Result Local Result Local Result Local Result

17 / 30

slide-97
SLIDE 97

Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)

A

18 / 30

slide-98
SLIDE 98

Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)

  • 1. Range-partition

while copying A Index(A) 1024 partitions

18 / 30

slide-99
SLIDE 99

Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)

  • 1. Range-partition

while copying A Index(A) 1024 partitions

  • 2. Perform in-place

radix sort on each partition Index(A) Fully sorted

18 / 30

slide-100
SLIDE 100

Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)

  • 1. Range-partition

while copying A Index(A) 1024 partitions

  • 2. Perform in-place

radix sort on each partition Index(A) Fully sorted shared with P-CCGI

18 / 30

slide-101
SLIDE 101

Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)

  • 1. Range-partition

while copying A Index(A) 1024 partitions

  • 2. Perform in-place

radix sort on each partition Index(A) Fully sorted 256 bucket Most significant byte ➞ 4 recursion levels shared with P-CCGI

18 / 30

slide-102
SLIDE 102

2 4 6 8 10 12 14 Time to sort [s]

P-RPRS (Parallel) Mergesort (GNU libstdc++) 4 Cores / 8 Threads 512 million 4 byte integers Uniform random distribution

Multi-threaded algorithms: Parallel Range-Partitioned Radix Sort (P-RPRS)

19 / 30

slide-103
SLIDE 103

Multi-threaded algorithms: Parallel-chunked Range-Partitioned Radix Sort (P-CRS) k Chunks

T1 T2 T3 Tk

Query

Range-partitioning CGI + RS CGI + RS CGI + RS CGI + RS

20 / 30

slide-104
SLIDE 104

Multi-threaded algorithms: Parallel-chunked Range-Partitioned Radix Sort (P-CRS) k Chunks

T1 T2 T3 Tk

Query

Range-partitioning CGI + RS CGI + RS CGI + RS CGI + RS

Local Result Local Result Local Result Local Result

20 / 30

slide-105
SLIDE 105

Multi-threaded algorithms: Parallel-chunked Range-Partitioned Radix Sort (P-CRS) k Chunks

T1 T2 T3 Tk

Query

Range-partitioning

P-RPRS + Chunking

CGI + RS CGI + RS CGI + RS CGI + RS

Local Result Local Result Local Result Local Result

20 / 30

slide-106
SLIDE 106

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

21 / 30

slide-107
SLIDE 107

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC)

21 / 30

slide-108
SLIDE 108

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

lock vs. lock-free (#Chunks * SC)

21 / 30

slide-109
SLIDE 109

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

lock vs. lock-free almost 2x faster (#Chunks * SC)

21 / 30

slide-110
SLIDE 110

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC)

21 / 30

slide-111
SLIDE 111

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (Par. Range Partitioning + P-SC)

21 / 30

slide-112
SLIDE 112

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (Par. Range Partitioning + P-SC) faster despite of range partitioning

21 / 30

slide-113
SLIDE 113

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (Par. Range Partitioning + P-SC)

21 / 30

slide-114
SLIDE 114

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI P-CCGI

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + P-SC)

21 / 30

slide-115
SLIDE 115

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI P-CCGI

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + P-SC) huge improvement in querying

21 / 30

slide-116
SLIDE 116

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI P-CCGI

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + P-SC)

21 / 30

slide-117
SLIDE 117

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI P-CCGI P-RPRS

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + #Parts * RS) (Par. Range Partitioning + P-SC)

21 / 30

slide-118
SLIDE 118

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI P-CCGI P-RPRS

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + #Parts * RS) (Par. Range Partitioning + P-SC) most expensive initialization

21 / 30

slide-119
SLIDE 119

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI P-CCGI P-RPRS

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + #Parts * RS) (Par. Range Partitioning + P-SC)

21 / 30

slide-120
SLIDE 120

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + P-SC) (Par. Range Partitioning + #Parts * RS) (#Chunks * (Range Partitioning + RS))

21 / 30

slide-121
SLIDE 121

Multi-threaded Results

1 2 3 4 5 6 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

P-SC P-CSC P-CGI P-CCGI P-RPRS P-CRS

(#Chunks * SC) (#Chunks * (Range Partitioning + SC)) (Par. Range Partitioning + P-SC) (Par. Range Partitioning + #Parts * RS) (#Chunks * (Range Partitioning + RS)) chunked non chunked

21 / 30

slide-122
SLIDE 122

Multi-threaded Results Factor Speedup from 1 to 8 Threads

1 2 3 4 5 6 7 8 P-SC Speedup [times]

Initialization (including first query) Total

22 / 30

slide-123
SLIDE 123

P-SC: Analysis

Bandwidth Mutex Wait Time (sec) Piece lock Cracker index lock 11.671 5.169 Total 16.84 Average (Total by 8) 2.105 Lock Time Intel VTune Amplifier XE 2013 Data

23 / 30

slide-124
SLIDE 124

P-SC: Analysis

Bandwidth Mutex Wait Time (sec) Piece lock Cracker index lock 11.671 5.169 Total 16.84 Average (Total by 8) 2.105 Lock Time Intel VTune Amplifier XE 2013 Data

23 / 30

slide-125
SLIDE 125

1 2 3 4 5 6 7 8 P-SC P-CGI P-RPRS Speedup [times]

Initialization (including first query) Total

Multi-threaded Results Factor Speedup from 1 to 8 Threads

24 / 30

slide-126
SLIDE 126

Non-Chunked Algorithms: Analysis (P-RPRS)

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

Range Partitioning (RP) Phase:

25 / 30

slide-127
SLIDE 127

Non-Chunked Algorithms: Analysis (P-RPRS)

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

Range Partitioning (RP) Phase:

Socket 1 Socket 2

25 / 30

slide-128
SLIDE 128

Non-Chunked Algorithms: Analysis (P-RPRS)

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

Range Partitioning (RP) Phase:

Socket 1 Socket 2

25 / 30

slide-129
SLIDE 129

Non-Chunked Algorithms: Analysis (P-RPRS)

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Source Destination # Elements # Threads (k)

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

Range Partitioning (RP) Phase:

Socket 1 Socket 2 NUMA 1 NUMA 2

25 / 30

slide-130
SLIDE 130

Non-Chunked Algorithms: Analysis (P-RPRS)

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Destination

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

Query Phase:

Socket 1 Socket 2 NUMA 1 NUMA 2 NUMA 1 NUMA 2

26 / 30

slide-131
SLIDE 131

Non-Chunked Algorithms: Analysis (P-RPRS)

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Destination

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

Query Phase:

Socket 1 Socket 2 NUMA 1 NUMA 2 NUMA 1 NUMA 2

Query 1 Thread 1 Socket 1

26 / 30

slide-132
SLIDE 132

Non-Chunked Algorithms: Analysis (P-RPRS)

A B tk t2 t1 t1 tk t2

n k

. . . . . .

Destination

Thread 1 Thread 2 Thread k Thread 1 Thread 2 Thread k

Query Phase:

Socket 1 Socket 2 NUMA 1 NUMA 2 NUMA 1 NUMA 2

Query 1 Thread 1 Socket 1 Remote Access

26 / 30

slide-133
SLIDE 133

1 2 3 4 5 6 7 8 P-SC P-CGI P-RPRS P-CSC P-CCGI P-CRS Speedup [times]

Initialization (including first query) Total

Multi-threaded Results Factor Speedup from 1 to 8 Threads

27 / 30

slide-134
SLIDE 134

Chunked Algorithms: Analysis (P-CRS)

All chunks are completely independent - 8x Speedup?

Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk

28 / 30

slide-135
SLIDE 135

Chunked Algorithms: Analysis (P-CRS)

All chunks are completely independent - 8x Speedup?

Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk

5.413x Speedup

28 / 30

slide-136
SLIDE 136

Chunked Algorithms: Analysis (P-CRS)

All chunks are completely independent - 8x Speedup?

Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk

2.9x Speedup 2.9x Speedup

28 / 30

slide-137
SLIDE 137

Chunked Algorithms: Analysis (P-CRS)

All chunks are completely independent - 8x Speedup?

Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk

2.9x Speedup 2.9x Speedup

Shared LLC Shared LLC

28 / 30

slide-138
SLIDE 138

Chunked Algorithms: Analysis (P-CRS)

All chunks are completely independent - 8x Speedup?

Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk

2.9x Speedup 2.9x Speedup

Shared LLC Shared LLC Main Memory (NUMA Region 1) Main Memory (NUMA Region 2)

28 / 30

slide-139
SLIDE 139

Chunked Algorithms: Analysis (P-CRS)

All chunks are completely independent - 8x Speedup?

Core Core Core Core Core Core Core Core Chunk Chunk Chunk Chunk Chunk Chunk Chunk Chunk

2.9x Speedup 2.9x Speedup

Shared LLC Shared LLC Main Memory (NUMA Region 1) Main Memory (NUMA Region 2)

28 / 30

slide-140
SLIDE 140

Conclusion

29 / 30

slide-141
SLIDE 141

Conclusion

1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads

RS P-RPRS P-CRS

29 / 30

slide-142
SLIDE 142

Conclusion

1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads

RS P-RPRS P-CRS 100 million in less than a second

29 / 30

slide-143
SLIDE 143

Conclusion

1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads

RS P-RPRS P-CRS

1 2 3 4 5 6 non-chunked chunked Total Time [s]

P

  • S

C P

  • C

G I P

  • C

S C P

  • C

C G I 100 million in less than a second

29 / 30

slide-144
SLIDE 144

Conclusion

1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads

RS P-RPRS P-CRS

1 2 3 4 5 6 non-chunked chunked Total Time [s]

P

  • S

C P

  • C

G I P

  • C

S C P

  • C

C G I

0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-CCGI P-CRS

100 million in less than a second

29 / 30

slide-145
SLIDE 145

Conclusion

1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads

RS P-RPRS P-CRS

1 2 3 4 5 6 non-chunked chunked Total Time [s]

P

  • S

C P

  • C

G I P

  • C

S C P

  • C

C G I

0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-CCGI P-CRS

100 million in less than a second > 10000 queries to win over best cracking

29 / 30

slide-146
SLIDE 146

Conclusion

1 2 3 4 5 6 7 1 2 4 8 Initialization Time [s] Number of Threads

RS P-RPRS P-CRS

1 2 3 4 5 6 non-chunked chunked Total Time [s]

P

  • S

C P

  • C

G I P

  • C

S C P

  • C

C G I

0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-CCGI P-CRS

100 million in less than a second > 10000 queries to win over best cracking gap decreased from 5 seconds (1T) to 0.5 seconds (8T)

29 / 30

slide-147
SLIDE 147

Upcoming

30 / 30

slide-148
SLIDE 148

Upcoming

0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-CCGI P-CRS

30 / 30

slide-149
SLIDE 149

Upcoming

0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-CCGI P-CRS

Uses plain Standard Cracking inside

30 / 30

slide-150
SLIDE 150

Upcoming

0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 Accumulated Query Response Time [s] Query Sequence 8 Threads, 100 Million Elements

P-CCGI P-CRS

Uses plain Standard Cracking inside Improvable? Next talk!

30 / 30

slide-151
SLIDE 151

Thank you!