[PPT] - ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Hashing) @ PowerPoint Presentation

SLIDE 1

Lect ure # 17

Parallel Join Algorithms (Hashing)

@ Andy_Pavlo // 15- 721 // Spring 2020

ADVANCED DATABASE SYSTEMS

SLIDE 2

15-721 (Spring 2020)

Background Parallel Hash Join Hash Functions Hashing Schemes Evaluation

2

SLIDE 3

15-721 (Spring 2020)

PARALLEL J O IN ALGO RITH M S

Perform a join between two relations on multiple threads simultaneously to speed up operation. Two main approaches:

→ Hash Join → Sort-Merge Join

We won't discuss nested-loop joins…

3

SLIDE 4

15-721 (Spring 2020)

O BSERVATIO N

Many OLTP DBMSs do not implement hash join. But an index nested-loop join with a small number of target tuples is at a high-level equivalent to a hash join.

4

SLIDE 5

15-721 (Spring 2020)

H ASH IN G VS. SO RTIN G

1970s – Sorting 1980s – Hashing 1990s – Equivalent 2000s – Hashing 2010s – Hashing (Partitioned vs. Non-Partitioned) 2020s – ???

5

SLIDE 6

15-721 (Spring 2020)

PARALLEL J O IN ALGO RITH M S

6

→ Hashing is faster than Sort-Merge. → Sort-Merge is faster w/ wider SIMD.

SORT VS. HASH REVISITED: FAST JOIN IMPLEMENTATION ON MODERN M MULTI- CORE CPUS

VLDB 2009

→ Sort-Merge is already faster than Hashing, even without SIMD.

MASSIVELY PARALLEL S SORT- MERGE JOINS IN MAIN MEMORY M MULTI- CORE D DATABASE SYSTEMS

VLDB 2012

→ New optimizations and results for Radix Hash Join.

MAIN- MEMORY HASH JOINS ON MULTI- CORE C CPUS: TUNING TO THE UNDERLYING HARDWARE

ICDE 2 2013

→ Trade-offs between partitioning & non-partitioning Hash-Join.

DESIGN AND EVALUATION OF MAIN MEMORY H HASH JOIN ALGORITHMS FOR MULTI- CORE C CPUS

SIGMOD 2011

→ Ignore what we said last year. → You really want to use Hashing!

MASSIVELY PARALLEL N NUMA- AWARE HASH JOINS

IMDM 2 2013

→ Hold up everyone! Let's look at everything more carefully!

AN EXPERIMENTAL COMPARISON OF THIRTEEN RELATIONAL EQUI- JOINS IN MAIN MEMORY

SIGMOD 2016

SLIDE 7

15-721 (Spring 2020)

J O IN ALGO RITH M DESIGN GOALS

Goal #1: Minimize Synchronization

→ Avoid taking latches during execution.

Goal #2: Minimize Memory Access Cost

→ Ensure that data is always local to worker thread. → Reuse data while it exists in CPU cache.

7

SLIDE 8

15-721 (Spring 2020)

IM PROVIN G CACH E BEH AVIO R

Factors that affect cache misses in a DBMS:

→ Cache + TLB capacity. → Locality (temporal and spatial).

Non-Random Access (Scan):

→ Clustering data to a cache line. → Execute more operations per cache line.

Random Access (Lookups):

→ Partition data to fit in cache + TLB.

8

Source: Johannes Gehrke

SLIDE 9

15-721 (Spring 2020)

PARALLEL H ASH J O IN S

Hash join is the most important operator in a DBMS for OLAP workloads. It is important that we speed up our DBMS's join algorithm by taking advantage of multiple cores.

→ We want to keep all cores busy, without becoming memory bound.

9

SLIDE 10

15-721 (Spring 2020)

H ASH J O IN (R⨝S)

Phase #1: Partition (optional)

→ Divide the tuples of R and S into sets using a hash on the join key.

Phase #2: Build

→ Scan relation R and create a hash table on join key.

Phase #3: Probe

→ For each tuple in S, look up its join key in hash table for

R. If a match is found, output combined tuple.

10

AN EXPERIMENTAL COMPARISON OF THIRTEEN RELATIONAL EQUI- JOINS IN MAIN MEMORY

SIGMOD 2016

SLIDE 11

15-721 (Spring 2020)

PARTITIO N PH ASE

Split the input relations into partitioned buffers by hashing the tuples’ join key(s).

→ Ideally the cost of partitioning is less than the cost of cache misses during build phase. → Sometimes called hybrid hash join / radix hash join.

Contents of buffers depends on storage model:

→ NSM: Usually the entire tuple. → DSM: Only the columns needed for the join + offset.

11

SLIDE 12

15-721 (Spring 2020)

PARTITIO N PH ASE

Approach #1: Non-Blocking Partitioning

→ Only scan the input relation once. → Produce output incrementally.

Approach #2: Blocking Partitioning (Radix)

→ Scan the input relation multiple times. → Only materialize results all at once. → Sometimes called radix hash join.

12

SLIDE 13

15-721 (Spring 2020)

N O N- BLO CKIN G PARTITIO N IN G

Scan the input relation only once and generate the

utput on-the-fly.

Approach #1: Shared Partitions

→ Single global set of partitions that all threads update. → Must use a latch to synchronize threads.

Approach #2: Private Partitions

→ Each thread has its own set of partitions. → Must consolidate them after all threads finish.

13

SLIDE 14

15-721 (Spring 2020)

SH ARED PARTITIO N S

14

Data Table

A B C

hashP(key) #p #p #p

SLIDE 15

15-721 (Spring 2020)

Partitions

SH ARED PARTITIO N S

14

Data Table

A B C

hashP(key) P1 ⋮ P2 Pn #p #p #p

SLIDE 16

15-721 (Spring 2020)

Partitions

PRIVATE PARTITIO N S

15

Data Table

A B C

hashP(key) #p #p #p

SLIDE 17

15-721 (Spring 2020)

Partitions

PRIVATE PARTITIO N S

15

Data Table

A B C

hashP(key) #p #p #p

SLIDE 18

15-721 (Spring 2020)

Partitions

PRIVATE PARTITIO N S

15

Data Table

A B C

hashP(key) #p #p #p Combined P1 ⋮ P2 Pn

SLIDE 19

15-721 (Spring 2020)

Partitions

PRIVATE PARTITIO N S

15

Data Table

A B C

hashP(key) #p #p #p Combined P1 ⋮ P2 Pn

SLIDE 20

15-721 (Spring 2020)

RADIX PARTITIO N IN G

Scan the input relation multiple times to generate the partitions. Multi-step pass over the relation:

→ Step #1: Scan R and compute a histogram of the # of tuples per hash key for the radix at some offset. → Step #2: Use this histogram to determine output offsets by computing the prefix sum. → Step #3: Scan R again and partition them according to the hash key.

16

SLIDE 21

15-721 (Spring 2020)

RADIX

The radix of a key is the value of an integer at a position (using its base).

17

89 12 23 08 41 64 Keys

SLIDE 22

15-721 (Spring 2020)

RADIX

The radix of a key is the value of an integer at a position (using its base).

17

89 12 23 08 41 64 9 2 3 8 1 4 Keys Radix

SLIDE 23

15-721 (Spring 2020)

RADIX

The radix of a key is the value of an integer at a position (using its base).

17

89 12 23 08 41 64 Keys Radix 8 1 2 4 6

SLIDE 24

15-721 (Spring 2020)

PREFIX SUM

The prefix sum of a sequence of numbers (x0, x1, …, xn) is a second sequence of numbers (y0, y1, …, yn) that is a running total of the input sequence.

18

+ + + + + 1 2 3 4 5 6 1 3 6 10 15 21 Input Prefix Sum

SLIDE 25

15-721 (Spring 2020)

RADIX PARTITIO N S

19

Step #1: Inspect input, create histograms 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

SLIDE 26

15-721 (Spring 2020)

RADIX PARTITIO N S

19

Step #1: Inspect input, create histograms 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

SLIDE 27

15-721 (Spring 2020)

RADIX PARTITIO N S

19

Step #1: Inspect input, create histograms

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

SLIDE 28

15-721 (Spring 2020)

RADIX PARTITIO N S

19

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #2: Compute output

ffsets

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

SLIDE 29

15-721 (Spring 2020)

RADIX PARTITIO N S

19

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #2: Compute output

ffsets

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

SLIDE 30

15-721 (Spring 2020)

RADIX PARTITIO N S

19

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #3: Read input and partition 0 7 0 3

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

SLIDE 31

15-721 (Spring 2020)

RADIX PARTITIO N S

19

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 0, CPU 1 Partition 1 Partition 1, CPU 1

Step #3: Read input and partition 0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0

, CPU 0 , CPU 0

0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

SLIDE 32

15-721 (Spring 2020)

RADIX PARTITIO N S

19

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3 Partition 0 Partition 1

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 Recursively repeat until target number of partitions have been created 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

SLIDE 33

15-721 (Spring 2020)

RADIX PARTITIO N S

19

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 Recursively repeat until target number of partitions have been created 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

SLIDE 34

15-721 (Spring 2020)

RADIX PARTITIO N S

19

Partition 0: 2 Partition 1: 2 Partition 0: 1 Partition 1: 3

0 7 0 7 0 3 1 8 1 9 1 1 1 5 1 0 Recursively repeat until target number of partitions have been created 0 7 1 8 1 9 0 7 0 3 1 1 1 5 1 0

1

#

p

#

p

#

p

#

p

#

p

#

p

#

p

#

p

hashP(key)

SLIDE 35

15-721 (Spring 2020)

BUILD PH ASE

The threads are then to scan either the tuples (or partitions) of R. For each tuple, hash the join key attribute for that tuple and add it to the appropriate bucket in the hash table.

→ The buckets should only be a few cache lines in size.

20

SLIDE 36

15-721 (Spring 2020)

H ASH TABLE

Design Decision #1: Hash Function

→ How to map a large key space into a smaller domain. → Trade-off between being fast vs. collision rate.

Design Decision #2: Hashing Scheme

→ How to handle key collisions after hashing. → Trade-off between allocating a large hash table vs. additional instructions to find/insert keys.

21

SLIDE 37

15-721 (Spring 2020)

H ASH FUN CTIO N S

We do not want to use a cryptographic hash function for our join algorithm. We want something that is fast and will have a low collision rate.

→ Best Speed: Always return '1' → Best Collision Rate: Perfect hashing

See SMHasher for a comprehensive hash function benchmark suite.

22

SLIDE 38

15-721 (Spring 2020)

H ASH FUN CTIO N S

CRC-64 (1975)

→ Used in networking for error detection.

MurmurHash (2008)

→ Designed to a fast, general purpose hash function.

Google CityHash (2011)

→ Designed to be faster for short keys (<64 bytes).

Facebook XXHash (2012)

→ From the creator of zstd compression.

Google FarmHash (2014)

→ Newer version of CityHash with better collision rates.

23

SLIDE 39

15-721 (Spring 2020)

H ASH FUN CTIO N BEN CH M ARK

24

7000 14000 21000 28000 1 51 101 151 201 251

Throughput (MB/sec) Key Size (bytes)

crc64 std::hash MurmurHash3 CityHash FarmHash XXHash3

Source: Fredrik Widlund

Intel Core i7-8700K @ 3.70GHz

32 64 128 192

SLIDE 40

15-721 (Spring 2020)

H ASH IN G SCH EM ES

Approach #1: Chained Hashing Approach #2: Linear Probe Hashing Approach #3: Robin Hood Hashing Approach #4: Hopscotch Hashing Approach #5: Cuckoo Hashing

25

SLIDE 41

15-721 (Spring 2020)

CH AIN ED H ASH IN G

Maintain a linked list of buckets for each slot in the hash table. Resolve collisions by placing all elements with the same hash key into the same bucket.

→ To determine whether an element is present, hash to its bucket and scan for it. → Insertions and deletions are generalizations of lookups.

26

SLIDE 42

15-721 (Spring 2020)

CH AIN ED H ASH IN G

27

A B C D hash(key) E F

SLIDE 43

15-721 (Spring 2020)

CH AIN ED H ASH IN G

27

A B C D hash(key) E F | A

hash(A)

Buckets

SLIDE 44

15-721 (Spring 2020)

CH AIN ED H ASH IN G

27

A B C D hash(key) E F | A

hash(A)

| B

hash(B)

Buckets

SLIDE 45

15-721 (Spring 2020)

CH AIN ED H ASH IN G

27

A B C D hash(key) E F | A

hash(A)

| B

hash(B)

Buckets | C

hash(C)

SLIDE 46

15-721 (Spring 2020)

CH AIN ED H ASH IN G

27

A B C D hash(key) E F | A

hash(A)

| B

hash(B)

Buckets | C

hash(C)

SLIDE 47

15-721 (Spring 2020)

CH AIN ED H ASH IN G

27

A B C D hash(key) E F | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

SLIDE 48

15-721 (Spring 2020)

CH AIN ED H ASH IN G

27

A B C D hash(key) E F | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

| E

hash(E)

SLIDE 49

15-721 (Spring 2020)

CH AIN ED H ASH IN G

27

A B C D hash(key) E F | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

| E

hash(E)

| F

hash(F)

SLIDE 50

15-721 (Spring 2020)

CH AIN ED H ASH IN G

27

A B C D hash(key) E F | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

| E

hash(E)

| F

hash(F) 64-bit Bucket Pointers

16-bit Bloom Filter 48-bit Pointer

¤

SLIDE 51

15-721 (Spring 2020)

LIN EAR PRO BE H ASH IN G

Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table.

→ To determine whether an element is present, hash to a location in the table and scan for it. → Must store the key in the table to know when to stop scanning. → Insertions and deletions are generalizations of lookups.

28

SLIDE 52

15-721 (Spring 2020)

LIN EAR PRO BE H ASH IN G

29

A B C D hash(key) | A

hash(A)

E F

SLIDE 53

15-721 (Spring 2020)

LIN EAR PRO BE H ASH IN G

29

A B C D hash(key) | A

hash(A)

| B

hash(B)

E F

SLIDE 54

15-721 (Spring 2020)

LIN EAR PRO BE H ASH IN G

29

A B C D hash(key) | A

hash(A)

| B

hash(B)

E F

SLIDE 55

15-721 (Spring 2020)

LIN EAR PRO BE H ASH IN G

29

A B C D hash(key) | A

hash(A)

| B

hash(B)

| C

hash(C)

E F

SLIDE 56

15-721 (Spring 2020)

LIN EAR PRO BE H ASH IN G

29

A B C D hash(key) | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

E F

SLIDE 57

15-721 (Spring 2020)

LIN EAR PRO BE H ASH IN G

29

A B C D hash(key) | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

E F

SLIDE 58

15-721 (Spring 2020)

LIN EAR PRO BE H ASH IN G

29

A B C D hash(key) | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

E | E

hash(E)

F

SLIDE 59

15-721 (Spring 2020)

LIN EAR PRO BE H ASH IN G

29

A B C D hash(key) | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

E | E

hash(E)

F | F

hash(F)

SLIDE 60

15-721 (Spring 2020)

O BSERVATIO N

To reduce the # of wasteful comparisons during the join, it is important to avoid collisions of hashed keys. This requires a chained hash table with ~2× the number of slots as the # of elements in R.

30

SLIDE 61

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

Variant of linear probe hashing that steals slots from "rich" keys and give them to "poor" keys.

→ Each key tracks the number of positions they are from where its optimal position in the table. → On insert, a key takes the slot of another key if the first key is farther away from its optimal position than the second key.

31

ROBIN HOOD H HASHING

FOUNDATIONS O OF COMPUTER SCIENCE 1985

SLIDE 62

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

E

# of "Jumps" From First Position

F

SLIDE 63

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

E F

SLIDE 64

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

E F

A[0] == C[0]

SLIDE 65

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

E F

A[0] == C[0]

SLIDE 66

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

E F

C[1] > D[0]

SLIDE 67

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

| D [1]

hash(D)

E F

C[1] > D[0]

SLIDE 68

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

| D [1]

hash(D)

E

A[0] == E[0]

F

SLIDE 69

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

| D [1]

hash(D)

E

A[0] == E[0] C[1] == E[1]

F

SLIDE 70

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

| D [1]

hash(D)

E

A[0] == E[0] C[1] == E[1] D[1] < E[2]

F

SLIDE 71

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

E | E [2]

hash(E) A[0] == E[0] C[1] == E[1] D[1] < E[2]

F | D [2]

hash(D)

SLIDE 72

15-721 (Spring 2020)

RO BIN H O O D H ASH IN G

32

A B C D hash(key) | A [0]

hash(A)

| B [0]

hash(B)

| C [1]

hash(C)

E | E [2]

hash(E)

F | D [2]

hash(D)

| F [1]

hash(F) D[2] > F[0]

SLIDE 73

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

Variant of linear probe hashing where keys can move between positions in a neighborhood.

→ A neighborhood is contiguous range of slots in the table. → The size of a neighborhood is a configurable constant.

A key is guaranteed to be in its neighborhood or not exist in the table.

33

HOPSCOTCH HASHING

SYMPOSIUM ON DISTRIBUTED COMPUTING 2008

SLIDE 74

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

SLIDE 75

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #1

SLIDE 76

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #1 Neighborhood #2 Neighborhood #3

⋮

SLIDE 77

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #3

SLIDE 78

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #3

| A

hash(A)

SLIDE 79

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #1

| A

hash(A)

SLIDE 80

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #1

| A

hash(A)

| B

hash(B)

SLIDE 81

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #3

| A

hash(A)

| B

hash(B)

SLIDE 82

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #3

| A

hash(A)

| B

hash(B)

SLIDE 83

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #3

| A

hash(A)

| B

hash(B)

| C

hash(C)

SLIDE 84

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3 | A

hash(A)

| B

hash(B)

| C

hash(C)

SLIDE 85

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3 | A

hash(A)

| B

hash(B)

| C

hash(C)

SLIDE 86

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3 | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

SLIDE 87

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3 | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

SLIDE 88

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #3

| A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

SLIDE 89

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #3

| A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

SLIDE 90

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #3

| A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

SLIDE 91

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3 | A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

SLIDE 92

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #3

| A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

| E

hash(E)

SLIDE 93

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #6

| A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

| E

hash(E)

SLIDE 94

15-721 (Spring 2020)

H O PSCOTCH H ASH IN G

34

A B C D hash(key) E F Neighborhood Size = 3

Neighborhood #6

| A

hash(A)

| B

hash(B)

| C

hash(C)

| D

hash(D)

| E

hash(E)

| F

hash(F)

SLIDE 95

15-721 (Spring 2020)

CUCKO O H ASH IN G

Use multiple tables with different hash functions.

→ On insert, check every table and pick anyone that has a free slot. → If no table has a free slot, evict the element from one of them and then re-hash it find a new location.

Look-ups are always O(1) because only one location per hash table is checked.

35

SLIDE 96

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮

SLIDE 97

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

SLIDE 98

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X) hash1(X) | X

SLIDE 99

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y) hash1(X) | X

SLIDE 100

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y) hash1(X) | X hash2(Y) | Y

SLIDE 101

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y) hash1(X) | X hash2(Y) | Y

Insert Z

hash1(Z) hash2(Z)

SLIDE 102

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y) hash1(X) | X hash2(Y) | Y

Insert Z

hash1(Z) hash2(Z)

SLIDE 103

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y) hash1(X) | X

Insert Z

hash1(Z) hash2(Z) hash2(Z) | Z hash1(Y)

SLIDE 104

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y) hash1(X) | X

Insert Z

hash1(Z) hash2(Z) hash2(Z) | Z hash1(Y)

SLIDE 105

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

Insert Z

hash1(Z) hash2(Z) hash2(Z) | Z hash1(Y) hash1(Y) | Y

SLIDE 106

15-721 (Spring 2020)

CUCKO O H ASH IN G

36

Hash Table #1 ⋮ Hash Table #2 ⋮ Insert X

hash1(X) hash2(X)

Insert Y

hash1(Y) hash2(Y)

Insert Z

hash1(Z) hash2(Z) hash2(Z) | Z hash1(Y) hash1(Y) | Y hash2(X) hash2(X) | X

SLIDE 107

15-721 (Spring 2020)

CUCKO O H ASH IN G

Threads have to make sure that they don’t get stuck in an infinite loop when moving keys. If we find a cycle, then we can rebuild the entire hash tables with new hash functions.

→ With two hash functions, we (probably) won’t need to rebuild the table until it is at about 50% full. → With three hash functions, we (probably) won’t need to rebuild the table until it is at about 90% full.

37

SLIDE 108

15-721 (Spring 2020)

PRO BE PH ASE

For each tuple in S, hash its join key and check to see whether there is a match for each tuple in corresponding bucket in the hash table constructed for R.

→ If inputs were partitioned, then assign each thread a unique partition. → Otherwise, synchronize their access to the cursor on S.

38

SLIDE 109

15-721 (Spring 2020)

PRO BE PH ASE BLO O M FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

39

A B

⨝

MICRO ADAPTIVITY IN VECTORWISE

SIGMOD 2013

SLIDE 110

15-721 (Spring 2020)

PRO BE PH ASE BLO O M FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

39

A B

⨝

Bloom Filter

MICRO ADAPTIVITY IN VECTORWISE

SIGMOD 2013

SLIDE 111

15-721 (Spring 2020)

PRO BE PH ASE BLO O M FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

39

A B

⨝

Bloom Filter

MICRO ADAPTIVITY IN VECTORWISE

SIGMOD 2013

SLIDE 112

15-721 (Spring 2020)

PRO BE PH ASE BLO O M FILTER

Create a Bloom Filter during the build phase when the key is likely to not exist in the hash table.

→ Threads check the filter before probing the hash table. This will be faster since the filter will fit in CPU caches. → Sometimes called sideways information passing.

39

A B

⨝

Bloom Filter

MICRO ADAPTIVITY IN VECTORWISE

SIGMOD 2013

SLIDE 113

15-721 (Spring 2020)

H ASH J O IN VARIAN TS

40

No-P Shared-P Private-P Radix

Partitioning

No Yes Yes Yes

Input scans

1 1 2

Sync during partitioning

– Spinlock per tuple Barrier,

nce at end

Barrier, 4 · #passes

Hash table

Shared Private Private Private

Sync during build phase

Yes No No No

Sync during probe phase

No No No No

SLIDE 114

15-721 (Spring 2020)

BEN CH M ARKS

Primary key – foreign key join

→ Outer Relation (Build): 16M tuples, 16 bytes each → Inner Relation (Probe): 256M tuples, 16 bytes each

Uniform and highly skewed (Zipf; s=1.25) No output materialization

41

DESIGN AND EVALUATION OF MAIN MEMORY HASH JOIN ALGORITHMS FOR MULTI- CORE C CPUS

SIGMOD 2011

SLIDE 115

15-721 (Spring 2020)

H ASH J O IN UN IFO RM DATA SET

42

40 80 120 160

No Partitioning Shared Partitioning Private Partitioning Radix Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz 6 Cores with 2 Threads Per Core

60.2 67.6 76.8 47.3

24% faster than No Partitioning 3.3x cache misses 70x TLB misses

Source: Spyros Blanas

SLIDE 116

15-721 (Spring 2020)

H ASH J O IN SKEWED DATA SET

43

40 80 120 160

No Partitioning Shared Partitioning Private Partitioning Radix Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz 6 Cores with 2 Threads Per Core

25.2 167.1 56.5 50.7

Source: Spyros Blanas

SLIDE 117

15-721 (Spring 2020)

O BSERVATIO N

We have ignored a lot of important parameters for all these algorithms so far.

→ Whether to use partitioning or not? → How many partitions to use? → How many passes to take in partitioning phase?

In a real DBMS, the optimizer will select what it thinks are good values based on what it knows about the data (and maybe hardware).

44

SLIDE 118

15-721 (Spring 2020)

RADIX H ASH J O IN UN IFO RM DATA SET

45

40 80 120

64 256 512 1024 4096 8192 32768 131072 64 256 512 1024 4096 8192 32768 131072 Radix / 1-Pass Radix / 2-Pass

Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz Varying the # of Partitions

▼No Partitioning

+24%

5%

Source: Spyros Blanas

SLIDE 119

15-721 (Spring 2020)

RADIX H ASH J O IN UN IFO RM DATA SET

46

40 80 120

64 256 512 1024 4096 8192 32768 131072 64 256 512 1024 4096 8192 32768 131072 Radix / 1-Pass Radix / 2-Pass

Cycles / Output Tuple

Partition Build Probe

Intel Xeon CPU X5650 @ 2.66GHz Varying the # of Partitions

▼No Partitioning

Source: Spyros Blanas

SLIDE 120

15-721 (Spring 2020)

EFFECTS O F H YPER- TH READIN G

Radix join has fewer cache & TLB misses but this has marginal benefit. Non-partitioned join relies on multi-threading for high performance.

47

Intel Xeon CPU X5650 @ 2.66GHz Uniform Data Set

1 3 5 7 9 11 1 3 5 7 9 11

Speedup Threads

No Partitioning Radix Ideal

Source: Spyros Blanas

Hyper-Threading

SLIDE 121

15-721 (Spring 2020)

TPC- H Q 19

48

250 279 301 285

100 200 300 400

No-Part (Linear) No-Part (Array) Radix (Linear) Radix (Array)

Runtime (ms)

Join Remaining Query

4× Intel Xeon CPU E7-4870v4 Scale Factor 100

Source: Stefan Schuh

SLIDE 122

15-721 (Spring 2020)

PARTIN G TH O UGH TS

Partitioned-based joins outperform no- partitioning algorithms is most settings, but it is non-trivial to tune it correctly. AFAIK, every DBMS vendor picks one hash join implementation and does not try to be adaptive.

49

SLIDE 123

15-721 (Spring 2020)

N EXT CLASS

Parallel Sort-Merge Joins

50