[PPT] - ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Sorting) @ PowerPoint Presentation

SLIDE 1

Parallel Join Algorithms (Sorting)

@ Andy_Pavlo // 15- 721 // Spring 2019

ADVANCED DATABASE SYSTEMS

Lect ure # 18

SLIDE 2 CMU 15-721 (Spring 2019)

PRO J ECT # 2

This Week

→ Status Meetings

Monday April 8th

→ Code Review Submission → Update Presentation → Design Document

2

SLIDE 3 CMU 15-721 (Spring 2019)

PARALLEL J O IN ALGO RITH M S

Perform a join between two relations on multiple threads simultaneously to speed up operation. Two main approaches:

→ Hash Join → Sort-Merge Join

3

SLIDE 4 CMU 15-721 (Spring 2019)

SIMD Background Parallel Sort-Merge Join Evaluation

4

SLIDE 5 CMU 15-721 (Spring 2019)

SIN GLE IN STRUCTIO N, M ULTIPLE DATA

A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously. All major ISAs have microarchitecture support SIMD operations.

→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX → PowerPC: Altivec → ARM: NEON

5

SLIDE 6 CMU 15-721 (Spring 2019)

Z

SIM D EXAM PLE

6

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

SLIDE 7 CMU 15-721 (Spring 2019)

Z

SIM D EXAM PLE

6

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

SLIDE 8 CMU 15-721 (Spring 2019)

Z

SIM D EXAM PLE

6

X + Y = Z

8 7 6 5 4 3 2 1

X

SISD

+

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1

Y

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

SLIDE 9 CMU 15-721 (Spring 2019)

Z

SIM D EXAM PLE

6

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 1 1 1 1 1 1 1 1

Y

SIMD

+

8 7 6 5 1 1 1 1

128-bit SIMD Register 128-bit SIMD Register 128-bit SIMD Register x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

SLIDE 10 CMU 15-721 (Spring 2019)

Z

SIM D EXAM PLE

6

X + Y = Z

8 7 6 5 4 3 2 1

X

for (i=0; i<n; i++) { Z[i] = X[i] + Y[i]; }

9 8 7 6 5 4 3 2 1 1 1 1 1 1 1 1

Y

SIMD

+

4 3 2 1 1 1 1 1

x1 x2 ⋮ xn y1 y2 ⋮ yn x1+y1 x2+y2 ⋮ xn+yn + =

SLIDE 11 CMU 15-721 (Spring 2019)

SIM D TRADE- O FFS

Advantages:

→ Significant performance gains and resource utilization if an algorithm can be vectorized.

Disadvantages:

→ Implementing an algorithm using SIMD is still mostly a manual process. → SIMD may have restrictions on data alignment. → Gathering data into SIMD registers and scattering it to the correct locations is tricky and/or inefficient.

7

SLIDE 12 CMU 15-721 (Spring 2019)

SO RT- M ERGE J O IN (R⨝S)

Phase #1: Sort

→ Sort the tuples of R and S based on the join key.

Phase #2: Merge

→ Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once.

8

SLIDE 13 CMU 15-721 (Spring 2019)

SO RT- M ERGE J O IN (R⨝S)

9

Relation R Relation S SORT! SORT!

SLIDE 14 CMU 15-721 (Spring 2019)

SO RT- M ERGE J O IN (R⨝S)

9

Relation R Relation S

⨝

SORT! SORT! MERGE!

SLIDE 15 CMU 15-721 (Spring 2019)

SO RT- M ERGE J O IN (R⨝S)

9

Relation R Relation S

⨝

SORT! SORT! MERGE!

SLIDE 16 CMU 15-721 (Spring 2019)

PARALLEL SO RT- M ERGE J O IN S

Sorting is always the most expensive part. Use hardware correctly to speed up the join algorithm as much as possible.

→ Utilize as many CPU cores as possible. → Be mindful of NUMA boundaries. → Use SIMD instructions where applicable.

10

MULTI- CORE, MAIN- MEMORY JOINS: SORT VS. HASH REVISITED

VLDB 2013

SLIDE 17 CMU 15-721 (Spring 2019)

PARALLEL SO RT- M ERGE J O IN (R⨝S)

Phase #1: Partitioning (optional)

→ Partition R and assign them to workers / cores.

Phase #2: Sort

→ Sort the tuples of R and S based on the join key.

Phase #3: Merge

→ Scan the sorted relations and compare tuples. → The outer relation R only needs to be scanned once.

11

SLIDE 18 CMU 15-721 (Spring 2019)

PARTITIO N IN G PH ASE

Approach #1: Implicit Partitioning

→ The data was partitioned on the join key when it was loaded into the database. → No extra pass over the data is needed.

Approach #2: Explicit Partitioning

→ Divide only the outer relation and redistribute among the different CPU cores. → Can use the same radix partitioning approach we talked about last time.

12

SLIDE 19 CMU 15-721 (Spring 2019)

SO RT PH ASE

Create runs of sorted chunks of tuples for both input relations. It used to be that Quicksort was good enough. But NUMA and parallel architectures require us to be more careful…

13

SLIDE 20 CMU 15-721 (Spring 2019)

CACH E- CO N SCIO US SO RTIN G

Level #1: In-Register Sorting

→ Sort runs that fit into CPU registers.

Level #2: In-Cache Sorting

→ Merge Level #1 output into runs that fit into CPU caches. → Repeat until sorted runs are ½ cache size.

Level #3: Out-of-Cache Sorting

→ Used when the runs of Level #2 exceed the size of caches.

14

SORT VS. HASH REVISITED: FAST JOIN IMPLEMENTATION ON MODERN M MULTI- CORE C CPUS

VLDB 2009

SLIDE 21 CMU 15-721 (Spring 2019)

CACH E- CO N SCIO US SO RTIN G

15

Level #1 Level #2 Level #3

SORTED UNSORTED

SLIDE 22 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

Abstract model for sorting keys.

→ Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

16

9 5 3 6

Input Output

SLIDE 23 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

Abstract model for sorting keys.

→ Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

16

9 5 3 6 3 6 5 9

Input Output

SLIDE 24 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

Abstract model for sorting keys.

→ Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

16

9 5 3 6 3 6 5 9 5 3

Input Output

SLIDE 25 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

Abstract model for sorting keys.

→ Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

16

9 5 3 6 3 6 5 9 5 3

Input Output

3

SLIDE 26 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

Abstract model for sorting keys.

→ Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

16

9 5 3 6 3 6 5 9 5 3

Input Output

3

SLIDE 27 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

Abstract model for sorting keys.

→ Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

16

9 5 3 6 3 6 5 9 9 6 5 3 5 6

Input Output

3 5 6 9

SLIDE 28 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

Abstract model for sorting keys.

→ Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

16

9 5 3 6 3 6 5 9 9 6 5 3 5 6

Input Output

3 5 6 9

wires = [9,5,3,6] wires[0] = min(wires[0], wires[1]) wires[1] = max(wires[0], wires[1]) wires[2] = min(wires[2], wires[3]) wires[3] = max(wires[2], wires[3]) wires[0] = min(wires[0], wires[2]) wires[2] = max(wires[0], wires[2]) wires[1] = min(wires[1], wires[3]) wires[3] = max(wires[1], wires[3]) wires[1] = min(wires[1], wires[2]) wires[2] = max(wires[1], wires[2])

SLIDE 29 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

Abstract model for sorting keys.

→ Fixed wiring “paths” for lists with the same # of elements. → Efficient to execute on modern CPUs because of limited data dependencies and no branches.

16

9 5 3 6 3 6 5 9 9 6 5 3 5 6

Input Output

3 5 6 9

wires = [9,5,3,6] wires[0] = min(wires[0], wires[1]) wires[1] = max(wires[0], wires[1]) wires[2] = min(wires[2], wires[3]) wires[3] = max(wires[2], wires[3]) wires[0] = min(wires[0], wires[2]) wires[2] = max(wires[0], wires[2]) wires[1] = min(wires[1], wires[3]) wires[3] = max(wires[1], wires[3]) wires[1] = min(wires[1], wires[2]) wires[2] = max(wires[1], wires[2])

1 1 2 2 3 3

SLIDE 30 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

17

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10

Instructions:

→ 4 LOAD

SLIDE 31 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

17

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10

Sort Across Registers Instructions:

→ 4 LOAD

SLIDE 32 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

17

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13

Sort Across Registers Instructions:

→ 4 LOAD

Instructions:

→ 10 MIN/MAX

SLIDE 33 CMU 15-721 (Spring 2019)

LEVEL # 1 SO RTIN G N ETWO RKS

17

12 21 4 13 9 8 6 7 1 14 3 5 11 15 10 1 8 3 5 11 4 7 9 14 6 10 12 21 15 13 1 5 9 12 8 11 14 21 3 4 6 15 7 10 13

Sort Across Registers Transpose Registers Instructions:

→ 4 LOAD

Instructions:

→ 10 MIN/MAX

Instructions:

→ 8 SHUFFLE → 4 STORE

SLIDE 34 CMU 15-721 (Spring 2019)

LEVEL # 2 BITO N IC M ERGE N ETWO RK

Like a Sorting Network but it can merge two locally-sorted lists into a globally-sorted list. Can expand network to merge progressively larger lists (½ cache size). Intel’s Measurements

→ 2.25–3.5x speed-up over SISD implementation.

18

EFFICIENT IMPLEMENTATION OF SORTING ON MULTI- CORE

VLDB 2008

SLIDE 35 CMU 15-721 (Spring 2019)

LEVEL # 2 BITO N IC M ERGE N ETWO RK

19

Input Output

b4 b3 b2 b1

Sorted Run Reverse Sorted Run

a1 a2 a3 a4

S H U F F L E S H U F F L E

Sorted Run

min/max min/max min/max

SLIDE 36 CMU 15-721 (Spring 2019)

LEVEL # 3 M ULTI- WAY M ERGIN G

Use the Bitonic Merge Networks but split the process up into tasks.

→ Still one worker thread per core. → Link together tasks with a cache-sized FIFO queue.

A task blocks when either its input queue is empty

r its output queue is full.

Requires more CPU instructions, but brings bandwidth and compute into balance.

20

SLIDE 37 CMU 15-721 (Spring 2019)

Sorted Runs

LEVEL # 3 M ULTI- WAY M ERGIN G

21

MERGE MERGE MERGE MERGE MERGE MERGE MERGE

Cache-Sized Queue

SLIDE 38 CMU 15-721 (Spring 2019)

Sorted Runs

LEVEL # 3 M ULTI- WAY M ERGIN G

21

MERGE MERGE MERGE MERGE MERGE MERGE MERGE

Cache-Sized Queue

SLIDE 39 CMU 15-721 (Spring 2019)

M ERGE PH ASE

Iterate through the outer table and inner table in lockstep and compare join keys. May need to backtrack if there are duplicates. Can be done in parallel at the different cores without synchronization if there are separate

utput buffers.

22

SLIDE 40 CMU 15-721 (Spring 2019)

SO RT- M ERGE J O IN VARIAN TS

Multi-Way Sort-Merge (M-WAY) Multi-Pass Sort-Merge (M-PASS) Massively Parallel Sort-Merge (MPSM)

23

SLIDE 41 CMU 15-721 (Spring 2019)

M ULTI- WAY SO RT- M ERGE

Outer Table

→ Each core sorts in parallel on local data (levels #1/#2). → Redistribute sorted runs across cores using the multi- way merge (level #3).

Inner Table

→ Same as outer table.

Merge phase is between matching pairs of chunks

f outer/inner tables at each core.

24

MULTI- CORE, MAIN- MEMORY JOINS: SORT VS. HASH REVISITED

VLDB 2013

SLIDE 42 CMU 15-721 (Spring 2019)

M ULTI- WAY SO RT- M ERGE

25

Local-NUMA Partitioning

SLIDE 43 CMU 15-721 (Spring 2019)

M ULTI- WAY SO RT- M ERGE

25

Local-NUMA Partitioning Sort

SLIDE 44 CMU 15-721 (Spring 2019)

M ULTI- WAY SO RT- M ERGE

25

Local-NUMA Partitioning Sort Multi-Way Merge

SLIDE 45 CMU 15-721 (Spring 2019)

M ULTI- WAY SO RT- M ERGE

25

Local-NUMA Partitioning Sort Multi-Way Merge

SLIDE 46 CMU 15-721 (Spring 2019)

M ULTI- WAY SO RT- M ERGE

25

SORT! SORT! SORT! SORT!

Local-NUMA Partitioning Sort Multi-Way Merge Same steps as Outer Table

SLIDE 47 CMU 15-721 (Spring 2019)

M ULTI- WAY SO RT- M ERGE

25

SORT! SORT! SORT! SORT!

⨝ ⨝ ⨝ ⨝

Local-NUMA Partitioning Sort Multi-Way Merge Local Merge Join Same steps as Outer Table

SLIDE 48 CMU 15-721 (Spring 2019)

M ULTI- PASS SO RT- M ERGE

Outer Table

→ Same level #1/#2 sorting as Multi-Way. → But instead of redistributing, it uses a multi-pass naïve merge on sorted runs.

Inner Table

→ Same as outer table.

Merge phase is between matching pairs of chunks

f outer table and inner table.

26

MULTI- CORE, MAIN- MEMORY JOINS: SORT VS. HASH REVISITED

VLDB 2013

SLIDE 49 CMU 15-721 (Spring 2019)

M ULTI- PASS SO RT- M ERGE

27

Local-NUMA Partitioning Local-NUMA Partitioning

SLIDE 50 CMU 15-721 (Spring 2019)

M ULTI- PASS SO RT- M ERGE

27

Local-NUMA Partitioning Sort Local-NUMA Partitioning Sort

SLIDE 51 CMU 15-721 (Spring 2019)

M ULTI- PASS SO RT- M ERGE

27

Local-NUMA Partitioning Sort Global Merge Join

⨝

Local-NUMA Partitioning Sort

SLIDE 52 CMU 15-721 (Spring 2019)

M ASSIVELY PARALLEL SO RT- M ERGE

Outer Table

→ Range-partition outer table and redistribute to cores. → Each core sorts in parallel on their partitions.

Inner Table

→ Not redistributed like outer table. → Each core sorts its local data.

Merge phase is between entire sorted run of outer table and a segment of inner table.

28

MASSIVELY PARALLEL SORT- MERGE JOINS IN MAIN MEMORY M MULTI- CORE D DATABASE SYSTEMS

VLDB 2012

SLIDE 53 CMU 15-721 (Spring 2019)

M ASSIVELY PARALLEL SO RT- M ERGE

29

Cross-NUMA Partitioning

SLIDE 54 CMU 15-721 (Spring 2019)

M ASSIVELY PARALLEL SO RT- M ERGE

29

Cross-NUMA Partitioning Sort Globally Sorted

SLIDE 55 CMU 15-721 (Spring 2019)

M ASSIVELY PARALLEL SO RT- M ERGE

29

SORT! SORT! SORT! SORT!

Cross-NUMA Partitioning Sort

SLIDE 56 CMU 15-721 (Spring 2019)

M ASSIVELY PARALLEL SO RT- M ERGE

29

SORT! SORT! SORT! SORT!

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

SLIDE 57 CMU 15-721 (Spring 2019)

M ASSIVELY PARALLEL SO RT- M ERGE

29

SORT! SORT! SORT! SORT!

⨝

Cross-NUMA Partitioning Sort Cross-Partition Merge Join

SLIDE 58 CMU 15-721 (Spring 2019)

Rule #1: No random writes to non-local memory

→ Chunk the data, redistribute, and then each core sorts/works on local data.

Rule #2: Only perform sequential reads on non-local memory

→ This allows the hardware prefetcher to hide remote access latency.

Rule #3: No core should ever wait for another

→ Avoid fine-grained latching or sync barriers.

30

Source: Martina- Cezara Albutiu

SLIDE 59 CMU 15-721 (Spring 2019)

EVALUATIO N

Compare the different join algorithms using a synthetic data set.

→ Sort-Merge: M-WAY, M-PASS, MPSM → Hash: Radix Partitioning

Hardware:

→ 4 Socket Intel Xeon E4640 @ 2.4GHz → 8 Cores with 2 Threads Per Core → 512 GB of DRAM

31

MULTI- CORE, MAIN- MEMORY JOINS: SORT VS. HASH REVISITED

VLDB 2013

SLIDE 60 CMU 15-721 (Spring 2019)

RAW SO RTIN G PERFO RM AN CE

32

9 18 27 36 1 2 4 8 16 32 64 128 256

Throughput (M Tuples/sec) Number of Tuples (in 220)

C++ STL Sort SIMD Sort

Source: Cagri Balkesen

Single-threaded sorting performance

2.5–3x Faster

SLIDE 61 CMU 15-721 (Spring 2019)

CO M PARISO N O F SO RT- M ERGE J O IN S

33

100 200 300 400

5 10 15 20 25

M-WAY M-PASS MPSM

Throughput (M Tuples/sec) Cycles / Output Tuple

Partition Sort S-Merge M-Join Throughput

13.6

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

7.6 22.9

SLIDE 62 CMU 15-721 (Spring 2019)

Hyper- Threading

M - WAY J O IN VS. M PSM J O IN

34

100 200 300 400 1 2 4 8 16 32 64

Throughput (M Tuples/sec) Number of Threads

Multi-Way Massively Parallel 108 M/sec 315 M/sec

Source: Cagri Balkesen

Workload: 1.6B⋈ 128M (8-byte tuples)

130 M/sec 54 M/sec 259 M/sec 90 M/sec

SLIDE 63 CMU 15-721 (Spring 2019)

SO RT- M ERGE J O IN VS. H ASH J O IN

35

2 4 6 8

SORT HASH SORT HASH SORT HASH SORT HASH 128M⨝128M 1.6B⨝1.6B 128M⨝512M 1.6B⨝6.4B

Cycles / Output Tuple

Partition Sort S-Merge M-Join Build+Probe

Source: Cagri Balkesen

Workload: Different Table Sizes (8-byte tuples)

SLIDE 64 CMU 15-721 (Spring 2019)

SO RT- M ERGE J O IN VS. H ASH J O IN

36

150 300 450 600 750 128 256 384 512 768 1024 1280 1536 1792 1920

Throughput (M Tuples/sec) Millions of Tuples

Multi-Way Sort-Merge Join Radix Hash Join

Source: Cagri Balkesen

Varying the size of the input relations

SLIDE 65 CMU 15-721 (Spring 2019)

PARTIN G TH O UGH TS

Both join approaches are equally important. Every serious OLAP DBMS supports both. We did not consider the impact of queries where the output needs to be sorted.

37

SLIDE 66 CMU 15-721 (Spring 2019)

N EXT CLASS

Query Compilation

→ Or "Why is DSL Project is Awesome"

38