Hardware Acceleration of Database Operations Jared Casper and Kunle - - PowerPoint PPT Presentation

▶

Jan 30, 2024 201 likes •399 views

Hardware Acceleration of Database Operations Jared Casper and Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Database machines n Database machines from late 1970s n Put some compute on the disk track/head/unit n

SLIDE 1

Hardware Acceleration of Database Operations

Jared Casper and Kunle Olukotun Pervasive Parallelism Laboratory Stanford University

SLIDE 2

Database machines

n Database machines from late 1970s

n Put some compute on the disk track/head/unit n Processors got faster, I/O performance did not n Processor could keep up with disk

n No performance left on the table

n Today's database machines

n Made up of general purpose components n Massive amounts of memory n Very high speed interconnect n Tables, even databases, fit entirely within memory

SLIDE 3

Database Operation Acceleration

n Processors can not keep up with memory

n Join performance is at 100s of million tuples per

second

n 64-bit tuples → 2-3 GB/s n Chips can get over 100 GB/s n Performance is being left on the table

n Follow 10x10 rule, build accelerators n Three acceleration blocks

n Selection, merge join, sort n Combine these to do a sort merge join n Goal is to “keep up with memory”

SLIDE 4

Select

n Software implementation uses SIMD

n Read data into SIMD register n Use SIMD shuffle operation to move selected data to

ne end of the register

n Mask used as index into table for shuffle values

n Unaligned write to append to output n Limited by SIMD width, number of SIMD registers

1 0 1 1 1 0 0 1 C F E B E C A F E B A B E

SLIDE 5

Select

1011

7 6 5 4

SLIDE 6

Merge Join

n Scan two sorted columns, output matching

values

n Can have associated values or record IDs n Output cross product when multiple values n Generally viewed as the “free” thing after sorting

n More an indication of how slow sorting is

n Software implementations have bad branching

behaviour

n Limits the IPC → hard to keep up with memory

SLIDE 7

Merge Join

7 ¨ Output is bitmask of equal keys with corresponding values

¤ Ready for input into the select block

SLIDE 8

Merge Sort

8 4 8 2 1 5 5 7 4 8 1 2 5 5 0 7 1 2 4 8 0 5 5 7 0 1 2 4 5 5 7

1st Pass 2nd Pass

SLIDE 9

Merge Sort Level

SLIDE 10

High Bandwidth Sort Merge Node

SLIDE 11

Sort Merge Join

¨ Sort, merge join, and select blocks are combined to

perform an full sort merge join in hardware

SLIDE 12

Prototyping Platform - Maxeler

SLIDE 13

Select Throughput

n Software achieved 7 GB/s (33%)

n STREAM achieved 12 GB/s (57%)

10 20 30 40 50 60 70 80 90 100

Cardinality (%)

16 17 18 19 20 21 22 23 24

Throughput (GB/s)

42 44 46 48 50 52 54 56 58 60 62 64

% of Line Bandwidth

Memory System Saturated!

SLIDE 14

Select Resources

64 88 112 136 160 184 208 232 256 280 304 328 352 376

Throughput (bytes/clock)

24 36 48 60 72 84 96 108 120 132

Throughput (GB/s @ 400 MHz)

2 4 6 8 10

Count (thousands)

ROM bits 16:1 mux 4:1 mux registers

SLIDE 15

Merge Join Throughput

¨ Resources required is a quadratic function of desired bandwidth

¤ All in comparison logic, routing was the limiting factor

¨ Above 1.5x output, write bandwidth dominates

¤ Throughput above is input consumed

0.15 0.3 0.45 0.6 0.75 0.9 1.05 1.2 1.35 1.5

Output ratio

8 10 12 14 16

Throughput (GB/s)

18 20 22 24 26 28 30 32 34 36

% T

tal Line Throughput

m=1 m=2 m=3 m=8

SLIDE 16

Sort throughput

¨ Resources required is a linear function of desired input size ¤ Dominated by the memory required to hold working sets ¨ Recent CPU/GPU numbers ~300M 32-bit values per second

375K 750K 1.5M 3M 6M 12.5M 25M 50M 100M 200M 400M 800M 1.6B 3.2B 6.4B 12.5B 25B 50B

Size of Input

400 600 800 1000 1200 1400 1600

Million values per second

2 passes 3 passes 3 passes (projected)

SLIDE 17

Sort Merge Join

n Performance limited by intra-FPGA link n Total throughput is 800 million tuples/second

n ~6.5 GB/s n 8x previous work on software joins

SLIDE 18

Conclusions

n FPGAs can be used to saturate memory

bandwidth in ways that processors can not

n Make the most of every byte read n In some cases, address bandwidth is just as important

as raw data bandwidth

n Scaling your design to high bandwidths can

greatly influence the architecture

n Think streaming

n Next step is to interact with the rest of the

system

SLIDE 19

Hardware Acceleration of Database Operations

Jared Casper and Kunle Olukotun Pervasive Parallelism Laboratory Stanford University

Database machines

n Database machines from late 1970s

n Today's database machines

Database Operation Acceleration

n Processors can not keep up with memory

second

n Follow 10x10 rule, build accelerators n Three acceleration blocks

Select

n Software implementation uses SIMD

Select

1011

7 6 5 4

Merge Join

n Scan two sorted columns, output matching

values

n Software implementations have bad branching

behaviour

Merge Join

Merge Sort

1st Pass 2nd Pass

Merge Sort Level

High Bandwidth Sort Merge Node

Sort Merge Join

perform an full sort merge join in hardware

Prototyping Platform - Maxeler

Select Throughput

n Software achieved 7 GB/s (33%)

Memory System Saturated!

Select Resources

Merge Join Throughput

Sort throughput

Sort Merge Join

Conclusions

n FPGAs can be used to saturate memory

bandwidth in ways that processors can not

as raw data bandwidth

n Scaling your design to high bandwidths can

greatly influence the architecture

n Next step is to interact with the rest of the

system

Questions?