Hardware Acceleration of Database Operations Jared Casper and Kunle - - PowerPoint PPT Presentation
Hardware Acceleration of Database Operations Jared Casper and Kunle - - PowerPoint PPT Presentation
Hardware Acceleration of Database Operations Jared Casper and Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Database machines n Database machines from late 1970s n Put some compute on the disk track/head/unit n
Database machines
n Database machines from late 1970s
n Put some compute on the disk track/head/unit n Processors got faster, I/O performance did not n Processor could keep up with disk
n No performance left on the table
n Today's database machines
n Made up of general purpose components n Massive amounts of memory n Very high speed interconnect n Tables, even databases, fit entirely within memory
2
Database Operation Acceleration
n Processors can not keep up with memory
n Join performance is at 100s of million tuples per
second
n 64-bit tuples → 2-3 GB/s n Chips can get over 100 GB/s n Performance is being left on the table
n Follow 10x10 rule, build accelerators n Three acceleration blocks
n Selection, merge join, sort n Combine these to do a sort merge join n Goal is to “keep up with memory”
3
Select
n Software implementation uses SIMD
n Read data into SIMD register n Use SIMD shuffle operation to move selected data to
- ne end of the register
n Mask used as index into table for shuffle values
n Unaligned write to append to output n Limited by SIMD width, number of SIMD registers
4
1 0 1 1 1 0 0 1 C F E B E C A F E B A B E
Select
5
1011
7 6 5 4
Merge Join
n Scan two sorted columns, output matching
values
n Can have associated values or record IDs n Output cross product when multiple values n Generally viewed as the “free” thing after sorting
n More an indication of how slow sorting is
n Software implementations have bad branching
behaviour
n Limits the IPC → hard to keep up with memory
6
Merge Join
7 ¨ Output is bitmask of equal keys with corresponding values
¤ Ready for input into the select block
Merge Sort
8 4 8 2 1 5 5 7 4 8 1 2 5 5 0 7 1 2 4 8 0 5 5 7 0 1 2 4 5 5 7
1st Pass 2nd Pass
Merge Sort Level
9
High Bandwidth Sort Merge Node
10
Sort Merge Join
11
¨ Sort, merge join, and select blocks are combined to
perform an full sort merge join in hardware
Prototyping Platform - Maxeler
12
Select Throughput
n Software achieved 7 GB/s (33%)
n STREAM achieved 12 GB/s (57%)
13
10 20 30 40 50 60 70 80 90 100
Cardinality (%)
16 17 18 19 20 21 22 23 24
Throughput (GB/s)
42 44 46 48 50 52 54 56 58 60 62 64
% of Line Bandwidth
Memory System Saturated!
Select Resources
14
64 88 112 136 160 184 208 232 256 280 304 328 352 376
Throughput (bytes/clock)
24 36 48 60 72 84 96 108 120 132
Throughput (GB/s @ 400 MHz)
2 4 6 8 10
Count (thousands)
ROM bits 16:1 mux 4:1 mux registers
Merge Join Throughput
15
¨ Resources required is a quadratic function of desired bandwidth
¤ All in comparison logic, routing was the limiting factor
¨ Above 1.5x output, write bandwidth dominates
¤ Throughput above is input consumed
0.15 0.3 0.45 0.6 0.75 0.9 1.05 1.2 1.35 1.5
Output ratio
8 10 12 14 16
Throughput (GB/s)
18 20 22 24 26 28 30 32 34 36
% T
- tal Line Throughput
m=1 m=2 m=3 m=8
Sort throughput
16
¨ Resources required is a linear function of desired input size ¤ Dominated by the memory required to hold working sets ¨ Recent CPU/GPU numbers ~300M 32-bit values per second
375K 750K 1.5M 3M 6M 12.5M 25M 50M 100M 200M 400M 800M 1.6B 3.2B 6.4B 12.5B 25B 50B
Size of Input
400 600 800 1000 1200 1400 1600
Million values per second
2 passes 3 passes 3 passes (projected)
Sort Merge Join
17
n Performance limited by intra-FPGA link n Total throughput is 800 million tuples/second
n ~6.5 GB/s n 8x previous work on software joins
Conclusions
n FPGAs can be used to saturate memory
bandwidth in ways that processors can not
n Make the most of every byte read n In some cases, address bandwidth is just as important
as raw data bandwidth
n Scaling your design to high bandwidths can
greatly influence the architecture
n Think streaming
n Next step is to interact with the rest of the
system
18