Hardware Acceleration of Database Operations Jared Casper and Kunle - - PowerPoint PPT Presentation

hardware acceleration of database operations
SMART_READER_LITE
LIVE PREVIEW

Hardware Acceleration of Database Operations Jared Casper and Kunle - - PowerPoint PPT Presentation

Hardware Acceleration of Database Operations Jared Casper and Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Database machines n Database machines from late 1970s n Put some compute on the disk track/head/unit n


slide-1
SLIDE 1

Hardware Acceleration of Database Operations

Jared Casper and Kunle Olukotun Pervasive Parallelism Laboratory Stanford University

slide-2
SLIDE 2

Database machines

n Database machines from late 1970s

n Put some compute on the disk track/head/unit n Processors got faster, I/O performance did not n Processor could keep up with disk

n No performance left on the table

n Today's database machines

n Made up of general purpose components n Massive amounts of memory n Very high speed interconnect n Tables, even databases, fit entirely within memory

2

slide-3
SLIDE 3

Database Operation Acceleration

n Processors can not keep up with memory

n Join performance is at 100s of million tuples per

second

n 64-bit tuples → 2-3 GB/s n Chips can get over 100 GB/s n Performance is being left on the table

n Follow 10x10 rule, build accelerators n Three acceleration blocks

n Selection, merge join, sort n Combine these to do a sort merge join n Goal is to “keep up with memory”

3

slide-4
SLIDE 4

Select

n Software implementation uses SIMD

n Read data into SIMD register n Use SIMD shuffle operation to move selected data to

  • ne end of the register

n Mask used as index into table for shuffle values

n Unaligned write to append to output n Limited by SIMD width, number of SIMD registers

4

1 0 1 1 1 0 0 1 C F E B E C A F E B A B E

slide-5
SLIDE 5

Select

5

1011

7 6 5 4

slide-6
SLIDE 6

Merge Join

n Scan two sorted columns, output matching

values

n Can have associated values or record IDs n Output cross product when multiple values n Generally viewed as the “free” thing after sorting

n More an indication of how slow sorting is

n Software implementations have bad branching

behaviour

n Limits the IPC → hard to keep up with memory

6

slide-7
SLIDE 7

Merge Join

7 ¨ Output is bitmask of equal keys with corresponding values

¤ Ready for input into the select block

slide-8
SLIDE 8

Merge Sort

8 4 8 2 1 5 5 7 4 8 1 2 5 5 0 7 1 2 4 8 0 5 5 7 0 1 2 4 5 5 7

1st Pass 2nd Pass

slide-9
SLIDE 9

Merge Sort Level

9

slide-10
SLIDE 10

High Bandwidth Sort Merge Node

10

slide-11
SLIDE 11

Sort Merge Join

11

¨ Sort, merge join, and select blocks are combined to

perform an full sort merge join in hardware

slide-12
SLIDE 12

Prototyping Platform - Maxeler

12

slide-13
SLIDE 13

Select Throughput

n Software achieved 7 GB/s (33%)

n STREAM achieved 12 GB/s (57%)

13

10 20 30 40 50 60 70 80 90 100

Cardinality (%)

16 17 18 19 20 21 22 23 24

Throughput (GB/s)

42 44 46 48 50 52 54 56 58 60 62 64

% of Line Bandwidth

Memory System Saturated!

slide-14
SLIDE 14

Select Resources

14

64 88 112 136 160 184 208 232 256 280 304 328 352 376

Throughput (bytes/clock)

24 36 48 60 72 84 96 108 120 132

Throughput (GB/s @ 400 MHz)

2 4 6 8 10

Count (thousands)

ROM bits 16:1 mux 4:1 mux registers

slide-15
SLIDE 15

Merge Join Throughput

15

¨ Resources required is a quadratic function of desired bandwidth

¤ All in comparison logic, routing was the limiting factor

¨ Above 1.5x output, write bandwidth dominates

¤ Throughput above is input consumed

0.15 0.3 0.45 0.6 0.75 0.9 1.05 1.2 1.35 1.5

Output ratio

8 10 12 14 16

Throughput (GB/s)

18 20 22 24 26 28 30 32 34 36

% T

  • tal Line Throughput

m=1 m=2 m=3 m=8

slide-16
SLIDE 16

Sort throughput

16

¨ Resources required is a linear function of desired input size ¤ Dominated by the memory required to hold working sets ¨ Recent CPU/GPU numbers ~300M 32-bit values per second

375K 750K 1.5M 3M 6M 12.5M 25M 50M 100M 200M 400M 800M 1.6B 3.2B 6.4B 12.5B 25B 50B

Size of Input

400 600 800 1000 1200 1400 1600

Million values per second

2 passes 3 passes 3 passes (projected)

slide-17
SLIDE 17

Sort Merge Join

17

n Performance limited by intra-FPGA link n Total throughput is 800 million tuples/second

n ~6.5 GB/s n 8x previous work on software joins

slide-18
SLIDE 18

Conclusions

n FPGAs can be used to saturate memory

bandwidth in ways that processors can not

n Make the most of every byte read n In some cases, address bandwidth is just as important

as raw data bandwidth

n Scaling your design to high bandwidths can

greatly influence the architecture

n Think streaming

n Next step is to interact with the rest of the

system

18

slide-19
SLIDE 19

Questions?