[PPT] - The Glass Half Full Using Programmable Hardware Accelerators in PowerPoint Presentation

SLIDE 1

The Glass Half Full

Using Programmable Hardware Accelerators in Analytical Databases

Zsolt István

IMDEA Software Institute

1

SLIDE 2

IM IMDEA Soft ftware In Institute

16 Faculty in the areas of:
Program Analysis and Verification
Languages and Compilers
Security and Privacy
Theoretical Computer Science
Distributed Systems and Databases
~10 Post-docs, ~25 PhD Students,

~10 Interns

Located in UPM Montegancedo Campus,

Madrid

We are hiring! https://software.imdea.org/

SLIDE 3

▪ OLAP – Online Analytical Processing

▪ Large datasets – up to TBs ▪ Ad-hoc querying to extract insight, recurring reporting – Possibly complex operations ▪ Read-mostly workloads, updates in batches

▪ OLTP – Online Transaction Processing

▪ Smaller datasets ▪ Queries known, relate to business actions ▪ Makes heavy use of indexes ▪ Reads and updates intermixed

3

Context: Analytical Databases

SLIDE 4

4

Databases were a 25 Billion $ market in 2018… Could we specialize machines to them?

https://www.statista.com/statistics/810188/worldwide-commercial-database-market-size/

SLIDE 5

▪ Fully custom machine for databases

▪ Processors – special ISA microprocessors ▪ Memory – magnetic bubbles and CCDs

▪ Semiconductor technology and general purpose CPUs took over

5

Database Computer – ’70s

“The first goal is to design it with the capability of handling a very large on-line database of 10^10 bytes or beyond since special-purpose machines are not likely to be cost-effective for small databases.”

Jayanta Banerjee, David K. Hsiao, Krishnamurthi Kannan: DBC - A Database Computer for Very Large Databases. IEEE Trans. Computers 28(6): 414-429 (1979)

SLIDE 6

▪ Based on VAX multi- processor system ▪ By the time the software and hardware were developed, CPUs have become much faster

▪ Couldn’t keep up with Moore’s law

6

Gamma Machine – ’80s

David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, M. Muralikrishna: GAMMA - A High Performance Dataflow Database Machine. VLDB 1986: 228-237

SLIDE 7

7

Data/Compute Gap

CPU Scaling Commodity in Cloud Specialized Hardware Revival

SLIDE 8

8

Renewed interest in Specialized Hardware

ASICs FPGAs CPUs

SLIDE 9

Field Programmable Gate Array (FPGA) ▪ Free choice of architecture ▪ Fine-grained pipelining, communication, distributed memory ▪ Tradeoff: all “code” occupies chip space ▪ Evolving platform: larger chips, more heterogeneity

9

Re-programmable Specialized Hardware

Op 1 Op 2 Op 3

SLIDE 10

10

Integration Options

Accel. 1) On the side 2) In data-path 3) Co-processor Data Data Data Accel. Accel.

SLIDE 11

▪ Accelerator

▪ Amazon F1

▪ In data path

▪ Microsoft Catapult

▪ Co-processor

▪ Intel Xeon+FPGA

11

In the Cloud Today

Socket1 Socket2 CPU FPGA Socket1 CPU FPGA CPU FPGA Intel Xeon+FPGA Gen.1 Intel Xeon+FPGA Gen.2

SLIDE 12

12

The Glass Half Empty…

SLIDE 13

▪ 1) On the side acceleration introduces overhead ▪ Many related work offers no real speedup if we factor in data movement, transformation, software overhead…

13

The Glass Half Empty…

20 40 60 80 100 120 Software With Acceleration

Query execution time

Compute Data Movement

2x

Accel. Data

SLIDE 14

▪ 2) “All or nothing” behavior makes query planning difficult

▪ Example: fixed capacity hash table on FPGA ▪ Constant time access for reads and writes ▪ What happens if data doesn’t fit?

▪ Can’t always know the number of keys aprioi

14

The Glass Half Empty…

#

SLIDE 15

▪ 3) Analytical databases becoming more optimized / not much compute in core SQL ▪ X100 [CIDR05] showed that <10% of compute time spent on SQL

perators +,-,*,SUM,AVG in analytical queries

▪ Columnar stores often memory bound (10s of GB/s)

15

The Glass Half Empty…

SLIDE 16

▪ On the side acceleration introduces overhead ▪ “All or nothing” behavior makes query planning difficult ▪ Analytical databases becoming more optimized / not much compute in core SQL

16

The Glass Half Empty…

SLIDE 17

▪ On the side acceleration introduces overhead ✓ Reduce data movement bottlenecks

17

The Glass Half Full…

SLIDE 18

▪ IBEX: Database storage engine with processing offload

▪ Filter and pre-aggregate for analytic workloads

18

Processing in data path: Smart Flash

Database Server IBEX

SSD IBEX – An Intelligent Storage Engine with Support for Advanced SQL Off-loading. L. Woods, Z. Istvan and G. Alonso, VLDB’14

→ Larger bandwidth, more IOPS (Samsung YourSQL, MIT BlueDBM)

▪ Opportunity to extend SSDs/Flash with complex offload Samsung “smart” SSD

SLIDE 19

19

Processing in data path: Distributed Processing

Workers (Compute) Storage

+ Provisioning + Scalability

Caribou: Distributed storage with processing

Specialized HW nodes
10Gbps access
25W power cons.

Zsolt István, David Sidler, Gustavo Alonso: Caribou: Intelligent Distributed Storage. PVLDB 10(11), 2017.

SLIDE 20

20

Smart Storage in Databases: Filter push-down

Intel Hyperscan library (Xeon E5-2680 v2) 2.8x

SELECT … FROM customer WHERE age<35 AND purchases>2 AND address LIKE “%PO. Box 123%”

▪ Challenge: guarantee that filtering never slows down retrieval ▪ Algorithms can be re-imagined to become bandwidth-bound instead of compute-bound

▪ Extend the state of the art: parameterization without re-programming [FCCM16] ▪ Many options: Regular expressions, comparisons, decompression, …

[FCCM16] Runtime Parameterizable Regular Expression Operators for Databases. Zs. Istvan, D. Sidler, G. Alonso. FCCM’16

SLIDE 21

✓ Reduce data movement bottlenecks ▪ “All or nothing” behavior makes query planning difficult ✓ Hybrid processing

21

The Glass Half Full…

SLIDE 22

▪ Group-by: Compute aggregate function over categories

▪ select avg(salary) from employees group by department

22

IBEX’s Hybrid Group-by

CPU Ibex with SW-only Group-By

Projection Selection Group-by

Final Group s

Input table Filtered data

SLIDE 23

▪ Group-by: Compute aggregate function over categories

▪ select avg(salary) from employees group by department

23

IBEX’s Hybrid Group-by

CPU Ibex with HW-only Group-By

Projection Selection Group-by

Final Group s

Input table Filtered data

SLIDE 24

CPU Ibex with HW-only Group-By

Projection Selection Group-by

Final Group s

Input table Filtered data

▪ Group-by: Compute aggregate function over categories

▪ select avg(salary) from employees group by department

▪ If number of groups does not fit on FPGA?

▪ Send partial aggregates – finalize in SW ▪ Worst case: same as no acceleration ▪ Best- case: All in HW!

24

IBEX’s Hybrid Group-by

CPU Ibex with Hybrid Group-by

Input table Projection Selection Group-by Group-by

Final Group s

Filtered data

Partial Group s

Challenge: How to split across accelerator and software?

SLIDE 25

✓ Reduce data movement bottlenecks ✓ Hybrid Processing ▪ Analytical databases becoming more optimized / not much compute in core SQL ✓ Emerging compute-intensive workloads

25

The Glass Half Full

SLIDE 26

▪ Databases adopting new ways of analyzing the data

▪ SAP Hana, Oracle, SQL Server, etc.

▪ Specialized hardware can help both with model building [Kara18], inference [Owaida18] ▪ Benefits for “classical” algorithms as well

[Kara18] Kara et al: ColumnML: Column-Store Machine Learning with On-The-Fly Data Transformation. PVLDB 12(4): 348-361 (2018) [Owaida18] Owaida et al: Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles. FPL 2018: 295-300

26

The Rise of Machine Learning

SLIDE 27

CPU FPGA Co-processor

27

doppioDB: a hybrid database engine

Database Engine (MonetDB)

Hardware Operator Software

perator

Software

perator

Hardware Operator Hardware Operator

▪ Goal: extend the capabilities of analytical databases

▪ FPGA works on the same data as software (cache-coherent access) ▪ Can combine SW and HW operators inside the same query

▪ Challenge: ensure high utilization of FPGA, use in many queries

DRAM (DB Tables) No data copy, transformation, partitioning, etc.

Hardware Operator

SLIDE 28

K-means – Algorithm

◼ Goal: partition unlabeled data into several clusters, where the number of clusters is the “k” in the k-means. ◼ Two steps in each iteration: ◼ Assignment: assign data points to closet centroid according to distance metric ◼ Centroid update: the centroids are re- calculated by averaging all the data points within each cluster ◼ Long process if the data set and number of iterations are large

28

SLIDE 29

DRAM (DB Tables)

Design – Execution Walk-Through

Receives K-Means parameters

1

Fetch the initial centroids and the data

2 3

Calculates the distance between a data point and all the centroids and assign it to closest centroid

4

Accumulates data points per cluster and counts how many data points are assigned to each cluster Collect partial results from each pipeline

5

Division for updating new centroid

6

Writes back the final results

7 1 2 3 4 5 6 7

Zhenhao He, David Sidler, Zsolt István, Gustavo Alonso: A Flexible K-Means Operator for Hybrid Databases. FPL 2018

29

SLIDE 30

30

Uses of Parallelism

K is known / Centroids known Need to determine K (Elbow method)

▪ K-Means algorithm

▪ FPGA outperforms several cores of the CPU ▪ Can use parallelism in two ways – cover more queries

SLIDE 31

▪ Text: Regular expression matching, Edit distance, … ▪ Database ops.: Skyline queries, Group-by aggregations, … ▪ Statistics: Histograms, Count-min sketch, Bloom filters, … ▪ Machine learning: Clustering (K-means), Stochastic Gradient Descent, Decision Trees, … ▪ Data management: Hash tables, hash functions, … ▪ [Your algorithm here]

31

Wide range of algorithms can benefit from hardware

SLIDE 32

✓ Reduce data movement bottlenecks ✓ Hybrid Processing ✓ Emerging compute-intensive workloads

32

The Glass Half Full… Future Challenges…

▪ Managing Programmable Hardware accelerators

▪ Is this the job of the OS or does the DB has to take control? ▪ How to share programmable hardware across tenants

▪ Compilation/synthesis of hardware accelerators

▪ Can we derive accelerators from user queries? ▪ Intermediary DSL or building blocks we could use?

For more details, see: The Glass Half Full: Using Programmable Hardware Accelerators in Analytics. Z. István. IEEE Data Engineering Bulletin, March 2019.