Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 - - PowerPoint PPT Presentation

▶

May 29, 2023 403 likes •686 views

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 Announcement Midterm date and location confirmed October 22 at 7-9pm in ILC S331 2 2 MapReduce vs. DBMSs 3 Advantages of DBMSs Abstract data representation

SLIDE 1

Analytical Query Processing

Marco Serafini

COMPSCI 532 Lecture 7

SLIDE 2

Announcement

Midterm date and location confirmed
October 22 at 7-9pm in ILC S331

SLIDE 3

MapReduce vs. DBMSs

SLIDE 4

Advantages of DBMSs

Abstract data representation
Relational model
Data storage is delegated to the DBMS
Functional query language (SQL)
Queries specified as simple relational operators
Actual query execution delegated to the DBMS…
… including parallelism, distribution, pipelining etc.
Support for indexing

SLIDE 5

Disadvantages of DBMSs

SQL is a limited interface for complex analytics
E.g. image analysis, creating maps
Need to define a schema for data a priori
High cost of loading data and indexing
Can be amortized only if same data and schema reused
Too complex for “one shot” analytics

SLIDE 6

Advantages of MapReduce

Support for arbitrary UDFs
Support for a variety of arbitrary data formats
Simple API
Scalability

SLIDE 7

Disadvantages

Many of the optimizations of DBMS must be

reimplemented, for example

Indices
Query execution plans (logical + physical)
Column-based storage
Data format specifications (ProtoBuf)
Support for updates
Several efforts towards closing the gap for analytics

SLIDE 8

Data Analytics

SLIDE 9

In Situ Analytics

Data dumped on GFS/HDFS (data lake)
Some of this data is relational
Several systems to execute relational queries on

HDFS data

SQL-like language
Query optimization
Columnar data representation
Can build on top of MR/Spark (e.g. Hive, SparkSQL)
r not (e.g. Dremel, Impala, Presto)
We will discuss both classes

SLIDE 10

Analytical Queries

Long-running, complex queries
Often aggregates
Run on read-only data or snapshots of dynamic data
Data characteristics
Tuples (rows) have many possible attributes (columns)
A row will have only a subset of attributes set

SLIDE 11

Star Schema: Facts and Dimensions

Popular schema for analytics/data warehousing
Many others exist!
At the center is a large fact table
Foreign-key references to small dimension tables

SLIDE 12

SLIDE 13

13 13

Dremel

In-situ analytics
Independent query executor (not on top of MR)
Uses columnar store

SLIDE 14

Data Model: Column Families

Also called column groups, nested columns
Common to many systems, e.g. Cassandra, HBase

New record Nested in repeated Language Nested in repeated Name

SLIDE 15

Assembling a Row

Finite state machine

SLIDE 16

Pros and Cons of Columnar Model

Pros
Compression: columns have uniform values
Less data to scan on projections (which are common)
Cons
Additional CPU load to decompress columns and rebuild rows

SLIDE 17

Query Execution

SLIDE 18

SLIDE 19

19 19

SparkSQL: Spark + DBMS

Extend Spark with
Simple, high-level SQL-like operators
Query optimization
No need to transfer data across systems
ETL, query processing, complex analytics in one system

SLIDE 20

Architecture

SLIDE 21

DataFrames

Collection of rows with homogeneous schema
Like a table in a DBMS
Can be manipulated like an RDD
DataFrame operations
Similar to Python Pandas or R data frames
Evaluated lazily (query planning is postponed)
Can optimize across multiple queries

SLIDE 22

SparkSQL Query Execution

SLIDE 23

Advantages

Relational structure enables query optimization
In-memory caching using columnar representation
Better compression
Mix SQL-like operators and arbitrary code
More flexible than UDFs in DBMSs
Can optimize across multiple SQL operations

SLIDE 24

Catalyst

Query optimizer of SparkSQL
Rule-based optimization
Rule: find pattern and transform
Used for both logical and physical plans
Can customize rules
Code generation
Directly outputs bytecode (as opposed to interpreting a plan)
Much more CPU efficient
Flexible data sources
Can change the physical representation of DataFrames
Still use the optimizer

SLIDE 25

Catalyst: Rule-Based Optimization

Apply rules to subtree until fixed point

Execution tree Transformation rules

SLIDE 26

Catalyst: Code Generation

Faster than interpreting a physical plan