Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 - - PowerPoint PPT Presentation

analytical query processing
SMART_READER_LITE
LIVE PREVIEW

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 - - PowerPoint PPT Presentation

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 Announcement Midterm date and location confirmed October 22 at 7-9pm in ILC S331 2 2 MapReduce vs. DBMSs 3 Advantages of DBMSs Abstract data representation


slide-1
SLIDE 1

Analytical Query Processing

Marco Serafini

COMPSCI 532 Lecture 7

slide-2
SLIDE 2

22

Announcement

  • Midterm date and location confirmed
  • October 22 at 7-9pm in ILC S331
slide-3
SLIDE 3

3

MapReduce vs. DBMSs

slide-4
SLIDE 4

44

Advantages of DBMSs

  • Abstract data representation
  • Relational model
  • Data storage is delegated to the DBMS
  • Functional query language (SQL)
  • Queries specified as simple relational operators
  • Actual query execution delegated to the DBMS…
  • … including parallelism, distribution, pipelining etc.
  • Support for indexing
slide-5
SLIDE 5

55

Disadvantages of DBMSs

  • SQL is a limited interface for complex analytics
  • E.g. image analysis, creating maps
  • Need to define a schema for data a priori
  • High cost of loading data and indexing
  • Can be amortized only if same data and schema reused
  • Too complex for “one shot” analytics
slide-6
SLIDE 6

66

Advantages of MapReduce

  • Support for arbitrary UDFs
  • Support for a variety of arbitrary data formats
  • Simple API
  • Scalability
slide-7
SLIDE 7

77

Disadvantages

  • Many of the optimizations of DBMS must be

reimplemented, for example

  • Indices
  • Query execution plans (logical + physical)
  • Column-based storage
  • Data format specifications (ProtoBuf)
  • Support for updates
  • Several efforts towards closing the gap for analytics
slide-8
SLIDE 8

8

Data Analytics

slide-9
SLIDE 9

9

9

In Situ Analytics

  • Data dumped on GFS/HDFS (data lake)
  • Some of this data is relational
  • Several systems to execute relational queries on

HDFS data

  • SQL-like language
  • Query optimization
  • Columnar data representation
  • Can build on top of MR/Spark (e.g. Hive, SparkSQL)
  • r not (e.g. Dremel, Impala, Presto)
  • We will discuss both classes
slide-10
SLIDE 10

10

10

Analytical Queries

  • Long-running, complex queries
  • Often aggregates
  • Run on read-only data or snapshots of dynamic data
  • Data characteristics
  • Tuples (rows) have many possible attributes (columns)
  • A row will have only a subset of attributes set
slide-11
SLIDE 11

11

11

Star Schema: Facts and Dimensions

  • Popular schema for analytics/data warehousing
  • Many others exist!
  • At the center is a large fact table
  • Foreign-key references to small dimension tables
slide-12
SLIDE 12

12

slide-13
SLIDE 13

13 13

Dremel

  • In-situ analytics
  • Independent query executor (not on top of MR)
  • Uses columnar store
slide-14
SLIDE 14

14

14

Data Model: Column Families

  • Also called column groups, nested columns
  • Common to many systems, e.g. Cassandra, HBase

New record Nested in repeated Language Nested in repeated Name

slide-15
SLIDE 15

15

15

Assembling a Row

  • Finite state machine
slide-16
SLIDE 16

16

16

Pros and Cons of Columnar Model

  • Pros
  • Compression: columns have uniform values
  • Less data to scan on projections (which are common)
  • Cons
  • Additional CPU load to decompress columns and rebuild rows
slide-17
SLIDE 17

17

17

Query Execution

slide-18
SLIDE 18
slide-19
SLIDE 19

19 19

SparkSQL: Spark + DBMS

  • Extend Spark with
  • Simple, high-level SQL-like operators
  • Query optimization
  • No need to transfer data across systems
  • ETL, query processing, complex analytics in one system
slide-20
SLIDE 20

20

20

Architecture

slide-21
SLIDE 21

21

21

DataFrames

  • Collection of rows with homogeneous schema
  • Like a table in a DBMS
  • Can be manipulated like an RDD
  • DataFrame operations
  • Similar to Python Pandas or R data frames
  • Evaluated lazily (query planning is postponed)
  • Can optimize across multiple queries
slide-22
SLIDE 22

22

22

SparkSQL Query Execution

slide-23
SLIDE 23

23

23

Advantages

  • Relational structure enables query optimization
  • In-memory caching using columnar representation
  • Better compression
  • Mix SQL-like operators and arbitrary code
  • More flexible than UDFs in DBMSs
  • Can optimize across multiple SQL operations
slide-24
SLIDE 24

24

24

Catalyst

  • Query optimizer of SparkSQL
  • Rule-based optimization
  • Rule: find pattern and transform
  • Used for both logical and physical plans
  • Can customize rules
  • Code generation
  • Directly outputs bytecode (as opposed to interpreting a plan)
  • Much more CPU efficient
  • Flexible data sources
  • Can change the physical representation of DataFrames
  • Still use the optimizer
slide-25
SLIDE 25

25

25

Catalyst: Rule-Based Optimization

  • Apply rules to subtree until fixed point

Execution tree Transformation rules

slide-26
SLIDE 26

26

26

Catalyst: Code Generation

  • Faster than interpreting a physical plan