DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: - - PowerPoint PPT Presentation

▶

Nov 30, 2022 246 likes •695 views

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: Dataflow Systems Chapter 2.2 of MLSys Book 1 Parallel RDBMSs Parallel RDBMSs are highly successful and widely used They offer massive scalability (shared-nothing

SLIDE 1

Topic 5: Dataflow Systems Chapter 2.2 of MLSys Book

Arun Kumar

DSC 102  Systems for Scalable Analytics

SLIDE 2

Parallel RDBMSs

❖ Parallel RDBMSs are highly successful and widely used ❖ They offer massive scalability (shared-nothing parallelism) and high performance (parallel relational dataflows) along with many other enterprise-grade benefits of RDBMSs: ❖ Full power of SQL ❖ Business intelligence dashboards/APIs on top ❖ Transaction management and crash recovery ❖ Index structures, compressed file formats, auto-tuning, etc. Q: So, why did people need to go beyond parallel RDBMSs?

SLIDE 3

Beyond RDBMSs: A Brief History

❖ Relational model and RDBMSs are too restrictive:

1. “Flat” tables with few data/attribute types
2. Restricted language interface (SQL)
3. Need to know schema first!
4. Optimized for static dataset

Ad: Take CSE 132B and CSE 135 to learn such extensions Object-Relational DBMSs: UDT, UDFs, text, multimedia, etc. PL/SQL; recursive SQL; embedded SQL; QBE; visual interfaces But the DB community has addressed these issues already! “Schema-later” semi-structured XML data model; XQuery Stream data model; “standing” queries; time windows

SLIDE 4

Q: Again, so, why did people still need to go beyond parallel RDBMSs?!

SLIDE 5

Beyond RDBMSs: A Brief History

❖ DB folks underappreciated 4 key concerns of Web folks: Developability Fault Tolerance Elasticity Cost/Politics! The DB community got blindsided by the unstoppable rise of the Web/Internet giants!

SLIDE 6

DB/Enterprise vs. Web Dichotomy

❖ DB folks underappreciated 4 key concerns of Web folks: Developability: RDBMS extensibility mechanisms (UDTs, UDFs, etc.) are too painful to use for programmers! DB companies: we write the software and sell to our customers, viz., enterprise companies (banks, retail, etc.) Web companies: we will hire an army of software engineers to build own in-house software systems! Need simpler APIs and DBMSs that scale custom programs

SLIDE 7

DB/Enterprise vs. Web Dichotomy

❖ DB folks underappreciated 4 key concerns of Web folks: Fault Tolerance: What if we run on 100Ks of machines?! DB companies: our customers do not need more than a few dozen machines to store and analyze their data! Web companies: we need hundreds of thousands of machines for planetary-scale Web services! If a machine fails, user should not have to rerun entire query! DBMS should take care of fault tolerance, not user/appl. (Cloud-native RDBMSs now offer fault tolerance by design)

SLIDE 8

DB/Enterprise vs. Web Dichotomy

❖ DB folks underappreciated 4 key concerns of Web folks: Elasticity: Resources should adapt to “query” workload DB companies: our customers have “fairly predictably” sized datasets and workloads; can fix their clusters! Web companies: our workloads could vary widely and the datasets they need vary widely! Need to be able to upsize and downsize clusters easily

n-the-fly, based on current query workload

SLIDE 9

DB/Enterprise vs. Web Dichotomy

❖ DB folks underappreciated 4 key concerns of Web folks: Cost/Politics: Commercial RDBMS licenses too costly! DB companies: our customers have $$$! ☺ Web companies: our products are mostly free (ads?); why pay so much $$$ if we can build our own DBMSs? Many started with MySQL (!) but then built their own DBMSs New tools were free & open source; led to viral adoption!

SLIDE 10

This new breed of parallel data systems called Dataflow Systems jolted the DB folks from being smug and complacent!

SLIDE 11

Outline

❖ Beyond RDBMSs: A Brief History ❖ MapReduce/Hadoop Craze ❖ Spark and Dataflow Programming ❖ More Scalable ML with MapReduce/Spark

SLIDE 12

The MapReduce/Hadoop Craze

❖ Blame Google! ❖ “Simple” problem: index, store, and search the Web! ☺ ❖ Who were their major systems hires? Jeff Dean and Sanjay Ghemawat (Systems, not DB or IR) ❖ Why did they not use RDBMSs? (Haha.) Developability, data model, fault tolerance, scale, cost, … Engineers started with MySQL; abandoned it!

SLIDE 13

What is MapReduce?

❖ Programming model for writing programs on sharded data + distributed system architecture for processing large data ❖ Map and Reduce are terms/ideas from functional PL ❖ Engineer only implements the logic of Map and Reduce ❖ System implementation handles orchestration of data distribution, parallelization, etc. under the covers MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004. Was radically easier for engineers to write programs with!

SLIDE 14

What is MapReduce?

❖ Standard example: count word occurrences in a doc corpus ❖ Input: A set of text documents (say, webpages) ❖ Output: A dictionary of unique words and their counts Hmmm, sounds suspiciously familiar … ☺ function map (String docname, String doctext) : for each word w in doctext : emit (w, 1) function reduce (String word, Iterator partialCounts) : sum = 0 for each pc in partialCounts : sum += pc emit (word, sum) Part of MapReduce API

SLIDE 15

How MapReduce Works

Parallel flow of control and data during MapReduce execution: Under the covers, each Mapper and Reducer is a separate process; Reducers face barrier synchronization (BSP) Fault tolerance achieved using data replication

SLIDE 16

Abstract Semantics of MapReduce

❖ Map(): Operates independently on one “record” at a time ❖ Can batch multiple data examples on to one record ❖ Dependencies across Mappers not allowed ❖ Can emit 1 or more key-value pairs as output ❖ Data types of inputs and outputs can be different! ❖ Reduce(): Gathers all Map output pairs across machines with same key into an Iterator (list) ❖ Aggregation function applied on Iterator and output final ❖ Input Split: ❖ Physical-level split/shard of dataset that batches multiple examples to one file “block” (~128MB default on HDFS) ❖ Custom Input Splits can be written by appl. user

SLIDE 17

Benefits of MapReduce

❖ Goal: Higher level abstraction of functional operations (Map; Reduce) to simplify data-parallel programming at scale ❖ Key Benefits: ❖ Out-of-the-box scalability and cluster parallelism ❖ Fault tolerance offloaded to system impl., not appl./user ❖ Map() and Reduce() can be highly general; no restrictions

n data types/structures processed; easier to use for ETL

and text/multimedia-oriented analytics ❖ Free and OSS implementations available (Hadoop) ❖ New burden on users: Converting computations of data- intensive program/operation to Map() + Reduce() API ❖ But MapReduce libraries available in multiple PLs to mitigate coding pains: Java, C++, Python, R, Scala, etc.

SLIDE 18

Emulate MapReduce in SQL?

Q: How would you do the word counting in RDBMS / in SQL? ❖ First step: Transform text docs into relations and load: Part of the Extract-Transform-Load (ETL) stage Suppose we pre-divide each document into words and have the schema: DocWords (DocName, Word) ❖ Second step: a single, simple SQL query! SELECT Word, COUNT (*) FROM DocWords GROUP BY Word [ORDER BY Word] Parallelism, scaling, etc. done by RDBMS under the covers

SLIDE 19

More MR Examples: Select

❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Map(): On tuple, apply selection condition; emit pair with dummy key and entire tuple as value ❖ Reduce(): Not needed! No cross-shard aggregation here ❖ Such kinds of tasks/jobs are called “Map-only” jobs

SLIDE 20

More MR Examples: Simple Agg.

❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Assume it is algebraic aggregate (SUM, AVG, MAX, etc.) ❖ Map(): On agg. attribute, compute incremental stats; emit pair with single global dummy key and stats as value ❖ Reduce(): Since only one dummy key across all shards, Iterator has all suff. stats that can be unified to get result

SLIDE 21

More MR Examples: GROUP BY Agg.

❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Assume it is algebraic aggregate (SUM, AVG, MAX, etc.) ❖ Map(): On agg. attribute, compute incremental stats; emit pair with grouping attribute as key and stats as value ❖ Reduce(): Iterator has all suff. stats for a single group; unify those to get result for that group; different reducers handle different groups

SLIDE 22

More MR Examples: Matrix Norm

❖ Input Split (part of ETL): Let matrix be sharded tile-oriented ❖ Assume it is algebraic aggregate (Lp,q norm) ❖ Very similar to simple aggregate! ❖ Map(): On agg. attribute, compute incremental stats; emit pair with single global dummy key and stats as value ❖ Reduce(): Since only one dummy key across all shards, Iterator has all suff. stats that can be unified to get result

SLIDE 23

Analogue: Parallel RDBMS UDA

❖ Recall how word count MapReduce can be done with SQL: SELECT Word, COUNT (*) FROM DocWords GROUP BY Word Q: How can we compute other aggregates not native to SQL? ❖ User-Defined Aggregate Function (UDAF) abstraction in parallel RDBMs can do MapReduce-like computations ❖ MapReduce seems more intuitive and succinct to many!

SLIDE 24

Analogue: Parallel RDBMS UDA

❖ 4 main functions in the UDAF API to work with BSP model ❖ Aggregation state: data structure computed (independently) by workers and unified by master ❖ Initialize(): Set up info./initialize RAM for agg. state; runs independently on each worker ❖ Transition(): Per-tuple function run by worker to update its

agg. state; analogous to Map() in MapReduce

❖ Merge(): Function that combines agg. states from workers; run by master after workers done; analogous to Reduce() ❖ Finalize(): Run once at end by master to return final result

SLIDE 25

Example Parallel RDBMS UDA: BGD

❖ BGD’s gradient (and loss) computation can be easily scaled as an RDBMS UDAF: ❖ Initialize: Alloc. memory for model and gradient ❖ Transition: Compute gradient as running info on tuple ❖ Merge: Add partial sums from workers to get full gradient ❖ Finalize: No-op; just return gradient ❖ Commands for controlling epochs, convergence, etc. can be done external to RDBMS, since they do not need dataset ❖ E.g., Apache MADlib does this with Python + Greenplum Q: How will you do the above with MapReduce instead? Q: What about mini-batch SGD?

SLIDE 26

What is Hadoop then?

❖ Open-source system implementation with MapReduce as

prog. model and HDFS as distr. filesystem

❖ Map and Reduce functions in API; input splitting, data distribution, shuffling, fault tolerance, etc. all handled by the Hadoop library under the covers ❖ Exploded in popularity: 100s of papers, 10s of products ❖ A real “revolution” in scalable parallel data processing that took the DB community by surprise! NB: Do not confuse MapReduce for Hadoop or vice versa!

SLIDE 27

A Spectacular “War of the Worlds”

No declarativity! Filescan-based! DeWitt’s work on parallel DBMSs! Cheap rip-off of RDBMSs!

SLIDE 28

“Young Turks” vs. “Old Guard”?

Swift and scathing rebuttal from MapReduce/Hadoop world! DBMSs too high-level/hard to use for low-level text ETL Meant for “offline” fault-tolerant workloads on cheap nodes Google awarded a patent for MapReduce (ahem)! MapReduce/Hadoop not meant to be an RDBMS replacement

SLIDE 29

Enter Hybrid Systems!

❖ Clever DB researches: “Let’s get the best of both worlds!” ❖ Numerous projects on hybrid systems in industry/academia: Programming model-level: Bring declarative programming from RDBMS world to MapReduce/Hadoop world Systems-level: Intermix system implementation ideas SQL dialect

ver Hadoop

Dataflow language

ver Hadoop

HadoopDB from Yale U. Microsoft Polybase

SLIDE 30

Outline

❖ Beyond RDBMSs: A Brief History ❖ MapReduce/Hadoop Craze ❖ Spark and Dataflow Programming ❖ More Scalable ML with MapReduce/Spark

SLIDE 31

Spark from UC Berkeley

❖ Extended dataflow programming model (subsumes most rel.

ps; MapReduce); system (re)designed from ground up

❖ Inspired by Python Pandas style of function calls for ops ❖ Agenda: Unified system to handle relations, text, etc.; support more general distributed data processing ❖ Tons of sponsors, gazillion bucks, unbelievable hype! ❖ Key idea: exploit distributed memory to cache data ❖ Key novelty: lineage-based fault tolerance, not replication ❖ Open-sourced to Apache; commercialized as Databricks

SLIDE 32

Spark’s Dataflow Programming Model

Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI 2012

Transformations are relational ops, MR, etc. as functions Actions are what force computation; aka lazy evaluation

SLIDE 33

Word Count Example in Spark

Spark RDD API available in Python, Scala, Java, and R DataFrame API and SparkSQL also offer an SQL interface that can be interleaved with such function calls

SLIDE 34

Spark DataFrame API and SparkSQL

❖ Databricks now recommends DataFrame API/SparkSQL; avoid RDD API unless absolutely needed! ❖ Key Reason: Automatic query optimization more feasible with SparkSQL/DataFrame than with RDD ❖ AKA (painfully) re-learn 40 years of DB research ☺

Spark SQL: Relational Data Processing in Spark. In SIGMOD 2015.

SLIDE 35

How does Spark work?

Databricks is building yet another parallel RDBMS! ☺

Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI 2012

SLIDE 36

Comparing Spark’s APIs

RDD DataFrame Koalas Abstraction Level Low High High Named Columns No Yes Yes Support for Query Optimization No Yes Yes Programming Mode map-reduce SQL SQL, Pandas Best suited for Unstructured data Low-level ops Folks who like

func. PLs and

MapReduce Structured data High-level ops Folks who know SQL, Python, R Structured data Lower barrier to entry for folks who only know Pandas or Dask

Check out Yuhao/Supun’s PA 2 slides for more on Spark APIs

SLIDE 37

Reinventing the Wheel?

SLIDE 38

The Berkeley Data Analytics Stack (BDAS)

Spark-based Ecosystem of Tools

SLIDE 39

Other Dataflow Systems

❖ Stratosphere/Apache Flink from TU Berlin ❖ Myria from U Washington ❖ AsterixDB from UC Irvine ❖ Azure Data Lakes from Microsoft ❖ Google Cloud Dataflow ❖ …

SLIDE 40

References and More Material

❖ MapReduce/Hadoop: ❖ MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. In OSDI 2004. ❖ More Examples: http://bit.ly/2rkSRj8 ❖ Online Tutorial: http://bit.ly/2rS2B5j ❖ Spark: ❖ Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. Matei Zaharia and

thers. In NSDI 2012.

❖ More Examples: http://bit.ly/2rhkhEp, http://bit.ly/2rkT8Tc ❖ Online Tutorial: http://bit.ly/2r8lW0S

SLIDE 41

Outline

❖ Beyond RDBMSs: A Brief History ❖ MapReduce/Hadoop Craze ❖ Spark and Dataflow Programming ❖ More Scalable ML with MapReduce/Spark ❖ K-Means Clustering

SLIDE 42

Primer: K-Means Clustering

❖ Basic Idea: Identify clusters of examples based on Euclidean distances; formulated as an optimization problem ❖ Llyod’s algorithm is most popular heuristic for K-Means ❖ Input: n x d examples/points in numeric space ❖ Output: k clusters of those points and their centroids

1. Initialize k centroid vectors and point-cluster ID assignment
2. Assignment step: Scan dataset and assign each point to a

cluster ID based on which centroid is nearest

3. Update step: Given new assignment, scan dataset again to

recompute centroids for all clusters

4. Repeat 2 and 3 until convergence or fixed # iterations

SLIDE 43

K-Means Clustering in MapReduce

❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Assume each tuple/example/point has an ExampleID ❖ Need 2 jobs! 1 for Assignment step and 1 for Update step ❖ 2 external data structures needed for both jobs: k x d centroid matrix A (dense); n x k assignment matrix B (ultra-sparse) ❖ A and B initially broadcast to all Mappers via HDFS; Mappers can read small data directly as files on HDFS ❖ Job 1 read A and creates new B ❖ Job 2 reads B and creates new A

SLIDE 44

K-Means Clustering in MapReduce

❖ 2 external data structures: k x d centroid matrix A (dense); n x k assignment matrix B (ultra-sparse) ❖ Job 1 Map(): Read A from HDFS; compute point’s distance to all k centroids; get nearest centroid; emit new assignment as

utput pair (PointID, ClusterID)

❖ No Reduce() for Job 1; new B now available on HDFS ❖ Job 2 Map(): Read B from HDFS; look into B and see which cluster point got assigned to; emit point as output pair (ClusterID, point vector) ❖ Job 2 Reduce(): Iterator has all point vectors of a given ClusterID; add them up and divide by count; got new centroid; emit output pair as (ClusterID, cluster vector)