DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: - - PowerPoint PPT Presentation

dsc 102 systems for scalable analytics
SMART_READER_LITE
LIVE PREVIEW

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: - - PowerPoint PPT Presentation

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: Dataflow Systems Chapter 2.2 of MLSys Book 1 Parallel RDBMSs Parallel RDBMSs are highly successful and widely used They offer massive scalability (shared-nothing


slide-1
SLIDE 1

Topic 5: Dataflow Systems Chapter 2.2 of MLSys Book

Arun Kumar

1

DSC 102
 Systems for Scalable Analytics

slide-2
SLIDE 2

2

Parallel RDBMSs

❖ Parallel RDBMSs are highly successful and widely used ❖ They offer massive scalability (shared-nothing parallelism) and high performance (parallel relational dataflows) along with many other enterprise-grade benefits of RDBMSs: ❖ Full power of SQL ❖ Business intelligence dashboards/APIs on top ❖ Transaction management and crash recovery ❖ Index structures, compressed file formats, auto-tuning, etc. Q: So, why did people need to go beyond parallel RDBMSs?

slide-3
SLIDE 3

3

Beyond RDBMSs: A Brief History

❖ Relational model and RDBMSs are too restrictive:

  • 1. “Flat” tables with few data/attribute types
  • 2. Restricted language interface (SQL)
  • 3. Need to know schema first!
  • 4. Optimized for static dataset

Ad: Take CSE 132B and CSE 135 to learn such extensions Object-Relational DBMSs: UDT, UDFs, text, multimedia, etc. PL/SQL; recursive SQL; embedded SQL; QBE; visual interfaces But the DB community has addressed these issues already! “Schema-later” semi-structured XML data model; XQuery Stream data model; “standing” queries; time windows

slide-4
SLIDE 4

Q: Again, so, why did people still need to go beyond parallel RDBMSs?!

4

slide-5
SLIDE 5

5

Beyond RDBMSs: A Brief History

❖ DB folks underappreciated 4 key concerns of Web folks: Developability Fault Tolerance Elasticity Cost/Politics! The DB community got blindsided by the unstoppable rise of the Web/Internet giants!

slide-6
SLIDE 6

6

DB/Enterprise vs. Web Dichotomy

❖ DB folks underappreciated 4 key concerns of Web folks: Developability: RDBMS extensibility mechanisms (UDTs, UDFs, etc.) are too painful to use for programmers! DB companies: we write the software and sell to our customers, viz., enterprise companies (banks, retail, etc.) Web companies: we will hire an army of software engineers to build own in-house software systems! Need simpler APIs and DBMSs that scale custom programs

slide-7
SLIDE 7

7

DB/Enterprise vs. Web Dichotomy

❖ DB folks underappreciated 4 key concerns of Web folks: Fault Tolerance: What if we run on 100Ks of machines?! DB companies: our customers do not need more than a few dozen machines to store and analyze their data! Web companies: we need hundreds of thousands of machines for planetary-scale Web services! If a machine fails, user should not have to rerun entire query! DBMS should take care of fault tolerance, not user/appl. (Cloud-native RDBMSs now offer fault tolerance by design)

slide-8
SLIDE 8

8

DB/Enterprise vs. Web Dichotomy

❖ DB folks underappreciated 4 key concerns of Web folks: Elasticity: Resources should adapt to “query” workload DB companies: our customers have “fairly predictably” sized datasets and workloads; can fix their clusters! Web companies: our workloads could vary widely and the datasets they need vary widely! Need to be able to upsize and downsize clusters easily

  • n-the-fly, based on current query workload
slide-9
SLIDE 9

9

DB/Enterprise vs. Web Dichotomy

❖ DB folks underappreciated 4 key concerns of Web folks: Cost/Politics: Commercial RDBMS licenses too costly! DB companies: our customers have $$$! ☺ Web companies: our products are mostly free (ads?); why pay so much $$$ if we can build our own DBMSs? Many started with MySQL (!) but then built their own DBMSs New tools were free & open source; led to viral adoption!

slide-10
SLIDE 10

This new breed of parallel data systems called Dataflow Systems jolted the DB folks from being smug and complacent!

10

slide-11
SLIDE 11

11

Outline

❖ Beyond RDBMSs: A Brief History ❖ MapReduce/Hadoop Craze ❖ Spark and Dataflow Programming ❖ More Scalable ML with MapReduce/Spark

slide-12
SLIDE 12

12

The MapReduce/Hadoop Craze

❖ Blame Google! ❖ “Simple” problem: index, store, and search the Web! ☺ ❖ Who were their major systems hires? Jeff Dean and Sanjay Ghemawat (Systems, not DB or IR) ❖ Why did they not use RDBMSs? (Haha.) Developability, data model, fault tolerance, scale, cost, … Engineers started with MySQL; abandoned it!

slide-13
SLIDE 13

13

What is MapReduce?

❖ Programming model for writing programs on sharded data + distributed system architecture for processing large data ❖ Map and Reduce are terms/ideas from functional PL ❖ Engineer only implements the logic of Map and Reduce ❖ System implementation handles orchestration of data distribution, parallelization, etc. under the covers MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004. Was radically easier for engineers to write programs with!

slide-14
SLIDE 14

14

What is MapReduce?

❖ Standard example: count word occurrences in a doc corpus ❖ Input: A set of text documents (say, webpages) ❖ Output: A dictionary of unique words and their counts Hmmm, sounds suspiciously familiar … ☺ function map (String docname, String doctext) : for each word w in doctext : emit (w, 1) function reduce (String word, Iterator partialCounts) : sum = 0 for each pc in partialCounts : sum += pc emit (word, sum) Part of MapReduce API

slide-15
SLIDE 15

15

How MapReduce Works

Parallel flow of control and data during MapReduce execution: Under the covers, each Mapper and Reducer is a separate process; Reducers face barrier synchronization (BSP) Fault tolerance achieved using data replication

slide-16
SLIDE 16

16

Abstract Semantics of MapReduce

❖ Map(): Operates independently on one “record” at a time ❖ Can batch multiple data examples on to one record ❖ Dependencies across Mappers not allowed ❖ Can emit 1 or more key-value pairs as output ❖ Data types of inputs and outputs can be different! ❖ Reduce(): Gathers all Map output pairs across machines with same key into an Iterator (list) ❖ Aggregation function applied on Iterator and output final ❖ Input Split: ❖ Physical-level split/shard of dataset that batches multiple examples to one file “block” (~128MB default on HDFS) ❖ Custom Input Splits can be written by appl. user

slide-17
SLIDE 17

17

Benefits of MapReduce

❖ Goal: Higher level abstraction of functional operations (Map; Reduce) to simplify data-parallel programming at scale ❖ Key Benefits: ❖ Out-of-the-box scalability and cluster parallelism ❖ Fault tolerance offloaded to system impl., not appl./user ❖ Map() and Reduce() can be highly general; no restrictions

  • n data types/structures processed; easier to use for ETL

and text/multimedia-oriented analytics ❖ Free and OSS implementations available (Hadoop) ❖ New burden on users: Converting computations of data- intensive program/operation to Map() + Reduce() API ❖ But MapReduce libraries available in multiple PLs to mitigate coding pains: Java, C++, Python, R, Scala, etc.

slide-18
SLIDE 18

18

Emulate MapReduce in SQL?

Q: How would you do the word counting in RDBMS / in SQL? ❖ First step: Transform text docs into relations and load: Part of the Extract-Transform-Load (ETL) stage Suppose we pre-divide each document into words and have the schema: DocWords (DocName, Word) ❖ Second step: a single, simple SQL query! SELECT Word, COUNT (*) FROM DocWords GROUP BY Word [ORDER BY Word] Parallelism, scaling, etc. done by RDBMS under the covers

slide-19
SLIDE 19

19

More MR Examples: Select

❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Map(): On tuple, apply selection condition; emit pair with dummy key and entire tuple as value ❖ Reduce(): Not needed! No cross-shard aggregation here ❖ Such kinds of tasks/jobs are called “Map-only” jobs

slide-20
SLIDE 20

20

More MR Examples: Simple Agg.

❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Assume it is algebraic aggregate (SUM, AVG, MAX, etc.) ❖ Map(): On agg. attribute, compute incremental stats; emit pair with single global dummy key and stats as value ❖ Reduce(): Since only one dummy key across all shards, Iterator has all suff. stats that can be unified to get result

slide-21
SLIDE 21

21

More MR Examples: GROUP BY Agg.

❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Assume it is algebraic aggregate (SUM, AVG, MAX, etc.) ❖ Map(): On agg. attribute, compute incremental stats; emit pair with grouping attribute as key and stats as value ❖ Reduce(): Iterator has all suff. stats for a single group; unify those to get result for that group; different reducers handle different groups

slide-22
SLIDE 22

22

More MR Examples: Matrix Norm

❖ Input Split (part of ETL): Let matrix be sharded tile-oriented ❖ Assume it is algebraic aggregate (Lp,q norm) ❖ Very similar to simple aggregate! ❖ Map(): On agg. attribute, compute incremental stats; emit pair with single global dummy key and stats as value ❖ Reduce(): Since only one dummy key across all shards, Iterator has all suff. stats that can be unified to get result

slide-23
SLIDE 23

23

Analogue: Parallel RDBMS UDA

❖ Recall how word count MapReduce can be done with SQL: SELECT Word, COUNT (*) FROM DocWords GROUP BY Word Q: How can we compute other aggregates not native to SQL? ❖ User-Defined Aggregate Function (UDAF) abstraction in parallel RDBMs can do MapReduce-like computations ❖ MapReduce seems more intuitive and succinct to many!

slide-24
SLIDE 24

24

Analogue: Parallel RDBMS UDA

❖ 4 main functions in the UDAF API to work with BSP model ❖ Aggregation state: data structure computed (independently) by workers and unified by master ❖ Initialize(): Set up info./initialize RAM for agg. state; runs independently on each worker ❖ Transition(): Per-tuple function run by worker to update its

  • agg. state; analogous to Map() in MapReduce

❖ Merge(): Function that combines agg. states from workers; run by master after workers done; analogous to Reduce() ❖ Finalize(): Run once at end by master to return final result

slide-25
SLIDE 25

25

Example Parallel RDBMS UDA: BGD

❖ BGD’s gradient (and loss) computation can be easily scaled as an RDBMS UDAF: ❖ Initialize: Alloc. memory for model and gradient ❖ Transition: Compute gradient as running info on tuple ❖ Merge: Add partial sums from workers to get full gradient ❖ Finalize: No-op; just return gradient ❖ Commands for controlling epochs, convergence, etc. can be done external to RDBMS, since they do not need dataset ❖ E.g., Apache MADlib does this with Python + Greenplum Q: How will you do the above with MapReduce instead? Q: What about mini-batch SGD?

slide-26
SLIDE 26

26

What is Hadoop then?

❖ Open-source system implementation with MapReduce as

  • prog. model and HDFS as distr. filesystem

❖ Map and Reduce functions in API; input splitting, data distribution, shuffling, fault tolerance, etc. all handled by the Hadoop library under the covers ❖ Exploded in popularity: 100s of papers, 10s of products ❖ A real “revolution” in scalable parallel data processing that took the DB community by surprise! NB: Do not confuse MapReduce for Hadoop or vice versa!

slide-27
SLIDE 27

27

A Spectacular “War of the Worlds”

No declarativity! Filescan-based! DeWitt’s work on parallel DBMSs! Cheap rip-off of RDBMSs!

slide-28
SLIDE 28

28

“Young Turks” vs. “Old Guard”?

Swift and scathing rebuttal from MapReduce/Hadoop world! DBMSs too high-level/hard to use for low-level text ETL Meant for “offline” fault-tolerant workloads on cheap nodes Google awarded a patent for MapReduce (ahem)! MapReduce/Hadoop not meant to be an RDBMS replacement

slide-29
SLIDE 29

29

Enter Hybrid Systems!

❖ Clever DB researches: “Let’s get the best of both worlds!” ❖ Numerous projects on hybrid systems in industry/academia: Programming model-level: Bring declarative programming from RDBMS world to MapReduce/Hadoop world Systems-level: Intermix system implementation ideas SQL dialect

  • ver Hadoop

Dataflow language

  • ver Hadoop

HadoopDB from Yale U. Microsoft Polybase

slide-30
SLIDE 30

30

Outline

❖ Beyond RDBMSs: A Brief History ❖ MapReduce/Hadoop Craze ❖ Spark and Dataflow Programming ❖ More Scalable ML with MapReduce/Spark

slide-31
SLIDE 31

31

Spark from UC Berkeley

❖ Extended dataflow programming model (subsumes most rel.

  • ps; MapReduce); system (re)designed from ground up

❖ Inspired by Python Pandas style of function calls for ops ❖ Agenda: Unified system to handle relations, text, etc.; support more general distributed data processing ❖ Tons of sponsors, gazillion bucks, unbelievable hype! ❖ Key idea: exploit distributed memory to cache data ❖ Key novelty: lineage-based fault tolerance, not replication ❖ Open-sourced to Apache; commercialized as Databricks

slide-32
SLIDE 32

32

Spark’s Dataflow Programming Model

Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI 2012

Transformations are relational ops, MR, etc. as functions Actions are what force computation; aka lazy evaluation

slide-33
SLIDE 33

33

Word Count Example in Spark

Spark RDD API available in Python, Scala, Java, and R DataFrame API and SparkSQL also offer an SQL interface that can be interleaved with such function calls

slide-34
SLIDE 34

34

Spark DataFrame API and SparkSQL

❖ Databricks now recommends DataFrame API/SparkSQL; avoid RDD API unless absolutely needed! ❖ Key Reason: Automatic query optimization more feasible with SparkSQL/DataFrame than with RDD ❖ AKA (painfully) re-learn 40 years of DB research ☺

Spark SQL: Relational Data Processing in Spark. In SIGMOD 2015.

slide-35
SLIDE 35

35

How does Spark work?

Databricks is building yet another parallel RDBMS! ☺

Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI 2012

slide-36
SLIDE 36

36

Comparing Spark’s APIs

RDD DataFrame Koalas Abstraction Level Low High High Named Columns No Yes Yes Support for Query Optimization No Yes Yes Programming Mode map-reduce SQL SQL, Pandas Best suited for Unstructured data Low-level ops Folks who like

  • func. PLs and

MapReduce Structured data High-level ops Folks who know SQL, Python, R Structured data Lower barrier to entry for folks who only know Pandas or Dask

Check out Yuhao/Supun’s PA 2 slides for more on Spark APIs

slide-37
SLIDE 37

37

Reinventing the Wheel?

slide-38
SLIDE 38

38

The Berkeley Data Analytics Stack (BDAS)

Spark-based Ecosystem of Tools

slide-39
SLIDE 39

39

Other Dataflow Systems

❖ Stratosphere/Apache Flink from TU Berlin ❖ Myria from U Washington ❖ AsterixDB from UC Irvine ❖ Azure Data Lakes from Microsoft ❖ Google Cloud Dataflow ❖ …

slide-40
SLIDE 40

40

References and More Material

❖ MapReduce/Hadoop: ❖ MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. In OSDI 2004. ❖ More Examples: http://bit.ly/2rkSRj8 ❖ Online Tutorial: http://bit.ly/2rS2B5j ❖ Spark: ❖ Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. Matei Zaharia and

  • thers. In NSDI 2012.

❖ More Examples: http://bit.ly/2rhkhEp, http://bit.ly/2rkT8Tc ❖ Online Tutorial: http://bit.ly/2r8lW0S

slide-41
SLIDE 41

41

Outline

❖ Beyond RDBMSs: A Brief History ❖ MapReduce/Hadoop Craze ❖ Spark and Dataflow Programming ❖ More Scalable ML with MapReduce/Spark ❖ K-Means Clustering

slide-42
SLIDE 42

42

Primer: K-Means Clustering

❖ Basic Idea: Identify clusters of examples based on Euclidean distances; formulated as an optimization problem ❖ Llyod’s algorithm is most popular heuristic for K-Means ❖ Input: n x d examples/points in numeric space ❖ Output: k clusters of those points and their centroids

  • 1. Initialize k centroid vectors and point-cluster ID assignment
  • 2. Assignment step: Scan dataset and assign each point to a

cluster ID based on which centroid is nearest

  • 3. Update step: Given new assignment, scan dataset again to

recompute centroids for all clusters

  • 4. Repeat 2 and 3 until convergence or fixed # iterations
slide-43
SLIDE 43

43

K-Means Clustering in MapReduce

❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Assume each tuple/example/point has an ExampleID ❖ Need 2 jobs! 1 for Assignment step and 1 for Update step ❖ 2 external data structures needed for both jobs: k x d centroid matrix A (dense); n x k assignment matrix B (ultra-sparse) ❖ A and B initially broadcast to all Mappers via HDFS; Mappers can read small data directly as files on HDFS ❖ Job 1 read A and creates new B ❖ Job 2 reads B and creates new A

slide-44
SLIDE 44

44

K-Means Clustering in MapReduce

❖ 2 external data structures: k x d centroid matrix A (dense); n x k assignment matrix B (ultra-sparse) ❖ Job 1 Map(): Read A from HDFS; compute point’s distance to all k centroids; get nearest centroid; emit new assignment as

  • utput pair (PointID, ClusterID)

❖ No Reduce() for Job 1; new B now available on HDFS ❖ Job 2 Map(): Read B from HDFS; look into B and see which cluster point got assigned to; emit point as output pair (ClusterID, point vector) ❖ Job 2 Reduce(): Iterator has all point vectors of a given ClusterID; add them up and divide by count; got new centroid; emit output pair as (ClusterID, cluster vector)