Topic 5: Dataflow Systems Chapter 2.2 of MLSys Book
Arun Kumar
1
DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: - - PowerPoint PPT Presentation
DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: Dataflow Systems Chapter 2.2 of MLSys Book 1 Parallel RDBMSs Parallel RDBMSs are highly successful and widely used They offer massive scalability (shared-nothing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI 2012
33
34
Spark SQL: Relational Data Processing in Spark. In SIGMOD 2015.
35
Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In NSDI 2012
36
RDD DataFrame Koalas Abstraction Level Low High High Named Columns No Yes Yes Support for Query Optimization No Yes Yes Programming Mode map-reduce SQL SQL, Pandas Best suited for Unstructured data Low-level ops Folks who like
MapReduce Structured data High-level ops Folks who know SQL, Python, R Structured data Lower barrier to entry for folks who only know Pandas or Dask
37
38
39
40
41
42
43
44