[PPT] - ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 PowerPoint Presentation

SLIDE 1

ScootR: Scaling R Dataframes on Dataflow Systems

Andreas Kunft1 Lukas Stadler2 Daniele Bonetta2 Cosmin Basca2 Jens Meiners1 Sebastian Breß1 Tilmann Rabl1 Juan Fumero3 Volker Markl2

Technische Universität Berlin1 Oracle Labs2 University of Manchester3

SLIDE 2

R gained increased traction

Dynamically typed, open-source language
Rich support for analytics & statistics

1

SLIDE 3

R gained increased traction

Dynamically typed, open-source language
Rich support for analytics & statistics

But

Standalone R is not well suited for out-of-core data loads

2

SLIDE 4

Analytics pipelines often work on large amounts of raw data

Dataflow engines (DF), e.g., Apache Flink and Spark, scale-out
Provide rich support for user-defined functions (UDFs)

3

SLIDE 5

Analytics pipelines often work on large amounts of raw data

Dataflow engines (DF), e.g., Apache Flink and Spark, scale-out
Provide rich support for user-defined functions (UDFs)

But

R users are often unfamiliar with DF APIs and concepts

4

SLIDE 6

Combine the usability of f R wit ith the scala lability of f dataflow engines

Goals
From functions calls to an operator graph
Approaches to execute R UDFs
Our Approach: ScootR
Evaluation

5

SLIDE 7

GOALS

1. Provide data.frame API with natural feeling
6

df$km <- df$miles * 1.6 df <- select(df, count = flights, distance) df <- apply(df, func)

SLIDE 8

GOALS

1. Provide data.frame API with natural feeling
2. Achieve comparable performance to native dataflow API

7

df$km <- df$miles * 1.6 df <- select(df, count = flights, distance) df <- apply(df, func)

SLIDE 9

From function calls to an operator graph

8

SLIDE 10

MAPPING DATA TYPES

R data.frame(T1,T2,…,TN) as Flink DataSet<TupleN<T1,T2,…,TN>>
E.g., data.frame(integer, character) as DataSet<Tuple2<Integer, String>>

9

N columns N fields Fixed element type of Tuple with arity N

SLIDE 11

MAPPING R FUNCTIONS TO OPERATORS

Functions on data.frames lazily build an operator graph

10

SLIDE 12

MAPPING R FUNCTIONS TO OPERATORS

Functions on data.frames lazily build an operator graph
1. Functions w/o UDFs are handled before execution, e.g.,

a select function is mapped to a project operator

select(df$id, df$arrival) to ds.project(1, 3)

11

SLIDE 13

MAPPING R FUNCTIONS TO OPERATORS

Functions on data.frames lazily build an operator graph
1. Functions w/o UDFs are handled before execution
2. Functions w/ UDFs call R functions during execution

12

SLIDE 14

Approaches to execute R UDFs

13

SLIDE 15

INTER PROCESS COMMUNICATION (IPC)

14

Driver Client Worker Task Task Worker Task Task R Process R Process R Process R Process

SLIDE 16

INTER PROCESS COMMUNICATION (IPC)

15

1

Communication + Serialization (R <> Java)

2

JVM and R compete for memory

Worker filter R Process

filter <- function(df) { df$language == ‘english’ }

1 2

JVM

SLIDE 17

SOURCE-TO-SOURCE TRANSLATION (STS)

Translate restricted set of functions to native dataflow API
Constant translation overhead, but native execution performance

16

SLIDE 18

SOURCE-TO-SOURCE TRANSLATION (STS)

E.g., STS translation in SparkR to Spark’s Scala Dataframe API:

17

df <- filter(df, df$language == ‘english’ ) val df = df.filter($”language” === “english”) df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)

SLIDE 19

Inter Process Communication

+ Execute arbitrary R code

Data serialization
Data exchange
Java and R process compete for

memory

Source-to-source translation

+ Native performance

Restricted to a language subset
r requires to build full-fledged

compiler

18

SLIDE 20

A common runtime for R and Ja Java

19

SLIDE 21

BACKGROUND: TRUFFLE/GRAAL

20

HotSpot JIT Bytecode

SLIDE 22

BACKGROUND: TRUFFLE/GRAAL

21

HotSpot JIT Bytecode Graal

SLIDE 23

BACKGROUND: TRUFFLE/GRAAL

22

HotSpot Graal Truffle GraalVM

SLIDE 24

BACKGROUND: TRUFFLE/GRAAL

23

Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015.

HotSpot Runtime Graal Interpreter GC … Truffle TruffleR (fastR) TruffleJS javac *.js *.R *.java GraalVM AST Interpreter Source Code

SLIDE 25

SCOOTR: FASTR + FLINK

24

SLIDE 26

SCOOTR OVERVIEW

25

flink.init(SERVER, PORT) flink.parallelism(DOP) df <- flink.readdf(SOURCE, list("id", “body“, …), list(character, character, …) ) words <- function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) } df <- flink.apply(df, words) flink.writeAsText(df, SINK) flink.execute()

SLIDE 27

SCOOTR OVERVIEW

26

flink.init(SERVER, PORT) flink.parallelism(DOP) df <- flink.readdf(SOURCE, list("id", “body“, …), list(character, character, …) ) words <- function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) } df <- flink.apply(df, words) flink.writeAsText(df, SINK) flink.execute()

SLIDE 28

Efficient data access in R UDFs

27

SLIDE 29

28

function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) }

SLIDE 30

29

function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) }

1

Dataframe proxy keeps track of columns and provides efficient access function(tuple) { len <- length(strsplit(tuple[[2]], " ")[[1]]) list(tuple[[1]], tuple[[2]], len) }

SLIDE 31

30

function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) }

1

Dataframe proxy keeps track of columns and provides efficient access function(tuple) { len <- length(strsplit(tuple[[2]], " ")[[1]]) flink.tuple(tuple[[1]], tuple[[2]], len) }

2

Rewrite to directly instantiate a Flink tuple instead of an R list

SLIDE 32

IMPACT OF DIRECT TYPE ACCESS

From list(...) to flink.tuple(...)
Avoids additional pass over R list to create Flink tuple
Up to 1.75x performance improvement

31

Output w/ arity 2 Output w/ arity 19

Purple is function execution, pink (hatched) conversion from list to tuple

SLIDE 33

Evaluation

32

SLIDE 34

APPLY FUNCTION MICROBENCHMARK

Airline On-Time Performance Dataset (2005 – 2016)

CSV, 19 columns, 9.5GB

UDF: df$km <- df$miles * 1.6

33

SLIDE 35

APPLY FUNCTION MICROBENCHMARK

Airline On-Time Performance Dataset (2005 – 2016)

CSV, 19 columns, 9.5GB

UDF: df$km <- df$miles * 1.6

34

ScootR and SparkR (STS) achieve near native performance

SLIDE 36

APPLY FUNCTION MICROBENCHMARK

Airline On-Time Performance Dataset (2005 – 2016)

CSV, 19 columns, 9.5GB

UDF: df$km <- df$miles * 1.6

35

Both heavily outperform gnu R and fastR ScootR and SparkR (STS) achieve near native performance

SLIDE 37

APPLY FUNCTION MICROBENCHMARK: SCALABILITY

36

SLIDE 38

MIXED PIPELINE W/ PREPROCESSING AND ML

Pipeline:

(Distributed) preprocessing of the dataset
Data is collected locally and an generalized linear model is trained

37

Majority of the time is spent in preprocessing ScootR is up to 11x faster than gnu R and fastR

SLIDE 39

RECAP

ScootR provides a data.frame API in R for Apache Flink
R and Flink run within the same runtime
Avoids serialization and data exchange
Avoids type conversion

> Achieves near native performance for a rich set of operators

38