ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 - - PowerPoint PPT Presentation

scootr scaling r dataframes on dataflow systems
SMART_READER_LITE
LIVE PREVIEW

ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 - - PowerPoint PPT Presentation

ScootR: Scaling R Dataframes on Dataflow Systems Andreas Kunft 1 Lukas Stadler 2 Daniele Bonetta 2 Cosmin Basca 2 Jens Meiners 1 Sebastian Bre 1 Tilmann Rabl 1 Juan Fumero 3 Volker Markl 2 Technische Universitt Berlin 1 Oracle Labs 2 University


slide-1
SLIDE 1

ScootR: Scaling R Dataframes on Dataflow Systems

Andreas Kunft1 Lukas Stadler2 Daniele Bonetta2 Cosmin Basca2 Jens Meiners1 Sebastian Breß1 Tilmann Rabl1 Juan Fumero3 Volker Markl2

Technische Universität Berlin1 Oracle Labs2 University of Manchester3

slide-2
SLIDE 2

R gained increased traction

  • Dynamically typed, open-source language
  • Rich support for analytics & statistics

1

slide-3
SLIDE 3

R gained increased traction

  • Dynamically typed, open-source language
  • Rich support for analytics & statistics

But

  • Standalone R is not well suited for out-of-core data loads

2

slide-4
SLIDE 4

Analytics pipelines often work on large amounts of raw data

  • Dataflow engines (DF), e.g., Apache Flink and Spark, scale-out
  • Provide rich support for user-defined functions (UDFs)

3

slide-5
SLIDE 5

Analytics pipelines often work on large amounts of raw data

  • Dataflow engines (DF), e.g., Apache Flink and Spark, scale-out
  • Provide rich support for user-defined functions (UDFs)

But

  • R users are often unfamiliar with DF APIs and concepts

4

slide-6
SLIDE 6

Combine the usability of f R wit ith the scala lability of f dataflow engines

  • Goals
  • From functions calls to an operator graph
  • Approaches to execute R UDFs
  • Our Approach: ScootR
  • Evaluation

5

slide-7
SLIDE 7

GOALS

  • 1. Provide data.frame API with natural feeling
  • 6

df$km <- df$miles * 1.6 df <- select(df, count = flights, distance) df <- apply(df, func)

slide-8
SLIDE 8

GOALS

  • 1. Provide data.frame API with natural feeling
  • 2. Achieve comparable performance to native dataflow API

7

df$km <- df$miles * 1.6 df <- select(df, count = flights, distance) df <- apply(df, func)

slide-9
SLIDE 9

From function calls to an operator graph

8

slide-10
SLIDE 10

MAPPING DATA TYPES

  • R data.frame(T1,T2,…,TN) as Flink DataSet<TupleN<T1,T2,…,TN>>
  • E.g., data.frame(integer, character) as DataSet<Tuple2<Integer, String>>

9

N columns N fields Fixed element type of Tuple with arity N

slide-11
SLIDE 11

MAPPING R FUNCTIONS TO OPERATORS

  • Functions on data.frames lazily build an operator graph

10

slide-12
SLIDE 12

MAPPING R FUNCTIONS TO OPERATORS

  • Functions on data.frames lazily build an operator graph
  • 1. Functions w/o UDFs are handled before execution, e.g.,

a select function is mapped to a project operator

select(df$id, df$arrival) to ds.project(1, 3)

11

slide-13
SLIDE 13

MAPPING R FUNCTIONS TO OPERATORS

  • Functions on data.frames lazily build an operator graph
  • 1. Functions w/o UDFs are handled before execution
  • 2. Functions w/ UDFs call R functions during execution

12

slide-14
SLIDE 14

Approaches to execute R UDFs

13

slide-15
SLIDE 15

INTER PROCESS COMMUNICATION (IPC)

14

Driver Client Worker Task Task Worker Task Task R Process R Process R Process R Process

slide-16
SLIDE 16

INTER PROCESS COMMUNICATION (IPC)

15

1

Communication + Serialization (R <> Java)

2

JVM and R compete for memory

Worker filter R Process

filter <- function(df) { df$language == ‘english’ }

1 2

JVM

slide-17
SLIDE 17

SOURCE-TO-SOURCE TRANSLATION (STS)

  • Translate restricted set of functions to native dataflow API
  • Constant translation overhead, but native execution performance

16

slide-18
SLIDE 18

SOURCE-TO-SOURCE TRANSLATION (STS)

  • E.g., STS translation in SparkR to Spark’s Scala Dataframe API:

17

df <- filter(df, df$language == ‘english’ ) val df = df.filter($”language” === “english”) df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)

slide-19
SLIDE 19

Inter Process Communication

+ Execute arbitrary R code

  • Data serialization
  • Data exchange
  • Java and R process compete for

memory

Source-to-source translation

+ Native performance

  • Restricted to a language subset
  • r requires to build full-fledged

compiler

18

slide-20
SLIDE 20

A common runtime for R and Ja Java

19

slide-21
SLIDE 21

BACKGROUND: TRUFFLE/GRAAL

20

HotSpot JIT Bytecode

slide-22
SLIDE 22

BACKGROUND: TRUFFLE/GRAAL

21

HotSpot JIT Bytecode Graal

slide-23
SLIDE 23

BACKGROUND: TRUFFLE/GRAAL

22

HotSpot Graal Truffle GraalVM

slide-24
SLIDE 24

BACKGROUND: TRUFFLE/GRAAL

23

Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015.

HotSpot Runtime Graal Interpreter GC … Truffle TruffleR (fastR) TruffleJS javac *.js *.R *.java GraalVM AST Interpreter Source Code

slide-25
SLIDE 25

SCOOTR: FASTR + FLINK

24

slide-26
SLIDE 26

SCOOTR OVERVIEW

25

flink.init(SERVER, PORT) flink.parallelism(DOP) df <- flink.readdf(SOURCE, list("id", “body“, …), list(character, character, …) ) words <- function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) } df <- flink.apply(df, words) flink.writeAsText(df, SINK) flink.execute()

slide-27
SLIDE 27

SCOOTR OVERVIEW

26

flink.init(SERVER, PORT) flink.parallelism(DOP) df <- flink.readdf(SOURCE, list("id", “body“, …), list(character, character, …) ) words <- function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) } df <- flink.apply(df, words) flink.writeAsText(df, SINK) flink.execute()

slide-28
SLIDE 28

Efficient data access in R UDFs

27

slide-29
SLIDE 29

28

function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) }

slide-30
SLIDE 30

29

function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) }

1

Dataframe proxy keeps track of columns and provides efficient access function(tuple) { len <- length(strsplit(tuple[[2]], " ")[[1]]) list(tuple[[1]], tuple[[2]], len) }

slide-31
SLIDE 31

30

function(df) { len <- length(strsplit(df$body, " ")[[1]]) list(df$id, df$body, len) }

1

Dataframe proxy keeps track of columns and provides efficient access function(tuple) { len <- length(strsplit(tuple[[2]], " ")[[1]]) flink.tuple(tuple[[1]], tuple[[2]], len) }

2

Rewrite to directly instantiate a Flink tuple instead of an R list

slide-32
SLIDE 32

IMPACT OF DIRECT TYPE ACCESS

  • From list(...) to flink.tuple(...)
  • Avoids additional pass over R list to create Flink tuple
  • Up to 1.75x performance improvement

31

Output w/ arity 2 Output w/ arity 19

Purple is function execution, pink (hatched) conversion from list to tuple

slide-33
SLIDE 33

Evaluation

32

slide-34
SLIDE 34

APPLY FUNCTION MICROBENCHMARK

  • Airline On-Time Performance Dataset (2005 – 2016)

CSV, 19 columns, 9.5GB

  • UDF: df$km <- df$miles * 1.6

33

slide-35
SLIDE 35

APPLY FUNCTION MICROBENCHMARK

  • Airline On-Time Performance Dataset (2005 – 2016)

CSV, 19 columns, 9.5GB

  • UDF: df$km <- df$miles * 1.6

34

ScootR and SparkR (STS) achieve near native performance

slide-36
SLIDE 36

APPLY FUNCTION MICROBENCHMARK

  • Airline On-Time Performance Dataset (2005 – 2016)

CSV, 19 columns, 9.5GB

  • UDF: df$km <- df$miles * 1.6

35

Both heavily outperform gnu R and fastR ScootR and SparkR (STS) achieve near native performance

slide-37
SLIDE 37

APPLY FUNCTION MICROBENCHMARK: SCALABILITY

36

slide-38
SLIDE 38

MIXED PIPELINE W/ PREPROCESSING AND ML

Pipeline:

  • (Distributed) preprocessing of the dataset
  • Data is collected locally and an generalized linear model is trained

37

Majority of the time is spent in preprocessing ScootR is up to 11x faster than gnu R and fastR

slide-39
SLIDE 39

RECAP

  • ScootR provides a data.frame API in R for Apache Flink
  • R and Flink run within the same runtime
  • Avoids serialization and data exchange
  • Avoids type conversion

> Achieves near native performance for a rich set of operators

38