[PPT] - Pipelines for data analysis in R Hadley Wickham @hadleywickham PowerPoint Presentation

SLIDE 1

Hadley Wickham  

@hadleywickham Chief Scientist, RStudio

Pipelines for   data analysis in R

October 2015

SLIDE 2

Data analysis is the process by which data becomes understanding, knowledge and insight Data analysis is the process by which data becomes understanding, knowledge and insight

SLIDE 3

Data analysis is the process by which data becomes understanding, knowledge and insight Data analysis is the process by which data becomes understanding, knowledge and insight

SLIDE 4

Transform Visualise Model Tidy Import

Surprises, but doesn't scale Scales, but doesn't (fundamentally) surprise Create new variables & new summaries Consistent way

f storing data

SLIDE 5

Transform Visualise Model

tidyr dplyr

Tidy Import

readr readxl haven DBI httr b r

m

ggplot2 ggvis

SLIDE 6

Pipelines

SLIDE 7 Think it Do it Describe it Cognitive Computational (precisely)

SLIDE 8

Cognition time ≫ Computation time

http://www.flickr.com/photos/mutsmuts/4695658106

SLIDE 9

%>%

Inspirations: unix, F#, haskell, clojure, method chaining magrittr::

SLIDE 10 foo_foo <- little_bunny() bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head ) # vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)

SLIDE 11 x %>% f(y) # f(x, y) x %>% f(z, .) # f(z, x) x %>% f(y) %>% g(z) # g(f(x, y), z) # Turns function composition (hard to read) # into sequence (easy to read)

SLIDE 12 # Any function can use it. Only needs a simple # property: the type of the first argument # needs to be the same as the type of the result. # tidyr: pipelines for messy -> tidy data # dplyr: pipelines for data manipulation # ggvis: pipelines for visualisations # rvest: pipelines for html # purrr: pipelines for lists # xml2: pipelines for xml # stringr: pipelines for strings

SLIDE 13

Tidy

SLIDE 14

Transform Visualise Model

tidyr dplyr

Tidy Import

readr readxl haven DBI httr ggplot2 ggvis b r

m

SLIDE 15 Storage Meaning Table / File Data set Rows Observations Columns Variables Tidy data = data that makes data analysis easy

SLIDE 16 Source: local data frame [5,769 x 22] iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... Variables not shown: f014 (int), f1524 (int), f2534 (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int) What are the variables in this dataset? (Hint: f = female,   u = unknown, 1524 = 15-24)

SLIDE 17 # To convert this messy data into tidy data # we need two verbs. First we need to gather # together all the columns that aren't variables tb2 <- tb %>% gather(demo, n, -iso2, -year, na.rm = TRUE) tb2

SLIDE 18 # Then separate the demographic variable into # sex and age tb3 <- tb2 %>% separate(demo, c("sex", "age"), 1) tb3 # Many tidyr verbs come in pairs: # spread vs. gather # extract/separate vs. unite # nest vs. unnest

SLIDE 19

Google for “tidyr” & “tidy data”

SLIDE 20

Transform

SLIDE 21

Transform Visualise Model

tidyr dplyr

Tidy

ggplot2 ggvis b r

m

Import

readr readxl haven DBI httr

SLIDE 22 Think it Do it Describe it Cognitive Computational (precisely)

SLIDE 23

One table verbs

select: subset variables by name
filter: subset observations by value
mutate: add new variables
summarise: reduce to a single obs
arrange: re-order the observations

+ g r

u

p b y

SLIDE 24

Demo

SLIDE 25 right_join() full_join() inner_join() left_join() Mutating semi_join() anti_join() Filtering Set intersect() setdiff() union()

SLIDE 26

dplyr sources

Local data frame (C++)
Local data table
Local data cube (experimental)
RDMS: Postgres, MySQL, SQLite,

Oracle, MS SQL, JDBC, Impala

MonetDB, BigQuery

SLIDE 27

Google for “dplyr”

SLIDE 28

Visualise

SLIDE 29

Transform Visualise Model

tidyr dplyr

Tidy

ggplot2 ggvis b r

m

Import

readr readxl haven DBI httr

SLIDE 30

What is ggvis?

A grammar of graphics

(like ggplot2)

Reactive (interactive & dynamic)

(like shiny)

A pipeline (a la dplyr)
Of the web (drawn with vega)

SLIDE 31

Demo

4-ggvis.R 4-ggvis.Rmd

SLIDE 32

Google for “ggvis”

SLIDE 33

Model

with broom, by David Robinson

SLIDE 34

Transform Visualise Model

tidyr dplyr

Tidy

ggplot2 ggvis b r

m

Import

readr readxl haven DBI httr

SLIDE 35 2.5 5.0 7.5 1990 1995 2000 2005 2010 2015 date log(sales) 46 TX cities, ~25 years of data What makes it hard to see the long term trend?

SLIDE 36 # Models are useful as tool for removing # known patterns tx <- tx %>% group_by(city) %>% mutate( resid = lm( log(sales) ~ factor(month), na.action = na.exclude ) %>% resid() )

SLIDE 37 −2 −1 1 1990 1995 2000 2005 2010 2015 date resid

SLIDE 38 # Models are also useful in their own right models <- tx %>% group_by(city) %>% do(mod = lm( log(sales) ~ factor(month), data = ., na.action = na.exclude) )

SLIDE 39

Model summaries

Model level: one row per model
Coefficient level: one row per

coefficient (per model)

Observation level: one row per
bservation (per model)

SLIDE 40

Demo

5-broom.R

SLIDE 41

Google for “broom r”

SLIDE 42

Big data and R

SLIDE 43 Big Can’t fit in memory on

ne computer: >5 TB

Medium Fits in memory on a server: 10 GB-5 TB Small Fits in memory on a laptop: <10 GB R is great at this!

SLIDE 44

R

R provides an excellent environment for

rapid interactive exploration of small data.

There is no technical reason why it can’t

also work well with medium size data. (But the work mostly hasn’t been done)

What about big data?

SLIDE 45

1. Can be reduced to a small data

problem with subsetting/sampling/ summarising (90%)

2. Can be reduced to a very large

number of small data problems (9%)

3. Is irreducibly big (1%)

SLIDE 46

The right small data

Rapid iteration essential
dplyr supports this activity by avoiding

cognitive costs of switching between languages.

SLIDE 47

Lots of small problems

Embarrassingly parallel (e.g. Hadoop)
R wrappers like foreach, rhipe, rhadoop
Challenging is matching architecture of

computing to data storage

SLIDE 48

Irreducibly big

Computation must be performed by

specialised system.

Typically C/C++, Fortran, Scala.
R needs to be able to talk to those

systems.

SLIDE 49

Future work

SLIDE 50

End game

Provide a fluent interface where you spent your mental energy on the specific data problem, not general data analysis process. The best tools become invisible with time! Still a lot of work to do, especially on the connection between modelling and visualisation.

SLIDE 51

Transform Visualise Model

tidyr dplyr

Tidy Import

readr readxl haven DBI httr b r

m

ggplot2 ggvis