SLIDE 1 Hadley Wickham
@hadleywickham
Chief Scientist, RStudio
Pipelines for
data analysis in R
October 2015
SLIDE 2 Data analysis is the process by which data becomes understanding, knowledge and insight Data analysis is the process by which data becomes understanding, knowledge and insight
SLIDE 3 Data analysis is the process by which data becomes understanding, knowledge and insight Data analysis is the process by which data becomes understanding, knowledge and insight
SLIDE 4 Transform Visualise Model Tidy Import
Surprises, but doesn't scale Scales, but doesn't (fundamentally) surprise Create new variables & new summaries Consistent way
SLIDE 5 Transform Visualise Model
tidyr dplyr
Tidy Import
readr readxl haven DBI httr b r
ggplot2 ggvis
SLIDE 7 Think it Do it Describe it Cognitive Computational
(precisely)
SLIDE 8 Cognition time ≫ Computation time
http://www.flickr.com/photos/mutsmuts/4695658106
SLIDE 9
%>%
Inspirations: unix, F#, haskell, clojure, method chaining
magrittr::
SLIDE 10 foo_foo <- little_bunny() bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head ) # vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)
SLIDE 11 x %>% f(y) # f(x, y) x %>% f(z, .) # f(z, x) x %>% f(y) %>% g(z) # g(f(x, y), z) # Turns function composition (hard to read) # into sequence (easy to read)
SLIDE 12 # Any function can use it. Only needs a simple # property: the type of the first argument # needs to be the same as the type of the result. # tidyr: pipelines for messy -> tidy data # dplyr: pipelines for data manipulation # ggvis: pipelines for visualisations # rvest: pipelines for html # purrr: pipelines for lists # xml2: pipelines for xml # stringr: pipelines for strings
SLIDE 14 Transform Visualise Model
tidyr dplyr
Tidy Import
readr readxl haven DBI httr ggplot2 ggvis b r
SLIDE 15 Storage Meaning Table / File Data set Rows Observations Columns Variables
Tidy data = data that makes data analysis easy
SLIDE 16 Source: local data frame [5,769 x 22] iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... Variables not shown: f014 (int), f1524 (int), f2534 (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int)
What are the variables in this dataset? (Hint: f = female,
u = unknown, 1524 = 15-24)
SLIDE 17 # To convert this messy data into tidy data # we need two verbs. First we need to gather # together all the columns that aren't variables tb2 <- tb %>% gather(demo, n, -iso2, -year, na.rm = TRUE) tb2
SLIDE 18 # Then separate the demographic variable into # sex and age tb3 <- tb2 %>% separate(demo, c("sex", "age"), 1) tb3 # Many tidyr verbs come in pairs: # spread vs. gather # extract/separate vs. unite # nest vs. unnest
SLIDE 19 Google for “tidyr” & “tidy data”
SLIDE 21 Transform Visualise Model
tidyr dplyr
Tidy
ggplot2 ggvis b r
Import
readr readxl haven DBI httr
SLIDE 22 Think it Do it Describe it Cognitive Computational
(precisely)
SLIDE 23 One table verbs
- select: subset variables by name
- filter: subset observations by value
- mutate: add new variables
- summarise: reduce to a single obs
- arrange: re-order the observations
+ g r
p b y
SLIDE 25 right_join() full_join() inner_join() left_join()
Mutating
semi_join() anti_join()
Filtering Set
intersect() setdiff() union()
SLIDE 26 dplyr sources
- Local data frame (C++)
- Local data table
- Local data cube (experimental)
- RDMS: Postgres, MySQL, SQLite,
Oracle, MS SQL, JDBC, Impala
SLIDE 27 Google for “dplyr”
SLIDE 29 Transform Visualise Model
tidyr dplyr
Tidy
ggplot2 ggvis b r
Import
readr readxl haven DBI httr
SLIDE 30 What is ggvis?
(like ggplot2)
- Reactive (interactive & dynamic)
(like shiny)
- A pipeline (a la dplyr)
- Of the web (drawn with vega)
SLIDE 31 Demo
4-ggvis.R 4-ggvis.Rmd
SLIDE 32 Google for “ggvis”
SLIDE 33 Model
with broom, by David Robinson
SLIDE 34 Transform Visualise Model
tidyr dplyr
Tidy
ggplot2 ggvis b r
Import
readr readxl haven DBI httr
SLIDE 35 2.5 5.0 7.5 1990 1995 2000 2005 2010 2015
date log(sales)
46 TX cities, ~25 years of data
What makes it hard to see the long term trend?
SLIDE 36 # Models are useful as tool for removing # known patterns tx <- tx %>% group_by(city) %>% mutate( resid = lm( log(sales) ~ factor(month), na.action = na.exclude ) %>% resid() )
SLIDE 37 −2 −1 1 1990 1995 2000 2005 2010 2015
date resid
SLIDE 38 # Models are also useful in their own right models <- tx %>% group_by(city) %>% do(mod = lm( log(sales) ~ factor(month), data = ., na.action = na.exclude) )
SLIDE 39 Model summaries
- Model level: one row per model
- Coefficient level: one row per
coefficient (per model)
- Observation level: one row per
- bservation (per model)
SLIDE 41 Google for “broom r”
SLIDE 43 Big Can’t fit in memory on
Medium Fits in memory on a server: 10 GB-5 TB Small Fits in memory on a laptop: <10 GB
R is great at this!
SLIDE 44 R
- R provides an excellent environment for
rapid interactive exploration of small data.
- There is no technical reason why it can’t
also work well with medium size data. (But the work mostly hasn’t been done)
SLIDE 45
- 1. Can be reduced to a small data
problem with subsetting/sampling/ summarising (90%)
- 2. Can be reduced to a very large
number of small data problems (9%)
- 3. Is irreducibly big (1%)
SLIDE 46 The right small data
- Rapid iteration essential
- dplyr supports this activity by avoiding
cognitive costs of switching between languages.
SLIDE 47 Lots of small problems
- Embarrassingly parallel (e.g. Hadoop)
- R wrappers like foreach, rhipe, rhadoop
- Challenging is matching architecture of
computing to data storage
SLIDE 48 Irreducibly big
- Computation must be performed by
specialised system.
- Typically C/C++, Fortran, Scala.
- R needs to be able to talk to those
systems.
SLIDE 50 End game
Provide a fluent interface where you spent your mental energy on the specific data problem, not general data analysis process. The best tools become invisible with time! Still a lot of work to do, especially on the connection between modelling and visualisation.
SLIDE 51 Transform Visualise Model
tidyr dplyr
Tidy Import
readr readxl haven DBI httr b r
ggplot2 ggvis