Pipelines for data analysis in R Hadley Wickham @hadleywickham - - PowerPoint PPT Presentation

pipelines for data analysis in r
SMART_READER_LITE
LIVE PREVIEW

Pipelines for data analysis in R Hadley Wickham @hadleywickham - - PowerPoint PPT Presentation

Pipelines for data analysis in R Hadley Wickham @hadleywickham Chief Scientist, RStudio October 2015 Data analysis is the process Data analysis is the process by which data becomes by which data becomes understanding, knowledge


slide-1
SLIDE 1

Hadley Wickham 


@hadleywickham Chief Scientist, RStudio

Pipelines for 
 data analysis in R

October 2015
slide-2
SLIDE 2

Data analysis is the process by which data becomes understanding, knowledge and insight Data analysis is the process by which data becomes understanding, knowledge and insight

slide-3
SLIDE 3

Data analysis is the process by which data becomes understanding, knowledge and insight Data analysis is the process by which data becomes understanding, knowledge and insight

slide-4
SLIDE 4

Transform Visualise Model Tidy Import

Surprises, but doesn't scale Scales, but doesn't (fundamentally) surprise Create new variables & new summaries Consistent way
  • f storing data
slide-5
SLIDE 5

Transform Visualise Model

tidyr dplyr

Tidy Import

readr readxl haven DBI httr b r
  • m
ggplot2 ggvis
slide-6
SLIDE 6

Pipelines

slide-7
SLIDE 7 Think it Do it Describe it Cognitive Computational (precisely)
slide-8
SLIDE 8

Cognition time ≫ Computation time

http://www.flickr.com/photos/mutsmuts/4695658106
slide-9
SLIDE 9

%>%

Inspirations: unix, F#, haskell, clojure, method chaining magrittr::
slide-10
SLIDE 10 foo_foo <- little_bunny() bop_on( scoop_up( hop_through(foo_foo, forest), field_mouse ), head ) # vs foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)
slide-11
SLIDE 11 x %>% f(y) # f(x, y) x %>% f(z, .) # f(z, x) x %>% f(y) %>% g(z) # g(f(x, y), z) # Turns function composition (hard to read) # into sequence (easy to read)
slide-12
SLIDE 12 # Any function can use it. Only needs a simple # property: the type of the first argument # needs to be the same as the type of the result. # tidyr: pipelines for messy -> tidy data # dplyr: pipelines for data manipulation # ggvis: pipelines for visualisations # rvest: pipelines for html # purrr: pipelines for lists # xml2: pipelines for xml # stringr: pipelines for strings
slide-13
SLIDE 13

Tidy

slide-14
SLIDE 14

Transform Visualise Model

tidyr dplyr

Tidy Import

readr readxl haven DBI httr ggplot2 ggvis b r
  • m
slide-15
SLIDE 15 Storage Meaning Table / File Data set Rows Observations Columns Variables Tidy data = data that makes data analysis easy
slide-16
SLIDE 16 Source: local data frame [5,769 x 22] iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 (chr) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) (int) 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 11 AD 2000 NA NA 0 0 1 0 0 0 0 NA NA NA 12 AD 2001 NA NA 0 NA NA 2 1 NA NA NA NA NA 13 AD 2002 NA NA 0 0 0 1 0 0 0 NA NA NA 14 AD 2003 NA NA 0 0 0 1 2 0 0 NA NA NA 15 AD 2004 NA NA 0 0 0 1 1 0 0 NA NA NA 16 AD 2005 0 0 0 0 1 1 0 0 0 0 0 0 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... Variables not shown: f014 (int), f1524 (int), f2534 (int), f3544 (int), f4554 (int), f5564 (int), f65 (int), fu (int) What are the variables in this dataset? (Hint: f = female, 
 u = unknown, 1524 = 15-24)
slide-17
SLIDE 17 # To convert this messy data into tidy data # we need two verbs. First we need to gather # together all the columns that aren't variables tb2 <- tb %>% gather(demo, n, -iso2, -year, na.rm = TRUE) tb2
slide-18
SLIDE 18 # Then separate the demographic variable into # sex and age tb3 <- tb2 %>% separate(demo, c("sex", "age"), 1) tb3 # Many tidyr verbs come in pairs: # spread vs. gather # extract/separate vs. unite # nest vs. unnest
slide-19
SLIDE 19

Google for “tidyr” & “tidy data”

slide-20
SLIDE 20

Transform

slide-21
SLIDE 21

Transform Visualise Model

tidyr dplyr

Tidy

ggplot2 ggvis b r
  • m

Import

readr readxl haven DBI httr
slide-22
SLIDE 22 Think it Do it Describe it Cognitive Computational (precisely)
slide-23
SLIDE 23

One table verbs

  • select: subset variables by name
  • filter: subset observations by value
  • mutate: add new variables
  • summarise: reduce to a single obs
  • arrange: re-order the observations
+ g r
  • u
p b y
slide-24
SLIDE 24

Demo

slide-25
SLIDE 25 right_join() full_join() inner_join() left_join() Mutating semi_join() anti_join() Filtering Set intersect() setdiff() union()
slide-26
SLIDE 26

dplyr sources

  • Local data frame (C++)
  • Local data table
  • Local data cube (experimental)
  • RDMS: Postgres, MySQL, SQLite,
Oracle, MS SQL, JDBC, Impala
  • MonetDB, BigQuery
slide-27
SLIDE 27

Google for “dplyr”

slide-28
SLIDE 28

Visualise

slide-29
SLIDE 29

Transform Visualise Model

tidyr dplyr

Tidy

ggplot2 ggvis b r
  • m

Import

readr readxl haven DBI httr
slide-30
SLIDE 30

What is ggvis?

  • A grammar of graphics 

(like ggplot2)
  • Reactive (interactive & dynamic) 

(like shiny)
  • A pipeline (a la dplyr)
  • Of the web (drawn with vega)
slide-31
SLIDE 31

Demo

4-ggvis.R 4-ggvis.Rmd
slide-32
SLIDE 32

Google for “ggvis”

slide-33
SLIDE 33

Model

with broom, by David Robinson
slide-34
SLIDE 34

Transform Visualise Model

tidyr dplyr

Tidy

ggplot2 ggvis b r
  • m

Import

readr readxl haven DBI httr
slide-35
SLIDE 35 2.5 5.0 7.5 1990 1995 2000 2005 2010 2015 date log(sales) 46 TX cities, ~25 years of data What makes it hard to see the long term trend?
slide-36
SLIDE 36 # Models are useful as tool for removing # known patterns tx <- tx %>% group_by(city) %>% mutate( resid = lm( log(sales) ~ factor(month), na.action = na.exclude ) %>% resid() )
slide-37
SLIDE 37 −2 −1 1 1990 1995 2000 2005 2010 2015 date resid
slide-38
SLIDE 38 # Models are also useful in their own right models <- tx %>% group_by(city) %>% do(mod = lm( log(sales) ~ factor(month), data = ., na.action = na.exclude) )
slide-39
SLIDE 39

Model summaries

  • Model level: one row per model
  • Coefficient level: one row per
coefficient (per model)
  • Observation level: one row per
  • bservation (per model)
slide-40
SLIDE 40

Demo

5-broom.R
slide-41
SLIDE 41

Google for “broom r”

slide-42
SLIDE 42

Big data and R

slide-43
SLIDE 43 Big Can’t fit in memory on
  • ne computer: >5 TB
Medium Fits in memory on a server: 10 GB-5 TB Small Fits in memory on a laptop: <10 GB R is great at this!
slide-44
SLIDE 44

R

  • R provides an excellent environment for
rapid interactive exploration of small data.
  • There is no technical reason why it can’t
also work well with medium size data. (But the work mostly hasn’t been done)
  • What about big data?
slide-45
SLIDE 45
  • 1. Can be reduced to a small data
problem with subsetting/sampling/ summarising (90%)
  • 2. Can be reduced to a very large
number of small data problems (9%)
  • 3. Is irreducibly big (1%)
slide-46
SLIDE 46

The right small data

  • Rapid iteration essential
  • dplyr supports this activity by avoiding
cognitive costs of switching between languages.
slide-47
SLIDE 47

Lots of small problems

  • Embarrassingly parallel (e.g. Hadoop)
  • R wrappers like foreach, rhipe, rhadoop
  • Challenging is matching architecture of
computing to data storage
slide-48
SLIDE 48

Irreducibly big

  • Computation must be performed by
specialised system.
  • Typically C/C++, Fortran, Scala.
  • R needs to be able to talk to those
systems.
slide-49
SLIDE 49

Future work

slide-50
SLIDE 50

End game

Provide a fluent interface where you spent your mental energy on the specific data problem, not general data analysis process. The best tools become invisible with time! Still a lot of work to do, especially on the connection between modelling and visualisation.
slide-51
SLIDE 51

Transform Visualise Model

tidyr dplyr

Tidy Import

readr readxl haven DBI httr b r
  • m
ggplot2 ggvis