Scalable Data Science with Hadoop, Spark and R
Mario Inchiosa, PhD Principal Software Engineer Microsoft Data Group DSC 2016 July 2, 2016
Scalable Data Science with Hadoop, Spark and R Mario Inchiosa, PhD - - PowerPoint PPT Presentation
Scalable Data Science with Hadoop, Spark and R Mario Inchiosa, PhD Principal Software Engineer Microsoft Data Group DSC 2016 July 2, 2016 Microsoft R Server Cloud Hadoop & Spark R Server portfolio R Server Technology EDW RDBMS
Mario Inchiosa, PhD Principal Software Engineer Microsoft Data Group DSC 2016 July 2, 2016
R Server portfolio Cloud RDBMS Desktops & Servers Hadoop & Spark EDW
R Server Technology
3
intermediate combiner)
another iteration is started
rxSetComputeContext( RxHadoopMR(…) ) inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS) model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)
Define Compute Context Define Data Source Train Predictive Model
Keep other code unchanged
rxSetComputeContext( RxSpark(…) )
Change the Compute Context
OBDC
ETL Statistical Tests
Sampling
variables)
form)
Descriptive Statistics
variables)
family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.
Predictive Statistics
Clustering
Machine Learning Simulation
Custom Parallelization
Variable Selection
R R R R R R R R R R
R Server
Master R process on Edge Node Apache YARN and Spark Worker R processes on Data Nodes Data in Distributed Storage R process on Edge Node
Worker Task R Server Master Task
Finalizer Initiator
Edge Node
Worker Task Worker Task
Remote Execution: ssh Web Services
DeployR
ssh or R Tools for Visual Studio BI Tools & Applications Jupyter Notebooks Thin Client IDEs
https:// https://
MapReduce
SparkR functions RevoScaleR functions R Spark and Hadoop Blob Storage Data Lake Storage
algorithms and statistical functions
machines
1 2 3 4 5 6 7 8 9 10 11 12 13 Elapsed Time Billions of rows
Logistic Regression on NYC Taxi Dataset
2.2 TB
Operationalize Model Prepare
Data
Processing
Analytics
Presentation
applications
BAM BAM BAM BAM VariantTools GWAS BAM
Platform