Spark Processing 101
September 10, 2015 Justin Sun
Spark Processing 101 September 10, 2015 Justin Sun Overview What - - PowerPoint PPT Presentation
Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext Resilient Distributed Datasets (RDDs) Transformations Actions Code Examples Resources What is Spark? General cluster
September 10, 2015 Justin Sun
What is Spark? SparkContext Resilient Distributed Datasets (RDDs) Transformations Actions Code Examples Resources
General cluster computing system for Big
Data
Supports in-memory processing APIs for Scala, Java, and Python Additional libraries:
Spark Streaming – Process live data streams Spark SQL – SQL and Data Frames MLlib – Machine learning GraphX - Graph processing
Starting point for working with Spark Specifjes access to cluster or local machine Required if you write a standalone program Provided as ‘sc’ by the Spark shell Scala:
val conf = new SparkConf().setAppName("Simple App") val sc = new SparkContext(conf)
Java:
SparkConf conf = new SparkConf().setAppName("Simple App"); JavaSparkContext sc = new JavaSparkContext(conf);
Main abstraction in Spark Fault-tolerant Supports parallel operations Create RDDs by
Calling sc.parallelize() Reading in data from an external source
T
ext fjle – sc.textFile()
HDFS source Cassandra
Immutable after creation Enable parallel computations Input is an RDD, output is a pointer to an RDD Can be chained together Arguments are functions or closures Lazy evaluation: Nothing happens until an
action is run
Program is run when an action is called Examples:
reduce() collect() count() fjrst() take()
DataBricks Visual Guide to Spark
Transformations and Actions – http://training.databricks.com/visualapi.pdf
map() fjlter() fmatMap()
http://spark.apache.org/docs/latest/quick-start.h tml
Spark website – http://spark.apache.org/docs/latest Quick Start –
http://spark.apache.org/docs/latest/quick-start.html
DataBricks Developer Resources –
https://databricks.com/spark/developer-resources
Spark YouT
ube channel – https://www.youtube.com/channel/UCRzsq7k4-kT-h 3TDUBQ82-w
edX.org Online Courses
CS100.1X – Introduction to Big Data with Apache Spark CS190.1X – Scalable Machine Learning