Spark Processing 101 September 10, 2015 Justin Sun Overview What - - PowerPoint PPT Presentation

spark processing 101
SMART_READER_LITE
LIVE PREVIEW

Spark Processing 101 September 10, 2015 Justin Sun Overview What - - PowerPoint PPT Presentation

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext Resilient Distributed Datasets (RDDs) Transformations Actions Code Examples Resources What is Spark? General cluster


slide-1
SLIDE 1

Spark Processing 101

September 10, 2015 Justin Sun

slide-2
SLIDE 2

Overview

What is Spark? SparkContext Resilient Distributed Datasets (RDDs) Transformations Actions Code Examples Resources

slide-3
SLIDE 3

What is Spark?

General cluster computing system for Big

Data

Supports in-memory processing APIs for Scala, Java, and Python Additional libraries:

Spark Streaming – Process live data streams Spark SQL – SQL and Data Frames MLlib – Machine learning GraphX - Graph processing

slide-4
SLIDE 4

Spark Context

Starting point for working with Spark Specifjes access to cluster or local machine Required if you write a standalone program Provided as ‘sc’ by the Spark shell Scala:

val conf = new SparkConf().setAppName("Simple App") val sc = new SparkContext(conf)

Java:

SparkConf conf = new SparkConf().setAppName("Simple App"); JavaSparkContext sc = new JavaSparkContext(conf);

slide-5
SLIDE 5

Resilient Distributed Datasets (RDDs)

Main abstraction in Spark Fault-tolerant Supports parallel operations Create RDDs by

Calling sc.parallelize() Reading in data from an external source

 T

ext fjle – sc.textFile()

 HDFS source  Cassandra

slide-6
SLIDE 6

Transformations

Immutable after creation Enable parallel computations Input is an RDD, output is a pointer to an RDD Can be chained together Arguments are functions or closures Lazy evaluation: Nothing happens until an

action is run

slide-7
SLIDE 7

Actions

Program is run when an action is called Examples:

reduce() collect() count() fjrst() take()

slide-8
SLIDE 8

Visual Transformations

DataBricks Visual Guide to Spark

Transformations and Actions – http://training.databricks.com/visualapi.pdf

map() fjlter() fmatMap()

slide-9
SLIDE 9

Code examples

http://spark.apache.org/docs/latest/quick-start.h tml

slide-10
SLIDE 10

Resources

Spark website – http://spark.apache.org/docs/latest Quick Start –

http://spark.apache.org/docs/latest/quick-start.html

DataBricks Developer Resources –

https://databricks.com/spark/developer-resources

Spark YouT

ube channel – https://www.youtube.com/channel/UCRzsq7k4-kT-h 3TDUBQ82-w

edX.org Online Courses

CS100.1X – Introduction to Big Data with Apache Spark CS190.1X – Scalable Machine Learning