Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big - - PowerPoint PPT Presentation

power pig with spark
SMART_READER_LITE
LIVE PREVIEW

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big - - PowerPoint PPT Presentation

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big Data Europe 2016 Agenda Background Why Pig on Spark ? Design Architecture Benchmark Optimization Current Status & Future Work Q&A Background


slide-1
SLIDE 1

Power Pig with Spark

Kelly Zhang (liyun.zhang@intel.com)

Apache Big Data Europe 2016

slide-2
SLIDE 2

Agenda

  • Background
  • Why Pig on Spark ?
  • Design Architecture
  • Benchmark
  • Optimization
  • Current Status & Future Work
  • Q&A
slide-3
SLIDE 3

Background

slide-4
SLIDE 4

Apache Pig

  • Procedural scripting language
  • Pig Latin: similar to sql
  • Heavily used for ETL
  • Schema / No schema data, Pig eats everything
slide-5
SLIDE 5

Spark

  • Faster
  • Generality
  • Easy of use
slide-6
SLIDE 6

Agenda

  • Background
  • Why Pig on Spark ?
  • Design Architecture
  • Benchmark
  • Optimization
  • Current Status & Future Work
  • Q&A
slide-7
SLIDE 7

Why Pig on Spark

  • Better Performance

○ No intermediate data between stages ○ In-memory caching abstraction ○ Executor JVM Reuse

  • Support Pig users to experience Spark conveniently
slide-8
SLIDE 8

Agenda

  • Background
  • Why Pig on Spark ?
  • Design Architecture
  • Benchmark
  • Optimization
  • Current Status & Future Work
  • Q&A
slide-9
SLIDE 9

Design Architecture

slide-10
SLIDE 10

Design Architecture

slide-11
SLIDE 11

Design Architecture

slide-12
SLIDE 12

Pig Latin to RDD<Tuple> transformations

slide-13
SLIDE 13

Pig Latin to RDD<Tuple> transformations

slide-14
SLIDE 14

Pig Latin to RDD<Tuple> transformations

slide-15
SLIDE 15

Operator Mapping

Pig Operator Spark Operator Load newAPIHadoopFile Store saveAsNewAPIHadoopFile Filter filter GroupBy groupby/reduceBy Join CoGroupRDD ForEach mapPartitions Sort sortByKey

slide-16
SLIDE 16

Agenda

  • Background
  • Why Pig on Spark ?
  • Design Architecture
  • Benchmark
  • Optimization
  • Current Status & Future Work
  • Q&A
slide-17
SLIDE 17

Benchmark Overview

Component Version Pig Spark branch Hadoop 2.6.0 Spark 1.6.2 PigMix Trunk

slide-18
SLIDE 18

Basic Configuration

spark.master=yarn-client spark.executor.memory=6553m spark.yarn.executor.memoryOverhead=1638 spark.executor.cores=8 spark.dynamicAllocation.enabled=true spark.network.timeout=1200000

slide-19
SLIDE 19

Benchmark Overview (cont’d)

slide-20
SLIDE 20

Agenda

  • Background
  • Why Pig on Spark ?
  • Design Architecture
  • Benchmark
  • Optimization
  • Current Status & Future Work
  • Q&A
slide-21
SLIDE 21

Optimize GroupBy/Join

slide-22
SLIDE 22

Optimize GroupBy/Join

slide-23
SLIDE 23

Optimize GroupBy/Join

slide-24
SLIDE 24

Optimize GroupBy/Join

slide-25
SLIDE 25

Skewed Key Sort

slide-26
SLIDE 26

Skewed Key Sort

slide-27
SLIDE 27

Skewed Key Sort

slide-28
SLIDE 28

Salted Key Solution

slide-29
SLIDE 29

Skewed Key Sort Performance

There are significant performance Improvement in sort case(L10) and skewed key sort case(L9)

slide-30
SLIDE 30

Agenda

  • Background
  • Why Pig on Spark ?
  • Design Architecture
  • Benchmark
  • Optimization
  • Current Status & Future Work
  • Q&A
slide-31
SLIDE 31

Current Status: Nearing end of Milestone 1

  • Functional completeness: DONE
  • All Unit Tests Pass: DONE
  • Merge Spark Branch to Master: In Code Review
slide-32
SLIDE 32

Ongoing Work towards Milestone 2

  • Implement Optimizations

○ Optimize Group by/Join - PIG-4797: DONE ○ FR Join - PIG-4771: DONE ○ Merge Join - PIG-4810: DONE ○ Skewed Join: UNDER REVIEW

  • Enhance Test Infrastructure

○ Use “local-cluster” mode to run unit tests

  • Spark Integration

○ Improved error, progress, stats reporting ○ YARN Cluster Mode

slide-33
SLIDE 33

Future work: Milestone 3

  • Implement More Optimizations

○ Split / MultiQuery using RDD.cache() ○ Compute optimal Shuffle Parallelism ○ Optimize/Redesign Spark Plan

  • Code Stablization, Bug Fixes
slide-34
SLIDE 34

Contribution welcomed

  • Git:

○ https://github.com/apache/pig/tree/spark

  • Wiki :

○ https://cwiki.apache.org/confluence/display/PIG/Pig +on+Spark

  • Umbrella jira:

○ PIG-4059

slide-35
SLIDE 35

Q&A