Improving Hadoop MapReduce Performance on Supercomputers with JVM - - PowerPoint PPT Presentation

improving hadoop mapreduce performance on supercomputers
SMART_READER_LITE
LIVE PREVIEW

Improving Hadoop MapReduce Performance on Supercomputers with JVM - - PowerPoint PPT Presentation

1 Thanh-Chung Dao Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo 2 Thanh-Chung Dao Supercomputers Expensive clusters Multi-core processors Large


slide-1
SLIDE 1

Thanh-Chung Dao and Shigeru Chiba The University of Tokyo

1

Thanh-Chung Dao

Improving Hadoop MapReduce Performance

  • n Supercomputers with JVM Reuse
slide-2
SLIDE 2

Supercomputers

  • Expensive clusters
  • Multi-core processors
  • Large capacity of main memory
  • High-speed network
  • Focus mainly on compute-intensive applications
  • Data-intensive workloads are emerging as

supercomputing problems

  • Graph processing
  • Pre-processing of simulation data

Thanh-Chung Dao

2

slide-3
SLIDE 3

MapReduce

Thanh-Chung Dao

3

  • Simple parallel paradigm to process large datasets
  • Hidden parallelization & communication
  • PageRank example

Input Splitting Mapping Shuffling Reducing Result

PageA à PageB, PageC PageB à PageC PageC à PageA, PageB PageA à PageB, PageC PageB à PageC PageC à PageA, PageB <PageB, 0.5> <PageC, 0.5> <PageC, 1> <PageA, 0.5> <PageB, 0.5> <PageA, 0.5> <PageB, 0.5> <PageB, 0.5> <PageC, 0.5> <PageC, 1> <PageA, 0.5> <PageB, 1> <PageC, 1.5> PageA 0.5 PageB 1 PageC 1.5

Rank contribution

Shuffling Done automatically (Users can ignore)

Function Mapper Input PageA à PageB, PageC Begin N = outbound links For each outbound link

  • utput <Page, 1/N>

End Function Reducer Input <PageA, x1>, …<PageA, xn> Begin rank = 0 For each item xi rank += xi

  • utput <PageA, rank>

End

slide-4
SLIDE 4

Hadoop MapReduce

  • Standard of MapReduce implementation
  • Provide easy-to-use MapReduce APIs
  • TCP/IP-based communication
  • Designed to run on commodity clusters
  • Lab clusters, or Amazon EC2
  • Scalability (32,000 nodes at Yahoo) & Resilience
  • Written in Java

Thanh-Chung Dao

4

slide-5
SLIDE 5

Improving Hadoop MapReduce Performance

  • n Supercomputers
  • Hadoop MapReduce is good choice on supercomputers
  • Maturity
  • Productivity

Thanh-Chung Dao

5

Supercomputer Hadoop Resource allocation at runtime

(# of processes, memory, CPU)

Static Dynamic Communication MPI TCP/IP Workload Compute-intensive Data-intensive

slide-6
SLIDE 6

Our Approach

  • JVM Reuse
  • Statically create JVM processes and dynamically allocate to

Hadoop tasks

  • Enable efficient MPI communication by Hadoop tasks
  • Statically created processes can exploit efficient MPI
  • Dynamic allocation enables to use the original Hadoop implementation
  • Shorten start-up time of processes
  • Technique
  • Process pool is used to implement JVM Reuse
  • Minimize changes of the original Hadoop engine

Thanh-Chung Dao

6

slide-7
SLIDE 7

Why MPI is required for Hadoop

  • The de facto high-speed communication on

supercomputers

  • Improve slow MapReduce shuffling
  • Enable Hadoop to co-host traditional MPI applications
  • Combine MPI and MapReduce models
  • Rich data analysis workflow
  • Efficient data sharing between MPI and MapReduce models
  • E.g. MPI can access data located at Hadoop file system (HDFS)

Thanh-Chung Dao

7

10000 30000 Message size (Bytes) Throughput (Mbps) 20 24 28 212 216 220 226 MPI TCP

10 times faster

On FX10 supercomputer Throughput (Mbps)

slide-8
SLIDE 8

Slow MapReduce shuffling on Hadoop

Thanh-Chung Dao

8

  • TCP/IP-based communication
  • JVM-Bypass (Wang et al., IPDPS 2013)

MapTasks

Map output 1 Map output n

Local disk Slave nodes

HTTP Servlet Server

ReduceTasks

Sort & Merge Reducing

Multiple requests at once Mapping Phase Shuffling Phase Reducing Phase

slide-9
SLIDE 9

Dynamic Process Creation on MPI

  • Discouraged on supercomputers
  • Reasons of performance
  • Collective mechanism (MPISpawn)
  • Gang scheduling (error-prone if not enough resource)
  • Gerbil (Xu el al., CCGrid 2015)
  • Co-hosting MPI applications on Hadoop
  • Creating dynamically processes
  • Its experiments showed significant overhead
  • Resources should be specified before running MPI

applications

  • Number of processes is known (static)
  • Memory and CPU cores

Thanh-Chung Dao

9

slide-10
SLIDE 10

Dynamic Process Creation on Hadoop

  • Required
  • Resources are allocated on demand to run MapReduce applications
  • Number of processes is unknown (dynamic)

Thanh-Chung Dao

10

slide-11
SLIDE 11

Dynamic Process Creation on Hadoop

  • Required
  • Resources are allocated on demand to run MapReduce applications
  • Number of processes is unknown (dynamic)

Thanh-Chung Dao

11

Master node Slave 1 … Slave 2 Slave n A Node

slide-12
SLIDE 12

Dynamic Process Creation on Hadoop

  • Required
  • Resources are allocated on demand to run MapReduce applications
  • Number of processes is unknown (dynamic)

Thanh-Chung Dao

12

Master node User Slave 1 … Job Submission 6 tasks 8 tasks 6 tasks Slave 2 Slave n Request sending A Node

slide-13
SLIDE 13

Dynamic Process Creation on Hadoop

  • Required
  • Resources are allocated on demand to run MapReduce applications
  • Number of processes is unknown (dynamic)

Thanh-Chung Dao

13

Master node User Slave 1 … Job Submission Request sending 8 tasks 6 tasks Processes A Node Process creation 6 processes Slave 2 Slave n Process creation Each task is run

  • n a process

Process creation 6 processes 8 processes 6 tasks

slide-14
SLIDE 14

Dynamic Process Creation on Hadoop

  • Required
  • Resources are allocated on demand to run MapReduce applications
  • Number of processes is unknown (dynamic)

Thanh-Chung Dao

14

Master node User Slave 1 … Job Submission Request sending 8 tasks 6 tasks Processes A Node Process creation Task running Slave 2 Slave n Process creation Each task is run

  • n a process

Process creation Task running Task running 6 tasks

slide-15
SLIDE 15

Dynamic Process Creation on Hadoop

  • Required
  • Resources are allocated on demand to run MapReduce applications
  • Number of processes is unknown (dynamic)

Thanh-Chung Dao

15

Master node User Slave 1 … Job Submission Request sending 8 tasks 6 tasks Processes A Node Process creation

Terminated

Slave 2 Slave n Process creation Each task is run

  • n a process

Process creation

Terminated Terminated

6 tasks

slide-16
SLIDE 16

Dynamic Process Creation on Hadoop

  • Required
  • Resources are allocated on demand to run MapReduce applications
  • Number of processes is unknown (dynamic)

Thanh-Chung Dao

16

Master node User Slave 1 … Job Completion Slave 2 Slave n Request sending A Node

slide-17
SLIDE 17

Idea of Reusing

Thanh-Chung Dao

17

  • JVM Pool added
  • Idle JVM processes
  • Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool idle idle idle JVM Pool idle idle idle JVM Pool idle idle idle

slide-18
SLIDE 18

Idea of Reusing

Thanh-Chung Dao

18

  • JVM Pool added
  • Idle JVM processes
  • Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool idle idle idle JVM Pool idle idle idle JVM Pool idle idle idle User Job Submission 6 tasks 8 tasks 6 tasks Request sending

slide-19
SLIDE 19

Idea of Reusing

Thanh-Chung Dao

19

  • JVM Pool added
  • Idle JVM processes
  • Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool Busy Busy idle JVM Pool Busy Busy idle JVM Pool Busy Busy Busy User Job Submission 6 tasks 8 tasks 6 tasks Request sending Allocation Allocation

slide-20
SLIDE 20

Idea of Reusing

Thanh-Chung Dao

20

  • JVM Pool added
  • Idle JVM processes
  • Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool

Running Running

idle JVM Pool

Running Running

idle JVM Pool

Running Running Running

User Job Submission 6 tasks 8 tasks 6 tasks Request sending

slide-21
SLIDE 21

Idea of Reusing

Thanh-Chung Dao

21

  • JVM Pool added
  • Idle JVM processes
  • Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool

Cleanup Cleanup

idle JVM Pool

Cleanup Cleanup

idle JVM Pool

Cleanup Cleanup Cleanup

User Job Submission 6 tasks 8 tasks 6 tasks Request sending

slide-22
SLIDE 22

Idea of Reusing

Thanh-Chung Dao

22

  • JVM Pool added
  • Idle JVM processes
  • Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool idle idle idle JVM Pool idle idle idle JVM Pool idle idle idle User Job Submission 6 tasks 8 tasks 6 tasks Request sending

slide-23
SLIDE 23

Idea of Reusing

Thanh-Chung Dao

23

  • JVM Pool added
  • Idle JVM processes
  • Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool idle idle idle JVM Pool idle idle idle JVM Pool idle idle idle User Job Completion Request sending

slide-24
SLIDE 24

JVM Reuse enables MPI communication

  • MPI communication is established at the beginning
  • JVM Reuse keeps processes running
  • MPI connection is always available

Thanh-Chung Dao

24

slide-25
SLIDE 25

JVM Reuse shortens start-up time

Thanh-Chung Dao

25

JVM start-up flow of Program A

OS level process creation Class loader subsystem

Class loading Class linking (verification & initializing)

Execution engine

JIT compiler Execute A instructions

Invoke main() method

  • f A

Cmd: java A

After A finishes, Program B wants to reuse JVM of A

Invoke main() method

  • f B

Process creation & class loader are skipped

Execute B instructions

slide-26
SLIDE 26

Iterative jobs benefit from JVM Reuse

  • Iterative jobs
  • Many short running JVM processes
  • PageRank is an example

Thanh-Chung Dao

26

Pre-processing job Reduce Map Maps use results

  • f the previous

Iteration Job A Reduce Map Job A Reduce Map

Stop Cond?

No Yes, then quit Initial data

Iterative job flow

slide-27
SLIDE 27

Implementation: Process Pool

Thanh-Chung Dao

27

JVM Pool

Process creation request Round-robin scheduling

Process Pool on each node Hadoop YARN (Node manager) Pool Manager

idle Busy Busy Busy Busy idle Allocation

  • Set busy
  • Export env. variables
  • Create new class loader
  • Invoke main() method

De-allocation

  • Clean static fields
  • Unset busy

Busy Process Request sending State changing

slide-28
SLIDE 28

Our MPI shuffling design

Thanh-Chung Dao

28

Mapping Phase Shuffling Phase Reducing Phase MapTasks

Map output 2 Map output n

Local disk Node 2

Shuffle Manager

ReduceTask 2

Sort & Merge Reducing

MPI send/recv One request at once

slide-29
SLIDE 29

Reuse’s Technical Issues

  • Loading user’s classes
  • The original flow exports CLASSPATH before running
  • Reflection
  • Load user’s classes at runtime
  • Create a new class loader for each user
  • Avoid class confliction
  • Clean-up
  • Static fields
  • Security problem
  • e.g. UserGroup static field
  • Must be reset
  • Current design
  • Reset user information and job conf. static fields

Thanh-Chung Dao

29

slide-30
SLIDE 30

Other Technical Issues

  • Enable Hadoop YARN to host traditional MPI applications
  • YARN is a resource manager
  • Work in progress
  • MPI AppMaster
  • Monitor MPI ranks
  • MPI Container
  • Host a rank
  • Avoid gang scheduling of MPI
  • Work in progress

Thanh-Chung Dao

30

slide-31
SLIDE 31

Evaluation

  • Hadoop version
  • v2.2.0
  • Changes in our implementation
  • Line of code / total of Hadoop: ~1000 / 1,851,473
  • Number of classes / total of Hadoop: 9 / 35142

Thanh-Chung Dao

31

slide-32
SLIDE 32

Cluster setup

  • FX10 supercomputer
  • Sparc64 Ixfx 1.848 GHz (16 cores) & 32GB RAM
  • MPI over Tofu interconnection (5GB/s)
  • Central storage
  • Hadoop setup
  • One master and many slaves
  • OpenJDK 7
  • HDFS is run on the central storage
  • OpenMPI 1.6
  • Java MPI binding (Vega-Gisbert et al.)
  • MCA parameter: plm_ple_cpu_affinity = 0

Thanh-Chung Dao

32

slide-33
SLIDE 33

Evaluation of JVM Reuse

  • MPI benefit
  • MPI vs. TCP/IP shuffling
  • Tera-sort job
  • Run on 32 FX10 nodes
  • 4-slot pool & -Xmx4096
  • Start-up time
  • JVM Reuse vs. the original
  • PageRank iterative job
  • 400 GB wikipedia data
  • Run on 8 FX10 nodes
  • 6-slot pool & -Xmx4096

Thanh-Chung Dao

33

slide-34
SLIDE 34

MPI vs. TCP/IP shuffling

Thanh-Chung Dao

34

50 100 150 200 250 Input size (GB) Tera-sort Execution Time (s) 4 8 16 Nonblocking TCP Shuffle Blocking TCP shuffle Blocking MPI shuffle

Nonblocking : accept multi-connection at once Blocking : accept one connection at once up to 10% faster

slide-35
SLIDE 35

1 30 64 98 137 180 223

Task ID 0s 40s 80s 120s 160s 200s 240s 280s 320s 360s 400s

Start-up (JVM & User info) Task initializing Shuffling (at Reducers) Data reading Task running Task finishing (MapOutput writing) 1 30 64 98 136 179 222

Time (s) Task ID 0s 40s 80s 120s 160s 200s 240s 280s 320s 360s 400s

Shorten start-up time (PageRank)

Thanh-Chung Dao

35

Original Hadoop JVM Reuse Hadoop 2nd iteration 50 seconds faster 1st iteration 2nd iteration Start-up time

slide-36
SLIDE 36

More iterations

Thanh-Chung Dao

36

Sum of start-up time Execution time Ratio

1000 3000 5000 Iteration number (n) Sum of start-up time (s) 1 2 4 6 8 Original Hadoop JVM Reuse-based Hadoop 200 400 600 800 1000 Iteration number (n) Execution Time (s) 1 2 4 6 8 Original Hadoop JVM Reuse-based Hadoop Approach Execution Time (s) 5000 10000 15000 20000 Original Hadoop JVM Reuse Hadoop Start-up time Computation time

slide-37
SLIDE 37

Related work

  • M3R (VLDB 2012)
  • Also apply JVM Reuse to enable in-memory MapReduce
  • Not providing any optimization of JVM Reuse and its evaluation
  • Hadoop MapReduce (HMR) engine is written in X10
  • We keep the original HMR engine with minimum changes
  • JVM Reuse in Hadoop v1 (2012)
  • Only for a single job
  • JVM processes are terminated after their job is completed
  • Gerbil: MPI + YARN (CCGrid 2015)
  • Hadoop Yarn co-hosts MPI applications
  • Long start-up time and significant overhead
  • DataMPI (IPDPS 2014)
  • Hadoop-like MapReduce implementation using MPI & C
  • JVM-Bypass (IPDPS 2013)
  • C-based shuffling engine & RDMA supported
  • We focus on using MPI over Hadoop processes

Thanh-Chung Dao

37

slide-38
SLIDE 38

Summary

  • Improving Hadoop MapReduce performance on

supercomputers

  • Approach: JVM Reuse
  • Statically create JVM processes and dynamically allocate to Hadoop

tasks

  • Enable efficient MPI communication on Hadoop
  • Shorten start-up time
  • Minimum changes of the original Hadoop
  • Future work
  • JVM Reuse drawback
  • Affect CPU-bound tasks
  • Co-host MPI applications more efficiently
  • Full cleanup

Thanh-Chung Dao

38