[PPT] - Improving Hadoop MapReduce Performance on Supercomputers with JVM PowerPoint Presentation

SLIDE 1

Thanh-Chung Dao and Shigeru Chiba The University of Tokyo

1

Thanh-Chung Dao

Improving Hadoop MapReduce Performance

n Supercomputers with JVM Reuse

SLIDE 2

Supercomputers

Expensive clusters
Multi-core processors
Large capacity of main memory
High-speed network
Focus mainly on compute-intensive applications
Data-intensive workloads are emerging as

supercomputing problems

Graph processing
Pre-processing of simulation data

Thanh-Chung Dao

2

SLIDE 3

MapReduce

Thanh-Chung Dao

3

Simple parallel paradigm to process large datasets
Hidden parallelization & communication
PageRank example

Input Splitting Mapping Shuffling Reducing Result

PageA à PageB, PageC PageB à PageC PageC à PageA, PageB PageA à PageB, PageC PageB à PageC PageC à PageA, PageB <PageB, 0.5> <PageC, 0.5> <PageC, 1> <PageA, 0.5> <PageB, 0.5> <PageA, 0.5> <PageB, 0.5> <PageB, 0.5> <PageC, 0.5> <PageC, 1> <PageA, 0.5> <PageB, 1> <PageC, 1.5> PageA 0.5 PageB 1 PageC 1.5

Rank contribution

Shuffling Done automatically (Users can ignore)

Function Mapper Input PageA à PageB, PageC Begin N = outbound links For each outbound link

utput <Page, 1/N>

End Function Reducer Input <PageA, x1>, …<PageA, xn> Begin rank = 0 For each item xi rank += xi

utput <PageA, rank>

End

SLIDE 4

Hadoop MapReduce

Standard of MapReduce implementation
Provide easy-to-use MapReduce APIs
TCP/IP-based communication
Designed to run on commodity clusters
Lab clusters, or Amazon EC2
Scalability (32,000 nodes at Yahoo) & Resilience
Written in Java

Thanh-Chung Dao

4

SLIDE 5

Improving Hadoop MapReduce Performance

n Supercomputers
Hadoop MapReduce is good choice on supercomputers
Maturity
Productivity

Thanh-Chung Dao

5

Supercomputer Hadoop Resource allocation at runtime

(# of processes, memory, CPU)

Static Dynamic Communication MPI TCP/IP Workload Compute-intensive Data-intensive

SLIDE 6

Our Approach

JVM Reuse
Statically create JVM processes and dynamically allocate to

Hadoop tasks

Enable efficient MPI communication by Hadoop tasks
Statically created processes can exploit efficient MPI
Dynamic allocation enables to use the original Hadoop implementation
Shorten start-up time of processes
Technique
Process pool is used to implement JVM Reuse
Minimize changes of the original Hadoop engine

Thanh-Chung Dao

6

SLIDE 7

Why MPI is required for Hadoop

The de facto high-speed communication on

supercomputers

Improve slow MapReduce shuffling
Enable Hadoop to co-host traditional MPI applications
Combine MPI and MapReduce models
Rich data analysis workflow
Efficient data sharing between MPI and MapReduce models
E.g. MPI can access data located at Hadoop file system (HDFS)

Thanh-Chung Dao

7

10000 30000 Message size (Bytes) Throughput (Mbps) 20 24 28 212 216 220 226 MPI TCP

10 times faster

On FX10 supercomputer Throughput (Mbps)

SLIDE 8

Slow MapReduce shuffling on Hadoop

Thanh-Chung Dao

8

TCP/IP-based communication
JVM-Bypass (Wang et al., IPDPS 2013)

MapTasks

Map output 1 Map output n

Local disk Slave nodes

HTTP Servlet Server

ReduceTasks

Sort & Merge Reducing

Multiple requests at once Mapping Phase Shuffling Phase Reducing Phase

SLIDE 9

Dynamic Process Creation on MPI

Discouraged on supercomputers
Reasons of performance
Collective mechanism (MPISpawn)
Gang scheduling (error-prone if not enough resource)
Gerbil (Xu el al., CCGrid 2015)
Co-hosting MPI applications on Hadoop
Creating dynamically processes
Its experiments showed significant overhead
Resources should be specified before running MPI

applications

Number of processes is known (static)
Memory and CPU cores

Thanh-Chung Dao

9

SLIDE 10

Dynamic Process Creation on Hadoop

Required
Resources are allocated on demand to run MapReduce applications
Number of processes is unknown (dynamic)

Thanh-Chung Dao

10

SLIDE 11

Dynamic Process Creation on Hadoop

Required
Resources are allocated on demand to run MapReduce applications
Number of processes is unknown (dynamic)

Thanh-Chung Dao

11

Master node Slave 1 … Slave 2 Slave n A Node

SLIDE 12

Dynamic Process Creation on Hadoop

Required
Resources are allocated on demand to run MapReduce applications
Number of processes is unknown (dynamic)

Thanh-Chung Dao

12

Master node User Slave 1 … Job Submission 6 tasks 8 tasks 6 tasks Slave 2 Slave n Request sending A Node

SLIDE 13

Dynamic Process Creation on Hadoop

Required
Resources are allocated on demand to run MapReduce applications
Number of processes is unknown (dynamic)

Thanh-Chung Dao

13

Master node User Slave 1 … Job Submission Request sending 8 tasks 6 tasks Processes A Node Process creation 6 processes Slave 2 Slave n Process creation Each task is run

n a process

Process creation 6 processes 8 processes 6 tasks

SLIDE 14

Dynamic Process Creation on Hadoop

Required
Resources are allocated on demand to run MapReduce applications
Number of processes is unknown (dynamic)

Thanh-Chung Dao

14

Master node User Slave 1 … Job Submission Request sending 8 tasks 6 tasks Processes A Node Process creation Task running Slave 2 Slave n Process creation Each task is run

n a process

Process creation Task running Task running 6 tasks

SLIDE 15

Dynamic Process Creation on Hadoop

Required
Resources are allocated on demand to run MapReduce applications
Number of processes is unknown (dynamic)

Thanh-Chung Dao

15

Master node User Slave 1 … Job Submission Request sending 8 tasks 6 tasks Processes A Node Process creation

Terminated

Slave 2 Slave n Process creation Each task is run

n a process

Process creation

Terminated Terminated

6 tasks

SLIDE 16

Dynamic Process Creation on Hadoop

Required
Resources are allocated on demand to run MapReduce applications
Number of processes is unknown (dynamic)

Thanh-Chung Dao

16

Master node User Slave 1 … Job Completion Slave 2 Slave n Request sending A Node

SLIDE 17

Idea of Reusing

Thanh-Chung Dao

17

JVM Pool added
Idle JVM processes
Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool idle idle idle JVM Pool idle idle idle JVM Pool idle idle idle

SLIDE 18

Idea of Reusing

Thanh-Chung Dao

18

JVM Pool added
Idle JVM processes
Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool idle idle idle JVM Pool idle idle idle JVM Pool idle idle idle User Job Submission 6 tasks 8 tasks 6 tasks Request sending

SLIDE 19

Idea of Reusing

Thanh-Chung Dao

19

JVM Pool added
Idle JVM processes
Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool Busy Busy idle JVM Pool Busy Busy idle JVM Pool Busy Busy Busy User Job Submission 6 tasks 8 tasks 6 tasks Request sending Allocation Allocation

SLIDE 20

Idea of Reusing

Thanh-Chung Dao

20

JVM Pool added
Idle JVM processes
Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool

Running Running

idle JVM Pool

Running Running

idle JVM Pool

Running Running Running

User Job Submission 6 tasks 8 tasks 6 tasks Request sending

SLIDE 21

Idea of Reusing

Thanh-Chung Dao

21

JVM Pool added
Idle JVM processes
Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool

Cleanup Cleanup

idle JVM Pool

Cleanup Cleanup

idle JVM Pool

Cleanup Cleanup Cleanup

User Job Submission 6 tasks 8 tasks 6 tasks Request sending

SLIDE 22

Idea of Reusing

Thanh-Chung Dao

22

JVM Pool added
Idle JVM processes
Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool idle idle idle JVM Pool idle idle idle JVM Pool idle idle idle User Job Submission 6 tasks 8 tasks 6 tasks Request sending

SLIDE 23

Idea of Reusing

Thanh-Chung Dao

23

JVM Pool added
Idle JVM processes
Number of processes is statically fixed

Master node Slave 1 … Processes A Node Slave 2 Slave n JVM Pool idle idle idle JVM Pool idle idle idle JVM Pool idle idle idle User Job Completion Request sending

SLIDE 24

JVM Reuse enables MPI communication

MPI communication is established at the beginning
JVM Reuse keeps processes running
MPI connection is always available

Thanh-Chung Dao

24

SLIDE 25

JVM Reuse shortens start-up time

Thanh-Chung Dao

25

JVM start-up flow of Program A

OS level process creation Class loader subsystem

Class loading Class linking (verification & initializing)

Execution engine

JIT compiler Execute A instructions

Invoke main() method

f A

Cmd: java A

After A finishes, Program B wants to reuse JVM of A

Invoke main() method

f B

Process creation & class loader are skipped

Execute B instructions

SLIDE 26

Iterative jobs benefit from JVM Reuse

Iterative jobs
Many short running JVM processes
PageRank is an example

Thanh-Chung Dao

26

Pre-processing job Reduce Map Maps use results

f the previous

Iteration Job A Reduce Map Job A Reduce Map

Stop Cond?

No Yes, then quit Initial data

Iterative job flow

SLIDE 27

Implementation: Process Pool

Thanh-Chung Dao

27

JVM Pool

Process creation request Round-robin scheduling

Process Pool on each node Hadoop YARN (Node manager) Pool Manager

idle Busy Busy Busy Busy idle Allocation

Set busy
Export env. variables
Create new class loader
Invoke main() method

De-allocation

Clean static fields
Unset busy

Busy Process Request sending State changing

SLIDE 28

Our MPI shuffling design

Thanh-Chung Dao

28

Mapping Phase Shuffling Phase Reducing Phase MapTasks

Map output 2 Map output n

Local disk Node 2

Shuffle Manager

ReduceTask 2

Sort & Merge Reducing

MPI send/recv One request at once

SLIDE 29

Reuse’s Technical Issues

Loading user’s classes
The original flow exports CLASSPATH before running
Reflection
Load user’s classes at runtime
Create a new class loader for each user
Avoid class confliction
Clean-up
Static fields
Security problem
e.g. UserGroup static field
Must be reset
Current design
Reset user information and job conf. static fields

Thanh-Chung Dao

29

SLIDE 30

Other Technical Issues

Enable Hadoop YARN to host traditional MPI applications
YARN is a resource manager
Work in progress
MPI AppMaster
Monitor MPI ranks
MPI Container
Host a rank
Avoid gang scheduling of MPI
Work in progress

Thanh-Chung Dao

30

SLIDE 31

Evaluation

Hadoop version
v2.2.0
Changes in our implementation
Line of code / total of Hadoop: ~1000 / 1,851,473
Number of classes / total of Hadoop: 9 / 35142

Thanh-Chung Dao

31

SLIDE 32

Cluster setup

FX10 supercomputer
Sparc64 Ixfx 1.848 GHz (16 cores) & 32GB RAM
MPI over Tofu interconnection (5GB/s)
Central storage
Hadoop setup
One master and many slaves
OpenJDK 7
HDFS is run on the central storage
OpenMPI 1.6
Java MPI binding (Vega-Gisbert et al.)
MCA parameter: plm_ple_cpu_affinity = 0

Thanh-Chung Dao

32

SLIDE 33

Evaluation of JVM Reuse

MPI benefit
MPI vs. TCP/IP shuffling
Tera-sort job
Run on 32 FX10 nodes
4-slot pool & -Xmx4096
Start-up time
JVM Reuse vs. the original
PageRank iterative job
400 GB wikipedia data
Run on 8 FX10 nodes
6-slot pool & -Xmx4096

Thanh-Chung Dao

33

SLIDE 34

MPI vs. TCP/IP shuffling

Thanh-Chung Dao

34

50 100 150 200 250 Input size (GB) Tera-sort Execution Time (s) 4 8 16 Nonblocking TCP Shuffle Blocking TCP shuffle Blocking MPI shuffle

Nonblocking : accept multi-connection at once Blocking : accept one connection at once up to 10% faster

SLIDE 35

1 30 64 98 137 180 223

Task ID 0s 40s 80s 120s 160s 200s 240s 280s 320s 360s 400s

Start-up (JVM & User info) Task initializing Shuffling (at Reducers) Data reading Task running Task finishing (MapOutput writing) 1 30 64 98 136 179 222

Time (s) Task ID 0s 40s 80s 120s 160s 200s 240s 280s 320s 360s 400s

Shorten start-up time (PageRank)

Thanh-Chung Dao

35

Original Hadoop JVM Reuse Hadoop 2nd iteration 50 seconds faster 1st iteration 2nd iteration Start-up time

SLIDE 36

More iterations

Thanh-Chung Dao

36

Sum of start-up time Execution time Ratio

1000 3000 5000 Iteration number (n) Sum of start-up time (s) 1 2 4 6 8 Original Hadoop JVM Reuse-based Hadoop 200 400 600 800 1000 Iteration number (n) Execution Time (s) 1 2 4 6 8 Original Hadoop JVM Reuse-based Hadoop Approach Execution Time (s) 5000 10000 15000 20000 Original Hadoop JVM Reuse Hadoop Start-up time Computation time

SLIDE 37

Related work

M3R (VLDB 2012)
Also apply JVM Reuse to enable in-memory MapReduce
Not providing any optimization of JVM Reuse and its evaluation
Hadoop MapReduce (HMR) engine is written in X10
We keep the original HMR engine with minimum changes
JVM Reuse in Hadoop v1 (2012)
Only for a single job
JVM processes are terminated after their job is completed
Gerbil: MPI + YARN (CCGrid 2015)
Hadoop Yarn co-hosts MPI applications
Long start-up time and significant overhead
DataMPI (IPDPS 2014)
Hadoop-like MapReduce implementation using MPI & C
JVM-Bypass (IPDPS 2013)
C-based shuffling engine & RDMA supported
We focus on using MPI over Hadoop processes

Thanh-Chung Dao

37

SLIDE 38

Summary

Improving Hadoop MapReduce performance on

supercomputers

Approach: JVM Reuse
Statically create JVM processes and dynamically allocate to Hadoop

tasks

Enable efficient MPI communication on Hadoop
Shorten start-up time
Minimum changes of the original Hadoop
Future work
JVM Reuse drawback
Affect CPU-bound tasks
Co-host MPI applications more efficiently
Full cleanup

Thanh-Chung Dao

38