Heterogeneous Computing Systems Mikiko Sato Tokyo University of - - PowerPoint PPT Presentation

heterogeneous computing systems
SMART_READER_LITE
LIVE PREVIEW

Heterogeneous Computing Systems Mikiko Sato Tokyo University of - - PowerPoint PPT Presentation

MapReduce Frameworks on Multiple PVAS for Heterogeneous Computing Systems Mikiko Sato Tokyo University of Agriculture and Technology Background A A recent t tendency ncy for p performance nce improvements ements is s due to


slide-1
SLIDE 1

MapReduce Frameworks on Multiple PVAS for Heterogeneous Computing Systems

Mikiko Sato

Tokyo University of Agriculture and Technology

slide-2
SLIDE 2

Background

A A recent t tendency ncy for p performance nce improvements ements is s due to increases in the number ers of CPU PU c cores with accelerato ators.

  • GPGPU, Intel XeonPhi
  • The multi-core and many-core CPUs provide differing

computational performance, parallelism, latency…

Application Program Multi-coreCPU Many-coreCPU

core core c c c c c c c c … Multi-core OS

(Linux)

Many-core OS

(Light-weight Kernel)

cooperate Task Task Task Task

T T T T T T T T T T T T T T T T TMany-core Task (High-parallel computational processing) Multi-core Task (I/O processing, Low-parallel & high-latency processing)

Its ts importa tant is t issue is H How to to improve ve th the application performan ance using both th ty types o

  • f CPUs cooperat

ative vely.

slide-3
SLIDE 3

MapReduce framework

 Big dat ata a an anal alytics h has as been identified ed as as t the e exc xciting ar areas as f for both ac acad ademia a a and industry.  Ma MapR pReduce fram amework is a a po popu pular ar pr program amming f fram amework  for big data analytics and scientific computing  MapReduce was originally designed for distributed-computing, and has been extended to various architectures. (HPC system,[2]

GPGPUs[3] , many-core CPUs[4])

 Ma MapR pReduce on a a heterogeneo eous system with XeonPhi

 The hardware features of the Xeon Phi achieve high performance (512- bit VPUs, MIMD thread parallelism, coherent L2 Cache, etc.)  The host processor assists the data transfer for MapReduce processing.

[1] Welcome to Apache Hadoop (online), available from http://hadoop.apache.org. [2]"K MapReduce: A scalable tool for data-processing and search/ensemble applications on large- scale supercomputers", M. Matsuda, et.al., in CLUSTER , IEEE Computer Society, pp. 1-8, 2013. [3] "Mars: a mapreduce framework on graphics processors”, B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wnag, in PACT, pp. 1-8, 2008. [4] "Optimizing the MapReduce framework on Intel Xeon Phi coprocessor," M. Lu, et al., Big Data, IEEE International Conference on, pp.125-130, Oct. 2013.

3

[1]

slide-4
SLIDE 4

Previous MapReduce frameworks on XeonPhi

4

 MRPhi[4]

  • Optimized MapReduce framework for XeonPhi coprocessor

 Using SIMD VPUs for map phase, SIMD hash computation algorithms, based on the MIMD hyper-threading, etc.  The pthread is used for Master/Worker task controls on Xeon Phi.

Important issues for the performance are both utilizing of advanced XeonPhi-features and effective thread controls.  MrPhi[5]

  • The expanded version of MRPhi.[4] MapReduce operation and data

are transferred separately from host to XeonPhi.

  • MPI communication is used for data transfer and synchronous

control between host and XeonPhi.

The communication overhead will be one of the factor of the MapReduce performance.

[4] "Optimizing the MapReduce framework on Intel Xeon Phi coprocessor," M. Lu, et al., Big Data, IEEE International Conference on, pp.125-130, Oct. 2013. [5] "MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors", Lu, M., et al., IEEE Transactions on Parallel and Distributed Systems, vol.PP, no.99, pp.1-14, 2014.

slide-5
SLIDE 5

Inter-task communications

 Turn-ar around times ( (TAT) of a a n null function ca call in t the XenoPhi

  • ffload

ading scheme ar are meas asured as as the referen ence e of our study

 The communication overhead is large when sending the small size data between host and XeonPhi.

→It is important to reduce the communication cost between host and XeonPhi as much as possible for the MapReduce performance.

5

Delegator Task

Local CPU Remote CPU Buf

Delegatee Task

Write Request (8~128B) Write Result (8B) Buf Polling Polling

Turn-around time (TAT) are measured The processing request data varies between 8 bytes and 128 bytes and the response data is fixed at 8 bytes. (Xeon E5-2670,MPSS 3-2.1.6720-13)

slide-6
SLIDE 6

Issues & Goal

In n order er t to o

  • btain h

n high gh per erfo forma manc nce e on n the he hy hybrid- architec ecture s e sys ystem ems, i it is i imp mportant to

 perform inter-task communication by less overhead  execute processing on the suitable CPU in consideration of the difference of performance and characteristic between CPUs

Goal

En Enable e cooper eration b by y little e over erhea ead bet etwe ween en tasks ks for MapRed educe e frame mewo work k on a a h hyb ybrid s sys ystem em. .

 In order to realize the program execution environment, “Multiple PVAS”(Multiple Partitioned Virtual Address Space)

will be provided as system software for task collaboration with less overhead on the hybrid-architecture system.

6

slide-7
SLIDE 7

Task Model

 A tas ask model o

  • f M

M-PVAS is bas ased on

  • n PVAS[1].

 The PVAS system assigns one partition to one PVAS task.  PVAS tasks execute using each PVAS partition on a same PVAS address space. →PVAS Tas asks can an communicat ate b by Re Read ad/W /Write virtual al ad address on a a PVAS ad address spa pace, w without using an another shar ared m memory.

Memory Many-core CPU PVAS Application Program TaskTaskTask Task PVAS Task#2 PVAS Task#M PVAS Address Space

  • PVAS Task#1

PVAS Task#3 Kernel Export TXT DATA & BSS HEAP STACK PVAS Partition

  • [1] Shimada, A., Gero, B., Hori, A. and Ishikawa, Y.: Proposing

a new task model towards many-core architecture (MES '13). 7

slide-8
SLIDE 8

M-PVAS Task Model

 M-PVAS map aps a a number of

  • f P

PVAS a address s spa paces o

  • nto a

a single virtual address space, “Mu Multipl ple PVAS Address Spa pace”.  PVAS tasks belonging to the same Multiple PVAS address space can ac access o

  • ther PVAS ad

address spa pace, even i if o

  • n a

a differen ent CPU. →M-PVAS Tas asks can an communicat ate w with an another M M-PVAS tas ask by just accessing to the v virtual address.

8

It is convenient to develop the parallel program which collaborates between different CPUs.

slide-9
SLIDE 9

Basic Design of M-PVAS MapReduce

 M-PVAS MapReduce was designed based on MRPhi[3]  The same MapReduce processing model as MRPhi[3]

  • Host sends the MapReduce Data to XeonPhi repeatedly.
  • Workers execute MapReduce operation with accessing each

part of the data.

 Change the inter-task communication and the task control part to compare the performance gain when using pthread and MPI I/F or M-PVAS methods.

9

(MRPhi): pthread control vs (M-PVAS): M-PVAS Task control (MRPhi): MPI comm. vs (M-PVAS): Shared Address Space

slide-10
SLIDE 10

 Ma Master T Tas ask controls Worker T Tas asks

 Master Task notifies Worker Tasks of the MapReduce Control Data (fig.①) ← the he same as pthr hread  Master/Worker Tasks control synchronously using busy-waiting flags and an atomic counter(fig.②,③) ← the he simple flag sensing g will be e expected better performance ce

Master/Worker Task Control on M-PVAS 10

Processing information(Map or Reduce), The Number of Worker Tasks, MapReduce Data address, size, MapReduce Result Data address, etc.

① ② ③

slide-11
SLIDE 11

Data transfer for MapReduce processing

 Non-blocking data transfer is employed by both Sender Task

  • n Host System and Master Task on Many-core System

 Sender Task gets the request from Ma Master r Task and t transfer ers s data  The double buffering g requires es two buffers, , with one u used to receive e the next d data chunk w while the other to pr process ss the he current data chunk.  Worker ers s divide e the he Receive e buffe fer r data and e execute each Map processi ssing. g. With this control, computation ion and data transfer er can be overlapped ped and will be e expecte ted d bet ette ter r performan ance

11

slide-12
SLIDE 12

 M-PVAS

 Master writes the buffer address and size information on Master address space, and Sender checks them and memory copy using “memcpy()” function simply.

 MR MRPhi

 MRPhi uses MPI_Irecv(), MPI_wait() functions to get data asynchronously.

Implementations of Data transfer 12

slide-13
SLIDE 13

Evaluation

 Execution environment for M-PVAS MapReduce  XeonPhi : : Ma Master Tas ask = = 1, 1, Worker Tas asks=2 =239 39  Host(Xeon): Sender Tas ask = = 1  Benchmark  Mo Monte Car arlo that at shows good pe performan ance on XeonPhi.

Many- core CPU Intel Xeon Phi 5110P (60 cores, 240 threads, 1.053GHz) Memory GDDR5 8GB OS Linux 2.6.38 Multi-core CPU Intel Xeon E5-2650 x2 (8 cores, 16 threads, 2.6GHz) Memory DDR3 64GB OS Linux 2.6.32 (CentOS 6.3) Intel CCL MPSS Version 3.4.3 MPI IMPI Version 5.0.1.035

13

slide-14
SLIDE 14

Summary

 In this study, the task execution model “Multiple Partitioned Virtual Address Space (M-PVAS)” is applied to the MapReduce framework.  The effect of the M-PVAS model is estimated by the MapReduce benchmark, Monte Carlo.

  • At the current state M-PVAS MapReduce shows better

performance than the original MapReduce framework.

  • M-PVAS achieves around 1.8~2.0 times speedup.
  • The main factor is data transfer processing.

 Future Work  investigate a factor of performance improvement more deeply.  experiment on different benchmarks.

14