Heterogeneous Computing Systems Mikiko Sato Tokyo University of - - PowerPoint PPT Presentation

▶

Oct 27, 2023 584 likes •733 views

MapReduce Frameworks on Multiple PVAS for Heterogeneous Computing Systems Mikiko Sato Tokyo University of Agriculture and Technology Background A A recent t tendency ncy for p performance nce improvements ements is s due to

SLIDE 1

MapReduce Frameworks on Multiple PVAS for Heterogeneous Computing Systems

Mikiko Sato

Tokyo University of Agriculture and Technology

SLIDE 2

Background

２

A A recent t tendency ncy for p performance nce improvements ements is s due to increases in the number ers of CPU PU c cores with accelerato ators.

GPGPU, Intel XeonPhi
The multi-core and many-core CPUs provide differing

computational performance, parallelism, latency…

Application Program Multi-coreCPU Many-coreCPU

core core c c c c c c c c … Multi-core OS

(Linux)

Many-core OS

(Light-weight Kernel)

cooperate Task Task Task Task

T T T T T T T T T T T T T T T T TMany-core Task (High-parallel computational processing) Multi-core Task (I/O processing, Low-parallel & high-latency processing)

Its ts importa tant is t issue is H How to to improve ve th the application performan ance using both th ty types o

f CPUs cooperat

ative vely.

SLIDE 3

MapReduce framework

 Big dat ata a an anal alytics h has as been identified ed as as t the e exc xciting ar areas as f for both ac acad ademia a a and industry.  Ma MapR pReduce fram amework is a a po popu pular ar pr program amming f fram amework  for big data analytics and scientific computing  MapReduce was originally designed for distributed-computing, and has been extended to various architectures. (HPC system,[2]

GPGPUs[3] , many-core CPUs[4])

 Ma MapR pReduce on a a heterogeneo eous system with XeonPhi

 The hardware features of the Xeon Phi achieve high performance (512- bit VPUs, MIMD thread parallelism, coherent L2 Cache, etc.)  The host processor assists the data transfer for MapReduce processing.

[1] Welcome to Apache Hadoop (online), available from http://hadoop.apache.org. [2]"K MapReduce: A scalable tool for data-processing and search/ensemble applications on large- scale supercomputers", M. Matsuda, et.al., in CLUSTER , IEEE Computer Society, pp. 1-8, 2013. [3] "Mars: a mapreduce framework on graphics processors”, B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wnag, in PACT, pp. 1-8, 2008. [4] "Optimizing the MapReduce framework on Intel Xeon Phi coprocessor," M. Lu, et al., Big Data, IEEE International Conference on, pp.125-130, Oct. 2013.

3

[1]

SLIDE 4

Previous MapReduce frameworks on XeonPhi

4

 MRPhi[4]

Optimized MapReduce framework for XeonPhi coprocessor

 Using SIMD VPUs for map phase, SIMD hash computation algorithms, based on the MIMD hyper-threading, etc.  The pthread is used for Master/Worker task controls on Xeon Phi.

Important issues for the performance are both utilizing of advanced XeonPhi-features and effective thread controls.  MrPhi[5]

The expanded version of MRPhi.[4] MapReduce operation and data

are transferred separately from host to XeonPhi.

MPI communication is used for data transfer and synchronous

control between host and XeonPhi.

The communication overhead will be one of the factor of the MapReduce performance.

[4] "Optimizing the MapReduce framework on Intel Xeon Phi coprocessor," M. Lu, et al., Big Data, IEEE International Conference on, pp.125-130, Oct. 2013. [5] "MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors", Lu, M., et al., IEEE Transactions on Parallel and Distributed Systems, vol.PP, no.99, pp.1-14, 2014.

SLIDE 5

Inter-task communications

 Turn-ar around times ( (TAT) of a a n null function ca call in t the XenoPhi

ffload

ading scheme ar are meas asured as as the referen ence e of our study

 The communication overhead is large when sending the small size data between host and XeonPhi.

→It is important to reduce the communication cost between host and XeonPhi as much as possible for the MapReduce performance.

5

Delegator Task

Local CPU Remote CPU Buf

Delegatee Task

Write Request (8～128B) Write Result (8B） Buf Polling Polling

Turn-around time (TAT) are measured The processing request data varies between 8 bytes and 128 bytes and the response data is fixed at 8 bytes. (Xeon E5-2670，MPSS 3-2.1.6720-13)

SLIDE 6

Issues & Goal

In n order er t to o

btain h

n high gh per erfo forma manc nce e on n the he hy hybrid- architec ecture s e sys ystem ems, i it is i imp mportant to

 perform inter-task communication by less overhead  execute processing on the suitable CPU in consideration of the difference of performance and characteristic between CPUs

Goal

En Enable e cooper eration b by y little e over erhea ead bet etwe ween en tasks ks for MapRed educe e frame mewo work k on a a h hyb ybrid s sys ystem em. .

 In order to realize the program execution environment, “Multiple PVAS”（Multiple Partitioned Virtual Address Space)

will be provided as system software for task collaboration with less overhead on the hybrid-architecture system.

6

SLIDE 7

Task Model

 A tas ask model o

M-PVAS is bas ased on

n PVAS[1].

 The PVAS system assigns one partition to one PVAS task.  PVAS tasks execute using each PVAS partition on a same PVAS address space. →PVAS Tas asks can an communicat ate b by Re Read ad/W /Write virtual al ad address on a a PVAS ad address spa pace, w without using an another shar ared m memory.

Memory Many-core CPU PVAS Application Program TaskTaskTask Task PVAS Task#2 PVAS Task#M PVAS Address Space

PVAS Task#1

PVAS Task#3 Kernel Export TXT DATA & BSS HEAP STACK PVAS Partition

[1] Shimada, A., Gero, B., Hori, A. and Ishikawa, Y.: Proposing

a new task model towards many-core architecture (MES '13). 7

SLIDE 8

M-PVAS Task Model

 M-PVAS map aps a a number of

PVAS a address s spa paces o

nto a

a single virtual address space, “Mu Multipl ple PVAS Address Spa pace”.  PVAS tasks belonging to the same Multiple PVAS address space can ac access o

ther PVAS ad

address spa pace, even i if o

a differen ent CPU. →M-PVAS Tas asks can an communicat ate w with an another M M-PVAS tas ask by just accessing to the v virtual address.

8

It is convenient to develop the parallel program which collaborates between different CPUs.

SLIDE 9

Basic Design of M-PVAS MapReduce

 M-PVAS MapReduce was designed based on MRPhi[3]  The same MapReduce processing model as MRPhi[3]

Host sends the MapReduce Data to XeonPhi repeatedly.
Workers execute MapReduce operation with accessing each

part of the data.

 Change the inter-task communication and the task control part to compare the performance gain when using pthread and MPI I/F or M-PVAS methods.

9

(MRPhi): pthread control vs (M-PVAS): M-PVAS Task control (MRPhi): MPI comm. vs (M-PVAS): Shared Address Space

SLIDE 10

 Ma Master T Tas ask controls Worker T Tas asks

 Master Task notifies Worker Tasks of the MapReduce Control Data (fig.①) ← the he same as pthr hread  Master/Worker Tasks control synchronously using busy-waiting flags and an atomic counter(fig.②,③) ← the he simple flag sensing g will be e expected better performance ce

Master/Worker Task Control on M-PVAS 10

Processing information(Map or Reduce), The Number of Worker Tasks, MapReduce Data address, size, MapReduce Result Data address, etc.

① ② ③

SLIDE 11

Data transfer for MapReduce processing

 Non-blocking data transfer is employed by both Sender Task

n Host System and Master Task on Many-core System

 Sender Task gets the request from Ma Master r Task and t transfer ers s data  The double buffering g requires es two buffers, , with one u used to receive e the next d data chunk w while the other to pr process ss the he current data chunk.  Worker ers s divide e the he Receive e buffe fer r data and e execute each Map processi ssing. g. With this control, computation ion and data transfer er can be overlapped ped and will be e expecte ted d bet ette ter r performan ance

11

SLIDE 12

 M-PVAS

 Master writes the buffer address and size information on Master address space, and Sender checks them and memory copy using “memcpy()” function simply.

 MR MRPhi

 MRPhi uses MPI_Irecv(), MPI_wait() functions to get data asynchronously.

Implementations of Data transfer 12

SLIDE 13

Evaluation

 Execution environment for M-PVAS MapReduce  XeonPhi : : Ma Master Tas ask = = 1, 1, Worker Tas asks=2 =239 39  Host（Xeon）: Sender Tas ask = = 1  Benchmark  Mo Monte Car arlo that at shows good pe performan ance on XeonPhi.

Many- core CPU Intel Xeon Phi 5110P (60 cores, 240 threads, 1.053GHz) Memory GDDR5 8GB OS Linux 2.6.38 Multi-core CPU Intel Xeon E5-2650 x2 (8 cores, 16 threads, 2.6GHz) Memory DDR3 64GB OS Linux 2.6.32 (CentOS 6.3) Intel CCL MPSS Version 3.4.3 MPI IMPI Version 5.0.1.035

13

SLIDE 14

Summary

 In this study, the task execution model “Multiple Partitioned Virtual Address Space (M-PVAS)” is applied to the MapReduce framework.  The effect of the M-PVAS model is estimated by the MapReduce benchmark, Monte Carlo.

At the current state M-PVAS MapReduce shows better

performance than the original MapReduce framework.

M-PVAS achieves around 1.8～2.0 times speedup.
The main factor is data transfer processing.