Replicating the Performance Evaluation of an N-Body Application on a - - PowerPoint PPT Presentation

replicating the performance evaluation of an n body
SMART_READER_LITE
LIVE PREVIEW

Replicating the Performance Evaluation of an N-Body Application on a - - PowerPoint PPT Presentation

Replicating the Performance Evaluation of an N-Body Application on a Manycore Accelerator Vincius Garcia Pinto Vinicius Alves Herbstrith Lucas Mello Schnorr October 18, 2015 1 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on


slide-1
SLIDE 1

Replicating the Performance Evaluation of an N-Body Application on a Manycore Accelerator

Vinícius Garcia Pinto Vinicius Alves Herbstrith Lucas Mello Schnorr October 18, 2015

1 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-2
SLIDE 2

Outline

1 Introduction 2 Background 3 Related Work 4 N-Body Performance Evaluation on XeonPhi 5 Conclusions 6 References

2 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-3
SLIDE 3

Introduction

Reproducibility

  • discoveries are replicated and reproduced by

independent scientists

  • in computer science
  • lack of documentation in the experiments and its

methodology

  • → obstacles to repeat/check third party results
  • HPC scenario
  • Few works about platforms with accelerators

Source: Nature Education 2015

[ Stodden et al. 2014; TOP500 2015]

3 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-4
SLIDE 4

This work

  • Replication of a performance evaluation reported in the book High Performance

Parallelism Pearls: Multicore and Many-core Programming Approaches

  • N-Body OpenMP parallel application on a XeonPhi accelerator.
  • Our Goals
  • check if their results are valid for a similar but not identical hardware
  • improve the reproducibility of the original experiments
  • no raw data and description with few details.
  • Despite this: we believe that with source code and a high-level description of the

hardware it’s possible to replicate and extend this performance analysis.

4 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-5
SLIDE 5

N-Body Simulation based on Newton’s Gravitation Law

  • Algorithm based on the Newton’s laws of motion
  • interaction between bodies and the forces acting over them.

∀i ∈ {1, ..., N} d−

→ v i

dt = G

  • i=j

− → x j − − → x i d3

ij

mj (1)

  • N-Body OpenMP Parallel Application for the XeonPhi
  • This kind of application simulates the interaction of particles in the space
  • presented in the Chapter 11 of the book High Performance Parallelism Pearls
  • open source code that implements this Equation as a parallel OpenMP application.

[Reinders et al. ’14]

5 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-6
SLIDE 6

N-Body OpenMP Parallel Application

  • Four versions
  • v0 - able to run natively either on the host or on the device
  • v1 - starts the execution on the host and offloads specific computations to the device
  • v2 - simultaneous computations in both host and device (host remains executing

while the device is running)

  • v2.1 - overlapping in data transfers between host and device
  • v3 - adds support for an arbitrary number of accelerator devices.

tag platform memory transfers FP Prec. v0 v0s/v0d host or device single/double v1 v1s both single v2 v2s both

  • ne-sided

single v2.1 v2.1s both bi single v3 v3s both bi single

6 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-7
SLIDE 7

Related Work

  • XeonPhi supports standard parallel programming tools (e.g OpenMP)
  • OpenMP 4.0: new directives for accelerators and coprocessors
  • Related Works:
  • [Cramer et al. 2012] Evaluation of OpenMP kernel-type benchmarks and CG solver:

XeonPhi vs 128core SMP machine; overhead of synchronization OpenMP constructs is smaller in the XeonPhi scientific applications can run efficiently on the device.

  • [Schmidl et al. 2013] Some applications don’t perform well on the XeonPhi because
  • f its relatively slow serial performance.
  • [Tian et al. 2015] N-Body to evaluate SIMD vectorization in XeonPhi: SIMD

instructions accelerates the execution by 10.52 times.

  • [Borovska et al. 2014] Porting of an (MPI/OpenMP) astrophysics simulation

application on XeonPhi: up to 38% of gain changing # of MPI process by device and # of threads by processes, no comparison with standard processors.

.

[Borovska et al. 2014; Cramer et al. 2012; Schmidl et al. 2013; Tian et al. 2015]

7 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-8
SLIDE 8

Related Work

  • Our work
  • Similar to first two related works, we used a OpenMP application. However, it uses

new OpenMP 4.0 directives like pragma simd and pragma target.

  • We also use a N-body like application (similar to other two works) but including

more two comparisons: one between the XeonPhi and the host and other between GPU and XeonPhi.

8 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-9
SLIDE 9

N-Body Performance Evaluation on XeonPhi

  • Platforms Setup and Experimental Methodology
  • Experiments conducted on two machines:
  • orion node at INF/UFRGS with two accelerators (Intel XeonPhi and Nvidia K20)
  • Bree desktop with one accelerator (Nvidia GTX760)

Orion Bree Processor Xeon E5-2630 i7-4770 N of procs. (NUMA) 2 (two) 1 (one) Cores per proc. 6 (12 Hyper. T.) 4 (8 Hyper. T.)

  • Max. Core Freq.

2.30GHz 3.40Ghz Main memory 32GBytes 8GBytes Accelerator #1 XeonPhi 3120A GTX760 Accelerator #2 Nvidia K20 OS CentOS Linux7 Ubuntu 14.04 Kernel 3.10.0 (x86 64) 3.13.0 MPSS / CUDA 3.4.1 / 6.5 NA / 5.5 Phi 3120A K20m GTX760 Processor in-order x86 cuda cores cuda cores Cores 57(228 HW T.) 2496 1152

  • Max. Core Freq.

1.10GHz 706MHz 980MHz L2 Cache 512KBytes 1.3MBytes 768KBytes Main memory 6GBytes 5GBytes 2GBytes

  • Mem. Bandwidth

240 GB/s 208 GB/s 192 GB/s TDP 300 W 225 W 170 W 9 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-10
SLIDE 10

N-Body Performance Evaluation on XeonPhi

  • Application Input
  • number of particles (50,000)
  • the time step for each iteration (0.01)
  • the number of iterations (100).
  • Experimental Methodology
  • average of at least 31 runs
  • std error as 3 times the std deviation divided by the square root of the number of
  • bservations
  • we also adopted the Speedup-Test to declare that the observed speedup is (or isn’t)

statistically significant

  • Open Source R-based tool that uses Student’s t-test and Wilcoxon-Mann-Whitney

test. [Touati et al. 2013]

10 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-11
SLIDE 11

N-Body Performance Evaluation on XeonPhi

  • Independent executions: v0s/v0d (on host)
  • Free

Pin 250 500 750 1000 2 4 6 8 10 12 2 4 6 8 10 12

Number of Threads Time (secs)

Version

  • v0d

v0s

  • Free

Pin 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

Number of Threads Speedup

Version

  • v0d

v0s

  • No much difference between

free and pinned

  • Performance gains seems to

be much higher for double prec.

  • However single has a good
  • seq. time → similar gains
  • Acceleration is very close no

matter which version used

  • Pinned: after 8 cores,

speedup gets distant from the ideal.

11 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-12
SLIDE 12

N-Body Performance Evaluation on XeonPhi

  • Independent executions: v0s/v0d (on device)

50 100 150 50 100 150 200 250

Number of Threads Speedup

Version v0d v0s

  • Native execution model
  • Vertical lines represent:
  • # of physical cores (57)
  • # of hw threads (228)
  • Speedup distances from the

ideal much before the first

  • vert. line
  • inadequate load/inability

to schedule this # of cores

  • irregular acceleration
  • maximimum speedup
  • btained using 224 threads

12 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-13
SLIDE 13

N-Body Performance Evaluation on XeonPhi

  • Offloading overhead to the XeonPhi– v0s vs v1s
  • v1s offloads the computation to the XeonPhi. Host remains idle during the execution

in the accelerator (similar to CUDA).

  • We evaluate here the difference between offloading the computation (v1s) against

the previous version (v0s).

  • 2

4 6 8 v0s v1s v0s−v1s

Version Time (secs)

Version

  • v0s

v1s v0s−v1s

  • Only exections with 228

threads (max)

  • v1s is 11.19% faster
  • scheduling decisions

taken on the host ?

  • Note: best v0s was with

224 threads.

13 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-14
SLIDE 14

N-Body Performance Evaluation on XeonPhi

  • Fullduplex Memory Transfer Gains - v2s vs v2.1s
  • v2s and v2.1s make use of all available cores - from the host and the accelerator -

work together through OpenMP scheduling.

  • in v2.1s memory transfers are bi-directional (host→device and device→host)

Version Average

  • Std. Dev
  • Std. Dev Error (3* std/sqrt(n))

v2s 6.1692480108 0.09636988 0.003351785 v2.1s 6.1686474234 0.19873992 0.006912257 v2s-v2.1s 0.0006005874 0.22089767 0.007682913

  • Multidevice Overhead Analysis - v2.1s vs v3s
  • v3s is exactly as v2.1s adding support for multi XeonPhi boards.
  • our platform has only one XeonPhi board so we evaluate the overhead caused by

such support

Version Average

  • Std. Dev
  • Std. Dev Error (3* std/sqrt(n))

v3s 6.166388258 0.2231157 0.007760058 v2.1s 6.168647423 0.1987399 0.006912257 v3s-v2.1s

  • 0.002259165

0.2988195 0.010393067 14 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-15
SLIDE 15

N-Body Performance Evaluation on XeonPhi, GTX760 and K20

  • Comparing versions v0s, v1s, v2s, and v3s and XeonPhi, GTX760 and K20

Xeon Phi GTX760 K20

  • 2

4 6 v0s v1s v2.1s v2s v3s gpugems3 gpugems3

Version Time (secs)

Version

  • gpugems3

v0s v1s v2.1s v2s v3s

  • v0s, v1s, v2s and v3s
  • v1s is the slower
  • v2s, v2.1s and v3s:

no significant diff.

  • XeonPhi, GTX760,K20
  • same O2 algorithm
  • different code
  • same input
  • GPUs are faster than

any version in XeonPhi

15 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-16
SLIDE 16

Discussion about the Experimental Replication

  • Our results and the results described in the book
  • we use only the graphics of the book since raw measurements are unavailable
  • hardware
  • accelerator differs, XeonPhi with 57 cores (3120A) vs XeonPhi with 60 (5110P)
  • host differs, two 10-core Intel Xeon E5-2660 vs two 6-core Xeon E5-2630.
  • Despite this, the hardware is similar enough to allow a fair comparison.
  • Results (our vs book)
  • v0 host-only: execution time graphics with similar shapes. But our slowest

experiments was 100 seconds faster. Faster points were similar.

  • v0 device-only: our curve is slightly more smooth but both curves perform similar

converging faster to smaller values. Our slower execution was faster than the book’s

  • ne.
  • v0 speedup: in host, our and their curves are close to ideal speedup while using

physical cores. In device, we observe some ups and downs, even so, the higher speedup was the same (153 times).

16 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-17
SLIDE 17

Discussion about the Experimental Replication

  • Our results and the results described in the book
  • single and double floating-point precision
  • our tests - single was 7.3x faster in host and 3.7x on device
  • their tests - single 4.4x faster in host and 3.1x in the XeonPhi
  • our host was 2x slower than their with double, this explains our speedup of 7.3x.
  • v1s
  • similar performance to v0s running on device only. Our and their results show that

v0s is faster than v1s in its best cases.

  • v2s: simultaneous computation in host and device represents a gain of 18% over the
  • ffload version and of 366% over the faster host only version
  • v2.1 - no major gains (as reported in the book)
  • v3s: in the book, speedup of 4x, while we achieved a speedup of 3.72x. However,
  • ur platform has only one XeonPhi.

17 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-18
SLIDE 18

Conclusions

  • We presented a performance evaluation of an N-Body application on a XeonPhi.
  • We extend this evaluation, comparing XeonPhi performance with NVIDIA GPUs.
  • Our results were similar to the ones presented in the book. In some cases we have
  • bserved different values but they can be explained by hardware differences.
  • Despite this, we were able to reproduce the original results.
  • XeonPhi vs GPUs
  • XeonPhi was slower than GPUs
  • but CUDA sample codes are very optimized and written in a low-level language while

XeonPhi runs OpenMP code.

  • Future Work
  • compare the XeonPhi with a traditional SMP machine
  • evaluate the impact of OpenMP thread placement decisions (e.g. compact or scatter

affinity)

  • trace OpenMP runtime to investigate how some scheduling decisions are taken.

18 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-19
SLIDE 19

Thank you

Questions ??

This work has been partially supported by CAPES and CNPq.

19 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-20
SLIDE 20

References I

Borovska, P and D Ivanova (2014). “Code Optimization and Scaling of the Astrophysics Software Gadget on Intel Xeon Phi.” In: Partnership for Advanced Computing in Europe (PRACE) 136. Cramer, Tim et al. (2012). “OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison.” In: Proc. Many-core Applications Research Community (MARC) Symp. at RWTH Aachen University. Aachen, Germany. Nature Education (2015). English Communication for Scientists. URL: http://www.nature.com/scitable/ebooks/english-communication-for- scientists-14053993/writing-scientific-papers-14239285. Reinders, J. and J. Jeffers (’14). High Performance Parallelism Pearls Vol. 1: Multicore and Many-core Programming Approaches. Elsevier. ISBN: 9780128021996.

20 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

slide-21
SLIDE 21

References II

Schmidl, Dirk et al. (2013). “Assessing the Performance of OpenMP Programs on the Intel Xeon Phi.” English. In: Euro-Par 2013 Parallel Processing. Ed. by Felix Wolf, Bernd Mohr, and Dieter an Mey. Vol. 8097. Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 547–558. ISBN: 978-3-642-40046-9. Stodden, V., F. Leisch, and R.D. Peng (2014). Implementing Reproducible Research. Chapman & Hall/CRC The R. Taylor & Francis. ISBN: 9781466561595. Tian, Xinmin et al. (2015). “Effective SIMD Vectorization for Intel Xeon Phi Coprocessors.” In: Scientific Programming 501, p. 269764. TOP500 (2015). TOP500 Supercomputer Sites. URL: http://www.top500.org. Touati, Sid-Ahmed-Ali, Julien Worms, and Sébastien Briais (2013). “The Speedup-Test: a statistical methodology for programme speedup analysis and computation.” In: Concurrency and Computation: Practice and Experience 25.10, pp. 1410–1426. ISSN: 1532-0634. DOI: 10.1002/cpe.2939.

21 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA