[PPT] - Replicating the Performance Evaluation of an N-Body Application on a PowerPoint Presentation

SLIDE 1

Replicating the Performance Evaluation of an N-Body Application on a Manycore Accelerator

Vinícius Garcia Pinto Vinicius Alves Herbstrith Lucas Mello Schnorr October 18, 2015

1 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 2

Outline

1 Introduction 2 Background 3 Related Work 4 N-Body Performance Evaluation on XeonPhi 5 Conclusions 6 References

2 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 3

Introduction

Reproducibility

discoveries are replicated and reproduced by

independent scientists

in computer science
lack of documentation in the experiments and its

methodology

→ obstacles to repeat/check third party results
HPC scenario
Few works about platforms with accelerators

Source: Nature Education 2015

[ Stodden et al. 2014; TOP500 2015]

3 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 4

This work

Replication of a performance evaluation reported in the book High Performance

Parallelism Pearls: Multicore and Many-core Programming Approaches

N-Body OpenMP parallel application on a XeonPhi accelerator.
Our Goals
check if their results are valid for a similar but not identical hardware
improve the reproducibility of the original experiments
no raw data and description with few details.
Despite this: we believe that with source code and a high-level description of the

hardware it’s possible to replicate and extend this performance analysis.

4 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 5

N-Body Simulation based on Newton’s Gravitation Law

Algorithm based on the Newton’s laws of motion
interaction between bodies and the forces acting over them.

∀i ∈ {1, ..., N} d−

→ v i

dt = G

i=j

− → x j − − → x i d3

ij

mj (1)

N-Body OpenMP Parallel Application for the XeonPhi
This kind of application simulates the interaction of particles in the space
presented in the Chapter 11 of the book High Performance Parallelism Pearls
open source code that implements this Equation as a parallel OpenMP application.

[Reinders et al. ’14]

5 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 6

N-Body OpenMP Parallel Application

Four versions
v0 - able to run natively either on the host or on the device
v1 - starts the execution on the host and offloads specific computations to the device
v2 - simultaneous computations in both host and device (host remains executing

while the device is running)

v2.1 - overlapping in data transfers between host and device
v3 - adds support for an arbitrary number of accelerator devices.

tag platform memory transfers FP Prec. v0 v0s/v0d host or device single/double v1 v1s both single v2 v2s both

ne-sided

single v2.1 v2.1s both bi single v3 v3s both bi single

6 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 7

Related Work

XeonPhi supports standard parallel programming tools (e.g OpenMP)
OpenMP 4.0: new directives for accelerators and coprocessors
Related Works:
[Cramer et al. 2012] Evaluation of OpenMP kernel-type benchmarks and CG solver:

XeonPhi vs 128core SMP machine; overhead of synchronization OpenMP constructs is smaller in the XeonPhi scientific applications can run efficiently on the device.

[Schmidl et al. 2013] Some applications don’t perform well on the XeonPhi because
f its relatively slow serial performance.
[Tian et al. 2015] N-Body to evaluate SIMD vectorization in XeonPhi: SIMD

instructions accelerates the execution by 10.52 times.

[Borovska et al. 2014] Porting of an (MPI/OpenMP) astrophysics simulation

application on XeonPhi: up to 38% of gain changing # of MPI process by device and # of threads by processes, no comparison with standard processors.

.

[Borovska et al. 2014; Cramer et al. 2012; Schmidl et al. 2013; Tian et al. 2015]

7 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 8

Related Work

Our work
Similar to first two related works, we used a OpenMP application. However, it uses

new OpenMP 4.0 directives like pragma simd and pragma target.

We also use a N-body like application (similar to other two works) but including

more two comparisons: one between the XeonPhi and the host and other between GPU and XeonPhi.

8 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 9

N-Body Performance Evaluation on XeonPhi

Platforms Setup and Experimental Methodology
Experiments conducted on two machines:
orion node at INF/UFRGS with two accelerators (Intel XeonPhi and Nvidia K20)
Bree desktop with one accelerator (Nvidia GTX760)

Orion Bree Processor Xeon E5-2630 i7-4770 N of procs. (NUMA) 2 (two) 1 (one) Cores per proc. 6 (12 Hyper. T.) 4 (8 Hyper. T.)

Max. Core Freq.

2.30GHz 3.40Ghz Main memory 32GBytes 8GBytes Accelerator #1 XeonPhi 3120A GTX760 Accelerator #2 Nvidia K20 OS CentOS Linux7 Ubuntu 14.04 Kernel 3.10.0 (x86 64) 3.13.0 MPSS / CUDA 3.4.1 / 6.5 NA / 5.5 Phi 3120A K20m GTX760 Processor in-order x86 cuda cores cuda cores Cores 57(228 HW T.) 2496 1152

Max. Core Freq.

1.10GHz 706MHz 980MHz L2 Cache 512KBytes 1.3MBytes 768KBytes Main memory 6GBytes 5GBytes 2GBytes

Mem. Bandwidth

240 GB/s 208 GB/s 192 GB/s TDP 300 W 225 W 170 W 9 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 10

N-Body Performance Evaluation on XeonPhi

Application Input
number of particles (50,000)
the time step for each iteration (0.01)
the number of iterations (100).
Experimental Methodology
average of at least 31 runs
std error as 3 times the std deviation divided by the square root of the number of
bservations
we also adopted the Speedup-Test to declare that the observed speedup is (or isn’t)

statistically significant

Open Source R-based tool that uses Student’s t-test and Wilcoxon-Mann-Whitney

test. [Touati et al. 2013]

10 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 11

N-Body Performance Evaluation on XeonPhi

Independent executions: v0s/v0d (on host)
Free

Pin 250 500 750 1000 2 4 6 8 10 12 2 4 6 8 10 12

Number of Threads Time (secs)

Version

v0d

v0s

Free

Pin 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

Number of Threads Speedup

Version

v0d

v0s

No much difference between

free and pinned

Performance gains seems to

be much higher for double prec.

However single has a good
seq. time → similar gains
Acceleration is very close no

matter which version used

Pinned: after 8 cores,

speedup gets distant from the ideal.

11 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 12

N-Body Performance Evaluation on XeonPhi

Independent executions: v0s/v0d (on device)

50 100 150 50 100 150 200 250

Number of Threads Speedup

Version v0d v0s

Native execution model
Vertical lines represent:
# of physical cores (57)
# of hw threads (228)
Speedup distances from the

ideal much before the first

vert. line
inadequate load/inability

to schedule this # of cores

irregular acceleration
maximimum speedup
btained using 224 threads

12 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 13

N-Body Performance Evaluation on XeonPhi

Offloading overhead to the XeonPhi– v0s vs v1s
v1s offloads the computation to the XeonPhi. Host remains idle during the execution

in the accelerator (similar to CUDA).

We evaluate here the difference between offloading the computation (v1s) against

the previous version (v0s).

2

4 6 8 v0s v1s v0s−v1s

Version Time (secs)

Version

v0s

v1s v0s−v1s

Only exections with 228

threads (max)

v1s is 11.19% faster
scheduling decisions

taken on the host ?

Note: best v0s was with

224 threads.

13 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 14

N-Body Performance Evaluation on XeonPhi

Fullduplex Memory Transfer Gains - v2s vs v2.1s
v2s and v2.1s make use of all available cores - from the host and the accelerator -

work together through OpenMP scheduling.

in v2.1s memory transfers are bi-directional (host→device and device→host)

Version Average

Std. Dev
Std. Dev Error (3* std/sqrt(n))

v2s 6.1692480108 0.09636988 0.003351785 v2.1s 6.1686474234 0.19873992 0.006912257 v2s-v2.1s 0.0006005874 0.22089767 0.007682913

Multidevice Overhead Analysis - v2.1s vs v3s
v3s is exactly as v2.1s adding support for multi XeonPhi boards.
our platform has only one XeonPhi board so we evaluate the overhead caused by

such support

Version Average

Std. Dev
Std. Dev Error (3* std/sqrt(n))

v3s 6.166388258 0.2231157 0.007760058 v2.1s 6.168647423 0.1987399 0.006912257 v3s-v2.1s

0.002259165

0.2988195 0.010393067 14 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 15

N-Body Performance Evaluation on XeonPhi, GTX760 and K20

Comparing versions v0s, v1s, v2s, and v3s and XeonPhi, GTX760 and K20

Xeon Phi GTX760 K20

2

4 6 v0s v1s v2.1s v2s v3s gpugems3 gpugems3

Version Time (secs)

Version

gpugems3

v0s v1s v2.1s v2s v3s

v0s, v1s, v2s and v3s
v1s is the slower
v2s, v2.1s and v3s:

no significant diff.

XeonPhi, GTX760,K20
same O2 algorithm
different code
same input
GPUs are faster than

any version in XeonPhi

15 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 16

Discussion about the Experimental Replication

Our results and the results described in the book
we use only the graphics of the book since raw measurements are unavailable
hardware
accelerator differs, XeonPhi with 57 cores (3120A) vs XeonPhi with 60 (5110P)
host differs, two 10-core Intel Xeon E5-2660 vs two 6-core Xeon E5-2630.
Despite this, the hardware is similar enough to allow a fair comparison.
Results (our vs book)
v0 host-only: execution time graphics with similar shapes. But our slowest

experiments was 100 seconds faster. Faster points were similar.

v0 device-only: our curve is slightly more smooth but both curves perform similar

converging faster to smaller values. Our slower execution was faster than the book’s

ne.
v0 speedup: in host, our and their curves are close to ideal speedup while using

physical cores. In device, we observe some ups and downs, even so, the higher speedup was the same (153 times).

16 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 17

Discussion about the Experimental Replication

Our results and the results described in the book
single and double floating-point precision
our tests - single was 7.3x faster in host and 3.7x on device
their tests - single 4.4x faster in host and 3.1x in the XeonPhi
our host was 2x slower than their with double, this explains our speedup of 7.3x.
v1s
similar performance to v0s running on device only. Our and their results show that

v0s is faster than v1s in its best cases.

v2s: simultaneous computation in host and device represents a gain of 18% over the
ffload version and of 366% over the faster host only version
v2.1 - no major gains (as reported in the book)
v3s: in the book, speedup of 4x, while we achieved a speedup of 3.72x. However,
ur platform has only one XeonPhi.

17 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 18

Conclusions

We presented a performance evaluation of an N-Body application on a XeonPhi.
We extend this evaluation, comparing XeonPhi performance with NVIDIA GPUs.
Our results were similar to the ones presented in the book. In some cases we have
bserved different values but they can be explained by hardware differences.
Despite this, we were able to reproduce the original results.
XeonPhi vs GPUs
XeonPhi was slower than GPUs
but CUDA sample codes are very optimized and written in a low-level language while

XeonPhi runs OpenMP code.

Future Work
compare the XeonPhi with a traditional SMP machine
evaluate the impact of OpenMP thread placement decisions (e.g. compact or scatter

affinity)

trace OpenMP runtime to investigate how some scheduling decisions are taken.

18 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 19

Thank you

Questions ??

This work has been partially supported by CAPES and CNPq.

19 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 20

References I

Borovska, P and D Ivanova (2014). “Code Optimization and Scaling of the Astrophysics Software Gadget on Intel Xeon Phi.” In: Partnership for Advanced Computing in Europe (PRACE) 136. Cramer, Tim et al. (2012). “OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison.” In: Proc. Many-core Applications Research Community (MARC) Symp. at RWTH Aachen University. Aachen, Germany. Nature Education (2015). English Communication for Scientists. URL: http://www.nature.com/scitable/ebooks/english-communication-for- scientists-14053993/writing-scientific-papers-14239285. Reinders, J. and J. Jeffers (’14). High Performance Parallelism Pearls Vol. 1: Multicore and Many-core Programming Approaches. Elsevier. ISBN: 9780128021996.

20 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

SLIDE 21

References II

Schmidl, Dirk et al. (2013). “Assessing the Performance of OpenMP Programs on the Intel Xeon Phi.” English. In: Euro-Par 2013 Parallel Processing. Ed. by Felix Wolf, Bernd Mohr, and Dieter an Mey. Vol. 8097. Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 547–558. ISBN: 978-3-642-40046-9. Stodden, V., F. Leisch, and R.D. Peng (2014). Implementing Reproducible Research. Chapman & Hall/CRC The R. Taylor & Francis. ISBN: 9781466561595. Tian, Xinmin et al. (2015). “Effective SIMD Vectorization for Intel Xeon Phi Coprocessors.” In: Scientific Programming 501, p. 269764. TOP500 (2015). TOP500 Supercomputer Sites. URL: http://www.top500.org. Touati, Sid-Ahmed-Ali, Julien Worms, and Sébastien Briais (2013). “The Speedup-Test: a statistical methodology for programme speedup analysis and computation.” In: Concurrency and Computation: Practice and Experience 25.10, pp. 1410–1426. ISSN: 1532-0634. DOI: 10.1002/cpe.2939.

21 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA