Measuring the performance improvements as you parallelize and - - PowerPoint PPT Presentation

measuring the performance improvements as you parallelize
SMART_READER_LITE
LIVE PREVIEW

Measuring the performance improvements as you parallelize and - - PowerPoint PPT Presentation

Measuring the performance improvements as you parallelize and optimize your software 0.25 s -O2 -march=native -mtune=native -ftree-vectorize 0.27 s -O2 -march=native -mtune=native -ftree-vectorize Multiplication of 0.359 s 0.7x -O2


slide-1
SLIDE 1

Measuring the performance improvements as you parallelize and optimize your software

slide-2
SLIDE 2

2

  • O2 -march=native -mtune=native -ftree-vectorize

0.25 s

  • O2 -march=native -mtune=native -ftree-vectorize

0.27 s managing race conditions with... OpenMP pragma atomic

  • O2 -march=native -mtune=native -ftree-vectorize -fopenmp

0.359 s 0.7x

  • O2 -march=native -mtune=native -ftree-vectorize -fopenmp

0.3470 s 0.78x OpenMP privatization

  • f arrays
  • O2 -march=native -mtune=native -ftree-vectorize -fopenmp

0.1172 s 2.13x

  • O2 -march=native -mtune=native -ftree-vectorize -fopenmp

0.1177 s 2.29x

Multiplication of Transposed Sparse Matrix by Vector

slide-3
SLIDE 3
  • Evaluate the performance of your serial, parallel and optimized code

○ ○ ○

  • Tuning and optimization:

○ ○ ○

3

slide-4
SLIDE 4
  • Avoiding ‘too much parallel’

○ ○

Ideal Real Threads Processes Cores Speedup

4

slide-5
SLIDE 5
  • Serial and parallel performance
  • Take regular parallel performance measurements as you progress

○ ○

  • Understand your performance limits

○ ○

Use Speedup and Efficiency measures

5

slide-6
SLIDE 6
  • Measure the relative performance between serial and parallel code.
  • Improvement in speed of execution of a task executed on the same architecture but

with different resources Speedup, S, for problem size N on P processes/threads/cores

  • Tips:

○ ○ ○

6

T(N,1) T(N,P) S(N, P)

=

slide-7
SLIDE 7
  • Measure the efficiency of the parallel code.
  • 100% efficiency = using double the resources, but taking half the runtime (i.e. the

same resources are used in total) Parallel efficiency, E, for problem size N on P processes/threads/cores

7

S(N,P) P T(N,1) (P*T(N,P)) E(N, P) = =

slide-8
SLIDE 8
  • We can never parallelize every single part of code (e.g. initialising and distributing the data).
  • A fraction of the runtime, α, is completely serial, limiting the parallel runtime even with 100%

efficiency of the parallel fraction on P processors/threads/cores.

  • Limited by the serial fraction:

○ α ○ α ■ α

For runtime T, using problem size N for P processes

Known as ‘Amdahl’s Law’

8

(1-α)T(N,1) P T= α T(N,1) +

slide-9
SLIDE 9

Gene Amdahl, 1967

(1-α)=90% α=10%

α=50% α=25% α=10% α =5%

9

α =sequential

portion 1x 2x 3.7x 1 2 4 8 Processors Parallel Serial 1x 1.33x 1.6x 1.8x 1 2 4 8 Processors Parallel Serial

α=50% (1-α)=50%

Source: Wikipedia

5x

slide-10
SLIDE 10
  • Use the spreadsheet assigned to your team to record timings

○ ■ ■ ■ ○

  • Use this regularly, particularly once you start trying multiple parallelization

methods and tuning your implementations.

  • Try and gain an understanding of your serial fraction, α

○ ○ α

10