Employing MPI Collectives for Timing Analysis on Embedded - - PowerPoint PPT Presentation

employing mpi collectives for timing analysis on embedded
SMART_READER_LITE
LIVE PREVIEW

Employing MPI Collectives for Timing Analysis on Embedded - - PowerPoint PPT Presentation

Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores Martin Frieb , Alexander Stegmeier, J org Mische, Theo Ungerer Department of Computer Science University of Augsburg 16th International Workshop on Worst-Case Execution


slide-1
SLIDE 1

Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores

Martin Frieb, Alexander Stegmeier, J¨

  • rg Mische, Theo Ungerer

Department of Computer Science University of Augsburg

16th International Workshop on Worst-Case Execution Time Analysis July 5, 2016

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 1

slide-2
SLIDE 2

Motivation

Bj¨

  • rn Lisper, WCET 2012:

”Towards Parallel Programming Models for Predictability” – Shared memory does not scale ⇒ Replace it with distributed memory – Replace bus with Network-on-Chip (NoC) – Learn from Parallel Programming Models – e.g. Bulk Synchronous Programming (BSP): Execute program in supersteps:

  • 1. Local computation
  • 2. Global communication
  • 3. Barrier

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 2

slide-3
SLIDE 3

MPI programs

Similar programming model comes with MPI programs – At a collective operation, all (or a group of) cores work together – local computation, followed by communication ⇒ implicit barrier – One core for coordination and distribution (master), others for computation (slave) – Examples: – Barrier – Broadcast – Global sum

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 3

slide-4
SLIDE 4

Outline

Background Timing Analysis of MPI Collective Operations Case Study: Timing Analysis of the CG Benchmark Summary and Outlook

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 4

slide-5
SLIDE 5

Outline

Background Timing Analysis of MPI Collective Operations Case Study: Timing Analysis of the CG Benchmark Summary and Outlook

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 5

slide-6
SLIDE 6

Underlying Architecture

Core Local Memory Network Interface

Small and Simple Core

1

I /O Connection

Distributed Memory

2

Statically Scheduled Network

3

Task + Network Analysis = WCET

4

[Metzlaff et al.: A Real-Time Capable Many-Core Model, RTSS-WiP 2012]

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 6

slide-7
SLIDE 7

Structure of a MPI program

A B C

D Time

Same sequential code on all cores (A) Barrier after initialization (B) Data exchange (C) Data exchange (D) Global operation

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 7

slide-8
SLIDE 8

Outline

Background Timing Analysis of MPI Collective Operations Case Study: Timing Analysis of the CG Benchmark Summary and Outlook

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 8

slide-9
SLIDE 9

Structure of MPI Allreduce

– Global reduction operation – Broadcasts result afterwards

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

slide-10
SLIDE 10

Structure of MPI Allreduce

Master Slave 1 Slave 2 A B C D E F G

– Global reduction operation – Broadcasts result afterwards

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

slide-11
SLIDE 11

Structure of MPI Allreduce

Master Slave 1 Slave 2 A B C D E F G

– Global reduction operation – Broadcasts result afterwards (A) Initialization

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

slide-12
SLIDE 12

Structure of MPI Allreduce

Master Slave 1 Slave 2 A B C D E F G

– Global reduction operation – Broadcasts result afterwards (A) Initialization (B) Acknowledgement

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

slide-13
SLIDE 13

Structure of MPI Allreduce

Master Slave 1 Slave 2 A B C D E F G

– Global reduction operation – Broadcasts result afterwards (A) Initialization (B) Acknowledgement (C) Data structure initialization (D) Send values

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

slide-14
SLIDE 14

Structure of MPI Allreduce

Master Slave 1 Slave 2 A B C D E F G

– Global reduction operation – Broadcasts result afterwards (A) Initialization (B) Acknowledgement (C) Data structure initialization (D) Send values (E) Collect and store values

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

slide-15
SLIDE 15

Structure of MPI Allreduce

Master Slave 1 Slave 2 A B C D E F G

– Global reduction operation – Broadcasts result afterwards (A) Initialization (B) Acknowledgement (C) Data structure initialization (D) Send values (E) Collect and store values (F) Apply global operation

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

slide-16
SLIDE 16

Structure of MPI Allreduce

Master Slave 1 Slave 2 A B C D E F G

– Global reduction operation – Broadcasts result afterwards (A) Initialization (B) Acknowledgement (C) Data structure initialization (D) Send values (E) Collect and store values (F) Apply global operation (G) Broadcast result

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

slide-17
SLIDE 17

Structure of MPI Allreduce

Master Slave 1 Slave 2 A B C D E F G

– Global reduction operation – Broadcasts result afterwards (A) Initialization (B) Acknowledgement (C) Data structure initialization (D) Send values (E) Collect and store values (F) Apply global operation (G) Broadcast result WCET = Σ A to G

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 9

slide-18
SLIDE 18

Analysis of MPI Allreduce

– WCET of sequential parts estimated with OTAWA – Worst-case traversal time (WCTT) of communication parts has to be added – Result: Equation with parameters – #values to be transmitted – #communication partners – Dimensions of NoC – Transportation times – Time between Core and NoC – Equation can be reused for any application on same architecture

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 10

slide-19
SLIDE 19

Analysis of MPI Sendrecv

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

slide-20
SLIDE 20

Analysis of MPI Sendrecv

– Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

slide-21
SLIDE 21

Analysis of MPI Sendrecv

– Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

slide-22
SLIDE 22

Analysis of MPI Sendrecv

– Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

slide-23
SLIDE 23

Analysis of MPI Sendrecv

– Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement – Sending and receiving of values

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

slide-24
SLIDE 24

Analysis of MPI Sendrecv

– Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement – Sending and receiving of values – Result: Equation with parameters – #values to be transmitted – Transportation times – Time between Core and NoC

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

slide-25
SLIDE 25

Analysis of MPI Sendrecv

– Purpose: simultaneous send and receive to avoid deadlock – Often used for data exchange: pass data to next core – Structure: – Initialization – Acknowledgement – Sending and receiving of values – Result: Equation with parameters – #values to be transmitted – Transportation times – Time between Core and NoC – Equation can be reused for any application on same architecture

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 11

slide-26
SLIDE 26

Outline

Background Timing Analysis of MPI Collective Operations Case Study: Timing Analysis of the CG Benchmark Summary and Outlook

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 12

slide-27
SLIDE 27

The CG Benchmark

– Conjugate Gradient method from mathematics – Optimization method to find the minimum/maximum of a multidimensional function – Operations on a large matrix – Distributed on several cores – Cores exchange data a number of times – Taken from NAS Parallel Benchmark Suite for highly parallel systems – Adapted for C + MPI

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 13

slide-28
SLIDE 28

Setting for the analysis

– Simple ARM cores with 5-stage pipeline – Distributed memory, no caches, 10 cycles memory access latency – 4x4 PaterNoster NoC, arranged as unidirectional torus – Sending and receiving takes 3 assembler instructions + 4 cycles from Pipeline to NoC – Due to the usage of time division multiplexing (TDM), a WCTT can be estimated → Stegmeier et al.: WCTT bounds for MPI Primitives in the PaterNoster NoC, 14th International Workshop on Real-Time Networks (RTN), July 5th, 2016, Toulouse

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 14

slide-29
SLIDE 29

Analysis of the CG Benchmark

– Initialization not analysed – Structure of one benchmark iteration: – Alternating sequential and communication parts – Some of them are repeated for 15 times in a for loop – Summation of parts gives equation – Shortcoming: ignored pipeline states at summation WCETcg = 1 896 959 + WCETAR(2, 15) + 17 · WCETAR(1, 3) + 16 · (WCETAR(351, 3) + WCETSR(351))

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 15

slide-30
SLIDE 30

Results

Specific numbers: – WCETAR(2, 15) = 8.158 cycles – WCETAR(1, 3) = 1.071 cycles – WCETAR(351, 3) = 113.073 cycles – WCETSR(351) = 11.396 cycles WCETcg = 1 896 959 + 8 158 + 17 · 1 071 + 16 · (113 073 + 11 396) = 3 914 828 cycles

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 16

slide-31
SLIDE 31

Outline

Background Timing Analysis of MPI Collective Operations Case Study: Timing Analysis of the CG Benchmark Summary and Outlook

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 17

slide-32
SLIDE 32

Summary and Outlook

– Learn from parallel programming models – MPI collectives: clear separation of computation and communication phases ⇒ combine separate WCET estimates for sequential code and MPI collectives – Analyzed in case study: MPI Allreduce, MPI Sendrecv, CG benchmark – Results can be reused for any application on same architecture

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 18

slide-33
SLIDE 33

Summary and Outlook

– Learn from parallel programming models – MPI collectives: clear separation of computation and communication phases ⇒ combine separate WCET estimates for sequential code and MPI collectives – Analyzed in case study: MPI Allreduce, MPI Sendrecv, CG benchmark – Results can be reused for any application on same architecture Outlook – Improve MPI implementation: e. g. workload distribution – Optimize hardware support

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 18

slide-34
SLIDE 34

Questions?

Thank you for your attention!

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 19

slide-35
SLIDE 35

Extras

Additional slides

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 20

slide-36
SLIDE 36

Equation of MPI Allreduce

– f : #values to be transmitted – χ: #communication partners – n: Dimensions of NoC – ttransm,X: Transportation times – tBuf : Time from Core to NoC and backwards WCETAR(f , χ) =273 + 35f χ + 141χ + max(23 + 6n2 + 11χ, 24 + 2(ttransm,χ + tBuf )) + (f − 1)max(35χ, ttransm,χ) + (66 + ttransm,χ)f + tBuf

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 21

slide-37
SLIDE 37

Equation of MPI Sendrecv

– f : #values to be transmitted – ttransm,X: Transportation times – tBuf : Time from Core to NoC and backwards WCETSR(f ) = 108+2·(ttransm,1+tBuf )+max(f ·32, ttransm,f )+tBuf

July 5, 2016 Martin Frieb et al. / Employing MPI Collectives for Timing Analysis on Embedded Multi-Cores 22