Parallel Computing Portable Software & Cost-Effective Hardware - - PowerPoint PPT Presentation

▶

Oct 31, 2022 380 likes •1.74k views

Parallel Computing Portable Software & Cost-Effective Hardware 2001.05.28-2001.06.01 http://www.ifi.uio.no/~xingca/MPI-COURSE/ Day 1-3: Parallel programming using MPI Lecturer : Xing Cai Day 4: Beowulf Linux cluster Lecturer :

SLIDE 1

Parallel Computing — Portable Software & Cost-Effective Hardware

2001.05.28-2001.06.01 http://www.ifi.uio.no/~xingca/MPI-COURSE/ Day 1-3: Parallel programming using MPI Lecturer: Xing Cai Day 4: Beowulf Linux cluster Lecturer: ˚ Asmund Ødeg˚ ard Day 5: Notur and its computing resources Lecturers: Leif Reidar Røkkum, Martin Ringel, and Simen Gaure.

SLIDE 2

Work form Lectures 9:15-12:00 On-site exercise/demo 13:00-16:00 Data-lab VB 207

SLIDE 3

Parallel Computing Using MPI

Xing Cai Department of Informatics University of Oslo

SLIDE 4

Parallel Computing Using MPI

Contents Introduction to parallel computing Getting started with MPI Exercises for Day 1 Basic MPI programming Exercises for Day 2 Advanced MPI programming Exercises for Day 3 2

SLIDE 5

Introduction to Parallel Computing Parallel computing In the field of scientific computing, there is an ever-lasting wish to do larger simulations using shorter computer time. Development of the capacity for single-processor computers can hardly keep up with the pace of scientific computing:

processor speed
memory size/speed

There arises need for parallel computing! 3

SLIDE 6

Introduction to Parallel Computing

The Top500 list based on the Linpack Benchmark

http://www.top500.org/slides/2000/11/ 4

SLIDE 7

Introduction to Parallel Computing

The Top500 list based on the Linpack Benchmark

http://www.top500.org/slides/2000/11/ 5

SLIDE 8

Introduction to Parallel Computing Average Number of Processors

Nov93 Nov94 Nov95 Nov96 Nov97 Nov98 Nov99 Nov00 50 100 150 200 250 300

The Top500 list based on the Linpack Benchmark

6

SLIDE 9

Introduction to Parallel Computing The basic ideas of parallel computing Pursuit of shorter computation time and larger simulation size gives rise to parallel computing. Multiple processors are involved to solve a global problem. The essence is to divide the entire computation evenly among collaborative processors. 7

SLIDE 10

Introduction to Parallel Computing A rough classification of hardware models Conventional single-processor computers can be called SISD (single-instruction-single-data) machines. SIMD (single-instruction-multiple-data) machines incorporate the idea of parallel processing, which uses a large number of process- ing units to execute the same instruction on different data. Modern parallel computers are so-called MIMD (multiple-instruction- multiple-data) machines and can execute different instruction streams in parallel on different data. 8

SLIDE 11

Introduction to Parallel Computing Shared memory & distributed memory One way of categorizing modern parallel computers is to look at the memory configuration. In shared memory systems the CPUs share the same address

space. Any CPU can access any data in the global memory.

In distributed memory systems each CPU has its own memory. The CPUs are connected by some network and may exchange messages. A new trend is ccNUMA (cache-coherent-non-uniform-memory- access) systems which are clusters of SMP (symmetric multi- processing) machines and have a virtual shared memory. 9

SLIDE 12

Introduction to Parallel Computing Different parallel programming paradigms Task parallelism – the work of a global problem can be divided into a number of independent tasks, which rarely need to syn-

chronize. Monte Carlo simulation is one example. However this

paradigm is of limited use. Data parallelism – use of multiple threads (e.g. one thread per processor) to dissect loops over arrays etc. This paradigm re- quires a single memory address space. Communication and syn- chronization between processors are often hidden, thus easy to

program. However, the user surrenders much control to a spe-

cialized compiler. Examples of data parallelism are compiler-based parallelization and OpenMP directives. 10

SLIDE 13

Introduction to Parallel Computing Different parallel programming paradigms (cont’d) Message-passing – all involved processors have an independent memory address space. The user is responsible for partition- ing the data/work of a global problem and distributing the sub- problems to the processors. Collaboration between processors is achieved by explicit message passing, which is used for data transfer plus synchronization. This paradigm is the most general one where the user has full

control. Better parallel efficiency is usually achieved by explicit

message passing. However, message-passing programming is more difficult. 11

SLIDE 14

Introduction to Parallel Computing SPMD Although message-passing programming supports MIMD, it suf- fices with a SPMD (single-program-multiple-data) model, which is flexible enough for practical cases: Same executable for all the processors. Each processor works primarily with its assigned local data. Progression of code is allowed to differ between synchronization points. Possible to have a master/slave model. 12

SLIDE 15

Introduction to Parallel Computing Today’s situation of parallel computing Distributed memory is the dominant hardware configuration. There is a large diversity in these machines, from MPP (massively par- allel processing) systems to clusters of off-the-shelf PCs, which are very cost-effective. Message-passing is a mature programming paradigm and widely

accepted. It often provides an efficient match to the hardware.

It is primarily for the distributed memory systems, but can also be used on shared memory systems. We therefore consider here only using message-passing for writ- ing parallel programs. 13

SLIDE 16

Introduction to Parallel Computing Some useful concepts Speed-up S(P) = T(1) T(P). Efficiency η(P) = S(P) P . Latency and bandwidth — the cost model of send a message of length L between two processors: tC(L) = τ + βL. 14

SLIDE 17

Introduction to Parallel Computing Overhead present in parallel computing Uneven load balance → not all the processors can perform useful work at all time. Overhead of synchronization. Overhead of communication. Extra computation due to parallelization. Due to the above overhead and that certain part of a sequential algorithm can not be parallelized, T(P) ≥ T(1) P ⇒ S(P) ≤ P. However, super-linear speed-up (S(P) > P) may sometimes oc- cur due to e.g. cache effect. 15

SLIDE 18

Introduction to Parallel Computing Parallelizing a sequential algorithm Identify the part(s) of a sequential algorithm that can be exe- cuted in parallel. Distribute the global work and data among P processors. Insert communication and synchronization functions where nec- essary. 16

SLIDE 19

Introduction to Parallel Computing Example 1: Addition of two vectors.

w = a

x + b y, where a and b are two real scalars and the vectors x, y and w are all of length M. Suppose P is a factor of M, i.e., I = M/P. For every processor, 1 ≤ p ≤ P, we assign it with wp, xp, and yp where e.g.

wp = (w(p−1)I+1, . . . , wpI).

17

SLIDE 20

Introduction to Parallel Computing When each processor has the sub-vectors ready, it does its por- tion of the global computation:

wp = a

xp + b yp. No inter-processor communication is necessary. Perfect speed-up: S(P) = T(1) T(P) = 3MtA 3ItA = M I = P. (tA is the time for one floating-point operation.) 18

SLIDE 21

Introduction to Parallel Computing Example 2: Inner-product of two vectors. c = x · y =

M

xiyi. Again suppose P is a factor of M, I = M/P. Each processor is assigned with sub-vectors xp and

yp. The local computation is

thus: cp = xp · yp. 19

SLIDE 22

Introduction to Parallel Computing To obtain the global result, each processor has to send its local result cp to all the other processors. This takes about (P−1)tC(1) as the communication cost. Finally we get c =

p cp.

The speed-up is S(P) = T(1) T(P) = (2M − 1)tA (2I − 1)tA + (P − 1)tC(1) + (P − 1)tA = 2M − 1 (2I + P − 2) + (P − 1)γ ≈ P 1 + γP 2

2M

, which depends on γ = tC(1)/tA and the factor P 2/M. 20

SLIDE 23

Introduction to Parallel Computing Need a standard for writing message-passing programs Distributed memory systems have a wide spectrum in respect of hardware architecture design. Each vendor used to have special routines of its own for writing message-passing programs. To promote software portability and allow efficient implementa- tion across a range of architectures, there was an obvious need for a standard in early 90’s. 21

SLIDE 24

Getting Started With MPI MPI MPI is a library specification for the message passing interface, proposed as a standard.

independent of hardware;
not a language or compiler specification;
not a specific implementation or product.

A message passing standard for portability and ease-of-use. Pri- marily for SPMD/MIMD. Designed for high performance. 22

SLIDE 25

Getting Started With MPI Two phases MPI-1: Traditional message-passing. Standard finalized in 1995. Widely available, many implementa- tions. MPI-2: Remote memory, parallel I/O, dynamic process, threads. Standard finalized in 1997. However, implementations are not yet wide-spread. We will only learn MPI-1 in this course. 23

SLIDE 26

Getting Started With MPI What’s MPI-1? Point-to-point message passing Collective communication Support for process groups & communication contexts Support for process topologies Environmental inquiry routines Profiling interface 24

SLIDE 27

Getting Started With MPI Process and processor We refer to process as a logical unit which executes its own code, in an MIMD style. Processor is a physical device on which one or several processes are executed. The MPI standard uses the concept process consistently through-

ut its documentation.

However, we only consider situations where one processor is re- sponsible for one process and therefore use the two terms inter- changeably. 25

SLIDE 28

Getting Started With MPI Bindings to MPI routines MPI comprises a message-passing library where all the routines have corresponding C-binding MPI Command name and Fortran-binding (routine names are in uppercase) MPI COMMAND NAME (We focus on the C-bindings in the lecture notes.) 26

SLIDE 29

Getting Started With MPI Communicator A group of MPI processes with a name (context). Any process is identified by its rank. The rank is only meaningful within a particular communicator. By default communicator MPI COMM WORLD contains all the MPI processes. Mechanism to identify subset of processes. Promotes modular design of parallel libraries. 27

SLIDE 30

Getting Started With MPI The 6 most important MPI routines MPI Init - initiate an MPI computation MPI Finalize - terminate the MPI computation and clean up MPI Comm size - how many processes participate in a given MPI communicator? MPI Comm rank - which one am I? (A number between 0 and size-1.) MPI Send - send a message to a particular process within an MPI communicator MPI Recv - receive a message from a particular process within an MPI communicator 28

SLIDE 31

Getting Started With MPI The first MPI program Let every process write “Hello world” on the standard output.

#include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); printf("Hello world, I’ve rank %d out of %d procs.\n", my_rank,size); MPI_Finalize (); return 0; }

29

SLIDE 32

Getting Started With MPI The Fortran program

program hello include "mpif.h" integer size, my_rank, ierr ! call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierr) ! print *,"Hello world, I’ve rank ",my_rank," out of ",size ! call MPI_FINALIZE(ierr) end

30

SLIDE 33

Getting Started With MPI Compilation by e.g.,

> mpicc -c hello.c > mpicc -o hello hello.o

Execution

> mpirun -np 4 ./hello

Example running result (non-deterministic)

Hello world, I’ve rank 2 out of 4 procs. Hello world, I’ve rank 1 out of 4 procs. Hello world, I’ve rank 3 out of 4 procs. Hello world, I’ve rank 0 out of 4 procs.

31

SLIDE 34

Getting Started With MPI Parallel execution

✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✷ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✿ ❀✵❁✧❂✳✽✜❂✪✄✢✁✲✑❃★❅❄✞✄✜✦✞✬✵✭✂✏❆✯❇❄☛✦✞✬✵✭✙✏✌✶❆✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❉★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✙✏✌✁✒✹✌☞✙✶▲✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾▼★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❆✿ ✛✌✬✧✁☎✄✌✑✵◆❖★✲P✪◗☛☞✵✝✵✝✮✓❙❘✜✓✞✬✜✝✞✡❖✯❚❂❆❯❲❱☛☞✎✬☛✦✳✄✵✾❨❳✮✡❩✓✳✟✵✑❩✓✞◆❨❳✵✡✫✛✌✬☛✓☛✆✵✏✺✔❅❬✠✄▲P✢✯ ✚✙✼✜✽✳✬☛✦✠✄✌✾✕✯❭✏✌✁✲✹✌☞✙✶❆✿ ❀✵❁✧❂✳✽✳❪✂✁☎✄✜✦✌✝✜✁✲✹✌☞▼★✪✶▲✿ ✬☛☞✞✑✮✟✌✬✞✄❨❫❖✿ ❴ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✷ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✿ ❀✵❁✧❂✳✽✜❂✪✄✢✁✲✑❃★❅❄✞✄✜✦✞✬✵✭✂✏❆✯❇❄☛✦✞✬✵✭✙✏✌✶❆✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❉★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✙✏✌✁✒✹✌☞✙✶▲✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾▼★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❆✿ ✛✌✬✧✁☎✄✌✑✵◆❖★✲P✪◗☛☞✵✝✵✝✮✓❙❘✜✓✞✬✜✝✞✡❖✯❚❂❆❯❲❱☛☞✎✬☛✦✳✄✵✾❨❳✮✡❩✓✳✟✵✑❩✓✞◆❨❳✵✡✫✛✌✬☛✓☛✆✵✏✺✔❅❬✠✄▲P✢✯ ✚✙✼✜✽✳✬☛✦✠✄✌✾✕✯❭✏✌✁✲✹✌☞✙✶❆✿ ❀✵❁✧❂✳✽✳❪✂✁☎✄✜✦✌✝✜✁✲✹✌☞▼★✪✶▲✿ ✬☛☞✞✑✮✟✌✬✞✄❨❫❖✿ ❴ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✷ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✿ ❀✵❁✧❂✳✽✜❂✪✄✢✁✲✑❃★❅❄✞✄✜✦✞✬✵✭✂✏❆✯❇❄☛✦✞✬✵✭✙✏✌✶❆✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❉★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✙✏✌✁✒✹✌☞✙✶▲✿ ❀✵❁✧❂✳✽✞❈✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾▼★❅❀✵❁✧❂✳✽✳❈☛❊✳❀✵❀✜✽✒❋✙❊✳●✌❍✮■❏✯❑❄✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❆✿ ✛✌✬✧✁☎✄✌✑✵◆❖★✲P✪◗☛☞✵✝✵✝✮✓❙❘✜✓✞✬✜✝✞✡❖✯❚❂❆❯❲❱☛☞✎✬☛✦✳✄✵✾❨❳✮✡❩✓✳✟✵✑❩✓✞◆❨❳✵✡✫✛✌✬☛✓☛✆✵✏✺✔❅❬✠✄▲P✢✯ ✚✙✼✜✽✳✬☛✦✠✄✌✾✕✯❭✏✌✁✲✹✌☞✙✶❆✿ ❀✵❁✧❂✳✽✳❪✂✁☎✄✜✦✌✝✜✁✲✹✌☞▼★✪✶▲✿ ✬☛☞✞✑✮✟✌✬✞✄❨❫❖✿ ❴

Process 0 Process 1 Process P-1

· · · No inter-processor communication, no synchronization. 32

SLIDE 35

Getting Started With MPI Synchronization Many parallel algorithms require that none process proceeds be- fore all the processes have reached the same state at certain points of a program. Explicit synchronization

int MPI_Barrier (MPI_Comm comm)

Implicit synchronization through use of e.g. pairs of MPI Send and MPI Recv. Ask yourself the following question: “If Process 1 progresses 100 times faster than Process 2, will the final result still be correct?” 33

SLIDE 36

Getting Started With MPI Example: ordered output

Explicit synchronization

#include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank,i; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); for (i=0; i<size; i++) { MPI_Barrier (MPI_COMM_WORLD); if (i==my_rank) { printf("Hello world, I’ve rank %d out of %d procs.\n", my_rank, size); fflush (stdout); } } MPI_Finalize (); return 0; }

34

SLIDE 37

Getting Started With MPI

✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✷ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯✿✁❁❀ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❅★❇❆✞✄✜✦✞✬✵✭✂✏❁✯❈❆☛✦✞✬✵✭✙✏✌✶❁❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❊★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✙✏✌✁✒✹✌☞✙✶▼❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾◆★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❁❀ ❖✌✓✞✬P★✪✁✒◗✌❘✺❀❙✁✠✍☛✏✌✁✒✹✌☞✺❀✱✁✒❚✵❚✂✶ ✷ ❂✵❃✂❄✳✽✠❯✜✦✞✬✮✬✧✁✠☞✞✬❱★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏✂✶▼❀ ✁✒❖P★✪✁✒◗✮◗✒✚✙✼✜✽✳✬✌✦✳✄✌✾✧✶ ✷ ✛✌✬✧✁☎✄✵✑✌❖❲★✲❳✪❨✌☞✌✝✵✝✮✓❙❩✜✓✞✬✜✝✳✡❑✯❬❄❁❭❫❪☛☞❴✬☛✦✠✄✌✾❛❵✵✡❜✓✠✟✌✑❝✓✞❖❛❵✮✡❞✛✌✬✌✓✜✆✵✏✺✔❇❡✒✄❁❳✢✯ ✚✙✼✜✽✳✬✌✦✳✄✌✾✕✯✱✏✌✁✲✹✌☞✙✶❁❀ ❖✵❖☛✝✠✟✂✏✲✖P★✿✏✒✑✌✡✵✓✳✟✌✑✧✶❁❀ ❢ ❢ ❂✵❃✧❄✳✽✳❣✂✁☎✄✜✦✌✝✜✁✲✹✌☞◆★✪✶▼❀ ✬☛☞✞✑✮✟✌✬✞✄❤❘❲❀ ❢ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✷ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯✿✁❁❀ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❅★❇❆✞✄✜✦✞✬✵✭✂✏❁✯❈❆☛✦✞✬✵✭✙✏✌✶❁❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❊★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✙✏✌✁✒✹✌☞✙✶▼❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾◆★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❁❀ ❖✌✓✞✬P★✪✁✒◗✌❘✺❀❙✁✠✍☛✏✌✁✒✹✌☞✺❀✱✁✒❚✵❚✂✶ ✷ ❂✵❃✂❄✳✽✠❯✜✦✞✬✮✬✧✁✠☞✞✬❱★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏✂✶▼❀ ✁✒❖P★✪✁✒◗✮◗✒✚✙✼✜✽✳✬✌✦✳✄✌✾✧✶ ✷ ✛✌✬✧✁☎✄✵✑✌❖❲★✲❳✪❨✌☞✌✝✵✝✮✓❙❩✜✓✞✬✜✝✳✡❑✯❬❄❁❭❫❪☛☞❴✬☛✦✠✄✌✾❛❵✵✡❜✓✠✟✌✑❝✓✞❖❛❵✮✡❞✛✌✬✌✓✜✆✵✏✺✔❇❡✒✄❁❳✢✯ ✚✙✼✜✽✳✬✌✦✳✄✌✾✕✯✱✏✌✁✲✹✌☞✙✶❁❀ ❖✵❖☛✝✠✟✂✏✲✖P★✿✏✒✑✌✡✵✓✳✟✌✑✧✶❁❀ ❢ ❢ ❂✵❃✧❄✳✽✳❣✂✁☎✄✜✦✌✝✜✁✲✹✌☞◆★✪✶▼❀ ✬☛☞✞✑✮✟✌✬✞✄❤❘❲❀ ❢ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✷ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯✿✁❁❀ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❅★❇❆✞✄✜✦✞✬✵✭✂✏❁✯❈❆☛✦✞✬✵✭✙✏✌✶❁❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞❊★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✙✏✌✁✒✹✌☞✙✶▼❀ ❂✵❃✧❄✳✽✞❉✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾◆★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏❑✯▲❆✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❁❀ ❖✌✓✞✬P★✪✁✒◗✌❘✺❀❙✁✠✍☛✏✌✁✒✹✌☞✺❀✱✁✒❚✵❚✂✶ ✷ ❂✵❃✂❄✳✽✠❯✜✦✞✬✮✬✧✁✠☞✞✬❱★❇❂✵❃✧❄✳✽✳❉☛❋✳❂✵❂✜✽✒●✙❋✳❍✌■✮❏✂✶▼❀ ✁✒❖P★✪✁✒◗✮◗✒✚✙✼✜✽✳✬✌✦✳✄✌✾✧✶ ✷ ✛✌✬✧✁☎✄✵✑✌❖❲★✲❳✪❨✌☞✌✝✵✝✮✓❙❩✜✓✞✬✜✝✳✡❑✯❬❄❁❭❫❪☛☞❴✬☛✦✠✄✌✾❛❵✵✡❜✓✠✟✌✑❝✓✞❖❛❵✮✡❞✛✌✬✌✓✜✆✵✏✺✔❇❡✒✄❁❳✢✯ ✚✙✼✜✽✳✬✌✦✳✄✌✾✕✯✱✏✌✁✲✹✌☞✙✶❁❀ ❖✵❖☛✝✠✟✂✏✲✖P★✿✏✒✑✌✡✵✓✳✟✌✑✧✶❁❀ ❢ ❢ ❂✵❃✧❄✳✽✳❣✂✁☎✄✜✦✌✝✜✁✲✹✌☞◆★✪✶▼❀ ✬☛☞✞✑✮✟✌✬✞✄❤❘❲❀ ❢

Process 0 Process 1 Process P-1

· · ·

✲ ✛ ✲ ✛

The processes synchronize between themselves P times. Parallel execution result:

Hello world, I’ve rank 0 out of 4 procs. Hello world, I’ve rank 1 out of 4 procs. Hello world, I’ve rank 2 out of 4 procs. Hello world, I’ve rank 3 out of 4 procs.

35

SLIDE 38

Getting Started With MPI Point-to-point communication An MPI message is an array of elements of a particular MPI datatype.

predefined standard types
derived types

MPI message = “data inside an envelope” Data: start address of the message buffer, counter of elements in the buffer, data type. Envelope: source/destination process, message tag, communi- cator. 36

SLIDE 39

Getting Started With MPI To send a message

int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);

This blocking send function returns when the data has been de- livered to the system and the buffer can be reused. The message may not have been received by the destination process. 37

SLIDE 40

Getting Started With MPI To receive a message

int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status);

This blocking receive function waits until a matching message is received from the system so that the buffer contains the incoming message. Match of data type, source process (or MPI ANY SOURCE), message tag (or MPI ANY TAG). Receiving fewer datatype elements than count is ok, but receiving more is an error. 38

SLIDE 41

Getting Started With MPI MPI Status The source or tag of a received message may not be known if wildcard values were used in the receive function. In C, MPI Status is a structure that contains further information. It can be queried as follows:

status.MPI_SOURCE status.MPI_TAG MPI_Get_count (MPI_Status *status, MPI_Datatype datatype, int *count);

39

SLIDE 42

Getting Started With MPI Another solution to “ordered output”

Passing semaphore from process to process, in a ring

#include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank, flag; MPI_Status status; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); if (my_rank>0) MPI_Recv (&flag, 1, MPI_INT, my_rank-1, 100, MPI_COMM_WORLD, &status); printf("Hello world, I’ve rank %d out of %d procs.\n",my_rank,size); if (my_rank<size-1) MPI_Send (&my_rank, 1, MPI_INT, my_rank+1, 100, MPI_COMM_WORLD); MPI_Finalize (); return 0; }

40

SLIDE 43

Getting Started With MPI

✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✷ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯❀✿☛✝✮✦✳✭✕❁ ❂✵❃✧❄✳✽✮❅✳✑☛✦✞✑✮✟✧✏✥✏✒✑☛✦✞✑✞✟✧✏✰❁ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❆★❈❇✞✄✜✦✞✬✵✭✂✏❉✯❊❇☛✦✞✬✵✭✙✏✌✶❉❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞●★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✙✏✌✁✒✹✌☞✙✶❖❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾P★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❉❁ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✘✮◗✙✶ ❂✵❃✂❄✳✽✳❏☛☞✜✆✲❘❆★❈❇✌✿☛✝✞✦✞✭✕✯❚❙✢✯❯❂✵❃✧❄✳✽☛❄✲❱✵❲✕✯ ✚✙✼✜✽✠✬☛✦✳✄✌✾✜❳✙❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✕✯◆❇✂✏✒✑✌✦✞✑✮✟✧✏✌✶❖❁ ✛✌✬✧✁☎✄✌✑✵✿❩★✲❭✪❪☛☞✵✝✵✝✮✓❚❫✜✓✞✬✜✝✞✡❩✯❨❄❉❴❵❘☛☞✎✬☛✦✳✄✵✾❜❛✮✡❝✓✳✟✵✑❝✓✞✿❜❛✵✡✫✛✌✬☛✓☛✆✵✏✺✔❈❞✠✄❖❭✢✯✤✚✙✼✜✽✠✬☛✦✠✄✌✾▼✯❡✏✌✁☎✹✌☞✙✶❖❁ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✍☛✏✌✁✲✹✌☞✌❳✂❙✠✶ ❂✵❃✂❄✳✽✮❅✵☞✳✄✌✡❢★❈❇✠✚✙✼☛✽✳✬☛✦✳✄✌✾▼✯❣❙✢✯❯❂✵❃✧❄✳✽✜❄☎❱✵❲✕✯ ✚✙✼✜✽✠✬☛✦✳✄✌✾✌❤✧❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✧✶❖❁ ❂✵❃✧❄✳✽✳✐✂✁☎✄✜✦✌✝✜✁✲✹✌☞P★✪✶❖❁ ✬☛☞✞✑✮✟✌✬✞✄❜◗❩❁ ❥ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✷ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯❀✿☛✝✮✦✳✭✕❁ ❂✵❃✧❄✳✽✮❅✳✑☛✦✞✑✮✟✧✏✥✏✒✑☛✦✞✑✞✟✧✏✰❁ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❆★❈❇✞✄✜✦✞✬✵✭✂✏❉✯❊❇☛✦✞✬✵✭✙✏✌✶❉❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞●★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✙✏✌✁✒✹✌☞✙✶❖❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾P★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❉❁ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✘✮◗✙✶ ❂✵❃✂❄✳✽✳❏☛☞✜✆✲❘❆★❈❇✌✿☛✝✞✦✞✭✕✯❚❙✢✯❯❂✵❃✧❄✳✽☛❄✲❱✵❲✕✯ ✚✙✼✜✽✠✬☛✦✳✄✌✾✜❳✙❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✕✯◆❇✂✏✒✑✌✦✞✑✮✟✧✏✌✶❖❁ ✛✌✬✧✁☎✄✌✑✵✿❩★✲❭✪❪☛☞✵✝✵✝✮✓❚❫✜✓✞✬✜✝✞✡❩✯❨❄❉❴❵❘☛☞✎✬☛✦✳✄✵✾❜❛✮✡❝✓✳✟✵✑❝✓✞✿❜❛✵✡✫✛✌✬☛✓☛✆✵✏✺✔❈❞✠✄❖❭✢✯✤✚✙✼✜✽✠✬☛✦✠✄✌✾▼✯❡✏✌✁☎✹✌☞✙✶❖❁ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✍☛✏✌✁✲✹✌☞✌❳✂❙✠✶ ❂✵❃✂❄✳✽✮❅✵☞✳✄✌✡❢★❈❇✠✚✙✼☛✽✳✬☛✦✳✄✌✾▼✯❣❙✢✯❯❂✵❃✧❄✳✽✜❄☎❱✵❲✕✯ ✚✙✼✜✽✠✬☛✦✳✄✌✾✌❤✧❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✧✶❖❁ ❂✵❃✧❄✳✽✳✐✂✁☎✄✜✦✌✝✜✁✲✹✌☞P★✪✶❖❁ ✬☛☞✞✑✮✟✌✬✞✄❜◗❩❁ ❥ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☛✏✒✑✌✡✂✁✒✓✕✔✗✖✙✘ ✂✁☎✄✂✆✞✝✠✟☛✡✌☞✎✍☎✚✜✛✢✁✣✔✤✖✙✘ ✁☎✄✌✑✥✚✧✦✜✁☎✄✩★✪✁☎✄✌✑✫✄✜✦✞✬✮✭✂✏✰✯✱✆✲✖✜✦✳✬✂✴✵✴✥✦✞✬✵✭✙✏✌✶ ✷ ✁☎✄✌✑✸✏✌✁✒✹✌☞✺✯✻✚✜✼✜✽✳✬☛✦✳✄✵✾✕✯❀✿☛✝✮✦✳✭✕❁ ❂✵❃✧❄✳✽✮❅✳✑☛✦✞✑✮✟✧✏✥✏✒✑☛✦✞✑✞✟✧✏✰❁ ❂✵❃✧❄✳✽✜❄✪✄✢✁✲✑❆★❈❇✞✄✜✦✞✬✵✭✂✏❉✯❊❇☛✦✞✬✵✭✙✏✌✶❉❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽☛✏✵✁✒✹✌☞●★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✙✏✌✁✒✹✌☞✙✶❖❁ ❂✵❃✧❄✳✽✞❋✵✓✲✚✵✚✢✽✳✬✌✦✳✄✌✾P★❈❂✵❃✧❄✳✽✳❋☛❍✳❂✵❂✜✽✒■✙❍✳❏✌❑✮▲▼✯◆❇✒✚✙✼✜✽✳✬☛✦✠✄✌✾✧✶❉❁ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✘✮◗✙✶ ❂✵❃✂❄✳✽✳❏☛☞✜✆✲❘❆★❈❇✌✿☛✝✞✦✞✭✕✯❚❙✢✯❯❂✵❃✧❄✳✽☛❄✲❱✵❲✕✯ ✚✙✼✜✽✠✬☛✦✳✄✌✾✜❳✙❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✕✯◆❇✂✏✒✑✌✦✞✑✮✟✧✏✌✶❖❁ ✛✌✬✧✁☎✄✌✑✵✿❩★✲❭✪❪☛☞✵✝✵✝✮✓❚❫✜✓✞✬✜✝✞✡❩✯❨❄❉❴❵❘☛☞✎✬☛✦✳✄✵✾❜❛✮✡❝✓✳✟✵✑❝✓✞✿❜❛✵✡✫✛✌✬☛✓☛✆✵✏✺✔❈❞✠✄❖❭✢✯✤✚✙✼✜✽✠✬☛✦✠✄✌✾▼✯❡✏✌✁☎✹✌☞✙✶❖❁ ✁✒✿✩★✤✚✙✼✜✽✳✬☛✦✠✄✌✾✜✍☛✏✌✁✲✹✌☞✌❳✂❙✠✶ ❂✵❃✂❄✳✽✮❅✵☞✳✄✌✡❢★❈❇✠✚✙✼☛✽✳✬☛✦✳✄✌✾▼✯❣❙✢✯❯❂✵❃✧❄✳✽✜❄☎❱✵❲✕✯ ✚✙✼✜✽✠✬☛✦✳✄✌✾✌❤✧❙✢✯❨❙✲◗✮◗❩✯❬❂✵❃✧❄✳✽✞❋✌❍✳❂✵❂✜✽✠■✜❍✳❏✌❑✮▲✧✶❖❁ ❂✵❃✧❄✳✽✳✐✂✁☎✄✜✦✌✝✜✁✲✹✌☞P★✪✶❖❁ ✬☛☞✞✑✮✟✌✬✞✄❜◗❩❁ ❥

Process 0 Process 1 Process P-1

· · ·

✲ ✲

Process 0 sends a message to Process 1, which forwards it further to Process 2, and so on. 41

SLIDE 44

Getting Started With MPI MPI timer

double MPI_Wtime(void)

This function returns a number representing the number of wall- clock seconds elapsed since some time in the past. Example usage:

double starttime, endtime; starttime = MPI_Wtime(); /* .... work to be timed ... */ endtime = MPI_Wtime(); printf("That took %f seconds\n",endtime-starttime);

42

SLIDE 45

Getting Started With MPI Timing point-to-point communication (ping-pong)

#include <stdio.h> #include <stdlib.h> #include "mpi.h" #define NUMBER_OF_TESTS 10 int main( int argc, char **argv ) { double *buf; int rank; int n; double t1, t2, tmin; int i, j, k, nloop; MPI_Status status; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); if (rank == 0) printf( "Kind\t\tn\ttime (sec)\tRate (MB/sec)\n" ); for (n=1; n<1100000; n*=2) { nloop = 1000; buf = (double *) malloc( n * sizeof(double) );

43

SLIDE 46

Getting Started With MPI

if (!buf) { fprintf( stderr, "Could not allocate send/recv buffer of size %d\n", n ); MPI_Abort( MPI_COMM_WORLD, 1 ); } tmin = 1000; for (k=0; k<NUMBER_OF_TESTS; k++) { if (rank == 0) { /* Make sure both processes are ready */ MPI_Sendrecv( MPI_BOTTOM, 0, MPI_INT, 1, 14, MPI_BOTTOM, 0, MPI_INT, 1, 14, MPI_COMM_WORLD, &status ); t1 = MPI_Wtime(); for (j=0; j<nloop; j++) { MPI_Send( buf, n, MPI_DOUBLE, 1, k, MPI_COMM_WORLD ); MPI_Recv( buf, n, MPI_DOUBLE, 1, k, MPI_COMM_WORLD, &status ); } t2 = (MPI_Wtime() - t1) / nloop; if (t2 < tmin) tmin = t2; } else if (rank == 1) { /* Make sure both processes are ready */ MPI_Sendrecv( MPI_BOTTOM, 0, MPI_INT, 0, 14, MPI_BOTTOM, 0, MPI_INT, 0, 14, MPI_COMM_WORLD, &status ); for (j=0; j<nloop; j++) { MPI_Recv( buf, n, MPI_DOUBLE, 0, k, MPI_COMM_WORLD, &status );

44

SLIDE 47

Getting Started With MPI

MPI_Send( buf, n, MPI_DOUBLE, 0, k, MPI_COMM_WORLD ); } } } /* Convert to half the round-trip time */ tmin = tmin / 2.0; if (rank == 0) { double rate; if (tmin > 0) rate = n * sizeof(double) * 1.0e-6 /tmin; else rate = 0.0; printf( "Send/Recv\t%d\t%f\t%f\n", n, tmin, rate ); } free( buf ); } MPI_Finalize( ); return 0; }

45

SLIDE 48

Getting Started With MPI Measurements of pingpong.c on a Linux cluster

Kind n time (sec) Rate (MB/sec) Send/Recv 1 0.000154 0.051985 Send/Recv 2 0.000155 0.103559 Send/Recv 4 0.000158 0.202938 Send/Recv 8 0.000162 0.394915 Send/Recv 16 0.000173 0.739092 Send/Recv 32 0.000193 1.323439 Send/Recv 64 0.000244 2.097787 Send/Recv 128 0.000339 3.018741 Send/Recv 256 0.000473 4.329810 Send/Recv 512 0.000671 6.104322 Send/Recv 1024 0.001056 7.757576 Send/Recv 2048 0.001797 9.114882 Send/Recv 4096 0.003232 10.137046 Send/Recv 8192 0.006121 10.706747 Send/Recv 16384 0.012293 10.662762 Send/Recv 32768 0.024315 10.781164 Send/Recv 65536 0.048755 10.753412 Send/Recv 131072 0.097074 10.801766 Send/Recv 262144 0.194003 10.809867 Send/Recv 524288 0.386721 10.845800 Send/Recv 1048576 0.771487 10.873298

46

SLIDE 49

Getting Started With MPI Exercises for Day 1 Exercise One: Write a new “Hello World” program, where all the processes first generate a text message using sprintf and then send it to Process 0 (you may use strlen(message)+1 to find out the length of the message). Afterwards, Process 0 is responsible for writing out all the messages on the standard output. Exercise Two: Modify the “ping-pong” program to involve more than 2 processes in the measurement. That is, Process 0 sends message to Process 1, which forwards the message further to Process 2, and so on, until Process P-1 returns the message back to Process 0. Run the test for some different choices of P. 47

SLIDE 50

Getting Started With MPI Exercise Three: Write a parallel program for calculating π, using the formula: π =

1 4 1 + x2 dx . (Hint: use numerical integration and divide the interval [0, 1] into n intervals.) How to use the PBS scheduling system?

1. mpicc program.c
2. qsub -o res.txt pbs.sh, where the script pbs.sh may be:

#! /bin/sh #PBS -e stderr.txt #PBS -l walltime=0:05:00,ncpus=4 cd $PBS_O_WORKDIR mpirun -np $NCPUS ./a.out exit $?

48

SLIDE 51

Basic MPI Programming Example of send & receive: sum of random numbers

Write an MPI program in which each process generates a random number and then Process 0 calculates the sum of these numbers.

#include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank, i, a, sum; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); srand (7654321*(my_rank+1)); a = rand()%100; printf("<%02d> a=%d\n",my_rank,a); fflush (stdout); if (my_rank==0) { MPI_Status status; sum = a; for (i=1; i<size; i++) { MPI_Recv (&a, 1, MPI_INT, i, 500, MPI_COMM_WORLD, &status);

49

SLIDE 52

Basic MPI Programming

sum += a; } printf("<%02d> sum=%d\n",my_rank,sum); } else MPI_Send (&a, 1, MPI_INT, 0, 500, MPI_COMM_WORLD); MPI_Finalize (); return 0; }

50

SLIDE 53

Basic MPI Programming Rules for point-to-point communication Message order preservation – If Process A sends two messages to Process B, which posts two matching receive calls, then the two messages are guaranteed to be received in the order they were sent. Progress – It is not possible for a matching send and receive pair to remain permanently outstanding. That is, if one process posts a send and a second process posts a matching receive, then either the send or the receive will eventually complete. 51

SLIDE 54

Basic MPI Programming Probing in MPI It is possible in MPI to only read the envelope of a message before choosing whether or not to read the actual message.

int MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status *status)

The function blocks until a message having given source and/or tag is available. The result of probing is returned in an MPI Status data structure. 52

SLIDE 55

Basic MPI Programming Sum of random numbers: Alternative 1

Use of MPI Probe

#include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank, i, a, sum; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); srand (7654321*(my_rank+1)); a = rand()%100; printf("<%02d> a=%d\n",my_rank,a); fflush (stdout); if (my_rank==0) { MPI_Status status; sum = a; for (i=1; i<size; i++) { MPI_Probe (MPI_ANY_SOURCE,500,MPI_COMM_WORLD,&status); MPI_Recv (&a, 1, MPI_INT, status.MPI_SOURCE,500,MPI_COMM_WORLD,&status);

53

SLIDE 56

Basic MPI Programming

sum += a; } printf("<%02d> sum=%d\n",my_rank,sum); } else MPI_Send (&a, 1, MPI_INT, 0, 500, MPI_COMM_WORLD); MPI_Finalize (); return 0; }

54

SLIDE 57

Basic MPI Programming Collective communication Communication carried out by all processes in a communicator

A0 A0 A0 A0 A0

ne-to-all broadcast

MPI_BCAST data processes A0 A1 A2 A3 A0 A1 A2 A3

ne-to-all scatter

MPI_SCATTER A0 A1 A2 A3 A0 A1 A2 A3 all-to-one gather MPI_GATHER

55

SLIDE 58

Basic MPI Programming Collective communication (Cont’d)

1316 0 2 0 2 0 2 0 2 0 2

MPI_REDUCE with MPI_SUM, root = 1 : MPI_ALLREDUCE with MPI_MIN: MPI_REDUCE with MPI_MIN, root = 0 :

2 4 5 7 6 2 0 3 1 2 3

Initial Data : Processes . . .

56

SLIDE 59

Basic MPI Programming Sum of random numbers: Alternative 2

Use of MPI Gather

#include <stdio.h> #include <mpi.h> #include <malloc.h> int main (int nargs, char** args) { int size, my_rank, i, a, sum, *array; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); srand (7654321*(my_rank+1)); a = rand()%100; printf("<%02d> a=%d\n",my_rank,a); fflush (stdout); if (my_rank==0) array = (int*) malloc(size*sizeof(int)); MPI_Gather (&a, 1, MPI_INT, array, 1, MPI_INT, 0, MPI_COMM_WORLD); if (my_rank==0) {

57

SLIDE 60

Basic MPI Programming

sum = a; for (i=1; i<size; i++) sum += array[i]; printf("<%02d> sum=%d\n",my_rank,sum); free (array); } MPI_Finalize (); return 0; }

58

SLIDE 61

Basic MPI Programming Sum of random numbers: Alternative 3

Use of MPI Reduce (MPI Allreduce)

#include <stdio.h> #include <mpi.h> int main (int nargs, char** args) { int my_rank, a, sum; MPI_Init (&nargs, &args); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); srand (7654321*(my_rank+1)); a = rand()%100; printf("<%02d> a=%d\n",my_rank,a); fflush (stdout); MPI_Reduce (&a,&sum,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD); if (my_rank==0) printf("<%02d> sum=%d\n",my_rank,sum); MPI_Finalize (); return 0; }

59

SLIDE 62

Basic MPI Programming Example: inner-product Write an MPI program that calculates inner-product between two vectors u = (u1, u2, . . . , uM) and u = (v1, v2, . . . , vM), where uk = xk, vk = x2

k,

xk = k M , 1 ≤ k ≤ M. Partition u and v into P segments (load balancing) with sub- length mi: mi =

⌊M

P ⌋ + 1

0 ≤ i < mod(M, P) ⌊M

P ⌋

mod(M, P) ≤ i < P First find the local result. Then compute the global result. 60

SLIDE 63

Basic MPI Programming

#include <stdio.h> #include <mpi.h> #include <malloc.h> int main (int nargs, char** args) { int size, my_rank, i, m = 1000, m_i, lower_bound, res; double l_sum, g_sum, time, x, dx, *u, *v; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); if (my_rank==0 && nargs>1) m = atoi(args[1]); MPI_Bcast (&m, 1, MPI_INT, 0, MPI_COMM_WORLD); dx = 1.0/m; time = MPI_Wtime(); /* data partition & load balancing */ m_i = m/size; res = m%size; lower_bound = my_rank*m_i;

61

SLIDE 64

Basic MPI Programming

lower_bound += (my_rank<=res) ? my_rank : res; if (my_rank+1<=res) ++m_i; /* allocation of data storage */ u = (double*) malloc (m_i*sizeof(double)); v = (double*) malloc (m_i*sizeof(double)); /* fill out the u og v-vectors */ x = lower_bound*dx; for (i=0; i<m_i; i++) { x += dx; u[i] = x; v[i] = x*x; } /* calculate the local result */ l_sum = 0.; for (i=0; i<m_i; i++) l_sum += u[i]*v[i]; MPI_Allreduce (&l_sum,&g_sum,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD); time = MPI_Wtime()-time;

62

SLIDE 65

Basic MPI Programming

/* output the global result */ printf("<%d>m=%d, lower_bound=%d, m_i=%d, u*v=%g, time=%g\n", my_rank, m, lower_bound, m_i, g_sum*dx, time); free (u); free (v); MPI_Finalize (); return 0; }

63

SLIDE 66

Basic MPI Programming Parallelizing explicit finite element schemes Consider the following 1D wave equation: ∂2u ∂t2 = γ2∂2u ∂x2, x ∈ (0, 1), t > 0, u(0, t) = UL, u(1, t) = UR, u(x, 0) = f(x), ∂ ∂tu(x, 0) = 0 . 64

SLIDE 67

Basic MPI Programming Discretization Let uk

i denote the numerical approximation to u(x, t) at the spa-

tial grid point xi and the temporal grid point tk, where ∆x =

1 n+1,

xi = i∆x and tk = k∆t. Defining C = γ∆t/∆x, u0

i = f(xi),

i = 0, . . . , n + 1, uk+1

i

= 2uk

i − uk−1 i

+ C2(uk

i+1 − 2uk i + uk i−1),

i = 1, . . . , n, k ≥ 0, uk+1 = UL, k ≥ 0, uk+1

n+1 = UR,

k ≥ 0, u−1

i

= u0

i + 1

2C2(u0

i+1 − 2u0 i + u0 i−1),

i = 1, . . . , n . 65

SLIDE 68

Basic MPI Programming Domain partition Each subdomain has a number of computational points, plus 2 “ghost points” that are used to contain values from neighboring subdomains. It is only on those computational points that uk+1

i

is updated based on uk

i−1, uk i , uk i+1, and uk−1 i

. After local computation at each time level, values on the leftmost and rightmost computa- tional points are sent to the left and right neighbors, respectively. Values from neighbors are received into the left and right ghost points. 66

SLIDE 69

Basic MPI Programming

#include <stdio.h> #include <malloc.h> #include <mpi.h> int main (int nargs, char** args) { int size, my_rank, i, n = 999, n_i; double h, x, t, dt, gamma = 1.0, C = 0.9, tstop = 1.0; double *up, *u, *um, *tmp, umax = 0.05, UL = 0., UR = 0., time; MPI_Status status; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); if (my_rank==0 && nargs>1) /* total number of points in (0,1) */ n = atoi(args[1]); if (my_rank==0 && nargs>2) /* read in Courant number */ C = atof(args[2]); if (my_rank==0 && nargs>3) /* length of simulation */ tstop = atof(args[3]); MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

67

SLIDE 70

Basic MPI Programming

MPI_Bcast (&C, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); MPI_Bcast (&tstop, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); h = 1.0/(n+1); /* distance between two points */ time = MPI_Wtime(); /* find number of subdomain computational points, assume mod(n,size)=0 */ n_i = n/size; up = (double*) malloc ((n_i+2)*sizeof(double)); u = (double*) malloc ((n_i+2)*sizeof(double)); um = (double*) malloc ((n_i+2)*sizeof(double)); /* set initial conditions */ x = my_rank*n_i*h; for (i=0; i<=n_i+1; i++) { u[i] = (x<0.7) ? umax/0.7*x : umax/0.3*(1.0-x); x += h; } if (my_rank==0) u[0] = UL; if (my_rank==size-1) u[n_i+1] = UR; for (i=1; i<=n_i; i++) um[i] = u[i]+0.5*C*C*(u[i+1]-2*u[i]+u[i-1]); /* artificial condition */

68

SLIDE 71

Basic MPI Programming

dt = C/gamma*h; t = 0.; while (t < tstop) { /* time stepping loop */ t += dt; for (i=1; i<=n_i; i++) up[i] = 2*u[i]-um[i]+C*C*(u[i+1]-2*u[i]+u[i-1]); if (my_rank>0) { /* receive from left neighbor */ MPI_Recv (&(up[0]),1,MPI_DOUBLE,my_rank-1,501,MPI_COMM_WORLD,&status); /* send left neighbor */ MPI_Send (&(up[1]),1,MPI_DOUBLE,my_rank-1,502,MPI_COMM_WORLD); } else up[0] = UL; if (my_rank<size-1) { /* send to right neighbor */ MPI_Send (&(up[n_i]),1,MPI_DOUBLE,my_rank+1,501,MPI_COMM_WORLD); /* receive from right neighbor */ MPI_Recv (&(up[n_i+1]),1,MPI_DOUBLE,my_rank+1,502,MPI_COMM_WORLD,&status); } else up[n_i+1] = UR;

69

SLIDE 72

Basic MPI Programming

/* prepare for next time step */ tmp = um; um = u; u = up; up = tmp; } time = MPI_Wtime()-time; printf("<%d> time=%g\n", my_rank, time); free (um); free (u); free (up); MPI_Finalize (); return 0; }

70

SLIDE 73

Basic MPI Programming Two-point boundary value problem −u′′(x) = f(x), 0 < x < 1, x(0) = x(1) = 0. Uniform 1D grid: x0, x1, . . . , xn+1, ∆x =

1 n+1.

Finite difference discretization −ui−1 + 2ui − ui+1 ∆x2 = fi, 1 ≤ i ≤ n.

    

2 −1 −1 2 −1 ... ... ... −1 2

         

u1 u2 . . . un

    

= ∆x2

    

f1 f2 . . . fn

    

Au

= f 71

SLIDE 74

Basic MPI Programming A simple iterative method for Au = f Start with an initial guess u0. Jacobi iteration uk

i =

∆x2fi + uk−1

n−1 + uk−1 n+1

Calculate residual r = f − Auk. Repeat Jacobi iterations until the norm of residual is small enough. 72

SLIDE 75

Basic MPI Programming Partition of work load Divide the interior points x1, x2, . . . , xn among the P processors: xi,1, xi,2, . . . , xi,ni. We need two “ghost” boundary nodes xi,0 and xi,ni+1. Those two ghost boundary nodes need to be updated after each Jacobi iteration by receiving data from the neighbors. 73

SLIDE 76

Basic MPI Programming The main program

#include <stdio.h> #include <mpi.h> #include <math.h> extern void jacobi_iteration (int my_rank, int size, int n_i, double* rhs, double* x_k1, double* x_k); extern void calc_residual (int my_rank, int size, int n_i, double* rhs, double* x, double* res); extern double norm (int n_i, double* x); int main (int nargs, char** args) { int size, my_rank, i, j, n = 99999, n_i, lower_bound; double time, base, r_norm, x, dx, *rhs, *res, *x_k, *x_k1; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); if (my_rank==0 && nargs>1) n = atoi(args[1]); MPI_Bcast (&n, 1, MPI_INT, 0, MPI_COMM_WORLD); dx = 1.0/(n+1);

74

SLIDE 77

Basic MPI Programming

time = MPI_Wtime(); /* data partition & load balancing */ n_i = n/size; i = n%size; lower_bound = my_rank*n_i; lower_bound += (my_rank<=i) ? my_rank : i; if (my_rank+1<=i) ++n_i; /* allocation of data storage, plus 2 (ghost) boundary points */ x_k = (double*) malloc ((n_i+2)*sizeof(double)); x_k1= (double*) malloc ((n_i+2)*sizeof(double)); rhs = (double*) malloc ((n_i+2)*sizeof(double)); res = (double*) malloc ((n_i+2)*sizeof(double)); /* fill out the rhs og x_k-vectors */ x = lower_bound*dx; for (i=1; i<=n_i; i++) { x += dx; rhs[i] = dx*dx*M_PI*M_PI*sin(M_PI*x); x_k[i] = x_k1[i] = 0.; } x_k[0]=x_k[n_i+1]=x_k1[0]=x_k1[n_i+1]=0.;

75

SLIDE 78

Basic MPI Programming

base = norm(n_i,rhs); r_norm = 1.0; i = 0; while (i<1000 && (r_norm/base)>1.0e-4) { ++i; jacobi_iteration(my_rank,size,n_i,rhs,x_k1,x_k); calc_residual(my_rank,size,n_i,rhs,x_k,res); r_norm = norm(n_i,res); for (j=0; j<n_i+2; j++) x_k1[j] = x_k[j]; } time = MPI_Wtime()-time; printf("<%d> n+1=%d, iters=%d, time=%g\n",my_rank,n+1,i,time); free (x_k); free (x_k1); free (rhs); free (res); MPI_Finalize (); return 0; }

76

SLIDE 79

Basic MPI Programming The routine for Jacobi iteration

#include <mpi.h> void jacobi_iteration (int my_rank, int size, int n_i, double* rhs, double* x_k1, double* x_k) { MPI_Status status; for (int i=1; i<=n_i; i++) x_k[i] = 0.5*(rhs[i]+x_k1[i-1]+x_k1[i+1]); if (my_rank>0) { /* receive from left neighbor */ MPI_Recv (&(x_k[0]),1,MPI_DOUBLE,my_rank-1,501,MPI_COMM_WORLD,&status); /* send left neighbor */ MPI_Send (&(x_k[1]),1,MPI_DOUBLE,my_rank-1,502,MPI_COMM_WORLD); } if (my_rank<size-1) { /* send to right neighbor */ MPI_Send (&(x_k[n_i]),1,MPI_DOUBLE,my_rank+1,501,MPI_COMM_WORLD); /* receive from right neighbor */ MPI_Recv (&(x_k[n_i+1]),1,MPI_DOUBLE,my_rank+1,502,MPI_COMM_WORLD,&status); } }

77

SLIDE 80

Basic MPI Programming The routine for calculating the residual

#include <mpi.h> void calc_residual (int my_rank, int size, int n_i, double* rhs, double* x, double* res) { /* no communication is necessary */ int i; for (i=1; i<=n_i; i++) res[i] = rhs[i]-2.0*x[i]+x[i-1]+x[i+1]; }

Note that no communication is necessary, because the latest so- lution vector is already correctly duplicated between neighboring processes after jacobi iteration. 78

SLIDE 81

Basic MPI Programming The routine for calculating the norm of a vector

#include <mpi.h> #include <math.h> double norm (int n_i, double* x) { double l_sum=0., g_sum; for (int i=1; i<=n_i; i++) l_sum += x[i]*x[i]; MPI_Allreduce (&l_sum,&g_sum,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD); return sqrt(g_sum); }

79

SLIDE 82

Basic MPI Programming Source of deadlocks Send a large message from one process to another

If there is insufficient storage at the destination, the send must wait for the

user to provide the memory space (through a receive).

Process 0 Process 1 Send(1) Send(0) Recv(1) Recv(0) Unsafe because it depends on the availability of system buffers. 80

SLIDE 83

Basic MPI Programming Solutions to deadlocks

Order the operations more carefully
Use MPI Sendrecv
Use MPI Bsend
Use non-blocking operations

Performance may be improved on many systems by overlapping communication and computation. This is especially true on sys- tems where communication can be executed autonomously by an intelligent communication controller. Use of non-blocking and completion routines allow computa- tion and communication to be overlapped. (Not guaranteed, though.) 81

SLIDE 84

Basic MPI Programming Non-blocking send and receive Non-blocking send: returns “immediately”; message buffer should not be written to after return; must check for local completion.

int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)

Non-blocking receive: returns “immediately”; message buffer should not be read from after return; must check for local com- pletion.

int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)

The use of nonblocking receives may also avoid system buffering and memory-to-memory copying, as information is provided early

n the location of the receive buffer.

82

SLIDE 85

Basic MPI Programming MPI Request A request object identifies various properties of a communication

peration. In addition, this object stores information about the

status of the pending communication operation. 83

SLIDE 86

Basic MPI Programming Local completion Two basic ways of checking on non-blocking sends and receives:

MPI Wait blocks until the communication is complete.

MPI_Wait(MPI_Request *request, MPI_Status *status)

MPI Test returns “immediately”, and sets flag to true is the

communication is complete.

MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)

84

SLIDE 87

Basic MPI Programming Using non-blocking send & receive

#include <mpi.h> extern MPI_Request* req; extern MPI_Status* status; void jacobi_iteration_nb (int my_rank, int size, int n_i, double* rhs, double* x_k1, double* x_k) { int i; /* calculate values on the end-points */ x_k[1] = 0.5*(rhs[1]+x_k1[0]+x_k1[2]); x_k[n_i] = 0.5*(rhs[n_i]+x_k1[n_i-1]+x_k1[n_i+1]); if (my_rank>0) { /* receive from left neighbor */ MPI_Irecv (&(x_k[0]),1,MPI_DOUBLE, my_rank-1,501,MPI_COMM_WORLD,&req[0]); /* send left neighbor */ MPI_Isend (&(x_k[1]),1,MPI_DOUBLE, my_rank-1,502,MPI_COMM_WORLD,&req[1]); } if (my_rank<size-1) {

85

SLIDE 88

Basic MPI Programming

/* send to right neighbor */ MPI_Isend (&(x_k[n_i]),1,MPI_DOUBLE, my_rank+1,501,MPI_COMM_WORLD,&req[2]); /* receive from right neighbor */ MPI_Irecv (&(x_k[n_i+1]),1,MPI_DOUBLE, my_rank+1,502,MPI_COMM_WORLD,&req[3]); } /* calculate values on the inner-points */ for (i=2; i<n_i; i++) x_k[i] = 0.5*(rhs[i]+x_k1[i-1]+x_k1[i+1]); if (my_rank>0) { MPI_Wait (&req[0],&status[0]); MPI_Wait (&req[1],&status[1]); } if (my_rank<size-1) { MPI_Wait (&req[2],&status[2]); MPI_Wait (&req[3],&status[3]); } }

86

SLIDE 89

Basic MPI Programming Exercises for Day 2 Exercise One: Introduce non-blocking send/receive MPI func- tions into the “1D wave” program. Do you get any performance improvement? Also improve the program so that load balance is maintained for any choice of n. Can you think of other improve- ments? Exercise Two: Data storage preparation for doing unstructured finite element computation in parallel.

✂ ✄ ☎ ✆ ✝ ✞✠✟ ✞☛✡ ✞✌☞ ✞✌✍ ✞✏✎ ✞✏✑

87

SLIDE 90

Basic MPI Programming When a non-overlapping partition of a finite element grid is car- ried out, i.e., each element belongs to only one subdomain, grid points lying on the internal boundaries will be shared between neighboring subdomains. Assume grid points in every subdomain remember their original global number, contained in files: global-ids.00, global-ids.01, and so. Write an MPI program that can a) find out for each subdomain who the neighboring subdomains are, b) how many points are shared between each pair of neighbor subdomains, and c) calculate the Euclidean norm of a global vector that is dis- tributed as sub-vectors contained in files: sub-vec.00, sub-vec.01 and so on. 88

SLIDE 91

Basic MPI Programming The data files are contained in mpi-kurs.tar, which can be found under directory /site/vitsim/ on the machine pico.uio.no. The Euclidean norm should be 17.3594 if you have programmed ev- erything correctly. 89

SLIDE 92

Advanced MPI Programming More about point-to-point communication When a standard mode blocking send call returns, the message data and envelope have been “safely stored away”. The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer. MPI decides whether outgoing messages will be buffered. If MPI buffers outgoing messages, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. Then the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver. 90

SLIDE 93

Advanced MPI Programming Four communication modes for sending messages standard mode - a send may be initiated even if a matching receive has not been initiated. buffered mode - similar to standard mode, but completion is always independent of matching receive, and message may be buffered to ensure this. synchronous mode - a send may be initiated only if a matching receive has been initiated. ready mode - a send will not complete until message delivery is guaranteed. All 4 modes have blocking and non-blocking versions for send. 91

SLIDE 94

Advanced MPI Programming The buffered mode (MPI Bsend) A buffered mode send operation can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. If a send is executed and no matching receive is posted, then MPI must buffer the outgoing message. An error will occur if there is insufficient buffer space. The amount of available buffer space is controlled by the user. Buffer allocation by the user may be required for the buffered mode to be effective. 92

SLIDE 95

Advanced MPI Programming Buffer allocation for buffered send A user specifies a buffer to be used for buffering messages sent in buffered mode. To provide a buffer in the user’s memory to be used for buffering

utgoing messages:

int MPI_Buffer_attach( void* buffer, int size)

To detach the buffer currently associated with MPI:

int MPI_Buffer_detach( void* buffer_addr, int* size)

93

SLIDE 96

Advanced MPI Programming The synchronous mode (MPI Ssend) A synchronous send can be started whether or not a matching receive was posted. However, the send will complete successfully

nly if a matching receive is posted, and the receive operation

has started to receive the message. The completion of a synchronous send not only indicates that the send buffer can be reused, but also indicates that the receiver has reached a certain point in its execution, namely that it has started executing the matching receive. 94

SLIDE 97

Advanced MPI Programming The ready mode (MPI Rsend) A send using the mode may be started only if the matching re- ceive is already posted. Otherwise, the operation is erroneous and its outcome is undefined. The completion of the send op- eration does not depend on the status of a matching receive, and merely indicates that the send buffer can be reused. A send using the ready mode provides additional information to the sys- tem (namely that a matching receive is already posted), that can save some overhead. In a correct program, a ready send could be replaced by a stan- dard send with no effect on the behavior of the program other than performance. 95

SLIDE 98

Advanced MPI Programming Persistent communication requests Often a communication with the same argument list is repeatedly

executed. MPI can bind the list of communication arguments to

a persistent communication request once and, then, repeatedly use the request to initiate and complete messages. Overhead reduction for communication between the process and commu- nication controller. A persistent communication request can e.g. be created as (no communication yet):

int MPI_Send_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)

96

SLIDE 99

Advanced MPI Programming Persistent communication requests (cont’d) A communication that uses a persistent request is initiated by

int MPI_Start(MPI_Request *request)

A communication started with a call to MPI Start can be com- pleted by a call to MPI Wait or MPI Test. The request becomes inactive after successful completion. The request is not deallo- cated and it can be activated anew by a MPI Start call. A persistent request is deallocated by a call to MPI Request free. 97

SLIDE 100

Advanced MPI Programming Send-receive operations MPI send-receive operations combine in one call the sending

f a message to one destination and the receiving of another

message, from another process. A send-receive operation is very useful for executing a shift op- eration across a chain of processes.

int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status)

98

SLIDE 101

Advanced MPI Programming Null process The special value MPI PROC NULL can be used to specify a “dummy” source or destination for communication. This may simplify the code that is needed for dealing with boundaries, e.g., a non- circular shift done with calls to send-receive. A communication with process MPI PROC NULL has no effect.

int left_rank = my_rank-1, right_rank = my_rank+1; if (my_rank==0) left_rank = MPI_PROC_NULL; if (my_rank==size-1) right_rank = MPI_PROC_NULL;

99

SLIDE 102

Advanced MPI Programming Derived datatypes We have so far used communication involving only contiguous buffers containing elements of the same type. However, one often wants to pass messages that contain values with different datatypes (e.g., an integer count, followed by a sequence of real numbers); Also, one often wants to send noncontiguous data (e.g., a sub- block of a matrix). 100

SLIDE 103

Advanced MPI Programming Derived datatypes (cont’d) More general communication buffers are specified by replacing the basic datatypes with derived datatypes that are constructed from basic datatypes. A general datatype is an opaque object that specifies two things:

a sequence of basic datatypes
a sequence of integer (byte) displacements

101

SLIDE 104

Advanced MPI Programming Datatype constructors

int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype) int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) int MPI_Type_hvector(int count, int blocklength, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype) int MPI_Type_indexed(int count, int *array_of_blocklengths, int *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype) int MPI_Type_hindexed(int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype) int MPI_Type_struct(int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype)

102

SLIDE 105

Advanced MPI Programming One example of creating new datatypes in MPI

#include <mpi.h> #include <stdio.h> typedef struct { float a; float b; int n; } MY_TYPE; int main (int nargs, char** args) { MY_TYPE indata; int my_rank; int block_lengths[3]; MPI_Aint displacements[3], addresses[4]; MPI_Datatype typelist[3], new_type; typelist[0] = typelist[1] = MPI_FLOAT; typelist[2] = MPI_INT; block_lengths[0] = block_lengths[1] = block_lengths[2] = 1; MPI_Init (&nargs, &args); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); indata.a = indata.b = 1.0*my_rank;

103

SLIDE 106

Advanced MPI Programming

indata.n = my_rank; /* calculate the displacements of the members relative to MY_TYPE */ MPI_Address (&indata,&addresses[0]); MPI_Address (&(indata.a),&addresses[1]); MPI_Address (&(indata.b),&addresses[2]); MPI_Address (&(indata.n),&addresses[3]); displacements[0] = addresses[1]-addresses[0]; displacements[1] = addresses[2]-addresses[0]; displacements[2] = addresses[3]-addresses[0]; /* create the derived type */ MPI_Type_struct (3, block_lengths, displacements, typelist, &new_type); /* commit it so that it can be used */ MPI_Type_commit (&new_type); if (my_rank==0) { MPI_Status status; MPI_Recv (&indata, 1, new_type, 1, 500, MPI_COMM_WORLD, &status); printf ("From Proc 1, a=%f, b=%f, n=%d\n", indata.a,indata.b,indata.n); } else if (my_rank==1) MPI_Send (&indata, 1, new_type, 0, 500, MPI_COMM_WORLD); MPI_Finalize (); return 0; }

104

SLIDE 107

Advanced MPI Programming Packing and unpacking An alternative approach to grouping data. The user explicitly packs data into a contiguous buffer before sending it, and unpacks it from a contiguous buffer after receiving it.

int MPI_Pack(void* inbuf, int incount, MPI_Datatype datatype, void *outbuf, int outsize, int *position, MPI_Comm comm) int MPI_Unpack(void* inbuf, int insize, int *position, void *outbuf, int outcount, MPI_Datatype datatype, MPI_Comm comm)

105

SLIDE 108

Advanced MPI Programming An example of using MPI Pack

int position, i, j, a[2]; char buff[1000]; .... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { / * SENDER CODE */ position = 0; MPI_Pack(&i, 1, MPI_INT, buff, 1000, &position, MPI_COMM_WORLD); MPI_Pack(&j, 1, MPI_INT, buff, 1000, &position, MPI_COMM_WORLD); MPI_Send( buff, position, MPI_PACKED, 1, 0, MPI_COMM_WORLD); } else /* RECEIVER CODE */ MPI_Recv( a, 2, MPI_INT, 0, 0, MPI_COMM_WORLD)

Note that MPI PACKED matches any other type. 106

SLIDE 109

Advanced MPI Programming When to pack/unpack, when to use derived datatypes? If the data to be sent is already stored in consecutive entries of an array, use basic MPI datatypes. If there are a large number of elements that are not in contiguous memory locations, then build a derived datatype (less overhead than pack/unpack). If one is sending heterogeneous data only once or a few times, use MPI Pack and MPI Unpack. May avoid system buffering. 107

SLIDE 110

Advanced MPI Programming Parallel libraries Developing parallel libraries requires:

safe communication space,
group scope for collective operations, that allow libraries to

avoid unnecessarily synchronizing uninvolved processes,

abstract process naming to allow libraries to describe their

communication in terms suitable to their own data structures and algorithms, - the ability to “adorn” a set of communicating processes with additional user-defined attributes, such as extra collective operations. 108

SLIDE 111

Advanced MPI Programming Communicators revisited The MPI communicators provide a unified mechanism for con- veniently denoting communication context, the group of com- municating processes, to house abstract process naming, and to store adornments. Communicators = appropriate scope for all communication oper- ations in MPI. Communicators are divided into two kinds: intra- communicators for operations within a single group of processes, and inter-communicators, for point-to-point communication be- tween two groups of processes. 109

SLIDE 112

Advanced MPI Programming Communicators revisited (cont’d) Groups define an ordered collection of processes, each with a rank used for sending and receiving. A scope for process names in point-to-point communication. In addition, groups define the scope of collective operations. Contexts (a property of communicators) provide the ability to have separate “universes” of message passing in MPI. The use

f separate communication contexts by distinct libraries (or dis-

tinct library invocations) insulates communication internal to the library execution from external communication. 110

SLIDE 113

Advanced MPI Programming Communicator constructors & destructors

int MPI_Comm_dup(MPI_Comm comm, MPI_Comm *newcomm) int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm) int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm) int MPI_Comm_free(MPI_Comm *comm)

111

SLIDE 114

Advanced MPI Programming An example of constructing a new communicator

We construct a new communicator excluding only Process 0 in MPI COMM WORLD

main(int argc, char **argv) { int me, count, count2; void *send_buf, *recv_buf, *send_buf2, *recv_buf2; MPI_Group MPI_GROUP_WORLD, grprem; MPI_Comm commslave; static int ranks[] = {0}; MPI_Init(&argc, &argv); MPI_Comm_group(MPI_COMM_WORLD, &MPI_GROUP_WORLD); MPI_Comm_rank(MPI_COMM_WORLD, &me); /* local */ MPI_Group_excl(MPI_GROUP_WORLD, 1, ranks, &grprem); /* local */ MPI_Comm_create(MPI_COMM_WORLD, grprem, &commslave); if(me != 0) { /* compute on slave */ MPI_Reduce(send_buf,recv_buf,count, MPI_INT, MPI_SUM, 1, commslave); } /* zero falls through immediately to this reduce, others do later... */

112

SLIDE 115

Advanced MPI Programming

MPI_Reduce(send_buf2, recv_buf2, count2, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); MPI_Comm_free(&commslave); MPI_Group_free(&MPI_GROUP_WORLD); MPI_Group_free(&grprem); MPI_Finalize(); }

113

SLIDE 116

Advanced MPI Programming Process topology MPI does not provide mechanisms to specify the initial allocation

f processes to an MPI computation and their binding to physical

processors. A topology is an extra and optional attribute of a communi- cator; A convenient naming mechanism for the processes of a group (within a communicator), and additionally, may assist the runtime system in mapping the processes onto hardware. 114

SLIDE 117

Advanced MPI Programming Process topology (cont’d) The communication pattern of a set of processes can be repre- sented by a graph. The nodes stand for the processes, and the edges connect processes that communicate with each other. A “missing link” in the user-defined process graph means rather that this connection is neglected in the virtual topology. A large fraction of all parallel applications use process topolo- gies like rings, two- or higher-dimensional grids, or tori. These structures are completely defined by the number of dimensions and the numbers of processes in each coordinate direction. Also, the mapping of grids and tori is generally an easier problem then that of general graphs. Thus, it is desirable to address these cases explicitly. 115

SLIDE 118

Advanced MPI Programming Creating topologies The functions MPI Graph create and MPI Cart create are used to create general (graph) virtual topologies and Cartesian topolo- gies, respectively. These topology creation functions are collec- tive. The topology creation functions take as input an existing com- municator comm old, which defines the set of processes on which the topology is to be mapped. A new communicator comm topol is created that carries the topological structure. MPI Cart create can be used to describe Cartesian structures of arbitrary dimension. For each coordinate direction one specifies 116

SLIDE 119

Advanced MPI Programming whether the process structure is periodic or not. The local auxil- iary function MPI Dims create can be used to compute a balanced distribution of processes among a given number of dimensions. For Cartesian topologies, an MPI Sendrecv operation can be used along a coordinate direction to perform a shift of data. Func- tion MPI Cart shift provides identifiers, which can be passed to MPI Sendrecv.

int MPI_Cart_shift(MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest)

117

SLIDE 120

Advanced MPI Programming Partitioning Cartesian topologies We can also partition a communicator (from MPI Graph create) into subgroups that form lower-dimensional Cartesian subgrids, and build for each subgroup a communicator with the associated subgrid Cartesian topology.

int MPI_Cart_sub(MPI_Comm comm, int *remain_dims, MPI_Comm *newcomm)

118

SLIDE 121

Advanced MPI Programming “Optimal” placement of processes MPI can compute an “optimal” placement of processes on the physical machine.

int MPI_Cart_map(MPI_Comm comm, int ndims, int *dims, int *periods, int *newrank) int MPI_Graph_map(MPI_Comm comm, int nnodes, int *index, int *edges, int *newrank)

119

SLIDE 122

Advanced MPI Programming Profiling Every MPI routine has two entry points: MPI ...

r PMPI ...

Enables single level of low-overhead routines to interact MPI calls without any source code modifications.

static int totalBytes=0; static double totalTime=0; int MPI_Send(void * buffer, const int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) { double tstart = MPI_Wtime(); /* Pass on all the arguments */ int extent; int result = PMPI_Send(buffer,count,datatype,dest,tag,comm); MPI_Type_size(datatype, &extent); /* Compute size */ totalBytes += count*extent; totalTime += MPI_Wtime() - tstart; /* and time */ return result; }

120

SLIDE 123

Advanced MPI Programming The MPE toolbox

MPE_Init_log(); if (myid == 0) { MPE_Describe_state(1, 2, "Broadcast", "red:vlines3"); MPE_Describe_state(3, 4, "Compute", "blue:gray3"); MPE_Describe_state(5, 6, "Reduce", "green:light_gray"); } MPE_Start_log(); MPE_Log_event(1, 0, "start broadcast"); MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); MPE_Log_event(2, 0, "end broadcast"); MPE_Log_event(3, 0, "start compute"); /* some computation */ MPE_Log_event(4, 0, "end compute"); MPE_Log_event(5, 0, "start reduce"); MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); MPE_Log_event(6, 0, "end reduce"); MPE_Finish_log("cpilog");

121

SLIDE 124

Advanced MPI Programming Graphical MPE output

> mpirun -np 4 ./a.out > clog2alog cpilog > upshot cpilog.alog

x x x 0.015 0.044 0.074 0.103 0.133 0.162 0.192 Broadcast Compute Reduce Sync 1 2 3

122

SLIDE 125

Advanced MPI Programming Message passing programming Keep in mind that: Multiple processes with own memory co-exist. They need to work collaboratively to solve a common global problem. The collaboration is done by synchronization and data exchange in form of passing messages (with tags) between neighbors. There are certain non-deterministic features of a parallel simu- lation:

The order of incoming messages from neighbors may vary from

execution to execution.

Progress of one process may lag behind others.

123

SLIDE 126

Advanced MPI Programming Some tips on MPI programming Use predefined MPI reduction operations; Avoid MPI Barrier if possible; Make use of local disk to different processors when saving a large amount of data to files; When computing a global maximum/minimum entity, remember to call MPI Reduce or MPI Allreduce after finding the local maxi- mum/minimum value. 124

SLIDE 127

Advanced MPI Programming Exercises for Day 3 Exercise One: What is the output of the following program when executed on e.g. 6 CPUs?

#include <mpi.h> #include <stdio.h> int main (int nargs, char** args) { int size0, rank0, size1, rank1, size2, rank2, i; MPI_Comm comm1, comm2; MPI_Init (&nargs, &args); MPI_Comm_size (MPI_COMM_WORLD, &size0); MPI_Comm_rank (MPI_COMM_WORLD, &rank0); i = rank0%2; MPI_Comm_split (MPI_COMM_WORLD, i, rank0, &comm1); MPI_Comm_size (comm1, &size1); MPI_Comm_rank (comm1, &rank1); i = rank0/2; MPI_Comm_split (MPI_COMM_WORLD, i, rank0, &comm2);

125

SLIDE 128

Advanced MPI Programming

MPI_Comm_size (comm2, &size2); MPI_Comm_rank (comm2, &rank2); printf("size0=%d, rank0=%d, size1=%d, rank1=%d, size2=%d, rank2=%d\n", size0, rank0, size1, rank1, size2, rank2); MPI_Finalize (); return 0; }

126

SLIDE 129

Advanced MPI Programming Exercise Two: Parallelize an existing simulator for the 2D wave

equation. Use 2D Cartesian process topologies.

∂2u ∂t2 = ∂2u ∂x2 + ∂2u ∂y2 in Ω = [0, 1]2, u(x, y, 0) = u0(x, y), ∂ ∂tu(x, y, 0) = 0, ∂u ∂n ≡ ∇u · n = 0 on ∂Ω.

#include <malloc.h> #include <stdio.h> #include <math.h> #define LaplaceU(i,j) \ (dt*dt/dx/dx)*(u[i+1][j]-2*u[i][j]+u[i-1][j]) \ +(dt*dt/dy/dy)*(u[i][j+1]-2*u[i][j]+u[i][j-1]) /* routine for carrying out one time step computational points are [1,2...nx]x[1,2...ny] where i=1,i=nx,j=1,j=ny are boundaries fulling no-slip conditions. */ void WAVE (const int nx, const int ny,

127

SLIDE 130

Advanced MPI Programming

double** up, double** u, double** um, double a, double b, double c, double dt, double dx, double dy) { int i,j; /* prepare values on ghost boundary nodes */ for (j=1; j<=ny; j++) { u[0][j] = u[2][j]; u[nx+1][j] = u[nx-1][j]; } for (i=1; i<=nx; i++) { u[i][0] = u[i][2]; u[i][ny+1] = u[i][ny-1]; } /* update computational points according to finite difference scheme: */ for (i=1; i<=nx; i++) for (j=1; j<=ny; j++) up[i][j] = a*2.*u[i][j]-b*um[i][j]+c*LaplaceU(i,j); } int main (int argc, const char* argv[]) { int nx=101, ny=101, i, j; double dx, dy, t, dt, tstop = 1.0;

128

SLIDE 131

Advanced MPI Programming

double **u, **up, **um; if (argc>1) nx = atoi(argv[1]); if (argc>2) ny = atoi(argv[2]); dx = 1.0/(nx-1); dy = 1.0/(ny-1); dt = 1.0/sqrt(1.0/dx/dx+1.0/dy/dy); /* allocate data storage */ u = (double**)malloc((nx+2)*sizeof(double*)); up = (double**)malloc((nx+2)*sizeof(double*)); um = (double**)malloc((nx+2)*sizeof(double*)); u[0] = (double*)malloc((nx+2)*(ny+2)*sizeof(double)); up[0] = (double*)malloc((nx+2)*(ny+2)*sizeof(double)); um[0] = (double*)malloc((nx+2)*(ny+2)*sizeof(double)); for (i=1; i<=nx+1; i++) { u[i] = u[i-1] +(ny+2); up[i] = up[i-1]+(ny+2); um[i] = um[i-1]+(ny+2); } /* velocity at t=0; */ for (i=1; i<=nx; i++)

129

SLIDE 132

Advanced MPI Programming

for (j=1; j<=ny; j++) u[i][j] = 0.25*cos(M_PI*(i-1)*dx); /* find the artifical um such that du/dt=0 at t=0 */ WAVE (nx, ny, um, u, um, 0.5, 0, 0.5, dt, dx, dy); t = 0.; while (t < tstop) { double** tmp; t += dt; WAVE (nx, ny, up, u, um, 1, 1, 1, dt, dx ,dy); tmp = um; um = u; u = up; up = tmp; } /* de-allocate data storage */ free (u[0]); free (up[0]); free (um[0]); free (u); free (up); free (um); }

130

SLIDE 133

Advanced MPI Programming References Message Passing Interface Forum. MPI: A Message-Passing In- terface Standard, 1995. William Gropp, Ewing Lusk and Anthony Skjellum. “Using MPI: Portable Parallel Programming with the Message Passing”, MIT Press, 1994. Peter S. Pacheco, “Parallel Programming With MPI”, Morgan Kaufmann, 1996. 131

SLIDE 134

Advanced MPI Programming Related MPI web pages

http://www.mpi-forum.org/ http://www-unix.mcs.anl.gov/mpi/ http://www.mpi.nd.edu/mpi_tutorials/ http://www.nas.nasa.gov/Groups/SciCon/Tutorials/MPIintro/toc.html http://www-unix.mcs.anl.gov/mpi/tutorial/snir/mpi/mpi.htm http://www.epm.ornl.gov/~walker/mpitutorial/ http://www-unix.mcs.anl.gov/dbpp/