Overview on Parallel Programming Paradigms Ivan Giro3o - - PowerPoint PPT Presentation

▶

Apr 09, 2023 150 likes •615 views

Overview on Parallel Programming Paradigms Ivan Giro3o igiro3o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal

SLIDE 1

Overview ¡on ¡Parallel ¡ Programming ¡Paradigms ¡

Ivan ¡Giro3o ¡– ¡igiro3o@ictp.it ¡

Informa(on ¡& ¡ ¡Communica(on ¡Technology ¡Sec(on ¡(ICTS) ¡ Interna(onal ¡Centre ¡for ¡Theore(cal ¡Physics ¡(ICTP) ¡ ¡ ¡

SLIDE 2

What ¡Determines ¡Performance? ¡ ¡

How ¡fast ¡is ¡my ¡CPU? ¡
How ¡fast ¡can ¡I ¡move ¡data ¡around? ¡ ¡
How ¡well ¡can ¡I ¡split ¡work ¡into ¡pieces? ¡

– Very ¡applica(on ¡specific: ¡never ¡assume ¡that ¡a ¡good ¡ solu(on ¡for ¡one ¡problem ¡is ¡as ¡good ¡a ¡solu(on ¡for ¡ another ¡ ¡ – always ¡run ¡benchmarks ¡to ¡understand ¡requirements ¡

f ¡your ¡applica(ons ¡and ¡proper(es ¡of ¡your ¡hardware ¡

– respect ¡Amdahl's ¡law ¡ ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 2 ¡

SLIDE 3

Parallel ¡Architectures ¡ ¡

Distributed ¡Memory ¡
Shared ¡Memory ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 3 ¡ CPU

memory node

CPU

memory node

CPU

memory node

CPU

memory node

CPU

memory node

CPU

memory node

NETWORK

CPU

MEMORY

CPU CPU CPU CPU

node ¡

SLIDE 4

Mul(ple ¡Socket ¡CPUs ¡

Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 4 ¡ 01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡

SLIDE 5

Paradigm ¡at ¡Shared ¡Memory ¡/1 ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 5 ¡ PC PC PC

Private data Private data Private data

Shared data Thread 1

Thread 2 Thread 3

SLIDE 6

Paradigm ¡at ¡Shared ¡Memory ¡/2 ¡

Usually ¡indicated ¡as ¡Mul(threading ¡Programming ¡
Commonly ¡implemented ¡in ¡scien(fic ¡compu(ng ¡

using ¡the ¡OpenMP ¡standard ¡(direc(ve ¡based) ¡

Thread ¡management ¡overhead ¡ ¡
Limited ¡scalability ¡
Write ¡access ¡to ¡shared ¡data ¡can ¡easily ¡lead ¡to ¡

race ¡condi(ons ¡and ¡incorrect ¡data ¡ ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 6 ¡

SLIDE 7

Parallel ¡Programming ¡Paradigms ¡

MPI ¡(Message ¡Passing ¡Interface) ¡ ¡

– A ¡standard ¡defined ¡for ¡portable ¡message ¡passing ¡ ¡ – It ¡available ¡in ¡the ¡form ¡of ¡library ¡which ¡includes ¡interfaces ¡ for ¡expressing ¡the ¡data ¡exchange ¡among ¡processes ¡ – A ¡framework ¡is ¡provided ¡for ¡spawning ¡the ¡independent ¡ processes ¡(i.e., ¡mpirun) ¡ – Processes ¡communica(on ¡is ¡via ¡network ¡ – It ¡works ¡on ¡either ¡shared ¡and ¡distributed ¡mem. ¡ architecture ¡ – ideal ¡for ¡distribu(ng ¡memory ¡among ¡compute ¡nodes ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 7 ¡

SLIDE 8

MPI ¡Program ¡Design ¡ ¡

Mul(ple ¡ and ¡ separate ¡ processes ¡ (can ¡ be ¡ local ¡ and ¡

remote) ¡ concurrently ¡ that ¡ are ¡ coordinated ¡ and ¡ exchange ¡ data ¡ through ¡ “messages” ¡ => ¡ a ¡ “share ¡ nothing” ¡paralleliza(on ¡ ¡

Best ¡ for ¡ coarse ¡ grained ¡ paralleliza(on ¡ Distribute ¡

large ¡data ¡sets; ¡replicate ¡small ¡data ¡ ¡

Minimize ¡communica(on ¡or ¡overlap ¡communica(on ¡

and ¡compu(ng ¡for ¡efficiency ¡=> ¡Amdahl's ¡law ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 8 ¡

SLIDE 9

What ¡is ¡MPI? ¡ ¡

A ¡standard, ¡i.e. ¡there ¡is ¡a ¡document ¡describing ¡how ¡the ¡API ¡

(constants ¡& ¡subrou(nes) ¡are ¡named ¡and ¡should ¡behave; ¡mul(ple ¡ “levels”, ¡MPI-‑1 ¡(basic), ¡MPI-‑2 ¡(advanced), ¡MPI-‑3 ¡(new) ¡ ¡

A ¡library ¡or ¡API ¡to ¡hide ¡the ¡details ¡of ¡low-‑level ¡communica(on ¡

hardware ¡and ¡how ¡to ¡use ¡it ¡ ¡

Implemented ¡by ¡mul(ple ¡vendors ¡ ¡
Open ¡source ¡and ¡commercial ¡versions ¡
Vendor ¡specific ¡versions ¡for ¡certain ¡hardware ¡
Not ¡binary ¡compa(ble ¡between ¡implementa(ons ¡ ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 9 ¡

SLIDE 10

Programming ¡Parallel ¡Paradigms ¡

Are ¡the ¡tools ¡we ¡use ¡to ¡express ¡the ¡parallelism ¡

for ¡on ¡a ¡given ¡architecture ¡

They ¡differ ¡in ¡how ¡programmers ¡can ¡manage ¡and ¡

define ¡key ¡features ¡like: ¡

– parallel ¡regions ¡ – concurrency ¡ – process ¡communica(on ¡ ¡ – synchronism ¡

Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 10 ¡ 01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡

SLIDE 11

MPI ¡inter ¡process ¡communica(ons ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 11 ¡

MPI ¡on ¡Mul( ¡core ¡CPU ¡ MPI_BCAST ¡ network ¡ node ¡ node ¡ node ¡ node ¡

1 ¡MPI ¡proces ¡/ ¡core ¡ Stress ¡network ¡ Stress ¡OS ¡ Many ¡MPI ¡codes ¡(QE) ¡based ¡on ¡ ALLTOALL ¡ ¡ Messages ¡= ¡processes ¡* ¡processes ¡ We ¡need ¡to ¡exploit ¡the ¡hierarchy ¡ Re-‑design ¡ ¡ applica@ons ¡ Mix ¡message ¡passing ¡ And ¡mul@-‑threading ¡

SLIDE 12

The ¡Hybrid ¡Mode ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 12 ¡

network ¡ node ¡ node ¡ node ¡ node ¡

SLIDE 13

The ¡Hybrid ¡Mode ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 13 ¡

network ¡ node ¡ node ¡ node ¡ node ¡

SLIDE 14

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Computer ¡Architecture ¡for ¡HPC ¡-‑ ¡ICTP, ¡smr2761 ¡ ¡ 14 ¡

The ¡Intel ¡Xeon ¡E5-‑2665 ¡ ¡ Sandy ¡Bridge-‑EP ¡2.4GHz ¡

~ ¡8 ¡GBytes ¡

mpirun -np 8 pw-gpu.x -inp input file

SLIDE 15

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Computer ¡Architecture ¡for ¡HPC ¡-‑ ¡ICTP, ¡smr2761 ¡ ¡ 15 ¡

The ¡Intel ¡Xeon ¡E5-‑2665 ¡ ¡ Sandy ¡Bridge-‑EP ¡2.4GHz ¡

~ ¡8 ¡GBytes ¡

mpirun -np 1 pw-gpu.x -inp input file

SLIDE 16

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Computer ¡Architecture ¡for ¡HPC ¡-‑ ¡ICTP, ¡smr2761 ¡ ¡ 16 ¡

The ¡Intel ¡Xeon ¡E5-‑2665 ¡ ¡ Sandy ¡Bridge-‑EP ¡2.4GHz ¡

~ ¡8 ¡GBytes ¡

export OMP_NUM_THREADS=4 export OPENBLAS_NUM_THREADS=$OMP_NUM_THREADS mpirun -np 2 pw-gpu.x -inp input file

SLIDE 17

MPI: Domain partition OpenMP: Node Level shared mem CUDA/OpenCL/OpenAcc: floating point accelerators Python: Ensemble simulations, workfows Workload Management: system level, High-throughput

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Computer ¡Architecture ¡for ¡HPC ¡-‑ ¡ICTP, ¡smr2761 ¡ ¡ 17 ¡

SLIDE 18

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 18 ¡

Type ¡of ¡Parallelism ¡

Func@onal ¡(or ¡task) ¡parallelism: ¡

different ¡people ¡are ¡performing ¡ different ¡task ¡at ¡the ¡same ¡(me ¡

Data ¡Parallelism: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

different ¡people ¡are ¡performing ¡the ¡ same ¡task, ¡but ¡on ¡different ¡ equivalent ¡and ¡independent ¡objects ¡ ¡

SLIDE 19

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 19 ¡

Process ¡Interac(ons ¡ ¡

The ¡effec(ve ¡speed-‑up ¡obtained ¡by ¡the ¡paralleliza(on ¡depend ¡by ¡the ¡

amount ¡of ¡overhead ¡we ¡introduce ¡making ¡the ¡algorithm ¡parallel ¡

There ¡are ¡mainly ¡two ¡key ¡sources ¡of ¡overhead: ¡
1. Time ¡spent ¡in ¡inter-‑process ¡interac(ons ¡(communica@on) ¡
2. Time ¡some ¡process ¡may ¡spent ¡being ¡idle ¡(synchroniza@on) ¡ ¡

SLIDE 20

Effect ¡of ¡load-‑unbalancing ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 20 ¡

all here?

SLIDE 21

Mapping ¡and ¡Synchroniza(on ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 21 ¡

¡ ¡

SLIDE 22

Amdahl's ¡law ¡

In ¡a ¡massively ¡parallel ¡context, ¡an ¡upper ¡limit ¡for ¡the ¡scalability ¡of ¡parallel ¡ applica(ons ¡is ¡determined ¡by ¡the ¡frac(on ¡of ¡the ¡overall ¡execu(on ¡(me ¡ spent ¡in ¡non-‑scalable ¡opera(ons ¡(Amdahl's ¡law). ¡

maximum ¡speedup ¡tends ¡to ¡ ¡ 1 ¡/ ¡( ¡1 ¡− ¡P ¡) ¡ ¡ P= ¡parallel ¡frac(on ¡

1000000 ¡core ¡ P ¡= ¡0.999999 ¡ serial ¡frac+on= ¡0.000001 ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 22 ¡

SLIDE 23

How ¡do ¡we ¡evaluate ¡the ¡improvement? ¡

We ¡want ¡es(mate ¡the ¡amount ¡of ¡the ¡

introduced ¡overhead ¡=> ¡To ¡= ¡npesTP ¡-‑ ¡TS ¡ ¡

But ¡to ¡quan(fy ¡the ¡improvement ¡we ¡use ¡the ¡

term ¡Speedup: ¡ ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 23 ¡

SP ¡= ¡ ¡ TS ¡ ¡ TP ¡ ¡

SLIDE 24

Speedup ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 24 ¡

SLIDE 25

Efficiency ¡

Only ¡embarrassing ¡parallel ¡algorithm ¡can ¡obtain ¡an ¡

ideal ¡Speedup ¡ ¡

The ¡Efficiency ¡is ¡a ¡measure ¡of ¡the ¡frac(on ¡of ¡(me ¡for ¡

which ¡a ¡processing ¡element ¡is ¡usefully ¡employed: ¡ ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 25 ¡

EP ¡= ¡ ¡ SP ¡ ¡ p ¡ ¡

SLIDE 26

Efficiency ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 26 ¡

SLIDE 27

Amdal’s ¡Law ¡And ¡Real ¡Life ¡

The ¡speedup ¡of ¡a ¡parallel ¡program ¡is ¡limited ¡by ¡the ¡

sequen(al ¡frac(on ¡of ¡the ¡program ¡ ¡

This ¡assumes ¡perfect ¡scaling ¡and ¡no ¡overhead ¡ ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 27 ¡

SLIDE 28

Scaling ¡-‑ ¡QE-‑CP ¡on ¡Fermi ¡BGQ ¡@ ¡CINECA ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 28 ¡

SLIDE 29

Easy ¡Parallel ¡Compu(ng ¡

Farming, ¡embarrassingly ¡parallel ¡

– Execu(ng ¡mul(ple ¡instances ¡on ¡the ¡same ¡program ¡with ¡different ¡ inputs/ini(al ¡cond. ¡ ¡ – Reading ¡large ¡binary ¡files ¡by ¡splivng ¡the ¡workload ¡among ¡processes ¡ ¡ – Searching ¡elements ¡on ¡large ¡data-‑sets ¡ ¡ – Other ¡parallel ¡execu(on ¡of ¡embarrassingly ¡parallel ¡problem ¡(no ¡ communica(on ¡among ¡tasks) ¡ ¡ ¡

Ensemble ¡simula(ons ¡(weather ¡forecast) ¡ ¡
Parameter ¡space ¡(find ¡the ¡best ¡wing ¡shape) ¡ ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 29 ¡

SLIDE 30

Single ¡Program ¡on ¡Mul(ple ¡Data ¡

performing ¡the ¡same ¡program ¡(set ¡of ¡instruc(ons) ¡

among ¡different ¡data ¡

Same ¡model ¡adopted ¡by ¡the ¡MPI ¡library ¡ ¡
A ¡parallel ¡tool ¡is ¡needed ¡to ¡handle ¡the ¡different ¡

processes ¡working ¡in ¡parallel ¡

The ¡MPI ¡library ¡provides ¡the ¡mpirun ¡applica(on ¡to ¡

execute ¡parallel ¡instances ¡of ¡the ¡same ¡program ¡ ¡ ¡ ¡ ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 30 ¡

SLIDE 31

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 31 ¡

Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡

$ mpirun -np 12 my_program.x mynode01 ¡ mynode02 ¡

SLIDE 32

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 32 ¡

[igirotto@mynode01 ~]$ mpirun -np 12 /bin/hostname mynode01 mynode02 mynode01 mynode02 mynode01 mynode02 mynode01 mynode02 mynode01 mynode02 mynode01 mynode02

SLIDE 33

Parallel ¡Opera(ons ¡in ¡Prac(ce ¡

Parallel ¡reading ¡and ¡compu(ng ¡in ¡parallel ¡is ¡

always ¡allowed ¡

Parallel ¡wri(ng ¡is ¡extremely ¡dangerous! ¡
To ¡control ¡the ¡parallel ¡flow ¡each ¡process ¡should ¡

be ¡unique ¡and ¡iden(fiable ¡(ID) ¡

The ¡OpenMPI ¡implementa(on ¡of ¡the ¡MPI ¡library ¡

provides ¡a ¡series ¡of ¡environment ¡variables ¡ defined ¡for ¡each ¡MPI ¡process ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 33 ¡

SLIDE 34

OMPI_COMM_WORLD_SIZE ¡-‑ ¡the ¡number ¡of ¡processes ¡in ¡this ¡process' ¡MPI ¡ Comm_World ¡ OMPI_COMM_WORLD_RANK ¡-‑ ¡the ¡MPI ¡rank ¡of ¡this ¡process ¡ OMPI_COMM_WORLD_LOCAL_RANK ¡-‑ ¡the ¡rela(ve ¡rank ¡of ¡this ¡process ¡on ¡this ¡node ¡ within ¡its ¡job. ¡For ¡example, ¡if ¡four ¡processes ¡in ¡a ¡job ¡share ¡a ¡node, ¡they ¡will ¡each ¡be ¡ given ¡a ¡local ¡rank ¡ranging ¡from ¡0 ¡to ¡3. ¡ OMPI_UNIVERSE_SIZE ¡-‑ ¡the ¡number ¡of ¡process ¡slots ¡allocated ¡to ¡this ¡job. ¡Note ¡that ¡ this ¡may ¡be ¡different ¡than ¡the ¡number ¡of ¡processes ¡in ¡the ¡job. ¡ OMPI_COMM_WORLD_LOCAL_SIZE ¡-‑ ¡the ¡number ¡of ¡ranks ¡from ¡this ¡job ¡that ¡are ¡ running ¡on ¡this ¡node. ¡ OMPI_COMM_WORLD_NODE_RANK ¡-‑ ¡the ¡rela(ve ¡rank ¡of ¡this ¡process ¡on ¡this ¡node ¡ looking ¡across ¡ALL ¡jobs. ¡ ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 34 ¡

¡hSp://www.open-‑mpi.org ¡

SLIDE 35

In ¡Python ¡

import os myid = os.environ['OMPI_COMM_WORLD_RANK'] [...]

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 35 ¡

In ¡BASH ¡

#!/bin/bash myid=${OMPI_COMM_WORLD_RANK} [...]

[igirotto@mynode01 ~]$ mpirun ./myprogram.[py/sh...]

SLIDE 36

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 36 ¡

Possible ¡Applica(ons ¡

Execu(ng ¡mul(ple ¡instances ¡on ¡the ¡same ¡program ¡

with ¡different ¡inputs/ini(al ¡cond. ¡ ¡

Reading ¡large ¡binary ¡files ¡by ¡splivng ¡the ¡workload ¡

among ¡processes ¡ ¡

Searching ¡elements ¡on ¡large ¡data-‑sets ¡ ¡
Other ¡parallel ¡execu(on ¡of ¡embarrassingly ¡parallel ¡

problem ¡(no ¡communica(on ¡among ¡tasks) ¡ ¡ ¡

SLIDE 37

Conclusions ¡

Task ¡Farming ¡is ¡a ¡simple ¡model ¡to ¡parallelize ¡

simple ¡problems ¡that ¡can ¡be ¡divided ¡in ¡ independent ¡task ¡

The ¡mpirun ¡applica(on ¡aids ¡to ¡easily ¡perform ¡

mul(ple ¡processes, ¡includes ¡environment ¡sevng ¡

Load ¡balancing ¡remains ¡a ¡main ¡problem, ¡but ¡

moving ¡from ¡serial ¡to ¡parallel ¡processing ¡can ¡ substan(ally ¡speed-‑up ¡(me ¡of ¡simula(on ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 37 ¡

SLIDE 38

Task ¡Farming ¡

Many ¡independent ¡programs ¡(tasks) ¡running ¡at ¡once ¡

– each ¡task ¡can ¡be ¡serial ¡or ¡parallel ¡ – “independent” ¡means ¡they ¡don’t ¡communicate ¡directly ¡ – Processes ¡possibly ¡driven ¡by ¡the ¡mpirun ¡framework ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 38 ¡

[igirotto@localhost]$ more my_shell_wrapper.sh #!/bin/bash #example for the OpenMPI implementation ./prog.x --input input_${OMPI_COMM_WORLD_RANK}.dat [igirotto@localhost]$ mpirun -np 400 ./my_shell_wrapper.sh

SLIDE 39

Master/Slave ¡

Master ¡ W1 ¡ W1 ¡ W2 ¡ W3 ¡ W2 ¡ W3 ¡ W4 ¡ W4 ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 39 ¡

SLIDE 40

Parallel ¡I/O ¡ File ¡ System ¡

P0 ¡

I/O ¡Bandwidth ¡

P1 ¡ P2 ¡ P3 ¡ P4 ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 40 ¡

SLIDE 41

Parallel ¡I/O ¡

File ¡ System ¡

P0 ¡

I/O ¡Bandwidth ¡

P1 ¡ P2 ¡ P3 ¡

File ¡ System ¡

I/O ¡Bandwidth ¡

File ¡ System ¡

I/O ¡Bandwidth ¡

File ¡ System ¡

I/O ¡Bandwidth ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 41 ¡

SLIDE 42

Parallel ¡I/O ¡

P0 ¡ P1 ¡ P2 ¡ P3 ¡

I/O ¡ I/O ¡ I/O ¡ I/O ¡

Parallel ¡File ¡System ¡

MPI ¡I/O ¡& ¡Parallel ¡I/O ¡Libraries ¡(Hdf5, ¡Netcdf, ¡etc…) ¡

01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Overview ¡on ¡Parallel ¡Programming ¡Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2761 ¡ ¡ 42 ¡

SLIDE 43

What ¡If ¡You ¡Want ¡to ¡Learning ¡How ¡to ¡Program ¡All ¡This?! ¡

Introductory ¡School ¡on ¡Parallel ¡Programming ¡

and ¡Parallel ¡Architecture ¡for ¡High ¡ Performance ¡Compu(ng ¡| ¡(smr ¡2877) ¡

3 ¡October ¡2016 ¡-‑ ¡14 ¡October ¡2016 ¡ ¡

11/09/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Introduc(on ¡to ¡High-‑Performance ¡Compu(ng ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2706 ¡ ¡ 43 ¡

What ¡If ¡You ¡Want ¡to ¡Master ¡All ¡This?! ¡

SLIDE 44

11/09/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡ Introduc(on ¡to ¡High-‑Performance ¡Compu(ng ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ICTP, ¡smr2706 ¡ ¡ 44 ¡

SLIDE 45

Overview ¡on ¡Parallel ¡Programming ¡ Paradigms ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ICTP, ¡smr2761 ¡ ¡ 45 ¡ 01/10/2015 ¡– ¡ ¡Ivan ¡GiroSo ¡ ¡ ¡ igiroSo@ictp.it ¡

Overview ¡on ¡Parallel ¡ Programming ¡Paradigms ¡

What ¡Determines ¡Performance? ¡ ¡

Parallel ¡Architectures ¡ ¡

Mul(ple ¡Socket ¡CPUs ¡

Paradigm ¡at ¡Shared ¡Memory ¡/1 ¡

Paradigm ¡at ¡Shared ¡Memory ¡/2 ¡

Parallel ¡Programming ¡Paradigms ¡

MPI ¡Program ¡Design ¡ ¡

What ¡is ¡MPI? ¡ ¡

Programming ¡Parallel ¡Paradigms ¡

MPI ¡inter ¡process ¡communica(ons ¡

The ¡Hybrid ¡Mode ¡

The ¡Hybrid ¡Mode ¡

Type ¡of ¡Parallelism ¡

Process ¡Interac(ons ¡ ¡

Effect ¡of ¡load-­‑unbalancing ¡

Mapping ¡and ¡Synchroniza(on ¡

Amdahl's ¡law ¡

How ¡do ¡we ¡evaluate ¡the ¡improvement? ¡

SP ¡= ¡ ¡ TS ¡ ¡ TP ¡ ¡

Speedup ¡

Efficiency ¡

EP ¡= ¡ ¡ SP ¡ ¡ p ¡ ¡

Efficiency ¡

Amdal’s ¡Law ¡And ¡Real ¡Life ¡

Scaling ¡-­‑ ¡QE-­‑CP ¡on ¡Fermi ¡BGQ ¡@ ¡CINECA ¡

Easy ¡Parallel ¡Compu(ng ¡

Single ¡Program ¡on ¡Mul(ple ¡Data ¡

Parallel ¡Opera(ons ¡in ¡Prac(ce ¡

In ¡Python ¡

In ¡BASH ¡

Possible ¡Applica(ons ¡

Conclusions ¡

Task ¡Farming ¡

Master/Slave ¡

Parallel ¡I/O ¡ File ¡ System ¡

I/O ¡Bandwidth ¡

Parallel ¡I/O ¡

Parallel ¡I/O ¡

Parallel ¡File ¡System ¡

What ¡If ¡You ¡Want ¡to ¡Master ¡All ¡This?! ¡

Effect ¡of ¡load-‑unbalancing ¡

Scaling ¡-‑ ¡QE-‑CP ¡on ¡Fermi ¡BGQ ¡@ ¡CINECA ¡