Exame de Qualifica c ao Aluno: Vin cius Garcia Pinto Orientador: - - PowerPoint PPT Presentation

exame de qualifica c ao
SMART_READER_LITE
LIVE PREVIEW

Exame de Qualifica c ao Aluno: Vin cius Garcia Pinto Orientador: - - PowerPoint PPT Presentation

Exame de Qualifica c ao Aluno: Vin cius Garcia Pinto Orientador: Nicolas Maillard Linha de pesquisa: Processamento Paralelo e Distribu do Area de abrang encia: Programa c ao Paralela Tema de profundidade: Programa


slide-1
SLIDE 1

Exame de Qualifica¸ c˜ ao

Aluno: Vin´ ıcius Garcia Pinto Orientador: Nicolas Maillard

Linha de pesquisa: Processamento Paralelo e Distribu´ ıdo ´ Area de abrangˆ encia: Programa¸ c˜ ao Paralela Tema de profundidade: Programa¸ c˜ ao Paralela H´ ıbrida

11 de junho de 2014

slide-2
SLIDE 2

Exame de Qualifica¸ c˜ ao

Agenda

1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References

2/70

slide-3
SLIDE 3

Qualifying Exam Why Parallel Programming?

1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References

3/70

slide-4
SLIDE 4

Qualifying Exam Why Parallel Programming?

Introduction

Parallel Computing

use of two or more processing units to solve a single problem.

Goals:

Solve problems in less time; Solve larger problems;

Where?

Climate modeling; Energy research; Data analysis; Simulation;

[J´ aJ´ a 1992; Mattson et al. 2004; Pacheco 2011; Scott et al. 2005]

4/70

slide-5
SLIDE 5

Qualifying Exam Why Parallel Programming?

A current parallel computer

#0 #1 #2 #3 #0 #1 #2 #3 #0 #1 #2 #3 #0 #1 #2 #3

5/70

slide-6
SLIDE 6

Qualifying Exam Why Parallel Programming?

A current parallel computer

#0 #1 #2 #3 #0 #1 #2 #3 #0 #1 #2 #3 #0 #1 #2 #3

6/70

slide-7
SLIDE 7

Qualifying Exam Why Parallel Programming?

A current parallel computer

#0 #1 #2 #3 #0 #1 #2 #3 Xeon E5-2630 Xeon E5-2630 Tesla K20m #0 #1 #2 #3 #0 #1 #2 #3 1.9 GHz Cortex-A15 1.3 GHz Cortex-A7 Mali-T628

7/70

slide-8
SLIDE 8

Qualifying Exam Why Parallel Programming?

A current parallel computer

#0 #1 #2 #3 #0 #1 #2 #3 Xeon E5-2630 Xeon E5-2630 Tesla K20m #0 #1 #2 #3 #0 #1 #2 #3 1.9 GHz Cortex-A15 1.3 GHz Cortex-A7 Mali-T628

7/70 GPPD Orion #1 Samsung Galaxy S5

slide-9
SLIDE 9

Qualifying Exam Why Parallel Programming?

Introduction

Parallel Computers are (now) mainstream:

vector instructions, multithreaded cores, multicore processors, graphics engines, accelerators; not only for scientific (or HPC) applications.

But...

None of the most popular programming languages was designed for parallel computing; Many programmers have never written a parallel program; Tools for parallel computing were designed for homogeneous supercomputers.

All programmers could be parallel programmers!

[McCool et al. 2012]

8/70

slide-10
SLIDE 10

Qualifying Exam Parallel Programming (breadth)

1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References

9/70

slide-11
SLIDE 11

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

1 Why Parallel Programming? 2 Parallel Programming

Solving the problem in parallel Implementing the solution Testing the solution

3 Hybrid Parallel Programming

Programming Accelerators Hot Topics

4 Conclusion 5 References

10/70

slide-12
SLIDE 12

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

Solving a Problem in Parallel

Programmer’s tasks:

1 Identify the concurrency in the problem; 2 Structure an algorithm to exploit this concurrency; 3 Implement the solution with a suitable programming

environment.

Main challenges:

Identify and manage dependencies between concurrent tasks; Manage additional errors introduced by parallelism; Improve the sequential solution (if exists).

[Mattson et al. 2004; Sottile et al. 2010]

11/70

slide-13
SLIDE 13

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

Finding concurrency

Decompose the problem

Identify sequences of steps that can be executed together and (probably) at the same time

12/70

slide-14
SLIDE 14

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

Finding concurrency

Decompose the problem

Identify sequences of steps that can be executed together and (probably) at the same time→ tasks [Sottile et al. 2010]

12/70

slide-15
SLIDE 15

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

Finding concurrency

Decompose the problem

Identify sequences of steps that can be executed together and (probably) at the same time→ tasks [Sottile et al. 2010]

Granularity: size of one task vs # of tasks [Grama 2003] fine-grained: large number of smaller tasks; coarse-grained: small number of larger tasks. Degree of concurrency: # of tasks that can be executed simultaneously in parallel.

12/70

slide-16
SLIDE 16

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

Finding concurrency

Describe dependencies between tasks

logical dependencies:

the order that specific operations must be executed;

data dependencies:

the order that data elements must be updated.

13/70

slide-17
SLIDE 17

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

Structuring the algorithm

Common patterns: Fork-join:

divide sequential flow into multiple parallel flows; join parallel flows back to the sequential flow; usually used to implement parallel divide and conquer;

[J´ aJ´ a 1992; McCool et al. 2012]

14/70

Example: merge-sort

slide-18
SLIDE 18

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

Structuring the algorithm

Common patterns: Stencil

apply a function to each element and its neighbors, the output is a combination of the values of the current element and its neighbors.

[J´ aJ´ a 1992; McCool et al. 2012]

15/70

Example: computational fluid dynamics

slide-19
SLIDE 19

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

Structuring the algorithm

Common patterns: Map

apply a function to all elements of a collection, producing a new collection.

Recurrence

similar to map, but elements can use the outputs of adjacent elements as inputs.

[McCool et al. 2012]

16/70

Example: parallel for

slide-20
SLIDE 20

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

Structuring the algorithm

Common patterns: Reduction

combine all elements in a collection into a single element using an associative combiner function.

[McCool et al. 2012]

17/70

Example: MPI Reduce

slide-21
SLIDE 21

Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel

Structuring the algorithm

Common patterns: Scan

computes all partial reductions of a collection, for every output position, a reduction of the input up to that point is computed.

[McCool et al. 2012]

18/70

Example: prefix sum

slide-22
SLIDE 22

Qualifying Exam Parallel Programming (breadth) Implementing the solution

1 Why Parallel Programming? 2 Parallel Programming

Solving the problem in parallel Implementing the solution Testing the solution

3 Hybrid Parallel Programming

Programming Accelerators Hot Topics

4 Conclusion 5 References

19/70

slide-23
SLIDE 23

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Implementing the solution

Parallel programming environments, in general, abstract hardware organization:

Message passing for distributed-memory platforms

e.g. MPI;

Threads for shared-memory platforms

e.g. OpenMP;

But, there are exceptions:

Google Go is a multicore programming language that communicates using message passing by channels (Hoare).

20/70

slide-24
SLIDE 24

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in distributed-memory platforms

MPI (Message Passing Interface) de facto standard; processes communicate by exchanging messages:

send/receive; point-to-point/collective communications;

several implementations:

Open MPI, MPICH, etc;

[Gropp et al. 1999]

21/70

slide-25
SLIDE 25

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in distributed-memory platforms

MPI (Message Passing Interface) All processes run the same binary program (MPI-1);

SPMD; Each process is identified by a rank;

Execute tests (if... then) to run those parts of the program that are relevant;

Dynamic process creation (MPI-2);

MPMD; process creation after MPI application has started;

[Forum 1997]

22/70

slide-26
SLIDE 26

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in distributed-memory platforms

... MPI Init( &argc, &argv ); MPI Comm rank( MPI COMM WORLD, &myrank ); if (myrank == 0){ strcpy(message,"Hello, there"); MPI Send(message, strlen(message)+1, MPI CHAR, 1, 99, MPI COMM WORLD); } else if (myrank == 1){ MPI Recv(message, 13, MPI CHAR, 0, 99, MPI COMM WORLD, MPI STATUS IGNORE); printf("Received :%s:\n", message); } MPI Finalize();

23/70

slide-27
SLIDE 27

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in distributed-memory platforms

MPI (Message Passing Interface) Version 3 (MPI-3)

nonblocking collective operations; neighborhood collective communication; new one-sided communication operations; Fortran 2008 bindings; removed/deprecated functionalities:

C++ bindings; [Forum 2012]

24/70

slide-28
SLIDE 28

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in distributed-memory platforms

MPI (Message Passing Interface) Pros:

widely used; scalability; portability; no (explicit) locks.

Cons:

low level; explicit data distribution; ”assembly code of parallel computing”;

25/70

slide-29
SLIDE 29

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in distributed-memory platforms

Are there alternatives? Low level:

sockets (OS);

High level:

RMI/RPC;

Java RMI; Charm++;

MapReduce (Hadoop); Partitioned Global Address Space (PGAS):

Unified Parallel C (UPC); [Dean et al. 2008; Downing 1998; Kale et al. 1993; UPC et al. 2013]

26/70

slide-30
SLIDE 30

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in shared-memory platforms

OpenMP (Open Multi-Processing) Collection of compiler directives and library functions;

C, C++, Fortran;

# pragma omp <parallel, for, task, shared, private, critical, barrier, reduce,...>

Multi-thread programming; Sequentially equivalent / Incremental parallelism; Master thread / Slave threads;

[Chapman et al. 2008]

27/70

slide-31
SLIDE 31

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in shared-memory platforms

OpenMP (Open Multi-Processing) Loop Parallelism

#pragma omp parallel for for(i=0; i<N; i++){ res[i] = something(i); }

Task Parallelism (OpenMP 3+)

#pragma omp task shared(x) x = fib(n-1); #pragma omp task shared(y) y = fib(n-2); #pragma omp taskwait result = x+y; [Chapman et al. 2008]

28/70

slide-32
SLIDE 32

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in shared-memory platforms

OpenMP (Open Multi-Processing) Pros:

widely used; (very) simple; high level (despite of pragmas); accelerators support (OpenMP 4);

Cons:

risk of race conditions; simple scheduler (round-robin); sequential program inadequate to extract parallelism;

[Chapman et al. 2008]

29/70

slide-33
SLIDE 33

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in shared-memory platforms

Are there alternatives? Low level:

Process (OS); Pthreads; C++ Threads (Boost.Thread).

High Level:

Cilk, Cilk++, Intel Cilk Plus; Intel TBB; Google Go.

30/70

slide-34
SLIDE 34

Qualifying Exam Parallel Programming (breadth) Implementing the solution

Programming in shared-memory platforms

Cilk C extensions (compiler) + runtime system;

  • nly 3 keywords: cilk spawn, cilk sync, cilk for

Task parallelism

parent-child dependencies;

x = cilk spawn fib (n-1); y = cilk spawn fib (n-2); cilk sync; result = x + y;

Efficient scheduler;

work stealing;

[Blumofe et al. 1995; Intel 2014]

31/70

slide-35
SLIDE 35

Qualifying Exam Parallel Programming (breadth) Testing the solution

1 Why Parallel Programming? 2 Parallel Programming

Solving the problem in parallel Implementing the solution Testing the solution

3 Hybrid Parallel Programming

Programming Accelerators Hot Topics

4 Conclusion 5 References

32/70

slide-36
SLIDE 36

Qualifying Exam Parallel Programming (breadth) Testing the solution

Testing the solution

Metrics: Speedup

How many times faster is the parallel program?

Parallel vs Sequential Sp = T1 Tp

Efficiency

How effective is the use of parallel resources? Ep = Sp p

[Mattson et al. 2004]

33/70

slide-37
SLIDE 37

Qualifying Exam Parallel Programming (breadth) Testing the solution

Speedup

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 speedup processors superlinear linear typical

34/70

slide-38
SLIDE 38

Qualifying Exam Parallel Programming (breadth) Testing the solution

Efficiency

0.5 1 1.5 2 1 2 3 4 5 6 7 8 9 10 efficiency processors superlinear linear typical

35/70

slide-39
SLIDE 39

Qualifying Exam Parallel Programming (breadth) Testing the solution

Testing the solution

Other metrics?

FLOPS; Energy efficiency:

FLOPS per watt; Energy to solution; EDP (Energy Delay Product).

Bandwidth:

memory bandwidth; network bandwidth. [Bode 2013; Green500 2014; Laros III et al. 2013]

36/70

slide-40
SLIDE 40

Qualifying Exam Parallel Programming (breadth)

Until now

Parallel computers are mainstream;

37/70

slide-41
SLIDE 41

Qualifying Exam Parallel Programming (breadth)

Until now

Parallel computers are mainstream; Distributed-memory: MPI, RMI, Charm, PGAS;

37/70

slide-42
SLIDE 42

Qualifying Exam Parallel Programming (breadth)

Until now

Parallel computers are mainstream; Distributed-memory: MPI, RMI, Charm, PGAS; Shared-memory: OpenMP, Cilk, TBB;

37/70

slide-43
SLIDE 43

Qualifying Exam Parallel Programming (breadth)

Until now

Parallel computers are mainstream; Distributed-memory: MPI, RMI, Charm, PGAS; Shared-memory: OpenMP, Cilk, TBB; But...what about?

Cluster with multicore nodes;

37/70

slide-44
SLIDE 44

Qualifying Exam Parallel Programming (breadth)

Until now

Parallel computers are mainstream; Distributed-memory: MPI, RMI, Charm, PGAS; Shared-memory: OpenMP, Cilk, TBB; But...what about?

Cluster with multicore nodes; GPUs; Manycore coprocessors;

37/70

slide-45
SLIDE 45

Qualifying Exam Hybrid Parallel Programming (depth)

1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References

38/70

slide-46
SLIDE 46

Qualifying Exam Hybrid Parallel Programming (depth)

Combining OpenMP and MPI

Exploit hierarchical parallelism

Multiple nodes in the cluster (MPI); Multicore processors in the nodes (OpenMP);

[Chapman et al. 2008]

39/70

slide-47
SLIDE 47

Qualifying Exam Hybrid Parallel Programming (depth)

Combining OpenMP and MPI

Exploit hierarchical parallelism

Multiple nodes in the cluster (MPI); Multicore processors in the nodes (OpenMP);

[Chapman et al. 2008]

39/70

slide-48
SLIDE 48

Qualifying Exam Hybrid Parallel Programming (depth)

Combining OpenMP and MPI

Pros:

software approach matches the hardware; some applications expose two levels of parallelism:

coarse-grained → MPI; fine-grained → OpenMP;

load balancing; increases the amount of parallelism without more processes;

memory consumption; [Chapman et al. 2008]

40/70

slide-49
SLIDE 49

Qualifying Exam Hybrid Parallel Programming (depth)

Combining OpenMP and MPI

Cons:

some applications expose only one level of parallelism; interaction of MPI and OpenMP runtime libraries; portability; hard to use:

different models; two environments in the same code;

slave threads are sleeping while master thread communicates;

[Chapman et al. 2008]

41/70

slide-50
SLIDE 50

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

1 Why Parallel Programming? 2 Parallel Programming

Solving the problem in parallel Implementing the solution Testing the solution

3 Hybrid Parallel Programming

Programming Accelerators Hot Topics

4 Conclusion 5 References

42/70

slide-51
SLIDE 51

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

Accelerators can perform some computations faster than a general purpose CPU.

Execution throughput; Lots of (simple) cores; # of threads ≫ # of cores; separated memory spaces; slave device(s); attached to a host by PCIe;

there are exceptions: Cell BE, Tegra, Exynos (accelerator in the same board/chip).

43/70

slide-52
SLIDE 52

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs good GFlops/$ and GFlops/Watt ratios; specialized hardware:

classical programming tools do not work;

Two main (similar) programming environments for GPGPU:

CUDA

proprietary technology

  • nly NVIDIA cards;

most popular;

OpenCL

  • pen specification;

works with AMD/Intel/ARM GPUs, multicore CPUs, etc; unpopular; [Khronos et al. 2008; Nvidia 2014]

44/70

slide-53
SLIDE 53

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs Code split in: Host Code and Device Code

[Khronos et al. 2008; Nvidia 2014]

45/70

slide-54
SLIDE 54

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs Code split in: Host Code and Device Code ⋄ Host Code:

runs on the CPU (x86); extended C:

runtime API(kernel launches, memory copies, etc) [Khronos et al. 2008; Nvidia 2014]

45/70

slide-55
SLIDE 55

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs Code split in: Host Code and Device Code ⋄ Host Code:

runs on the CPU (x86); extended C:

runtime API(kernel launches, memory copies, etc)

⋄ Device Code:

runs on the GPU (kernel functions); restricted C;

“limited recursion”, access only GPU memory, function pointers, no variable # of arguments, no static variables, etc [Khronos et al. 2008; Nvidia 2014]

45/70

slide-56
SLIDE 56

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs Logical concepts:

kernel - function that executes on a device;

executed n times in parallel by n different threads (work items*)

block (work group) - group of threads;

threads inside the same block can synchronize and share memory;

grid (NDRange) - group of blocks;

[Khronos et al. 2008; Nvidia 2014] [*OpenCL names in blue.]

46/70

slide-57
SLIDE 57

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs Platform concepts:

streaming processor or cuda core (processing element) ALU + FPU; streaming multiprocessor (compute unit)

group of several streaming processors that execute the same instruction;

47/70

slide-58
SLIDE 58

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs Platform concepts:

streaming processor or cuda core (processing element) ALU + FPU; streaming multiprocessor (compute unit)

group of several streaming processors that execute the same instruction;

Memory concepts:

local memory (private memory)

private to a thread;

shared memory (local memory)

private to a block;

global memory

visible by all kernels (persistent);

47/70

slide-59
SLIDE 59

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs Basic steps of host code (CUDA):

1 Allocate memory on the device; 2 Copy data from host memory to device allocated memory; 3 Launch kernel; 4 Copy data from device memory to host memory; 5 Free device allocated memory. 48/70

slide-60
SLIDE 60

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs CUDA example (host code):

int main(){ ... float *dA, *dB, *dC; cudaMalloc((void**)&dA, (N*N)*sizeof(float)); cudaMalloc((void**)&dB, (N*N)*sizeof(float)); cudaMalloc((void**)&dC, (N*N)*sizeof(float)); cudaMemcpy(dA, hA, (N*N)*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, (N*N)*sizeof(float), cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(dA, dB, dC); cudaMemcpy(hC, (dC), (N*N)*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(dC); }

[Nvidia 2014]

49/70

slide-61
SLIDE 61

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs CUDA example (host code):

int main(){ ... float *dA, *dB, *dC; cudaMalloc((void**)&dA, (N*N)*sizeof(float)); cudaMalloc((void**)&dB, (N*N)*sizeof(float)); cudaMalloc((void**)&dC, (N*N)*sizeof(float)); cudaMemcpy(dA, hA, (N*N)*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, (N*N)*sizeof(float), cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(dA, dB, dC); cudaMemcpy(hC, (dC), (N*N)*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(dC); }

[Nvidia 2014]

49/70

slide-62
SLIDE 62

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs CUDA example (host code):

int main(){ ... float *dA, *dB, *dC; cudaMalloc((void**)&dA, (N*N)*sizeof(float)); cudaMalloc((void**)&dB, (N*N)*sizeof(float)); cudaMalloc((void**)&dC, (N*N)*sizeof(float)); cudaMemcpy(dA, hA, (N*N)*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, (N*N)*sizeof(float), cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(dA, dB, dC); cudaMemcpy(hC, (dC), (N*N)*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(dC); }

[Nvidia 2014]

49/70

slide-63
SLIDE 63

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs CUDA example (host code):

int main(){ ... float *dA, *dB, *dC; cudaMalloc((void**)&dA, (N*N)*sizeof(float)); cudaMalloc((void**)&dB, (N*N)*sizeof(float)); cudaMalloc((void**)&dC, (N*N)*sizeof(float)); cudaMemcpy(dA, hA, (N*N)*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, (N*N)*sizeof(float), cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(dA, dB, dC); cudaMemcpy(hC, (dC), (N*N)*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(dC); }

[Nvidia 2014]

49/70

slide-64
SLIDE 64

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs CUDA example (host code):

int main(){ ... float *dA, *dB, *dC; cudaMalloc((void**)&dA, (N*N)*sizeof(float)); cudaMalloc((void**)&dB, (N*N)*sizeof(float)); cudaMalloc((void**)&dC, (N*N)*sizeof(float)); cudaMemcpy(dA, hA, (N*N)*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, (N*N)*sizeof(float), cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(dA, dB, dC); cudaMemcpy(hC, (dC), (N*N)*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(dC); }

[Nvidia 2014]

49/70

slide-65
SLIDE 65

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

GPUs CUDA example (device code):

global void MatAdd(float A[N][N], float B[N][N], float C[N][N]){ int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; }

[Nvidia 2014]

50/70

slide-66
SLIDE 66

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

Intel Many Integrated Core (Xeon Phi) good GFlops/$ and GFlops/Watt ratios; common hardware (x86):

classical programming tools work;

Two programming models:

  • ffload:

parts of the application code are offloaded to device (similar to GPU model of execution);

native execution:

entire application runs on the device. [Jeffers et al. 2013]

51/70

slide-67
SLIDE 67

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

Intel Many Integrated Core (Xeon Phi) Platform overview:

x86-based SMP-on-a-chip; up-to 61 cores;

in-order execution; 4 hardware threads per core; 64-bits support; 512-bits SIMD instructions; interconnected by a bidirectional ring;

distributed L2 cache; runs Linux OS;

[Jeffers et al. 2013]

52/70

slide-68
SLIDE 68

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

Intel Many Integrated Core (Xeon Phi) Offload programming

Execution steps:

1

application execution starts in the host;

2

when a code region tagged as offload is reached, the execution is transferred to the device;

3

when this region ends, the execution returns to the host;

4

application execution ends in the host.

Two options for offloading:

Pragma Offload; Shared Virtual Memory Model; [Jeffers et al. 2013]

53/70

slide-69
SLIDE 69

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

Intel Many Integrated Core (Xeon Phi) Offload programming

Pragma Offload

C, C++, Fortran. Programmer controls data transfers (only contiguous blocks); Code example:

#pragma offload target(mic) in(b:length(count), c, d) out(a:length(count)) #pragma omp parallel for for (i=0; i<count; i++){ a[i] = b[i] * c + d; }

[Jeffers et al. 2013]

54/70

slide-70
SLIDE 70

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

Intel Many Integrated Core (Xeon Phi) Offload programming

Shared Virtual Memory Model

Cilk (C, C++); Offload controls data transfers (all data types); Code example:

Cilk offload Cilk for(i=0; i<N; i++){ a[i] = b[i] + c[i]; } ... x = Cilk offload func(y);

[Jeffers et al. 2013]

55/70

slide-71
SLIDE 71

Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators

Programming Accelerators

Intel Many Integrated Core (Xeon Phi) Native Execution

Appropriate when:

Applications without significant amounts of I/O; Highly parallel applications (that can scale into 200+ threads); With MPI (device acts as another cluster node); [Jeffers et al. 2013]

56/70

slide-72
SLIDE 72

Qualifying Exam Hybrid Parallel Programming (depth) Hot Topics

1 Why Parallel Programming? 2 Parallel Programming

Solving the problem in parallel Implementing the solution Testing the solution

3 Hybrid Parallel Programming

Programming Accelerators Hot Topics

4 Conclusion 5 References

57/70

slide-73
SLIDE 73

Qualifying Exam Hybrid Parallel Programming (depth) Hot Topics

Hot Topics

Runtime systems for Hybrid Architectures

Dynamic scheduling; Performance portability; Multi-devices management; “Transparent”data transfers; Implicit synchronizations; Examples: StarPU, WormS, XKaapi, etc;

Data-flow programming models:

tasks dependencies expressed in terms of data access (read, write, read-write); Example: OpenMP 4.0

#pragma omp task shared(x) depend(out: x) x = 2; #pragma omp task shared(x) depend(in: x) printf("x = %d\n", x);

[Augonnet et al. 2009; Gautier et al. 2013; Pinto 2013] [OpenMP 2013a,b]

58/70

slide-74
SLIDE 74

Qualifying Exam Hybrid Parallel Programming (depth) Hot Topics

Hot Topics

Accelerators support:

OpenACC; OpenMP 4.0;

int i; float p[N], v1[N], v2[N]; init(v1, v2, N); #pragma omp target map(to: v1, v2) map(from: p) #pragma omp parallel for for (i=0; i<N; i++) p[i] = v1[i] * v2[i];

  • utput(p, N);

[Group et al. 2011; OpenMP 2013a]

59/70

slide-75
SLIDE 75

Qualifying Exam Conclusion

1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References

60/70

slide-76
SLIDE 76

Qualifying Exam Conclusion

Conclusion

Parallel computers are mainstream; Parallel programming tools (still) are:

designed for homogeneous platforms; low level; difficult to use;

Accelerators:

more performance; more programming effort;

Future trends:

still more parallelism and heterogeneity; runtime support; high-level programming models;

61/70

slide-77
SLIDE 77

Qualifying Exam References

1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References

62/70

slide-78
SLIDE 78

Qualifying Exam References

Augonnet, C´ edric et al. (2009). “StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures”. Em: Proceedings of the 15th International Euro-Par Conference

  • n Parallel Processing. Euro-Par ’09. Delft, The Netherlands:

Springer-Verlag, pp. 863–874. isbn: 978-3-642-03868-6. doi: http://dx.doi.org/10.1007/978-3-642-03869-3_80. url: http://dx.doi.org/10.1007/978-3-642-03869-3_80. Blumofe, Robert D. et al. (1995). “Cilk: an efficient multithreaded runtime system”. Em: Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming. PPOPP ’95. Santa Barbara, California, United States: ACM,

  • pp. 207–216. isbn: 0-89791-700-6. doi:

10.1145/209936.209958. url: http://doi.acm.org/10.1145/209936.209958.

63/70

slide-79
SLIDE 79

Qualifying Exam References

Bode, Arndt (2013). “Energy to Solution: A New Mission for Parallel Computing”. Em: Euro-Par 2013 Parallel Processing.

  • Ed. por Felix Wolf, Bernd Mohr e Dieter Mey. Vol. 8097.

Lecture Notes in Computer Science. Springer Berlin Heidelberg,

  • pp. 1–2. isbn: 978-3-642-40046-9. doi:

10.1007/978-3-642-40047-6_1. url: http://dx.doi.org/10.1007/978-3-642-40047-6_1. Chapman, B., G. Jost e R. van der Pas (2008). Using OpenMP: Portable Shared Memory Parallel Programming. Scientific Computation Series v. 10. MIT Press. isbn: 9780262533027. url: http://books.google.com.br/books?id=MeFLQSKmaJYC. Dean, Jeffrey e Sanjay Ghemawat (2008). “MapReduce: simplified data processing on large clusters”. Em: Communications of the ACM 51.1, pp. 107–113.

64/70

slide-80
SLIDE 80

Qualifying Exam References

Downing, Troy Bryan (1998). Java RMI: remote method

  • invocation. IDG Books Worldwide, Inc.

Forum, Message Passing Interface (1997). MPI-2: Extensions to the Message-Passing Interface. Rel. t´

  • ec. CDA-9115428.

Knoxville, USA: University of Tennessee. – (2012). MPI: A Message-Passing Interface Standard - Version 3.0. Rel. t´

  • ec. Knoxville, USA: University of Tennessee.

Gautier, Thierry et al. (2013). “Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures”. Em: Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, pp. 1299–1308. Grama, A. (2003). Introduction to Parallel Computing. Pearson

  • Education. Addison-Wesley. isbn: 9780201648652. url:

http://books.google.com.br/books?id=B3jR2EhdZaMC. Green500 (2014). The Green500 List. url: http://www.green500.org/.

65/70

slide-81
SLIDE 81

Qualifying Exam References

Gropp, W., E. Lusk e A. Skjellum (1999). Using MPI: Portable Parallel Programming with the Message-passing Interface. Scientific and engineering computation v. 1. MIT Press. isbn:

  • 9780262571326. url:

http://books.google.com.br/books?id=xpBZ0RyRb-oC. Group, OpenACC Working et al. (2011). The OpenACC Application Programming Interface. Intel (2014). Cilk Home Page — CilkPlus. Dispon´ ıvel em http://www.cilkplus.org/. Acesso em Janeiro de 2013. J´ aJ´ a, J. (1992). An Introduction to Parallel Algorithms. Addison-Wesley. isbn: 9780201548563. url: http://books.google.com.br/books?id=NNmEMgEACAAJ. Jeffers, James e James Reinders (2013). Intel Xeon Phi Coprocessor High Performance Programming. Newnes.

66/70

slide-82
SLIDE 82

Qualifying Exam References

Kale, Laxmikant V e Sanjeev Krishnan (1993). CHARM++: a portable concurrent object oriented system based on C++.

  • Vol. 28. 10. ACM.

Khronos, OpenCL Working Group et al. (2008). “The OpenCL specification”. Em: A. Munshi, Ed. Laros III, JamesH. et al. (2013). “Energy Delay Product”. English. Em: Energy-Efficient High Performance Computing. SpringerBriefs in Computer Science. Springer London,

  • pp. 51–55. isbn: 978-1-4471-4491-5. doi:

10.1007/978-1-4471-4492-2_8. url: http://dx.doi.org/10.1007/978-1-4471-4492-2_8. Mattson, Timothy G., Beverly A. Sanders e Berna L. Massingill (2004). Patterns for Parallel Programming. Vol. 5. Pearson Education, p. 384. isbn: 0321630033. url: http: //books.google.com/books?id=LNcFvN5Z4RMC%5C&pgis=1.

67/70

slide-83
SLIDE 83

Qualifying Exam References

McCool, M., J. Reinders e A. Robison (2012). Structured Parallel Programming: Patterns for Efficient Computation. Elsevier

  • Science. isbn: 9780123914439. url:

http://books.google.com.br/books?id=2hYqeoO8t8IC. Nvidia (2014). CUDA Programming Guide. Rel. t´

  • ec. url: http:

//docs.nvidia.com/cuda/cuda-c-programming-guide. OpenMP (2013a). OpenMP Application Program Interface Examples: Version 4.0.0 - November 2013. – (2013b). OpenMP Application Program Interface: Version 4.0 - July 2013. Pacheco, Peter (2011). An Introduction to Parallel Programming.

  • 1st. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

isbn: 9780123742605.

68/70

slide-84
SLIDE 84

Qualifying Exam References

Pinto, Vinicius Garcia (2013). “Escalonamento por roubo de tarefas em sistemas Multi-CPU e Multi-GPU”. Diss. de

  • mestrado. Porto Alegre: Programa de P´
  • s-Gradua¸

c˜ ao em Computa¸ c˜ ao, Instituto de Inform´ atica, Universidade Federal do Rio Grande do Sul. doi: 10183/71270. url: http://www.lume.ufrgs.br/bitstream/handle/10183/ 71270/000879864.pdf?sequence=1. Scott, L.R., T. Clark e B. Bagheri (2005). Scientific Parallel

  • Computing. Computer science/Mathematics. Princeton

University Press. isbn: 9780691119359. url: http://books.google.com.br/books?id=6vhfQgAACAAJ. Sottile, M.J., T.G. Mattson e C.E. Rasmussen (2010). Introduction to Concurrency in Programming Languages. A Chapman & Hall

  • book. Chapman & Hall/CRC Press. isbn: 9781420072136. url:

http://books.google.com.br/books?id=bBzxmAEACAAJ.

69/70

slide-85
SLIDE 85

Qualifying Exam References

UPC, Consortium et al. (2013). “UPC language specifications v1.3”. Em: url: http://upc.lbl.gov/publications/upc-spec-1.3.pdf.

70/70

slide-86
SLIDE 86

Qualifying Exam

Programming in distributed-memory platforms

Charm++ C++-based parallel programming system; Distributed Objects (Chares):

Interact by asynchronous method invocations; Can be created dynamically.

Runtime system:

map chares to physical processors;

Usually: # chares ≫ # processors;

Load balancing; Fault tolerance;

Automatic checkpointing;

AMPI (MPI on top of Charm++ RTS);

71/70

slide-87
SLIDE 87

Qualifying Exam

Hoare channels

Hoare’s Communicating Sequential Processes (CSP) Synchronization; Communication;

[C.A.R. Hoare. 1985. Communicating sequential processes.]

72/70

slide-88
SLIDE 88

Qualifying Exam

Concurrent vs Parallel

[Sottile et al. 2010]

73/70

slide-89
SLIDE 89

Qualifying Exam

Partitioned Global Address Space (PGAS) I

Message passing: scalable, harder to program (?); Shared memory: easier to program, less scalable (?); Global address space:

Use shared address space (programmability); Distinguish local/global (performance); Runs on distributed or shared memory hw;

Partition a shared address space:

Local addresses live on local processor; Remote addresses live on other processors; May also have private address space; Programmer controls data placement.

[Source: http://www.cs.cornell.edu/˜ bindel/class/cs5220-f11/slides/lec10.pdf]

74/70