Exame de Qualifica c ao Aluno: Vin cius Garcia Pinto Orientador: - - PowerPoint PPT Presentation
Exame de Qualifica c ao Aluno: Vin cius Garcia Pinto Orientador: - - PowerPoint PPT Presentation
Exame de Qualifica c ao Aluno: Vin cius Garcia Pinto Orientador: Nicolas Maillard Linha de pesquisa: Processamento Paralelo e Distribu do Area de abrang encia: Programa c ao Paralela Tema de profundidade: Programa
Exame de Qualifica¸ c˜ ao
Agenda
1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References
2/70
Qualifying Exam Why Parallel Programming?
1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References
3/70
Qualifying Exam Why Parallel Programming?
Introduction
Parallel Computing
use of two or more processing units to solve a single problem.
Goals:
Solve problems in less time; Solve larger problems;
Where?
Climate modeling; Energy research; Data analysis; Simulation;
[J´ aJ´ a 1992; Mattson et al. 2004; Pacheco 2011; Scott et al. 2005]
4/70
Qualifying Exam Why Parallel Programming?
A current parallel computer
#0 #1 #2 #3 #0 #1 #2 #3 #0 #1 #2 #3 #0 #1 #2 #3
5/70
Qualifying Exam Why Parallel Programming?
A current parallel computer
#0 #1 #2 #3 #0 #1 #2 #3 #0 #1 #2 #3 #0 #1 #2 #3
6/70
Qualifying Exam Why Parallel Programming?
A current parallel computer
#0 #1 #2 #3 #0 #1 #2 #3 Xeon E5-2630 Xeon E5-2630 Tesla K20m #0 #1 #2 #3 #0 #1 #2 #3 1.9 GHz Cortex-A15 1.3 GHz Cortex-A7 Mali-T628
7/70
Qualifying Exam Why Parallel Programming?
A current parallel computer
#0 #1 #2 #3 #0 #1 #2 #3 Xeon E5-2630 Xeon E5-2630 Tesla K20m #0 #1 #2 #3 #0 #1 #2 #3 1.9 GHz Cortex-A15 1.3 GHz Cortex-A7 Mali-T628
7/70 GPPD Orion #1 Samsung Galaxy S5
Qualifying Exam Why Parallel Programming?
Introduction
Parallel Computers are (now) mainstream:
vector instructions, multithreaded cores, multicore processors, graphics engines, accelerators; not only for scientific (or HPC) applications.
But...
None of the most popular programming languages was designed for parallel computing; Many programmers have never written a parallel program; Tools for parallel computing were designed for homogeneous supercomputers.
All programmers could be parallel programmers!
[McCool et al. 2012]
8/70
Qualifying Exam Parallel Programming (breadth)
1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References
9/70
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
1 Why Parallel Programming? 2 Parallel Programming
Solving the problem in parallel Implementing the solution Testing the solution
3 Hybrid Parallel Programming
Programming Accelerators Hot Topics
4 Conclusion 5 References
10/70
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
Solving a Problem in Parallel
Programmer’s tasks:
1 Identify the concurrency in the problem; 2 Structure an algorithm to exploit this concurrency; 3 Implement the solution with a suitable programming
environment.
Main challenges:
Identify and manage dependencies between concurrent tasks; Manage additional errors introduced by parallelism; Improve the sequential solution (if exists).
[Mattson et al. 2004; Sottile et al. 2010]
11/70
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
Finding concurrency
Decompose the problem
Identify sequences of steps that can be executed together and (probably) at the same time
12/70
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
Finding concurrency
Decompose the problem
Identify sequences of steps that can be executed together and (probably) at the same time→ tasks [Sottile et al. 2010]
12/70
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
Finding concurrency
Decompose the problem
Identify sequences of steps that can be executed together and (probably) at the same time→ tasks [Sottile et al. 2010]
Granularity: size of one task vs # of tasks [Grama 2003] fine-grained: large number of smaller tasks; coarse-grained: small number of larger tasks. Degree of concurrency: # of tasks that can be executed simultaneously in parallel.
12/70
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
Finding concurrency
Describe dependencies between tasks
logical dependencies:
the order that specific operations must be executed;
data dependencies:
the order that data elements must be updated.
13/70
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
Structuring the algorithm
Common patterns: Fork-join:
divide sequential flow into multiple parallel flows; join parallel flows back to the sequential flow; usually used to implement parallel divide and conquer;
[J´ aJ´ a 1992; McCool et al. 2012]
14/70
Example: merge-sort
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
Structuring the algorithm
Common patterns: Stencil
apply a function to each element and its neighbors, the output is a combination of the values of the current element and its neighbors.
[J´ aJ´ a 1992; McCool et al. 2012]
15/70
Example: computational fluid dynamics
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
Structuring the algorithm
Common patterns: Map
apply a function to all elements of a collection, producing a new collection.
Recurrence
similar to map, but elements can use the outputs of adjacent elements as inputs.
[McCool et al. 2012]
16/70
Example: parallel for
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
Structuring the algorithm
Common patterns: Reduction
combine all elements in a collection into a single element using an associative combiner function.
[McCool et al. 2012]
17/70
Example: MPI Reduce
Qualifying Exam Parallel Programming (breadth) Solving the problem in parallel
Structuring the algorithm
Common patterns: Scan
computes all partial reductions of a collection, for every output position, a reduction of the input up to that point is computed.
[McCool et al. 2012]
18/70
Example: prefix sum
Qualifying Exam Parallel Programming (breadth) Implementing the solution
1 Why Parallel Programming? 2 Parallel Programming
Solving the problem in parallel Implementing the solution Testing the solution
3 Hybrid Parallel Programming
Programming Accelerators Hot Topics
4 Conclusion 5 References
19/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Implementing the solution
Parallel programming environments, in general, abstract hardware organization:
Message passing for distributed-memory platforms
e.g. MPI;
Threads for shared-memory platforms
e.g. OpenMP;
But, there are exceptions:
Google Go is a multicore programming language that communicates using message passing by channels (Hoare).
20/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in distributed-memory platforms
MPI (Message Passing Interface) de facto standard; processes communicate by exchanging messages:
send/receive; point-to-point/collective communications;
several implementations:
Open MPI, MPICH, etc;
[Gropp et al. 1999]
21/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in distributed-memory platforms
MPI (Message Passing Interface) All processes run the same binary program (MPI-1);
SPMD; Each process is identified by a rank;
Execute tests (if... then) to run those parts of the program that are relevant;
Dynamic process creation (MPI-2);
MPMD; process creation after MPI application has started;
[Forum 1997]
22/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in distributed-memory platforms
... MPI Init( &argc, &argv ); MPI Comm rank( MPI COMM WORLD, &myrank ); if (myrank == 0){ strcpy(message,"Hello, there"); MPI Send(message, strlen(message)+1, MPI CHAR, 1, 99, MPI COMM WORLD); } else if (myrank == 1){ MPI Recv(message, 13, MPI CHAR, 0, 99, MPI COMM WORLD, MPI STATUS IGNORE); printf("Received :%s:\n", message); } MPI Finalize();
23/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in distributed-memory platforms
MPI (Message Passing Interface) Version 3 (MPI-3)
nonblocking collective operations; neighborhood collective communication; new one-sided communication operations; Fortran 2008 bindings; removed/deprecated functionalities:
C++ bindings; [Forum 2012]
24/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in distributed-memory platforms
MPI (Message Passing Interface) Pros:
widely used; scalability; portability; no (explicit) locks.
Cons:
low level; explicit data distribution; ”assembly code of parallel computing”;
25/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in distributed-memory platforms
Are there alternatives? Low level:
sockets (OS);
High level:
RMI/RPC;
Java RMI; Charm++;
MapReduce (Hadoop); Partitioned Global Address Space (PGAS):
Unified Parallel C (UPC); [Dean et al. 2008; Downing 1998; Kale et al. 1993; UPC et al. 2013]
26/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in shared-memory platforms
OpenMP (Open Multi-Processing) Collection of compiler directives and library functions;
C, C++, Fortran;
# pragma omp <parallel, for, task, shared, private, critical, barrier, reduce,...>
Multi-thread programming; Sequentially equivalent / Incremental parallelism; Master thread / Slave threads;
[Chapman et al. 2008]
27/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in shared-memory platforms
OpenMP (Open Multi-Processing) Loop Parallelism
#pragma omp parallel for for(i=0; i<N; i++){ res[i] = something(i); }
Task Parallelism (OpenMP 3+)
#pragma omp task shared(x) x = fib(n-1); #pragma omp task shared(y) y = fib(n-2); #pragma omp taskwait result = x+y; [Chapman et al. 2008]
28/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in shared-memory platforms
OpenMP (Open Multi-Processing) Pros:
widely used; (very) simple; high level (despite of pragmas); accelerators support (OpenMP 4);
Cons:
risk of race conditions; simple scheduler (round-robin); sequential program inadequate to extract parallelism;
[Chapman et al. 2008]
29/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in shared-memory platforms
Are there alternatives? Low level:
Process (OS); Pthreads; C++ Threads (Boost.Thread).
High Level:
Cilk, Cilk++, Intel Cilk Plus; Intel TBB; Google Go.
30/70
Qualifying Exam Parallel Programming (breadth) Implementing the solution
Programming in shared-memory platforms
Cilk C extensions (compiler) + runtime system;
- nly 3 keywords: cilk spawn, cilk sync, cilk for
Task parallelism
parent-child dependencies;
x = cilk spawn fib (n-1); y = cilk spawn fib (n-2); cilk sync; result = x + y;
Efficient scheduler;
work stealing;
[Blumofe et al. 1995; Intel 2014]
31/70
Qualifying Exam Parallel Programming (breadth) Testing the solution
1 Why Parallel Programming? 2 Parallel Programming
Solving the problem in parallel Implementing the solution Testing the solution
3 Hybrid Parallel Programming
Programming Accelerators Hot Topics
4 Conclusion 5 References
32/70
Qualifying Exam Parallel Programming (breadth) Testing the solution
Testing the solution
Metrics: Speedup
How many times faster is the parallel program?
Parallel vs Sequential Sp = T1 Tp
Efficiency
How effective is the use of parallel resources? Ep = Sp p
[Mattson et al. 2004]
33/70
Qualifying Exam Parallel Programming (breadth) Testing the solution
Speedup
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 speedup processors superlinear linear typical
34/70
Qualifying Exam Parallel Programming (breadth) Testing the solution
Efficiency
0.5 1 1.5 2 1 2 3 4 5 6 7 8 9 10 efficiency processors superlinear linear typical
35/70
Qualifying Exam Parallel Programming (breadth) Testing the solution
Testing the solution
Other metrics?
FLOPS; Energy efficiency:
FLOPS per watt; Energy to solution; EDP (Energy Delay Product).
Bandwidth:
memory bandwidth; network bandwidth. [Bode 2013; Green500 2014; Laros III et al. 2013]
36/70
Qualifying Exam Parallel Programming (breadth)
Until now
Parallel computers are mainstream;
37/70
Qualifying Exam Parallel Programming (breadth)
Until now
Parallel computers are mainstream; Distributed-memory: MPI, RMI, Charm, PGAS;
37/70
Qualifying Exam Parallel Programming (breadth)
Until now
Parallel computers are mainstream; Distributed-memory: MPI, RMI, Charm, PGAS; Shared-memory: OpenMP, Cilk, TBB;
37/70
Qualifying Exam Parallel Programming (breadth)
Until now
Parallel computers are mainstream; Distributed-memory: MPI, RMI, Charm, PGAS; Shared-memory: OpenMP, Cilk, TBB; But...what about?
Cluster with multicore nodes;
37/70
Qualifying Exam Parallel Programming (breadth)
Until now
Parallel computers are mainstream; Distributed-memory: MPI, RMI, Charm, PGAS; Shared-memory: OpenMP, Cilk, TBB; But...what about?
Cluster with multicore nodes; GPUs; Manycore coprocessors;
37/70
Qualifying Exam Hybrid Parallel Programming (depth)
1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References
38/70
Qualifying Exam Hybrid Parallel Programming (depth)
Combining OpenMP and MPI
Exploit hierarchical parallelism
Multiple nodes in the cluster (MPI); Multicore processors in the nodes (OpenMP);
[Chapman et al. 2008]
39/70
Qualifying Exam Hybrid Parallel Programming (depth)
Combining OpenMP and MPI
Exploit hierarchical parallelism
Multiple nodes in the cluster (MPI); Multicore processors in the nodes (OpenMP);
[Chapman et al. 2008]
39/70
Qualifying Exam Hybrid Parallel Programming (depth)
Combining OpenMP and MPI
Pros:
software approach matches the hardware; some applications expose two levels of parallelism:
coarse-grained → MPI; fine-grained → OpenMP;
load balancing; increases the amount of parallelism without more processes;
memory consumption; [Chapman et al. 2008]
40/70
Qualifying Exam Hybrid Parallel Programming (depth)
Combining OpenMP and MPI
Cons:
some applications expose only one level of parallelism; interaction of MPI and OpenMP runtime libraries; portability; hard to use:
different models; two environments in the same code;
slave threads are sleeping while master thread communicates;
[Chapman et al. 2008]
41/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
1 Why Parallel Programming? 2 Parallel Programming
Solving the problem in parallel Implementing the solution Testing the solution
3 Hybrid Parallel Programming
Programming Accelerators Hot Topics
4 Conclusion 5 References
42/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
Accelerators can perform some computations faster than a general purpose CPU.
Execution throughput; Lots of (simple) cores; # of threads ≫ # of cores; separated memory spaces; slave device(s); attached to a host by PCIe;
there are exceptions: Cell BE, Tegra, Exynos (accelerator in the same board/chip).
43/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs good GFlops/$ and GFlops/Watt ratios; specialized hardware:
classical programming tools do not work;
Two main (similar) programming environments for GPGPU:
CUDA
proprietary technology
- nly NVIDIA cards;
most popular;
OpenCL
- pen specification;
works with AMD/Intel/ARM GPUs, multicore CPUs, etc; unpopular; [Khronos et al. 2008; Nvidia 2014]
44/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs Code split in: Host Code and Device Code
[Khronos et al. 2008; Nvidia 2014]
45/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs Code split in: Host Code and Device Code ⋄ Host Code:
runs on the CPU (x86); extended C:
runtime API(kernel launches, memory copies, etc) [Khronos et al. 2008; Nvidia 2014]
45/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs Code split in: Host Code and Device Code ⋄ Host Code:
runs on the CPU (x86); extended C:
runtime API(kernel launches, memory copies, etc)
⋄ Device Code:
runs on the GPU (kernel functions); restricted C;
“limited recursion”, access only GPU memory, function pointers, no variable # of arguments, no static variables, etc [Khronos et al. 2008; Nvidia 2014]
45/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs Logical concepts:
kernel - function that executes on a device;
executed n times in parallel by n different threads (work items*)
block (work group) - group of threads;
threads inside the same block can synchronize and share memory;
grid (NDRange) - group of blocks;
[Khronos et al. 2008; Nvidia 2014] [*OpenCL names in blue.]
46/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs Platform concepts:
streaming processor or cuda core (processing element) ALU + FPU; streaming multiprocessor (compute unit)
group of several streaming processors that execute the same instruction;
47/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs Platform concepts:
streaming processor or cuda core (processing element) ALU + FPU; streaming multiprocessor (compute unit)
group of several streaming processors that execute the same instruction;
Memory concepts:
local memory (private memory)
private to a thread;
shared memory (local memory)
private to a block;
global memory
visible by all kernels (persistent);
47/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs Basic steps of host code (CUDA):
1 Allocate memory on the device; 2 Copy data from host memory to device allocated memory; 3 Launch kernel; 4 Copy data from device memory to host memory; 5 Free device allocated memory. 48/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs CUDA example (host code):
int main(){ ... float *dA, *dB, *dC; cudaMalloc((void**)&dA, (N*N)*sizeof(float)); cudaMalloc((void**)&dB, (N*N)*sizeof(float)); cudaMalloc((void**)&dC, (N*N)*sizeof(float)); cudaMemcpy(dA, hA, (N*N)*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, (N*N)*sizeof(float), cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(dA, dB, dC); cudaMemcpy(hC, (dC), (N*N)*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(dC); }
[Nvidia 2014]
49/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs CUDA example (host code):
int main(){ ... float *dA, *dB, *dC; cudaMalloc((void**)&dA, (N*N)*sizeof(float)); cudaMalloc((void**)&dB, (N*N)*sizeof(float)); cudaMalloc((void**)&dC, (N*N)*sizeof(float)); cudaMemcpy(dA, hA, (N*N)*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, (N*N)*sizeof(float), cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(dA, dB, dC); cudaMemcpy(hC, (dC), (N*N)*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(dC); }
[Nvidia 2014]
49/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs CUDA example (host code):
int main(){ ... float *dA, *dB, *dC; cudaMalloc((void**)&dA, (N*N)*sizeof(float)); cudaMalloc((void**)&dB, (N*N)*sizeof(float)); cudaMalloc((void**)&dC, (N*N)*sizeof(float)); cudaMemcpy(dA, hA, (N*N)*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, (N*N)*sizeof(float), cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(dA, dB, dC); cudaMemcpy(hC, (dC), (N*N)*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(dC); }
[Nvidia 2014]
49/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs CUDA example (host code):
int main(){ ... float *dA, *dB, *dC; cudaMalloc((void**)&dA, (N*N)*sizeof(float)); cudaMalloc((void**)&dB, (N*N)*sizeof(float)); cudaMalloc((void**)&dC, (N*N)*sizeof(float)); cudaMemcpy(dA, hA, (N*N)*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, (N*N)*sizeof(float), cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(dA, dB, dC); cudaMemcpy(hC, (dC), (N*N)*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(dC); }
[Nvidia 2014]
49/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs CUDA example (host code):
int main(){ ... float *dA, *dB, *dC; cudaMalloc((void**)&dA, (N*N)*sizeof(float)); cudaMalloc((void**)&dB, (N*N)*sizeof(float)); cudaMalloc((void**)&dC, (N*N)*sizeof(float)); cudaMemcpy(dA, hA, (N*N)*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(dB, hB, (N*N)*sizeof(float), cudaMemcpyHostToDevice); dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(dA, dB, dC); cudaMemcpy(hC, (dC), (N*N)*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(dC); }
[Nvidia 2014]
49/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
GPUs CUDA example (device code):
global void MatAdd(float A[N][N], float B[N][N], float C[N][N]){ int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; }
[Nvidia 2014]
50/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
Intel Many Integrated Core (Xeon Phi) good GFlops/$ and GFlops/Watt ratios; common hardware (x86):
classical programming tools work;
Two programming models:
- ffload:
parts of the application code are offloaded to device (similar to GPU model of execution);
native execution:
entire application runs on the device. [Jeffers et al. 2013]
51/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
Intel Many Integrated Core (Xeon Phi) Platform overview:
x86-based SMP-on-a-chip; up-to 61 cores;
in-order execution; 4 hardware threads per core; 64-bits support; 512-bits SIMD instructions; interconnected by a bidirectional ring;
distributed L2 cache; runs Linux OS;
[Jeffers et al. 2013]
52/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
Intel Many Integrated Core (Xeon Phi) Offload programming
Execution steps:
1
application execution starts in the host;
2
when a code region tagged as offload is reached, the execution is transferred to the device;
3
when this region ends, the execution returns to the host;
4
application execution ends in the host.
Two options for offloading:
Pragma Offload; Shared Virtual Memory Model; [Jeffers et al. 2013]
53/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
Intel Many Integrated Core (Xeon Phi) Offload programming
Pragma Offload
C, C++, Fortran. Programmer controls data transfers (only contiguous blocks); Code example:
#pragma offload target(mic) in(b:length(count), c, d) out(a:length(count)) #pragma omp parallel for for (i=0; i<count; i++){ a[i] = b[i] * c + d; }
[Jeffers et al. 2013]
54/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
Intel Many Integrated Core (Xeon Phi) Offload programming
Shared Virtual Memory Model
Cilk (C, C++); Offload controls data transfers (all data types); Code example:
Cilk offload Cilk for(i=0; i<N; i++){ a[i] = b[i] + c[i]; } ... x = Cilk offload func(y);
[Jeffers et al. 2013]
55/70
Qualifying Exam Hybrid Parallel Programming (depth) Programming Accelerators
Programming Accelerators
Intel Many Integrated Core (Xeon Phi) Native Execution
Appropriate when:
Applications without significant amounts of I/O; Highly parallel applications (that can scale into 200+ threads); With MPI (device acts as another cluster node); [Jeffers et al. 2013]
56/70
Qualifying Exam Hybrid Parallel Programming (depth) Hot Topics
1 Why Parallel Programming? 2 Parallel Programming
Solving the problem in parallel Implementing the solution Testing the solution
3 Hybrid Parallel Programming
Programming Accelerators Hot Topics
4 Conclusion 5 References
57/70
Qualifying Exam Hybrid Parallel Programming (depth) Hot Topics
Hot Topics
Runtime systems for Hybrid Architectures
Dynamic scheduling; Performance portability; Multi-devices management; “Transparent”data transfers; Implicit synchronizations; Examples: StarPU, WormS, XKaapi, etc;
Data-flow programming models:
tasks dependencies expressed in terms of data access (read, write, read-write); Example: OpenMP 4.0
#pragma omp task shared(x) depend(out: x) x = 2; #pragma omp task shared(x) depend(in: x) printf("x = %d\n", x);
[Augonnet et al. 2009; Gautier et al. 2013; Pinto 2013] [OpenMP 2013a,b]
58/70
Qualifying Exam Hybrid Parallel Programming (depth) Hot Topics
Hot Topics
Accelerators support:
OpenACC; OpenMP 4.0;
int i; float p[N], v1[N], v2[N]; init(v1, v2, N); #pragma omp target map(to: v1, v2) map(from: p) #pragma omp parallel for for (i=0; i<N; i++) p[i] = v1[i] * v2[i];
- utput(p, N);
[Group et al. 2011; OpenMP 2013a]
59/70
Qualifying Exam Conclusion
1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References
60/70
Qualifying Exam Conclusion
Conclusion
Parallel computers are mainstream; Parallel programming tools (still) are:
designed for homogeneous platforms; low level; difficult to use;
Accelerators:
more performance; more programming effort;
Future trends:
still more parallelism and heterogeneity; runtime support; high-level programming models;
61/70
Qualifying Exam References
1 Why Parallel Programming? 2 Parallel Programming 3 Hybrid Parallel Programming 4 Conclusion 5 References
62/70
Qualifying Exam References
Augonnet, C´ edric et al. (2009). “StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures”. Em: Proceedings of the 15th International Euro-Par Conference
- n Parallel Processing. Euro-Par ’09. Delft, The Netherlands:
Springer-Verlag, pp. 863–874. isbn: 978-3-642-03868-6. doi: http://dx.doi.org/10.1007/978-3-642-03869-3_80. url: http://dx.doi.org/10.1007/978-3-642-03869-3_80. Blumofe, Robert D. et al. (1995). “Cilk: an efficient multithreaded runtime system”. Em: Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming. PPOPP ’95. Santa Barbara, California, United States: ACM,
- pp. 207–216. isbn: 0-89791-700-6. doi:
10.1145/209936.209958. url: http://doi.acm.org/10.1145/209936.209958.
63/70
Qualifying Exam References
Bode, Arndt (2013). “Energy to Solution: A New Mission for Parallel Computing”. Em: Euro-Par 2013 Parallel Processing.
- Ed. por Felix Wolf, Bernd Mohr e Dieter Mey. Vol. 8097.
Lecture Notes in Computer Science. Springer Berlin Heidelberg,
- pp. 1–2. isbn: 978-3-642-40046-9. doi:
10.1007/978-3-642-40047-6_1. url: http://dx.doi.org/10.1007/978-3-642-40047-6_1. Chapman, B., G. Jost e R. van der Pas (2008). Using OpenMP: Portable Shared Memory Parallel Programming. Scientific Computation Series v. 10. MIT Press. isbn: 9780262533027. url: http://books.google.com.br/books?id=MeFLQSKmaJYC. Dean, Jeffrey e Sanjay Ghemawat (2008). “MapReduce: simplified data processing on large clusters”. Em: Communications of the ACM 51.1, pp. 107–113.
64/70
Qualifying Exam References
Downing, Troy Bryan (1998). Java RMI: remote method
- invocation. IDG Books Worldwide, Inc.
Forum, Message Passing Interface (1997). MPI-2: Extensions to the Message-Passing Interface. Rel. t´
- ec. CDA-9115428.
Knoxville, USA: University of Tennessee. – (2012). MPI: A Message-Passing Interface Standard - Version 3.0. Rel. t´
- ec. Knoxville, USA: University of Tennessee.
Gautier, Thierry et al. (2013). “Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures”. Em: Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, pp. 1299–1308. Grama, A. (2003). Introduction to Parallel Computing. Pearson
- Education. Addison-Wesley. isbn: 9780201648652. url:
http://books.google.com.br/books?id=B3jR2EhdZaMC. Green500 (2014). The Green500 List. url: http://www.green500.org/.
65/70
Qualifying Exam References
Gropp, W., E. Lusk e A. Skjellum (1999). Using MPI: Portable Parallel Programming with the Message-passing Interface. Scientific and engineering computation v. 1. MIT Press. isbn:
- 9780262571326. url:
http://books.google.com.br/books?id=xpBZ0RyRb-oC. Group, OpenACC Working et al. (2011). The OpenACC Application Programming Interface. Intel (2014). Cilk Home Page — CilkPlus. Dispon´ ıvel em http://www.cilkplus.org/. Acesso em Janeiro de 2013. J´ aJ´ a, J. (1992). An Introduction to Parallel Algorithms. Addison-Wesley. isbn: 9780201548563. url: http://books.google.com.br/books?id=NNmEMgEACAAJ. Jeffers, James e James Reinders (2013). Intel Xeon Phi Coprocessor High Performance Programming. Newnes.
66/70
Qualifying Exam References
Kale, Laxmikant V e Sanjeev Krishnan (1993). CHARM++: a portable concurrent object oriented system based on C++.
- Vol. 28. 10. ACM.
Khronos, OpenCL Working Group et al. (2008). “The OpenCL specification”. Em: A. Munshi, Ed. Laros III, JamesH. et al. (2013). “Energy Delay Product”. English. Em: Energy-Efficient High Performance Computing. SpringerBriefs in Computer Science. Springer London,
- pp. 51–55. isbn: 978-1-4471-4491-5. doi:
10.1007/978-1-4471-4492-2_8. url: http://dx.doi.org/10.1007/978-1-4471-4492-2_8. Mattson, Timothy G., Beverly A. Sanders e Berna L. Massingill (2004). Patterns for Parallel Programming. Vol. 5. Pearson Education, p. 384. isbn: 0321630033. url: http: //books.google.com/books?id=LNcFvN5Z4RMC%5C&pgis=1.
67/70
Qualifying Exam References
McCool, M., J. Reinders e A. Robison (2012). Structured Parallel Programming: Patterns for Efficient Computation. Elsevier
- Science. isbn: 9780123914439. url:
http://books.google.com.br/books?id=2hYqeoO8t8IC. Nvidia (2014). CUDA Programming Guide. Rel. t´
- ec. url: http:
//docs.nvidia.com/cuda/cuda-c-programming-guide. OpenMP (2013a). OpenMP Application Program Interface Examples: Version 4.0.0 - November 2013. – (2013b). OpenMP Application Program Interface: Version 4.0 - July 2013. Pacheco, Peter (2011). An Introduction to Parallel Programming.
- 1st. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
isbn: 9780123742605.
68/70
Qualifying Exam References
Pinto, Vinicius Garcia (2013). “Escalonamento por roubo de tarefas em sistemas Multi-CPU e Multi-GPU”. Diss. de
- mestrado. Porto Alegre: Programa de P´
- s-Gradua¸
c˜ ao em Computa¸ c˜ ao, Instituto de Inform´ atica, Universidade Federal do Rio Grande do Sul. doi: 10183/71270. url: http://www.lume.ufrgs.br/bitstream/handle/10183/ 71270/000879864.pdf?sequence=1. Scott, L.R., T. Clark e B. Bagheri (2005). Scientific Parallel
- Computing. Computer science/Mathematics. Princeton
University Press. isbn: 9780691119359. url: http://books.google.com.br/books?id=6vhfQgAACAAJ. Sottile, M.J., T.G. Mattson e C.E. Rasmussen (2010). Introduction to Concurrency in Programming Languages. A Chapman & Hall
- book. Chapman & Hall/CRC Press. isbn: 9781420072136. url:
http://books.google.com.br/books?id=bBzxmAEACAAJ.
69/70
Qualifying Exam References
UPC, Consortium et al. (2013). “UPC language specifications v1.3”. Em: url: http://upc.lbl.gov/publications/upc-spec-1.3.pdf.
70/70
Qualifying Exam
Programming in distributed-memory platforms
Charm++ C++-based parallel programming system; Distributed Objects (Chares):
Interact by asynchronous method invocations; Can be created dynamically.
Runtime system:
map chares to physical processors;
Usually: # chares ≫ # processors;
Load balancing; Fault tolerance;
Automatic checkpointing;
AMPI (MPI on top of Charm++ RTS);
71/70
Qualifying Exam
Hoare channels
Hoare’s Communicating Sequential Processes (CSP) Synchronization; Communication;
[C.A.R. Hoare. 1985. Communicating sequential processes.]
72/70
Qualifying Exam
Concurrent vs Parallel
[Sottile et al. 2010]
73/70
Qualifying Exam