Parallel programming 01 Walter Boscheri walter.boscheri@unife.it - - PowerPoint PPT Presentation

▶

Dec 12, 2023 342 likes •572 views

Parallel programming 01 Walter Boscheri walter.boscheri@unife.it University of Ferrara - Department of Mathematics and Computer Science A.Y. 2018/2019 - Semester I Outline Introduction and motivation 1 Parallel architectures 2

SLIDE 1

Parallel programming 01

Walter Boscheri walter.boscheri@unife.it

University of Ferrara - Department of Mathematics and Computer Science

A.Y. 2018/2019 - Semester I

SLIDE 2

Outline

1

Introduction and motivation

2

Parallel architectures

3

Computational cost

4

Memory

5

Arrays

SLIDE 3

1. Introduction and motivation

Moore’s law

G. Moore (1985) - CEO Intel

In 1985 Moore asserted that the number of transistors in a dense integrated circuit would have doubled approximately every two years. This means that the computation speed doubles about every two years as well. Intel Corporation (2019) Transistors have become very small. Intel’s new 10 nm (molecular scale) architecture is expected to be released in 2019, marking over a three-year cadence since the previous architecture node.

Walter Boscheri Parallel programming 01 2 / 14

SLIDE 4

1. Introduction and motivation

Moore’s law

Walter Boscheri Parallel programming 01 2 / 14

SLIDE 5

1. Introduction and motivation

Parallel computing

Parallel computing Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. The code can be executed on many microprocessors or cores of the same processor in order to improve the computational efficiency of the simulation. High Performance Computing (HPC) High-performance computing (HPC) is the use of super computers and par- allel processing techniques for solving complex computational problems in science, engineering, or business. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both theoretical and parallel computational techniques.

Walter Boscheri Parallel programming 01 3 / 14

SLIDE 6

2. Parallel architectures

Von Neumann CPU architecture (1945)

Central Processing Unit (CPU) split into Artihmetic Logic Unit (ALU) and Control Unit. The accumulator (register) in the ALU connects input and output Random Access Memory (RAM) Input unit which enters data into the computer Output unit to return the elaborated data on the mass storage

Walter Boscheri Parallel programming 01 4 / 14

SLIDE 7

2. Parallel architectures

Bus CPU architecture

The bus is a channel which connects all components to each other. The single bus can only access one of the two classes of memory at a time, hence slowing the rate at which the CPU can work. This seriously limits the effective processing speed when the CPU is required to perform minimal processing on large amounts of data.

Walter Boscheri Parallel programming 01 5 / 14

SLIDE 8

2. Parallel architectures

Bus CPU architecture

Dual Independent Bus (DIB) Architecture (Intel and AMD) Two (dual) independent data I/O buses enables the processor to access data from either of its buses simultaneously and in parallel, rather than in a singular sequential manner (as in a single-bus system). The second or backside bus in a processor with DIB is used for the L2 cache, allowing it to run at much greater speeds than if it were to share the main processor bus.

Walter Boscheri Parallel programming 01 5 / 14

SLIDE 9

2. Parallel architectures

Parallel architecture: memory structure

SHARED There is a global memory space which is accessible by all processors. Processors might also have some local memory. Algorithms may use global data structures efficiently. Example: multicore CPU, GPU DISTRIBUTED All memory is associated with processors. To retrieve information from another processors’ memory a message must be sent. Algorithms should use distributed data structures. Example: network of computers, supercomputers

Walter Boscheri Parallel programming 01 6 / 14

SLIDE 10

2. Parallel architectures

Parallel architecture: memory structure

Shared memory architectures (multicore CPU): Intel 9th generation 2019: up to 8 cores, 16 threads, 5 GHz and 16 MB cache AMD Ryzen 2019: up to 12 cores, 24 threads, 4.6 GHz and 64 MB cache Distributed memory architectures: MARCONI A3 supercomputer (Bologna - Italy) Nodes: 1512, RAM: 196 GB/node Processors: 2 x 24-cores Intel Xeon 8160 (SkyLake), 2.10 GHz Peak performance: 8 PFLOP/s SUPERMUC-NG supercomputer (Munich - Germany) Nodes: 6336, RAM: 96 GB/node Processors: 2 x 24-cores Intel Xeon Platinum 8174 (SkyLake), 3.90 GHz Peak performance: 26.3 PFLOP/s

Walter Boscheri Parallel programming 01 6 / 14

SLIDE 11

3. Computational cost

Cost estimation of an algorithm

The cost of an algorithm is measured with the number of operations needed to execute the code. FLOP FLOP = FLoating point OPeration 1 FLOP is equivalent to the cost associated to a summation or a multipli- cation between floating points (real numbers). 1 GFLOP = 109 FLOPS 1 TFLOP = 1012 FLOPS 1 PFLOP = 1015 FLOPS 1 EFLOP = 1018 FLOPS

Walter Boscheri Parallel programming 01 7 / 14

SLIDE 12

3. Computational cost

Example

Dot product Given two vectors (a, b) ∈ RN, the dot product c is evaluated as c =

N

a(i) · b(i) Each iterations i counts a total number of 2 floating point operations, namely a sum and a product. Thus, the computational cost of the dot product is 2N.

Exercise Work out the computational cost for i) a matrix-vector and ii) a matrix- matrix multiplication with size [N × N]. Write a FORTRAN code for comparing the theoretical results against numerical evidences.

Walter Boscheri Parallel programming 01 8 / 14

SLIDE 13

3. Computational cost

Computational speed of a system

The computational speed of a system is measured in terms of the number of floating point operations that can be executed in one second, thus FLOP/s. 1 GFLOP/s = 109 FLOPS/s 1 TFLOP/s = 1012 FLOPS/s 1 PFLOP/s = 1015 FLOPS/s 1 EFLOP/s = 1018 FLOPS/s Examples: SUPERMUC-NG (Munich): 26.3 PFLOP/s MARCONI A3 (Bologna): 8 PFLOP/s HLRS (Stuttgart): 7.4 PFLOP/s

List of TOP 500 supercomputers Walter Boscheri Parallel programming 01 9 / 14

SLIDE 14

3. Computational cost

Computational speed of a processor

Intel Core i7-8750H (8th gen) Cores: 6, Threads: 12 Clock speed: 2.20 GHz FLOP/s: 4 Speed: 6 (cores) × 4 (FLOP/s) × 2.20 · 109 (clock) = 52.8 GFLOP/s Intel Skylake i7-9800X (9th gen) Cores: 8, Threads: 16 Clock speed: 4.40 GHz FLOP/s: 16 Speed: 8 (cores) × 16 (FLOP/s) × 4.40 · 109 (clock) = 563.2 GFLOP/s

Walter Boscheri Parallel programming 01 10 / 14

SLIDE 15

4. Memory

Memory bandwidth

Memory bandwidth is the rate at which data can be transferred between memory and processor. It is typically measured in GB/s. The theoretical memory bandwidth is computed as the product of: base RAM clock frequency number of data transfers per clock: double data rate (DDR) runs two bits per clock cycle memory bus width: each DDR memory interface is 64 bits wide (that is called line) number of interfaces: typically modern computers use two memory interfaces, thus they are referred to as dual-channel mode with 128-bit bus width. Example:dual-channel memory with DDR4-3200 (1600 MHz) 1600 · 106 clock s × 2 line clock × 64 bit line × 2 interfaces = 409.6 · 109 bit s = 51.2 GB/s

Walter Boscheri Parallel programming 01 11 / 14

SLIDE 16

4. Memory

Memory bandwidth

Example: floating point operation c = a · b 3 memory accesses: a, b, and c. 24 bytes must be transferred (a, b assumed to be double precision floats) ⇓ The processor can transfer data at the following rates: 24 bytes × 52.8 GFLOP/s = 1267.2 GB/s (Intel Core i7-8750H) 24 bytes × 563.2 GFLOP/s = 13516.8 GB/s (Intel Skylake i7-9800X) As the speed gap between CPU and memory widens, memory hier- archy has become the primary factor limiting program performance.

Walter Boscheri Parallel programming 01 11 / 14

SLIDE 17

4. Memory

Memory hierarchy

register (on chip): up to 64-bit (vector processors) cache memory: L1 on chip and L2 Random access memory (RAM) Mass storage Memory access time It is the time interval between the instant at which an instruction control unit initiates a call for data or a request to store data, and the instant at which delivery of the data is completed or the storage is started. It is how long it takes for a character in memory to be located.

Walter Boscheri Parallel programming 01 12 / 14

SLIDE 18

4. Memory

Memory hierarchy

Cache is a fast access memory that allows data to be temporarily located close to the CPU. Thus, memory access time is reduced. There are two types of locality: temporal locality: data can be accessed more than once in time spatial locality: when a specific data is accessed and copied to the cache, it is very likely that spatially closed data are accessed as well. All data loaded in the cache are stored as long as possible.

Walter Boscheri Parallel programming 01 12 / 14

SLIDE 19

5. Arrays

FORTRAN: array allocation

Let A ∈ RM×N be a matrix (array of rank 2). In FORTRAN arrays of rank greater than 1 are stored by columns. Thus, the fastest index is the left one. Matrix A of components aij is therefore stored in the following order: col 1: a11, a21, . . . aM1, col 2: a12, a22, . . . aM2, . . . col j: a1j, a2j, . . . aMj. . . . col N: a1N, a2N, . . . aMN. Elements of the same column are closed in the memory, thus the fastest way to access array data is to loop column-by-column. The cache is then fully exploited.

Walter Boscheri Parallel programming 01 13 / 14

SLIDE 20

5. Arrays

FORTRAN: matrix multiplication

Let (A, B) ∈ RN×N be arrays of rank 2. The matrix-matrix product C = A × B can be evaluated using: DO loop; MATMUL intrinsic function of FORTRAN DGEMM routine from Math Kernel Library (MKL) DGEMM: Math Kernel Library routine DGEMM (transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc)

DGEMM syntax

The DGEMM routine computes a scalar-matrix-matrix product and add the result to a scalar-matrix product. The operation is defined as C := alpha · op(A) · op(B) + beta · C, where the operator op(X) can be op(X) = X, op(X) = X T, op(X) = X H.

Walter Boscheri Parallel programming 01 14 / 14

SLIDE 21

5. Arrays

FORTRAN: matrix multiplication

Exercise Write a FORTRAN code for computing the matrix-matrix product C = A × B with (A, B, C) ∈ RN×N. Compare the efficiency (computational time) of the following techniques: DO loop (multiplication by rows); DO loop (multiplication by columns); MATMUL intrinsic function; DGEMM from MKL. Consider N = 10, N = 100, N = 1000.

Walter Boscheri Parallel programming 01 14 / 14