Parallel programming 01 Walter Boscheri walter.boscheri@unife.it - - PowerPoint PPT Presentation
Parallel programming 01 Walter Boscheri walter.boscheri@unife.it - - PowerPoint PPT Presentation
Parallel programming 01 Walter Boscheri walter.boscheri@unife.it University of Ferrara - Department of Mathematics and Computer Science A.Y. 2018/2019 - Semester I Outline Introduction and motivation 1 Parallel architectures 2
Outline
1
Introduction and motivation
2
Parallel architectures
3
Computational cost
4
Memory
5
Arrays
- 1. Introduction and motivation
Moore’s law
- G. Moore (1985) - CEO Intel
In 1985 Moore asserted that the number of transistors in a dense integrated circuit would have doubled approximately every two years. This means that the computation speed doubles about every two years as well. Intel Corporation (2019) Transistors have become very small. Intel’s new 10 nm (molecular scale) architecture is expected to be released in 2019, marking over a three-year cadence since the previous architecture node.
Walter Boscheri Parallel programming 01 2 / 14
- 1. Introduction and motivation
Moore’s law
Walter Boscheri Parallel programming 01 2 / 14
- 1. Introduction and motivation
Parallel computing
Parallel computing Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. The code can be executed on many microprocessors or cores of the same processor in order to improve the computational efficiency of the simulation. High Performance Computing (HPC) High-performance computing (HPC) is the use of super computers and par- allel processing techniques for solving complex computational problems in science, engineering, or business. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both theoretical and parallel computational techniques.
Walter Boscheri Parallel programming 01 3 / 14
- 2. Parallel architectures
Von Neumann CPU architecture (1945)
Central Processing Unit (CPU) split into Artihmetic Logic Unit (ALU) and Control Unit. The accumulator (register) in the ALU connects input and output Random Access Memory (RAM) Input unit which enters data into the computer Output unit to return the elaborated data on the mass storage
Walter Boscheri Parallel programming 01 4 / 14
- 2. Parallel architectures
Bus CPU architecture
The bus is a channel which connects all components to each other. The single bus can only access one of the two classes of memory at a time, hence slowing the rate at which the CPU can work. This seriously limits the effective processing speed when the CPU is required to perform minimal processing on large amounts of data.
Walter Boscheri Parallel programming 01 5 / 14
- 2. Parallel architectures
Bus CPU architecture
Dual Independent Bus (DIB) Architecture (Intel and AMD) Two (dual) independent data I/O buses enables the processor to access data from either of its buses simultaneously and in parallel, rather than in a singular sequential manner (as in a single-bus system). The second or backside bus in a processor with DIB is used for the L2 cache, allowing it to run at much greater speeds than if it were to share the main processor bus.
Walter Boscheri Parallel programming 01 5 / 14
- 2. Parallel architectures
Parallel architecture: memory structure
SHARED There is a global memory space which is accessible by all processors. Processors might also have some local memory. Algorithms may use global data structures efficiently. Example: multicore CPU, GPU DISTRIBUTED All memory is associated with processors. To retrieve information from another processors’ memory a message must be sent. Algorithms should use distributed data structures. Example: network of computers, supercomputers
Walter Boscheri Parallel programming 01 6 / 14
- 2. Parallel architectures
Parallel architecture: memory structure
Shared memory architectures (multicore CPU): Intel 9th generation 2019: up to 8 cores, 16 threads, 5 GHz and 16 MB cache AMD Ryzen 2019: up to 12 cores, 24 threads, 4.6 GHz and 64 MB cache Distributed memory architectures: MARCONI A3 supercomputer (Bologna - Italy) Nodes: 1512, RAM: 196 GB/node Processors: 2 x 24-cores Intel Xeon 8160 (SkyLake), 2.10 GHz Peak performance: 8 PFLOP/s SUPERMUC-NG supercomputer (Munich - Germany) Nodes: 6336, RAM: 96 GB/node Processors: 2 x 24-cores Intel Xeon Platinum 8174 (SkyLake), 3.90 GHz Peak performance: 26.3 PFLOP/s
Walter Boscheri Parallel programming 01 6 / 14
- 3. Computational cost
Cost estimation of an algorithm
The cost of an algorithm is measured with the number of operations needed to execute the code. FLOP FLOP = FLoating point OPeration 1 FLOP is equivalent to the cost associated to a summation or a multipli- cation between floating points (real numbers). 1 GFLOP = 109 FLOPS 1 TFLOP = 1012 FLOPS 1 PFLOP = 1015 FLOPS 1 EFLOP = 1018 FLOPS
Walter Boscheri Parallel programming 01 7 / 14
- 3. Computational cost
Example
Dot product Given two vectors (a, b) ∈ RN, the dot product c is evaluated as c =
N
- i=1
a(i) · b(i) Each iterations i counts a total number of 2 floating point operations, namely a sum and a product. Thus, the computational cost of the dot product is 2N.
Exercise Work out the computational cost for i) a matrix-vector and ii) a matrix- matrix multiplication with size [N × N]. Write a FORTRAN code for comparing the theoretical results against numerical evidences.
Walter Boscheri Parallel programming 01 8 / 14
- 3. Computational cost
Computational speed of a system
The computational speed of a system is measured in terms of the number of floating point operations that can be executed in one second, thus FLOP/s. 1 GFLOP/s = 109 FLOPS/s 1 TFLOP/s = 1012 FLOPS/s 1 PFLOP/s = 1015 FLOPS/s 1 EFLOP/s = 1018 FLOPS/s Examples: SUPERMUC-NG (Munich): 26.3 PFLOP/s MARCONI A3 (Bologna): 8 PFLOP/s HLRS (Stuttgart): 7.4 PFLOP/s
List of TOP 500 supercomputers Walter Boscheri Parallel programming 01 9 / 14
- 3. Computational cost
Computational speed of a processor
Intel Core i7-8750H (8th gen) Cores: 6, Threads: 12 Clock speed: 2.20 GHz FLOP/s: 4 Speed: 6 (cores) × 4 (FLOP/s) × 2.20 · 109 (clock) = 52.8 GFLOP/s Intel Skylake i7-9800X (9th gen) Cores: 8, Threads: 16 Clock speed: 4.40 GHz FLOP/s: 16 Speed: 8 (cores) × 16 (FLOP/s) × 4.40 · 109 (clock) = 563.2 GFLOP/s
Walter Boscheri Parallel programming 01 10 / 14
- 4. Memory
Memory bandwidth
Memory bandwidth is the rate at which data can be transferred between memory and processor. It is typically measured in GB/s. The theoretical memory bandwidth is computed as the product of: base RAM clock frequency number of data transfers per clock: double data rate (DDR) runs two bits per clock cycle memory bus width: each DDR memory interface is 64 bits wide (that is called line) number of interfaces: typically modern computers use two memory interfaces, thus they are referred to as dual-channel mode with 128-bit bus width. Example:dual-channel memory with DDR4-3200 (1600 MHz) 1600 · 106 clock s × 2 line clock × 64 bit line × 2 interfaces = 409.6 · 109 bit s = 51.2 GB/s
Walter Boscheri Parallel programming 01 11 / 14
- 4. Memory
Memory bandwidth
Example: floating point operation c = a · b 3 memory accesses: a, b, and c. 24 bytes must be transferred (a, b assumed to be double precision floats) ⇓ The processor can transfer data at the following rates: 24 bytes × 52.8 GFLOP/s = 1267.2 GB/s (Intel Core i7-8750H) 24 bytes × 563.2 GFLOP/s = 13516.8 GB/s (Intel Skylake i7-9800X) As the speed gap between CPU and memory widens, memory hier- archy has become the primary factor limiting program performance.
Walter Boscheri Parallel programming 01 11 / 14
- 4. Memory
Memory hierarchy
register (on chip): up to 64-bit (vector processors) cache memory: L1 on chip and L2 Random access memory (RAM) Mass storage Memory access time It is the time interval between the instant at which an instruction control unit initiates a call for data or a request to store data, and the instant at which delivery of the data is completed or the storage is started. It is how long it takes for a character in memory to be located.
Walter Boscheri Parallel programming 01 12 / 14
- 4. Memory
Memory hierarchy
Cache is a fast access memory that allows data to be temporarily located close to the CPU. Thus, memory access time is reduced. There are two types of locality: temporal locality: data can be accessed more than once in time spatial locality: when a specific data is accessed and copied to the cache, it is very likely that spatially closed data are accessed as well. All data loaded in the cache are stored as long as possible.
Walter Boscheri Parallel programming 01 12 / 14
- 5. Arrays
FORTRAN: array allocation
Let A ∈ RM×N be a matrix (array of rank 2). In FORTRAN arrays of rank greater than 1 are stored by columns. Thus, the fastest index is the left one. Matrix A of components aij is therefore stored in the following order: col 1: a11, a21, . . . aM1, col 2: a12, a22, . . . aM2, . . . col j: a1j, a2j, . . . aMj. . . . col N: a1N, a2N, . . . aMN. Elements of the same column are closed in the memory, thus the fastest way to access array data is to loop column-by-column. The cache is then fully exploited.
Walter Boscheri Parallel programming 01 13 / 14
- 5. Arrays
FORTRAN: matrix multiplication
Let (A, B) ∈ RN×N be arrays of rank 2. The matrix-matrix product C = A × B can be evaluated using: DO loop; MATMUL intrinsic function of FORTRAN DGEMM routine from Math Kernel Library (MKL) DGEMM: Math Kernel Library routine DGEMM (transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc)
DGEMM syntax
The DGEMM routine computes a scalar-matrix-matrix product and add the result to a scalar-matrix product. The operation is defined as C := alpha · op(A) · op(B) + beta · C, where the operator op(X) can be op(X) = X, op(X) = X T, op(X) = X H.
Walter Boscheri Parallel programming 01 14 / 14
- 5. Arrays
FORTRAN: matrix multiplication
Exercise Write a FORTRAN code for computing the matrix-matrix product C = A × B with (A, B, C) ∈ RN×N. Compare the efficiency (computational time) of the following techniques: DO loop (multiplication by rows); DO loop (multiplication by columns); MATMUL intrinsic function; DGEMM from MKL. Consider N = 10, N = 100, N = 1000.
Walter Boscheri Parallel programming 01 14 / 14