Parallel programming 02 Walter Boscheri walter.boscheri@unife.it - - PowerPoint PPT Presentation

▶

Jul 12, 2023 355 likes •584 views

Parallel programming 02 Walter Boscheri walter.boscheri@unife.it University of Ferrara - Department of Mathematics and Computer Science A.Y. 2018/2019 - Semester I Outline Classification of parallel systems 1 Performance measure 2

SLIDE 1

Parallel programming 02

Walter Boscheri walter.boscheri@unife.it

University of Ferrara - Department of Mathematics and Computer Science

A.Y. 2018/2019 - Semester I

SLIDE 2

Outline

1 Classification of parallel systems

2 Performance measure

3 Optimization of parallel computational resources

SLIDE 3

1. Classification of parallel systems

Classification of parallel systems

A parallel system can be described by considering: number and type of the processors (massively parallel and coarse-grained parallelism) presence of a global control mechanism (Flynn classification) synchronism (a common clock among processors is present or not) connections among processors (shared memory or distributed memory)

Walter Boscheri Parallel programming 02 2 / 12

SLIDE 4

1. Classification of parallel systems

Flynn classification (1966)

SISD: Single Instruction stream-Single Data stream It includes the model of Von Neumann because one stream of instructions is

perating on a one stream of data.

SIMD: Single Instruction stream-Multiple Data stream It involves vector processors and pipeline processors, in which all processors follow the same instructions by executing them on different data sets. MISD: Multiple Instruction stream-Single Data stream It can be seen as an extension of SISD. MIMD: Multiple Instruction stream-Multiple Data stream A system based on MIMD has independent processors, each of them has a local control unit. As a consequence, each processor can load different instructions and operate onto different data.

Walter Boscheri Parallel programming 02 3 / 12

SLIDE 5

1. Classification of parallel systems

Flynn classification (1966)

CU PU MM

DS IS

control unit processing unit memory module instruction stream data stream CU PU MM IS DS

SISD

scalar uniprocessor systems Von Neumann architecture

Walter Boscheri Parallel programming 02 3 / 12

SLIDE 6

1. Classification of parallel systems

Flynn classification (1966)

CU PU2 MM2

DS2 IS

control unit processing unit memory module instruction stream data stream CU PU MM IS DS

SIMD

PU1 PUn

DS1 DSn

MM1 MMn synchronized parallelism

ne single control unit
ne single instruction operates on several data sets

vector processors and parallel processing

Walter Boscheri Parallel programming 02 3 / 12

SLIDE 7

1. Classification of parallel systems

Flynn classification (1966)

CU2 PU2 MM2

DS2 IS2

control unit processing unit memory module instruction stream data stream CU PU MM IS DS

MIMD

PU1 PUn

DS1 DSn

MM1 MMn CU1 CUn

IS1

IS1 IS2 ISn

ISn

non-synchronized parallelism several processors execute several instructions and operate on several data sets shared or distributed memory

Walter Boscheri Parallel programming 02 3 / 12

SLIDE 8

1. Classification of parallel systems

Shared and distributed memory

Shared memory

single address space all processors have access to the pool of shared memory

Walter Boscheri Parallel programming 02 4 / 12

SLIDE 9

1. Classification of parallel systems

Shared and distributed memory

Distributed memory

each processor has its own local memory message-passing is used to exchange data among processors

Walter Boscheri Parallel programming 02 4 / 12

SLIDE 10

1. Classification of parallel systems

Sequential vs vector processors

Sequential processors execute all instructions in a serial mode, from the first to the last one. Vector processors make use of the pipelining technique: it is based on the parallel execution of several instructions which belong to the sequential algorithm it is similar to the assembly line: it does not reduce the execution time for one single instruction, but it increases the frequency at which the instructions are executed.

Walter Boscheri Parallel programming 02 5 / 12

SLIDE 11

1. Classification of parallel systems

Sequential vs vector processors

Time IS1 IS2

1.6 ns 1.6 ns

3.2 ns 1.6 ns 0.0 ns Time 3.2 ns 1.6 ns 0.0 ns

SEQUENTIAL PROCESSOR VECTOR PROCESSOR (pipeline)

0.4 ns

Instruction order IS1 IS2 Processor: 2.5 GHz (0.4 ns clock period)

Walter Boscheri Parallel programming 02 5 / 12

SLIDE 12

1. Classification of parallel systems

Sequential vs vector processors

Example Sequential processor DO i = 1, N A(i) = B(i) + C(i) B(i) = 2 * A(i+1) ENDDO Vector processor temp (1:N) = A(2:N+1) A(1:N) = B(1:N) + C(1:N) B(1:N) = 2 * temp (1:N)

Walter Boscheri Parallel programming 02 5 / 12

SLIDE 13

2. Performance measure

Speedup

Speedup The speedup S(p) measures the reduction of the computational time tp which has been obtained by using a total number of p processors while keeping the size of the problem fixed. Absolute speedup The speedup is measured w.r.t. the best serial code with computational time tbest: S(p) = tbest t(p). It is also called performance measure. Relative speedup The speedup is measured w.r.t. the same serial code with p = 1: S(p) = t(p=1) t(p) . It is also called scalability measure.

Walter Boscheri Parallel programming 02 6 / 12

SLIDE 14

2. Performance measure

Ideal speedup

In the ideal case, in which the work load is perfectly distributed among all processors, the relative speedup should be equal to 1. This is the case of linear speedup. Actually, linear speedup is never achieved: load balancing is not guaranteed; portions of code which can not be parallelized; synchronization and communication times.

Walter Boscheri Parallel programming 02 7 / 12

SLIDE 15

2. Performance measure

Superlinear speedup

Very rarely, one has S(p) > p. This is the case of superlinear speedup. Superlinear speedup can be occasionally achieved: in a distributed memory system, if the number of processors increases, the total amount of memory increases as well. Therefore, intermediate data and results can be stored, hence avoiding the need of computing them again. In such a way, the number of floating point operations, i.e. the number of computations, can be reduced compared to an execution

n less processors;

the size of the problem which belongs to one processor, might be re- duced up to the point that it can be entirely stored and managed in the cache.

Walter Boscheri Parallel programming 02 8 / 12

SLIDE 16

2. Performance measure

Model of Flatt and Kennedy

The model qualitatively describes the speedup S(p) as a function of p. Definitions Tser → execution time of the serial portion of an algorithm Tpar → execution time of the parallelizable portion of an algorithm T0(p) → synchronization and communication time for p processors It holds T(1) = Tser + Tpar T(p) = Tser + Tpar p + T0(p) S(p) = Tser + Tpar Tser + Tpar

p

+ T0(p)

Walter Boscheri Parallel programming 02 9 / 12

SLIDE 17

2. Performance measure

Model of Flatt and Kennedy

By considering that the communication time is a linear function of p, that is T0(p) = K p, the speedup results S(p) = Tser + Tpar Tser + Tpar

p

+ T0(p) = (Tser + Tpar) p Tserp + Tpar + Kp2 . It follows that lim

p→∞ S(p) = 0

Walter Boscheri Parallel programming 02 9 / 12

SLIDE 18

2. Performance measure

Model of Flatt and Kennedy

Speedup function: is initially linear; exhibits a saturation point; decreases as the communication cost increases.

Walter Boscheri Parallel programming 02 9 / 12

SLIDE 19

2. Performance measure

Efficiency

Efficiency is defined as the ratio E(p) = S(p) p if S(p) is linear, then E(p) = 1 actually, E(p) = 1 if p = 1 < 1 if p > 1 N.B.- the more the efficiency is far from 1, the worse the parallel computa- tional resources are exploited.

Walter Boscheri Parallel programming 02 10 / 12

SLIDE 20

3. Optimization of parallel computational resources

Optimize the number of processors

Speedup The optimal number of processors is the one which allows us to reach the saturation point. Efficiency The optimal number of processors is the one with E(p) = 1 : p = 1. At the saturation point the speedup is maximum but the efficiency is low.

Walter Boscheri Parallel programming 02 11 / 12

SLIDE 21

3. Optimization of parallel computational resources

Function of Kuck

The function of Kuck K(p) is used in order to measure the efficiency of a parallelization in terms of the number of processors p: K(p) = S(p) E(p) p∗ = arg max K(p)

simultaneous good speedup and efficiency

Walter Boscheri Parallel programming 02 12 / 12