[PPT] - Parallel Processing Raul Queiroz Feitosa Parts of these slides are PowerPoint Presentation

SLIDE 1

Parallel Processing

Raul Queiroz Feitosa

Parts of these slides are from the support material provided by W. Stallings

SLIDE 2 Parallel Processing 2

Objective

To present the most prominent approaches to parallel computer organization.

SLIDE 3 Parallel Processing 3

Outline

 Taxonomy  MIMD systems

 Symmetric multiprocessing  Time shared bus  Cache coherence

 Multithreading  Clusters  NUMA systems  Vector Computation

SLIDE 4 Parallel Processing 4

Taxonomy

Flynn

Instructions being executed simultaneously data being processed simultaneously 1 many 1 múltiplas

SISD SIMD MIMD MISD

SLIDE 5 Parallel Processing 5

Single Instruction, Single Data

SISD

Single processor Single instruction stream Data stored in single memory Uni-processor (von Neumann architecture)

control Unit processing unit memory unit instruction stream data stream

SLIDE 6 Parallel Processing 6

Single Instruction, Multiple Data

SIMD

 Single machine instruction

controls simultaneous execution

 Multiple processing elements  Each processing element has

associated data memory

 The same instruction

executed by all processing elements but on different set

f data.

 Main subclasses: vector and

array processors.

control unit processing element memory unit instruction stream data stream processing element processing element memory unit memory unit data stream data stream

…

SLIDE 7 Parallel Processing 7

Multiple Instruction, Single Data

MISD

Sequence of data transmitted to a set of processors. Each processor executes a different instruction

sequence on different “parts” of the same data sequence !!!!!!

Never been implemented.

SLIDE 8 Parallel Processing 8

Multiple Instruction, Multiple Data

MIMD

 Set of processors

simultaneously execute different instruction sequences.

 Different sets of data.  Main subclasses:

multiprocessors and multicomputers

processing unit memory unit data stream instruction stream processing unit data stream instruction stream processing unit data stream instruction stream

…

control unit control unit control unit

multiprocessor

SLIDE 9 Parallel Processing 9

Multiple Instruction, Multiple Data

MIMD

 Set of processors

simultaneously execute different instruction sequences.

 Different sets of data.  Main subclasses:

multiprocessors and multicomputers

processing unit memory unit data stream instruction stream processing unit data stream instruction stream processing unit data stream instruction stream

…

control unit control unit control unit memory unit memory unit Interconnection network

multicomputer

SLIDE 10 Parallel Processing 10

Taxonomy tree

Processor organizations

Single instructions single data stream (SISD) Single instructions multiple data stream (SIMD) Multiple instructions single data stream (MISD) Multiple instructions multiple data stream (MIMD) Vector Processor Array Processor Multiprocessor shared memory (tightly coupled) Multicomputer distributed memory (loosely coupled) Symmetric Multiprocessor (SMP) Nonuniform memory access (NUMA) Clusters

SLIDE 11 Parallel Processing 11

Outline

 Taxonomy  MIMD systems

 Symmetric multiprocessing  Time shared bus  Cache coherence

 Multithreading  Clusters  NUMA systems  Vector Computation

SLIDE 12 Parallel Processing 12

MIMD - Overview

Set of general purpose processors. Each can process all instructions necessary Further classified by method of processor communication.

SLIDE 13 Parallel Processing 13

Communication Models

Multiprocessors

 All CPUs are able to process all necessary

instruction.

 All access the same physical shared memory  All share the same address space.  Communication through shared memory via

LOAD/STORE instructions → tightly coupled.

 Simple programming model.

SLIDE 14 Parallel Processing 14

Communication Models

Multiprocessors (example)

a) Multiprocessor with 16 CPUs sharing a common memory. b) Memory in 16 sections; each one processed by one processor.

SLIDE 15 Parallel Processing 15

Communication Models

Multicomputers

 Each CPU has a private memory → distributed

memory system.

 Each CPU has a particular address space  Communication through send/receive primitives →

loosely coupled system.

 More complex programming model

SLIDE 16 Parallel Processing 16

Communication Models

Multicomputers (example)

Multicomputer with 16 CPUs each with its own private memory Image (see previous figure) distributed among the 16 CPUs

SLIDE 17 Parallel Processing 17

Communication Models

Multiprocessors  Multicomputers

 Multiprocessors:

 Potentially easier to program  Building a shared memory for hundreds of CPUs is not easy → non

scalable.

 Memory contention is a potential performance bottleneck.

 Multicomputers:

 More difficult to program.  Building multicomputers with 1000’s of CPU is not difficult →

scalable.

SLIDE 18 Parallel Processing 18

Outline

 Taxonomy  MIMD systems

 Symmetric multiprocessing  Time shared bus  Cache coherence

 Multithreading  Clusters  NUMA systems  Vector Computation

SLIDE 19 Parallel Processing 19

Symmetric Multiprocessors

A stand alone computer with the following characteristics:

 Two or more similar processors of comparable capacity.  Processors share same memory and I/O.  Processors are connected by a bus or other internal

connection.

 Memory access time is approximately the same for each

processor.

SLIDE 20 Parallel Processing 20

SMP Advantages

Performance

 If some work can be done in parallel

Availability

 Since all processors can perform the same functions, failure of a single

processor does not necessarily halt the system.

Incremental growth

 User can enhance performance by adding additional processors.

Scaling

 Vendors can offer range of products based on number of processors.

SLIDE 21 Parallel Processing 21

Outline

 Taxonomy  MIMD systems

 Symmetric multiprocessing  Time shared bus  Cache coherence

 Multithreading  Clusters  NUMA systems  Vector Computation

SLIDE 22 Parallel Processing 22

Time Shared Bus

Characteristics:

 Simplest form.  Structure and interface similar to single processor system  Following features provided:

 Addressing - distinguish modules on bus .  Arbitration - any module can be temporary master.  Time sharing - if one module has the bus, others must wait and may

have to suspend.

 Now have multiple processors as well as multiple I/O

modules.

SLIDE 23 Parallel Processing 23

Time Shared Bus - SMP

SLIDE 24 Parallel Processing 24

Time Shared Bus

Advantages:

 Simplicity  Flexibility  Reliability

Disadvantages:

 Performance limited by bus cycle time  Each processor should have local cache

 Reduce number of bus accesses

 Leads to problems with cache coherence

 Solved in hardware - see later

SLIDE 25 Parallel Processing 25

Outline

 Taxonomy  MIMD systems

 Symmetric multiprocessing  Time shared bus  Cache coherence

 Multithreading  Clusters  NUMA systems  Vector Computation

SLIDE 26 Parallel Processing 26

Cache Coherence Problem

Cache A SHARED BUS

. . .

SHARED MEMORY

a 1- CPU A reads data (miss) 2- CPU K reads the same data (miss) 3- CPU K writes (changes) data (hit) 4- CPU A reads data (hit) – outdated !!!!! … … y

x x x x

Cache K CPU A CPU K

SLIDE 27 Parallel Processing 27

Cache controllers may have a snoop,

monitors the shared bus to detect any for coherence

relevant activity and

acts so as to assure data coherence. It increases bus traffic.

which

Snoopy Protocols

SLIDE 28 Parallel Processing 28

Snoopy Protocols

Cache A SHARED BUS

SHARED MEMORY

a CPU A CPU K … … x

x x x

. . .

x

Cache K 1- CPU K writes (changes) the data (hit) 2- write propagates to the shared memory 3- snoop invalidates x

r updates data in CPU A

y y y y

SLIDE 29 Parallel Processing 29

MESI State Transition Diagram

SLIDE 30 Parallel Processing 30

L1-L2 Cache Consistency

L1 caches do not connect to the bus → do not

engage in the snoop protocol.

Simple solution:

 L1 is “write-through”.  Updates and invalidations in L2 must be propagated to

L1.

Approaches for write back L1 exist → more

complex.

SLIDE 31 Parallel Processing 31

Cache Coherence connection other than shared bus

Directory Protocols

 Collect and maintain information about copies of data in

cache.

 Typically a central directory stored in main memory.  Requests are checked against directory.  Appropriate transfers are performed.  Creates central bottleneck.  Effective in large scale systems with complex

interconnection schemes, according to Stallings ??????

SLIDE 32 Parallel Processing 32

Cache Coherence

Software Solutions

 Compiler and operating system deal with problem.  Overhead transferred to compile time.  Design complexity transferred from hardware to software.  However, software tends to make conservative decisions

 Inefficient cache utilization.

 Analyze code to determine safe periods for caching shared

variables.

 HW+SW solutions exist.

SLIDE 33 Parallel Processing 33

Outline

 Taxonomy  MIMD systems

 Symmetric multiprocessing  Time shared bus  Cache coherence

 Multithreading  Clusters  NUMA systems  Vector Computation

SLIDE 34 Parallel Processing 34

Increasing Performance

MIPS rate = f  IPC

 Pipelining and superscalar to increase IPC  Mechanisms to maximize the utilization of

each pipeline, may be reaching a limit due to

Complexity Power consumption

processor clock frequency average instructions per cycle

SLIDE 35 Parallel Processing 35

Threads and Processes

Process:

An instance of program running on computer

Resource ownership, such as main memory,

I/O channels, I/O devices, and files

Scheduling/execution Process switch

SLIDE 36 Parallel Processing 36

Threads and Processes

Thread:

A dispatchable unit of work within a process

Includes processor context (which includes the

program counter and stack pointer) and data area for stack (to enable subroutine branching)

Threads within a process share resources Threads are able to be executed independently! Switching processor between threads within same

process

Typically less costly than process switch

SLIDE 37 Parallel Processing 37

Explicit:

user-level threads visible to the application

program

kernel-level threads, which are visible to the OS All commercial processors and most experimental

nes use explicit multithreading

Implicit:

defined statically by compiler or dynamically by

hardware

Implicit and Explicit Multithreading

SLIDE 38 Parallel Processing 38

Approaches to Explicit Multithreading

Interleaved (fine-grained)

 Processor issues instruction(s) from a single thread at a time  Switching thread at each clock cycle  If thread is blocked it is skipped

Blocked (coarse-grained)

 Processor issues instruction(s) from a single thread at a time  Thread executed until event causes delay e.g. cache miss  Effective on in-order processor  Avoids pipeline stall

Simultaneous (SMT)

 Processor issues instructions from multiple threads at a time to execution units of

superscalar processor

Chip multiprocessing

 Processor is replicated on a single chip  Each processor handles separate threads

SLIDE 39 Parallel Processing 39

Scalar Processor Approaches

Single-threaded scalar

 Simple pipeline  No multithreading

Interleaved multithreaded scalar

 Easiest multithreading to implement  Switch threads at each clock cycle  Pipeline stages kept close to fully occupied  Hardware needs to switch thread context between cycles

Blocked multithreaded scalar

 Thread executed until latency event occurs  Would stop pipeline  Processor switches to another thread

SLIDE 40 Parallel Processing 40

Scalar Diagrams

caused by some dependency 3 time slots to clear dependency

SLIDE 41 Parallel Processing 41

Superscalar Processor Approaches (1)

Superscalar

 No multithreading

Interleaved multithreading superscalar:

 Each cycle, as many instructions as possible issued from single thread  Delays due to thread switches eliminated  Number of instructions issued in cycle limited by dependencies

Blocked multithreaded superscalar

 Instructions from one thread  Blocked multithreading used

SLIDE 42 Parallel Processing 42

Superscalar Diagrams (1)

SLIDE 43 Parallel Processing 43

Very long instruction word (VLIW)

 E.g. IA-64  Multiple instructions in single word  Typically constructed by compiler  Operations that may be executed in parallel in same word  May pad with no-ops

Interleaved multithreading VLIW

 Similar efficiencies to interleaved multithreading on superscalar

architecture

Blocked multithreaded VLIW

 Similar efficiencies to blocked multithreading on superscalar

architecture

Superscalar Processor Approaches (2)

SLIDE 44 Parallel Processing 44

Superscalar Diagrams (2)

SLIDE 45 Parallel Processing 45

Simultaneous multithreading (SMT)

 Issue multiple instructions at a time  One thread may fill all horizontal slots  Instructions from two or more threads may be issued  With enough threads, can issue maximum number of instructions on

each cycle

Chip multiprocessor

 Multiple processors  Each has two-issue superscalar processor  Each processor is assigned thread

 Can issue up to two instructions per cycle per thread

Superscalar Processor Approaches (3)

SLIDE 46 Parallel Processing 46

Superscalar Diagrams (3)

SLIDE 47 Parallel Processing 47

Outline

 Taxonomy  MIMD systems

 Symmetric multiprocessing  Time shared bus  Cache coherence

 Multithreading  Clusters  NUMA systems  Vector Computation

SLIDE 48 Parallel Processing 48

Clusters

“A group of interconnected whole computers working together as unified resource illusion of being one machine.”

SLIDE 49 Parallel Processing 49

Clusters

 Benefits

Absolute scalability Incremental scalability High availability Superior price/performance

 Server applications  Each computer called a node

SLIDE 50 Parallel Processing 50

Cluster Configurations

Standby Server, No Shared Disk

SLIDE 51 Parallel Processing 51

Cluster Configurations

With Shared Disk

SLIDE 52 Parallel Processing 52

Clustering Methods

Method Description Benefits Limitations Passive Standby A secondary server takes over in case of primary server failure. Easy to implement High cost because the secondary server is unavailable for other processing tasks. Active Secondary The secondary server is also used for processing tasks. Reduced cost because secondary servers can be used for processing. Increased complexity. Separate Servers Separate servers have their own

disks. Data is continuously copied

from primary to secondary server. High availability High network and server

verhead due to copying
perations.

Servers Connected to Disks Servers are cabled to the same disks, but each server owns its

disks. If one server fails, its disks

re taken over by the other srver. Reduced network and server overhead due to elimination of copying

perations

Usually requires disk mirroring or RAID technology to compensate for risk of disk failure. Servers Share Disks Multiple servers simultaneously share access to disks. Low network and server

verhead. Reduced risk of

downtime caused by disk failure. Requires lock manager

software. Usually used

with disk mirroring or RAID technology.

SLIDE 53 Parallel Processing 53

Operating Systems Design Issues

Failure Management

 High availability (when one computer fails, another

computer takes over)

 Fault tolerant (redundancy)  Failover: switching applications & data from failed system

to alternative within cluster

 Failback: restoration of applications and data to original

system after problem is fixed

Load balancing

 Incremental scalability: automatically include new

computers in scheduling

 Middleware needs to recognise that processes may switch

between machines

SLIDE 54 Parallel Processing 54

Parallelizing

Single application executing in parallel on a number of machines in cluster

 Complier

 Determines at compile time which parts can be executed in parallel  Split off for different computers

 Application

 Application written from scratch to be parallel  Message passing to move data between nodes  Hard to program  Best end result

SLIDE 55 Parallel Processing 55

Cluster Computer Architecture

Middleware provides a unified system image to the user, it provides for the communication of software components and users to applications

SLIDE 56 Parallel Processing 56

Blade Servers

Common cluster implementation that houses multiple server modules (blades) in single chassis

Save space Improve system management Chassis provides power supply Each blade has processor, memory, disk

SLIDE 57 Parallel Processing 57

Example of a Blade Server

SLIDE 58

Example of a Blade Server

Parallel Processing 58 Figure source: https://www.supermicro.com/en/products/blade/

SLIDE 59 Parallel Processing 59

Cluster × SMP

Both provide multiprocessing support to high demand applications. Both available commercially

 SMP for longer

SMP:

 Easier to manage and control  Closer to single processor systems

Clustering:

 Superior incremental & absolute scalability  Superior availability → Redundancy

Scheduling is main difference Less physical space Lower power consumption

SLIDE 60 Parallel Processing 60

Outline

 Taxonomy  MIMD systems

 Symmetric multiprocessing  Time shared bus  Cache coherence

 Multithreading  Clusters  NUMA systems  Vector Computation

SLIDE 61 Parallel Processing 61

NUMA Systems

Terminology:

 Uniform memory access (UMA)

 All processors have access to all parts of memory → load & store.  Access time to all regions of memory is the same.  Access time to memory for different processors is the same.  As used by SMP.

 Nonuniform memory access (NUMA)

 All processors have access to all parts of memory → load & store  Access time of processor differs depending on region of memory  Different processors access different regions of memory at different

speeds

 Cache coherent NUMA (CC_NUMA)

 Cache coherence is maintained among the caches of the various

processors

 Significantly different from SMP and clusters

SLIDE 62 Parallel Processing 62

Motivation for NUMA

 SMP has practical limit to the number of processors

 Bus traffic limits to between 16 and 64 processors

 In clusters each node has own memory

 Apps do not see large global memory  Coherence maintained by software not hardware

 NUMA retains SMP flavour while giving large scale

multiprocessing

 e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors

 Objective

 to maintain transparent system wide memory while permitting

multiprocessor nodes, each with own bus or internal interconnection system.

SLIDE 63 Parallel Processing 63

CC-NUMA Organization

Single addressable memory space. Memory request order:

1. L1 cache (local to processor)
2. L2 cache (local to processor)
3. Main memory (local to node)
4. Remote memory (automatically and

transparent to processor)

SLIDE 64 Parallel Processing 64

NUMA Pros & Cons

Higher levels of parallelism than SMP, with minor

software changes

Performance can breakdown if too much access to

remote memory

 Can be avoided by:

 L1 & L2 cache design reducing all memory access  Need good temporal and spatial locality of software  Virtual memory management moving pages to nodes that are

using them most

Keeping it Cache Coherent is costly in terms of

HW and performance

SLIDE 65 Parallel Processing 65

Outline

 Taxonomy  MIMD systems

 Symmetric multiprocessing  Time shared bus  Cache coherence

 Multithreading  Clusters  NUMA systems  Vector Computation

SLIDE 66 Parallel Processing 66

Vector Computation

 Maths problems involving physical processes present

different difficulties for computation

 Aerodynamics, seismology, meteorology, …  Continuous field simulation

 High precision  Repeated floating point calculations on large arrays of

numbers

SLIDE 67 Parallel Processing 67

Vector Computation

Supercomputers handle these types of problem

 Hundreds of millions of flops  $10-15 million  Optimised for calculation rather than multitasking and I/O  Limited market

 Research, government agencies, meteorology

Array processor

 Alternative to supercomputer  Configured as peripherals to ordinary computer  Just run vector portion of problems

SLIDE 68 Parallel Processing 68

Approaches

Scalar Processing

for i=1:N for j=1:N for k=1:N C(i,j) = C(i,j) + A(i,k) * B(k,j); end end end

kj N k ik ij

b a c





 

1

SLIDE 69 Parallel Processing 69

Approaches

for i=1:N for j=1:N C(i,j) = sum(A(i,:).*B(:,j)'); end end

Vector Processing

All elements in that dimension

kj N k ik ij

b a c





 

1

All operations are independent → no pipeline hazard!!!!!!!

SLIDE 70 Parallel Processing 70

Processor Designs

Pipelined ALU

Within operations Across operations

Parallel ALUs Parallel processors

SLIDE 71 Parallel Processing 71

Approaches to Vector Computation

SLIDE 72 Parallel Processing 72

Chaining

The ability to feed the result issued from one functional unit into another functional unit. (Cray Supercomputers)

 Vector operation may start as soon as first element of operand vector

available and functional element is free.

 Result from one functional element is fed immediately into another.  If vector registers used, intermediate results do not have to be stored in

memory.

memory vector registers vector register scalar register pipelined ALUs pipelined ALUs

SLIDE 73 Parallel Processing 73

Outline

 Taxonomy  MIMD systems

 Symmetric multiprocessing  Time shared bus  Cache coherence

 Multithreading  Clusters  NUMA systems  Vector Computation

Outline

SLIDE 74 Parallel Processing 74

Exercises

Exercise 1:

Consider a situation in which two processors (P1 and P2) in an SMP configuration, over time, require access to the same line of data from main

memory. Both processors have a cache and use the MESI protocol.

Initially both caches have an invalid copy of the line. Eventually, processor P1 reads line x. If this is the start of a sequence of accesses, complete the table below indicating the state-transitions of this line in the caches of both processors.

Accesses State transition in cache 1 State transition in cache 2 P2 reads x ES IS P1 writes to x SM SI P1 writes to x MM II P2 reads x M↓S IS

SLIDE 75 Parallel Processing 75

Exercises

Exercise 2: Consider an SMP with both L1and L2 caches using the MESI protocol. One of four states is associated with each line in the L2 cache. Are all four states also needed for each line in the L1 cache? If so, why? If not, explain which state or states can be eliminated.

SLIDE 76 Parallel Processing 76

Exercises

Exercise 3: Consider a pipeline similar to the ones seen in chapter on Pipelining, which is redrawn in Figure a with the fetch and decode stages ignored to represent the execution of thread A. Figure b illustrates the execution of a separate thread B. In both cases, a simple pipelined processor is used.

1 2 3 4 5 6 7 8 9 10 12 12 CO A1 A2 A3 A4 A5 A15 A16 FO A1 A2 A3 A4 A15 A16 EI A1 A2 A3 A15 A16 WO A1 A2 A3 A15 A16 1 2 3 4 5 6 7 8 9 10 12 12 CO B1 B2 B3 B4 B5 B6 B7 FO B1 B2 B3 B4 B5 B6 B7 EI B1 B2 B3 B4 B5 B6 B7 WO B1 B2 B3 B4 B5 B6 B7

cycle Thread A Thread B

SLIDE 77 Parallel Processing 77

Exercises

Exercise 3a : Show an instruction issue diagram, for each of the two threads Exercise 3b : Assume that the two threads are to be executed in parallel on a chip processor, with each of the two processors on the chip using a simple pipeline. Show an instruction issue diagram. Also show a pipeline execution diagram.

Thread A Thread B

1 2 3 4 5 6 7 8 9 10 12 12

cycle

SLIDE 78 Parallel Processing 78

Exercises

Exercise 3c : Assume a two-issue superscalar architecture. Repeat part (b) for an interleaved multithreading superscalar implementation, assuming no data dependencies. Note: There is no unique answer, you need to make assumptions about latency and priority

1 2 3 4 5 6 7 8 9 10 12 12

cycle

SLIDE 79 Parallel Processing 79

Exercises

Exercise 3d : Repeat part c for a blocked multithreading superscalar implementation.

1 2 3 4 5 6 7 8 9 10 12 12

cycle

SLIDE 80 Parallel Processing 80

Exercises

Exercise 3e : Repeat for a four-issue SMT architecture.

1 2 3 4 5 6 7 8 9 10 12 12

cycle

SLIDE 81 Parallel Processing 81

Exercises

Exercise 4: An application program is executed on a nine-computer

cluster. A benchmark program took time T on this cluster.

Further, it was found that 25% of T was time in which the application was running simultaneously on all nine

computers. The remaining time, the application had to run on

a single computer.

a)

Calculate the effective speedup under the aforementioned condition as compared to executing the program on a single computer. Also calculate , the percentage of code that has been parallelized (programmed or compiled so as to use the cluster mode) in the preceding program.

b)

Suppose that we are able to effectively use 17 computers rather than 9 computers on the parallelized portion of the code. Calculate the effective speedup that is achieved

SLIDE 82 Parallel Processing 82

Exercises

Exercise 5:

The figures below show the state diagrams of two possible cache coherence protocols. Deduce and explain each protocol, and compare each to MESI

W(i) = write to line by processor i R(i) = read line by processor i Z(i) = displace line by processor i W(j) = write to line by processor j ( j i) R(j) = read line by processor j ( j i) Z(j) = displace line by processor j ( j i)

Note: state diagrams are for a given line in cache i

Invalid Valid

R(i) W(i) W(j) Z(i) R(j) W(j) Z(j) W(i) Z(i) R(i)

Exclusive Shared Invalid

R(i) W(j) Z(i) R(j) W(j) Z(j) Z(j) R(j) R(i) W(j) W(i) R(j) W(i) Z(i) R(i) W(i) Z(j)

SLIDE 83 Parallel Processing 83

Text Book References

These topics are covered in

Stallings

chapter 17

Tanenbaum - chapter 8 Parhami

chapter 22

SLIDE 84 Parallel Processing 84

Parallel Processing

Parallel Processing Raul Queiroz Feitosa Parts of these slides are - - PowerPoint PPT Presentation

END