Parallel Processing
Raul Queiroz Feitosa
Parts of these slides are from the support material provided by W. Stallings
Parallel Processing Raul Queiroz Feitosa Parts of these slides are - - PowerPoint PPT Presentation
Parallel Processing Raul Queiroz Feitosa Parts of these slides are from the support material provided by W. Stallings Objective To present the most prominent approaches to parallel computer organization. 2 Parallel Processing Outline
Parallel Processing
Raul Queiroz Feitosa
Parts of these slides are from the support material provided by W. Stallings
Objective
To present the most prominent approaches to parallel computer organization.
Outline
Taxonomy MIMD systems
Symmetric multiprocessing Time shared bus Cache coherence
Multithreading Clusters NUMA systems Vector Computation
Taxonomy
Flynn
Instructions being executed simultaneously data being processed simultaneously 1 many 1 múltiplas
SISD SIMD MIMD MISD
Single Instruction, Single Data
SISD
Single processor Single instruction stream Data stored in single memory Uni-processor (von Neumann architecture)
control Unit processing unit memory unit instruction stream data stream
Single Instruction, Multiple Data
SIMD
Single machine instruction
controls simultaneous execution
Multiple processing elements Each processing element has
associated data memory
The same instruction
executed by all processing elements but on different set
Main subclasses: vector and
array processors.
control unit processing element memory unit instruction stream data stream processing element processing element memory unit memory unit data stream data stream
…
Multiple Instruction, Single Data
MISD
Sequence of data transmitted to a set of processors. Each processor executes a different instruction
sequence on different “parts” of the same data sequence !!!!!!
Never been implemented.
Multiple Instruction, Multiple Data
MIMD
Set of processors
simultaneously execute different instruction sequences.
Different sets of data. Main subclasses:
multiprocessors and multicomputers
processing unit memory unit data stream instruction stream processing unit data stream instruction stream processing unit data stream instruction stream
…
control unit control unit control unit
multiprocessor
Multiple Instruction, Multiple Data
MIMD
Set of processors
simultaneously execute different instruction sequences.
Different sets of data. Main subclasses:
multiprocessors and multicomputers
processing unit memory unit data stream instruction stream processing unit data stream instruction stream processing unit data stream instruction stream
…
control unit control unit control unit memory unit memory unit Interconnection network
multicomputer
Taxonomy tree
Processor organizations
Single instructions single data stream (SISD) Single instructions multiple data stream (SIMD) Multiple instructions single data stream (MISD) Multiple instructions multiple data stream (MIMD) Vector Processor Array Processor Multiprocessor shared memory (tightly coupled) Multicomputer distributed memory (loosely coupled) Symmetric Multiprocessor (SMP) Nonuniform memory access (NUMA) Clusters
Outline
Taxonomy MIMD systems
Symmetric multiprocessing Time shared bus Cache coherence
Multithreading Clusters NUMA systems Vector Computation
MIMD - Overview
Set of general purpose processors. Each can process all instructions necessary Further classified by method of processor communication.
Communication Models
Multiprocessors
All CPUs are able to process all necessary
instruction.
All access the same physical shared memory All share the same address space. Communication through shared memory via
LOAD/STORE instructions → tightly coupled.
Simple programming model.
Communication Models
Multiprocessors (example)
a) Multiprocessor with 16 CPUs sharing a common memory. b) Memory in 16 sections; each one processed by one processor.
Communication Models
Multicomputers
Each CPU has a private memory → distributed
memory system.
Each CPU has a particular address space Communication through send/receive primitives →
loosely coupled system.
More complex programming model
Communication Models
Multicomputers (example)
Multicomputer with 16 CPUs each with its own private memory Image (see previous figure) distributed among the 16 CPUs
Communication Models
Multiprocessors Multicomputers
Multiprocessors:
Potentially easier to program Building a shared memory for hundreds of CPUs is not easy → nonscalable.
Memory contention is a potential performance bottleneck. Multicomputers:
More difficult to program. Building multicomputers with 1000’s of CPU is not difficult →scalable.
Outline
Taxonomy MIMD systems
Symmetric multiprocessing Time shared bus Cache coherence
Multithreading Clusters NUMA systems Vector Computation
Symmetric Multiprocessors
A stand alone computer with the following characteristics:
Two or more similar processors of comparable capacity. Processors share same memory and I/O. Processors are connected by a bus or other internal
connection.
Memory access time is approximately the same for each
processor.
SMP Advantages
Performance
If some work can be done in parallel
Availability
Since all processors can perform the same functions, failure of a single
processor does not necessarily halt the system.
Incremental growth
User can enhance performance by adding additional processors.
Scaling
Vendors can offer range of products based on number of processors.
Outline
Taxonomy MIMD systems
Symmetric multiprocessing Time shared bus Cache coherence
Multithreading Clusters NUMA systems Vector Computation
Time Shared Bus
Characteristics:
Simplest form. Structure and interface similar to single processor system Following features provided:
Addressing - distinguish modules on bus . Arbitration - any module can be temporary master. Time sharing - if one module has the bus, others must wait and mayhave to suspend.
Now have multiple processors as well as multiple I/O
modules.
Time Shared Bus - SMP
Time Shared Bus
Advantages:
Simplicity Flexibility Reliability
Disadvantages:
Performance limited by bus cycle time Each processor should have local cache
Reduce number of bus accesses Leads to problems with cache coherence
Solved in hardware - see laterOutline
Taxonomy MIMD systems
Symmetric multiprocessing Time shared bus Cache coherence
Multithreading Clusters NUMA systems Vector Computation
Cache Coherence Problem
Cache A SHARED BUS
. . .
SHARED MEMORY
a 1- CPU A reads data (miss) 2- CPU K reads the same data (miss) 3- CPU K writes (changes) data (hit) 4- CPU A reads data (hit) – outdated !!!!! … … y
x x x x
Cache K CPU A CPU K
Cache controllers may have a snoop,
monitors the shared bus to detect any for coherence
relevant activity and
acts so as to assure data coherence. It increases bus traffic.
which
Snoopy Protocols
Snoopy Protocols
Cache A SHARED BUS
SHARED MEMORY
a CPU A CPU K … … x
x x x
. . .
x
Cache K 1- CPU K writes (changes) the data (hit) 2- write propagates to the shared memory 3- snoop invalidates x
y y y y
MESI State Transition Diagram
L1-L2 Cache Consistency
L1 caches do not connect to the bus → do not
engage in the snoop protocol.
Simple solution:
L1 is “write-through”. Updates and invalidations in L2 must be propagated to
L1.
Approaches for write back L1 exist → more
complex.
Cache Coherence connection other than shared bus
Directory Protocols
Collect and maintain information about copies of data in
cache.
Typically a central directory stored in main memory. Requests are checked against directory. Appropriate transfers are performed. Creates central bottleneck. Effective in large scale systems with complex
interconnection schemes, according to Stallings ??????
Cache Coherence
Software Solutions
Compiler and operating system deal with problem. Overhead transferred to compile time. Design complexity transferred from hardware to software. However, software tends to make conservative decisions
Inefficient cache utilization. Analyze code to determine safe periods for caching shared
variables.
HW+SW solutions exist.
Outline
Taxonomy MIMD systems
Symmetric multiprocessing Time shared bus Cache coherence
Multithreading Clusters NUMA systems Vector Computation
Increasing Performance
MIPS rate = f IPC
Pipelining and superscalar to increase IPC Mechanisms to maximize the utilization of
each pipeline, may be reaching a limit due to
Complexity Power consumption
processor clock frequency average instructions per cycle
Threads and Processes
Process:
An instance of program running on computer
Resource ownership, such as main memory,
I/O channels, I/O devices, and files
Scheduling/execution Process switch
Threads and Processes
Thread:
A dispatchable unit of work within a process
Includes processor context (which includes the
program counter and stack pointer) and data area for stack (to enable subroutine branching)
Threads within a process share resources Threads are able to be executed independently! Switching processor between threads within same
process
Typically less costly than process switch
Explicit:
user-level threads visible to the application
program
kernel-level threads, which are visible to the OS All commercial processors and most experimental
Implicit:
defined statically by compiler or dynamically by
hardware
Implicit and Explicit Multithreading
Approaches to Explicit Multithreading
Interleaved (fine-grained)
Processor issues instruction(s) from a single thread at a time Switching thread at each clock cycle If thread is blocked it is skippedBlocked (coarse-grained)
Processor issues instruction(s) from a single thread at a time Thread executed until event causes delay e.g. cache miss Effective on in-order processor Avoids pipeline stallSimultaneous (SMT)
Processor issues instructions from multiple threads at a time to execution units ofsuperscalar processor
Chip multiprocessing
Processor is replicated on a single chip Each processor handles separate threadsScalar Processor Approaches
Single-threaded scalar
Simple pipeline No multithreading
Interleaved multithreaded scalar
Easiest multithreading to implement Switch threads at each clock cycle Pipeline stages kept close to fully occupied Hardware needs to switch thread context between cycles
Blocked multithreaded scalar
Thread executed until latency event occurs Would stop pipeline Processor switches to another thread
Scalar Diagrams
caused by some dependency 3 time slots to clear dependency
Superscalar Processor Approaches (1)
Superscalar
No multithreading
Interleaved multithreading superscalar:
Each cycle, as many instructions as possible issued from single thread Delays due to thread switches eliminated Number of instructions issued in cycle limited by dependencies
Blocked multithreaded superscalar
Instructions from one thread Blocked multithreading used
Superscalar Diagrams (1)
Very long instruction word (VLIW)
E.g. IA-64 Multiple instructions in single word Typically constructed by compiler Operations that may be executed in parallel in same word May pad with no-ops
Interleaved multithreading VLIW
Similar efficiencies to interleaved multithreading on superscalar
architecture
Blocked multithreaded VLIW
Similar efficiencies to blocked multithreading on superscalar
architecture
Superscalar Processor Approaches (2)
Superscalar Diagrams (2)
Simultaneous multithreading (SMT)
Issue multiple instructions at a time One thread may fill all horizontal slots Instructions from two or more threads may be issued With enough threads, can issue maximum number of instructions on
each cycle
Chip multiprocessor
Multiple processors Each has two-issue superscalar processor Each processor is assigned thread
Can issue up to two instructions per cycle per threadSuperscalar Processor Approaches (3)
Superscalar Diagrams (3)
Outline
Taxonomy MIMD systems
Symmetric multiprocessing Time shared bus Cache coherence
Multithreading Clusters NUMA systems Vector Computation
Clusters
“A group of interconnected whole computers working together as unified resource illusion of being one machine.”
Clusters
Benefits
Absolute scalability Incremental scalability High availability Superior price/performance
Server applications Each computer called a node
Cluster Configurations
Standby Server, No Shared Disk
Cluster Configurations
With Shared Disk
Clustering Methods
Method Description Benefits Limitations Passive Standby A secondary server takes over in case of primary server failure. Easy to implement High cost because the secondary server is unavailable for other processing tasks. Active Secondary The secondary server is also used for processing tasks. Reduced cost because secondary servers can be used for processing. Increased complexity. Separate Servers Separate servers have their own
from primary to secondary server. High availability High network and server
Servers Connected to Disks Servers are cabled to the same disks, but each server owns its
re taken over by the other srver. Reduced network and server overhead due to elimination of copying
Usually requires disk mirroring or RAID technology to compensate for risk of disk failure. Servers Share Disks Multiple servers simultaneously share access to disks. Low network and server
downtime caused by disk failure. Requires lock manager
with disk mirroring or RAID technology.
Operating Systems Design Issues
Failure Management
High availability (when one computer fails, another
computer takes over)
Fault tolerant (redundancy) Failover: switching applications & data from failed system
to alternative within cluster
Failback: restoration of applications and data to original
system after problem is fixed
Load balancing
Incremental scalability: automatically include new
computers in scheduling
Middleware needs to recognise that processes may switch
between machines
Parallelizing
Single application executing in parallel on a number of machines in cluster
Complier
Determines at compile time which parts can be executed in parallel Split off for different computers Application
Application written from scratch to be parallel Message passing to move data between nodes Hard to program Best end resultCluster Computer Architecture
Middleware provides a unified system image to the user, it provides for the communication of software components and users to applications
Blade Servers
Common cluster implementation that houses multiple server modules (blades) in single chassis
Save space Improve system management Chassis provides power supply Each blade has processor, memory, disk
Example of a Blade Server
Example of a Blade Server
Parallel Processing 58 Figure source: https://www.supermicro.com/en/products/blade/Cluster × SMP
Both provide multiprocessing support to high demand applications. Both available commercially
SMP for longer
SMP:
Easier to manage and control Closer to single processor systems
Clustering:
Superior incremental & absolute scalability Superior availability → Redundancy
Scheduling is main difference Less physical space Lower power consumptionOutline
Taxonomy MIMD systems
Symmetric multiprocessing Time shared bus Cache coherence
Multithreading Clusters NUMA systems Vector Computation
NUMA Systems
Terminology:
Uniform memory access (UMA)
All processors have access to all parts of memory → load & store. Access time to all regions of memory is the same. Access time to memory for different processors is the same. As used by SMP. Nonuniform memory access (NUMA)
All processors have access to all parts of memory → load & store Access time of processor differs depending on region of memory Different processors access different regions of memory at differentspeeds
Cache coherent NUMA (CC_NUMA)
Cache coherence is maintained among the caches of the variousprocessors
Significantly different from SMP and clustersMotivation for NUMA
SMP has practical limit to the number of processors
Bus traffic limits to between 16 and 64 processors
In clusters each node has own memory
Apps do not see large global memory Coherence maintained by software not hardware
NUMA retains SMP flavour while giving large scale
multiprocessing
e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors
Objective
to maintain transparent system wide memory while permitting
multiprocessor nodes, each with own bus or internal interconnection system.
CC-NUMA Organization
Single addressable memory space. Memory request order:
transparent to processor)
NUMA Pros & Cons
Higher levels of parallelism than SMP, with minor
software changes
Performance can breakdown if too much access to
remote memory
Can be avoided by:
L1 & L2 cache design reducing all memory access Need good temporal and spatial locality of software Virtual memory management moving pages to nodes that are
using them most
Keeping it Cache Coherent is costly in terms of
HW and performance
Outline
Taxonomy MIMD systems
Symmetric multiprocessing Time shared bus Cache coherence
Multithreading Clusters NUMA systems Vector Computation
Vector Computation
Maths problems involving physical processes present
different difficulties for computation
Aerodynamics, seismology, meteorology, … Continuous field simulation
High precision Repeated floating point calculations on large arrays of
numbers
Vector Computation
Supercomputers handle these types of problem
Hundreds of millions of flops $10-15 million Optimised for calculation rather than multitasking and I/O Limited market
Research, government agencies, meteorologyArray processor
Alternative to supercomputer Configured as peripherals to ordinary computer Just run vector portion of problems
Approaches
Scalar Processing
for i=1:N for j=1:N for k=1:N C(i,j) = C(i,j) + A(i,k) * B(k,j); end end end
kj N k ik ij
b a c
1
Approaches
for i=1:N for j=1:N C(i,j) = sum(A(i,:).*B(:,j)'); end end
Vector Processing
All elements in that dimension
kj N k ik ij
b a c
1
All operations are independent → no pipeline hazard!!!!!!!
Processor Designs
Pipelined ALU
Within operations Across operations
Parallel ALUs Parallel processors
Approaches to Vector Computation
Chaining
The ability to feed the result issued from one functional unit into another functional unit. (Cray Supercomputers)
Vector operation may start as soon as first element of operand vector
available and functional element is free.
Result from one functional element is fed immediately into another. If vector registers used, intermediate results do not have to be stored in
memory.
memory vector registers vector register scalar register pipelined ALUs pipelined ALUs
Outline
Taxonomy MIMD systems
Symmetric multiprocessing Time shared bus Cache coherence
Multithreading Clusters NUMA systems Vector Computation
Outline
Exercises
Exercise 1:
Consider a situation in which two processors (P1 and P2) in an SMP configuration, over time, require access to the same line of data from main
Initially both caches have an invalid copy of the line. Eventually, processor P1 reads line x. If this is the start of a sequence of accesses, complete the table below indicating the state-transitions of this line in the caches of both processors.
Accesses State transition in cache 1 State transition in cache 2 P2 reads x ES IS P1 writes to x SM SI P1 writes to x MM II P2 reads x M↓S IS
Exercises
Exercise 2: Consider an SMP with both L1and L2 caches using the MESI protocol. One of four states is associated with each line in the L2 cache. Are all four states also needed for each line in the L1 cache? If so, why? If not, explain which state or states can be eliminated.
Exercises
Exercise 3: Consider a pipeline similar to the ones seen in chapter on Pipelining, which is redrawn in Figure a with the fetch and decode stages ignored to represent the execution of thread A. Figure b illustrates the execution of a separate thread B. In both cases, a simple pipelined processor is used.
1 2 3 4 5 6 7 8 9 10 12 12 CO A1 A2 A3 A4 A5 A15 A16 FO A1 A2 A3 A4 A15 A16 EI A1 A2 A3 A15 A16 WO A1 A2 A3 A15 A16 1 2 3 4 5 6 7 8 9 10 12 12 CO B1 B2 B3 B4 B5 B6 B7 FO B1 B2 B3 B4 B5 B6 B7 EI B1 B2 B3 B4 B5 B6 B7 WO B1 B2 B3 B4 B5 B6 B7cycle Thread A Thread B
Exercises
Exercise 3a : Show an instruction issue diagram, for each of the two threads Exercise 3b : Assume that the two threads are to be executed in parallel on a chip processor, with each of the two processors on the chip using a simple pipeline. Show an instruction issue diagram. Also show a pipeline execution diagram.
Thread A Thread B
1 2 3 4 5 6 7 8 9 10 12 12cycle
Exercises
Exercise 3c : Assume a two-issue superscalar architecture. Repeat part (b) for an interleaved multithreading superscalar implementation, assuming no data dependencies. Note: There is no unique answer, you need to make assumptions about latency and priority
1 2 3 4 5 6 7 8 9 10 12 12cycle
Exercises
Exercise 3d : Repeat part c for a blocked multithreading superscalar implementation.
1 2 3 4 5 6 7 8 9 10 12 12cycle
Exercises
Exercise 3e : Repeat for a four-issue SMT architecture.
1 2 3 4 5 6 7 8 9 10 12 12cycle
Exercises
Exercise 4: An application program is executed on a nine-computer
Further, it was found that 25% of T was time in which the application was running simultaneously on all nine
a single computer.
a)
Calculate the effective speedup under the aforementioned condition as compared to executing the program on a single computer. Also calculate , the percentage of code that has been parallelized (programmed or compiled so as to use the cluster mode) in the preceding program.
b)
Suppose that we are able to effectively use 17 computers rather than 9 computers on the parallelized portion of the code. Calculate the effective speedup that is achieved
Exercises
Exercise 5:
The figures below show the state diagrams of two possible cache coherence protocols. Deduce and explain each protocol, and compare each to MESI
W(i) = write to line by processor i R(i) = read line by processor i Z(i) = displace line by processor i W(j) = write to line by processor j ( j i) R(j) = read line by processor j ( j i) Z(j) = displace line by processor j ( j i)
Note: state diagrams are for a given line in cache i
Invalid ValidR(i) W(i) W(j) Z(i) R(j) W(j) Z(j) W(i) Z(i) R(i)
Exclusive Shared InvalidR(i) W(j) Z(i) R(j) W(j) Z(j) Z(j) R(j) R(i) W(j) W(i) R(j) W(i) Z(i) R(i) W(i) Z(j)
Text Book References
These topics are covered in
Stallings
Tanenbaum - chapter 8 Parhami
Parallel Processing