Introduction to High Performance Computing and Optimization Oliver - - PowerPoint PPT Presentation

▶

Jan 01, 2023 558 likes •1.05k views

Institut fr Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13 Contents 1. Introduction 2. Processor

SLIDE 1

Institut für Numerische Mathematik und Optimierung

Introduction to High Performance Computing and Optimization

Oliver Ernst

Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13

SLIDE 2

1. Introduction
2. Processor Architecture
3. Optimization of Serial Code

3.1 Performance Measurement 3.2 Optimization Guidelines 3.3 Compiler-Aided Optimization 3.4 Combine example

Oliver Ernst (INMO) HPC Wintersemester 2012/13 1

SLIDE 3

1. Introduction
2. Processor Architecture
3. Optimization of Serial Code

Oliver Ernst (INMO) HPC Wintersemester 2012/13 16

SLIDE 4

Processor Architecture

Von Neumann Architecture

John von Neumann (1903–1957), Hungarian-American mathematician and computer science pioneer “First Draft of a Report on the EDVAC” (1945): computer design based on stored program, based on previous work by J. P. Eckert and J. W. Mauchly, U Pennsylvania ENIAC project, and earlier theoretical work by A. Turing. Essentially all electronic digital computers based on this model. Von Neumann bottleneck: Manipulation of data in memory only via traffic to ALU; width of this data path constrains all computing throughput. (John Backus, 1977) Inherently sequential architecture.

Memory

CPU Control Unit Arithmetic Logical Unit (ALU) Input Output Oliver Ernst (INMO) HPC Wintersemester 2012/13 17

SLIDE 5

Processor Architecture

Current Microprocessors

Extremely complex manufactured devices. Feature size currently at 22 nm and decreasing. Transistor count ≈ 1.4 billion on 160 mm2. Fortunately: it is enough to understand the basic schematic workings of modern microprocessors.

Intel Westmere die shot Intel Ivy Bridge die labelling

Oliver Ernst (INMO) HPC Wintersemester 2012/13 18

SLIDE 6

Processor Architecture

MicroprocessorBlock diagram

Memory interface cache cache

mask shift INT

LD ST FP mult FP add

Main memory

L2 unified cache Memory queue INT/FP queue INT reg. file FP reg. file L1 data L1 instr. source: Hager & Wellein

arithmtic units for floating point (FP) and integer (INT) operations. CPU registers (FP and general-purpose). Load (LD) and store (ST) units for transferring operands to/from registers. Instructions sorted into queues. Caches hold data and instructions.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 19

SLIDE 7

Processor Architecture

Performance

For scientific computing, typically measured in floating point operations per second, i.e, # floating point operations runtime . Unit: FLOPS, FLOP/s, Flop/sec . . . . What constitutes a FLOP?

IEEE double precision floating point (FP) number (64 bits) FP add or FP multiply division, square roots etc. take several cycles

Peak performance is defined as [max # floating point operations per cycle] × clock rate [Hz] × # cores × # sockets × # nodes Question: what is the peak performance of klio?

Oliver Ernst (INMO) HPC Wintersemester 2012/13 20

SLIDE 8

Processor Architecture

Example: Intel Xeon 5160 (Woodcrest, June 2006)

Architecture: Intel 64 Microarchitecture: Core (successor to Netburst) 1st CPU with this microarchitecture, server/workstation version of Intel Core 2 processor. 65 nm manufacturing process technology, socket LGA771 dual core, 2 sockets, total of 4 cores clock frequency: 3 GHz Each core of Woodcrest can perform 4 Flops in each clock cycle.

Woodcrest die shot

source: Intel

Peak performance: 4 Flops × 2 cores × 2 sockets × 3 GHz = 48 GFlops But: higher rates possible using SIMD instructions (MMX, SSE, AVX). More later.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 21

SLIDE 9

Processor Architecture

Some definitions

Architecture: The instruction set of the CPU, also called instruction set architecture (ISA). The parts of a processor design that one needs to understand to write assembly code. Examples of ISAs:

Intel Architectures (IA): IA32/x86, Intel 64/EM64T, IA64 MIPS (SGI) POWER (IBM) SPARC (Sun) ARM (Acorn)

Microarchitecture: Implementation of ISA; invisible features such as caches, cache structure, CPU cycle time, details of virtual memory system. Process technology: The size of the physical features (such as transistors) that make up the processor. Roughly: smaller is better due to lower power consumption, more chips per silicon wafer in production.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 22

SLIDE 10

Processor Architecture

Intel architecture/microarchitecture/process roadmap

Intel tick-tock schedule Tick: New microarchitecture Tock: Die shrink, i.e., new process technology Microarchitecture Processor codename (server) Process technology Date introduced Woodcrest/Clovertown 65 nm 06/2006 Core Dunnington/Harpertown 45 nm 11/2007 Nehalem 45 nm 11/2008 Nehalem Westmere 32 nm 01/2010 Sandy Bridge 32 nm 01/2011 Sandy Bridge Ivy Bridge 22 nm 04/2012 Haswell 22 nm 03/2013 (?) Haswell Broadwell 14 nm Skylake 14 nm Skylake Skymont 10 nm

Oliver Ernst (INMO) HPC Wintersemester 2012/13 23

SLIDE 11

Processor Architecture

Modern processor features

Pipelined instructions. Separate complex instructions into simpler ones which are executed by different functional units in an overlapping fashion; increases throughput; example of instruction level parallelism (ILP). Superscalar architecture. Multiple function units operating concurrently. SIMD instructions (Single Instruction, Multiple Data) One instruction

perates on a vector of data simultaneously. (Examples: Intel’s SSE,

AMD’s 3dNow!, Power/PowerPC AltiVec) Out-of-order execution. When instruction operands are not available in registers, execute next possible instruction(s) to avoid idle time (eligible instructions held in reorder buffer).

Caches. Small, fast, on-chip buffers for holding data which has recently

been used (temporal locality) or is close (in memory) to data which has recently been used (spatial locality). Simplified instruction sets. Reduced Instruction Set Computers(RISC, 1980s), contrast with CISC; simple instructions executing in few clock cycles; allowed higher clock cycles, freed up transistors; x86 processors translate to “µ-ops” on the fly.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 24

SLIDE 12

Processor Architecture

Pipelining: Example

Pipelining, it’s natural. [D. Patterson, UC Berkeley] Four students (Anoop, Brian, Christine & Djamal) doing laundry, one load each. Washing takes 30 minutes. Drying takes 40 minutes. Folding takes 20 minutes.

DAP.F96 5

each have one load of clothes A B C D

Each complete laundry load, done sequentially, takes 90 minutes.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 25

SLIDE 13

Processor Architecture

Pipelining: Example

Sequential laundry

Sequential Laundry

Sequential laundry takes 6 hours for 4 loads

A B C D 30 40 20 30 40 20 30 40 20 30 40 20 6 PM 7 8 9 10 11 Midnight

T a s k O r d e r Time

4 loads in 6 hours. How long would the pipelined laundry take?

Oliver Ernst (INMO) HPC Wintersemester 2012/13 26

SLIDE 14

Processor Architecture

Pipelining: Example

Pipelined laundry

Start work ASAP

A B C D 6 PM 7 8 9 10 11 Midnight

T a s k O r d e r Time

30 40 40 40 40 20 Start each task as soon as functional unit available. Takes only 3.5 hours.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 27

SLIDE 15

Processor Architecture

Pipelining: Example

Pipelining lessons Pipelining doesn’t help latency of individual tasks, it helps throughput of entire workload. Pipeline limited by slowest pipeline stage. Multiple stages operating concurrently. Potential speedup = number of pipeline stages. Unbalanced lengths of pipeline stages reduce speedup. Time to “fill” (start-up) pipeline and time to “drain” (wind-down) it reduces speedup, especially if there are few loads relative to the # stages.

B C D 6 PM 7 8 9

T a s k O r d e r Time

30 40 40 40 40 20

Oliver Ernst (INMO) HPC Wintersemester 2012/13 28

SLIDE 16

Processor Architecture

Pipelining in computers

Pipelining is the basis of vector processors. Greatest source of ILP today. Invisible to programmer. Typical pipeline stages on CPU (MIPS ISA):

(1) Fetch instruction from memory (2) Read registers while decoding instruction (3) Execute operation or calculate address (4) Access an operand in data memory (5) Write the result into a register

Helpful: machine instructions of equal length (x86: from 1 to 17 bytes)

Oliver Ernst (INMO) HPC Wintersemester 2012/13 29

SLIDE 17

Processor Architecture

Pipelining in computers

Pipelined floating point multiply (FPM)

B(1) C(1) B(2) C(2) B(3) C(3) B(4) C(4) B(N) C(N) B(1) C(1) B(2) C(2) B(3) C(3) B(4) C(4) B(5) C(5) B(N) C(N) B(2) C(2) B(3) C(3) B(1) C(1) B(N) C(N) Multiply mantissas Add exponents Normalize result Insert sign Separate mant./exp. (N−4) A (N−3) A A (N−2) A (N−1) A(1) A(N) (N−3) A (N−2) A A (N−1) A(N) A(1) A(2)

C(N−1) B(N−1) B(N−2) C(N−2) B(N−1) C(N−1) ... ... ... ... ...

1 2 3 4 5 N N+1 N+2 N+3 N+4

...

Cycle

Wind−up Wind−down

Figure 1.5: Timeline for a simplified floating-point multiplication pipeline that executes

source: Hager & Wellein

Timeline for simplified floating-point multiplication pipeline executing A(:)=B(:)*C(:). One result is generated on each cycle after a four-cycle “filling” phase.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 30

SLIDE 18

Processor Architecture

Pipelining hazards

Limits to pipelining: hazards prevent next instruction from executing during its designated clock cycle. Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away, washer/dryer combination) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Pipelining of branches & other instructions that change the program counter; attempt to make a decision before condition is evaulated (washing football uniforms and need to get proper detergent level; need to see after dryer before next load in); branch instruction Common solution is to stall the pipeline until the hazard is resolved, inserting

ne or more “bubbles” into the pipeline.

Alternative solution to control hazard problem: branch prediction. Third solution: delayed decision.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 31

SLIDE 19

Processor Architecture

Pipeline parameters

m-stage pipeline processing N tasks: Speedup: Tseq Tpipe = Nm N + m − 1 = m 1 + m−1

− → m (N → ∞). Throughput: N Tpipe = N N + m − 1 = 1 1 + m−1

− → 1 (N → ∞). Given m, how large must N be to obtain α results per cycle (α ∈ (0, 1])? α = 1 1 + m−1

Nα

⇔ Nα = m − 1 1/α − 1 = (m − 1)α 1 − α . Often quoted: N 1

2 = m − 1.

Typical for current microprocessors: m = 10–35.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 32

SLIDE 20

Processor Architecture

Pipeline parameters

10 10

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 N N/Tpipe (throughput) m=5 m=10 m=30 m=100

Oliver Ernst (INMO) HPC Wintersemester 2012/13 33

SLIDE 21

Processor Architecture

Superscalar processors

Design features enabling the generation of multiple results per cycle: Fetch and decode of multiple instructions in one cycle (currently 3–6). Integer operations (arithmetic, addressing) done in multiple units for add, mult, shift, mask etc., (currently 2–6). Floating-point operations on multiple units (add, mult); additionally fused multiply-add (FMA) pipelines can perform a ← b + c ∗ d in one cycle. Fast caches able to sustain enough loads/stores per cycle to feed these units. Yet another form of ILP. Needs support from out-of-order execution and compiler; to get more than 2/3 instructions/cycle often assembly coding necessary.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 34

SLIDE 22

Processor Architecture

Vector extensions (SIMD)

Flynn’s taxonomy: Classification of computer architectures Michael J. Flynn (1966) Single Instruction Multiple Instruction Single Data SISD MISD Multiple Data SIMD MIMD SIMD first arose with vector supercomputers (CDC, Cray, Fujitsu). Again in first “massively parallel” supercomputers (Connection Machine). Wide desktop deployment with x86 MMX extensions in 1996. On current cache-based microprocessors: smaller scale; concurrent execution of arithmetic operations on wide registers holding, e.g., 2 DP or 4 SP floating-point operands. Carried to extremes by GPUs. Note: sustained cache/memory bandwidth necessary to feed the SIMD units.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 35

SLIDE 23

Processor Architecture

Vector extensions x86

1978: Intel 8086 architecture introduced; 16 bit, dedicated registers. 1980: Intel 8087 floating-point coprocessor introduced; added ≈ 60 FP

instructions. Stack in place of registers.

1997: Pentium and Pentium Pro architectures expanded with Multi Media Extensions (MMX). 57 new instructions, use FP stack to accelerate multimedia/communication applications. 1999: Another 70 instructions, labeled Streaming SIMD Extensions (SSE); eight new separate 128-bit wide registers; new 32-bit SP data type. 2001: Yet another 144 instructions, SSE2; new 64-bit DP data type; compilers can choose between stack and 8 SSE registers for FP. 2004: SSE3, 13 new instructions; complex arithmetic, video encoding, FP conversion etc. 2006: SSE4, 54 new instructions. 2008: Advanced Vector Extension (AVX); expands SSE registers from 128 to 256 bits; redefines ≈ 250 instructions, adds 128 more.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 36

SLIDE 24

Processor Architecture

Memory Hierarchy

Ideal: Unlimited amount of immediately accessible storage for data/instructions. Reality: Memory bottleneck of the von Neumann computer.

source: Patterson & Hennessy

DRAM gap: development of average time between memory access of single processor/core (top) and latency of DRAM access (bottom).

Oliver Ernst (INMO) HPC Wintersemester 2012/13 37

SLIDE 25

Processor Architecture

Memory Hierarchy

The fact that this gap can be tolerated, that computers can give us the illusion

f unlimited fast memory, is based on the locality of memory references:

temporal: data recently used will likely be used again soon. spatial: data close (in address space) to data recently used will likely be used soon. Examples: Instructions in memory are accessed sequentially (except for branches), exhibiting spatial locality. Loops access instructions repeatedly, exhibiting temporal locality. Arrays are typically traversed sequentially (spatial locality). But: 2D and 3D grids must involve jumps in linear address space. Note: Locality of references is a property of software the programmer can (must) influence.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 38

SLIDE 26

Processor Architecture

Memory Hierarchy

Consequence: Computer system memory organized in a hierarchy.

registers

n-chip L1

cache (SRAM) main memory (DRAM) local secondary storage (local disks) Larger, slower, cheaper per byte remote secondary storage (tapes, distributed file systems, Web servers)

Local disks hold files retrieved from disks on remote network servers Main memory holds disk blocks retrieved from local disks

n-chip L2

cache (SRAM)

L1 cache holds cache lines retrieved from L2 cache CPU registers hold words retrieved from L1 cache L2 cache holds cache lines retrieved from main memory

L0: L1: L2: L3: L4: L5: Smaller, faster, costlier per byte

source: M. Püschel, ETH Zürich Oliver Ernst (INMO) HPC Wintersemester 2012/13 39

SLIDE 27

Processor Architecture

Memory Hierarchy

Typical access times and sizes: Level Speed Size Register, L1 1 ns KB L2 10 ns MB Main memory 100 ns GB Disk 10 ms TB Tape 10 s PB

Oliver Ernst (INMO) HPC Wintersemester 2012/13 40

SLIDE 28

Processor Architecture

Caches and their terminology

Caches: (one or more) levels of memory between processor and main memory. unified caches: store both data and instructions (typically L2 or higher); L1 split into instruction (L1I) and data (L1D) cache. block or line: smallest unit of data which can be present/absent in cache. All data transfers occur in multiples of this unit. cache hit: requested data found in cache, fast access. cache miss: requested not found in cache, need to access lower memory levels, slower access. hit rate: fraction of requests found in cache. miss rate: 1 − hit rate. When cache is full, next load operation must evict resident cache line. Writing: when cache line modified, two strategies. Write-through caches immediately update corresponding location in memory. Write-back caches

nly update copy in cache, memory not updated until cache line about to

be replaced. Both strategies can use a write buffer.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 41

SLIDE 29

Processor Architecture

Simple cache performance model

β: cache reuse ratio, i.e., fraction of loads/stores sresulting in cache hit due to recent load/store of surrounding cache line. Tmem: access time (latency + bandwidth) to main memory. Tc: access time for cache hit. τ := Tmem/Tcache: relative performance penalty of cache miss. Average access time: Tav = βTc + (1 − β)Tmem Performance gain:

G(τ, β) = Tmem Tav = τTc βTc + (1 − β)Tmem = τ β + τ(1 − β)

0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 14 16 18 20 (reuse ratio) G(,) (performance gain) =5 =10 =50 Oliver Ernst (INMO) HPC Wintersemester 2012/13 42

SLIDE 30

Processor Architecture

Cache lines and spatial locality

Typical behavior of scientific codes: traverse large data sed (read, modify, write). No temporal locality. By loading an entire cache line into memory, memory latency only encountered on line load. Subsequents requests to nearby memory locations serviced by cache. For code exhibiting spatial locality, loading cache line-wise increases cache hit rate. Example: streaming through memory sequentially with cach line of length 16. hit rate γ = # cache hits # references = 15 16 ≈ 0.94.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 43

SLIDE 31

Processor Architecture

Caches and memory/compute bound programs

We define the operational intensity I of a program/algorithm as I = nop ntrans := # operations amount of data transferred between cache and RAM Programs with high I are called compute bound, those with low I are called memory bound. Trivial bound: ntrans ≥ nio := size(input data) + size(output data), therefore I ≤ nop nio Examples: Vector add x ← x + y: I ≤

n 2n = O(1).

Matrix-matrix multiply (MMM) C ← C + AB: I ≤ 2n3

3n2 = O(n).

Fast Fourier Transform (FFT) y = fft(x): I ≤ 5n log2 n

= O(log n).

Oliver Ernst (INMO) HPC Wintersemester 2012/13 44

SLIDE 32

Processor Architecture

Cache mapping

Where in the cache is a cache line placed? fully set-associative cache: anywhere there’s room. Expensive to build, search logic. direct-mapped cache: unique location for all cache lines. Line may be evicted if new load maps to same location even though cache not full. Cache thrashing: when this happens in rapid succession. E-way set-associative cache: anywhere in one of E possible positions within a uniquely assigned set. Set chosen using s consecutive bits from the middle of an address. More precisely: Assume memory addressed using m bits, yielding M = 2m unique addresses. Partition m address bits into (most to least significant) t tag bits , s set index bits , b block offset bits (m = t + s + b). Cache then has S = 2s sets, each cache line containing B = 2b words (Bytes) of data. With E cache lines per set, this organization is summarized by the notation (S, E, B, m).

Oliver Ernst (INMO) HPC Wintersemester 2012/13 45

SLIDE 33

Processor Architecture

Cache mapping

Cache Memory Cache Memory

Way 1 Way 2 source: Hager & Wellein

Direct-mapped (left) and 2-way set-associative (right) caches. Shading indicates mapping of lines in memory to their assigned locations in cache.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 46

SLIDE 34

Processor Architecture

Cache miss taxonomy

Reasons for cache misses (the three Cs)

Compulsory. First access to this line, would occur even for cache of

infinite size.

Capacity. Previously resident cache line was evicted when cache full.
Conflict. Previously resident cache line was evicted when set full.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 47

SLIDE 35

Processor Architecture

Prefetching

Latency on first access (miss), example: for (i=0; i<N; i++) s = s + a[i]*a[i]; One load stream. Assume cache line size of 4 elements ⇒ 3 cache hits before next miss. Between these the memory bus is inactive.

1 2 3 4 5 6 7

Iteration #

time

cache miss: latency use data use data use data use data cache miss: latency

use data use data use data

LD LD LD LD LD

source: Hager & Wellein

Cache misses, latency penalty and inactive memory bus.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 48

SLIDE 36

Processor Architecture

Prefetching

Prefetching: initiate cache line load sufficiently ahead of use. Can be done by compiler or hardware. Requires that memory system can sustain sufficiently many outstanding prefetch operations. Number outstanding prefetches required to hide latency: time to transfer cache line of length Lc given latency of memory system Tℓ and bandwidth B: T = Tℓ + Lc B . One prefetch operation per cache line transfer. Number prefetches needed is number cache lines which can be moved in time T: P = T Lc/B = 1 + Tℓ Lc/B .

Oliver Ernst (INMO) HPC Wintersemester 2012/13 49

SLIDE 37

Processor Architecture

Prefetching

1 2 3 4 5 6 7

Iteration #

8 9 time

use data use data use data use data use data use data use data use data use data

LD LD LD PF

cache miss: latency

LD LD LD LD LD LD

cache miss: latency

source: Hager & Wellein

Sufficiently early prefetching permits overlapping of data movement and computation, a form of latency hiding.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 50

SLIDE 38

Processor Architecture

Multicore processors

Laws of physics (essentially heat dissipation) has forced CPU manufacturers to use increased transistot count (Moore’s law) toward multiple processor cores per chip/die/package rather than increased clock

frequency. See [Hager & Wellein (p. 23)] for physical explanation.

core = CPU = processor Socket: physical package containing one or more cores.

source: Tom’s Hardware

AMD Opteron, Intel Dempsey & Intel Woodcrest packages.

Desktop PCs have 1 socket, servers 2–4; (klio: 2, mathmaster: 2, FG compute server: 2) Putting all cores to work necessitates parallel programming. More cores reduce available memory bandwidth per core.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 51

SLIDE 39

Processor Architecture

Multicore processors: core and cache arrangements

P

L1D L2 L3

P

L1D L2 L3

dual core, separate L1, L2 and L3 caches (Intel Montecito)

L1D L1D L2 L1D L2 L1D

P P P P

quad-core, separate L1, caches; L2 shared across 2 (Intel Harpertown)

P P P P P P

L1D L1D L1D L2 L1D L2 L1D L1D L2 L3

hexa-core, separate L1, L2 shared across 2, L3 shared across 6 (Intel Dunnington)

HT/QPI

L1D L2

P

L1D L2

P

L1D L2

P

L1D L2

P

Memory Interface

quad-core, separate L1, L2 caches, L3 shared; memory interface built in, allows adding memory /more sockets directly (Intel Nehalem, AMD Shanghai)

Oliver Ernst (INMO) HPC Wintersemester 2012/13 52

SLIDE 40

Processor Architecture

Multithreaded processors

Thread: stream of instructions from one process/program/task. Modern processors have multiple functional units (execution units): FP add, FP mult integer units (shift, rotated, arithmetic) load/store branch vector dispatch/issue At any given time, many/most of these will be idle. Branch misprediction, instruction pipeline must be flushed. Waiting for memory access to complete. Instruction mix only utilizes small fraction of functional units. Hardware multithreading allows multiple threads to share functional units of a single processor in an overlapping fashion.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 53

SLIDE 41

Processor Architecture

Multithreaded processors

Intel (2002) 3.06 GHz Pentium 4/Xeon: “Hyper-Threading” (HT) Neutral name: multithreading Execute more than one thread simultaneously Some resources are replicated (registers, program counter) in order to duplicate state of each active thread. Some are not: ALUs, caches, queues, memory interface. Single physical processor/core appears as two logical processors. Requires OS/compiler support. Some code may take advantage of SMT more than others. Thread switch much faster than process switch (context switch) Fine-grained multithreading: Switch threads after every instruction. Coarse-grained multithreading: Switch threads only after significant events (cache miss). Simultaneous multithreading (SMT): multiple instructions from independent threads. (HW handles dependencies among instructions)

Oliver Ernst (INMO) HPC Wintersemester 2012/13 54

SLIDE 42

Processor Architecture

Multithreaded processors

Issue slots Thread C Thread D Thread A Thread B Time Time SMT Coarse MT Fine MT Issue slots

source: Patterson & Hennessy Oliver Ernst (INMO) HPC Wintersemester 2012/13 55

SLIDE 43

Processor Architecture

Multithreaded processors

Issues: Single-thread performance not improved, may be slight decrease due to MT overhead. Well-optimized FP-heavy code tends to benefit less from MT. Pressure on shared resources (caches). Affinity. Conservative maxim: run different processes on different physical cores unless certain code benefits from SMT.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 56

SLIDE 44

Processor Architecture

Performance measurement: vector triad

for(j=0; j<NITER; j++){ for(i=0; i<N; i++) a[i] = b[i] + c[i]*d[i]; } 3 load streams, 1 store stream. Outer loop to produce measurably long run times. Omitted tricks to prevent compiler optimizations.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 57

SLIDE 45

Processor Architecture

Performance measurement: vector triad

Vector triad timings on locally available systems:

500 1000 1500 2000 2500 3000 Vector length MFlops/s Intel Woodcrest 5160 @ 3.0 GHz Intel Clovertown E5335 @ 2.0 GHz Intel Westmere X5670 @ 2.93 GHz AMD Opteron 8380 @ 2.5 GHz Intel Core i5 M460 @ 2.53 GHz Oliver Ernst (INMO) HPC Wintersemester 2012/13 58

SLIDE 46

Processor Architecture

Our Woodcrest 5160 system (klio)

CPU: Xeon 5160 @ 3 GHz 2 × 2 cores Memory: 8 × 2 GB Caches: all writeback L1D: 32 KB 8-way set associative 64 byte line size L2: 4 MB 16-way set associative 64 bytes line size Memory bandwidth: to chipset: 21.3 GB/s to memory: 21.3 GB/s

Memory (4 channels, FB-DIMMS, DDR2-667) L2 (4MB)

C0

L1I (32 KB) L1D (32 KB)

C1

L1I (32 KB) L1D (32 KB)

L2 (4MB)

C0

L1I (32 KB) L1D (32 KB)

C1

L1I (32 KB) L1D (32 KB) Front Side Bus (FSB) up to 2 x 1333 MHz x 64 bit sockets

Chipset

Memory channels up to 4 x 667 MHz x 64 bit Oliver Ernst (INMO) HPC Wintersemester 2012/13 59

SLIDE 47

Processor Architecture

Performance measurement: vector triad

Data on remaining computers in vector triad measurements: (to be completed) Clovertown @ 2.0 GHz (4 cores) 32 KB L1 cache/core; 4 MB L2 cache shared by 2 cores; 1333 MHz Front Side Bus (21.3 GB/s bandwidth) Westmere @ 2.93 GHz (6 cores) 32 KB L1 cache/core; 256 KB L2 cache/core; 12 MB L3 cache shared; memory bandwidth max. 32 GB/s Opteron @ 2.5 GHz 32 KB L1 cache/core; 128 KB L2 cache/core; 6144 KB shared L3 cache; system bus speed 1 GT/s; memory controller speed 2 GHz Core i5 @ 2.53 GHz 32 KB L1D cache/core; 32 KB L1I cache/core; 256 KB L2 cache/core; 3 MB shared L3 cache; memory type: DDR3-800/1066;

max. memory bandwidth 17.1 GB/s; DMI 2.5 GT/s

Oliver Ernst (INMO) HPC Wintersemester 2012/13 60

SLIDE 48

Processor Architecture

Performance measurement: vector triad, interpretation of results Each loop iteration operates on 4 vector elements, each a double precision floating point number (8 Bytes), i.e., 32 Bytes data movement. An L1 cache of 32 KB (= 32 × 1024 Bytes) can hold at most 1024 sets of 4 such vector elements, i.e., all four vectors fit into the L1 cache for at most N=1024. This explains the flop rate drop around N=1024 for all Intel processors. With the same reasoning, all four vectors fit into an L2 cache of 4 MB (= 4 × 1024 × 1024 Bytes) for at most N = 128 × 1024 = 131 072. This explains the flop rate drop around N = 105 for the Woodcrest and Clovertown processors. For the Westmere CPU these events occur for N = 1024 (L1), N = 8196 (L2) and N = 393 216 (L3). Cache bandwidth: Intel says 8.5 Bytes/cycle 1 loop iteration (32 Bytes) needs 3.75 cycles, 3 GHz means 3 x 109 cycles/s, we sould see 1.23 GFlops; we see roughly 1.4 GFlops Memory bandwidth: 10,664 MB/s, 333.25 M sets of 4 doubles /s, 2 Flops for each set, we should see 666.5 MFlops/s but we see 200 MFlops/s Intel: 3.5 GB/s memory bandwidth for servers this means 3.5G/32 = 0.109 G loop iterations /s, 2 Flops each gives 0.218 GFlops

Oliver Ernst (INMO) HPC Wintersemester 2012/13 61