[PPT] - Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. PowerPoint Presentation

SLIDE 1

Lecture 18 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

SLIDE 2

Announcements

Office hours on Wednesday

u 3:30 PM until 5:30 u I’ll stay after 5:30 until the last one leaves

Test on the last day of class

u BRING A BLUE BOOK u Tests your ability to apply the knowledge you’ve

gained in the course

u Open book, open notes u You may bring a PDF viewer (e.g. preview,

Acrobat) to look at course materials only

No web browsing - turn off internet
No cell phones

Scott B. Baden / CSE 260, UCSD / Fall '15 2

SLIDE 3

Today’s lecture

Supercomputers
Archiectures
Applications

Scott B. Baden / CSE 260, UCSD / Fall '15 3

SLIDE 4

What is the purpose of a supercomputer?

Improve our understanding of scientific and

technologically important phenomenon

Improve the quality of life through technological

innovation, simulations, data processing

u Data Mining u Image processing u Simulations – financial modeling, weather, biomedical

Economic benefits

Scott B. Baden / CSE 260, UCSD / Fall '15 4

SLIDE 5

What is the world’s fastest supercomputer?

Top500 #1 (>2 years), Tianhe– 2 @ NUDT (China)

3.12M cores, 54.8 Pflops peak, 17.8MW power+6W cooling , 12-core Ivy Bridge + Intel Phi

#2: Titan @ Oak Ridge, USA, 561K cores, 27PF,

8.2MW, Cray XK7: AMD Opteron + Nvida Kepler K20x

top500.org

Scott B. Baden / CSE 260, UCSD / Fall '15 5

SLIDE 6

What does a supercomputer look like?

Hierarchically organized servers
IrrHybrid communication

u Threads within the server u Pass messages between servers

(or among groups of cores)

Edison @ nersc.gov

conferences.computer.org/sc/2012/papers/1000a079.pdf

Scott B. Baden / CSE 260, UCSD / Fall '15 6

SLIDE 7

State-of-the-art applications

Blood Simulation on Jaguar Gatech team

p 48 384 3072 24576 Time (sec) 899.8 116.7 16.7 4.9 Efficiency 1.00 0.96 0.84 0.35

p 24576 98304 196608 Time (sec) 228.3 258 304.9 Efficiency 1.00 0.88 0.75

Strong scaling Weak scaling

Ab Initio Molecular Dynamics (AIMD) using Plane Waves Density Functional Theory Eric Bylaska (PNNL)

Exchange time

n HOPPER

Slide courtesy Tan Nguyen, UCSD Scott B. Baden / CSE 260, UCSD / Fall '15 7

SLIDE 8

Performance differs across application domains

u Collela’s 7 dwarfs, patterns of communication and

computation that persist over time and across implementations

u Structured grids

Panfilov method

u Dense linear algebra

Matrix multiply, Vector-Mtx Mpy

Gaussian elimination

u N-body methods u Sparse linear algebra

With sparse matrix, use knowledge about the

locations of non-zeroes to improve some aspect of performance

u Unstructured Grids u Spectral methods (FFT) u Monte Carlo

+=

*

C[i,j] A[i,:] B[:,j] Scott B. Baden / CSE 260, UCSD / Fall '15 8

SLIDE 9

Application-specific knowledge is important

Currently exists no tool that can convert a serial

program into an efficient parallel program

… for all applications … all of the time… on all hardware

The more we know about the application…

… specific problem … math/physics ... initial data … … context for analyzing the output… … the more we can improve performance

Performance Programming Issues

4 Data motion and locality 4 Load balancing 4 Serial sections

Scott B. Baden / CSE 260, UCSD / Fall '15 9

SLIDE 10

Sparse Matrices

A matrix where we employ knowledge about the

location of the non-zeroes

Consider Jacobi’s method with a 5-point stencil

u’[i,j] = (u[i-1,j] + u[i+1,j]+ u[i,j-1]+ u[i, j+1] - h2f[i,j]) /4

Scott B. Baden / CSE 260, UCSD / Fall '15 10

SLIDE 11

1M x 1M submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase

3 non-zeroes/row

Dense: 220×220= 240 = 1024 Gwords Sparse: (3/220) × 240 = 3 Mwords Sparse representation saves a factor of 1 million in storage

Web connectivity Matrix: 1M x 1M

Jim Demmel Scott B. Baden / CSE 260, UCSD / Fall '15 11

SLIDE 12

Circuit Simulation

www.cise.ufl.edu/research/sparse/matrices/Hamm/scircuit.html

Motorola Circuit 170,9982 958,936 nonzeroes .003% nonzeroes 5.6 nonzeroes/row

Scott B. Baden / CSE 260, UCSD / Fall '15 12

SLIDE 13

Generating sparse matrices from unstructured grids

A 2D airfoil

In some applications of sparse

matrices, we generate the matrix from an “unstructured” mesh, e.g. finite element method

In some cases we apply direct

mesh updates, using nearest neighbors

Irregular partitioning

Scott B. Baden / CSE 260, UCSD / Fall '15 13

SLIDE 14

Sparse Matrix Vector Multiplication

Important kernel used in linear algebra
Assume x[] fits in memory of 1 processor

y[i] += A[i,j] × x[j]

Many formats, common format for CPUs is

Compressed Sparse Row (CSR)

Jim Demmel Scott B. Baden / CSE 260, UCSD / Fall '15 14

SLIDE 15

Sparse matrix vector multiply kernel

// y[i] += A[i,j] × x[j] #pragma parallel for schedule (dynamic,chunk)

for i = 0 : N-1 // rows i0= ptr[i] i1 = ptr[i+1] – 1 for j = i0 : i1 // cols y[ ind[j]] += val[ j ] *x[ ind[j] ] end j end i

j→ i ↓ A

X

Scott B. Baden / CSE 260, UCSD / Fall '15 15

SLIDE 16

Up and beyond to Exascale

In 1961, President Kennedy mandated a

landing on the Moon by the end of the decade

July 20, 1969 at tranquility base

“The Eagle has landed”

The US Government set an ambitious

schedule to reach 1018 flops by ~2023

DOE is taking the lead in the US, EU also

engaged

Massive technical challenges

Scott B. Baden / CSE 260, UCSD / Fall '15 16

SLIDE 17

The Challenges to landing “Eagle”

High levels of parallelism within and across nodes

u 1018 flops using NVIDIA devices @ 1012 flops u 106 devices. 109+ threads

Power : ≤ 20 MW. Today 18MW@0.05 ExaFlops

u Power consumption 1-2nJ/op today → 20pJ @Exascale u Data storage & access consumes most of the energy

Ever lengthening communication delays

u Complicated memory hierarchies u Raise amount of computation per unit of communication u Hide latency, conserve locality

Reliability and resilience

u Blue Gene L’s Mean Time Between Failure (MTBF(

measured in days

Application code complexity; domain specific languages

u NUMA processors, not fully cache coherent on-chip u Mixture of accelerators and conventional cores

Scott B. Baden / CSE 260, UCSD / Fall '15 17

SLIDE 18

Growth: cores/socket rather than sockets
Heterogeneous processors
Memory/core is shrinking
Complicated software-managed

parallel memory hierarchy

Communication costs increasing relative to

computation

Technological trends

Intel Sandybridge, anandtech.com

Scott B. Baden / CSE 260, UCSD / Fall '15 18

SLIDE 19

35 years of processor trends

Scott B. Baden / CSE 260, UCSD / Fall '15 19

SLIDE 20

Increase amount of computation performed

per unit of communication

u Conserve locality, “communication avoiding”

Hide communication
Many threads

How do we manage these constraints?

Year

Processor Memory Bandwidth Latency

Tera Exa???

Improvement

Giga Peta

Scott B. Baden / CSE 260, UCSD / Fall '15 20

SLIDE 21

21

A Crosscutting issue: hiding communication

Little’s law [1961]

u

The number of threads must equal the parallelism times the latency T = p ×λ

u

p and λ are increasing with time

Difficult to implement

u

Split phase algorithms

u

Partitioning and scheduling

The state-of-the-art enables but

doesn’t support the activity

Distracts from the focus on the

domain science

Implementation policies entangled with

correctness issues

u

Non-robust performance

u

High development costs

Scott B. Baden / CSE 260, UCSD / Fall '15 21

SLIDE 22

Motivating application

Solve Laplace’s equation in 3

dimensions with Dirichlet Boundary conditions Δϕ = ρ(x,y,z), ϕ=0 on ∂Ω

Building block: iterative solver using

Jacobi’s method (7-point stencil)

Ω

∂Ω

for (i,j,k) in 1:N x 1:N x 1:N u’[i][j][k] = (u[i-1][j][k] + u[i+1][j][k] + u[i][j-1][k] + u[i][j+1][k] + u[i][j][k+1] + u[i][j][k-1] ) / 6.0 ρ≠0

Scott B. Baden / CSE 260, UCSD / Fall '15 22

SLIDE 23

Classic message passing implementation

Decompose domain into sub-regions, one per process

u Transmit halo regions between processes u Compute inner region after communication completes

Loop carried dependences impose a strict ordering on

communication and computation

Scott B. Baden / CSE 260, UCSD / Fall '15 23

SLIDE 24

Communication tolerant variant

Only a subset of the domain exhibits loop carried

dependences with respect to the halo region

Subdivide the domain to remove some of the

dependences

We may now sweep the inner region in parallel

with communication

Sweep the annulus after communication finishes

Scott B. Baden / CSE 260, UCSD / Fall '15 24

SLIDE 25

Processor Virtualization

Virtualize the processors by
verdecomposing
AMPI [Kalé et al.]
When an MPI call blocks, thread

yields to another virtual process

How do we inform the scheduler

about ready tasks?

Scott B. Baden / CSE 260, UCSD / Fall '15 25

SLIDE 26

Observations

The exact execution order depends on the data

dependence structure: communication & computation

But many other correct orderings are possible, and

some can enable us to hide communication

We can characterize the running program in terms of

a task precedence graph

There is a deterministic procedure for translating MPI

code into the graph

Scott B. Baden / CSE 260, UCSD / Fall '15 26

2 1 4 3

Irecv j

Send j Wait Comp Irecv j Send j Wait Comp SPMD MPI TASK GRAPH

SLIDE 27

A data driven execution model

Parallelism exists among independent tasks
Independent tasks may execute concurrently
A task is runnable when its data dependences have been met
A task suspends if its data dependences are not met
Computation and data motion are coupled activities
The scheduler

determines which task(s) to run next

Scheduler and application are only vaguely aware of one another
Scheduler doesn’t affect graph execution semantics

Scott B. Baden / CSE 260, UCSD / Fall '15 27

SLIDE 28

A custom, domain specific source-to-source-

translator for translating MPI into an equivalent MPI formulation (Tan Nguyen, PhD 2014, UCSD) Bamboo

Scott B. Baden / CSE 260, UCSD / Fall '15 28

2 1 4 3

Worker threads Communication handlers Dynamic scheduling

Irecv j

Send j Wait Comp Irecv j Send j Wait Comp SPMD MPI Task dependency graph Runtime system

SLIDE 29

3D Jacobi Results

Strong Scaling, size = 30723

TFLOPS/s

Cores on Hopper @ NERSC (Cray XE-6)

S = 1.02 VF = 8 S = 1.05 VF = 4 S = 1.27 VF = 2

4.49 8.63 14.65 23.05 4.58 8.95 15.49 26.92

4.96 9.90 19.14 36.03

5 10 15 20 25 30 35 40 12288 24576 49152 98304

MPI-basic MPI-olap MPI+OMP (MPI+OMP)-olap Bamboo-basic MPI-nocomm

S = 1.15 VF = 2

Scott B. Baden / CSE 260, UCSD / Fall '15 30

SLIDE 30

Inside a supercomputer

Scott B. Baden / CSE 260, UCSD / Fall '15 31

SLIDE 31

Blue Gene

IBM-US Dept of Energy collaboration
Low power, high performance interconnect
3rd Generation: Blue Gene/Q
Sequoia at LLNL, #3 on top 500

u 17.2 Pflops/s (Linpack), 20.1 PF peak, 7.9

MW

u 1.6M cores/93,304 compute nodes,16 GB mem/node u 96 racks, 3,000 square feet u 64-bit PowerPC (A2 core) u Weighs about the same as

30 elephants

https://computing.llnl.gov/tutorials/bgq

Scott B. Baden / CSE 260, UCSD / Fall '15 32

SLIDE 32

Hierarchical Packaging

SC ‘10, via IBM

1. Chip

16 cores

2. Module

Single Chip

4. Node Card

32 Compute Cards, Optical Modules, Link Chips, Torus

5a. Midplane

16 Node Cards

6. Rack

2 Midplanes 1, 2 or 4 I/O Drawers

7. System

96 racks, 20PF/s

3. Compute Card

One single chip module, 16 GB DDR3 Memory

5b. I/O Drawer

8 I/O Cards 8 PCIe Gen2 slots

Scott B. Baden / CSE 260, UCSD / Fall '15 33

SLIDE 33

Blue Gene/Q Interconnect

5D toroidal mesh (end around)

u

Can scale to > 2M cores

u

2 GB/sec bidirectional bandwidth (raw)

n all 10 links, 1.8 GB available to the user

u

5D nearest neighbor exchange ~1.8GB/s/link, 98% efficiency

u

Hardware latency ranges from 80 ns to 3µs

Collective network

u Global Barrier, Allreduce, Prefix Sum u Floating point reductions at ~95% of peak

IBM

Scott B. Baden / CSE 260, UCSD / Fall '15 34

SLIDE 34

Die photograph

IBM

16 (user) + 1 (OS) cores
18th redundant core
Each core 4-way

multithreaded@1.6 GHz

Quad wide (@Double)

FPU: SIMD

42.6 GB/s DDR3 B/W
Network routing integrated
n-chip (5D torus)
Compare with layout of the

Cray 1 (1976)

Private L1 (16K/16K) Shared L2$: 32MB

Scott B. Baden / CSE 260, UCSD / Fall '15 36

SLIDE 35

Architectural directions

Scott B. Baden / CSE 260, UCSD / Fall '15 39

SLIDE 36

Hybrid processing

Two types of processors: general purpose + accelerator

u

AMD fusion: 4 x86 cores + hundreds of Radeon GPU cores

Accelerator can perform certain tasks more quickly than the

conventional cores

Accelerator amplifies relative cost of communication, latency hiding

important, but not sufficient

Hothardware.com

Scott B. Baden / CSE 260, UCSD / Fall '15 40

SLIDE 37

Memory structure

Only partial cache coherence on-chip
NUMA

http://www.hector.ac.uk/cse/documentation/Phase2b/#arch Scott B. Baden / CSE 260, UCSD / Fall '15 41

SLIDE 38

Revisiting Broadcast at the Exascale

Uses a different algorithm for long messages than for

short messages , based van de Geijn’s strategy See cseweb.ucsd.edu/classes/wi14/cse260-a/Lectures/Lec15.pdf

Scatter the data

u Divide the data to be broadcast into pieces, and fill the

machine with the pieces

Do an Allgather

u Now that everyone has a part of the entire result, collect on all

processors

Faster than standard algorithm for long messages

2 p −1 p nβ << lg p

$ %nβ

Scott B. Baden / CSE 260, UCSD / Fall '15 42

SLIDE 39

Algorithm for long messages

P0 Root P1 Pp-1 Scatter

The scatter step

lgP

" #α + p −1

p nβ

Uses a recursive hypercube like

algorithm

Running time is

Scott B. Baden / CSE 260, UCSD / Fall '15 43

SLIDE 40

Algorithm for long messages

P0 P1 Pp-1

AllGather step

Uses an all to all recursive

doubling algorithm

For P a power of two, running

time is

lgP

" #α + p −1

p nβ

Scott B. Baden / CSE 260, UCSD / Fall '15 44

SLIDE 41

Software

We need to think hard about hiding an

explosion of details

Parallel programming languages
Domain specific languages
Embedded domain specific languages
Autotuning

Scott B. Baden / CSE 260, UCSD / Fall '15 45

SLIDE 42

What you learned in this class

How to solve computationally intensive problems
n parallel computers effectively

u Theory and practice u Software techniques u Performance tradeoffs

CSE 260 built on what you learned earlier in your

career about programming, algorithm design & analysis and generalize them

Scott B. Baden / CSE 260, UCSD / Fall '15 46

SLIDE 43

How does parallel computing relate to other branches

f computer science?
Parallel processing generalizes problems we

encounter on single processor computers

A parallel computer is just an extension of

the traditional memory hierarchy

The need to preserve locality, which

prevails in virtual memory, cache memory, and registers, also applies to a parallel computer

Scott B. Baden / CSE 260, UCSD / Fall '15 47

SLIDE 44

Do you have an application in mind for parallel processing?

A. Yes
B. No
C. Maybe

Scott B. Baden / CSE 260, UCSD / Fall '15 48