Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. - - PowerPoint PPT Presentation

lecture 18 cse 260 parallel computation fall 2015 scott b
SMART_READER_LITE
LIVE PREVIEW

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. - - PowerPoint PPT Presentation

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing Announcements Office hours on Wednesday u 3:30 PM until 5:30 u Ill stay after 5:30 until the last one leaves Test on the last day of class


slide-1
SLIDE 1

Lecture 18 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

slide-2
SLIDE 2

Announcements

  • Office hours on Wednesday

u 3:30 PM until 5:30 u I’ll stay after 5:30 until the last one leaves

  • Test on the last day of class

u BRING A BLUE BOOK u Tests your ability to apply the knowledge you’ve

gained in the course

u Open book, open notes u You may bring a PDF viewer (e.g. preview,

Acrobat) to look at course materials only

  • No web browsing - turn off internet
  • No cell phones

Scott B. Baden / CSE 260, UCSD / Fall '15 2

slide-3
SLIDE 3

Today’s lecture

  • Supercomputers
  • Archiectures
  • Applications

Scott B. Baden / CSE 260, UCSD / Fall '15 3

slide-4
SLIDE 4

What is the purpose of a supercomputer?

  • Improve our understanding of scientific and

technologically important phenomenon

  • Improve the quality of life through technological

innovation, simulations, data processing

u Data Mining u Image processing u Simulations – financial modeling, weather, biomedical

  • Economic benefits

Scott B. Baden / CSE 260, UCSD / Fall '15 4

slide-5
SLIDE 5

What is the world’s fastest supercomputer?

  • Top500 #1 (>2 years), Tianhe– 2 @ NUDT (China)

3.12M cores, 54.8 Pflops peak, 17.8MW power+6W cooling , 12-core Ivy Bridge + Intel Phi

  • #2: Titan @ Oak Ridge, USA, 561K cores, 27PF,

8.2MW, Cray XK7: AMD Opteron + Nvida Kepler K20x

top500.org

Scott B. Baden / CSE 260, UCSD / Fall '15 5

slide-6
SLIDE 6

What does a supercomputer look like?

  • Hierarchically organized servers
  • IrrHybrid communication

u Threads within the server u Pass messages between servers

(or among groups of cores)

Edison @ nersc.gov

conferences.computer.org/sc/2012/papers/1000a079.pdf

Scott B. Baden / CSE 260, UCSD / Fall '15 6

slide-7
SLIDE 7

State-of-the-art applications

Blood Simulation on Jaguar Gatech team

p 48 384 3072 24576 Time (sec) 899.8 116.7 16.7 4.9 Efficiency 1.00 0.96 0.84 0.35

p 24576 98304 196608 Time (sec) 228.3 258 304.9 Efficiency 1.00 0.88 0.75

Strong scaling Weak scaling

Ab Initio Molecular Dynamics (AIMD) using Plane Waves Density Functional Theory Eric Bylaska (PNNL)

Exchange time

  • n HOPPER

Slide courtesy Tan Nguyen, UCSD Scott B. Baden / CSE 260, UCSD / Fall '15 7

slide-8
SLIDE 8

Performance differs across application domains

u Collela’s 7 dwarfs, patterns of communication and

computation that persist over time and across implementations

u Structured grids

  • Panfilov method

u Dense linear algebra

  • Matrix multiply, Vector-Mtx Mpy

Gaussian elimination

u N-body methods u Sparse linear algebra

  • With sparse matrix, use knowledge about the

locations of non-zeroes to improve some aspect of performance

u Unstructured Grids u Spectral methods (FFT) u Monte Carlo

+=

*

C[i,j] A[i,:] B[:,j] Scott B. Baden / CSE 260, UCSD / Fall '15 8

slide-9
SLIDE 9

Application-specific knowledge is important

  • Currently exists no tool that can convert a serial

program into an efficient parallel program

… for all applications … all of the time… on all hardware

  • The more we know about the application…

… specific problem … math/physics ... initial data … … context for analyzing the output… … the more we can improve performance

  • Performance Programming Issues

4 Data motion and locality 4 Load balancing 4 Serial sections

Scott B. Baden / CSE 260, UCSD / Fall '15 9

slide-10
SLIDE 10

Sparse Matrices

  • A matrix where we employ knowledge about the

location of the non-zeroes

  • Consider Jacobi’s method with a 5-point stencil

u’[i,j] = (u[i-1,j] + u[i+1,j]+ u[i,j-1]+ u[i, j+1] - h2f[i,j]) /4

Scott B. Baden / CSE 260, UCSD / Fall '15 10

slide-11
SLIDE 11

1M x 1M submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase

3 non-zeroes/row

Dense: 220×220= 240 = 1024 Gwords Sparse: (3/220) × 240 = 3 Mwords Sparse representation saves a factor of 1 million in storage

Web connectivity Matrix: 1M x 1M

Jim Demmel Scott B. Baden / CSE 260, UCSD / Fall '15 11

slide-12
SLIDE 12

Circuit Simulation

www.cise.ufl.edu/research/sparse/matrices/Hamm/scircuit.html

Motorola Circuit 170,9982 958,936 nonzeroes .003% nonzeroes 5.6 nonzeroes/row

Scott B. Baden / CSE 260, UCSD / Fall '15 12

slide-13
SLIDE 13

Generating sparse matrices from unstructured grids

A 2D airfoil

  • In some applications of sparse

matrices, we generate the matrix from an “unstructured” mesh, e.g. finite element method

  • In some cases we apply direct

mesh updates, using nearest neighbors

  • Irregular partitioning

Scott B. Baden / CSE 260, UCSD / Fall '15 13

slide-14
SLIDE 14

Sparse Matrix Vector Multiplication

  • Important kernel used in linear algebra
  • Assume x[] fits in memory of 1 processor

y[i] += A[i,j] × x[j]

  • Many formats, common format for CPUs is

Compressed Sparse Row (CSR)

Jim Demmel Scott B. Baden / CSE 260, UCSD / Fall '15 14

slide-15
SLIDE 15

Sparse matrix vector multiply kernel

// y[i] += A[i,j] × x[j] #pragma parallel for schedule (dynamic,chunk)

for i = 0 : N-1 // rows i0= ptr[i] i1 = ptr[i+1] – 1 for j = i0 : i1 // cols y[ ind[j]] += val[ j ] *x[ ind[j] ] end j end i

j→ i ↓ A

X

Scott B. Baden / CSE 260, UCSD / Fall '15 15

slide-16
SLIDE 16

Up and beyond to Exascale

  • In 1961, President Kennedy mandated a

landing on the Moon by the end of the decade

  • July 20, 1969 at tranquility base

“The Eagle has landed”

  • The US Government set an ambitious

schedule to reach 1018 flops by ~2023

  • DOE is taking the lead in the US, EU also

engaged

  • Massive technical challenges

Scott B. Baden / CSE 260, UCSD / Fall '15 16

slide-17
SLIDE 17

The Challenges to landing “Eagle”

  • High levels of parallelism within and across nodes

u 1018 flops using NVIDIA devices @ 1012 flops u 106 devices. 109+ threads

  • Power : ≤ 20 MW. Today 18MW@0.05 ExaFlops

u Power consumption 1-2nJ/op today → 20pJ @Exascale u Data storage & access consumes most of the energy

  • Ever lengthening communication delays

u Complicated memory hierarchies u Raise amount of computation per unit of communication u Hide latency, conserve locality

  • Reliability and resilience

u Blue Gene L’s Mean Time Between Failure (MTBF(

measured in days

  • Application code complexity; domain specific languages

u NUMA processors, not fully cache coherent on-chip u Mixture of accelerators and conventional cores

Scott B. Baden / CSE 260, UCSD / Fall '15 17

slide-18
SLIDE 18
  • Growth: cores/socket rather than sockets
  • Heterogeneous processors
  • Memory/core is shrinking
  • Complicated software-managed

parallel memory hierarchy

  • Communication costs increasing relative to

computation

Technological trends

Intel Sandybridge, anandtech.com

Scott B. Baden / CSE 260, UCSD / Fall '15 18

slide-19
SLIDE 19

35 years of processor trends

Scott B. Baden / CSE 260, UCSD / Fall '15 19

slide-20
SLIDE 20
  • Increase amount of computation performed

per unit of communication

u Conserve locality, “communication avoiding”

  • Hide communication
  • Many threads

How do we manage these constraints?

Year

Processor Memory Bandwidth Latency

Tera Exa???

Improvement

Giga Peta

Scott B. Baden / CSE 260, UCSD / Fall '15 20

slide-21
SLIDE 21

21

A Crosscutting issue: hiding communication

  • Little’s law [1961]

u

The number of threads must equal the parallelism times the latency T = p ×λ

u

p and λ are increasing with time

  • Difficult to implement

u

Split phase algorithms

u

Partitioning and scheduling

  • The state-of-the-art enables but

doesn’t support the activity

  • Distracts from the focus on the

domain science

  • Implementation policies entangled with

correctness issues

u

Non-robust performance

u

High development costs

Scott B. Baden / CSE 260, UCSD / Fall '15 21

slide-22
SLIDE 22

Motivating application

  • Solve Laplace’s equation in 3

dimensions with Dirichlet Boundary conditions Δϕ = ρ(x,y,z), ϕ=0 on ∂Ω

  • Building block: iterative solver using

Jacobi’s method (7-point stencil)

Ω

∂Ω

for (i,j,k) in 1:N x 1:N x 1:N u’[i][j][k] = (u[i-1][j][k] + u[i+1][j][k] + u[i][j-1][k] + u[i][j+1][k] + u[i][j][k+1] + u[i][j][k-1] ) / 6.0 ρ≠0

Scott B. Baden / CSE 260, UCSD / Fall '15 22

slide-23
SLIDE 23

Classic message passing implementation

  • Decompose domain into sub-regions, one per process

u Transmit halo regions between processes u Compute inner region after communication completes

  • Loop carried dependences impose a strict ordering on

communication and computation

Scott B. Baden / CSE 260, UCSD / Fall '15 23

slide-24
SLIDE 24

Communication tolerant variant

  • Only a subset of the domain exhibits loop carried

dependences with respect to the halo region

  • Subdivide the domain to remove some of the

dependences

  • We may now sweep the inner region in parallel

with communication

  • Sweep the annulus after communication finishes

Scott B. Baden / CSE 260, UCSD / Fall '15 24

slide-25
SLIDE 25

Processor Virtualization

  • Virtualize the processors by
  • verdecomposing
  • AMPI [Kalé et al.]
  • When an MPI call blocks, thread

yields to another virtual process

  • How do we inform the scheduler

about ready tasks?

Scott B. Baden / CSE 260, UCSD / Fall '15 25

slide-26
SLIDE 26

Observations

  • The exact execution order depends on the data

dependence structure: communication & computation

  • But many other correct orderings are possible, and

some can enable us to hide communication

  • We can characterize the running program in terms of

a task precedence graph

  • There is a deterministic procedure for translating MPI

code into the graph

Scott B. Baden / CSE 260, UCSD / Fall '15 26

2 1 4 3

Irecv j

Send j Wait Comp Irecv j Send j Wait Comp SPMD MPI TASK GRAPH

slide-27
SLIDE 27

A data driven execution model

  • Parallelism exists among independent tasks
  • Independent tasks may execute concurrently
  • A task is runnable when its data dependences have been met
  • A task suspends if its data dependences are not met
  • Computation and data motion are coupled activities
  • The scheduler

determines which task(s) to run next

  • Scheduler and application are only vaguely aware of one another
  • Scheduler doesn’t affect graph execution semantics

Scott B. Baden / CSE 260, UCSD / Fall '15 27

slide-28
SLIDE 28
  • A custom, domain specific source-to-source-

translator for translating MPI into an equivalent MPI formulation (Tan Nguyen, PhD 2014, UCSD) Bamboo

Scott B. Baden / CSE 260, UCSD / Fall '15 28

2 1 4 3

Worker threads Communication handlers Dynamic scheduling

Irecv j

Send j Wait Comp Irecv j Send j Wait Comp SPMD MPI Task dependency graph Runtime system

slide-29
SLIDE 29

3D Jacobi Results

Strong Scaling, size = 30723

TFLOPS/s

Cores on Hopper @ NERSC (Cray XE-6)

S = 1.02 VF = 8 S = 1.05 VF = 4 S = 1.27 VF = 2

4.49 8.63 14.65 23.05 4.58 8.95 15.49 26.92

4.96 9.90 19.14 36.03

5 10 15 20 25 30 35 40 12288 24576 49152 98304

MPI-basic MPI-olap MPI+OMP (MPI+OMP)-olap Bamboo-basic MPI-nocomm

S = 1.15 VF = 2

Scott B. Baden / CSE 260, UCSD / Fall '15 30

slide-30
SLIDE 30

Inside a supercomputer

Scott B. Baden / CSE 260, UCSD / Fall '15 31

slide-31
SLIDE 31

Blue Gene

  • IBM-US Dept of Energy collaboration
  • Low power, high performance interconnect
  • 3rd Generation: Blue Gene/Q
  • Sequoia at LLNL, #3 on top 500

u 17.2 Pflops/s (Linpack), 20.1 PF peak, 7.9

MW

u 1.6M cores/93,304 compute nodes,16 GB mem/node u 96 racks, 3,000 square feet u 64-bit PowerPC (A2 core) u Weighs about the same as

30 elephants

  • https://computing.llnl.gov/tutorials/bgq

Scott B. Baden / CSE 260, UCSD / Fall '15 32

slide-32
SLIDE 32

Hierarchical Packaging

SC ‘10, via IBM

  • 1. Chip

16 cores

  • 2. Module

Single Chip

  • 4. Node Card

32 Compute Cards, Optical Modules, Link Chips, Torus

  • 5a. Midplane

16 Node Cards

  • 6. Rack

2 Midplanes 1, 2 or 4 I/O Drawers

  • 7. System

96 racks, 20PF/s

  • 3. Compute Card

One single chip module, 16 GB DDR3 Memory

  • 5b. I/O Drawer

8 I/O Cards 8 PCIe Gen2 slots

Scott B. Baden / CSE 260, UCSD / Fall '15 33

slide-33
SLIDE 33

Blue Gene/Q Interconnect

  • 5D toroidal mesh (end around)

u

Can scale to > 2M cores

u

2 GB/sec bidirectional bandwidth (raw)

  • n all 10 links, 1.8 GB available to the user

u

5D nearest neighbor exchange ~1.8GB/s/link, 98% efficiency

u

Hardware latency ranges from 80 ns to 3µs

  • Collective network

u Global Barrier, Allreduce, Prefix Sum u Floating point reductions at ~95% of peak

IBM

Scott B. Baden / CSE 260, UCSD / Fall '15 34

slide-34
SLIDE 34

Die photograph

IBM

  • 16 (user) + 1 (OS) cores
  • 18th redundant core
  • Each core 4-way

multithreaded@1.6 GHz

  • Quad wide (@Double)

FPU: SIMD

  • 42.6 GB/s DDR3 B/W
  • Network routing integrated
  • n-chip (5D torus)
  • Compare with layout of the

Cray 1 (1976)

Private L1 (16K/16K) Shared L2$: 32MB

Scott B. Baden / CSE 260, UCSD / Fall '15 36

slide-35
SLIDE 35

Architectural directions

Scott B. Baden / CSE 260, UCSD / Fall '15 39

slide-36
SLIDE 36

Hybrid processing

  • Two types of processors: general purpose + accelerator

u

AMD fusion: 4 x86 cores + hundreds of Radeon GPU cores

  • Accelerator can perform certain tasks more quickly than the

conventional cores

  • Accelerator amplifies relative cost of communication, latency hiding

important, but not sufficient

Hothardware.com

Scott B. Baden / CSE 260, UCSD / Fall '15 40

slide-37
SLIDE 37

Memory structure

  • Only partial cache coherence on-chip
  • NUMA

http://www.hector.ac.uk/cse/documentation/Phase2b/#arch Scott B. Baden / CSE 260, UCSD / Fall '15 41

slide-38
SLIDE 38

Revisiting Broadcast at the Exascale

  • Uses a different algorithm for long messages than for

short messages , based van de Geijn’s strategy See cseweb.ucsd.edu/classes/wi14/cse260-a/Lectures/Lec15.pdf

  • Scatter the data

u Divide the data to be broadcast into pieces, and fill the

machine with the pieces

  • Do an Allgather

u Now that everyone has a part of the entire result, collect on all

processors

  • Faster than standard algorithm for long messages

2 p −1 p nβ << lg p

$ %nβ

Scott B. Baden / CSE 260, UCSD / Fall '15 42

slide-39
SLIDE 39

Algorithm for long messages

P0 Root P1 Pp-1 Scatter

The scatter step

lgP

" #α + p −1

p nβ

  • Uses a recursive hypercube like

algorithm

  • Running time is

Scott B. Baden / CSE 260, UCSD / Fall '15 43

slide-40
SLIDE 40

Algorithm for long messages

P0 P1 Pp-1

AllGather step

  • Uses an all to all recursive

doubling algorithm

  • For P a power of two, running

time is

lgP

" #α + p −1

p nβ

Scott B. Baden / CSE 260, UCSD / Fall '15 44

slide-41
SLIDE 41

Software

  • We need to think hard about hiding an

explosion of details

  • Parallel programming languages
  • Domain specific languages
  • Embedded domain specific languages
  • Autotuning

Scott B. Baden / CSE 260, UCSD / Fall '15 45

slide-42
SLIDE 42

What you learned in this class

  • How to solve computationally intensive problems
  • n parallel computers effectively

u Theory and practice u Software techniques u Performance tradeoffs

  • CSE 260 built on what you learned earlier in your

career about programming, algorithm design & analysis and generalize them

Scott B. Baden / CSE 260, UCSD / Fall '15 46

slide-43
SLIDE 43

How does parallel computing relate to other branches

  • f computer science?
  • Parallel processing generalizes problems we

encounter on single processor computers

  • A parallel computer is just an extension of

the traditional memory hierarchy

  • The need to preserve locality, which

prevails in virtual memory, cache memory, and registers, also applies to a parallel computer

Scott B. Baden / CSE 260, UCSD / Fall '15 47

slide-44
SLIDE 44

Do you have an application in mind for parallel processing?

  • A. Yes
  • B. No
  • C. Maybe

Scott B. Baden / CSE 260, UCSD / Fall '15 48