Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. - - PowerPoint PPT Presentation
Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. - - PowerPoint PPT Presentation
Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing Announcements Office hours on Wednesday u 3:30 PM until 5:30 u Ill stay after 5:30 until the last one leaves Test on the last day of class
Announcements
- Office hours on Wednesday
u 3:30 PM until 5:30 u I’ll stay after 5:30 until the last one leaves
- Test on the last day of class
u BRING A BLUE BOOK u Tests your ability to apply the knowledge you’ve
gained in the course
u Open book, open notes u You may bring a PDF viewer (e.g. preview,
Acrobat) to look at course materials only
- No web browsing - turn off internet
- No cell phones
Scott B. Baden / CSE 260, UCSD / Fall '15 2
Today’s lecture
- Supercomputers
- Archiectures
- Applications
Scott B. Baden / CSE 260, UCSD / Fall '15 3
What is the purpose of a supercomputer?
- Improve our understanding of scientific and
technologically important phenomenon
- Improve the quality of life through technological
innovation, simulations, data processing
u Data Mining u Image processing u Simulations – financial modeling, weather, biomedical
- Economic benefits
Scott B. Baden / CSE 260, UCSD / Fall '15 4
What is the world’s fastest supercomputer?
- Top500 #1 (>2 years), Tianhe– 2 @ NUDT (China)
3.12M cores, 54.8 Pflops peak, 17.8MW power+6W cooling , 12-core Ivy Bridge + Intel Phi
- #2: Titan @ Oak Ridge, USA, 561K cores, 27PF,
8.2MW, Cray XK7: AMD Opteron + Nvida Kepler K20x
top500.org
Scott B. Baden / CSE 260, UCSD / Fall '15 5
What does a supercomputer look like?
- Hierarchically organized servers
- IrrHybrid communication
u Threads within the server u Pass messages between servers
(or among groups of cores)
Edison @ nersc.gov
conferences.computer.org/sc/2012/papers/1000a079.pdf
Scott B. Baden / CSE 260, UCSD / Fall '15 6
State-of-the-art applications
Blood Simulation on Jaguar Gatech team
p 48 384 3072 24576 Time (sec) 899.8 116.7 16.7 4.9 Efficiency 1.00 0.96 0.84 0.35
p 24576 98304 196608 Time (sec) 228.3 258 304.9 Efficiency 1.00 0.88 0.75
Strong scaling Weak scaling
Ab Initio Molecular Dynamics (AIMD) using Plane Waves Density Functional Theory Eric Bylaska (PNNL)
Exchange time
- n HOPPER
Slide courtesy Tan Nguyen, UCSD Scott B. Baden / CSE 260, UCSD / Fall '15 7
Performance differs across application domains
u Collela’s 7 dwarfs, patterns of communication and
computation that persist over time and across implementations
u Structured grids
- Panfilov method
u Dense linear algebra
- Matrix multiply, Vector-Mtx Mpy
Gaussian elimination
u N-body methods u Sparse linear algebra
- With sparse matrix, use knowledge about the
locations of non-zeroes to improve some aspect of performance
u Unstructured Grids u Spectral methods (FFT) u Monte Carlo
+=
*
C[i,j] A[i,:] B[:,j] Scott B. Baden / CSE 260, UCSD / Fall '15 8
Application-specific knowledge is important
- Currently exists no tool that can convert a serial
program into an efficient parallel program
… for all applications … all of the time… on all hardware
- The more we know about the application…
… specific problem … math/physics ... initial data … … context for analyzing the output… … the more we can improve performance
- Performance Programming Issues
4 Data motion and locality 4 Load balancing 4 Serial sections
Scott B. Baden / CSE 260, UCSD / Fall '15 9
Sparse Matrices
- A matrix where we employ knowledge about the
location of the non-zeroes
- Consider Jacobi’s method with a 5-point stencil
u’[i,j] = (u[i-1,j] + u[i+1,j]+ u[i,j-1]+ u[i, j+1] - h2f[i,j]) /4
Scott B. Baden / CSE 260, UCSD / Fall '15 10
1M x 1M submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase
3 non-zeroes/row
Dense: 220×220= 240 = 1024 Gwords Sparse: (3/220) × 240 = 3 Mwords Sparse representation saves a factor of 1 million in storage
Web connectivity Matrix: 1M x 1M
Jim Demmel Scott B. Baden / CSE 260, UCSD / Fall '15 11
Circuit Simulation
www.cise.ufl.edu/research/sparse/matrices/Hamm/scircuit.html
Motorola Circuit 170,9982 958,936 nonzeroes .003% nonzeroes 5.6 nonzeroes/row
Scott B. Baden / CSE 260, UCSD / Fall '15 12
Generating sparse matrices from unstructured grids
A 2D airfoil
- In some applications of sparse
matrices, we generate the matrix from an “unstructured” mesh, e.g. finite element method
- In some cases we apply direct
mesh updates, using nearest neighbors
- Irregular partitioning
Scott B. Baden / CSE 260, UCSD / Fall '15 13
Sparse Matrix Vector Multiplication
- Important kernel used in linear algebra
- Assume x[] fits in memory of 1 processor
y[i] += A[i,j] × x[j]
- Many formats, common format for CPUs is
Compressed Sparse Row (CSR)
Jim Demmel Scott B. Baden / CSE 260, UCSD / Fall '15 14
Sparse matrix vector multiply kernel
// y[i] += A[i,j] × x[j] #pragma parallel for schedule (dynamic,chunk)
for i = 0 : N-1 // rows i0= ptr[i] i1 = ptr[i+1] – 1 for j = i0 : i1 // cols y[ ind[j]] += val[ j ] *x[ ind[j] ] end j end i
j→ i ↓ A
X
Scott B. Baden / CSE 260, UCSD / Fall '15 15
Up and beyond to Exascale
- In 1961, President Kennedy mandated a
landing on the Moon by the end of the decade
- July 20, 1969 at tranquility base
“The Eagle has landed”
- The US Government set an ambitious
schedule to reach 1018 flops by ~2023
- DOE is taking the lead in the US, EU also
engaged
- Massive technical challenges
Scott B. Baden / CSE 260, UCSD / Fall '15 16
The Challenges to landing “Eagle”
- High levels of parallelism within and across nodes
u 1018 flops using NVIDIA devices @ 1012 flops u 106 devices. 109+ threads
- Power : ≤ 20 MW. Today 18MW@0.05 ExaFlops
u Power consumption 1-2nJ/op today → 20pJ @Exascale u Data storage & access consumes most of the energy
- Ever lengthening communication delays
u Complicated memory hierarchies u Raise amount of computation per unit of communication u Hide latency, conserve locality
- Reliability and resilience
u Blue Gene L’s Mean Time Between Failure (MTBF(
measured in days
- Application code complexity; domain specific languages
u NUMA processors, not fully cache coherent on-chip u Mixture of accelerators and conventional cores
Scott B. Baden / CSE 260, UCSD / Fall '15 17
- Growth: cores/socket rather than sockets
- Heterogeneous processors
- Memory/core is shrinking
- Complicated software-managed
parallel memory hierarchy
- Communication costs increasing relative to
computation
Technological trends
Intel Sandybridge, anandtech.com
Scott B. Baden / CSE 260, UCSD / Fall '15 18
35 years of processor trends
Scott B. Baden / CSE 260, UCSD / Fall '15 19
- Increase amount of computation performed
per unit of communication
u Conserve locality, “communication avoiding”
- Hide communication
- Many threads
How do we manage these constraints?
Year
Processor Memory Bandwidth Latency
Tera Exa???
Improvement
Giga Peta
Scott B. Baden / CSE 260, UCSD / Fall '15 20
21
A Crosscutting issue: hiding communication
- Little’s law [1961]
u
The number of threads must equal the parallelism times the latency T = p ×λ
u
p and λ are increasing with time
- Difficult to implement
u
Split phase algorithms
u
Partitioning and scheduling
- The state-of-the-art enables but
doesn’t support the activity
- Distracts from the focus on the
domain science
- Implementation policies entangled with
correctness issues
u
Non-robust performance
u
High development costs
Scott B. Baden / CSE 260, UCSD / Fall '15 21
Motivating application
- Solve Laplace’s equation in 3
dimensions with Dirichlet Boundary conditions Δϕ = ρ(x,y,z), ϕ=0 on ∂Ω
- Building block: iterative solver using
Jacobi’s method (7-point stencil)
Ω
∂Ω
for (i,j,k) in 1:N x 1:N x 1:N u’[i][j][k] = (u[i-1][j][k] + u[i+1][j][k] + u[i][j-1][k] + u[i][j+1][k] + u[i][j][k+1] + u[i][j][k-1] ) / 6.0 ρ≠0
Scott B. Baden / CSE 260, UCSD / Fall '15 22
Classic message passing implementation
- Decompose domain into sub-regions, one per process
u Transmit halo regions between processes u Compute inner region after communication completes
- Loop carried dependences impose a strict ordering on
communication and computation
Scott B. Baden / CSE 260, UCSD / Fall '15 23
Communication tolerant variant
- Only a subset of the domain exhibits loop carried
dependences with respect to the halo region
- Subdivide the domain to remove some of the
dependences
- We may now sweep the inner region in parallel
with communication
- Sweep the annulus after communication finishes
Scott B. Baden / CSE 260, UCSD / Fall '15 24
Processor Virtualization
- Virtualize the processors by
- verdecomposing
- AMPI [Kalé et al.]
- When an MPI call blocks, thread
yields to another virtual process
- How do we inform the scheduler
about ready tasks?
Scott B. Baden / CSE 260, UCSD / Fall '15 25
Observations
- The exact execution order depends on the data
dependence structure: communication & computation
- But many other correct orderings are possible, and
some can enable us to hide communication
- We can characterize the running program in terms of
a task precedence graph
- There is a deterministic procedure for translating MPI
code into the graph
Scott B. Baden / CSE 260, UCSD / Fall '15 26
2 1 4 3
Irecv j
Send j Wait Comp Irecv j Send j Wait Comp SPMD MPI TASK GRAPH
A data driven execution model
- Parallelism exists among independent tasks
- Independent tasks may execute concurrently
- A task is runnable when its data dependences have been met
- A task suspends if its data dependences are not met
- Computation and data motion are coupled activities
- The scheduler
determines which task(s) to run next
- Scheduler and application are only vaguely aware of one another
- Scheduler doesn’t affect graph execution semantics
Scott B. Baden / CSE 260, UCSD / Fall '15 27
- A custom, domain specific source-to-source-
translator for translating MPI into an equivalent MPI formulation (Tan Nguyen, PhD 2014, UCSD) Bamboo
Scott B. Baden / CSE 260, UCSD / Fall '15 28
2 1 4 3
Worker threads Communication handlers Dynamic scheduling
Irecv j
Send j Wait Comp Irecv j Send j Wait Comp SPMD MPI Task dependency graph Runtime system
3D Jacobi Results
Strong Scaling, size = 30723
TFLOPS/s
Cores on Hopper @ NERSC (Cray XE-6)
S = 1.02 VF = 8 S = 1.05 VF = 4 S = 1.27 VF = 2
4.49 8.63 14.65 23.05 4.58 8.95 15.49 26.92
4.96 9.90 19.14 36.03
5 10 15 20 25 30 35 40 12288 24576 49152 98304
MPI-basic MPI-olap MPI+OMP (MPI+OMP)-olap Bamboo-basic MPI-nocomm
S = 1.15 VF = 2
Scott B. Baden / CSE 260, UCSD / Fall '15 30
Inside a supercomputer
Scott B. Baden / CSE 260, UCSD / Fall '15 31
Blue Gene
- IBM-US Dept of Energy collaboration
- Low power, high performance interconnect
- 3rd Generation: Blue Gene/Q
- Sequoia at LLNL, #3 on top 500
u 17.2 Pflops/s (Linpack), 20.1 PF peak, 7.9
MW
u 1.6M cores/93,304 compute nodes,16 GB mem/node u 96 racks, 3,000 square feet u 64-bit PowerPC (A2 core) u Weighs about the same as
30 elephants
- https://computing.llnl.gov/tutorials/bgq
Scott B. Baden / CSE 260, UCSD / Fall '15 32
Hierarchical Packaging
SC ‘10, via IBM
- 1. Chip
16 cores
- 2. Module
Single Chip
- 4. Node Card
32 Compute Cards, Optical Modules, Link Chips, Torus
- 5a. Midplane
16 Node Cards
- 6. Rack
2 Midplanes 1, 2 or 4 I/O Drawers
- 7. System
96 racks, 20PF/s
- 3. Compute Card
One single chip module, 16 GB DDR3 Memory
- 5b. I/O Drawer
8 I/O Cards 8 PCIe Gen2 slots
Scott B. Baden / CSE 260, UCSD / Fall '15 33
Blue Gene/Q Interconnect
- 5D toroidal mesh (end around)
u
Can scale to > 2M cores
u
2 GB/sec bidirectional bandwidth (raw)
- n all 10 links, 1.8 GB available to the user
u
5D nearest neighbor exchange ~1.8GB/s/link, 98% efficiency
u
Hardware latency ranges from 80 ns to 3µs
- Collective network
u Global Barrier, Allreduce, Prefix Sum u Floating point reductions at ~95% of peak
IBM
Scott B. Baden / CSE 260, UCSD / Fall '15 34
Die photograph
IBM
- 16 (user) + 1 (OS) cores
- 18th redundant core
- Each core 4-way
multithreaded@1.6 GHz
- Quad wide (@Double)
FPU: SIMD
- 42.6 GB/s DDR3 B/W
- Network routing integrated
- n-chip (5D torus)
- Compare with layout of the
Cray 1 (1976)
Private L1 (16K/16K) Shared L2$: 32MB
Scott B. Baden / CSE 260, UCSD / Fall '15 36
Architectural directions
Scott B. Baden / CSE 260, UCSD / Fall '15 39
Hybrid processing
- Two types of processors: general purpose + accelerator
u
AMD fusion: 4 x86 cores + hundreds of Radeon GPU cores
- Accelerator can perform certain tasks more quickly than the
conventional cores
- Accelerator amplifies relative cost of communication, latency hiding
important, but not sufficient
Hothardware.com
Scott B. Baden / CSE 260, UCSD / Fall '15 40
Memory structure
- Only partial cache coherence on-chip
- NUMA
http://www.hector.ac.uk/cse/documentation/Phase2b/#arch Scott B. Baden / CSE 260, UCSD / Fall '15 41
Revisiting Broadcast at the Exascale
- Uses a different algorithm for long messages than for
short messages , based van de Geijn’s strategy See cseweb.ucsd.edu/classes/wi14/cse260-a/Lectures/Lec15.pdf
- Scatter the data
u Divide the data to be broadcast into pieces, and fill the
machine with the pieces
- Do an Allgather
u Now that everyone has a part of the entire result, collect on all
processors
- Faster than standard algorithm for long messages
2 p −1 p nβ << lg p
$ %nβ
Scott B. Baden / CSE 260, UCSD / Fall '15 42
Algorithm for long messages
P0 Root P1 Pp-1 Scatter
The scatter step
lgP
" #α + p −1
p nβ
- Uses a recursive hypercube like
algorithm
- Running time is
Scott B. Baden / CSE 260, UCSD / Fall '15 43
Algorithm for long messages
P0 P1 Pp-1
AllGather step
- Uses an all to all recursive
doubling algorithm
- For P a power of two, running
time is
lgP
" #α + p −1
p nβ
Scott B. Baden / CSE 260, UCSD / Fall '15 44
Software
- We need to think hard about hiding an
explosion of details
- Parallel programming languages
- Domain specific languages
- Embedded domain specific languages
- Autotuning
Scott B. Baden / CSE 260, UCSD / Fall '15 45
What you learned in this class
- How to solve computationally intensive problems
- n parallel computers effectively
u Theory and practice u Software techniques u Performance tradeoffs
- CSE 260 built on what you learned earlier in your
career about programming, algorithm design & analysis and generalize them
Scott B. Baden / CSE 260, UCSD / Fall '15 46
How does parallel computing relate to other branches
- f computer science?
- Parallel processing generalizes problems we
encounter on single processor computers
- A parallel computer is just an extension of
the traditional memory hierarchy
- The need to preserve locality, which
prevails in virtual memory, cache memory, and registers, also applies to a parallel computer
Scott B. Baden / CSE 260, UCSD / Fall '15 47
Do you have an application in mind for parallel processing?
- A. Yes
- B. No
- C. Maybe
Scott B. Baden / CSE 260, UCSD / Fall '15 48