Sparse Matrix Algorithms and Advanced Topics in FEM MATH 9830 Timo - - PowerPoint PPT Presentation
Sparse Matrix Algorithms and Advanced Topics in FEM MATH 9830 Timo - - PowerPoint PPT Presentation
Sparse Matrix Algorithms and Advanced Topics in FEM MATH 9830 Timo Heister (heister@clemson.edu) http://www.math.clemson.edu/~heister/math9830-spring2015/ Tasks Lab 1 Grab class repo Git clone https://github.com/tjhei/spring15_9830
Tasks Lab 1
- Grab class repo
– Git clone https://github.com/tjhei/spring15_9830 – Compile example: g++ main.cc – Run: ./a.out – Do question 2 from homework 1 – Work on the other tasks listed in the cc
- Install deal.II
- Run deal.II step 2
– Print sparsity pattern, change refinement – Use cuthill_mckee, other ordering? – Change to dim=3, etc.
Tasks Lab 2
- Deal.II intro
- Work on hw2
- Bonus: step 2
– Print sparsity pattern, change refinement – Use cuthill_mckee, other ordering? – Change to dim=3, etc.
deal.II
- “A Finite Element Differential Equations Analysis Library”
- Open source, c++ library
- I am one of the four maintainers
- One of the most widely used libraries:
– ~600 papers using and citing deal.II – ~600 downloads/month – 100+ people have contributed in the past 10 years – ~600,000 lines of code – 10,000+ pages of documentation
- Website: www.dealii.org
Features
- 1d, 2d, 3d computations, adaptive mesh
refinement (on quads/hexas only)
- Finite element types:
– Continuous and DG Lagrangian elements – Higher order elements, hp adaptivity – Raviart-Thomas, Nedelec, … – And arbitrary combinations
Features, part II
- Linear Algebra
– Own sparse and dense library – Interfaces to PETSc, Trilinos, UMFPACK, BLAS, ..
- Parallelization
– Multi-threading on multi-core machines – MPI: 16,000+ processors
- Output in many visualization file formats
Development of deal.II
- Professional-level development style
- Development in the open, open repository
- Mailing lists for users and developers
- Test suite with 6,000+ tests after every change
- Platform support:
– Linux/Unix – Mac – Work in progress: Windows
Installation
- How to install from source code, configure,
compile, test, run “step-1”
- Ubuntu (or any other linux) or Mac OSX
- Steps:
– Detect compilers/dependencies/etc. (cmake) – Compile & install deal.II (make)
Prerequisites on Linux
- Compiler: GNU g++
- Recommended:
$ s u d
- a
p t
- g
e t i n s t a l l s u b v e r s i
- n
- p
e n m p i 1 . 6
- b
i n
- p
e n m p i 1 . 6
- c
- m
m
- n
g + + g f
- r
t r a n l i b
- p
e n b l a s
- d
e v l i b l a p a c k
- d
e v z l i b 1 g
- d
e v g i t e m a c s g n u p l
- t
- manually: cmake (in a minute)
- Later: eclipse, paraview
- Optional manually: visit, p4est, PETSc, Trilinos, hdf5
On Mac OS
- If OSX 10.9 follow instructions from wiki:
https://github.com/dealii/dealii/wiki/MacOSX
- Later manually: eclipse, paraview
cmake
- Ubuntu 12.04 has a version that is too old
- If newer ubuntu do:
$ s u d
- a
p t
- g
e t i n s t a l l c m a k e … and you are done
- Otherwise: install cmake from source or
download the 32bit binary
cmake from binary
- Do:
e x p
- r
t C M A K E V E R = 2 . 8 . 1 2 . 1 w g e t h t t p : / / w w w . c m a k e .
- r
g / f i l e s / v 2 . 8 / c m a k e
- $
C M A K E V E R
- L
i n u x
- i
3 8 6 . s h c h m
- d
u + x c m a k e
- $
C M A K E V E R
- L
i n u x
- i
3 8 6 . s h . / c m a k e
- $
C M A K E V E R
- L
i n u x
- i
3 8 6 . s h
- Answer “q”, yes and yes
- Add the bin directory to your path (.bashrc)
- You might need
s u d
- a
p t
- g
e t i n s t a l l i a 3 2
- l
i b s
Cmake from source
w g e t ¬ h t t p : / / w w w . c m a k e .
- r
g / f i l e s / v 2 . 8 / c m a k e
- 2
. 8 . 1 2 . 1 . t a r . g z t a r x f c m a k e
- 2
. 8 . 1 1 . 1 . t a r . g z . / c
- n
f i g u r e m a k e i n s t a l l
Install deal.II
- http://www.dealii.org/8.1.0/readme.html
- Extract:
t a r x f d e a l . I I
- 8
. 1 . . t a r . g z
- Build directory:
c d d e a l . I I ; m k d i r b u i l d ; c d b u i l d
- Configuration:
c m a k e
- D
C M A K E _ I N S T A L L _ P R E F I X = / ? / ? . . (where /?/? is your installation directory)
- Compile (5-60 minutes):
m a k e
- j
X i n s t a l l (where X is the number of cores you have)
- Test:
m a k e t e s t (in build directory)
- Test part two:
c d e x a m p l e s / s t e p
- 1
c m a k e
- D
D E A L _ I I _ D I R = / ? / ? . m a k e r u n
- Recommended layout:
d e a l . I I /
b u i l d < build files i n s t a l l e d < your inst. dir e x a m p l e s < all examples! i n c l u d e s
- u
r c e . . .
Running examples
- In short:
cd examples/step-1 cmake . make run
- cmake:
–
Detect configuration, only needs to be run once
–
Input: CMakeLists.txt
–
Output: Makefile, (other files like CmakeCache.txt)
- make:
–
Tool to execute commands in Makefile, do every time you change your code
–
Input: step-1.cc, Makefile
–
Output: step-1 (the binary executable file)
- Run your program with
./step-1
- Or (compile and run):
make run
How to create an eclipse project
- Run this once in your project:
cmake -G "Eclipse CDT4 - Unix Makefiles" .
- Now create a new project in eclipse (“file-
>import->existing project” and select your folder for the project above)
- Eclipse intro:
–
http://www.math.tamu.edu/~bangerth/videos.676.7.html
–
http://www.math.tamu.edu/~bangerth/videos.676.8.html
Templates in C++
- “blueprints” to generate functions and/or classes
- Template arguments are either numbers or types
- No performance penalty!
- Very powerful feature of C++: difficult syntax, ugly
error messages, slow compilation
- More info:
http://www.cplusplus.com/doc/tutorial/templates/ http://www.math.tamu.edu/~bangerth/videos.676.12 .html
Why used in deal.II?
- Write your program once and run in 1d, 2d, 3d:
- Also: large parts of the library independent of dimension
DoFHandler<dim>::active_cell_iterator cell = dof_handler.begin_active(), endc = dof_handler.end(); for (; cell!=endc; ++cell) { ... cell_matrix(i,j) += (fe_values.shape_grad (i, q_point) * fe_values.shape_grad (j, q_point) * fe_values.JxW (q_point));
Function Templates
- Blueprint for a function
- One type called “number”
- You can use
“typename” or “class”
- Sometimes you need to state which function
you want to call:
t e m p l a t e < t y p e n a m e n u m b e r > n u m b e r s q u a r e ( c
- n
s t n u m b e r x ) { r e t u r n x * x ; } ; i n t x = 3 ; i n t y = s q u a r e < i n t > ( x ) ; t e m p l a t e < t y p e n a m e T > v
- i
d y e l l ( ) { T t e s t ; t e s t . s h
- u
t ( “ H I ! ” ) ; } ; / / c a t i s a c l a s s t h a t h a s s h
- u
t ( ) y e l l < c a t > ( ) ;
Value Templates
- Template arguments can also be values (like
int) instead of types:
- Of course this would have worked here too:
t e m p l a t e < i n t d i m > v
- i
d m a k e _ g r i d ( T r i a n g u l a t i
- n
< d i m > & t r i a n g u l a t i
- n
) { … } T r i a n g u l a t i
- n
< 2 > t r i a ; m a k e _ g r i d < 2 > ( t r i a ) ; t e m p l a t e < t y p e n a m e T > v
- i
d m a k e _ g r i d ( T & t r i a n g u l a t i
- n
) { … / / n
- w
w e c a n n
- t
a c c e s s “ d i m ” t h
- u
g h
Class templates
- Whole classes from a blueprint
- Same idea:
t e m p l a t e < i n t d i m > c l a s s P
- i
n t { d
- u
b l e e l e m e n t s [ d i m ] ; / / . . . } P
- i
n t < 2 > a _ p
- i
n t ; P
- i
n t < 5 > d i f f e r e n t _ p
- i
n t ; n a m e s p a c e s t d { t e m p l a t e < t y p e n a m e n u m b e r > c l a s s v e c t
- r
; } s t d : : v e c t
- r
< i n t > l i s t _
- f
_ i n t s ; s t d : : v e c t
- r
< c a t > c a t s ;
Example
- Value of N known at compile time
- Compiler can optimize (unroll loop)
- Fixed size arrays faster than dynamic
(dealii::Point<dim> vs dealii::Vector<double>)
t e m p l a t e < u n s i g n e d i n t N > d
- u
b l e n
- r
m ( c
- n
s t P
- i
n t < N > & p ) { d
- u
b l e t m p = ; f
- r
( u n s i g n e d i n t i = ; i < N ; + + i ) t m p + = s q u a r e ( v . e l e m e n t s [ i ] ) ; r e t u r n s q r t ( t m p ) ; } ;
Examples in deal.II
- Step-4:
t e m p l a t e < i n t d i m > v
- i
d m a k e _ g r i d ( T r i a n g u l a t i
- n
< d i m > & t r i a n g u l a t i
- n
) { . . . }
- So that we can use Vector<double> and Vector<float>:
t e m p l a t e < t y p e n a m e n u m b e r > c l a s s V e c t
- r
< n u m b e r > { n u m b e r [ ] e l e m e n t s ; . . . } ;
- Default values (embed dim-dimensional object in spacedim):
t e m p l a t e < i n t d i m , i n t s p a c e d i m = d i m > c l a s s T r i a n g u l a t i
- n
< d i m , s p a c e d i m > { . . . } ;
- Already familiar:
t e m p l a t e < i n t d i m , i n t s p a c e d i m > v
- i
d G r i d G e n e r a t
- r
: : h y p e r _ c u b e ( T r i a n g u l a t i
- n
< d i m , s p a c e d i m > & t r i a , c
- n
s t d
- u
b l e l e f t , c
- n
s t d
- u
b l e r i g h t ) { . . . }
Explicit Specialization
- different blueprint for a specific type T or value
/ / s t
- r
e s
- m
e i n f
- r
m a t i
- n
/ / a b
- u
t a T r i a n g u l a t i
- n
:
- t
e m p l a t e < i n t d i m > s t r u c t N u m b e r C a c h e { } ; t e m p l a t e < > s t r u c t N u m b e r C a c h e < 1 > { u n s i g n e d i n t n _ l e v e l s ; u n s i g n e d i n t n _ l i n e s ; } ; t e m p l a t e < > s t r u c t N u m b e r C a c h e < 2 > { u n s i g n e d i n t n _ l e v e l s ; u n s i g n e d i n t n _ l i n e s ; u n s i g n e d i n t n _ q u a d s ; } / / m
- r
e c l e v e r : t e m p l a t e < > s t r u c t N u m b e r C a c h e < 2 > : p u b l i c N u m b e r C a c h e < 1 > { u n s i g n e d i n t n _ q u a d s ; }
step-4
- Dimension independent Laplace problem
- Triangulation<2>, DoFHandler<2>, …
replaced by Triangulation<dim>, DoFHandler<dim>, …
- Template class:
t e m p l a t e < i n t d i m > c l a s s S t e p 4 { } ;
Parallel Computing: Introduction
A modern CPU: Intel Core i7
Basics
- Single cores are not
getting (much) faster
- “the free lunch is over”:
http://www.gotw.ca/publi cations/concurrency-ddj. htm
- Concurrency is only
- ption:
– SIMD/vector instructions – Several cores – Several chips in one node – Combine nodes into
supercomputer
Hierarchy of memory
- Latency: time CPU gets data after requesting
- Bandwidth: how much data per second?
- prefetching of data, “cache misses” are expensive
- automatically managed by processor
https://gist.github.com/hellerbarde/2843375
Amdahl's Law
- Task: serial fraction s, parallel fraction p=1-s
- N workers (whatever that means)
- Runtime: T(N) = (1-s)T(1)/N + sT(1)
- Speedup T(1)/T(N), N to infinity:
max_speedup = 1/ s
- http://en.wikipedia.org/wiki/Amdahl%27s_law
- Reality: T(N) = (1-s)T(1)/N + sT(1) + aN + bN^2
Summary
- Computing much faster than memory access
- Parallel computing required: no free lunch!
- Communication is serial fraction (or worse
when increasing with N!)
- Communication in Amdahl's law is main
challenge in parallel computing
Multithreading
- Idea: call functions in separate background thread and continue
immediately
- All threads can access the same memory (dangerous, but easy)
- Can spawn arbitrary many threads, scheduled by operating
system (pthreads, CreateThread, …)
- Wrapper for c++: std::thread, boost::thread
– See demos
- Higher level libraries (later):
– OpenMP (parallelize loops, tasks, ...) – TBB = Intel Threading Building Blocks (tasks, algorithms) – deal.II wraps TBB in a very easy task based interface
Multithreading limits
- One machine only (no cluster) => MPI
- Wrong abstraction:
– Creating threads is expensive – How many to create? – How to split work? – Reuse them? – Synchronization difficult
- Better: task based library based on threads (later)
MPI
- Need MPI library and compile with compiler wrappers (mpicxx
instead of g++) or use a cmake script that does that
- Need to run with (for 4 processes):
mpirun -n 4 ./main
- Reference http://mpi.deino.net/mpi_functions/
- Most basic program:
#include <mpi.h> int main(int argc, char **argv) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Finalize(); }
MPI_Send and MPI_Recv
- Send and receive a message from <source> to <dest> of
<count> elements
- <comm> is always MPI_COMM_WORLD for us
- <tag> can be any integer buts needs to be the same
- <status> can be MPI_STATUS_IGNORE if you don't care
- Some examples for <datatype>: MPI_INT, MPI_DOUBLE
- <source> can be MPI_ANY_SOURCE if you want to receive
any message
int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status); int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);
MPI_Barrier
- Code:
MPI_Barrier(MPI_Comm comm)
- Waits until all ranks arrived at this line, then all
continue
MPI_Probe
- Wait until a matching message arrives an return info about it in <status>
int MPI_Probe( int source, int tag, MPI_Comm comm, MPI_Status *status );
- <source> source rank or MPI_ANY_SOURCE
- <tag> tag value or MPI_ANY_TAG
- <status> contains:
– .MPI_SOURCE the source rank – .MPI_TAG the tag
- You can get information about the size of the message using MPI_Get_count()
MPI_Bcast
- Send the same data from <root> to all
- Result: r
a n k [ j ] . b u f f e r [ i ] = r a n k [ r
- t
] . b u f f e r [ i ]
int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
MPI_Reduce
- Combine elements from <sendbuf> using <op> from all
ranks and store in <recvbuf> on <root>
- MPI_OP examples: MPI_SUM, MPI_MIN, MPI_MAX, ...
- Result:
r a n k [ r
- t
] . r e c v b u f [ i ] =
- p
( r a n k [ ] . s e n d b u f [ i ] , … , r a n k [ n
- 1
] . s e n d b u f [ i ] ) int MPI_Reduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm );
MPI_AllReduce
- Combine elements from <sendbuf> using <op> from all
ranks and store in <recvbuf> on all ranks
- Like reduce, but result is available everywhere:
rank[j].recvbuf[i]=op(rank[0].sendbuf[i], …, rank[n-1].sendbuf[i]) int MPI_Allreduce( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm );
MPI_Gather
- Collect data from all ranks at <root>
int MPI_Gather( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm );
MPI_AllGather
- Collect data from all ranks at every rank
- (like Gather but copied to everyone)
int MPI_Allgather( void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm );
MPI_Scatter
- Send different data from <root> to all
- <sendbuf> is only used on <root>
int MPI_Scatter( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm );
Weak/Strong Scaling
- Weak:
–
Fixed problem size per CPU
–
Increase CPUs
–
Optimal: constant time
- Strong:
–
Fixed total problem size
–
Increase CPUs
–
Optimal: linear decrease
The following slides are for the future....
MPI vs. Multithreading
Boundary conditions
Adaptive Mesh Refinement
- Typical loop:
– Solve – Estimate – Mark – Refine/coarsen
- Estimate is problem dependent:
– Approximate gradient jumps: K
e l l y E r r
- r
E s t i m a t
- r
class
– Approximate local norm of gradient: D
e r i v a t i v e A p p r
- x
i m a t i
- n
class
– Or something else
- Mark:
G r i d R e f i n e m e n t : : r e f i n e _ a n d _ c
- a
r s e n _ f i x e d _ n u m b e r ( . . . )
- r
G r i d R e f i n e m e n t : : r e f i n e _ a n d _ c
- a
r s e n _ f i x e d _ f r a c t i
- n
( . . . )
- Refine/coarsen:
– t
r i a n g u l a t i
- n
. e x e c u t e _ c
- a
r s e n i n g _ a n d _ r e f i n e m e n t ( )
– Transferring the solution: S
- l
u t i
- n
T r a n s f e r class (maybe discussed later)
Constraints
xi=∑j αij x j+c j
- Used for hanging nodes (and other things!)
- Have the form:
- Represented by class ConstraintMatrix
- Created using D
- F
T
- l
s : : m a k e _ h a n g i n g _ n
- d
e _ c
- n
s t r a i n t s ( )
- Will also use for boundary values from now on:
V e c t
- r
T
- l
s : : i n t e r p
- l
a t e _ b
- u
n d a r y _ v a l u e s ( . . . , c
- n
s t r a i n t s ) ;
- Need different SparsityPattern (see step-6):
D
- F
T
- l
s : : m a k e _ s p a r s i t y _ p a t t e r n ( . . . , c
- n
s t r a i n t s , . . . )
Constraints II
- Old approach (explained in video):
– Assemble global matrix – Then eliminate rows/columns: C
- n
s t r a i n t M a t r i x : : c
- n
d e n s e ( . . . ) (similar to M a t r i x T
- l
s : : a p p l y _ b
- u
n d a r y _ v a l u e s ( ) in step-3)
– Solve and then set all constraint values correctly: C
- n
s t r a i n t M a t r i x : : d i s t r i b u t e ( . . . )
- New approach (step-6):
– Assemble local matrix as normal – Eliminate while transferring to global matrix:
c
- n
s t r a i n t s . d i s t r i b u t e _ l
- c
a l _ t
- _
g l
- b
a l ( c e l l _ m a t r i x , c e l l _ r h s , l
- c
a l _ d
- f
_ i n d i c e s , s y s t e m _ m a t r i x , s y s t e m _ r h s ) ;
– Solve and then set all constraint values correctly: C
- n
s t r a i n t M a t r i x : : d i s t r i b u t e ( . . . )
Vector Values Problems
- (video 19&20)
- FESystem: list of FEs (can be nested!)
- Will give one FE with N shape functions
- Use FEValuesExtractors to do
f e _ v a l u e s [ v e l
- c
i t i e s ] . d i v e r g e n c e ( i , q ) , . . .
- Ordering of DoFs in system matrix is independent
- See module “handling vector valued problems”
- Non-primitive elements (see fe.is_primitive()):
shape functions have more than one non-zero component, example: RT
Computing Errors
- Important for verification!
- See step-7 for an example
- Set up problem with analytical solution and implement it as a Function<dim>
- Quantities or interest:
- Break it down as one operation per cell and the “summation” (local and global error)
- Need quadrature to compute integrals
∥e∥0=∥e∥L2=(∑
K
∥e∥
0, K 2
)
1/2
∣e∣1=∣e∣H 1=∥∇ e∥0=(∑
K
∥∇ e∥0, K
2
)
1/2
∥e∥1=∥e∥H 1=(∣e∣1
2 +∥e∥0 2 ) 1/2=(∑ K
∥e∥1, K
2
)
1/2
∥e∥0,K=(∫K∣e∣
2) 1/2
e=u−uh
Computing Errors
- Example:
V e c t
- r
< f l
- a
t > d i f f e r e n c e _ p e r _ c e l l ( t r i a n g u l a t i
- n
. n _ a c t i v e _ c e l l s ( ) ) ; V e c t
- r
T
- l
s : : i n t e g r a t e _ d i f f e r e n c e ( d
- f
_ h a n d l e r , s
- l
u t i
- n
, / / s
- l
u t i
- n
v e c t
- r
S
- l
u t i
- n
< d i m > ( ) , / / r e f e r e n c e s
- l
u t i
- n
d i f f e r e n c e _ p e r _ c e l l , Q G a u s s < d i m > ( 3 ) , / / q u a d r a t u r e V e c t
- r
T
- l
s : : L 2 _ n
- r
m ) ; / / l
- c
a l n
- r
m c
- n
s t d
- u
b l e L 2 _ e r r
- r
= d i f f e r e n c e _ p e r _ c e l l . l 2 _ n
- r
m ( ) ; / / g l
- b
a l n
- r
m
- Local norms:
m e a n , L 1 _ n
- r
m , L 2 _ n
- r
m , L i n f t y _ n
- r
m , H 1 _ s e m i n
- r
m , H 1 _ n
- r
m , . . .
- Global norms are vector norms: l
1 _ n
- r
m ( ) , l 2 _ n
- r
m ( ) , l i n f t y _ n
- r
m ( ) , . . .
ParameterHandler
- Control program at runtime without recompilation
- You can put in:
ints (e.g. number of refinements), doubles (e.g. coefficients, time step size), strings (e.g. choice for algorithm/mesh/problem/etc.), functions (e.g. right- hand side, reference solution)
- Stuff can be grouped in sections
- See class-repository: prm/
ParameterHandler
# order of the finite element to use. set fe order = 1 # Refinement method. Choice between 'global' and 'adaptive'. set refinement = global subsection equation # expression for the reference solution and boundary values. Function of x,y (and z) set reference = sin(pi*x)*cos(pi*y) # expression for the gradient of the reference solution. Function of x,y (and z) set gradient = pi*cos(pi*x)*cos(pi*y); -pi*sin(pi*x)*sin(pi*y) # expression for the right-hand side. Function of x,y (and z) set rhs = 2*pi*pi*sin(pi*x)*cos(pi*y) + sin(pi*x)*cos(pi*y) end