An FPGA Implementation of Reciprocal Sums for SPME
Sam Lee and Paul Chow
Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto
An FPGA Implementation of Reciprocal Sums for SPME Sam Lee and Paul - - PowerPoint PPT Presentation
An FPGA Implementation of Reciprocal Sums for SPME Sam Lee and Paul Chow Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Objectives Accelerate part of Molecular Dynamics Simulation Smooth
Sam Lee and Paul Chow
Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto
2
Accelerate part of Molecular Dynamics Simulation
Smooth Particle Mesh Ewald
Implementation
FPGA based Try it and learn
Investigation
Acceleration bottleneck Precision requirement Parallelization strategy
3
Molecular Dynamics SPME The Reciprocal Sum Compute Engine Speedup and Parallelization Precision Future work
4
5
forces.
equations of motion.
calculations with Newton’s equations of motion.
demanding.
1 − → →
⋅ = m F a
( ) ( ) ( ) ( )
t a t t v t t r t t r
→ → → →
+ + = +
2
5 . δ δ δ
( ) ( ) ( ) ( )⎥
⎦ ⎤ ⎢ ⎣ ⎡ + + + = +
→ → → →
t t a t a t t v t t v δ δ δ 5 .
→
F
6
−
Bonds All
l l k
2
) (
Θ − Θ −
Θ Angles All
2
) (
+ +
Torsions All
n A )] cos( 1 [ φ τ
∑
⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛
Pairs All
r r
6 12
4 σ σ ε
Pairs All
r q q
2 1
+
δ
−
δ
+ + + +
7
Problem scientists are facing:
SLOW! O(N2) complexity.
3 0 CPU Years
8
Parallelize to more compute engines Accelerate with FPGA Especially: The non-bonded calculations To be more specific, this paper addresses:
Electrostatic interaction (Reciprocal space) Smooth Particle Mesh Ewald algorithm.
9
Software SPME Implementations:
Original PME Package written by Toukmaji. Used in NAMD2.
Hardware Implementations:
No previous hardware implementation of
reciprocal sums calculation.
MD-Grape & MD-Engine uses Ewald Summation. Ewald Summation is O(N2); SPME is O(NLogN)!
10
11
Coulombic equation: Under the Periodic Boundary Condition,
the summation to calculate Electrostatic energy is only … Conditionally Convergent.
= =
=
' 1 1 ,
2 1
n N i N j n ij j i
r q q U
r q q vcoulomb
2 1
4πε − =
12
A
3 2 1 4 5
B
3 2 1 4 5
C
3 2 1 4 5
D
3 2 1 4 5
E
3 2 1 4 5
F
3 2 1 4 5
G
3 2 1 4 5
H
3 2 1 4 5
I
3 2 1 4 5
To combat Surface Effect…
3 2 1 4 5
Replication
13
r q r q r
q
Direct Sum Reciprocal Sum
To calculate the Coulombic Interactions O(N2) Direct Sum + O(N2) Reciprocal Sum
14
Shift the workload to the Reciprocal Sum. Use Fast Fourier Transform. O(N) Real + O(NLogN) Reciprocal. RSCE calculates the Reciprocal Sums
using the SPME algorithm.
15
) ,m ,m m Q)( (θ ) ,m ,m (m r Q r E F
K m K m rec K m αi αi rec ~ 3 2 1 1 1 1 1 2 2 1 3 3 3 2 1
∑ ∑ ∑
− = − = − =
∗
∂ = ∂ ∂ =
2 3 3 2 2 2 2 1 1 3 2 1
) (m b ) (m b ) (m b ) ,m ,m B(m
1 2
2 exp 1 1 2 exp
− − =
⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + × − =
∑
n k i i n i i i i
) K k πim ( ) (k M ) K )m πi(n ( ) (m b
2 2 2 2 3 2 1
exp 1 m ) /β m π ( πV ) ,m ,m C(m − =
= ≠ ) , , ,c( m ) m , m , m )F(Q)( ,m ,m F(Q)(m ) ,m ,m B(m m ) /β m π ( πV E
m ~ 3 2 1 3 2 1 3 2 1 2 2 2 2
exp 2 1 − − −
=
∑
≠
FFT FFT
Energy: Force:
) ,m ,m m Q)( (θ ) ,m ,m Q(m E
K m K m rec K m ~ 3 2 1 1 1 1 1 2 2 1 3 3 3 2 1
2 1 ∑ ∑ ∑
− = − = − =
∗
16
17
18
19
20
21
22
RSCE @ 100MHz vs. P4 Intel @ 2.4GHz.
Speedup: 3x to 14x
Why so insignificant?
Reciprocal Sums calculations not easily
parallelizable.
QMM memory bandwidth limitation.
Improvement:
Using more QMM memories can improve the
speedup.
Slight design modifications are required.
23
24
Assume a 2-D simulation system. Assume P= 2, K= 8, N= 6. Assume NumP = 4. Four 4x4x4 Mini Meshes An 8x8x8 mesh
25
P1 P3 P2 P4
Kx
1D FFT Y direction
Ky
P1 P3 P2 P4
Kx
1D FFT X direction
Ky
Mini-mesh composed -> 2D-IFFT 2D-IFFT = two passes of 1D-FFT (X and Y). X Direction FFT Y Direction FFT
26
=
=
3 P P Total
E E
2D-FFT 2D-IFFT -> Energy Calculation -> 2D-FFT 2D-FFT -> Force Calculation Energy Calculation Force Calculation
27
28
Precision goal: Relative error bound < 10-5. Two major calculation steps:
B-Spline Calculation. 3D-FFT/ IFFT Calculation.
Due to the limited logic resource & limited
precision FFT LogiCore. = > Precision goal cannot be achieved.
29
To achieve the relative error bound of < 10-5. Minimum calculation precision:
FFT { 14.30} , B-Spline { 1.27}
30
RMS Energy Error Fluctuation:
E E E n Fluctuatio Energy RMS
2 2 −
=
31
32
Implementation of FPGA-based Reciprocal Sums
Compute Engine and its SystemC model.
Integration of the RSCE into a widely used
Molecular Dynamics program called NAMD2 for verification
RSCE Speedup Estimate
3x to 14x
Precision Requirement
B-Spline: { 1.27} & FFT: { 14: 30} = > 10-5 rel. error
Parallelization Strategy
33
More in-depth precision analysis. Investigation on how to further speedup
the SPME algorithm with FPGA.
34