May 8-11, 2017 | Silicon Valley
Cris Cecka, Senior Research Scientist. May 11, 2017
LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD
LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior - - PowerPoint PPT Presentation
May 8-11, 2017 | Silicon Valley LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11, 2017 THE FAST FOURIER TRANSFORM Operation Count: 4 N log 2 N 6 N + 8 2 SPLIT-RADIX FFT Algorithm 3
May 8-11, 2017 | Silicon Valley
Cris Cecka, Senior Research Scientist. May 11, 2017
LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD
2
THE FAST FOURIER TRANSFORM
Operation Count: 4N log2 N − 6N + 8
3
SPLIT-RADIX FFT
Algorithm
4
SPLIT-RADIX FFT
Profile
5
FMM-FFT
Edelman et al. 1999
6
STRUCTURED DENSE MATRICES AND FMM
A = U D V ∗
K = U ˜ Kr×r V ∗ KIJ = UI ˜ KIJ V ∗
J
KIJ = UI ˜ U˜
I ˜
K˜
I ˜ J ˜
V ∗
˜ J V ∗ J
7
FMM-FFT
Algorithm
MM,P = diag(IM, C1, . . . , CP −1) [Cp]mn = ρp h cot ⇣ π M ⇣ n − m + p P ⌘⌘ + ı i
8
COT FMM
[Cp]mn = ρp h cot ⇣ π M ⇣ n − m + p P ⌘⌘ + ı i
9
FMM OPERATORS
Each operator is an (implicit) matrix.
M/2L Q Q Q S2M M2M M2M M2L M2L L2L L2L L2L L2T L2T S2T
S2T M2L B=2 3 L=4
10
PARAMETERS OF THE FMM-FFT
N = M P Q B
ML
L = log2(M/ML) (N, P, ML, Q, B)
11
DISTRIBUTED FMM
All2All Gather All2All Gather Halo 2b Halo 2b Halo 1b Halo 2b Halo 2b Halo 1b
12
INTERPOLATIVE FMM
zj = cos ✓(2j + 1)π 2Q ◆ `i(z) = Y
0k<Q k6=i
z − zk zi − zk
S2M M2M M2L L2L L2T
Cij = `m(tI
i ) `q(z ˜ I m) C(z ˜ I q , z ˜ J r ) `r(z ˜ J n ) `n(sJ i )
13
TENSOR REPRESENTATIONS
Aijk` := A[i + j ∗ ldA<1> + k ∗ ldA<2> + ` ∗ ldA<3>],
Sn ≡ Spm ≡ Spmb Tn ≡ Tpm ≡ Tpmb
14
S2M/L2T
S2Mqm = `q(sm)
sm = −1 + 2m + 1 ML
Computed with single BatchedGEMM
ML
(p−1)qb = S2Mqm Spmb
15
BATCHED MATRIX-MATRIX MULTIPLY
cublas<T>gemmStridedBatched in cuBLAS 8.0
16
S2M/L2T
Tpmb = L2Tmq Lpqb = ⇒ Tpm[b] = Lpq[b] S2Mqm
Mpqb = S2Mqm Spmb = ⇒ Mpq[b] = Spm[b] S2M T
qm
17
M2M/L2L
M2M ±
qk = `q
✓zk ± 1 2 ◆
M`
pqb = M2Mqk M`+1 pk(2b)
Computed with single BatchedGEMM
L`+1
pq(2b) = L2Lqk L` pkb + L`+1 pq(2b)
18
S2T/M2L
Tpib = S2Tp(j−i) Spjb
S2Tpk = ( cot π
N (p + Pk)
δk0 p = 0
L`
pib = M2L` pijs M` pj(b+s)
M2L`
pijs = cot
⇣ π 2` (zj 2 − zi 2 + s) + π N (p + 1) ⌘
19
INTERPOLATIVE FMM
P(4ML-1) QML QML 2Q2 2Q2 4(L-B)PQ2
Storage Operator Compute
2PMQ 2PMQ 3P2LML2 4(2L-2B)PQ2 4(2L-2B)PQ2 3(2L-2B)PQ2
20
ALGORITHM
21
PROFILE
22
FMM-FFT PROFILE
S2M M2M Halo S2T M2L
L2L L2T
2D FFT
23
2xK40c FMM-FFT
24
2xP100 FMM-FFT
25
8xP100 FMM-FFT
26
FMM BREAKDOWN
Components
27
EFFICIENCY
28
PARAMETER DEPENDENCE — ML
Points per box per FMM
29
PARAMETER DEPENDENCE — P
Number of FMMs
30
PARAMETER DEPENDENCE — B
Base Level
31
PARAMETER DEPENDENCE — Q
Quadrature Order
32
FUTURE
, Sparse FFT
33
CONCLUSION
May 8-11, 2017 | Silicon Valley