LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior - - PowerPoint PPT Presentation

low communication fft with fast multipole method
SMART_READER_LITE
LIVE PREVIEW

LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior - - PowerPoint PPT Presentation

May 8-11, 2017 | Silicon Valley LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11, 2017 THE FAST FOURIER TRANSFORM Operation Count: 4 N log 2 N 6 N + 8 2 SPLIT-RADIX FFT Algorithm 3


slide-1
SLIDE 1

May 8-11, 2017 | Silicon Valley

Cris Cecka, Senior Research Scientist. May 11, 2017

LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD

slide-2
SLIDE 2

2

THE FAST FOURIER TRANSFORM

Operation Count: 4N log2 N − 6N + 8

slide-3
SLIDE 3

3

SPLIT-RADIX FFT

Algorithm

slide-4
SLIDE 4

4

SPLIT-RADIX FFT

Profile

slide-5
SLIDE 5

5

FMM-FFT

Edelman et al. 1999

slide-6
SLIDE 6

6

STRUCTURED DENSE MATRICES AND FMM

  • SVD:
  • Low-Rank:
  • Hierarchically LR:
  • H-Semi-Separable:
  • H2-Matrix/FMM

A = U D V ∗

K = U ˜ Kr×r V ∗ KIJ = UI ˜ KIJ V ∗

J

KIJ = UI ˜ U˜

I ˜

I ˜ J ˜

V ∗

˜ J V ∗ J

slide-7
SLIDE 7

7

FMM-FFT

Algorithm

MM,P = diag(IM, C1, . . . , CP −1) [Cp]mn = ρp h cot ⇣ π M ⇣ n − m + p P ⌘⌘ + ı i

} 2D M × P FFT

slide-8
SLIDE 8

8

COT FMM

  • One dimensional
  • Uniform — integers are source/target
  • Periodic
  • Distributed
  • Size M-by-M
  • P of them!
  • Interleaved

[Cp]mn = ρp h cot ⇣ π M ⇣ n − m + p P ⌘⌘ + ı i

slide-9
SLIDE 9

9

FMM OPERATORS

Each operator is an (implicit) matrix.

M/2L Q Q Q S2M M2M M2M M2L M2L L2L L2L L2L L2T L2T S2T

  • S: “Source”
  • T: “Target”
  • M: “Multipole”
  • L: “Local”

S2T M2L B=2 3 L=4

slide-10
SLIDE 10

10

PARAMETERS OF THE FMM-FFT

  • FFT
  • FMM
  • Rank
  • Base level
  • Leaf box size
  • Leaf level

N = M P Q B

ML

L = log2(M/ML) (N, P, ML, Q, B)

slide-11
SLIDE 11

11

DISTRIBUTED FMM

All2All Gather All2All Gather Halo 2b Halo 2b Halo 1b Halo 2b Halo 2b Halo 1b

slide-12
SLIDE 12

12

INTERPOLATIVE FMM

  • Same operators across all boxes
  • Same operators across all levels
  • Almost same operators across all FMMs

zj = cos ✓(2j + 1)π 2Q ◆ `i(z) = Y

0k<Q k6=i

z − zk zi − zk

S2M M2M M2L L2L L2T

Cij = `m(tI

i ) `q(z ˜ I m) C(z ˜ I q , z ˜ J r ) `r(z ˜ J n ) `n(sJ i )

slide-13
SLIDE 13

13

TENSOR REPRESENTATIONS

  • Input:
  • Output:

Aijk` := A[i + j ∗ ldA<1> + k ∗ ldA<2> + ` ∗ ldA<3>],

Sn ≡ Spm ≡ Spmb Tn ≡ Tpm ≡ Tpmb

slide-14
SLIDE 14

14

S2M/L2T

S2Mqm = `q(sm)

sm = −1 + 2m + 1 ML

Computed with single BatchedGEMM

ML

(p−1)qb = S2Mqm Spmb

slide-15
SLIDE 15

15

BATCHED MATRIX-MATRIX MULTIPLY

cublas<T>gemmStridedBatched in cuBLAS 8.0

slide-16
SLIDE 16

16

S2M/L2T

Tpmb = L2Tmq Lpqb = ⇒ Tpm[b] = Lpq[b] S2Mqm

Mpqb = S2Mqm Spmb = ⇒ Mpq[b] = Spm[b] S2M T

qm

slide-17
SLIDE 17

17

M2M/L2L

M2M ±

qk = `q

✓zk ± 1 2 ◆

M`

pqb = M2Mqk M`+1 pk(2b)

Computed with single BatchedGEMM

L`+1

pq(2b) = L2Lqk L` pkb + L`+1 pq(2b)

slide-18
SLIDE 18

18

S2T/M2L

  • Also Level-3 Linear Algebra computations, but no BLAS primitives.
  • CUSTOM KERNELS

Tpib = S2Tp(j−i) Spjb

S2Tpk = ( cot π

N (p + Pk)

  • p > 0

δk0 p = 0

L`

pib = M2L` pijs M` pj(b+s)

M2L`

pijs = cot

⇣ π 2` (zj 2 − zi 2 + s) + π N (p + 1) ⌘

slide-19
SLIDE 19

19

INTERPOLATIVE FMM

P(4ML-1) QML QML 2Q2 2Q2 4(L-B)PQ2

Storage Operator Compute

2PMQ 2PMQ 3P2LML2 4(2L-2B)PQ2 4(2L-2B)PQ2 3(2L-2B)PQ2

slide-20
SLIDE 20

20

ALGORITHM

slide-21
SLIDE 21

21

PROFILE

slide-22
SLIDE 22

22

FMM-FFT PROFILE

S2M M2M Halo S2T M2L

}

L2L L2T

2D FFT

slide-23
SLIDE 23

23

2xK40c FMM-FFT

slide-24
SLIDE 24

24

2xP100 FMM-FFT

slide-25
SLIDE 25

25

8xP100 FMM-FFT

slide-26
SLIDE 26

26

FMM BREAKDOWN

  • T=ComplexDouble, A=2xP100
  • B-GEMM and S2T dominate
  • Small N
  • Latency — Use 1 Level
  • Large N
  • Compute

Components

slide-27
SLIDE 27

27

EFFICIENCY

  • >95% BatchedGEMM
  • 60% S2T/M2L
  • >90% FMM-FFT
slide-28
SLIDE 28

28

PARAMETER DEPENDENCE — ML

  • Trade #levels for S2T comp
  • Flop count not enough
  • Increase the intensity
  • Tune performance for ML=64
  • T=Z, A=2xP100, N=227, P=256, B=3, Q=16

Points per box per FMM

slide-29
SLIDE 29

29

PARAMETER DEPENDENCE — P

  • Flops/Intensity approx constant
  • Trade #levels for #FMMs
  • Large P good
  • Fill up B-GEMM
  • More square 2D FFT
  • T=Z, A=2xP100, N=227, ML=64, B=3, Q=16

Number of FMMs

slide-30
SLIDE 30

30

PARAMETER DEPENDENCE — B

  • Not very significant
  • Scale to 128 GPUs w/o complications
  • T=Z, A=2xP100, N=227, P=256, ML=64, Q=16

Base Level

slide-31
SLIDE 31

31

PARAMETER DEPENDENCE — Q

  • Weak performance dependence
  • Accuracy tuning
  • T=Z, A=2xP100, N=227, P=256, ML=64, B=3

Quadrature Order

slide-32
SLIDE 32

32

FUTURE

  • Integration into CUFFT
  • Application to 2D/3D FFTs?
  • Convolutions
  • NUFFT

, Sparse FFT

  • Volta predictions and measurements
  • Mixed precision (e.g. FP16 far-field) to use Tensor Core?
  • Persistent Matrix Batched GEMM (cuBLAS optimization)
  • Staged Persistent Matrix Batched GEMM (cooperative groups, RNNs)
slide-33
SLIDE 33

33

CONCLUSION

  • FMM-FFT trades 2/3 communication in 1D FFT for P FMMs
  • Viable on highest comp:comm architecture available
  • Detailed implementation that relies heavily on existing primitives
  • Primitives >95% efficient
  • Two custom dense kernels >60% efficient
  • Entire FMM-FFT >90% efficient
  • Tunable accuracy-performance tradeoff
  • Compute model accurately predicts performance
slide-34
SLIDE 34

May 8-11, 2017 | Silicon Valley

THANK YOU