FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz - - PowerPoint PPT Presentation

fft libraries on cray xt
SMART_READER_LITE
LIVE PREVIEW

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz - - PowerPoint PPT Presentation

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline Background Current FFT libraries on XT CRAFFT design Example interfaces Performance Results Future plans Questions? May 05 Cray Inc. Proprietary


slide-1
SLIDE 1

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT)

Jonathan Bentz Cray Inc.

slide-2
SLIDE 2

Outline

Background Current FFT libraries on XT CRAFFT design

Example interfaces

Performance Results Future plans Questions?

May 05 Slide 2 Cray Inc. Proprietary

slide-3
SLIDE 3

Fourier Transform Background

Discrete Fourier Transform (DFT)

Transforms an array x(0:N-1) into X(0:N-1) Calculation by the definition is a O(N2) algorithm

Fast Fourier Transform (FFT)

Algorithm to calculate the DFT using O(N log N) Algorithm is dependent on N

Applications (among many)

Signal processing Solving PDE

May 05 Slide 3 Cray Inc. Proprietary

1 , 2 exp

1

i N ijk x X

N j k k

slide-4
SLIDE 4

May 05 Cray Inc. Proprietary Slide 4

Current FFT libraries on XT

FFTW (MIT, Frigo & Johnson, fftw.org)

Serial performance is very competitive SIMD code for x86 Sophisticated run-time tuning mechanisms Extremely flexible interface FFT for almost any data distribution you can imagine Complicated and tedious interface Substantive differences between versions 2 and 3 Interfaces are incompatible Parallel transforms in version 2 only Superior serial performance in version 3

ACML (AMD, amd.com)

Performance is not spectacular Especially on non-powers of 2

slide-5
SLIDE 5

FFT libraries common practice

Execution of FFT in application code generally has two steps 1.

PLANNING stage

  • Initialize the FFT library based on the FFT size

Some libraries pre-compute a table of trigonometric values FFTW is able to try out various FFT of that size and choose the fastest one

  • Often this can take orders of magnitude longer than the actual

execution of the FFT

  • An FFTW_PATIENT plan for size 512^3 FFT takes 2758 sec to

plan and 9.7 sec to execute!!!

2.

EXECUTION stage

  • Execute the FFT using the information from the Planning stage

May 05 Slide 5 Cray Inc. Proprietary

slide-6
SLIDE 6

Major problem with FFT libs

Which library to choose?

We want the best possible FFT performance To date, we have seen excellent performance from FFTW FFTW also has a rich set of options for different data distributions Do NOT want to change application code frequently

How to use the complicated interfaces???

FFTW can be really difficult to use E.g., 2d FFT with LDA > size, 14 arguments!!! call dfftw_plan_many_dft(plan,rank,n,howmany, & input,inembed, & istride,idist, &

  • utput,onembed, &
  • stride,odist, &

expon,FFTW_flags)

May 05 Slide 6 Cray Inc. Proprietary

slide-7
SLIDE 7

CRAFFT library solves this problem

CRAFFT is designed with simple-to-use interfaces

Planning and execution stage can be combined into one subroutine call Underneath the interfaces, CRAFFT calls the appropriate FFT kernel

CRAFFT provides both offline and online tuning

Offline tuning Which FFT kernel to use Pre-computed PLANs for common-sized FFT Online tuning is performed as necessary at runtime as well

At runtime, CRAFFT adaptively selects the best FFT kernel to use based on both offline and online testing (e.g. ACML, FFTW, Custom FFT)

May 05 Cray Inc. Proprietary Slide 7

slide-8
SLIDE 8

User Interface Choices

Cray-style interface (mostly for legacy compatibility)

ZZFFT(…); 1d complex-to-complex double precision FFT

Simple interface

CRAFFT_z2z1d(size,array,isign) Just the basics, size and array locations All internals, including possible temporary memory allocation and tuning are taken care of The easiest choice for users

Advanced interface

CRAFFT_z2z1d(size,array,isign,workspace,PLANNING) In addition to size and array, user also provides workspace and planning parameters In 2D and 3D, the leading dimension type args can be used

May 05 Cray Inc. Proprietary Slide 8

slide-9
SLIDE 9

Interfaces (cont.)

All subroutine names have the form crafft_α2βθD

α, β = S,D,C or Z like netlib, i.e., D = double precision real, C = single precision complex θ = 1, 2 or 3, i.e., the dimension of the transform E.g., crafft_d2z1d is a double real to double complex transform in 1d

Interface makes use of F90 modules to overload the names

Users must put “use crafft” in their fortran source code 1D complex to complex examples: crafft_z2z1d(size,array,isign)

  • in-place

crafft_z2z1d(size,input,output,isign)

  • out-of-place

May 05 Slide 9 Cray Inc. Proprietary

slide-10
SLIDE 10

Simple 1d CRAFFT call resolves to…

May 05 Cray Inc. Proprietary Slide 10

dfftw_execute(plan) dfftw_plan_dft_1d(plan,n,input,output,isign,FFTW_FLAG) z2z1d_simple_internal(n,input,input,isign,1,1) z2z1d_simple1_inplace(n,input,isign) crafft_z2z1d(n,input,isign)

slide-11
SLIDE 11

Advanced 2d CRAFFT call resolves to…

May 05 Cray Inc. Proprietary Slide 11

dfftw_execute(plan)

dfftw_plan_many_dft(plan,rank,n,howmany,input,inembed,istride,idist,output,onembed,ostride,odist,isign,FFTW_FLAG)

z2z2d_adv_internal(n1,n2,input,ld_in,output,ld_out,isign,1,1,work )

z2z2d_adv1(n1,n2,input,ld_in,output,ld_out,isign,work) crafft_z2z2d(n1,n2,input,ld_in,output,ld_out,isign,work)

slide-12
SLIDE 12

CRAFFT user code calling sequence

May 05 Cray Inc. Proprietary Slide 12

call crafft_z2z1d(n,input,+1)

Execute the backward FFT

Do work call crafft_z2z1d(n,input,-1)

Perform online tuning Execute the forward FFT

call crafft_init()

Initialize the library Setup the offline wisdom

slide-13
SLIDE 13

CRAFFT 1.0alpha (current status)

Largely FFTW centric Includes FFTW offline wisdom to minimize expensive online planning Allows simple interface into advanced FFTW functionality Proposed release in summer 2008

PERFORMANCE???

May 05 Slide 13 Cray Inc. Proprietary

slide-14
SLIDE 14

Walltime vs. size, 1D C2C FFT planner

May 05 Slide 14 Cray Inc. Proprietary 0.00001 0.0001 0.001 0.01 0.1 1 1 4 16 64 256 1024 4096 16384 65536 262144 Time (s) Size C2C CRAFFT planner+exe C2C FFTW planner

CRAFFT_PLANNER=0 FFTW_ESTIMATE

slide-15
SLIDE 15

Walltime vs. size, 1D C2C FFT execute

May 05 Cray Inc. Proprietary Slide 15 0.0000001 0.000001 0.00001 0.0001 0.001 0.01 0.1 1 1 4 16 64 256 1024 4096 16384 65536 262144 Time (s) Size C2C CRAFFT exe C2C FFTW exe

CRAFFT_PLANNER=0 FFTW_ESTIMATE

slide-16
SLIDE 16

Walltime vs. size, 1D C2C FFT planner

May 05 Cray Inc. Proprietary Slide 16 0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1 4 16 64 256 1024 4096 16384 65536 262144 Time (s) Size C2C CRAFFT plan+exe C2C FFTW plan

CRAFFT_PLANNER=2 FFTW_PATIENT

slide-17
SLIDE 17

Walltime vs. size, 1D C2C FFT execute

May 05 Cray Inc. Proprietary Slide 17 0.0000001 0.000001 0.00001 0.0001 0.001 0.01 0.1 1 1 4 16 64 256 1024 4096 16384 65536 262144 Time (s) Size C2C CRAFFT exe C2C FFTW exe

CRAFFT_PLANNER=2 FFTW_PATIENT

slide-18
SLIDE 18

Summary

CRAFFT provides a simple interface into FFT for XT

Avoid those nasty 14 argument FFTW calls!

CRAFFT overhead is very minimal CRAFFT performance is really excellent when using common-sized FFT

CRAFFT avoids expensive planning stage

May 05 Cray Inc. Proprietary Slide 18

slide-19
SLIDE 19

Future Work

Additional libraries “under-the-covers”

Complete libraries, e.g., SPIRAL (CMU, Franchetti et. al., spiral.net) Targeted tuning of kernels for specific sizes

Parallel FFT

Again, provide a simple, intuitive interface and handle the details transparently Provide multiple data distributions

May 05 Cray Inc. Proprietary Slide 19

slide-20
SLIDE 20

QUESTIONS???

Email: jnbntz@cray.com

May 05 Cray Inc. Proprietary Slide 20