Carnegie Mellon
Operator Language: A Program Generation Framework for Fast Kernels - - PowerPoint PPT Presentation
Operator Language: A Program Generation Framework for Fast Kernels - - PowerPoint PPT Presentation
Carnegie Mellon Operator Language: A Program Generation Framework for Fast Kernels Franz Franchetti, Frdric de Mesmay, Daniel McFarlin, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA
Carnegie Mellon
The Problem: Example MMM
Similar plots can be shown for all numerical kernels in linear algebra, signal processing, coding, crypto, …
What’s going on? Hardware is becoming increasingly complex.
5 10 15 20 25 30 35 40 45 50 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 matrix size
Matrix-Matrix Multiplication (MMM) on 2xCore2Duo 3 GHz (double precision)
Performance [Gflop/s]
160x
Triple loop Best code (K. Goto)
Carnegie Mellon
Automatic Performance Tuning
Current vicious circle: Whenever a new platform comes
- ut, the same functionality needs to be rewritten and
reoptimized
Automatic Performance Tuning
- BLAS: ATLAS, PHiPAC
- Linear algebra: Sparsity/OSKI, Flame
- Sorting
- Fourier transform: FFTW
- Linear transforms (and beyond): Spiral
- …others
Proceedings of the IEEE special issue, Feb. 2005
How to build an extensible system? For more problem classes? For yet un-invented platforms?
Carnegie Mellon
What is Spiral?
Traditionally Spiral Approach
High performance library
- ptimized for given platform
Spiral
High performance library
- ptimized for given platform
Comparable performance
Carnegie Mellon
Idea: Common Abstraction and Rewriting
ν p μ
Architectural parameter: Vector length, #processors, …
rewriting defines
Kernel: problem size, algorithm choice pick search abstraction abstraction Model: common abstraction = spaces of matching formulas = domain-specific language
architecture space algorithm space
- ptimization
Carnegie Mellon
Viterbi Decoding Linear Transforms Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR)
interpolation 2D iFFT matched filtering preprocessing convolutional encoder Viterbi decoder
010001 11 10 00 01 10 01 11 00 010001 11 10 01 01 10 10 11 00
= £
£
Some Kernels as OL Formulas.
Carnegie Mellon
How Spiral Works
Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (transform) algorithm C code Fast executable performance Search controls controls
Spiral
Spiral: Complete automation of the implementation and
- ptimization task
Basic ideas: Declarative representation
- f algorithms
Rewriting systems to generate and optimize algorithms at a high level
- f abstraction
Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005
Carnegie Mellon
Organization
Operator language and algorithms Optimizing algorithms for platforms Performance results Summary
Carnegie Mellon
Organization
Operator language and algorithms Optimizing algorithms for platforms Performance results Summary
Carnegie Mellon
Operators
Definition
- Operator: Multiple complex vectors ! multiple complex vectors
- Higher-dimensional data is linearized
- Operators are potentially nonlinear
Example: Matrix-matrix-multiplication (MMM)
A B C
Carnegie Mellon
Operator Language
Carnegie Mellon
OL Tensor Product: Repetitive Structure
Kronecker product (structured matrices) OL Tensor product (structured operators) Definition (extension to non-linear)
Carnegie Mellon
Translating OL Formulas Into Programs
Carnegie Mellon
Example: Matrix Multiplication (MMM)
Breakdown rules: capture various forms of blocking
Carnegie Mellon
Example: SAR Computation as OL Rules
Grid Compute Range Interpolation Azimuth Interpolation 2D FFT
Carnegie Mellon
Organization
Operator language and algorithms Optimizing algorithms for platforms Performance results Summary
Carnegie Mellon
Modeling Multicore: Base Cases
- Tensor product: embarrassingly parallel operator
A A A A x y
Processor 0 Processor 1 Processor 2 Processor 3
- Permutation: problematic; may produce false sharing
x y
- Hardware abstraction: shared cache with cache lines
Carnegie Mellon
Parallelization: OL Rewriting Rules
- Tags encode hardware constraints
- Rules are algorithm-independent
- Rules encode program transformations
Carnegie Mellon
The Joint Rule Set: MMM
- Hardware constraints: base cases
- Program transformations: manipulation rules
- Algorithm rules: breakdown rules
Combined rule set spans search space for empirical optimization
Carnegie Mellon
Parallelization Through Rewriting: MMM
Load-balanced No false sharing
Carnegie Mellon
Same Approach for Different Paradigms
Vectorization: Threading: GPUs: Verilog for FPGAs:
Carnegie Mellon
Organization
Operator language and algorithms Optimizing algorithms for platforms Performance results Summary
Carnegie Mellon
Matrix Multiplication Library
MKL 10.0 GotoBLAS 1.26 Spiral-generated library MKL 10.0 GotoBLAS 1.26 Spiral-generated library
1 2 3 4 5 6 7 8 9 2 4 8 16 32 64 128 256 512 performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz Rank-k Update, double precision, k=4 Input size 2 4 6 8 10 12 14 16 18 2 4 8 16 32 64 128 256 512 performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz Rank-k Update, single precision, k=4 Input size
Spiral-generated library MKL 10.0 Spiral-generated library MKL 10.0
Carnegie Mellon
Result: Spiral-Generated PFA SAR on Core2 Quad
44 43 10 20 30 40 50
SAR Image Formation on Intel platforms
performance [Gflop/s]
3.0 GHz Core 2 (65nm) 3.0 GHz Core 2 (45nm) 2.66 GHz Core i7 3.0 GHz Core i7 (Virtual)
newer platforms 16 Megapixels 100 Megapixels
Algorithm by J. Rudin (best paper award, HPEC 2007): 30 Gflop/s on Cell
Each implementation: vectorized, threaded, cache tuned, ~13 MB of code
Carnegie Mellon
Organization
Operator language and algorithms Optimizing algorithms for platforms Performance results Summary
Carnegie Mellon
Summary
Platforms are powerful yet complicated
- ptimization will stay a hard problem
OL: unified mathematical framework
captures platforms and algorithms
Spiral: program generation and autotuning
can provide full automation
Performance of supported kernels
is competitive with expert tuning
A(µ)
M (»)
architecture kernel
Image: Intel