Operator Language: A Program Generation Framework for Fast Kernels - - PowerPoint PPT Presentation

▶

May 07, 2023 386 likes •658 views

Carnegie Mellon Operator Language: A Program Generation Framework for Fast Kernels Franz Franchetti, Frdric de Mesmay, Daniel McFarlin, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA

SLIDE 1

Carnegie Mellon

Operator Language: A Program Generation Framework for Fast Kernels

Sponsors: DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel

Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel Electrical and Computer Engineering Carnegie Mellon University

SLIDE 2

Carnegie Mellon

The Problem: Example MMM



Similar plots can be shown for all numerical kernels in linear algebra, signal processing, coding, crypto, …



What’s going on? Hardware is becoming increasingly complex.

5 10 15 20 25 30 35 40 45 50 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 matrix size

Matrix-Matrix Multiplication (MMM) on 2xCore2Duo 3 GHz (double precision)

Performance [Gflop/s]

160x

Triple loop Best code (K. Goto)

SLIDE 3

Carnegie Mellon

Automatic Performance Tuning

 Current vicious circle: Whenever a new platform comes

ut, the same functionality needs to be rewritten and

reoptimized

 Automatic Performance Tuning

BLAS: ATLAS, PHiPAC
Linear algebra: Sparsity/OSKI, Flame
Sorting
Fourier transform: FFTW
Linear transforms (and beyond): Spiral
…others

Proceedings of the IEEE special issue, Feb. 2005

How to build an extensible system? For more problem classes? For yet un-invented platforms?

SLIDE 4

Carnegie Mellon

What is Spiral?

Traditionally Spiral Approach

High performance library

ptimized for given platform

Spiral

High performance library

ptimized for given platform

Comparable performance

SLIDE 5

Carnegie Mellon

Idea: Common Abstraction and Rewriting

ν p μ

Architectural parameter: Vector length, #processors, …

rewriting defines

Kernel: problem size, algorithm choice pick search abstraction abstraction Model: common abstraction = spaces of matching formulas = domain-specific language

architecture space algorithm space

ptimization

SLIDE 6

Carnegie Mellon

Viterbi Decoding Linear Transforms Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR)

interpolation 2D iFFT matched filtering preprocessing convolutional encoder Viterbi decoder

010001 11 10 00 01 10 01 11 00 010001 11 10 01 01 10 10 11 00

= £

£

Some Kernels as OL Formulas.

SLIDE 7

Carnegie Mellon

How Spiral Works

Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (transform) algorithm C code Fast executable performance Search controls controls

Spiral

Spiral: Complete automation of the implementation and

ptimization task

Basic ideas: Declarative representation

f algorithms

Rewriting systems to generate and optimize algorithms at a high level

f abstraction

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

SLIDE 8

Carnegie Mellon

Organization

 Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

SLIDE 9

Carnegie Mellon

Organization

 Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

SLIDE 10

Carnegie Mellon

Operators

Definition

Operator: Multiple complex vectors ! multiple complex vectors
Higher-dimensional data is linearized
Operators are potentially nonlinear

Example: Matrix-matrix-multiplication (MMM)

A B C

SLIDE 11

Carnegie Mellon

Operator Language

SLIDE 12

Carnegie Mellon

OL Tensor Product: Repetitive Structure

Kronecker product (structured matrices) OL Tensor product (structured operators) Definition (extension to non-linear)

SLIDE 13

Carnegie Mellon

Translating OL Formulas Into Programs

SLIDE 14

Carnegie Mellon

Example: Matrix Multiplication (MMM)

Breakdown rules: capture various forms of blocking

SLIDE 15

Carnegie Mellon

Example: SAR Computation as OL Rules

Grid Compute Range Interpolation Azimuth Interpolation 2D FFT

SLIDE 16

Carnegie Mellon

Organization

 Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

SLIDE 17

Carnegie Mellon

Modeling Multicore: Base Cases

Tensor product: embarrassingly parallel operator

A A A A x y

Processor 0 Processor 1 Processor 2 Processor 3

Permutation: problematic; may produce false sharing

x y

Hardware abstraction: shared cache with cache lines

SLIDE 18

Carnegie Mellon

Parallelization: OL Rewriting Rules

Tags encode hardware constraints
Rules are algorithm-independent
Rules encode program transformations

SLIDE 19

Carnegie Mellon

The Joint Rule Set: MMM

Hardware constraints: base cases
Program transformations: manipulation rules
Algorithm rules: breakdown rules

Combined rule set spans search space for empirical optimization

SLIDE 20

Carnegie Mellon

Parallelization Through Rewriting: MMM

Load-balanced No false sharing

SLIDE 21

Carnegie Mellon

Same Approach for Different Paradigms

Vectorization: Threading: GPUs: Verilog for FPGAs:

SLIDE 22

Carnegie Mellon

Organization

 Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

SLIDE 23

Carnegie Mellon

Matrix Multiplication Library

MKL 10.0 GotoBLAS 1.26 Spiral-generated library MKL 10.0 GotoBLAS 1.26 Spiral-generated library

1 2 3 4 5 6 7 8 9 2 4 8 16 32 64 128 256 512 performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz Rank-k Update, double precision, k=4 Input size 2 4 6 8 10 12 14 16 18 2 4 8 16 32 64 128 256 512 performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz Rank-k Update, single precision, k=4 Input size

Spiral-generated library MKL 10.0 Spiral-generated library MKL 10.0

SLIDE 24

Carnegie Mellon

Result: Spiral-Generated PFA SAR on Core2 Quad

44 43 10 20 30 40 50

SAR Image Formation on Intel platforms

performance [Gflop/s]

3.0 GHz Core 2 (65nm) 3.0 GHz Core 2 (45nm) 2.66 GHz Core i7 3.0 GHz Core i7 (Virtual)

newer platforms 16 Megapixels 100 Megapixels



Algorithm by J. Rudin (best paper award, HPEC 2007): 30 Gflop/s on Cell



Each implementation: vectorized, threaded, cache tuned, ~13 MB of code

SLIDE 25

Carnegie Mellon

Organization

 Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

SLIDE 26

Carnegie Mellon

Summary

 Platforms are powerful yet complicated

ptimization will stay a hard problem

 OL: unified mathematical framework

captures platforms and algorithms

 Spiral: program generation and autotuning

can provide full automation

 Performance of supported kernels

is competitive with expert tuning

A(µ)

M (»)

architecture kernel

Image: Intel