Operator Language: A Program Generation Framework for Fast Kernels - - PowerPoint PPT Presentation

operator language a program generation framework for fast
SMART_READER_LITE
LIVE PREVIEW

Operator Language: A Program Generation Framework for Fast Kernels - - PowerPoint PPT Presentation

Carnegie Mellon Operator Language: A Program Generation Framework for Fast Kernels Franz Franchetti, Frdric de Mesmay, Daniel McFarlin, Markus Pschel Electrical and Computer Engineering Carnegie Mellon University Sponsors: DARPA DESA


slide-1
SLIDE 1

Carnegie Mellon

Operator Language: A Program Generation Framework for Fast Kernels

Sponsors: DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel

Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, Markus Püschel Electrical and Computer Engineering Carnegie Mellon University

slide-2
SLIDE 2

Carnegie Mellon

The Problem: Example MMM

Similar plots can be shown for all numerical kernels in linear algebra, signal processing, coding, crypto, …

What’s going on? Hardware is becoming increasingly complex.

5 10 15 20 25 30 35 40 45 50 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 matrix size

Matrix-Matrix Multiplication (MMM) on 2xCore2Duo 3 GHz (double precision)

Performance [Gflop/s]

160x

Triple loop Best code (K. Goto)

slide-3
SLIDE 3

Carnegie Mellon

Automatic Performance Tuning

 Current vicious circle: Whenever a new platform comes

  • ut, the same functionality needs to be rewritten and

reoptimized

 Automatic Performance Tuning

  • BLAS: ATLAS, PHiPAC
  • Linear algebra: Sparsity/OSKI, Flame
  • Sorting
  • Fourier transform: FFTW
  • Linear transforms (and beyond): Spiral
  • …others

Proceedings of the IEEE special issue, Feb. 2005

How to build an extensible system? For more problem classes? For yet un-invented platforms?

slide-4
SLIDE 4

Carnegie Mellon

What is Spiral?

Traditionally Spiral Approach

High performance library

  • ptimized for given platform

Spiral

High performance library

  • ptimized for given platform

Comparable performance

slide-5
SLIDE 5

Carnegie Mellon

Idea: Common Abstraction and Rewriting

ν p μ

Architectural parameter: Vector length, #processors, …

rewriting defines

Kernel: problem size, algorithm choice pick search abstraction abstraction Model: common abstraction = spaces of matching formulas = domain-specific language

architecture space algorithm space

  • ptimization
slide-6
SLIDE 6

Carnegie Mellon

Viterbi Decoding Linear Transforms Matrix-Matrix Multiplication Synthetic Aperture Radar (SAR)

interpolation 2D iFFT matched filtering preprocessing convolutional encoder Viterbi decoder

010001 11 10 00 01 10 01 11 00 010001 11 10 01 01 10 10 11 00

= £

£

Some Kernels as OL Formulas.

slide-7
SLIDE 7

Carnegie Mellon

How Spiral Works

Algorithm Generation Algorithm Optimization Implementation Code Optimization Compilation Compiler Optimizations Problem specification (transform) algorithm C code Fast executable performance Search controls controls

Spiral

Spiral: Complete automation of the implementation and

  • ptimization task

Basic ideas: Declarative representation

  • f algorithms

Rewriting systems to generate and optimize algorithms at a high level

  • f abstraction

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

slide-8
SLIDE 8

Carnegie Mellon

Organization

 Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

slide-9
SLIDE 9

Carnegie Mellon

Organization

 Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

slide-10
SLIDE 10

Carnegie Mellon

Operators

Definition

  • Operator: Multiple complex vectors ! multiple complex vectors
  • Higher-dimensional data is linearized
  • Operators are potentially nonlinear

Example: Matrix-matrix-multiplication (MMM)

A B C

slide-11
SLIDE 11

Carnegie Mellon

Operator Language

slide-12
SLIDE 12

Carnegie Mellon

OL Tensor Product: Repetitive Structure

Kronecker product (structured matrices) OL Tensor product (structured operators) Definition (extension to non-linear)

slide-13
SLIDE 13

Carnegie Mellon

Translating OL Formulas Into Programs

slide-14
SLIDE 14

Carnegie Mellon

Example: Matrix Multiplication (MMM)

Breakdown rules: capture various forms of blocking

slide-15
SLIDE 15

Carnegie Mellon

Example: SAR Computation as OL Rules

Grid Compute Range Interpolation Azimuth Interpolation 2D FFT

slide-16
SLIDE 16

Carnegie Mellon

Organization

 Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

slide-17
SLIDE 17

Carnegie Mellon

Modeling Multicore: Base Cases

  • Tensor product: embarrassingly parallel operator

A A A A x y

Processor 0 Processor 1 Processor 2 Processor 3

  • Permutation: problematic; may produce false sharing

x y

  • Hardware abstraction: shared cache with cache lines
slide-18
SLIDE 18

Carnegie Mellon

Parallelization: OL Rewriting Rules

  • Tags encode hardware constraints
  • Rules are algorithm-independent
  • Rules encode program transformations
slide-19
SLIDE 19

Carnegie Mellon

The Joint Rule Set: MMM

  • Hardware constraints: base cases
  • Program transformations: manipulation rules
  • Algorithm rules: breakdown rules

Combined rule set spans search space for empirical optimization

slide-20
SLIDE 20

Carnegie Mellon

Parallelization Through Rewriting: MMM

Load-balanced No false sharing

slide-21
SLIDE 21

Carnegie Mellon

Same Approach for Different Paradigms

Vectorization: Threading: GPUs: Verilog for FPGAs:

slide-22
SLIDE 22

Carnegie Mellon

Organization

 Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

slide-23
SLIDE 23

Carnegie Mellon

Matrix Multiplication Library

MKL 10.0 GotoBLAS 1.26 Spiral-generated library MKL 10.0 GotoBLAS 1.26 Spiral-generated library

1 2 3 4 5 6 7 8 9 2 4 8 16 32 64 128 256 512 performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz Rank-k Update, double precision, k=4 Input size 2 4 6 8 10 12 14 16 18 2 4 8 16 32 64 128 256 512 performance [Gflop/s] Dual Intel Xeon 5160, 3Ghz Rank-k Update, single precision, k=4 Input size

Spiral-generated library MKL 10.0 Spiral-generated library MKL 10.0

slide-24
SLIDE 24

Carnegie Mellon

Result: Spiral-Generated PFA SAR on Core2 Quad

44 43 10 20 30 40 50

SAR Image Formation on Intel platforms

performance [Gflop/s]

3.0 GHz Core 2 (65nm) 3.0 GHz Core 2 (45nm) 2.66 GHz Core i7 3.0 GHz Core i7 (Virtual)

newer platforms 16 Megapixels 100 Megapixels

Algorithm by J. Rudin (best paper award, HPEC 2007): 30 Gflop/s on Cell

Each implementation: vectorized, threaded, cache tuned, ~13 MB of code

slide-25
SLIDE 25

Carnegie Mellon

Organization

 Operator language and algorithms  Optimizing algorithms for platforms  Performance results  Summary

slide-26
SLIDE 26

Carnegie Mellon

Summary

 Platforms are powerful yet complicated

  • ptimization will stay a hard problem

 OL: unified mathematical framework

captures platforms and algorithms

 Spiral: program generation and autotuning

can provide full automation

 Performance of supported kernels

is competitive with expert tuning

A(µ)

M (»)

architecture kernel

Image: Intel