A Rewriting System for the Vectorization of Signal Transforms Franz - - PowerPoint PPT Presentation

▶

Jan 08, 2024 322 likes •538 views

A Rewriting System for the Vectorization of Signal Transforms Franz Franchetti Yevgen Voronenko Markus Pschel Department of Electrical & Computer Engineering Carnegie Mellon University http://www.spiral.net Supported by: NSF

SLIDE 1

A Rewriting System for the Vectorization of Signal Transforms

Franz Franchetti Yevgen Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University

http://www.spiral.net

Supported by: NSF ACR-0234293, ITR/NGS-0325687, DARPA NBCH-105000, Intel, Austrian FWF

SLIDE 2

The Problem (Example FFT Performance)

reasonable implementation (Numerical recipes. GNU scientific library) best available implementation (FFTW, Intel IPP, Spiral)

10x

roughly the same

perations count

Solution: program generators like Atlas and Spiral, adaptive libraries like FFTW

SLIDE 3

Organization

 Spiral overview  SIMD vector instructions  Vectorization by rewriting  Extension to SMP and Multicore  Experimental results  Summary

SLIDE 4

Knowledge of the platform: By evaluating runtime

Spiral



Program generation from a problem specification for linear digital signal processing (DSP) transforms (DFT, DCT, DWT, filters, ….)



Goal 1: A flexible push-button program generation framework for an entire domain of algorithms



Goal 2: With new architectures, update the tool rather than the individual programs in the library

Spiral: generates DSP programs for SIMD vector, shared memory, multicore, distributed memory, FPGAs, embedded CPUs

Principle 1: Domain knowledge in the system Principle 2: Optimization at a high level of abstraction

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo, SPIRAL: Code Generation for DSP Transforms, Proceedings of the IEEE 93(2), 2005

SLIDE 5

What is a DSP Transform?

 Mathematically: Matrix-vector multiplication  Example: Discrete Fourier transform (DFT)

input vector (signal)

utput vector (signal)

transform = matrix

SLIDE 6

DSP Algorithms: Example 4-point DFT



Algorithm = sparse matrix factorization



Reduce computation cost from O(n2) to O(n log n)



For every transform there are many fast algorithms



SPIRAL generates the space of algorithms using breakdown rules in the domain-specific Signal Processing Language (SPL)

12 adds 4 mults 4 adds 4 adds 1 mult

(when multiplied with input vector x)

SLIDE 7

Some Transforms

Spiral currently contains 45 transforms

SLIDE 8

Some Breakdown Rules

Spiral currently contains 165 rules

Base case rules

SLIDE 9

SPL (Signal Processing Language)

 SPL expresses transform algorithms as structured sparse

matrix factorization

 Examples:  Kronecker product = loop (parallel, vector)

for i = 0:n-1 y[im:im+m-1] = B·x[im:im+m-1] endfor

SLIDE 10

Formula Level Optimization: Idea

Traditionally optimizations by C/Fortran compilers

Formulas Code

Move optimizations to higher abstraction level: Domain knowledge overcomes compiler limitations Formula level optimizations in Spiral: Implemented through rewriting systems

Loop merging
Vectorization
Parallelization

SLIDE 11

SIMD (Signal Instruction Multiple Data) Vector Instructions in a Nutshell

 What are these instructions?

Extension of the ISA. Data types and instructions for parallel computation
n short (2-way–16-way) vectors of integers and floats

 Problems:

Not standardized
Compiler vectorization limited
Low-level issues (data alignment,…)
Reordering data kills runtime

One can easily slow down a program by vectorizing it

1 2 4 4 5 1 1 3 6 3 5 7

+ + + + vector register xmm1 vector operation addps xmm0, xmm1 xmm0 xmm0

Intel MMX
AMD 3DNow!
Intel SSE
AMD Enhanced 3DNow!
Motorola AltiVec
AMD 3DNow! Professional
Itanium
Intel XScale
Intel SSE2
AMD-64
IBM BlueGene/L PPC440FP2
Intel Wireless MMX
Intel SSE3
…

SLIDE 12

A4 A4 A4 A4 A4

Vectorization of Formulas by Rewriting

Naturally vectorizable construct

vector length (any two-power)

Rewriting rules to vectorize formulas

Introduces data reorganization (permutations)

A4 A4 A4

Operates on 4-way vectors

vector construct further rewriting base case

Franchetti and Püschel (IPDPS 2002/2003)

Definition: Vectorized formula := vector constructs and base cases, A¢B, and IA of vectorized formulas

SLIDE 13

Example: DFT

vector constructs base cases

Formula is vectorized w.r.t. Definition

SLIDE 14

Some Vectorization Rules

SLIDE 15

Shared Memory Parallelization by Rewriting

Load balanced, contiguous blocks No false sharing (entire cache lines are swapped)

F. Franchetti, Y. Voronenko, and M. Püschel: “FFT Program Generation for Shared Memory:

SMP and Multicore,” to appear in SC|06

SLIDE 16

1000 2000 3000 4000 5000 6000 7000 8000 9000 4 5 6 7 8 9 10 11 12 13 14 15 16 17 problem size (log2 N) pseudo Mflop/s 1000 2000 3000 4000 5000 6000 7000 8000 9000 4 5 6 7 8 9 10 11 12 13 14 15 16 17 problem size (log2 N) pseudo Mflop/s 1000 2000 3000 4000 5000 6000 7000 8000 9000 4 5 6 7 8 9 10 11 12 13 14 15 16 17 problem size (log2 N) pseudo Mflop/s

scalar (x87) Spiral code

(automatically generated)

scalar Spiral code + vectorizing compiler Spiral vector code

(automatically generated)

FFTW 3.1 SSE

(adapted, but hand-vectorized)

Intel MKL 8.1

(handcoded)

3.5x

How Good is Our Generated Vector Code?

Spiral generated code performs comparable to expertly hand-tuned code

better

Complex 1D DFT on Intel Pentium 4, 3.6 GHz, 4-way SSE (float)

SLIDE 17

What About 8-way Vector Code?

2000 4000 6000 8000 10000 12000 14000 16000 64 128 256 512 1024 2048 4096 8192 problem sizes (N) MIPS

better

Complex 1D DFT on Intel Pentium 4, 3.6 GHz, 8-way SSE2 (16-bit int)

Spiral vector code

(automatically generated)

Intel IPP 5.0

(handcoded)

Spiral generated code clearly outperforms expertly hand-tuned code

SLIDE 18

1000 2000 3000 4000 5000 6000 7 8 9 10 11 12 13 14 15 16 17 18 19 20 problem size (log2 N) pseudo Mflop/s

Combined Multicore and Vector Code

better

2.5x speed-up from parallel + vector
Parallelization speed-up for small problems

Pentium D 3.6 GHz (Dual Core, 2-way SIMD), double precision 1-D DFT

sequential parallel parallel + vector 2.5x

SLIDE 19

Summary

 Parallelization and vectorization in Spiral

Entirely automatic
Principled approach
Rewriting system
Generated code is very fast

 Works for other hardware as well

Distributed memory: MPI

with C.W. Ueberhuber, A. Bonelli, and J. Lorenz, Vienna University of Technology

Hardware: FPGAs

with J.C. Hoe and Peter Milder, Carnegie Mellon University

SLIDE 20

A Rewriting System for the Vectorization of Signal Transforms

Franz Franchetti Yevgen Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University

The Problem (Example FFT Performance)

10x

Organization

Spiral

Spiral: generates DSP programs for SIMD vector, shared memory, multicore, distributed memory, FPGAs, embedded CPUs

What is a DSP Transform?

DSP Algorithms: Example 4-point DFT

Some Transforms

Spiral currently contains 45 transforms

Some Breakdown Rules

SPL (Signal Processing Language)

matrix factorization

Formula Level Optimization: Idea

SIMD (Signal Instruction Multiple Data) Vector Instructions in a Nutshell

One can easily slow down a program by vectorizing it

A4 A4 A4 A4 A4

Vectorization of Formulas by Rewriting

A4 A4 A4

Definition: Vectorized formula := vector constructs and base cases, A¢B, and IA of vectorized formulas

Example: DFT

Formula is vectorized w.r.t. Definition

Some Vectorization Rules

Shared Memory Parallelization by Rewriting

How Good is Our Generated Vector Code?

better

What About 8-way Vector Code?

better

Combined Multicore and Vector Code

better

Summary

www.spiral.net

(Part of the) Spiral Team