A Rewriting System for the Vectorization of Signal Transforms Franz - - PowerPoint PPT Presentation

a rewriting system for the vectorization of signal
SMART_READER_LITE
LIVE PREVIEW

A Rewriting System for the Vectorization of Signal Transforms Franz - - PowerPoint PPT Presentation

A Rewriting System for the Vectorization of Signal Transforms Franz Franchetti Yevgen Voronenko Markus Pschel Department of Electrical & Computer Engineering Carnegie Mellon University http://www.spiral.net Supported by: NSF


slide-1
SLIDE 1

A Rewriting System for the Vectorization of Signal Transforms

Franz Franchetti Yevgen Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University

http://www.spiral.net

Supported by: NSF ACR-0234293, ITR/NGS-0325687, DARPA NBCH-105000, Intel, Austrian FWF

slide-2
SLIDE 2

The Problem (Example FFT Performance)

reasonable implementation (Numerical recipes. GNU scientific library) best available implementation (FFTW, Intel IPP, Spiral)

10x

roughly the same

  • perations count

Solution: program generators like Atlas and Spiral, adaptive libraries like FFTW

slide-3
SLIDE 3

Organization

 Spiral overview  SIMD vector instructions  Vectorization by rewriting  Extension to SMP and Multicore  Experimental results  Summary

slide-4
SLIDE 4

Knowledge of the platform: By evaluating runtime

Spiral

Program generation from a problem specification for linear digital signal processing (DSP) transforms (DFT, DCT, DWT, filters, ….)

Goal 1: A flexible push-button program generation framework for an entire domain of algorithms

Goal 2: With new architectures, update the tool rather than the individual programs in the library

Spiral: generates DSP programs for SIMD vector, shared memory, multicore, distributed memory, FPGAs, embedded CPUs

Principle 1: Domain knowledge in the system Principle 2: Optimization at a high level of abstraction

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo, SPIRAL: Code Generation for DSP Transforms, Proceedings of the IEEE 93(2), 2005

slide-5
SLIDE 5

What is a DSP Transform?

 Mathematically: Matrix-vector multiplication  Example: Discrete Fourier transform (DFT)

input vector (signal)

  • utput vector (signal)

transform = matrix

slide-6
SLIDE 6

DSP Algorithms: Example 4-point DFT

Algorithm = sparse matrix factorization

Reduce computation cost from O(n2) to O(n log n)

For every transform there are many fast algorithms

SPIRAL generates the space of algorithms using breakdown rules in the domain-specific Signal Processing Language (SPL)

12 adds 4 mults 4 adds 4 adds 1 mult

(when multiplied with input vector x)

slide-7
SLIDE 7

Some Transforms

Spiral currently contains 45 transforms

slide-8
SLIDE 8

Some Breakdown Rules

Spiral currently contains 165 rules

Base case rules

slide-9
SLIDE 9

SPL (Signal Processing Language)

 SPL expresses transform algorithms as structured sparse

matrix factorization

 Examples:  Kronecker product = loop (parallel, vector)

for i = 0:n-1 y[im:im+m-1] = B·x[im:im+m-1] endfor

slide-10
SLIDE 10

Formula Level Optimization: Idea

Traditionally optimizations by C/Fortran compilers

Formulas Code

Move optimizations to higher abstraction level: Domain knowledge overcomes compiler limitations Formula level optimizations in Spiral: Implemented through rewriting systems

  • Loop merging
  • Vectorization
  • Parallelization
slide-11
SLIDE 11

SIMD (Signal Instruction Multiple Data) Vector Instructions in a Nutshell

 What are these instructions?

  • Extension of the ISA. Data types and instructions for parallel computation
  • n short (2-way–16-way) vectors of integers and floats

 Problems:

  • Not standardized
  • Compiler vectorization limited
  • Low-level issues (data alignment,…)
  • Reordering data kills runtime

One can easily slow down a program by vectorizing it

1 2 4 4 5 1 1 3 6 3 5 7

+ + + + vector register xmm1 vector operation addps xmm0, xmm1 xmm0 xmm0

  • Intel MMX
  • AMD 3DNow!
  • Intel SSE
  • AMD Enhanced 3DNow!
  • Motorola AltiVec
  • AMD 3DNow! Professional
  • Itanium
  • Intel XScale
  • Intel SSE2
  • AMD-64
  • IBM BlueGene/L PPC440FP2
  • Intel Wireless MMX
  • Intel SSE3
slide-12
SLIDE 12

A4 A4 A4 A4 A4

Vectorization of Formulas by Rewriting

  • Naturally vectorizable construct

vector length (any two-power)

  • Rewriting rules to vectorize formulas

Introduces data reorganization (permutations)

A4 A4 A4

Operates on 4-way vectors

vector construct further rewriting base case

Franchetti and Püschel (IPDPS 2002/2003)

Definition: Vectorized formula := vector constructs and base cases, A¢B, and IA of vectorized formulas

slide-13
SLIDE 13

Example: DFT

vector constructs base cases

Formula is vectorized w.r.t. Definition

slide-14
SLIDE 14

Some Vectorization Rules

slide-15
SLIDE 15

Shared Memory Parallelization by Rewriting

Load balanced, contiguous blocks No false sharing (entire cache lines are swapped)

  • F. Franchetti, Y. Voronenko, and M. Püschel: “FFT Program Generation for Shared Memory:

SMP and Multicore,” to appear in SC|06

slide-16
SLIDE 16

1000 2000 3000 4000 5000 6000 7000 8000 9000 4 5 6 7 8 9 10 11 12 13 14 15 16 17 problem size (log2 N) pseudo Mflop/s 1000 2000 3000 4000 5000 6000 7000 8000 9000 4 5 6 7 8 9 10 11 12 13 14 15 16 17 problem size (log2 N) pseudo Mflop/s 1000 2000 3000 4000 5000 6000 7000 8000 9000 4 5 6 7 8 9 10 11 12 13 14 15 16 17 problem size (log2 N) pseudo Mflop/s

scalar (x87) Spiral code

(automatically generated)

scalar Spiral code + vectorizing compiler Spiral vector code

(automatically generated)

FFTW 3.1 SSE

(adapted, but hand-vectorized)

Intel MKL 8.1

(handcoded)

3.5x

How Good is Our Generated Vector Code?

Spiral generated code performs comparable to expertly hand-tuned code

better

Complex 1D DFT on Intel Pentium 4, 3.6 GHz, 4-way SSE (float)

slide-17
SLIDE 17

What About 8-way Vector Code?

2000 4000 6000 8000 10000 12000 14000 16000 64 128 256 512 1024 2048 4096 8192 problem sizes (N) MIPS

better

Complex 1D DFT on Intel Pentium 4, 3.6 GHz, 8-way SSE2 (16-bit int)

Spiral vector code

(automatically generated)

Intel IPP 5.0

(handcoded)

Spiral generated code clearly outperforms expertly hand-tuned code

slide-18
SLIDE 18

1000 2000 3000 4000 5000 6000 7 8 9 10 11 12 13 14 15 16 17 18 19 20 problem size (log2 N) pseudo Mflop/s

Combined Multicore and Vector Code

better

  • 2.5x speed-up from parallel + vector
  • Parallelization speed-up for small problems

Pentium D 3.6 GHz (Dual Core, 2-way SIMD), double precision 1-D DFT

sequential parallel parallel + vector 2.5x

slide-19
SLIDE 19

Summary

 Parallelization and vectorization in Spiral

  • Entirely automatic
  • Principled approach
  • Rewriting system
  • Generated code is very fast

 Works for other hardware as well

  • Distributed memory: MPI

with C.W. Ueberhuber, A. Bonelli, and J. Lorenz, Vienna University of Technology

  • Hardware: FPGAs

with J.C. Hoe and Peter Milder, Carnegie Mellon University

slide-20
SLIDE 20

www.spiral.net

(Part of the) Spiral Team