A Language for the Compact Representation of Multiple Program - - PowerPoint PPT Presentation

a language for the compact representation of multiple
SMART_READER_LITE
LIVE PREVIEW

A Language for the Compact Representation of Multiple Program - - PowerPoint PPT Presentation

A Language for the Compact Representation of Multiple Program Versions Proceedings of the 18 th International Workshop on Languages and Compilers for Parallel Computing (2005) Sebastien Donadio 1,2 , James Brodman 4 , Thomas Roeder 5 , Kamen Yotov


slide-1
SLIDE 1

A Language for the Compact Representation

  • f Multiple Program Versions

Proceedings of the 18th International Workshop

  • n Languages and Compilers for Parallel Computing (2005)

Sebastien Donadio1,2, James Brodman4, Thomas Roeder5, Kamen Yotov5, Denis Barthou2, Albert Cohen3, María Jesús Garzarán4, David Padua4, and Keshav Pingali5

1 BULL SA 2 University of Versailles St-Quentin-en-Yvelines 3 INRIA Futurs 4 University of Illinois at Urbana-Champaign 5 Cornell University

Pascal Fischli, 9. November 2011

All examples are taken from this paper

slide-2
SLIDE 2

Motivation

■ Wanted: Best Program Version

slide-3
SLIDE 3

Motivation

■ Wanted: Best Program Version ■ Library Generators have Weaknesses

slide-4
SLIDE 4

Motivation

■ Wanted: Best Program Version ■ Library Generators have Weaknesses

  • Specification of Transformations

► Which ► Where ► Order ► How

slide-5
SLIDE 5

Motivation

■ Wanted: Best Program Version ■ Library Generators have Weaknesses

  • Specification of Transformations

► Which ► Where ► Order ► How

  • Representation of Program Versions

► Natural and Compact

slide-6
SLIDE 6

Motivation

■ Wanted: Best Program Version ■ Library Generators have Weaknesses

  • Specification of Transformations

► Which ► Where ► Order ► How

  • Representation of Program Versions

► Natural and Compact

  • Defining of new Transformations
slide-7
SLIDE 7

Language X - Workflow

Language X Search Engine Optimized Code

■ Language Usages

  • Write Programs in X directly
  • Intermediate Representation

Program Versions C Compiler

slide-8
SLIDE 8

Language X - Workflow

■ Language Usages

  • Write Programs in X directly
  • Intermediate Representation

■ Native C Compilers

  • Low-Level Optimizations
  • May undo Transformations in X

Language X Search Engine Optimized Code Program Versions C Compiler

slide-9
SLIDE 9

Language X - Workflow

■ Language Usages

  • Write Programs in X directly
  • Intermediate Representation

■ Native C Compilers

  • Low-Level Optimizations
  • May undo Transformations in X

■ Search Engine

  • Exhaustive Search
  • Parameter Values

Language X Search Engine Optimized Code Program Versions C Compiler

slide-10
SLIDE 10

Transformations – Important Features

■ Elementary Transformations

  • Sequences of Statements
  • Loops
slide-11
SLIDE 11

Transformations – Important Features

■ Elementary Transformations

  • Sequences of Statements
  • Loops

■ Composition of Transformations

  • Conditional
slide-12
SLIDE 12

Transformations – Important Features

■ Elementary Transformations

  • Sequences of Statements
  • Loops

■ Composition of Transformations

  • Conditional

■ Mechanism to name Statements

slide-13
SLIDE 13

Transformations – Important Features

■ Elementary Transformations

  • Sequences of Statements
  • Loops

■ Composition of Transformations

  • Conditional

■ Mechanism to name Statements ■ Procedural Abstraction

slide-14
SLIDE 14

Transformations – Important Features

■ Elementary Transformations

  • Sequences of Statements
  • Loops

■ Composition of Transformations

  • Conditional

■ Mechanism to name Statements ■ Procedural Abstraction ■ Mechanism to define new Transformations

slide-15
SLIDE 15

Macros as Language Representation

 Simple Example

sum = 0; for (i=0;i<256;i++) { s = s + a[i]; }

slide-16
SLIDE 16

Macros as Language Representation

 Simple Example

sum = 0; for (i=0;i<256;i+=%d) { %for (k=0; k<=(%d-1); k++) s = s + a[i+%k]; } sum = 0; for (i=0;i<256;i++) { s = s + a[i]; }

■ X Representation

slide-17
SLIDE 17

Macros as Language Representation

 Simple Example

sum = 0; for (i=0;i<256;i+=%d) { %for (k=0; k<=(%d-1); k++) s = s + a[i+%k]; } sum = 0; for (i=0;i<256;i++) { s = s + a[i]; } sum = 0; for (i=0;i<256;i+=%d) { s = s + a[i]; s = s + a[i+1]; ... s = s + a[i+(%d-1)]; }

■ X Representation ■ Which stands for

slide-18
SLIDE 18

Macros as Language Representation

 Simple Example

sum = 0; for (i=0;i<256;i+=%d) { %for (k=0; k<=(%d-1); k++) s = s + a[i+%k]; } sum = 0; for (i=0;i<256;i++) { s = s + a[i]; } sum = 0; for (i=0;i<256;i+=%d) { s = s + a[i]; s = s + a[i+1]; ... s = s + a[i+(%d-1)]; }

■ X Representation

Seems complicated?

■ Which stands for

slide-19
SLIDE 19

Macros again: Tiled MMM-Loop

for (i=0;i<N;i++) { for (j=0;j<M;j++) { for (k=0;k<K;k++) { c[i][j] += a[i][k] * b[k][j]; }}} for (i=0;i<(N/%tile)*%tile;i+=%tile) { for (j=0;j<(M/%tile)*%tile;j+=%tile) { for (k=0;k<(K/%tile)*%tile;k+=%tile) { for (ii=i;ii<i+%tile;i++) { for (jj=j;jj<j+%tile;j++) { for (kk=k;kk<k+%tile;kk++) { c[ii][jj] += a[ii][kk] * b[kk][jj]; }}}} %if ((K/%tile)*%tile)!=K) { for (k=(K/%tile)*%tile;k<;k++) { for (ii=i;ii<i+%tile;i++) { for (jj=j;jj<j+%tile;j++) { for (kk=k;kk<k+%tile;kk++) { c[ii][jj] += a[ii][kk] * b[kk][jj]; }}}}}} ....

slide-20
SLIDE 20

Better Representation: Pragmas

■ Begin/End ■ Naming

  • {} for set of statements

■ Transformation

  • Basic Syntax

#pragma xlang begin . . . #pragma xlang end #pragma xlang name <id> {...} #pragma xlang transform keyword <list-input-par> <list-output-par>

slide-21
SLIDE 21

Implemented Elementary Transformations

 Full Unrolling ■ Partial Unrolling ■ Strip Mining ■ Interchange ■ Loop Fission ■ Loop Fusion ■ Scalar Promote ■ Lifting ■ Sofware Pipelining

"A Languag for the Compact Representation of Multiple Program Versions" Presentation Slides

slide-22
SLIDE 22

Example 1: Loop Unroll

■ Once again the simple Loop

sum = 0; for (i=0;i<256;i++) { s = s + a[i]; }

slide-23
SLIDE 23

Example 1: Loop Unroll

■ Once again the simple Loop ■ X Representation

sum=0; #pragma xlang name l1 for (i=0;i<256;i++) { s = s + a[i]; } #pragma xlang transform unroll l1 4 sum = 0; for (i=0;i<256;i++) { s = s + a[i]; }

slide-24
SLIDE 24

Example 1: Loop Unroll

■ Once again the simple Loop ■ X Representation

sum=0; #pragma xlang name l1 for (i=0;i<256;i++) { s = s + a[i]; } #pragma xlang transform unroll l1 4 sum=0; #pragma xlang name l1 for (i=0;i<256;i+=4) { s = s + a[i]; s = s + a[i+1]; s = s + a[i+2]; s = s + a[i+3]; } sum = 0; for (i=0;i<256;i++) { s = s + a[i]; }

■ Resulting Code

slide-25
SLIDE 25

Example 2: Pipelining

■ The MMM-Loop again

for (i=0;i<N;i++) { for (j=0;j<M;j++) { for (k=0;k<K;k++) { c[i][j] += a[i][k] * b[k][j]; }}}

slide-26
SLIDE 26

Example 2: Pipelining

■ The MMM-Loop again

for (i=0;i<N;i++){ for (j=0;j<M;j++) { for (k=0;k<K;k++) { #pragma xlang name statement st1 c[i][j] += a[i][k] * b[k][j]; }}} #pragma xlang transform split st1 st2 temp for (i=0;i<N;i++) { for (j=0;j<M;j++) { for (k=0;k<K;k++) { c[i][j] += a[i][k] * b[k][j]; }}}

■ X Representation

slide-27
SLIDE 27

Example 2: Pipelining

■ The MMM-Loop again

double temp[0..K]; for (i=0;i<N;i++){ for (j=0;j<M;j++) { for (k=0;k<K;k++) { #pragma xlang name statement st1 temp[k] = a[i][k] * b[k][j]; #pragma xlang name statement st2 c[i][j] = c[i][j] + temp[k]; }}} for (i=0;i<N;i++) { for (j=0;j<M;j++) { for (k=0;k<K;k++) { c[i][j] += a[i][k] * b[k][j]; }}}

■ Resulting Code

for (i=0;i<N;i++){ for (j=0;j<M;j++) { for (k=0;k<K;k++) { #pragma xlang name statement st1 c[i][j] += a[i][k] * b[k][j]; }}} #pragma xlang transform split st1 st2 temp

■ X Representation

slide-28
SLIDE 28

Defining of new Transformations

■ Pattern Rewriting

  • 1. Pattern: Matching
  • 2. Pattern: Rewriting

■ Macro Code directly

slide-29
SLIDE 29

Experimental Results

■ Matrix-Matrix Multiplication (DGEMM) ■ Mimic ATLAS ■ Focus on Blocking for L2 and L3 cache ■ Compiler Intel C compiler (icc) 8.1

  • Pipelining
  • Block Scheduling
slide-30
SLIDE 30

Experimental Results – X Code

#pragma xlang name iloop for (i=0;i<NB;i++) #pragma xlang name jloop for (j=0;j<NB;j++) #pragma xlang name kloop for (k=0;k<NB;k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j]; } #pragma xlang transform stripmine iloop NU NUloop #pragma xlang transform stripmine jloop MU MUloop #pragma xlang transform interchange kloop MUloop #pragma xlang transform interchange jloop NUloop #pragma xlang transform interchange kloop NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform scalarize_in b in kloop #pragma xlang transform scalarize_in a in kloop #pragma xlang transform scalarize_in&out c in kloop #pragma xlang transform lift kloop.loads before kloop #pragma xlang transform lift kloop.stores after kloop

slide-31
SLIDE 31

Experimental Results – X Code

#pragma xlang name iloop for (i=0;i<NB;i++) #pragma xlang name jloop for (j=0;j<NB;j++) #pragma xlang name kloop for (k=0;k<NB;k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j]; } #pragma xlang transform stripmine iloop NU NUloop #pragma xlang transform stripmine jloop MU MUloop #pragma xlang transform interchange kloop MUloop #pragma xlang transform interchange jloop NUloop #pragma xlang transform interchange kloop NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform scalarize_in b in kloop #pragma xlang transform scalarize_in a in kloop #pragma xlang transform scalarize_in&out c in kloop #pragma xlang transform lift kloop.loads before kloop #pragma xlang transform lift kloop.stores after kloop Tiling iloop and jloop

slide-32
SLIDE 32

Experimental Results – X Code

#pragma xlang name iloop for (i=0;i<NB;i++) #pragma xlang name jloop for (j=0;j<NB;j++) #pragma xlang name kloop for (k=0;k<NB;k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j]; } #pragma xlang transform stripmine iloop NU NUloop #pragma xlang transform stripmine jloop MU MUloop #pragma xlang transform interchange kloop MUloop #pragma xlang transform interchange jloop NUloop #pragma xlang transform interchange kloop NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform scalarize_in b in kloop #pragma xlang transform scalarize_in a in kloop #pragma xlang transform scalarize_in&out c in kloop #pragma xlang transform lift kloop.loads before kloop #pragma xlang transform lift kloop.stores after kloop Tiling iloop and jloop NU = 1 MU = 4

slide-33
SLIDE 33

Experimental Results – X Code

#pragma xlang name iloop for (i=0;i<NB;i++) #pragma xlang name jloop for (j=0;j<NB;j++) #pragma xlang name kloop for (k=0;k<NB;k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j]; } #pragma xlang transform stripmine iloop NU NUloop #pragma xlang transform stripmine jloop MU MUloop #pragma xlang transform interchange kloop MUloop #pragma xlang transform interchange jloop NUloop #pragma xlang transform interchange kloop NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform scalarize_in b in kloop #pragma xlang transform scalarize_in a in kloop #pragma xlang transform scalarize_in&out c in kloop #pragma xlang transform lift kloop.loads before kloop #pragma xlang transform lift kloop.stores after kloop Tiling iloop and jloop Full Unroll the new tiles NU = 1 MU = 4

slide-34
SLIDE 34

Experimental Results – X Code

#pragma xlang name iloop for (i=0;i<NB;i++) #pragma xlang name jloop for (j=0;j<NB;j++) #pragma xlang name kloop for (k=0;k<NB;k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j]; } #pragma xlang transform stripmine iloop NU NUloop #pragma xlang transform stripmine jloop MU MUloop #pragma xlang transform interchange kloop MUloop #pragma xlang transform interchange jloop NUloop #pragma xlang transform interchange kloop NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform scalarize_in b in kloop #pragma xlang transform scalarize_in a in kloop #pragma xlang transform scalarize_in&out c in kloop #pragma xlang transform lift kloop.loads before kloop #pragma xlang transform lift kloop.stores after kloop Tiling iloop and jloop Full Unroll the new tiles Control the Loads and Stores NU = 1 MU = 4

slide-35
SLIDE 35

Experimental Results(ctd)

2x Intel Itanium 2(Madison) 1.3Ghz, 256KB L2 and 1.5MB L3

slide-36
SLIDE 36

Conclusion

■ Pro

  • Easy to Generate Multiple

Program Versions

  • No Knowledge of Compiler

Internals necessary

  • Precise Specification of

Transformations

  • Defining of new

Transformations

  • Macros and Pragmas

■ Contra

  • No Dependence Analysis
  • No Type Safety
  • Clear Focus on Loops
  • Programmer has to do the Job
  • Gets difficult to read and

understand

  • Error prone
  • Exhaustive Search
slide-37
SLIDE 37

Questions?