[PPT] - Perform ance Visualization of Hybrid Cell Applications Scicom P 1 5 PowerPoint Presentation

SLIDE 1

Com puter Science » Computer Engineering » Computer Architecture

Perform ance Visualization of Hybrid Cell Applications

Scicom P 1 5 , May 1 9 th, Barcelona holger.brunst daniel.hackenberg @ tu-dresden.de

SLIDE 2

Holger Brunst, Daniel Hackenberg Slide 2

Outline

I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary

SLIDE 3

Holger Brunst, Daniel Hackenberg Slide 3

PowerPC Processor Element (PPE) SPE PowerPC Core SPU

Element Interconnect Bus

Memory Interface Controller ( MIC ) LS SPU LS SPU LS SPU LS SPU LS SPU LS SPU LS SPU LS L 1 L 2 Bus Interface Controller ( BIC ) Dual XDR FlexIO SPE: Synergistic Processor Element LS: Local Store

Cell Broadband Engine

SLIDE 4

Holger Brunst, Daniel Hackenberg Slide 4

Cell Broadband Engine

Vast Resources

SPEs: SIMD-Cores for fast calculations, 256 KB local store

(LS, software controlled), dedicated DMA engine (MFC)

PPE: very simple PowerPC Core for OS (Linux) and

control tasks Sophisticated Architecture

Complex software development process
Different compilers and programs for PPE and SPEs
SPEs use DMA commands to access main memory or LS
f other SPEs, asynchronous execution by MFC
Mailbox communication between PPE and SPEs

Tool Support

SLIDE 5

Holger Brunst, Daniel Hackenberg Slide 5

Trace-based Analysis

W hy do w e still need to analyze?

HPC: System complexity increases constantly
Parallelism enters main stream market and not many

people know how to deal with it Approaches

Profilers do not give detailed insight into timing behavior
f an application
Detailed online analysis pretty much impossible because
f intrusion and data amount

Tracing

Records application behavior step-wise
Tracing is an option to capture the dynamic behavior of

parallel applications

Performance analysis done on a post-mortem basis

SLIDE 6

Holger Brunst, Daniel Hackenberg Slide 6

Background

W hat is Vam pir?

Performance monitoring and analysis tool
Targets the visualization of dynamic processes
n massively parallel (compute-) systems

History

Development started more than 15 years ago at Research

Centre Jülich, ZAM

Since 1997, developed at TU Dresden

(first: collaboration with Pallas GmbH, from 2003-2005: Intel Software & Solutions Group, since January 2006: TU Dresden, ZIH / GWT-TUD) Availability

Unix, Windows, and Mac OS
Visualization components (Vampir) are commercial
Monitor components (VampirTrace) are Open Source

SLIDE 7

Holger Brunst, Daniel Hackenberg Slide 7

Time

Application

Com ponents

CPU 1 2 3 4

Application CPU VampirTrace Application CPU VampirTrace Application CPU VampirTrace Application CPU VampirTrace Application CPU VampirTrace

10,000 . . . VampirTrace Trace Data (OTF) Vampir Trace Data Part m OTF Trace Part 1 OTF Trace Part 2 OTF Trace Part 3 . . . OTF Trace Part 4 VampirServer

Task 1 Task n << m

…

SLIDE 8

Holger Brunst, Daniel Hackenberg Slide 8

Flavors

Vam pir

Sequential event analysis
Rich set of graphical performance views
For desktops and small parallel production environments
Less scalable

Vam pirServer

Distributed client/ server approach
Parallel analysis
New features

Vam pir for W indow s

Modern QT-based GUI
Released at ISC 2009, Hamburg
Currently: Beta-Release

SLIDE 9

Holger Brunst, Daniel Hackenberg Slide 11

Outline

I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary

SLIDE 10

Holger Brunst, Daniel Hackenberg Slide 12

Softw are Tracing on Cell System s

PPE

Conventional tools with PowerPC support run unmodified
Modifications necessary to support SPE threads

SPE

New concept needs to be designed, suitable for this

architecture

New monitor necessary to generate events
Local store too small, only temporary storage of events
Synchronization of PPE and SPE timers necessary

SLIDE 11

Holger Brunst, Daniel Hackenberg Slide 13

Trace Monitor Concept

Main Memory PPE SPE 0 Element Interconnect Bus Local Store SPU 0

Buf 1 Buf 1/0 Buf 2/0 Buf 3/0 Buf m/0 ... Conventional monitoring tool with enhancements to cover e.g. mailbox communication with SPEs Instrumented SPE program Instrumented PPE program SPE program writes trace events into small trace buffer

I/O System

trace file PPE trace file SPE 0 trace file SPE n ... PPE processes SPE trace buffers (post mortem) and writes trace files to disk DMA transfer

f full trace

buffer to main memory in backgound, SPE program keeps running

...

Buf 1/n Buf 2/n Buf 3/n Buf m/n ...

*

Buffers will switch each time the current trace buffer is full

*

SPE n Local Store SPU n

Buf 1 Buf 2 Instrumented SPE program Buf 2

...

SLIDE 12

Holger Brunst, Daniel Hackenberg Slide 14

Trace Visualization for Cell ( 1 )

Time

Process 4 Region 2 Region 1 Process 3 Region 2 Region 1 Process 2 Region 2 Region 1 Process 1

Location

Region 1 Region 2

I llustration of parallel processes in a typical tim eline display

SLIDE 13

Holger Brunst, Daniel Hackenberg Slide 15

Trace Visualization for Cell ( 2 )

Time

SPE Thread 3 Region 2 Region 1 SPE Thread 2 Region 2 Region 1 SPE Thread 1 Region 2 Region 1 PPE Process 1

Location

I llustration of SPE threads as children of the PPE process

SLIDE 14

Holger Brunst, Daniel Hackenberg Slide 16

Trace Visualization for Cell ( 3 )

Time

SPE Thread 3 Region 1 SPE Thread 2 Region 1 SPE Thread 1 Region 1 PPE Process 1

Location

I llustration of m ailbox m essages

Classic two-sided communication (send/ receive)
Illustrated by lines similar to MPI messages

SLIDE 15

Holger Brunst, Daniel Hackenberg Slide 17

Trace Visualization for Cell ( 4 )

read

Time

SPE Thread 2 Region 1 SPE Thread 1 Region 1 Main Memory PPE Process 1 read write

Location

I llustration of DMA transfers betw een SPEs and m ain m em ory

PPE is not involved
Main memory is represented as independent bar
Allows graphical representation of memory states (read/ write)

SLIDE 16

Time

SPE Thread 2 SPE Thread 1 Main Memory PPE Process 1

Location

Holger Brunst, Daniel Hackenberg Slide 18

DMA get DMA put

Trace Visualization for Cell ( 5 )

DMA transfers betw een SPEs

Challenge: Communication is one-sided
Peer-to-peer send/ receive representation unsuitable

Distinction of active and passive partner?

Additional lines
Additional bullets (active partner)
Even more bullets? (passive partner)

SLIDE 17

Time

SPE Thread 2 SPE Thread 1 Main Memory PPE Process 1

Location

Holger Brunst, Daniel Hackenberg Slide 19 DMA wait

t0 t1 t2

t_0 = get_timestamp(); mfc_get(); [...] t_1 = get_timestamp(); wait_for_dma_tag(); t_2 = get_timestamp();

Trace Visualization for Cell ( 6 )

DMA wait operation creates

two events (at t1 and t2)

Allows illustration of DMA wait

time

Similar for mailbox messages

SLIDE 18

Holger Brunst, Daniel Hackenberg Slide 20

Outline

I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary

SLIDE 19

Holger Brunst, Daniel Hackenberg Slide 21

I m plem entation

Prototype im plem entation based on Vam pirTrace ( VT)

Open Source
http: / / www.tu-dresden.de/ zih/ vampirtrace

Additional tool: CellTrace

Header files for PPE and SPE programs: Instrumentation of inline

functions provided by the Cell SDK

Library for PPE programs + library for SPE programs

spu_code_1.c spu _code _n.c spu_code_1.o spu _code _n.o SPU Compiler

DCTRACE

spu_object .o SPU Compiler

DCTRACE

spu _lib.a Embedder ppu_code_1.c ppu_code _m.c ppu_code_1.o ppu_code _m.o vtcc

DCTRACE

ppu_object .o vtcc

DCTRACE

cell _binary (trace enabled ) Archiver celltrace _spu.a celltrace _spu.h celltrace _ppu.a celltrace _ppu.h

SLIDE 20

Holger Brunst, Daniel Hackenberg Slide 22

Trace Visualization w ith Vam pir ( 1 )

Visualization of a Cell trace using Vam pir

Simple demo program
4 SPEs only

SLIDE 21

Holger Brunst, Daniel Hackenberg Slide 23

Trace Visualization w ith Vam pir ( 2 )

SLIDE 22

Holger Brunst, Daniel Hackenberg Slide 24

Trace Visualization w ith Vam pir ( 3 )

Complex DMA transfers of SPE 3

SLIDE 23

Holger Brunst, Daniel Hackenberg Slide 25

Outline

I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary

SLIDE 24

Holger Brunst, Daniel Hackenberg Slide 26

Exam ple Cell Applications: FFT ( 1 )

FFT at a synchronization point 8 SPEs, 64 KByte page size, 11.9 GFLOPS

SLIDE 25

Holger Brunst, Daniel Hackenberg Slide 27

Exam ple Cell Applications: FFT ( 2 )

FFT at a synchronization point 8 SPEs, 16 MByte page size, 42.9 GFLOPS

SLIDE 26

Holger Brunst, Daniel Hackenberg Slide 28

Exam ple Applications: Cholesky ( 1 )

Cholesky transformation with 8 SPEs

verview with DMA communication of SPE 3

SLIDE 27

Holger Brunst, Daniel Hackenberg Slide 29

Exam ple Applications: Cholesky ( 2 )

Cholesky transformation with 8 SPEs enlargement with DMA communication of SPE 3

SLIDE 28

Holger Brunst, Daniel Hackenberg Slide 30

Exam ple Cell Applications: RAxML ( 1 )

RAxML (Randomized Accelerated Maximum Likelihood) with 8 SPEs, ramp-up phase

SLIDE 29

Holger Brunst, Daniel Hackenberg Slide 31

Exam ple Cell Applications: RAxML ( 2 )

RAxML with 8 SPEs, 4000 ns window enlargement of a small loop shifted start of loop, constant runtime

SLIDE 30

Holger Brunst, Daniel Hackenberg Slide 32

Exam ple Cell Applications: RAxML ( 3 )

RAxML with 8 SPEs, 4000 ns window enlargement of a small loop (modified) synchronous start, memory contention

SLIDE 31

Holger Brunst, Daniel Hackenberg Slide 33

Exam ple Cell Applications: RAxML ( 4 )

RAxML with 16 SPEs, load imbalance

SLIDE 32

Holger Brunst, Daniel Hackenberg Slide 34

Hybrid Cell/ MPI Application: PBPI

PBPI (Parallel Bayesian Phylogenetic Inference)

n 3 QS21 blades (6 Cell processors)

SLIDE 33

Slide 35

SLIDE 34

Com bined Charts in New GUI

SLIDE 36 HOLGER BRUNST

SLIDE 35

Holger Brunst, Daniel Hackenberg Slide 37

(*) Increased overhead due to intense usage of DMA lists. Trace overhead without DMAs: 1,4 %

Overhead

Overhead sources

Creating events

Transferring trace data from the SPEs to main memory

Trace buffer und trace library use space in local store (< 12 KByte) Additional overhead

Initialization and processing of SPE event data

Outside of SPE runtime Analysis unaffected Experim ental overhead m easurem ents ( QS2 1 , 8 SPEs) : Original (GFLOPS) Tracing (GFLOPS) Overhead SGEMM 203,25 200,73 1,3 % FFT 11,93 11,85 0,7 % Cholesky, SPOTRF 143,17 139,32 2,8 % Cholesky, DGEMM 4,48 4,10 9,2 % (* ) Cholesky, STRSM 5,73 5,64 1,7 %

SLIDE 36

Holger Brunst, Daniel Hackenberg Slide 38

Sum m ary & Future W ork

Concept and Prototype for Perform ance Tracing on Cell

CellTrace
Typical overhead: less than 5 percent

Visualization of Traces w ith Vam pir

Creates valuable insight into the runtime behavior of Cell

applications

Intuitive performance visualization and verification
Support for large, hybrid Cell/ MPI applications

Future w ork

Improved tracing, e.g. full integration in VampirTrace,

providing additional analysis features such as alignment checks

Improved visualization, e.g. by colorizing DMA messages

(tag, size or bandwidth), displaying intensity of main memory accesses

SLIDE 37

Com puter Science » Computer Engineering » Computer Architecture

THANK YOU

Matthias Jurenz Andreas Knüpfer Matthias Lieber Holger Mickler

Dr. Hartmut Mix

Ronny Brendel Jens Doleschal Ronald Geisler Daniel Hackenberg Robert Henschel

Dr. Matthias Müller
Prof. Wolfgang E. Nagel

Michael Peter Matthias Weber Thomas William

http://www.tu-dresden.de/zih/cell/trace http://www.vampir.eu

SLIDE 38

Holger Brunst, Daniel Hackenberg Slide 40

Literature

1. BRUNST, Holger; NAGEL, Wolfgang E.: Scalable Performance Analysis of Parallel Systems: Concepts and Experiences. ParCo 2003 2. BLAGOJEVIC et. al.: Scheduling Asymmetric Parallelism on a PlayStation3 Cluster. To appear in ISCCG 2008 3. BLAGOJEVIC et. al.: Dynamic multigrain parallelization on the Cell Broadband Engine. ACM SIGPLAN 2007 4. Hermanns, M.-A.; Mohr, B.; Wolf, F.: Event-Based Measurement and Analysis of One- Sided Communication. Euro-Par 2005 5. KURZAK, Jakub; BUTTARI, Alfredo; DONGARRA, Jack: Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization. IEEE Transactions on Parallel and Distributed Systems 2007 6. JURENZ, Matthias: VampirTrace Software and Documentation. ZIH, TU Dresden. 2007 7. KNÜPFER et. al.: Introducing the Open Trace Format (OTF). ICCS 2006 8. IBM, SCEI, Toshiba: Cell Broadband Engine Architecture. 2005 9. IBM, Sony, Toshiba: Cell Broadband Engine Programming Handbook. 2007

10. EICHENBERGER et. al.: Using advanced compiler technology to exploit the

performance of the Cell Broadband Engine architecture. IBM Systems Journal 45, 2006

11. CHOW, Alex C.; FOSSUM, Gordon C.; BROKENSHIRE, Daniel A.: A Programming

Example: Large FFT on the Cell Broadband Engine. 2005

12. HACKENBERG, Daniel: Einsatz und Leistungsanalyse der Cell Broadband Engine. ZIH,

TU Dresden. 2007