Perform ance Visualization of Hybrid Cell Applications Scicom P 1 5 - - PowerPoint PPT Presentation

perform ance visualization of hybrid cell applications
SMART_READER_LITE
LIVE PREVIEW

Perform ance Visualization of Hybrid Cell Applications Scicom P 1 5 - - PowerPoint PPT Presentation

Com puter Science Computer Engineering Computer Architecture Perform ance Visualization of Hybrid Cell Applications Scicom P 1 5 , May 1 9 th, Barcelona holger.brunst daniel.hackenberg @ tu-dresden.de Outline I ntroduction Softw are


slide-1
SLIDE 1

Com puter Science » Computer Engineering » Computer Architecture

Perform ance Visualization of Hybrid Cell Applications

Scicom P 1 5 , May 1 9 th, Barcelona holger.brunst daniel.hackenberg @ tu-dresden.de

slide-2
SLIDE 2

Holger Brunst, Daniel Hackenberg Slide 2

Outline

I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary

slide-3
SLIDE 3

Holger Brunst, Daniel Hackenberg Slide 3

PowerPC Processor Element (PPE) SPE PowerPC Core SPU

Element Interconnect Bus

Memory Interface Controller ( MIC ) LS SPU LS SPU LS SPU LS SPU LS SPU LS SPU LS SPU LS L 1 L 2 Bus Interface Controller ( BIC ) Dual XDR FlexIO SPE: Synergistic Processor Element LS: Local Store

Cell Broadband Engine

slide-4
SLIDE 4

Holger Brunst, Daniel Hackenberg Slide 4

Cell Broadband Engine

Vast Resources

  • SPEs: SIMD-Cores for fast calculations, 256 KB local store

(LS, software controlled), dedicated DMA engine (MFC)

  • PPE: very simple PowerPC Core for OS (Linux) and

control tasks Sophisticated Architecture

  • Complex software development process
  • Different compilers and programs for PPE and SPEs
  • SPEs use DMA commands to access main memory or LS
  • f other SPEs, asynchronous execution by MFC
  • Mailbox communication between PPE and SPEs

Tool Support

slide-5
SLIDE 5

Holger Brunst, Daniel Hackenberg Slide 5

Trace-based Analysis

W hy do w e still need to analyze?

  • HPC: System complexity increases constantly
  • Parallelism enters main stream market and not many

people know how to deal with it Approaches

  • Profilers do not give detailed insight into timing behavior
  • f an application
  • Detailed online analysis pretty much impossible because
  • f intrusion and data amount

Tracing

  • Records application behavior step-wise
  • Tracing is an option to capture the dynamic behavior of

parallel applications

  • Performance analysis done on a post-mortem basis
slide-6
SLIDE 6

Holger Brunst, Daniel Hackenberg Slide 6

Background

W hat is Vam pir?

  • Performance monitoring and analysis tool
  • Targets the visualization of dynamic processes
  • n massively parallel (compute-) systems

History

  • Development started more than 15 years ago at Research

Centre Jülich, ZAM

  • Since 1997, developed at TU Dresden

(first: collaboration with Pallas GmbH, from 2003-2005: Intel Software & Solutions Group, since January 2006: TU Dresden, ZIH / GWT-TUD) Availability

  • Unix, Windows, and Mac OS
  • Visualization components (Vampir) are commercial
  • Monitor components (VampirTrace) are Open Source
slide-7
SLIDE 7

Holger Brunst, Daniel Hackenberg Slide 7

Time

Application

Com ponents

CPU 1 2 3 4

Application CPU VampirTrace Application CPU VampirTrace Application CPU VampirTrace Application CPU VampirTrace Application CPU VampirTrace

10,000 . . . VampirTrace Trace Data (OTF) Vampir Trace Data Part m OTF Trace Part 1 OTF Trace Part 2 OTF Trace Part 3 . . . OTF Trace Part 4 VampirServer

Task 1 Task n << m

slide-8
SLIDE 8

Holger Brunst, Daniel Hackenberg Slide 8

Flavors

Vam pir

  • Sequential event analysis
  • Rich set of graphical performance views
  • For desktops and small parallel production environments
  • Less scalable

Vam pirServer

  • Distributed client/ server approach
  • Parallel analysis
  • New features

Vam pir for W indow s

  • Modern QT-based GUI
  • Released at ISC 2009, Hamburg
  • Currently: Beta-Release
slide-9
SLIDE 9

Holger Brunst, Daniel Hackenberg Slide 11

Outline

I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary

slide-10
SLIDE 10

Holger Brunst, Daniel Hackenberg Slide 12

Softw are Tracing on Cell System s

PPE

  • Conventional tools with PowerPC support run unmodified
  • Modifications necessary to support SPE threads

SPE

  • New concept needs to be designed, suitable for this

architecture

  • New monitor necessary to generate events
  • Local store too small, only temporary storage of events
  • Synchronization of PPE and SPE timers necessary
slide-11
SLIDE 11

Holger Brunst, Daniel Hackenberg Slide 13

Trace Monitor Concept

Main Memory PPE SPE 0 Element Interconnect Bus Local Store SPU 0

Buf 1 Buf 1/0 Buf 2/0 Buf 3/0 Buf m/0 ... Conventional monitoring tool with enhancements to cover e.g. mailbox communication with SPEs Instrumented SPE program Instrumented PPE program SPE program writes trace events into small trace buffer

I/O System

trace file PPE trace file SPE 0 trace file SPE n ... PPE processes SPE trace buffers (post mortem) and writes trace files to disk DMA transfer

  • f full trace

buffer to main memory in backgound, SPE program keeps running

...

Buf 1/n Buf 2/n Buf 3/n Buf m/n ...

*

Buffers will switch each time the current trace buffer is full

*

SPE n Local Store SPU n

Buf 1 Buf 2 Instrumented SPE program Buf 2

...

slide-12
SLIDE 12

Holger Brunst, Daniel Hackenberg Slide 14

Trace Visualization for Cell ( 1 )

Time

Process 4 Region 2 Region 1 Process 3 Region 2 Region 1 Process 2 Region 2 Region 1 Process 1

Location

Region 1 Region 2

I llustration of parallel processes in a typical tim eline display

slide-13
SLIDE 13

Holger Brunst, Daniel Hackenberg Slide 15

Trace Visualization for Cell ( 2 )

Time

SPE Thread 3 Region 2 Region 1 SPE Thread 2 Region 2 Region 1 SPE Thread 1 Region 2 Region 1 PPE Process 1

Location

I llustration of SPE threads as children of the PPE process

slide-14
SLIDE 14

Holger Brunst, Daniel Hackenberg Slide 16

Trace Visualization for Cell ( 3 )

Time

SPE Thread 3 Region 1 SPE Thread 2 Region 1 SPE Thread 1 Region 1 PPE Process 1

Location

I llustration of m ailbox m essages

  • Classic two-sided communication (send/ receive)
  • Illustrated by lines similar to MPI messages
slide-15
SLIDE 15

Holger Brunst, Daniel Hackenberg Slide 17

Trace Visualization for Cell ( 4 )

read

Time

SPE Thread 2 Region 1 SPE Thread 1 Region 1 Main Memory PPE Process 1 read write

Location

I llustration of DMA transfers betw een SPEs and m ain m em ory

  • PPE is not involved
  • Main memory is represented as independent bar
  • Allows graphical representation of memory states (read/ write)
slide-16
SLIDE 16

Time

SPE Thread 2 SPE Thread 1 Main Memory PPE Process 1

Location

Holger Brunst, Daniel Hackenberg Slide 18

DMA get DMA put

Trace Visualization for Cell ( 5 )

DMA transfers betw een SPEs

  • Challenge: Communication is one-sided
  • Peer-to-peer send/ receive representation unsuitable

Distinction of active and passive partner?

  • Additional lines
  • Additional bullets (active partner)
  • Even more bullets? (passive partner)
slide-17
SLIDE 17

Time

SPE Thread 2 SPE Thread 1 Main Memory PPE Process 1

Location

Holger Brunst, Daniel Hackenberg Slide 19 DMA wait

t0 t1 t2

t_0 = get_timestamp(); mfc_get(); [...] t_1 = get_timestamp(); wait_for_dma_tag(); t_2 = get_timestamp();

Trace Visualization for Cell ( 6 )

  • DMA wait operation creates

two events (at t1 and t2)

  • Allows illustration of DMA wait

time

  • Similar for mailbox messages
slide-18
SLIDE 18

Holger Brunst, Daniel Hackenberg Slide 20

Outline

I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary

slide-19
SLIDE 19

Holger Brunst, Daniel Hackenberg Slide 21

I m plem entation

Prototype im plem entation based on Vam pirTrace ( VT)

  • Open Source
  • http: / / www.tu-dresden.de/ zih/ vampirtrace

Additional tool: CellTrace

  • Header files for PPE and SPE programs: Instrumentation of inline

functions provided by the Cell SDK

  • Library for PPE programs + library for SPE programs

spu_code_1.c spu _code _n.c spu_code_1.o spu _code _n.o SPU Compiler

  • DCTRACE

spu_object .o SPU Compiler

  • DCTRACE

spu _lib.a Embedder ppu_code_1.c ppu_code _m.c ppu_code_1.o ppu_code _m.o vtcc

  • DCTRACE

ppu_object .o vtcc

  • DCTRACE

cell _binary (trace enabled ) Archiver celltrace _spu.a celltrace _spu.h celltrace _ppu.a celltrace _ppu.h

slide-20
SLIDE 20

Holger Brunst, Daniel Hackenberg Slide 22

Trace Visualization w ith Vam pir ( 1 )

Visualization of a Cell trace using Vam pir

  • Simple demo program
  • 4 SPEs only
slide-21
SLIDE 21

Holger Brunst, Daniel Hackenberg Slide 23

Trace Visualization w ith Vam pir ( 2 )

slide-22
SLIDE 22

Holger Brunst, Daniel Hackenberg Slide 24

Trace Visualization w ith Vam pir ( 3 )

Complex DMA transfers of SPE 3

slide-23
SLIDE 23

Holger Brunst, Daniel Hackenberg Slide 25

Outline

I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary

slide-24
SLIDE 24

Holger Brunst, Daniel Hackenberg Slide 26

Exam ple Cell Applications: FFT ( 1 )

FFT at a synchronization point 8 SPEs, 64 KByte page size, 11.9 GFLOPS

slide-25
SLIDE 25

Holger Brunst, Daniel Hackenberg Slide 27

Exam ple Cell Applications: FFT ( 2 )

FFT at a synchronization point 8 SPEs, 16 MByte page size, 42.9 GFLOPS

slide-26
SLIDE 26

Holger Brunst, Daniel Hackenberg Slide 28

Exam ple Applications: Cholesky ( 1 )

Cholesky transformation with 8 SPEs

  • verview with DMA communication of SPE 3
slide-27
SLIDE 27

Holger Brunst, Daniel Hackenberg Slide 29

Exam ple Applications: Cholesky ( 2 )

Cholesky transformation with 8 SPEs enlargement with DMA communication of SPE 3

slide-28
SLIDE 28

Holger Brunst, Daniel Hackenberg Slide 30

Exam ple Cell Applications: RAxML ( 1 )

RAxML (Randomized Accelerated Maximum Likelihood) with 8 SPEs, ramp-up phase

slide-29
SLIDE 29

Holger Brunst, Daniel Hackenberg Slide 31

Exam ple Cell Applications: RAxML ( 2 )

RAxML with 8 SPEs, 4000 ns window enlargement of a small loop shifted start of loop, constant runtime

slide-30
SLIDE 30

Holger Brunst, Daniel Hackenberg Slide 32

Exam ple Cell Applications: RAxML ( 3 )

RAxML with 8 SPEs, 4000 ns window enlargement of a small loop (modified) synchronous start, memory contention

slide-31
SLIDE 31

Holger Brunst, Daniel Hackenberg Slide 33

Exam ple Cell Applications: RAxML ( 4 )

RAxML with 16 SPEs, load imbalance

slide-32
SLIDE 32

Holger Brunst, Daniel Hackenberg Slide 34

Hybrid Cell/ MPI Application: PBPI

PBPI (Parallel Bayesian Phylogenetic Inference)

  • n 3 QS21 blades (6 Cell processors)
slide-33
SLIDE 33

Slide 35

slide-34
SLIDE 34

Com bined Charts in New GUI

SLIDE 36 HOLGER BRUNST

slide-35
SLIDE 35

Holger Brunst, Daniel Hackenberg Slide 37

(*) Increased overhead due to intense usage of DMA lists. Trace overhead without DMAs: 1,4 %

Overhead

Overhead sources

Creating events

Transferring trace data from the SPEs to main memory

Trace buffer und trace library use space in local store (< 12 KByte) Additional overhead

Initialization and processing of SPE event data

Outside of SPE runtime Analysis unaffected Experim ental overhead m easurem ents ( QS2 1 , 8 SPEs) : Original (GFLOPS) Tracing (GFLOPS) Overhead SGEMM 203,25 200,73 1,3 % FFT 11,93 11,85 0,7 % Cholesky, SPOTRF 143,17 139,32 2,8 % Cholesky, DGEMM 4,48 4,10 9,2 % (* ) Cholesky, STRSM 5,73 5,64 1,7 %

slide-36
SLIDE 36

Holger Brunst, Daniel Hackenberg Slide 38

Sum m ary & Future W ork

Concept and Prototype for Perform ance Tracing on Cell

  • CellTrace
  • Typical overhead: less than 5 percent

Visualization of Traces w ith Vam pir

  • Creates valuable insight into the runtime behavior of Cell

applications

  • Intuitive performance visualization and verification
  • Support for large, hybrid Cell/ MPI applications

Future w ork

  • Improved tracing, e.g. full integration in VampirTrace,

providing additional analysis features such as alignment checks

  • Improved visualization, e.g. by colorizing DMA messages

(tag, size or bandwidth), displaying intensity of main memory accesses

slide-37
SLIDE 37

Com puter Science » Computer Engineering » Computer Architecture

THANK YOU

Matthias Jurenz Andreas Knüpfer Matthias Lieber Holger Mickler

  • Dr. Hartmut Mix

Ronny Brendel Jens Doleschal Ronald Geisler Daniel Hackenberg Robert Henschel

  • Dr. Matthias Müller
  • Prof. Wolfgang E. Nagel

Michael Peter Matthias Weber Thomas William

http://www.tu-dresden.de/zih/cell/trace http://www.vampir.eu

slide-38
SLIDE 38

Holger Brunst, Daniel Hackenberg Slide 40

Literature

1. BRUNST, Holger; NAGEL, Wolfgang E.: Scalable Performance Analysis of Parallel Systems: Concepts and Experiences. ParCo 2003 2. BLAGOJEVIC et. al.: Scheduling Asymmetric Parallelism on a PlayStation3 Cluster. To appear in ISCCG 2008 3. BLAGOJEVIC et. al.: Dynamic multigrain parallelization on the Cell Broadband Engine. ACM SIGPLAN 2007 4. Hermanns, M.-A.; Mohr, B.; Wolf, F.: Event-Based Measurement and Analysis of One- Sided Communication. Euro-Par 2005 5. KURZAK, Jakub; BUTTARI, Alfredo; DONGARRA, Jack: Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization. IEEE Transactions on Parallel and Distributed Systems 2007 6. JURENZ, Matthias: VampirTrace Software and Documentation. ZIH, TU Dresden. 2007 7. KNÜPFER et. al.: Introducing the Open Trace Format (OTF). ICCS 2006 8. IBM, SCEI, Toshiba: Cell Broadband Engine Architecture. 2005 9. IBM, Sony, Toshiba: Cell Broadband Engine Programming Handbook. 2007

  • 10. EICHENBERGER et. al.: Using advanced compiler technology to exploit the

performance of the Cell Broadband Engine architecture. IBM Systems Journal 45, 2006

  • 11. CHOW, Alex C.; FOSSUM, Gordon C.; BROKENSHIRE, Daniel A.: A Programming

Example: Large FFT on the Cell Broadband Engine. 2005

  • 12. HACKENBERG, Daniel: Einsatz und Leistungsanalyse der Cell Broadband Engine. ZIH,

TU Dresden. 2007