Com puter Science » Computer Engineering » Computer Architecture
Perform ance Visualization of Hybrid Cell Applications Scicom P 1 5 - - PowerPoint PPT Presentation
Perform ance Visualization of Hybrid Cell Applications Scicom P 1 5 - - PowerPoint PPT Presentation
Com puter Science Computer Engineering Computer Architecture Perform ance Visualization of Hybrid Cell Applications Scicom P 1 5 , May 1 9 th, Barcelona holger.brunst daniel.hackenberg @ tu-dresden.de Outline I ntroduction Softw are
Holger Brunst, Daniel Hackenberg Slide 2
Outline
I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary
Holger Brunst, Daniel Hackenberg Slide 3
PowerPC Processor Element (PPE) SPE PowerPC Core SPU
Element Interconnect Bus
Memory Interface Controller ( MIC ) LS SPU LS SPU LS SPU LS SPU LS SPU LS SPU LS SPU LS L 1 L 2 Bus Interface Controller ( BIC ) Dual XDR FlexIO SPE: Synergistic Processor Element LS: Local Store
Cell Broadband Engine
Holger Brunst, Daniel Hackenberg Slide 4
Cell Broadband Engine
Vast Resources
- SPEs: SIMD-Cores for fast calculations, 256 KB local store
(LS, software controlled), dedicated DMA engine (MFC)
- PPE: very simple PowerPC Core for OS (Linux) and
control tasks Sophisticated Architecture
- Complex software development process
- Different compilers and programs for PPE and SPEs
- SPEs use DMA commands to access main memory or LS
- f other SPEs, asynchronous execution by MFC
- Mailbox communication between PPE and SPEs
Tool Support
Holger Brunst, Daniel Hackenberg Slide 5
Trace-based Analysis
W hy do w e still need to analyze?
- HPC: System complexity increases constantly
- Parallelism enters main stream market and not many
people know how to deal with it Approaches
- Profilers do not give detailed insight into timing behavior
- f an application
- Detailed online analysis pretty much impossible because
- f intrusion and data amount
Tracing
- Records application behavior step-wise
- Tracing is an option to capture the dynamic behavior of
parallel applications
- Performance analysis done on a post-mortem basis
Holger Brunst, Daniel Hackenberg Slide 6
Background
W hat is Vam pir?
- Performance monitoring and analysis tool
- Targets the visualization of dynamic processes
- n massively parallel (compute-) systems
History
- Development started more than 15 years ago at Research
Centre Jülich, ZAM
- Since 1997, developed at TU Dresden
(first: collaboration with Pallas GmbH, from 2003-2005: Intel Software & Solutions Group, since January 2006: TU Dresden, ZIH / GWT-TUD) Availability
- Unix, Windows, and Mac OS
- Visualization components (Vampir) are commercial
- Monitor components (VampirTrace) are Open Source
Holger Brunst, Daniel Hackenberg Slide 7
Time
Application
Com ponents
CPU 1 2 3 4
Application CPU VampirTrace Application CPU VampirTrace Application CPU VampirTrace Application CPU VampirTrace Application CPU VampirTrace
10,000 . . . VampirTrace Trace Data (OTF) Vampir Trace Data Part m OTF Trace Part 1 OTF Trace Part 2 OTF Trace Part 3 . . . OTF Trace Part 4 VampirServer
Task 1 Task n << m
…
Holger Brunst, Daniel Hackenberg Slide 8
Flavors
Vam pir
- Sequential event analysis
- Rich set of graphical performance views
- For desktops and small parallel production environments
- Less scalable
Vam pirServer
- Distributed client/ server approach
- Parallel analysis
- New features
Vam pir for W indow s
- Modern QT-based GUI
- Released at ISC 2009, Hamburg
- Currently: Beta-Release
Holger Brunst, Daniel Hackenberg Slide 11
Outline
I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary
Holger Brunst, Daniel Hackenberg Slide 12
Softw are Tracing on Cell System s
PPE
- Conventional tools with PowerPC support run unmodified
- Modifications necessary to support SPE threads
SPE
- New concept needs to be designed, suitable for this
architecture
- New monitor necessary to generate events
- Local store too small, only temporary storage of events
- Synchronization of PPE and SPE timers necessary
Holger Brunst, Daniel Hackenberg Slide 13
Trace Monitor Concept
Main Memory PPE SPE 0 Element Interconnect Bus Local Store SPU 0
Buf 1 Buf 1/0 Buf 2/0 Buf 3/0 Buf m/0 ... Conventional monitoring tool with enhancements to cover e.g. mailbox communication with SPEs Instrumented SPE program Instrumented PPE program SPE program writes trace events into small trace buffer
I/O System
trace file PPE trace file SPE 0 trace file SPE n ... PPE processes SPE trace buffers (post mortem) and writes trace files to disk DMA transfer
- f full trace
buffer to main memory in backgound, SPE program keeps running
...
Buf 1/n Buf 2/n Buf 3/n Buf m/n ...
*
Buffers will switch each time the current trace buffer is full
*
SPE n Local Store SPU n
Buf 1 Buf 2 Instrumented SPE program Buf 2
...
Holger Brunst, Daniel Hackenberg Slide 14
Trace Visualization for Cell ( 1 )
Time
Process 4 Region 2 Region 1 Process 3 Region 2 Region 1 Process 2 Region 2 Region 1 Process 1
Location
Region 1 Region 2
I llustration of parallel processes in a typical tim eline display
Holger Brunst, Daniel Hackenberg Slide 15
Trace Visualization for Cell ( 2 )
Time
SPE Thread 3 Region 2 Region 1 SPE Thread 2 Region 2 Region 1 SPE Thread 1 Region 2 Region 1 PPE Process 1
Location
I llustration of SPE threads as children of the PPE process
Holger Brunst, Daniel Hackenberg Slide 16
Trace Visualization for Cell ( 3 )
Time
SPE Thread 3 Region 1 SPE Thread 2 Region 1 SPE Thread 1 Region 1 PPE Process 1
Location
I llustration of m ailbox m essages
- Classic two-sided communication (send/ receive)
- Illustrated by lines similar to MPI messages
Holger Brunst, Daniel Hackenberg Slide 17
Trace Visualization for Cell ( 4 )
read
Time
SPE Thread 2 Region 1 SPE Thread 1 Region 1 Main Memory PPE Process 1 read write
Location
I llustration of DMA transfers betw een SPEs and m ain m em ory
- PPE is not involved
- Main memory is represented as independent bar
- Allows graphical representation of memory states (read/ write)
Time
SPE Thread 2 SPE Thread 1 Main Memory PPE Process 1
Location
Holger Brunst, Daniel Hackenberg Slide 18
DMA get DMA put
Trace Visualization for Cell ( 5 )
DMA transfers betw een SPEs
- Challenge: Communication is one-sided
- Peer-to-peer send/ receive representation unsuitable
Distinction of active and passive partner?
- Additional lines
- Additional bullets (active partner)
- Even more bullets? (passive partner)
Time
SPE Thread 2 SPE Thread 1 Main Memory PPE Process 1
Location
Holger Brunst, Daniel Hackenberg Slide 19 DMA wait
t0 t1 t2
t_0 = get_timestamp(); mfc_get(); [...] t_1 = get_timestamp(); wait_for_dma_tag(); t_2 = get_timestamp();
Trace Visualization for Cell ( 6 )
- DMA wait operation creates
two events (at t1 and t2)
- Allows illustration of DMA wait
time
- Similar for mailbox messages
Holger Brunst, Daniel Hackenberg Slide 20
Outline
I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary
Holger Brunst, Daniel Hackenberg Slide 21
I m plem entation
Prototype im plem entation based on Vam pirTrace ( VT)
- Open Source
- http: / / www.tu-dresden.de/ zih/ vampirtrace
Additional tool: CellTrace
- Header files for PPE and SPE programs: Instrumentation of inline
functions provided by the Cell SDK
- Library for PPE programs + library for SPE programs
spu_code_1.c spu _code _n.c spu_code_1.o spu _code _n.o SPU Compiler
- DCTRACE
spu_object .o SPU Compiler
- DCTRACE
spu _lib.a Embedder ppu_code_1.c ppu_code _m.c ppu_code_1.o ppu_code _m.o vtcc
- DCTRACE
ppu_object .o vtcc
- DCTRACE
cell _binary (trace enabled ) Archiver celltrace _spu.a celltrace _spu.h celltrace _ppu.a celltrace _ppu.h
Holger Brunst, Daniel Hackenberg Slide 22
Trace Visualization w ith Vam pir ( 1 )
Visualization of a Cell trace using Vam pir
- Simple demo program
- 4 SPEs only
Holger Brunst, Daniel Hackenberg Slide 23
Trace Visualization w ith Vam pir ( 2 )
Holger Brunst, Daniel Hackenberg Slide 24
Trace Visualization w ith Vam pir ( 3 )
Complex DMA transfers of SPE 3
Holger Brunst, Daniel Hackenberg Slide 25
Outline
I ntroduction Softw are Tracing on Cell System s I m plem entation and Functionality Exam ples and Overhead Sum m ary
Holger Brunst, Daniel Hackenberg Slide 26
Exam ple Cell Applications: FFT ( 1 )
FFT at a synchronization point 8 SPEs, 64 KByte page size, 11.9 GFLOPS
Holger Brunst, Daniel Hackenberg Slide 27
Exam ple Cell Applications: FFT ( 2 )
FFT at a synchronization point 8 SPEs, 16 MByte page size, 42.9 GFLOPS
Holger Brunst, Daniel Hackenberg Slide 28
Exam ple Applications: Cholesky ( 1 )
Cholesky transformation with 8 SPEs
- verview with DMA communication of SPE 3
Holger Brunst, Daniel Hackenberg Slide 29
Exam ple Applications: Cholesky ( 2 )
Cholesky transformation with 8 SPEs enlargement with DMA communication of SPE 3
Holger Brunst, Daniel Hackenberg Slide 30
Exam ple Cell Applications: RAxML ( 1 )
RAxML (Randomized Accelerated Maximum Likelihood) with 8 SPEs, ramp-up phase
Holger Brunst, Daniel Hackenberg Slide 31
Exam ple Cell Applications: RAxML ( 2 )
RAxML with 8 SPEs, 4000 ns window enlargement of a small loop shifted start of loop, constant runtime
Holger Brunst, Daniel Hackenberg Slide 32
Exam ple Cell Applications: RAxML ( 3 )
RAxML with 8 SPEs, 4000 ns window enlargement of a small loop (modified) synchronous start, memory contention
Holger Brunst, Daniel Hackenberg Slide 33
Exam ple Cell Applications: RAxML ( 4 )
RAxML with 16 SPEs, load imbalance
Holger Brunst, Daniel Hackenberg Slide 34
Hybrid Cell/ MPI Application: PBPI
PBPI (Parallel Bayesian Phylogenetic Inference)
- n 3 QS21 blades (6 Cell processors)
Slide 35
Com bined Charts in New GUI
SLIDE 36 HOLGER BRUNST
Holger Brunst, Daniel Hackenberg Slide 37
(*) Increased overhead due to intense usage of DMA lists. Trace overhead without DMAs: 1,4 %
Overhead
Overhead sources
Creating events
Transferring trace data from the SPEs to main memory
Trace buffer und trace library use space in local store (< 12 KByte) Additional overhead
Initialization and processing of SPE event data
Outside of SPE runtime Analysis unaffected Experim ental overhead m easurem ents ( QS2 1 , 8 SPEs) : Original (GFLOPS) Tracing (GFLOPS) Overhead SGEMM 203,25 200,73 1,3 % FFT 11,93 11,85 0,7 % Cholesky, SPOTRF 143,17 139,32 2,8 % Cholesky, DGEMM 4,48 4,10 9,2 % (* ) Cholesky, STRSM 5,73 5,64 1,7 %
Holger Brunst, Daniel Hackenberg Slide 38
Sum m ary & Future W ork
Concept and Prototype for Perform ance Tracing on Cell
- CellTrace
- Typical overhead: less than 5 percent
Visualization of Traces w ith Vam pir
- Creates valuable insight into the runtime behavior of Cell
applications
- Intuitive performance visualization and verification
- Support for large, hybrid Cell/ MPI applications
Future w ork
- Improved tracing, e.g. full integration in VampirTrace,
providing additional analysis features such as alignment checks
- Improved visualization, e.g. by colorizing DMA messages
(tag, size or bandwidth), displaying intensity of main memory accesses
Com puter Science » Computer Engineering » Computer Architecture
THANK YOU
Matthias Jurenz Andreas Knüpfer Matthias Lieber Holger Mickler
- Dr. Hartmut Mix
Ronny Brendel Jens Doleschal Ronald Geisler Daniel Hackenberg Robert Henschel
- Dr. Matthias Müller
- Prof. Wolfgang E. Nagel
Michael Peter Matthias Weber Thomas William
http://www.tu-dresden.de/zih/cell/trace http://www.vampir.eu
Holger Brunst, Daniel Hackenberg Slide 40
Literature
1. BRUNST, Holger; NAGEL, Wolfgang E.: Scalable Performance Analysis of Parallel Systems: Concepts and Experiences. ParCo 2003 2. BLAGOJEVIC et. al.: Scheduling Asymmetric Parallelism on a PlayStation3 Cluster. To appear in ISCCG 2008 3. BLAGOJEVIC et. al.: Dynamic multigrain parallelization on the Cell Broadband Engine. ACM SIGPLAN 2007 4. Hermanns, M.-A.; Mohr, B.; Wolf, F.: Event-Based Measurement and Analysis of One- Sided Communication. Euro-Par 2005 5. KURZAK, Jakub; BUTTARI, Alfredo; DONGARRA, Jack: Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization. IEEE Transactions on Parallel and Distributed Systems 2007 6. JURENZ, Matthias: VampirTrace Software and Documentation. ZIH, TU Dresden. 2007 7. KNÜPFER et. al.: Introducing the Open Trace Format (OTF). ICCS 2006 8. IBM, SCEI, Toshiba: Cell Broadband Engine Architecture. 2005 9. IBM, Sony, Toshiba: Cell Broadband Engine Programming Handbook. 2007
- 10. EICHENBERGER et. al.: Using advanced compiler technology to exploit the
performance of the Cell Broadband Engine architecture. IBM Systems Journal 45, 2006
- 11. CHOW, Alex C.; FOSSUM, Gordon C.; BROKENSHIRE, Daniel A.: A Programming
Example: Large FFT on the Cell Broadband Engine. 2005
- 12. HACKENBERG, Daniel: Einsatz und Leistungsanalyse der Cell Broadband Engine. ZIH,
TU Dresden. 2007