Combining Instrumentation and Sampling for Trace-based Application - - PowerPoint PPT Presentation

combining instrumentation and sampling for trace based
SMART_READER_LITE
LIVE PREVIEW

Combining Instrumentation and Sampling for Trace-based Application - - PowerPoint PPT Presentation

Center for Information Services and High Performance Computing (ZIH) Combining Instrumentation and Sampling for Trace-based Application Performance Analysis 8th International Parallel Tools Workshop Stuttgart, Germany, October 2, 2014 Thomas


slide-1
SLIDE 1

Thomas Ilsche e (thomas.i .ilsche@t he@tu-dr dresden.de) esden.de) Joseph Schuchart Robert Schöne Daniel Hackenberg

Center for Information Services and High Performance Computing (ZIH)

Combining Instrumentation and Sampling for Trace-based Application Performance Analysis

8th International Parallel Tools Workshop Stuttgart, Germany, October 2, 2014

slide-2
SLIDE 2

Introduction

Thomas Ilsche 4

Looking at the landscape of performance analysis tools – Identify established techniques – Provide a structured overview – Highlight strengths and weaknesses Identify novel combinations – Combine strengths – Mitigate weaknesses – Look beyond the traditional fields of tools

slide-3
SLIDE 3

Classification of performance analysis techniques

Thomas Ilsche 5

Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012)

Performance Analysis Layer Performance Analysis Technique Data Acquisition Sampling Event-based Instrumentation Data Recording Data Presentation Timelines Logging Summarization Profiles

slide-4
SLIDE 4

Classification of performance analysis techniques

Thomas Ilsche 6

Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012)

Performance Analysis Layer Performance Analysis Technique Data Acquisition Sampling Event-based Instrumentation Data Recording Data Presentation Timelines Logging Summarization Profiles

slide-5
SLIDE 5

Data Acquisition: Event-based Instrumentation

Thomas Ilsche 7

Event-based instrumentation; also: direct instrumentation, event trigger, probe- based measurement or simply instrumentation. Modification of the application execution in order to record and present certain intrinsic events of the application execution, e.g., function entry and exit events. main foo bar bar time Measurement environment

slide-6
SLIDE 6

Data Acquisition: Event-based Instrumentation

Thomas Ilsche 8

Overhead & perturbation depends on function call rate – Hard to predict in complex applications – Can be influenced by filtering function calls

  • Preferably statically, not during runtime

Complete information – Accurate function call counts – Message properties (semantics of function call arguments) – Analysis tools may rely on completeness

slide-7
SLIDE 7

Data Acquisition: Event-based Instrumentation

Thomas Ilsche 9

Various instrumentation methods available – Compiler instrumentation * – Library wrapping ** – Source code transformation * – Manual instrumentation * – Binary instrumentation * Requires recompilation & separate performance measurement binary ** Requires relinking for statically linked binaries

slide-8
SLIDE 8

Data Acquisition: Sampling

Thomas Ilsche 10

main foo bar bar Measurement environment time

200us 400us 600us 800us

Sampling; also: statistical sampling or (ambiguously) profiling. Periodic interruption of a running program and inspection of its state.

slide-9
SLIDE 9

Data Acquisition: Sampling

Thomas Ilsche 11

Overhead & perturbation depends on sampling rate – Can be predicted – Can be controlled – Stack unwinding introduce uncertainty Easy to use (for end users) – No recompilation or relinking necessary – No filtering necessary

slide-10
SLIDE 10

Data Acquisition: Sampling

Thomas Ilsche 12

Incomplete information – No accurate function call counts – No specific message properties or other semantics of function arguments Measurement has statistical value – More reliable for longer running experiments Trade-off between accuracy and perturbation via sampling rate

slide-11
SLIDE 11

Classification of performance analysis techniques

Thomas Ilsche 13

Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012)

Performance Analysis Layer Performance Analysis Technique Data Acquisition Sampling Event-based Instrumentation Data Recording Data Presentation Timelines Logging Summarization Profiles

slide-12
SLIDE 12

Summarization vs Logging

Thomas Ilsche 14

Summarization Logging Event-based Instrumentation

0000us: Enter main 0050us: Enter foo 0100us: Enter bar 0300us: Leave bar 0650us: Leave foo 0700us: Enter bar 0900us: Leave bar 1000us: Leave main

Sampling Defines how the recording during runtime is performed.

slide-13
SLIDE 13

Summarization vs Logging

Thomas Ilsche 15

Summarization Logging Event-based Instrumentation

count[main]++ count[foo]++ count[bar]++ time[bar]+=200 time[foo]+=600 count[foo]++ time[bar]+=200 time[main]+=1000 0000us: Enter main 0050us: Enter foo 0100us: Enter bar 0300us: Leave bar 0650us: Leave foo 0700us: Enter bar 0900us: Leave bar 1000us: Leave main

Sampling Defines how the recording during runtime is performed.

slide-14
SLIDE 14

Summarization vs Logging

Thomas Ilsche 16

Summarization Logging Event-based Instrumentation

count[main]++ count[foo]++ count[bar]++ time[bar]+=200 time[foo]+=600 count[foo]++ time[bar]+=200 time[main]+=1000 0000us: Enter main 0050us: Enter foo 0100us: Enter bar 0300us: Leave bar 0650us: Leave foo 0700us: Enter bar 0900us: Leave bar 1000us: Leave main

Sampling

200us: main|foo|bar 400us: main|foo 600us: main|foo 800us: main|bar

Defines how the recording during runtime is performed.

slide-15
SLIDE 15

Summarization vs Logging

Thomas Ilsche 17

Summarization Logging Event-based Instrumentation

count[main]++ count[foo]++ count[bar]++ time[bar]+=200 time[foo]+=600 count[foo]++ time[bar]+=200 time[main]+=1000 0000us: Enter main 0050us: Enter foo 0100us: Enter bar 0300us: Leave bar 0650us: Leave foo 0700us: Enter bar 0900us: Leave bar 1000us: Leave main

Sampling

time_ex[bar] += 200 time_ex[foo] += 200 time_ex[foo] += 200 time_ex[bar] += 200 200us: main|foo|bar 400us: main|foo 600us: main|foo 800us: main|bar

Defines how the recording during runtime is performed.

slide-16
SLIDE 16

Summarization vs Logging

Thomas Ilsche 18

Summarization Logging Event-based Instrumentation

count[main]++ count[foo]++ count[bar]++ time[bar]+=200 time[foo]+=600 count[foo]++ time[bar]+=200 time[main]+=1000 0000us: Enter main 0050us: Enter foo 0100us: Enter bar 0300us: Leave bar 0650us: Leave foo 0700us: Enter bar 0900us: Leave bar 1000us: Leave main

Sampling

time_ex[bar] += 200 time_ex[foo] += 200 time_ex[foo] += 200 time_ex[bar] += 200 200us: main|foo|bar 400us: main|foo 600us: main|foo 800us: main|bar Loses information Requires memory at runtime

Defines how the recording during runtime is performed.

slide-17
SLIDE 17

Classification of performance analysis techniques

Thomas Ilsche 19

Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012)

Performance Analysis Layer Performance Analysis Technique Data Acquisition Sampling Event-based Instrumentation Data Recording Data Presentation Timelines Logging Summarization Profiles

slide-18
SLIDE 18

Data Presentation

Thomas Ilsche 20

Example timeline showing call-path and event annotations (Vampir)

  • Needs logging during recording

Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 33.34 0.02 0.02 7208 0.00 0.00 open 16.67 0.03 0.01 244 0.04 0.12 offtime 16.67 0.04 0.01 8 1.25 1.25 memccpy 16.67 0.05 0.01 7 1.43 1.43 write

Example profile (gprof)

  • Can be generated by summarization,

but also from logging

slide-19
SLIDE 19

Classification of performance analysis techniques

Thomas Ilsche 21

Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012)

Performance Analysis Layer Performance Analysis Technique Data Acquisition Sampling Event-based Instrumentation Data Recording Data Presentation Tracing Timelines Logging Profiling Summarization Profiles

slide-20
SLIDE 20

Tools

Thomas Ilsche 25

Profiling Tracing Event-based Instrumentation Sampling VampirTrace TAU Scalasca Extrae HPCToolkit gprof Example Concepts Score-P perf Allinea MAP

slide-21
SLIDE 21

Combining Performance Analysis Techniques (1)

Thomas Ilsche 26

C++ Graph code INDDGO OpenMP, 4 Threads Uninstrumented :< 6 seconds Instrumented (profiling): 72 seconds  1100% overhead! A trace file would be ~3.8 GB with even more overhead

slide-22
SLIDE 22

Combining Performance Analysis Techniques (1)

Thomas Ilsche 27 Estimated aggregate size of event trace: 3851MB Estimated requirements for largest trace buffer (max_buf): 3851MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 3860MB (hint: When tracing set SCOREP_TOTAL_MEMORY=3860MB to avoid intermediate flushes

  • r reduce requirements using USR regions filters.)

type max_buf[B] visits time[s] region ALL 4,038,048,140 161,849,290 119.61 ALL USR 4,038,047,650 161,849,275 115.72 USR OMP 412 12 0.07 OMP COM 78 3 3.82 COM USR 365,389,440 14,053,440 3.58 Graph::lcgrand(int) USR 322,737,636 12,412,986 5.78 std::_List_iterator<int>::operator*() const USR 208,735,202 8,028,277 3.70 std::_List_iterator<int>::operator++() USR 201,389,266 7,745,741 3.02 std::_List_iterator<int>::_List_iterator… USR 200,350,128 12,521,883 6.12 std::_List_iterator<int>::operator!=… … USR 1,040,000 40,000 0.01 Graph::Node* std::__addressof…

72 functions with > 1 million visits

slide-23
SLIDE 23

Combining Performance Analysis Techniques (1)

Thomas Ilsche 28

MPI instrumentation – For messages, complete information is very important during analysis – MPI functions generally imply a certain minimum load

  • Lower relative overhead expected compared to very short

computation functions Call-path sampling – For function execution statistical information may be sufficient – Controlling the overhead of compiler instrumentation via filtering is not straightforward Tracing (Logging  Timelines) Prototype in VampirTrace MPI: Traditional library interposition with PMPI Sampling: Performance counter based interrupt (e.g. every 1 million cycles)

slide-24
SLIDE 24

Combining Performance Analysis Techniques (1)

Thomas Ilsche 29

Example with NPB-BT, 1 sample every 1 million cycles (~385 us)

slide-25
SLIDE 25

Combining Performance Analysis Techniques (1)

Thomas Ilsche 30

MPI Instrumentation and call-path sampling Example with NPB-BT

∗ Runtime filter: matmul_sub, matvec_sub, binvrhs, binvcrhs, lhsinit, exact_solution  Filter helps with overhead, but no more information about those functions ** Sampling rate of 2.6 kSa/s.

slide-26
SLIDE 26

Combining Performance Analysis Techniques (2)

Thomas Ilsche 31

MPI and compiler instrumentation and performance counter sampling Tracing (Logging  Timelines) Implemented as VampirTrace/Score-P metrics plugin MPI: Library interposition with PMPI Functions: Compiler instrumentation Hardware Counters: Monitoring thread wakes up in regular intervals and reads performance counter from application thread

slide-27
SLIDE 27

Combining Performance Analysis Techniques (2)

Thomas Ilsche 33

NPB FT class B 16 procs 1ms sampling interval

slide-28
SLIDE 28

Combining Performance Analysis Techniques (2)

Thomas Ilsche 34

slide-29
SLIDE 29

Combining Performance Analysis Techniques (2)

Thomas Ilsche 35

slide-30
SLIDE 30

Combining Performance Analysis Techniques (2)

Thomas Ilsche 36

slide-31
SLIDE 31

Combining Performance Analysis Techniques (2)

Thomas Ilsche 37

slide-32
SLIDE 32

Combining Performance Analysis Techniques (2)

Thomas Ilsche 38

Normalized trace sizes of NPB CLASS B, sampled (1 kSa/s). Baseline: trace without counters. Filtered functions: matmul_sub, matvec_sub, binvcrhs, exact_solution.

slide-33
SLIDE 33

Combining Performance Analysis Techniques

Thomas Ilsche 39

Event-based instrumentation for: – MPI, SHMEM, … – CUDA – Manual instrumentation of longer program phases – Function instrumentation together with sophisticated filtering Sampling for: – Call-stack of programs where filtering is not feasible (often C++) – Hardware counters – External metrics (e.g. power consumption)

slide-34
SLIDE 34

Conclusion

Thomas Ilsche 40

Combine strength and mitigate weaknesses by selecting the right technique for different aspects of performance analysis – Sampling and instrumentation complement each other – Many tools already cross the borders of ‘single technique’ Use a clear terminology for techniques – Separate the description for data acquisition, data recording and data presentation – Event-based instrumentation vs sampling – Tracing vs profiling

slide-35
SLIDE 35

Outlook

Thomas Ilsche 41

Sampling in Score-P 1.4 – planned for 2014 – will be experimental – New trace records for samples Analyzing & visualizing merged call-stack samples and region instrumentation – Sample is valid for point in time – Instrumentation covers time ranges – Overlap?