IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo - - PowerPoint PPT Presentation

iopin runtime profiling of parallel i o in hpc s ystems
SMART_READER_LITE
LIVE PREVIEW

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo - - PowerPoint PPT Presentation

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo (Shawn) Kim * , S on + , Wei-keng Liao + , eung Woo S Mahmut Kandemir*, Raj eev Thakur # , and Alok Choudhary + * : Pennsylvania S tate University + : Northwestern University # :


slide-1
SLIDE 1

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems

1

Seong Jo (Shawn) Kim*, S eung Woo S

  • n+, Wei-keng Liao+,

Mahmut Kandemir*, Raj eev Thakur#, and Alok Choudhary+

*: Pennsylvania S

tate University

+: Northwestern University #: Argonne National Laboratory

slide-2
SLIDE 2

2

Outline

  • Motivation
  • Overview
  • Background: IOPin
  • Technical Details
  • Evaluation
  • Conclusion & Future Work

Parallel Data Storage Workshop 12

slide-3
SLIDE 3

3

Motivation

  • Users of HPC systems frequently find that limiting the perfor-

mance of the applications is the storage system, not the CPU, memory, or network.

  • I/O behavior is the key factor to

determine the overall performance.

  • Many I/O-intensive scientific

applications use parallel I/O software stack to access files in high performance.

  • Critically important is understanding

how the parallel I/O system operates and the issues involved.

  • Understand I/O behavior!!!

Parallel Data Storage Workshop 12

slide-4
SLIDE 4

4

Motivation (cont’d)

  • Manual instrumentation for understanding I/O behavior is

extremely difficult and error-prone.

  • Most parallel scientific applications are expected to run on

large-scale systems with 100,000↑ processors to achieve better resolution.

  • Collecting and analyzing the trace data from them is

challenging and burdensome.

Parallel Data Storage Workshop 12

slide-5
SLIDE 5

5

Our Approach

  • IOPin – Dynamic performance and visualization tool
  • We leverage a light-weight binary instrumentation using

probe mode in Pin.

– Language independent instrumentation for scientific applications written in C/C++ and Fortran – Neither source code modification nor recompilation of the application and the I/O stack components

  • IOPin provides a hierarchical view for parallel I/O:

– Associating MPI I/O call issued from the application with its sub-calls in the PVFS layer below

  • It provides detailed I/O performance metrics for each I/O call:

I/O latency at each layer, # of disk accesses, disk throughput

  • Low overhead: ~ 7%

Parallel Data Storage Workshop 12

slide-6
SLIDE 6

6

Background: Pin

  • Pin is a software system that performs runtime binary instru-

mentation.

  • Pin supports two modes of instrumentation, JIT mode and

probe mode.

  • JIT mode uses a just-in-time compiler to recompile the

program code and insert instrumentation; while probe mode uses code trampolines (jump) for instrumentation.

  • In JIT mode, the incurred overhead ranges from 38.7% to 78%
  • f the total execution time with 32, 64, 128, and 256

processes.

  • In probe mode, about 7%.

Parallel Data Storage Workshop 12

slide-7
SLIDE 7

7

Overview: IOPin

  • The pin process on the client creates two trace log info. for

the MPI library and PVFS client.

– rank, mpi_call_id, pvfs_call_id, I/O type (write/read), latency

Parallel Data Storage Workshop 12

  • The pin process on the server

produces a trace log info. with server_id, latency, processed bytes, # of disk accesses, and disk throughput.

  • Each log info is sent to the log

manager and the log manager identifies the process that has a max. latency.

  • Pin process instruments the

target process.

slide-8
SLIDE 8

8

Technical Details

MPI_File_Write_all

High-level I/O lib.,

  • r App

PVFS Client PVFS Server

MPI_File_Write_all PVFS_sys_write PVFS_sys_io(…, hints) io_start_flow(*smcb, …) flow_callback(*flow_d, …) trove_write_callback_fn(*user_ptr, …)

#define PVFS_sys_write(ref,req,off,buf,mem_req,creds,resp) PVFS_sys_io(ref,req,off,buf,mem_req,creds,resp, PVFS_IO_WRITE,PVFS_HINT_NULL) PVFS_sys_io(ref,req,off,buf,mem_req,creds,resp, PVFS_IO_WRITE,PVFS_hints)

MPI-IO LIbrary Original call flow Server-side Pin Process

rank, mpi_call_id, pvfs_call_id rank, mpi_call_id, pvfs_call_id

Generate trace info. for MPI_File_write_all()

Pin call flow

Pack trace info. into PVFS_hints Replace PVFS_HINT_NULL with PVFS_hints Client starting point Client ending point PVFS_hints Server starting point Server ending point Disk starting/ending point The server Pin searches hints from *smcb passed from the traced process, extracts trace info., gener- ates a log, and sends it to the server log manager. The server log manager identifies/instruments the I/O server that has a max. latency.

Client-side Pin Process

The client Pin sends a log to the client log manager. The client log manager returns a record that has a max. latency for the I/O. Pin instruments the corresponding MPI process selectively.

Client Log Manager Sever Log Manager

slide-9
SLIDE 9

9

Computation Methodology: Latency and Throughput

  • For each I/O operation:

– the I/O latency computed at each layer is the maximum of the I/O latencies from the layers below. – I/O throughput computed at any layer is the sum of the I/O throughput from the layers below

Parallel Data Storage Workshop 12

slide-10
SLIDE 10

10

Evaluation

  • Hardware:

– Breadboard cluster at Argonne National Laboratory – 8 quad-core processors per node: support 32 MPI processes – 16 GB main memory

  • I/O stack configuration:

– Application: S3D I/O – PnetCDF (pnetcdf-1.2.0), mpich2-1.4, pvfs-2.8.2

  • PVFS configuration:

– 1 metadata server – 8 I/O servers – 256 MPI processes

Parallel Data Storage Workshop 12

slide-11
SLIDE 11

11

Evaluation: S3D-IO

  • S3D-IO

– I/O kernel of S3D application – A parallel turbulent combustion application using a direct numerical simulation solver developed in SNL

  • A checkpoint is performed at regular intervals.

– At each checkpoint, four global arrays―represenng the variables of mass, velocity, pressure, and temperature―are wrien to files.

  • We maintain the block size of the partitioned X-Y-Z dimension

as 200 * 200 * 200

  • It generates three checkpoint files, 976.6MB each.

Parallel Data Storage Workshop 12

slide-12
SLIDE 12

12

Evaluation: Comparison of S3D I/O Execution Time

Parallel Data Storage Workshop 12

slide-13
SLIDE 13

13

Evaluation: Detailed Execution Time of S3D I/O

Parallel Data Storage Workshop 12

slide-14
SLIDE 14

14

Evaluation: I/O Throughput of S3D I/O

Parallel Data Storage Workshop 12

slide-15
SLIDE 15

15

Conclusion & Future Work

  • Understanding I/O behavior is one of the most important

steps for efficient execution of parallel scientific applications.

  • IOPin provides dynamic instrumentation to understand I/O

behavior without affecting the performance:

– no source code modification and recompilation – a hierarchical view of the I/O call from the MPI lib. to the PVFS server – metrics: latency of each layer, # of fragmented I/O calls, # of disk accesses, I/O throughput – ~7% overhead

  • Work is underway: (1) to test IOPin on a very large process

counts, (2) to employ it for runtime I/O optimizations.

Parallel Data Storage Workshop 12

slide-16
SLIDE 16

16

Questions?

Parallel Data Storage Workshop 12