IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo - - PowerPoint PPT Presentation

▶

Jul 10, 2023 273 likes •455 views

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems Seong Jo (Shawn) Kim * , S on + , Wei-keng Liao + , eung Woo S Mahmut Kandemir*, Raj eev Thakur # , and Alok Choudhary + * : Pennsylvania S tate University + : Northwestern University # :

SLIDE 1

IOPin: Runtime Profiling of Parallel I/ O in HPC S ystems

Seong Jo (Shawn) Kim*, S eung Woo S

n+, Wei-keng Liao+,

Mahmut Kandemir*, Raj eev Thakur#, and Alok Choudhary+

*: Pennsylvania S

tate University

+: Northwestern University #: Argonne National Laboratory

SLIDE 2

Outline

Motivation
Overview
Background: IOPin
Technical Details
Evaluation
Conclusion & Future Work

Parallel Data Storage Workshop 12

SLIDE 3

Motivation

Users of HPC systems frequently find that limiting the perfor-

mance of the applications is the storage system, not the CPU, memory, or network.

I/O behavior is the key factor to

determine the overall performance.

Many I/O-intensive scientific

applications use parallel I/O software stack to access files in high performance.

Critically important is understanding

how the parallel I/O system operates and the issues involved.

Understand I/O behavior!!!

Parallel Data Storage Workshop 12

SLIDE 4

Motivation (cont’d)

Manual instrumentation for understanding I/O behavior is

extremely difficult and error-prone.

Most parallel scientific applications are expected to run on

large-scale systems with 100,000↑ processors to achieve better resolution.

Collecting and analyzing the trace data from them is

challenging and burdensome.

Parallel Data Storage Workshop 12

SLIDE 5

Our Approach

IOPin – Dynamic performance and visualization tool
We leverage a light-weight binary instrumentation using

probe mode in Pin.

– Language independent instrumentation for scientific applications written in C/C++ and Fortran – Neither source code modification nor recompilation of the application and the I/O stack components

IOPin provides a hierarchical view for parallel I/O:

– Associating MPI I/O call issued from the application with its sub-calls in the PVFS layer below

It provides detailed I/O performance metrics for each I/O call:

I/O latency at each layer, # of disk accesses, disk throughput

Low overhead: ~ 7%

Parallel Data Storage Workshop 12

SLIDE 6

Background: Pin

Pin is a software system that performs runtime binary instru-

mentation.

Pin supports two modes of instrumentation, JIT mode and

probe mode.

JIT mode uses a just-in-time compiler to recompile the

program code and insert instrumentation; while probe mode uses code trampolines (jump) for instrumentation.

In JIT mode, the incurred overhead ranges from 38.7% to 78%
f the total execution time with 32, 64, 128, and 256

processes.

In probe mode, about 7%.

Parallel Data Storage Workshop 12

SLIDE 7

Overview: IOPin

The pin process on the client creates two trace log info. for

the MPI library and PVFS client.

– rank, mpi_call_id, pvfs_call_id, I/O type (write/read), latency

Parallel Data Storage Workshop 12

The pin process on the server

produces a trace log info. with server_id, latency, processed bytes, # of disk accesses, and disk throughput.

Each log info is sent to the log

manager and the log manager identifies the process that has a max. latency.

Pin process instruments the

target process.

SLIDE 8

Technical Details

MPI_File_Write_all

High-level I/O lib.,

r App

PVFS Client PVFS Server

MPI_File_Write_all PVFS_sys_write PVFS_sys_io(…, hints) io_start_flow(*smcb, …) flow_callback(*flow_d, …) trove_write_callback_fn(*user_ptr, …)

#define PVFS_sys_write(ref,req,off,buf,mem_req,creds,resp) PVFS_sys_io(ref,req,off,buf,mem_req,creds,resp, PVFS_IO_WRITE,PVFS_HINT_NULL) PVFS_sys_io(ref,req,off,buf,mem_req,creds,resp, PVFS_IO_WRITE,PVFS_hints)

MPI-IO LIbrary Original call flow Server-side Pin Process

rank, mpi_call_id, pvfs_call_id rank, mpi_call_id, pvfs_call_id

Generate trace info. for MPI_File_write_all()

Pin call flow

Pack trace info. into PVFS_hints Replace PVFS_HINT_NULL with PVFS_hints Client starting point Client ending point PVFS_hints Server starting point Server ending point Disk starting/ending point The server Pin searches hints from *smcb passed from the traced process, extracts trace info., gener- ates a log, and sends it to the server log manager. The server log manager identifies/instruments the I/O server that has a max. latency.

Client-side Pin Process

The client Pin sends a log to the client log manager. The client log manager returns a record that has a max. latency for the I/O. Pin instruments the corresponding MPI process selectively.

Client Log Manager Sever Log Manager

SLIDE 9

Computation Methodology: Latency and Throughput

For each I/O operation:

– the I/O latency computed at each layer is the maximum of the I/O latencies from the layers below. – I/O throughput computed at any layer is the sum of the I/O throughput from the layers below

Parallel Data Storage Workshop 12

SLIDE 10

Evaluation

Hardware:

– Breadboard cluster at Argonne National Laboratory – 8 quad-core processors per node: support 32 MPI processes – 16 GB main memory

I/O stack configuration:

– Application: S3D I/O – PnetCDF (pnetcdf-1.2.0), mpich2-1.4, pvfs-2.8.2

PVFS configuration:

– 1 metadata server – 8 I/O servers – 256 MPI processes

Parallel Data Storage Workshop 12

SLIDE 11

Evaluation: S3D-IO

S3D-IO

– I/O kernel of S3D application – A parallel turbulent combustion application using a direct numerical simulation solver developed in SNL

A checkpoint is performed at regular intervals.

– At each checkpoint, four global arrays―represenng the variables of mass, velocity, pressure, and temperature―are wrien to files.

We maintain the block size of the partitioned X-Y-Z dimension

as 200 * 200 * 200

It generates three checkpoint files, 976.6MB each.

Parallel Data Storage Workshop 12

SLIDE 12

Evaluation: Comparison of S3D I/O Execution Time

Parallel Data Storage Workshop 12

SLIDE 13

Evaluation: Detailed Execution Time of S3D I/O

Parallel Data Storage Workshop 12

SLIDE 14

Evaluation: I/O Throughput of S3D I/O

Parallel Data Storage Workshop 12

SLIDE 15

Conclusion & Future Work

Understanding I/O behavior is one of the most important

steps for efficient execution of parallel scientific applications.

IOPin provides dynamic instrumentation to understand I/O

behavior without affecting the performance:

– no source code modification and recompilation – a hierarchical view of the I/O call from the MPI lib. to the PVFS server – metrics: latency of each layer, # of fragmented I/O calls, # of disk accesses, I/O throughput – ~7% overhead

Work is underway: (1) to test IOPin on a very large process

counts, (2) to employ it for runtime I/O optimizations.

Parallel Data Storage Workshop 12

SLIDE 16

Questions?

Parallel Data Storage Workshop 12