Replicating HPC I/O Workloads With Proxy Applications James Dickson , - - PowerPoint PPT Presentation

▶

Oct 24, 2022 195 likes •429 views

Replicating HPC I/O Workloads With Proxy Applications James Dickson , Steven Wright, Stephen Jarvis - University of Warwick Satheesh Maheswaran, Andy Herdman - UK Atomic Weapons Establishment Marc C. Miller - Lawrence Livermore National Laboratory

SLIDE 1

Replicating HPC I/O Workloads With Proxy Applications

James Dickson, Steven Wright, Stephen Jarvis - University of Warwick Satheesh Maheswaran, Andy Herdman - UK Atomic Weapons Establishment Marc C. Miller - Lawrence Livermore National Laboratory

SLIDE 2

Motivation

I/O investigation goals? – Benchmarking systems – Tuning application behaviour – Tuning software stack – Changing paradigm – Changing hardware technology

SLIDE 3

Motivation

Working with a mini application or proxy is less cumbersome and more streamlined, not to mention more portable Developing and maintaining a representative proxy for every application is time consuming and probably redundant Ideally we would like to experiment while minimising time spent making code changes and writing new implementations

SLIDE 4

Outline

Background: Proxy app and I/O library Replication Components Case Study Conclusion

SLIDE 5

Background: MACSio

“Multi-purpose Application-Centric, Scalable I/O Proxy Application” Two key characteristics: – Level of Abstraction: POSIX, MPI-IO, SILO, HDF5 and beyond… – Degree of Flexibility: dump type, dataset composition, user defined data objects Multi-purpose achieved through plugin based design, if you have a library or interface to work with, write a plugin!

SLIDE 6

Background: TyphonIO

APPLICATION HIGH LEVEL I/O LIBRARY MIDDLEWARE PARALLEL FILE SYSTEM STORAGE HARDWARE

TyphonIO HDF5

SCIENTIFIC DATA MODEL PARALLEL INTERFACE

SLIDE 7

Background: TyphonIO

Overlays a hierarchical data model on the parallel I/O interface Designed to use HDF5 in a consistent way that can be

ptimised for the data model,

e.g. efficient use of chunking in the mesh structure

Chunked Object

File State 1 State 2..N Mesh Material Quants Vargroup Variable

SLIDE 8

Replication: Profiling

Darshan I/O characterisation chosen for lightweight profiling Instrumentation overhead indistinguishable from machine noise in our experiments Profiling produces counters for POSIX, MPI-IO, HDF5

Runtime (seconds) 1 Node 64 Nodes Uninstrumented 309.25 352.33 Instrumented 307.43 352.29

SLIDE 9

Replication: Parameter Generation

Application Run Darshan Log YAML Log MACSio Parameters Access Diagram

SLIDE 10

Replication: Parameter Generation

Filesize = Processors ( PartSize ( α Variables + β ) +γ Variables + δ ) + ψ Variables + η MACSio currently weak scales, so increasing processor count increases the file size linearly Similarly, part size and dataset variable count give a linear increase in total bytes written Combining the linear equations gives the equation above to calculate a good estimate for the resultant file size based on dataset composition Constants α, β, γ, δ, ψ, η are derived experimentally from a dataset composition scaling study

SLIDE 11

Replication: Parameter Generation

Extracting counters such as BYTES_WRITTEN, NUM_PROCS, COLL_WRITES, [OPEN/ CLOSE]_TIMESTAMP is enough to generate an input to MACSio for a similar dataset composition and I/O pattern In particular, using timestamps to distribute load across the simulation runtime is important to give an accurate representation of typical ‘bursty’ I/O hotspots spread out across runtime

SLIDE 12

Case Study: Bookleaf

2D unstructured Lagrangian hydrodynamics application Fixed checkpoint scheme: two per simulation The input deck used solves the Noh verification problem for ideal gases I/O is handled by TyphonIO

SLIDE 13

Experimental Setup

ARCHER

4920 node CRAY XC30
Two 12-core Ivy Bridge processors per node

(118,080 cores total)

Three Lustre filesystems:
12 OSSs
4 OSTs/OSS
10 4TB Discs/OST (RAID6)
1 MDS + 1 MDT with 14 600GB discs (RAID1+0)
10 LNet Router nodes with overlapping routing

paths 13

SLIDE 14

Experimental Setup: Input Parameters

Part size represents the volume of data written from each rank Wait time is a basic time buffer between consecutive file accesses

Nodes Part Size (Bytes) Wait Time (s) 1 404 320 266 2 202 205 120 4 101 148 53 8 50 619 22 16 25 355 11 32 12 723 7 64 6407 5

SLIDE 15

File Access Pattern

File access times are offset by the initial setup in Bookleaf Accounting for this overhead is not necessary to accurately represent the I/O pattern so we don’t factor it in, but this could easily be introduced

MACSio 1 MACSio 2 50 100 150 200 250 300 350 Bookleaf 1 Bookleaf 2 Time (s)

SLIDE 16

Results: I/O Time

1 2 4 8 16 32 64 4 8 16 32 64 128 # Nodes Time (s) MACSio #1 MACSio #2 Bookleaf #1 Bookleaf #2 1 2 4 8 16 32 64 64 512 4,090 32,700 262,000 # Nodes MACSio #1 MACSio #2 Bookleaf #1 Bookleaf #2

Cumulative I/O Time across all ranks

Absolute I/O Time

17,000s 1,536 ranks ≈ 110s writing per rank

SLIDE 17

Results: I/O Time

Total, cumulative and slowest individual I/O time remain consistent for the

riginal and replicated runs

Looking at a wider range of Darshan counters, access sizes and frequencies are also consistent

Slowest Individual MPIIO Operation

1 2 4 8 16 32 64 0.5 1 2 4 8 16 32 64 # Nodes Time (s) MACSio #1 MACSio #2 Bookleaf #1 Bookleaf #2

SLIDE 18

Results: Testing Independent vs Collective I/O with MACSio

Using the MACSio replication, a parameter tweak can be used to manipulate I/O library behaviour The switch to use collective buffering has a very predictable effect, reducing the number of small write operations and lowering the overall I/O time

1 2 4 8 16 32 64 0.5 2 8 32 128 # Nodes Time (s) Collective #1 Collective #2 Independent #1 Independent #2

SLIDE 19

Conclusion

We use a proxy application and high level library to mimic an I/O pattern based off as lightweight profiling as possible I/O characterisation and a small amount of application familiarity is enough to produce a proxy that is workable Once a parameter set has been identified, we can chop and change strategy, library and platform with a reasonable amount of simplicity

SLIDE 20

Next Steps

More irregular I/O patterns from range of applications Exercise different parallel interfaces Multiple concurrent workloads

SLIDE 21

Acknowledgements

UK Atomic Weapons Establishment Technical Outreach Programme UK Engineering and Physical Sciences Research Council

SLIDE 22

Replicating HPC I/O Workloads With Proxy Applications James Dickson , - - PowerPoint PPT Presentation

Replicating HPC I/O Workloads With Proxy Applications

Motivation

I/O investigation goals? – Benchmarking systems – Tuning application behaviour – Tuning software stack – Changing paradigm – Changing hardware technology

Motivation

Outline

Background: Proxy app and I/O library Replication Components Case Study Conclusion

Background: MACSio

Background: TyphonIO

TyphonIO HDF5

Background: TyphonIO

Replication: Profiling

Replication: Parameter Generation

Application Run Darshan Log YAML Log MACSio Parameters Access Diagram

Replication: Parameter Generation

Replication: Parameter Generation

Case Study: Bookleaf

2D unstructured Lagrangian hydrodynamics application Fixed checkpoint scheme: two per simulation The input deck used solves the Noh verification problem for ideal gases I/O is handled by TyphonIO

Experimental Setup

Experimental Setup: Input Parameters

Part size represents the volume of data written from each rank Wait time is a basic time buffer between consecutive file accesses

File Access Pattern

File access times are offset by the initial setup in Bookleaf Accounting for this overhead is not necessary to accurately represent the I/O pattern so we don’t factor it in, but this could easily be introduced

Results: I/O Time

Results: I/O Time

Total, cumulative and slowest individual I/O time remain consistent for the

Looking at a wider range of Darshan counters, access sizes and frequencies are also consistent

Results: Testing Independent vs Collective I/O with MACSio

Conclusion

Next Steps

More irregular I/O patterns from range of applications Exercise different parallel interfaces Multiple concurrent workloads

Acknowledgements

UK Atomic Weapons Establishment Technical Outreach Programme UK Engineering and Physical Sciences Research Council

Thank You Any Questions?

J.Dickson@warwick.ac.uk