Replicating HPC I/O Workloads With Proxy Applications
James Dickson, Steven Wright, Stephen Jarvis - University of Warwick Satheesh Maheswaran, Andy Herdman - UK Atomic Weapons Establishment Marc C. Miller - Lawrence Livermore National Laboratory
Replicating HPC I/O Workloads With Proxy Applications James Dickson , - - PowerPoint PPT Presentation
Replicating HPC I/O Workloads With Proxy Applications James Dickson , Steven Wright, Stephen Jarvis - University of Warwick Satheesh Maheswaran, Andy Herdman - UK Atomic Weapons Establishment Marc C. Miller - Lawrence Livermore National Laboratory
James Dickson, Steven Wright, Stephen Jarvis - University of Warwick Satheesh Maheswaran, Andy Herdman - UK Atomic Weapons Establishment Marc C. Miller - Lawrence Livermore National Laboratory
2
3
4
“Multi-purpose Application-Centric, Scalable I/O Proxy Application” Two key characteristics: – Level of Abstraction: POSIX, MPI-IO, SILO, HDF5 and beyond… – Degree of Flexibility: dump type, dataset composition, user defined data objects Multi-purpose achieved through plugin based design, if you have a library or interface to work with, write a plugin!
5
APPLICATION HIGH LEVEL I/O LIBRARY MIDDLEWARE PARALLEL FILE SYSTEM STORAGE HARDWARE
SCIENTIFIC DATA MODEL PARALLEL INTERFACE
6
Overlays a hierarchical data model on the parallel I/O interface Designed to use HDF5 in a consistent way that can be
e.g. efficient use of chunking in the mesh structure
Chunked Object
File State 1 State 2..N Mesh Material Quants Vargroup Variable
7
Darshan I/O characterisation chosen for lightweight profiling Instrumentation overhead indistinguishable from machine noise in our experiments Profiling produces counters for POSIX, MPI-IO, HDF5
Runtime (seconds) 1 Node 64 Nodes Uninstrumented 309.25 352.33 Instrumented 307.43 352.29
8
9
Filesize = Processors ( PartSize ( α Variables + β ) +γ Variables + δ ) + ψ Variables + η MACSio currently weak scales, so increasing processor count increases the file size linearly Similarly, part size and dataset variable count give a linear increase in total bytes written Combining the linear equations gives the equation above to calculate a good estimate for the resultant file size based on dataset composition Constants α, β, γ, δ, ψ, η are derived experimentally from a dataset composition scaling study
10
Extracting counters such as BYTES_WRITTEN, NUM_PROCS, COLL_WRITES, [OPEN/ CLOSE]_TIMESTAMP is enough to generate an input to MACSio for a similar dataset composition and I/O pattern In particular, using timestamps to distribute load across the simulation runtime is important to give an accurate representation of typical ‘bursty’ I/O hotspots spread out across runtime
11
12
ARCHER
(118,080 cores total)
paths 13
Nodes Part Size (Bytes) Wait Time (s) 1 404 320 266 2 202 205 120 4 101 148 53 8 50 619 22 16 25 355 11 32 12 723 7 64 6407 5
14
MACSio 1 MACSio 2 50 100 150 200 250 300 350 Bookleaf 1 Bookleaf 2 Time (s)
15
1 2 4 8 16 32 64 4 8 16 32 64 128 # Nodes Time (s) MACSio #1 MACSio #2 Bookleaf #1 Bookleaf #2 1 2 4 8 16 32 64 64 512 4,090 32,700 262,000 # Nodes MACSio #1 MACSio #2 Bookleaf #1 Bookleaf #2
Cumulative I/O Time across all ranks
Absolute I/O Time
17,000s 1,536 ranks ≈ 110s writing per rank
16
Slowest Individual MPIIO Operation
1 2 4 8 16 32 64 0.5 1 2 4 8 16 32 64 # Nodes Time (s) MACSio #1 MACSio #2 Bookleaf #1 Bookleaf #2
17
Using the MACSio replication, a parameter tweak can be used to manipulate I/O library behaviour The switch to use collective buffering has a very predictable effect, reducing the number of small write operations and lowering the overall I/O time
1 2 4 8 16 32 64 0.5 2 8 32 128 # Nodes Time (s) Collective #1 Collective #2 Independent #1 Independent #2
18
19
20