Replicating HPC I/O Workloads With Proxy Applications James Dickson , - - PowerPoint PPT Presentation

replicating hpc i o workloads with proxy applications
SMART_READER_LITE
LIVE PREVIEW

Replicating HPC I/O Workloads With Proxy Applications James Dickson , - - PowerPoint PPT Presentation

Replicating HPC I/O Workloads With Proxy Applications James Dickson , Steven Wright, Stephen Jarvis - University of Warwick Satheesh Maheswaran, Andy Herdman - UK Atomic Weapons Establishment Marc C. Miller - Lawrence Livermore National Laboratory


slide-1
SLIDE 1

Replicating HPC I/O Workloads With Proxy Applications

James Dickson, Steven Wright, Stephen Jarvis - University of Warwick Satheesh Maheswaran, Andy Herdman - UK Atomic Weapons Establishment Marc C. Miller - Lawrence Livermore National Laboratory

slide-2
SLIDE 2

Motivation

I/O investigation goals? – Benchmarking systems – Tuning application behaviour – Tuning software stack – Changing paradigm – Changing hardware technology

2

slide-3
SLIDE 3

Motivation

Working with a mini application or proxy is less cumbersome and more streamlined, not to mention more portable Developing and maintaining a representative proxy for every application is time consuming and probably redundant Ideally we would like to experiment while minimising time spent making code changes and writing new implementations

3

slide-4
SLIDE 4

Outline

Background: Proxy app and I/O library Replication Components Case Study Conclusion

4

slide-5
SLIDE 5

Background: MACSio

“Multi-purpose Application-Centric, Scalable I/O Proxy Application” Two key characteristics: – Level of Abstraction: POSIX, MPI-IO, SILO, HDF5 and beyond… – Degree of Flexibility: dump type, dataset composition, user defined data objects Multi-purpose achieved through plugin based design, if you have a library or interface to work with, write a plugin!

5

slide-6
SLIDE 6

Background: TyphonIO

APPLICATION HIGH LEVEL I/O LIBRARY MIDDLEWARE PARALLEL FILE SYSTEM STORAGE HARDWARE

TyphonIO HDF5

SCIENTIFIC DATA MODEL PARALLEL INTERFACE

6

slide-7
SLIDE 7

Background: TyphonIO

Overlays a hierarchical data model on the parallel I/O interface Designed to use HDF5 in a consistent way that can be

  • ptimised for the data model,

e.g. efficient use of chunking in the mesh structure

Chunked Object

File State 1 State 2..N Mesh Material Quants Vargroup Variable

7

slide-8
SLIDE 8

Replication: Profiling

Darshan I/O characterisation chosen for lightweight profiling Instrumentation overhead indistinguishable from machine noise in our experiments Profiling produces counters for POSIX, MPI-IO, HDF5

Runtime (seconds) 1 Node 64 Nodes Uninstrumented 309.25 352.33 Instrumented 307.43 352.29

8

slide-9
SLIDE 9

Replication: Parameter Generation

Application Run Darshan Log YAML Log MACSio Parameters Access Diagram

9

slide-10
SLIDE 10

Replication: Parameter Generation

Filesize = Processors ( PartSize ( α Variables + β ) +γ Variables + δ ) + ψ Variables + η MACSio currently weak scales, so increasing processor count increases the file size linearly Similarly, part size and dataset variable count give a linear increase in total bytes written Combining the linear equations gives the equation above to calculate a good estimate for the resultant file size based on dataset composition Constants α, β, γ, δ, ψ, η are derived experimentally from a dataset composition scaling study

10

slide-11
SLIDE 11

Replication: Parameter Generation

Extracting counters such as BYTES_WRITTEN, NUM_PROCS, COLL_WRITES, [OPEN/ CLOSE]_TIMESTAMP is enough to generate an input to MACSio for a similar dataset composition and I/O pattern In particular, using timestamps to distribute load across the simulation runtime is important to give an accurate representation of typical ‘bursty’ I/O hotspots spread out across runtime

11

slide-12
SLIDE 12

Case Study: Bookleaf

2D unstructured Lagrangian hydrodynamics application Fixed checkpoint scheme: two per simulation The input deck used solves the Noh verification problem for ideal gases I/O is handled by TyphonIO

12

slide-13
SLIDE 13

Experimental Setup

ARCHER

  • 4920 node CRAY XC30
  • Two 12-core Ivy Bridge processors per node

(118,080 cores total)

  • Three Lustre filesystems:
  • 12 OSSs
  • 4 OSTs/OSS
  • 10 4TB Discs/OST (RAID6)
  • 1 MDS + 1 MDT with 14 600GB discs (RAID1+0)
  • 10 LNet Router nodes with overlapping routing

paths 13

slide-14
SLIDE 14

Experimental Setup: Input Parameters

Part size represents the volume of data written from each rank Wait time is a basic time buffer between consecutive file accesses

Nodes Part Size (Bytes) Wait Time (s) 1 404 320 266 2 202 205 120 4 101 148 53 8 50 619 22 16 25 355 11 32 12 723 7 64 6407 5

14

slide-15
SLIDE 15

File Access Pattern

File access times are offset by the initial setup in Bookleaf Accounting for this overhead is not necessary to accurately represent the I/O pattern so we don’t factor it in, but this could easily be introduced

MACSio 1 MACSio 2 50 100 150 200 250 300 350 Bookleaf 1 Bookleaf 2 Time (s)

15

slide-16
SLIDE 16

Results: I/O Time

1 2 4 8 16 32 64 4 8 16 32 64 128 # Nodes Time (s) MACSio #1 MACSio #2 Bookleaf #1 Bookleaf #2 1 2 4 8 16 32 64 64 512 4,090 32,700 262,000 # Nodes MACSio #1 MACSio #2 Bookleaf #1 Bookleaf #2

Cumulative I/O Time across all ranks

Absolute I/O Time

17,000s 1,536 ranks ≈ 110s writing per rank

16

slide-17
SLIDE 17

Results: I/O Time

Total, cumulative and slowest individual I/O time remain consistent for the

  • riginal and replicated runs

Looking at a wider range of Darshan counters, access sizes and frequencies are also consistent

Slowest Individual MPIIO Operation

1 2 4 8 16 32 64 0.5 1 2 4 8 16 32 64 # Nodes Time (s) MACSio #1 MACSio #2 Bookleaf #1 Bookleaf #2

17

slide-18
SLIDE 18

Results: Testing Independent vs Collective I/O with MACSio

Using the MACSio replication, a parameter tweak can be used to manipulate I/O library behaviour The switch to use collective buffering has a very predictable effect, reducing the number of small write operations and lowering the overall I/O time

1 2 4 8 16 32 64 0.5 2 8 32 128 # Nodes Time (s) Collective #1 Collective #2 Independent #1 Independent #2

18

slide-19
SLIDE 19

Conclusion

We use a proxy application and high level library to mimic an I/O pattern based off as lightweight profiling as possible I/O characterisation and a small amount of application familiarity is enough to produce a proxy that is workable Once a parameter set has been identified, we can chop and change strategy, library and platform with a reasonable amount of simplicity

19

slide-20
SLIDE 20

Next Steps

More irregular I/O patterns from range of applications Exercise different parallel interfaces Multiple concurrent workloads

20

slide-21
SLIDE 21

Acknowledgements

UK Atomic Weapons Establishment Technical Outreach Programme UK Engineering and Physical Sciences Research Council

slide-22
SLIDE 22

Thank You Any Questions?

J.Dickson@warwick.ac.uk