[PPT] - [ with the schedulers collaboration] Alberto Miranda, PhD PowerPoint Presentation

SLIDE 1

Dagstuhl, May 2017

echofs: Enabling Transparent Access to Node-local NVM Burst Buffers for Legacy Applications […with the scheduler’s collaboration]

Alberto Miranda, PhD Researcher on HPC I/O alberto.miranda@bsc.es

www.bsc.es

The NEXTGenIO project has received funding from the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671951

SLIDE 2

Petascale already struggles with I/O…

– Extreme parallelism w/ millions of threads – Job’s input data read from external PFS – Checkpoints: periodic writes to external PFS – Job’s output data write to external PFS

HPC & Data Intensive systems merging

– Modelling, simulation and analytic workloads increasing…

I/O -> a fundamental challenge;

2

external filesystem high performance network compute nodes

SLIDE 3

Petascale already struggles with I/O…

– Extreme parallelism w/ millions of threads – Job’s input data read from external PFS – Checkpoints: periodic writes to external PFS – Job’s output data write to external PFS

HPC & Data Intensive systems merging

– Modelling, simulation and analytic workloads increasing…

And it will only get worse at Exascale…

I/O -> a fundamental challenge;

3

external filesystem high performance network compute nodes

SLIDE 4

Fast storage devices that temporarily store

application data before sending it to PFS

– Goal: Absorb peak I/O to avoid overtaxing PFS – Cray Datawarp, DDN IME

Growing interest to add them into next-gen

HPC architectures

– NERSC’s Cori, LLNL’s Sierra, ANL’s Aurora, … – Typically a separate resource to PFS – Usage/allocation/data movements/etc become user responsibility

Burst Buffers -> remote;

4

external filesystem high performance network compute nodes

burst filesystem

SLIDE 5

Non-volatile coming to the node

– Argonne’s Theta has 128GB SSDs in each compute node

Burst Buffers -> on-node;

5

external filesystem high performance network compute nodes

SLIDE 6

I/O Stack

Node-local, high-density NVRAM becomes a

fundamental component

– Intel’s 3DXPointTM – Capacity much larger than DRAM – Slightly slower than DRAM but significantly faster than SSDs – DIMM form factor  standard memory controller – No refresh  no/low energy leakage

NEXTGenIO EU Project [http://www.nextgenio.eu]

6

cache memory storage

SLIDE 7

Node-local, high-density NVRAM becomes a

fundamental component

– Intel’s 3DXPointTM – Capacity much larger than DRAM – Slightly slower than DRAM but significantly faster than SSDs – DIMM form factor  standard memory controller – No refresh  no/low energy leakage

NEXTGenIO EU Project [http://www.nextgenio.eu]

7

I/O Stack

cache memory nvram fast storage slow storage

SLIDE 8

Node-local, high-density NVRAM becomes a

fundamental component

– Intel’s 3DXPointTM – Capacity much larger than DRAM – Slightly slower than DRAM but significantly faster than SSDs – DIMM form factor  standard memory controller – No refresh  no/low energy leakage

I/O Stack

NEXTGenIO EU Project [http://www.nextgenio.eu]

8

cache memory nvram fast storage slow storage

1. How do we manage

access to these layers?

2. How can we bring the benefits

from these layers to legacy code?

SLIDE 9

OUR SOLUTION: MANAGING ACCESS THROUGH A USER-LEVEL FILESYSTEM

SLIDE 10

I/O Stack

First goal: Allow legacy applications to transparently

benefit from new storage layers

– Accessible storage layers under unique mount point – Make new layers readily available to applications – I/O stack complexity hidden from applications – Allows for automatic management of data location – POSIX interface [sorry]

echofs -> objectives;

10

cache memory nvram SSD Lustre /mnt/echofs/

PFS namespace is “echoed” /mnt/PFS/User/App  /mnt/ECHOFS/User/App

SLIDE 11

Second goal: construct a collaborative burst buffer by joining

NVM regions assigned to a batch job by scheduler [SLURM]

– Filesystem’s lifetime linked to batch job’s lifetime – Input files staged into NVM before job starts – Allow HPC jobs to perform collaborative NVM I/O – Output files staged out to PFS when job ends

echofs -> objectives;

11

NVM collaborative burst buffer compute nodes NVM NVM NVM parallel processes external filesystem

echofs

POSIX read/writes stage-in/ stage-out

SLIDE 12

User provides job I/O requirements through SLURM

– Nodes required, files accessed, type of access [in|out|inout], expected lifetime [temporary|persistent], expected “survivability”, required POSIX semantics [?], …

SLURM allocates nodes and mounts echofs across them

– Also forwards I/O requirements through API

echofs builds the CBB and fills it with input files

– When finished, SLURM starts the batch job

echofs -> intended workflow;

12

SLIDE 13

User provides job I/O requirements through SLURM

– Nodes required, files accessed, type of access [in|out|inout], expected lifetime [temporary|persistent], expected “survivability”, required POSIX semantics [?], …

SLURM allocates nodes and mounts echofs across them

– Also forwards I/O requirements through API

echofs builds the CBB and fills it with input files

– When finished, SLURM starts the batch job

echofs -> intended workflow;

13

We can’t expect

ptimization details

from users, but maybe for them to offer us enough hints…

SLIDE 14

Job I/O absorbed by collaborative burst buffer

– Non-CBB open()s forwarded to PFS (throttled to limit PFS congestion) – Temporary files do not need to make it to PFS (e.g. checkpoints) – Metadata attributes for temporary files cached

Distributed key-value store
When job completes, future of files managed by echofs

– Persistent files eventually sync'd to PFS – Decision orchestrated by SLURM & DataScheduler component depending on requirements of upcoming jobs

echofs -> intended workflow;

14

If some other job reuses these files, we can leave them “as is”

SLIDE 15

Distributed data servers

– Job’s data space partitioned across compute nodes – Each node acts as data server for its partition – Each node acts as data client for other partitions

Pseudo-random file segment distribution

– No replication ⇒ avoid coherence mechanisms – Resiliency through erasure codes (eventually) – Each node acts as lock manager for its partition

echofs -> data distribution;

15

NVM NVM NVM NVM

[0-8MB) [8-16MB) [16-32MB) hash

compute nodes data space shared file

SLIDE 16

Why pseudo-random?

– Efficient & decentralized segment lookup [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O

echofs -> data distribution;

16

NVM NVM NVM NVM

[0-8MB) [8-16MB) [16-32MB) hash

compute nodes data space shared file

SLIDE 17

Why pseudo-random?

– Efficient & decentralized segment lookup [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O – Guarantees minimal movements of data if node allocation changes [future research on elasticity]

echofs -> data distribution;

17

NVM allocated nodes NVM NVM allocated nodes NVM NVM NVM allocated nodes NVM NVM NVM NVM allocated nodes NVM

job phases over time

job scheduler

data transfer data transfer data transfer +1 node +1 node

2 nodes

SLIDE 18

Why pseudo-random?

– Efficient & decentralized segment lookup [no metadata request to lookup segment] – Balance workload w.r.t. partition size – Allows for collaborative I/O – Guarantees minimal movements of data if node allocation changes [future research on elasticity]

echofs -> data distribution;

18

NVM allocated nodes NVM NVM allocated nodes NVM NVM NVM allocated nodes NVM NVM NVM NVM allocated nodes NVM

job phases over time

job scheduler

data transfer data transfer data transfer +1 node +1 node

2 nodes

Other strategies would be possible depending

n job semantics

SLIDE 19

Data Scheduler daemon external to echofs

– Interfaces SLURM & echofs

Allows SLURM to send requests to echofs
Allows echofs to ACK these requests

– Offers an API to [non-legacy] applications willing to send I/O hints to echofs – In the future will coordinate w/ SLURM to decide when different echofs instances should access PFS [data-aware job scheduling]

echofs -> integration with batch scheduler;

19

echofs data scheduler SLURM applications

stage-in/stage-out asynchronous requests dynamic I/O requirements static I/O requirements

SLIDE 20

Main features:

– Ephemeral filesystem linked to job lifetime – Allows legacy applications to benefit from newer storage technologies – Provides aggregate I/O for applications

Research goals:

– Improve coordination w/ job scheduler and other HPC management infrastructure – Investigate ad-hoc data distributions tailored for each job I/O – Scheduler-triggered specific optimizations for jobs/files

Summary;

20

SLIDE 21

POSIX compliance is hard…

– But maybe we don’t need FULL COMPLIANCE for ALL jobs…

Adding I/O-awareness to the scheduler is important…

– Allows wasting I/O work already done… – … but requires user/developer collaboration (tricky…)

User-level filesystems/libraries solve very specific I/O problems…

– Can we reuse/integrate these efforts? Can we learn what works for a specific application, characterize it & automatically run similar ones in a “best fit” FS?

Food for thought;

21

SLIDE 22