[PPT] - Applying Machine Learning to Understand Write Performance of PowerPoint Presentation

SLIDE 1

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

Presented by Bing Xie Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, Feiyi Wang

SLIDE 2

2 2

Open slide master to edit

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

Problem

– Understand the write performance of HPC applications running on

large-scale systems

Contribution

– Built accurate ML models for predicting the I/O write performance – Interpreted multi-stage write behaviors of large-scale I/O subsystems

Impact

– Demonstrated that ML can be applied to predict the write

performance of large-scale I/O subsystems

– Delivered a generic solution applicable to various large-scale I/O

subsystems and technologies

SLIDE 3

3 3

Open slide master to edit

Motivation: Reduce the Write Cost

Configure write burst size/rate tradeoffs
Guide I/O middleware (e.g., ROMIO) to adapt write patterns
Inform system job schedulers to yield tighter/better estimates of

I/O cost and application runtime

SLIDE 4

4 4

Open slide master to edit

Typical Scientific Applications

HPC codes compute

for a long time at large scales

Produce write bursts

that stall application executions and impact application runtime

A generic example: XGC

– Evaluate physical equations iteratively

ver space: compute cost is

predictable

– 4 types of bursts with different write

frequencies and burst sizes:

state snapshots: 500MB to 1.2GB
diagnostic analysis bursts: 1MB – 400MB
Bursts are stored as independent files

– Write stalls comprise 7-20% of run time

SLIDE 6

6 6

Open slide master to edit

Target I/O systems

Titan and Spider 2 at OLCF/ORNL

– Cray XK7 – Lustre filesystem

Cetus and Mira-FS1at ALCF/ANL

– IBM Blue Gene/Q – GPFS filesystem

Storage System Supercomputer Metadata Server Client Server Target SAN

SLIDE 7

7 7

Open slide master to edit

Challenges

High performance variability
Limited filesystem visibility for end-users

SLIDE 8

8 8

Open slide master to edit

High Performance Variability

1. CDFs of write performance variations
n Titan and Cetus.
2. The x-axis represents the relative measures

( max/min ) of the write bandwidths of the experiment data (IOR benchmarks)

3. Write performance on Titan and Cetus is

highly variable.

5 10 15 20 25 30

Max/Min

0.2 0.4 0.6 0.8 1

CDF

Cetus Titan

SLIDE 9

9 9

Open slide master to edit

Our Approach

Highly variable, but reverts to mean over time

– Model the mean performance – Effectively address the repeated I/O writes and aggregate impact

Limited visibility for end users

– Extract features from write patterns and system architecture and configurations

Interference

– Address noise as features

ML solution

– Convergence-guaranteed sampling method – Lasso models – Systematic ML methodology

SLIDE 10

10 10

Open slide master to edit

End-to-end I/O Write Path

Burst 0 Striping Burst 0 b7 b5 b6 b3 b4 b1 b2 b0 Server26 Target26 Example: Stripe_Count=4 Starting_OST=23 Stripe_size Server25 Target25 Server24 Target24 Server23 Target23 Each Target is a RAID array. Spider 2 (Atlas1 and 2) Titan SAN Metadata Server Client Server Target

SLIDE 11

11 11

Open slide master to edit

Extract Features

Insight: infer end-to-end burst absorption time based on

performance-related parameters (write load, load skew, resources in use) at each stage

Collectable performance-related parameters on Titan and

Cetus

Predictable performance-related parameters on Spider 2

and Mira-FS1

Positive and inverse forms of performance-related

parameters on separate stages, adjacent stages, and noise

Titan/Spider 2: 41 features; Cetus/Mira-FS1: 30 features

SLIDE 12

12 12

Open slide master to edit

Systematic Machine Learning Approach

Candidate features a Lasso model

1. Train the model with

10-fold cross validation.

2. Evaluate the model by

Mean Square Error (MSE).

BEST MODEL

In each training set For each model Search for the model with minimum MSE from the 255 Lasso models each for 1 training set

SLIDE 13

13 13

Open slide master to edit

Experiments

Train models on a small scale data set

– 3,465 (Titan) and 4,715 (Cetus) converged samples collected with

multiple IOR benchmarks on the scale of 1-128 compute nodes

Evaluate models on medium scale

– 668 (Titan) and 874 (Cetus) converged samples produced by 200 -512

compute nodes

Evaluation criteria

– Accuracy of the best model – Effectiveness of features

SLIDE 14

14 14

Open slide master to edit

Reported 4 models

Lassobest

– With minimum Mean Square Error from 255 Lasso models across the

training set candidates

Lassobase

– The Lasso model trained on the write scales of 1-128 compute nodes

Linearbest

– With minimum Mean Square Error from 255 Linear models across the

training set candidates

Linearbase

– The Linear model trained on the write scales of 1-128 compute nodes

SLIDE 15

15 15

Open slide master to edit

Results on Titan and Cetus

test set with 200, 256 nodes test set with 400, 512 nodes test set with 200, 256 nodes test set with 400, 512 nodes

5 8.33 14.34 20.92 34.38 48.4 130.61 Samples sorted by t, Unit:Sec

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 Relative True Error Lasso best_lustre Lasso base_lustre Linear best_lustre Linear base_lustre 5.04 11.56 21.02 30.08 48.85 84.64 250.51 Samples sorted by t, Unit:Sec

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 Relative True Error Lasso best_lustre Lasso base_lustre Linear best_lustre Linear base_lustre 5.06 13.08 27.83 50.49 95.92 207.04 1281.38

Samples sorted by t, Unit:Sec

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 Relative True Error Lasso best_gpfs Lasso base_gpfs Linear best_gpfs Linear base_gpfs 5.13 14.33 33.61 62.79 107.76 191.26 2330.2 Samples sorted by t, Unit:Sec

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 Relative True Error Lasso best_gpfs Lasso base_gpfs Linear best_gpfs Linear base_gpfs

Titan/Spider 2 Cetus/Mira-FS1

SLIDE 16

16 16

Open slide master to edit

Results on Titan and Cetus

5 8.33 14.34 20.92 34.38 48.4 130.61 Samples sorted by t, Unit:Sec

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 Relative True Error Lasso best_lustre Lasso base_lustre Linear best_lustre Linear base_lustre 5.04 11.56 21.02 30.08 48.85 84.64 250.51 Samples sorted by t, Unit:Sec

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 Relative True Error Lasso best_lustre Lasso base_lustre Linear best_lustre Linear base_lustre 5.06 13.08 27.83 50.49 95.92 207.04 1281.38

Samples sorted by t, Unit:Sec

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 Relative True Error Lasso best_gpfs Lasso base_gpfs Linear best_gpfs Linear base_gpfs 5.13 14.33 33.61 62.79 107.76 191.26 2330.2 Samples sorted by t, Unit:Sec

1
0.8
0.6
0.4
0.2

0.2 0.4 0.6 0.8 1 Relative True Error Lasso best_gpfs Lasso base_gpfs Linear best_gpfs Linear base_gpfs

Lasso_best is highly accurate and the best model Titan/Spider 2 Cetus/Mira-FS1

SLIDE 17

17 17

Open slide master to edit

Conclusions

Problem

– Understand the I/O write performance of large-scale supercomputers

Our Solution

– Systematic ML approach with Lasso – Modeling the mean performance, extracting features from application write patterns,

system architecture and configurations, convergence-guaranteed sampling

Findings

– Lassobest is the most accurate model for both Titan and Cetus – Most effective features are load skew in supercomputers and resources in use on the

system side

Applicability

– Lasso models, features: Lustre, GPFS deployment – Systematic modeling method: generic supercomputer I/O subsystems

SLIDE 18

18 18

Open slide master to edit

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

Presented by Bing Xie Bing Xie, Zilong Tan, Philip Carns, Jeff Chase, Kevin Harms, Jay Lofstead, Sarp Oral, Sudharshan S. Vazhkudai, Feiyi Wang

Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems

– Understand the write performance of HPC applications running on

large-scale systems

– Built accurate ML models for predicting the I/O write performance – Interpreted multi-stage write behaviors of large-scale I/O subsystems

– Demonstrated that ML can be applied to predict the write

performance of large-scale I/O subsystems

– Delivered a generic solution applicable to various large-scale I/O

subsystems and technologies

Motivation: Reduce the Write Cost

I/O cost and application runtime

Related Works and Our Solution

– Profiling supercomputer I/O subsystems under production loads – Darshan toolkit – Statistical benchmarking

– ROMIO, ADIOS

– Tune I/O parameters at application level – Learn I/O patterns from job logs and system monitoring data

– First ML work to predict write performance of large-scale parallel filesystems based on

application write patterns, system architecture, and configurations

Typical Scientific Applications

for a long time at large scales

that stall application executions and impact application runtime

– Evaluate physical equations iteratively

predictable

– 4 types of bursts with different write

frequencies and burst sizes:

– Write stalls comprise 7-20% of run time

Target I/O systems

– Cray XK7 – Lustre filesystem

– IBM Blue Gene/Q – GPFS filesystem

Storage System Supercomputer Metadata Server Client Server Target SAN

Challenges

High Performance Variability

( max/min ) of the write bandwidths of the experiment data (IOR benchmarks)

highly variable.

5 10 15 20 25 30

Max/Min

0.2 0.4 0.6 0.8 1

CDF

Cetus Titan

Our Approach

– Model the mean performance – Effectively address the repeated I/O writes and aggregate impact

– Extract features from write patterns and system architecture and configurations

– Address noise as features

– Convergence-guaranteed sampling method – Lasso models – Systematic ML methodology

End-to-end I/O Write Path

Burst 0 Striping Burst 0 b7 b5 b6 b3 b4 b1 b2 b0 Server26 Target26 Example: Stripe_Count=4 Starting_OST=23 Stripe_size Server25 Target25 Server24 Target24 Server23 Target23 Each Target is a RAID array. Spider 2 (Atlas1 and 2) Titan SAN Metadata Server Client Server Target

Extract Features

performance-related parameters (write load, load skew, resources in use) at each stage

Cetus

and Mira-FS1

parameters on separate stages, adjacent stages, and noise

Systematic Machine Learning Approach

Candidate features a Lasso model

10-fold cross validation.

Mean Square Error (MSE).

BEST MODEL

In each training set For each model Search for the model with minimum MSE from the 255 Lasso models each for 1 training set

Experiments

– 3,465 (Titan) and 4,715 (Cetus) converged samples collected with

multiple IOR benchmarks on the scale of 1-128 compute nodes

– 668 (Titan) and 874 (Cetus) converged samples produced by 200 -512

compute nodes

– Accuracy of the best model – Effectiveness of features

Reported 4 models

– With minimum Mean Square Error from 255 Lasso models across the

training set candidates

– The Lasso model trained on the write scales of 1-128 compute nodes

– With minimum Mean Square Error from 255 Linear models across the

training set candidates

– The Linear model trained on the write scales of 1-128 compute nodes

Results on Titan and Cetus

test set with 200, 256 nodes test set with 400, 512 nodes test set with 200, 256 nodes test set with 400, 512 nodes

Titan/Spider 2 Cetus/Mira-FS1

Results on Titan and Cetus

Lasso_best is highly accurate and the best model Titan/Spider 2 Cetus/Mira-FS1

Conclusions

– Understand the I/O write performance of large-scale supercomputers

– Systematic ML approach with Lasso – Modeling the mean performance, extracting features from application write patterns,

system architecture and configurations, convergence-guaranteed sampling

– Lassobest is the most accurate model for both Titan and Cetus – Most effective features are load skew in supercomputers and resources in use on the