Birds-of-Feather: ECP Center and Application Monitoring - Working - - PowerPoint PPT Presentation

birds of feather ecp center and application monitoring
SMART_READER_LITE
LIVE PREVIEW

Birds-of-Feather: ECP Center and Application Monitoring - Working - - PowerPoint PPT Presentation

Birds-of-Feather: ECP Center and Application Monitoring - Working Group Day: Thursday, February 6, 2020 Time: 1:30pm 3:00pm CT Room: Founders I ORNL is managed by UT-Battelle, LLC for the US Department of Energy Recent Advances Are


slide-1
SLIDE 1

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

Birds-of-Feather: ECP Center and Application Monitoring - Working Group

Day: Thursday, February 6, 2020 Time: 1:30pm – 3:00pm CT Room: Founders I

slide-2
SLIDE 2

2 2

Recent Advances Are Disrupting The Status Quo…

H

  • m
  • g

e n e

  • u

s P r

  • c

e s s

  • r

s H

  • m
  • g

e n e

  • u

s M e m

  • r

i e s

slide-3
SLIDE 3

3 3

Where Data-Driven Decisions Make A Difference…

  • Improved feedback to Application Developers on how their jobs performed

(e.g. others with your application signature have used the XXXX library)

  • Improved feedback to CS Researchers on how to improve the software environment

(e.g., link compiler and memory data)

  • Improved Job Scheduler to decide what runs on our machines
  • Improved feedback to Operations (e.g. Chiller Management)
  • Improved feedback to Planners on the characteristics of our workload

(e.g., prefer 5% faster memory over 12% faster interconnect)

  • Improved feedback to Vendors on how we use systems

(e.g., 22% of jobs use GPUs in an XXXXX manner)

  • Security & better quantification our outputs: better ways to identify applications (avoid

inappropriate usage like malware or bit coin mining) and answer questions about science hours, utilization, and so on.

  • Plus many, many other uses
slide-4
SLIDE 4

4 4

But There Are Barriers to Overcome…

  • A tremendous amount of data is currently collected within our

computer center, but it is artificially separated by knowledge domains

Sysadmin data

Data available from tools and performance counters

Resilience & health of system data

Workload characteristics & resource usage details for future procurements,

And so on…

  • The artificially separated domains form a high barrier to

understanding & a wealth of information is currently largely untapped

Currently users are not aware of most of the data & do not have adequate access Data available from various environment tools is underutilized System designers don’t have adequate access to how the systems are used

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

– Sherlock Holmes (Sir Arthur Conan Doyle) Ops Users CS Research Planning Security What We Have… SEPARATE STOVEPIPE SYSTEMS

slide-5
SLIDE 5

5 5

Risk ID WBS

Risk*

(* Taken From ECP Risk Register v6.1)

Adding Data Component To Support Solution

Mi Mitigat ation

  • n

(best)

Mi Mitigat ation

  • n

(worst)

10000 2.2

If Aurora has lower than anticipated aggregate memory capacity, then some projects maybe be unable to run challenge problem(s), which would make it impossible to meet KPP-1 or KPP-2.

Track memory utilization

$1.0M $1.5M 10001 2.2

If Frontier has lower than anticipated aggregate memory capacity, then some projects maybe be unable to run challenge problem(s), which would make it impossible to meet KPP-1 or KPP-2.

Track memory utilization of apps

$1.0M $1.5M 10007 2.3.1

If MPICH does not meet the other ECP subproject performance needs, for example, in interactions between CPU and GPU, directly between GPUs, in latency hiding, etc., then this will be a significant impact to the

  • verall project as vendor implementations of MPI often rely heavily on MPICH

Monitor MPI performance

$2.0M $3.0M 10008 2.3.1

If OpenMP 4.5/5.0 does not meet the other ECP subproject needs, there will be a significant impact to the

  • verall project as OpenMP is a widely used mechanism for achieving good node-level performance.

Track OpenMP utilization in applications & plugins to tools API

$1.6M $2.4M 10009 2.3.3

If sparse linear solvers fail to perform well at scale and on multi-node architectures or otherwise don't meet ECP application needs, several ECP applications will be at risk of not being able to meet their KPPs.

Track Libraries utilization and how they scale

$1.6M $2.4M 10010 2.3.3

If dense linear algebra kernels fail to perform well on multi-node architectures, several ECP applications will be at risk of not being able to meet their KPPs

Track libraries implementation

$0.8M $1.2M 10012 2.3.5

If ST products are perceived as, or in fact are, inferior or overly complex, AD performance could suffer and ST products will not be adopted.

Track utilization of ST products

10016 2.2

Aurora or Frontier HW or SW has defects (e.g. bugs).

Monitor error bugs and aggregate them until a threshold is met

10018 2.2

Language features used by applications perform poorly or are not fully supported on Frontier and/or Aurora

Compiler Tracking of language standard utilization and possibly performance.

$0.6M $0.9M 10019 2.3.2

If vendor software does not provide required functionality or performance, Applications and/or ST products may not perform as required.

Track Application Utilization

10022 2.3.2

If we do not have a Fortran compiler on Aurora that supports OpenMP target offload capabilities, then we will not be able to compile applications.

Track Fortran utilization

$1.0M $1.5M 10023 2.3.2

If we do not have a Fortran compiler on Frontier that supports OpenMP target offload capabilities, then we will not be able to compile applications.

Track Fortran + OpenMP utilization

$1.0M $1.5M 10025 2.3

If vendors produce new high-performance programming models for next generation architectures e.g. HIP or SYCL, instead of ST-supported models, ST products or functionality may be underused.

Track SYCL, PM utilization

10032 2.3

If ST products do not function, meet performance targets, or support key system capabilities at full system size, then dependent AD and ST codes will not meet goals. Because there are no effective proxy systems, these issues are revealed late in the ECP project.

Track how ST products are used

10046 2.4

If the Facilities do not provide reliable, timely access to the systems for integration of ECP ST, ECP AD products, and/or resources in support of ECP efforts, then this will delay demonstration of KPP's.

Track utilization for information sharing with Facilities

4 $10.6M $10.6M $15.9M $15.9M

ECP’s Interest: Better Data Mitigates Identified Risks…

slide-6
SLIDE 6

6 6

The ECP Center and Application Monitoring - Working Group

BoF Agenda Time Speaker(s) Description Introduction 5 mins Jones/Montoya Some challenges (work toward white paper) ECP Viewpoint 5 mins Heroux/Quinn What ECP Level 2’s and Level 3’s want to see ECP Use Cases 15 mins Panel 1 What does workflow need? What do apps teams need? Current State 25 mins Panel 2 A brief overview of the monitoring activities at several institutions (LANL/OLCF/ALCF/SNL/LLNL/NERSC/Cray) Open Discussion 35 mins Full Audience Audience participation; Also cover Next Steps Goals for the BoF Notes: https://confluence.exascaleproject.org/display/HISD/Annual+Meeting+Notes+-+BoF Capture the current state of center-wide monitoring at ECP institutions Make a determination of need Produce a white paper on the Identified Need