4 May 10, 2011 1/36 Outline Koala Architecture Job Model System - - PowerPoint PPT Presentation

4
SMART_READER_LITE
LIVE PREVIEW

4 May 10, 2011 1/36 Outline Koala Architecture Job Model System - - PowerPoint PPT Presentation

ComplexHPC Spring School Day 2: KOALA Tutorial The KOALA Scheduler Nezih Yigitbasi Delft University of Technology 4 May 10, 2011 1/36 Outline Koala Architecture Job Model System Components Support for different application


slide-1
SLIDE 1

1/36

ComplexHPC Spring School Day 2: KOALA Tutorial The KOALA Scheduler

Nezih Yigitbasi Delft University of Technology

4

May 10, 2011

slide-2
SLIDE 2

2/36

Outline

  • Koala Architecture
  • Job Model
  • System Components
  • Support for different application types
  • Parallel Applications
  • Parameter Sweep Applications (PSAs)
  • Workflows
slide-3
SLIDE 3

3/36

Introduction

  • Developed in the DAS system
  • Has been deployed on the DAS-2 in September 2005
  • Ported to DAS-3 in April’07, and to DAS-4 in April’11
  • Independent from grid middlewares such as Globus
  • Runs on top of local schedulers
  • Objectives:
  • Data and processor co-allocation in grids
  • Supporting different application types
  • Specialized job placement policies
slide-4
SLIDE 4

4/36

May 13, 2011

VU (148 CPUs) TU Delft (64) Leiden (32) UvA/MultimediaN (72) UvA (32)

Background (1): DAS-4

SURFnet6

10 Gb/s lambdas

  • 1,600 cores (quad cores)
  • 2.4 GHz CPUs
  • accelerators
  • 180 TB storage
  • Infiniband
  • Gb Eternet

Operational since oct. 2010

Astron (46)

slide-5
SLIDE 5

5/36

Background (2): Grid Applications

  • Different application types with different characteristics:
  • Parallel applications
  • Parameter sweep applications
  • Workflows
  • Data intensive applications
  • Challenges:
  • Application characteristics and needs
  • Grid infrastructure is highly heterogeneous
  • Grid infrastructure configuration issues
  • Grid resources are highly dynamic
slide-6
SLIDE 6

6/36

Fixed job Flexible job Non-fixed job

scheduler decides on component placement

scheduler decides on split up and placement

job components same total job size

job component placement fixed

Koala Job Model

  • A job consists of one or more job

components

  • A job component contains:
  • An executable name
  • Sufficient information necessary for

scheduling

  • Sufficient information necessary for

execution

slide-7
SLIDE 7

7/36

Koala Architecture (1)

slide-8
SLIDE 8

8/36

  • PIP/NIP: information services
  • RLS: replica location service
  • CO: co-allocator
  • PC: processor claimer
  • RM: run monitor
  • RL: runners listener
  • DM: data mover
  • Ri: runners

Koala Architecture (2): A Closer Look

slide-9
SLIDE 9

9/36

Scheduler

  • Enforces Scheduling Policies
  • Co-Allocation Policies
  • Worst-Fit, Flexible Cluster Min., Comm. Aware, Close to

Files

  • Malleability Management Policies
  • Favour Previously Started Malleable Applications
  • Equi Grow Shrink
  • Cycle Scavenging Policies
  • Equi-All, Equi-PerSite
  • Workflow Scheduling Policies
  • Single-Site, Multi-Site
slide-10
SLIDE 10

10/36

Runners

  • Extends support for different application types
  • KRunner:

Globus runner

  • PRunner:

A simplified job runner

  • IRunner:

Ibis applications

  • OMRunner:

OpenMPI applications

  • MRunner: Malleable applications based on the DYNACO

framework

  • WRunner: For workflows (Directed Acyclic Graphs) and BoTs
slide-11
SLIDE 11

11/36

The Runners Framework

slide-12
SLIDE 12

12/36

Support for Different Application Types

  • Parallel Applications
  • MPI, Ibis,…
  • Co-Allocation
  • Malleability
  • Parameter Sweep Applications
  • Cycle Scavenging
  • Run as low-priority jobs
  • Workflows
slide-13
SLIDE 13

13/36

Support for Co-Allocation

  • What is co-allocation (just to remind)
  • Co-allocation Policies
  • Experimental Results
slide-14
SLIDE 14

14/36

Co-Allocation

  • Simultaneous allocation of resources in multiple sites
  • Higher system utilizations
  • Lower queue wait times
  • Co-allocated applications might be less efficient due to

the relatively slow wide-area communications

  • Parallel applications may have different communication

characteristics

slide-15
SLIDE 15

15/36

Co-Allocation Policies (1)

  • Dictate where the components of a job go
  • Policies for non-fixed jobs:
  • Load-aware:

Worst Fit (WF) (balance load in clusters)

  • Input-file-location-aware:

Close-to-Files (CF) (reduce file-transfer times)

  • Communication-aware:

Cluster Minimization (CM) (reduce number of wide-area messages)

See: H.H. Mohamed and D.H.J. Epema, “An Evaluation of the Close-to-Files Processor and Data Co-Allocation Policy in Multiclusters,” IEEE Cluster 2004.

slide-16
SLIDE 16

16/36

Co-Allocation Policies (2)

  • Placement policies for flexible jobs:
  • Queue time-aware:

Flexible Cluster (CM + reduce queue wait time) Minimization (FCM)

  • Communication-aware:

Communication (decisions based on inter-cluster Aware (CA) communication speeds)

See: O.O.Sonmez, H.H. Mohamed and D.H.J. Epema, “Communication-aware Job Scheduling Policies for the Koala Grid Scheduler”, IEEE e-Science 2006.

slide-17
SLIDE 17

17/36

Co-Allocation Policies (3)

8 8 8

Components

Clusters

C1 (16) C2 (16) C3 (16) I II III

WF

24

Component

C1 (16) C2 (16) C3 (16) I II

FCM

slide-18
SLIDE 18

18/36

Experimental Results : Co-Allocation Vs. No co-allocation

  • OpenMPI + DRMAA
  • no co-allocation (left) vs. co-allocation (right)
  • workloads of real parallel applications range from computation-

(Prime) to very communication-intensive (Wave) average job response time (s) Prime Poisson Wave

co-allocation is disadvantageous for communication-intensive applications

slide-19
SLIDE 19

19/36

Experimental Results : The performance of the policies

  • Flexible Cluster Min. vs. Comm. Aware
  • Workloads of communication-intensive applications

average job response time (s)

considering the network metrics improves the co-allocation performance

FCM CA FCM CA [w/o Delft] [with Delft]

slide-20
SLIDE 20

20/36

Support for PSAs in Koala

  • Background
  • System Design
  • Scheduling Policies
  • Experimental Results
slide-21
SLIDE 21

21/36

Parameter Sweep Application Model

  • A single executable that runs for a large set of parameters
  • E.g.; monte-carlo simulations, bioinformatics applications...
  • PSAs may run in multiple clusters simultaneously
  • We support OGF’s JSDL 1.0 (XML)
slide-22
SLIDE 22

22/36

Motivation

  • How to run thousands of tasks in the DAS?
  • Issues:
  • 15 min. rule!
  • Observational scheduling
  • Overload
  • Run them as Cycle Scavenging Applications !!
  • Sets priority classes implicitly
  • No worries for observing empty clusters
slide-23
SLIDE 23

23/36

Cycle Scavenging

  • The technology behind volunteer computing projects
  • Harnessing idle CPU cycles from desktops
  • Download a software (screen saver)
  • Receive tasks from a central server
  • Execute a task when the computer is idle
  • Immediate preemption when the user is active again
slide-24
SLIDE 24

24/36

System Requirements

  • 1. Unobtrusiveness

Minimal delay for (higher priority) local and grid jobs

  • 2. Fairness

Multiple cycle scavenging applications running concurrently should be assigned comparable CPU-Time

  • 3. Dynamic Resource Allocation

Cycle scavenging applications has to Grow/Shrink at runtime

  • 4. Efficiency

As much use of dynamic resources as possible

  • 5. Robustness and Fault Tolerance

Long-running, complex system: problems will occur, and must be dealt with

slide-25
SLIDE 25

25/36

System Interaction

Scheduler CS-Runner

Node

submits PSA(s)

JDL

grow/shrink messages registers

Clusters

Launcher Head Node KCM

submits launchers deploys, monitors, and preempts tasks monitors/informs idle/demanded resources CS Policies:

  • Equi-All:

grid-wide basis

  • Equi-PerSite:

per cluster Application Level Scheduling:

  • Pull-based approach
  • Shrinkage policy
slide-26
SLIDE 26

26/36

Cycle Scavenging Policies

  • 1. Equipartition-All

Clusters

C1 (12) C2 (12) C3 (24)

CS User-1 CS User-2 CS User-3

slide-27
SLIDE 27

27/36

Cycle Scavenging Policies

  • 2. Equipartition-PerSite

Clusters

C1 (12) C2 (12) C3 (24)

CS User-1 CS User-2 CS User-3

slide-28
SLIDE 28

28/36

Experimental Results

  • DAS3
  • Using Launchers vs. not
  • 60s. dummy tasks
  • Tested on a 32-node cluster

See: O. Sonmez, B. Grundeken, H.H. Mohamed, Alex Iosup, D.H.J. Epema, Scheduling Strategies for Cycle Scavenging in Multicluster Grid Systems, CCGrid 2009.

Equi-PerSite is fair and superior to Equi-All

Number of Completed Jobs Equi-All Equi-All Equi-PerSite Equi-PerSite WBlock WBurst WBlock WBurst Makespan [s]

  • Equi-All vs. Equi-PerSite
  • 3 CS Users submit the same application

with the same parameter range

  • Non-CS Workloads: WBlock, WBurst

job startup overhead + information delay Number of Jobs

slide-29
SLIDE 29

29/36

Support for Workflows in Koala

  • Applications with dependencies
  • e.g., Montage workflow
  • Astronomy application to

generate mosaics of the sky

  • 4500 tasks
  • Dependencies are file transfers
  • Experience the WRunner in

the hands-on session

slide-30
SLIDE 30

30/36

Workflow Scheduling Policies (1/3)

  • 1. Round Robin: submits the eligible tasks to the clusters in

round-robin order

  • 2. Single Cluster: maps every complete workflow to the least
  • loaded cluster at its submission
  • 3. All Clusters: submits each eligible task to the least loaded

cluster

slide-31
SLIDE 31

31/36

  • 4. All Clusters File-Aware: submits each eligible task to the cluster that

minimizes the transfer costs of the files on which it depends

  • 5. Coarsening*: iteratively reduces the size of a graph by collapsing

groups of nodes and their internal edges

  • We use Heavy Edge Matching* technique to group tasks that are

connected with heavy edges

*G. Karypis and V. Kumar. Multilevel graph partitioning schemes. In Int. Conf. Par.

Proc., pages 113–122, 1995.

Workflow Scheduling Policies (2/3)

slide-32
SLIDE 32

32/36

  • 6. Cluster Minimization: submits as many eligible tasks as

possible to a cluster before considering the next cluster

  • 7. HEFT* (Heterogeneous Earliest-Finish-Time): selects the task

with the highest upward rank value and assigns the selected task to the cluster that ensures its earliest completion time

5/10/11 32

*H. Topcuoglu, S. Hariri, and M. Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE TPDS, 13(3):260–274, 2002.

Workflow Scheduling Policies (3/3)

We will use Single Cluster and All Clusters in the hands-on session

slide-33
SLIDE 33

33/36

Workflow Engine Architecture

DRMAA DFS Workflow Description SSH + Custom Protocol Execution

slide-34
SLIDE 34

34/36

Cloud Integration

  • Resource management issues
  • When and how many resources to allocate
  • What type of resource to allocate
  • Which availability zone
  • Performance issues
  • No queueing but resource acquisition/release overheads
  • Virtualization overhead
  • Wide-area file transfers
  • Cost issues (for public clouds)
  • Functionality Issues (MapReduce)
slide-35
SLIDE 35

35/36

Conclusion

  • Koala supports multiple application types:
  • Sequential applications
  • Parallel applications that may need co-allocation
  • Parallel applications that can grow/shrink at runtime
  • Parameter sweep applications
  • Workflows
  • Different scheduling policies for each application type
  • No one size fits all policy
slide-36
SLIDE 36

36/36

“M.N.Yigitbasi@tudelft.nl” http://www.st.ewi.tudelft.nl/~nezih/

More Information:

  • Koala Project: http://st.ewi.tudelft.nl/koala
  • PDS publication database: http://www.pds.twi.tudelft.nl
  • DAS-4: http://www.cs.vu.nl/das4/