[PPT] - 4 May 10, 2011 1/36 Outline Koala Architecture Job Model System PowerPoint Presentation

SLIDE 1

1/36

ComplexHPC Spring School Day 2: KOALA Tutorial The KOALA Scheduler

Nezih Yigitbasi Delft University of Technology

4

May 10, 2011

SLIDE 2

2/36

Outline

Koala Architecture
Job Model
System Components
Support for different application types
Parallel Applications
Parameter Sweep Applications (PSAs)
Workflows

SLIDE 3

3/36

Introduction

Developed in the DAS system
Has been deployed on the DAS-2 in September 2005
Ported to DAS-3 in April’07, and to DAS-4 in April’11
Independent from grid middlewares such as Globus
Runs on top of local schedulers
Objectives:
Data and processor co-allocation in grids
Supporting different application types
Specialized job placement policies

SLIDE 4

4/36

May 13, 2011

VU (148 CPUs) TU Delft (64) Leiden (32) UvA/MultimediaN (72) UvA (32)

Background (1): DAS-4

SURFnet6

10 Gb/s lambdas

1,600 cores (quad cores)
2.4 GHz CPUs
accelerators
180 TB storage
Infiniband
Gb Eternet

Operational since oct. 2010

Astron (46)

SLIDE 5

5/36

Background (2): Grid Applications

Different application types with different characteristics:
Parallel applications
Parameter sweep applications
Workflows
Data intensive applications
Challenges:
Application characteristics and needs
Grid infrastructure is highly heterogeneous
Grid infrastructure configuration issues
Grid resources are highly dynamic

SLIDE 6

6/36

Fixed job Flexible job Non-fixed job

scheduler decides on component placement

scheduler decides on split up and placement

job components same total job size

job component placement fixed

Koala Job Model

A job consists of one or more job

components

A job component contains:
An executable name
Sufficient information necessary for

scheduling

Sufficient information necessary for

execution

SLIDE 7

7/36

Koala Architecture (1)

SLIDE 8

8/36

PIP/NIP: information services
RLS: replica location service
CO: co-allocator
PC: processor claimer
RM: run monitor
RL: runners listener
DM: data mover
Ri: runners

Koala Architecture (2): A Closer Look

SLIDE 9

9/36

Scheduler

Enforces Scheduling Policies
Co-Allocation Policies
Worst-Fit, Flexible Cluster Min., Comm. Aware, Close to

Files

Malleability Management Policies
Favour Previously Started Malleable Applications
Equi Grow Shrink
Cycle Scavenging Policies
Equi-All, Equi-PerSite
Workflow Scheduling Policies
Single-Site, Multi-Site

SLIDE 10

10/36

Runners

Extends support for different application types
KRunner:

Globus runner

PRunner:

A simplified job runner

IRunner:

Ibis applications

OMRunner:

OpenMPI applications

MRunner: Malleable applications based on the DYNACO

framework

WRunner: For workflows (Directed Acyclic Graphs) and BoTs

SLIDE 11

11/36

The Runners Framework

SLIDE 12

12/36

Support for Different Application Types

Parallel Applications
MPI, Ibis,…
Co-Allocation
Malleability
Parameter Sweep Applications
Cycle Scavenging
Run as low-priority jobs
Workflows

SLIDE 13

13/36

Support for Co-Allocation

What is co-allocation (just to remind)
Co-allocation Policies
Experimental Results

SLIDE 14

14/36

Co-Allocation

Simultaneous allocation of resources in multiple sites
Higher system utilizations
Lower queue wait times
Co-allocated applications might be less efficient due to

the relatively slow wide-area communications

Parallel applications may have different communication

characteristics

SLIDE 15

15/36

Co-Allocation Policies (1)

Dictate where the components of a job go
Policies for non-fixed jobs:
Load-aware:

Worst Fit (WF) (balance load in clusters)

Input-file-location-aware:

Close-to-Files (CF) (reduce file-transfer times)

Communication-aware:

Cluster Minimization (CM) (reduce number of wide-area messages)

See: H.H. Mohamed and D.H.J. Epema, “An Evaluation of the Close-to-Files Processor and Data Co-Allocation Policy in Multiclusters,” IEEE Cluster 2004.

SLIDE 16

16/36

Co-Allocation Policies (2)

Placement policies for flexible jobs:
Queue time-aware:

Flexible Cluster (CM + reduce queue wait time) Minimization (FCM)

Communication-aware:

Communication (decisions based on inter-cluster Aware (CA) communication speeds)

See: O.O.Sonmez, H.H. Mohamed and D.H.J. Epema, “Communication-aware Job Scheduling Policies for the Koala Grid Scheduler”, IEEE e-Science 2006.

SLIDE 17

17/36

Co-Allocation Policies (3)

8 8 8

Components

Clusters

C1 (16) C2 (16) C3 (16) I II III

WF

24

Component

C1 (16) C2 (16) C3 (16) I II

FCM

SLIDE 18

18/36

Experimental Results : Co-Allocation Vs. No co-allocation

OpenMPI + DRMAA
no co-allocation (left) vs. co-allocation (right)
workloads of real parallel applications range from computation-

(Prime) to very communication-intensive (Wave) average job response time (s) Prime Poisson Wave

co-allocation is disadvantageous for communication-intensive applications

SLIDE 19

19/36

Experimental Results : The performance of the policies

Flexible Cluster Min. vs. Comm. Aware
Workloads of communication-intensive applications

average job response time (s)

considering the network metrics improves the co-allocation performance

FCM CA FCM CA [w/o Delft] [with Delft]

SLIDE 20

20/36

Support for PSAs in Koala

Background
System Design
Scheduling Policies
Experimental Results

SLIDE 21

21/36

Parameter Sweep Application Model

A single executable that runs for a large set of parameters
E.g.; monte-carlo simulations, bioinformatics applications...
PSAs may run in multiple clusters simultaneously
We support OGF’s JSDL 1.0 (XML)

SLIDE 22

22/36

Motivation

How to run thousands of tasks in the DAS?
Issues:
15 min. rule!
Observational scheduling
Overload
Run them as Cycle Scavenging Applications !!
Sets priority classes implicitly
No worries for observing empty clusters

SLIDE 23

23/36

Cycle Scavenging

The technology behind volunteer computing projects
Harnessing idle CPU cycles from desktops
Download a software (screen saver)
Receive tasks from a central server
Execute a task when the computer is idle
Immediate preemption when the user is active again

SLIDE 24

24/36

System Requirements

1. Unobtrusiveness

Minimal delay for (higher priority) local and grid jobs

2. Fairness

Multiple cycle scavenging applications running concurrently should be assigned comparable CPU-Time

3. Dynamic Resource Allocation

Cycle scavenging applications has to Grow/Shrink at runtime

4. Efficiency

As much use of dynamic resources as possible

5. Robustness and Fault Tolerance

Long-running, complex system: problems will occur, and must be dealt with

SLIDE 25

25/36

System Interaction

Scheduler CS-Runner

Node

submits PSA(s)

JDL

grow/shrink messages registers

Clusters

Launcher Head Node KCM

submits launchers deploys, monitors, and preempts tasks monitors/informs idle/demanded resources CS Policies:

Equi-All:

grid-wide basis

Equi-PerSite:

per cluster Application Level Scheduling:

Pull-based approach
Shrinkage policy

SLIDE 26

26/36

Cycle Scavenging Policies

1. Equipartition-All

Clusters

C1 (12) C2 (12) C3 (24)

CS User-1 CS User-2 CS User-3

SLIDE 27

27/36

Cycle Scavenging Policies

2. Equipartition-PerSite

Clusters

C1 (12) C2 (12) C3 (24)

CS User-1 CS User-2 CS User-3

SLIDE 28

28/36

Experimental Results

DAS3
Using Launchers vs. not
60s. dummy tasks
Tested on a 32-node cluster

See: O. Sonmez, B. Grundeken, H.H. Mohamed, Alex Iosup, D.H.J. Epema, Scheduling Strategies for Cycle Scavenging in Multicluster Grid Systems, CCGrid 2009.

Equi-PerSite is fair and superior to Equi-All

Number of Completed Jobs Equi-All Equi-All Equi-PerSite Equi-PerSite WBlock WBurst WBlock WBurst Makespan [s]

Equi-All vs. Equi-PerSite
3 CS Users submit the same application

with the same parameter range

Non-CS Workloads: WBlock, WBurst

job startup overhead + information delay Number of Jobs

SLIDE 29

29/36

Support for Workflows in Koala

Applications with dependencies
e.g., Montage workflow
Astronomy application to

generate mosaics of the sky

4500 tasks
Dependencies are file transfers
Experience the WRunner in

the hands-on session

SLIDE 30

30/36

Workflow Scheduling Policies (1/3)

1. Round Robin: submits the eligible tasks to the clusters in

round-robin order

2. Single Cluster: maps every complete workflow to the least
loaded cluster at its submission
3. All Clusters: submits each eligible task to the least loaded

cluster

SLIDE 31

31/36

4. All Clusters File-Aware: submits each eligible task to the cluster that

minimizes the transfer costs of the files on which it depends

5. Coarsening*: iteratively reduces the size of a graph by collapsing

groups of nodes and their internal edges

We use Heavy Edge Matching* technique to group tasks that are

connected with heavy edges

*G. Karypis and V. Kumar. Multilevel graph partitioning schemes. In Int. Conf. Par.

Proc., pages 113–122, 1995.

Workflow Scheduling Policies (2/3)

SLIDE 32

32/36

6. Cluster Minimization: submits as many eligible tasks as

possible to a cluster before considering the next cluster

7. HEFT* (Heterogeneous Earliest-Finish-Time): selects the task

with the highest upward rank value and assigns the selected task to the cluster that ensures its earliest completion time

5/10/11 32

*H. Topcuoglu, S. Hariri, and M. Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE TPDS, 13(3):260–274, 2002.

Workflow Scheduling Policies (3/3)

We will use Single Cluster and All Clusters in the hands-on session

SLIDE 33

33/36

Workflow Engine Architecture

DRMAA DFS Workflow Description SSH + Custom Protocol Execution

SLIDE 34

34/36

Cloud Integration

Resource management issues
When and how many resources to allocate
What type of resource to allocate
Which availability zone
Performance issues
No queueing but resource acquisition/release overheads
Virtualization overhead
Wide-area file transfers
Cost issues (for public clouds)
Functionality Issues (MapReduce)

SLIDE 35

35/36

Conclusion

Koala supports multiple application types:
Sequential applications
Parallel applications that may need co-allocation
Parallel applications that can grow/shrink at runtime
Parameter sweep applications
Workflows
Different scheduling policies for each application type
No one size fits all policy

SLIDE 36

36/36

“M.N.Yigitbasi@tudelft.nl” http://www.st.ewi.tudelft.nl/~nezih/

More Information:

Koala Project: http://st.ewi.tudelft.nl/koala
PDS publication database: http://www.pds.twi.tudelft.nl
DAS-4: http://www.cs.vu.nl/das4/