Mixing Cloud and Grid Resources for Many Task Computing David - - PowerPoint PPT Presentation

mixing cloud and grid resources for many task computing
SMART_READER_LITE
LIVE PREVIEW

Mixing Cloud and Grid Resources for Many Task Computing David - - PowerPoint PPT Presentation

Mixing Cloud and Grid Resources for Many Task Computing David Abramson Monash e-Science and Grid Engineering Lab (MeSsAGE Lab) Faculty of Information Technology Science Director: Monash e-Research Centre ARC Professorial Fellow 1 Introduction


slide-1
SLIDE 1

Mixing Cloud and Grid Resources for Many Task Computing

David Abramson

Monash e-Science and Grid Engineering Lab (MeSsAGE Lab) Faculty of Information Technology Science Director: Monash e-Research Centre ARC Professorial Fellow

1

slide-2
SLIDE 2

Introduction

  • A typical MTC Driving Application
  • The Nimrod tool family
  • Things the Grid ignored

– Deployment – Deadlines (QoS)

  • Clusters & Grids & Clouds
  • Conclusions and future directions
slide-3
SLIDE 3

A Typical MTC Driving Application

slide-4
SLIDE 4

A little quantum chemistry

Wibke Sudholt, Univ Zurich

4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 3 Radius r/a.u.

A1, A2 B1, B2

C C H H H HH H F C H H H

 

   

2 2 2 2 1 1 eff

exp exp r B A r B A r U    

slide-5
SLIDE 5
slide-6
SLIDE 6

SC03 testbed

slide-7
SLIDE 7

The Nimrod Tools Family

slide-8
SLIDE 8

Nimrod supporting “real” science

  • A full parameter sweep is the

cross product of all the parameters (Nimrod/G)

  • An optimization run minimizes

some output metric and returns parameter combinations that do this (Nimrod/O)

  • Design of experiments limits

number of combinations (Nimrod/E)

  • Workflows (Nimrod/K)

Results Results Nimrod/O Results

slide-9
SLIDE 9

9

Nimrod/G Grid Middleware Nimrod/O Nimrod/E Nimrod Portal Actuators Plan File

parameter pressure float range from 5000 to 6000 points 4 parameter concent float range from 0.002 to 0.005 points 2 parameter material text select anyof “Fe” “Al” task main copy compModel node:compModel copy inputFile.skel node:inputFile.skel node:substitute inputFile.skel inputFile node:execute ./compModel < inputFile > results copy node:results results.$jobname endtask

slide-10
SLIDE 10

10

Prepare Jobs using Portal Jobs Scheduled Executed Dynamically Sent to available machines Results displayed & interpreted

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13

13 Antenna Design Drug Docking Aerofoil Design Aerofoil Design

slide-14
SLIDE 14

Nimrod/G Architecture

Nimrod/G Client

  • Rsrc. Scheduler

Nimrod/G Client Grid Information Server(s) RM & TS RM: Local Resource Manager, TS: Trade Server G L Legion enabled node. C Condor enabled node. Nimrod/G GUI

Enfuzion API + Database Level 3 Level 2 Level 1

Generato r Creator

Run File Plan File

Job Scheduler Agent Scheduler DB Server Agent Globus Actuator Condor Actuator Legion Actuator

Globus enabled node

RM & TS RM & TS Agent Agent

Grid Middleware

slide-15
SLIDE 15

Nimrod/K Workflows

slide-16
SLIDE 16

Nimrod/K Workflows

  • Nimrod/K integrates Kepler with

– Massivly parallel execution mechanism – Special purpose function of Nimrod/G/O/E – General purpose workflows from Kepler – Flexible IO model: Streams to files

Authentication GUI Vergil SMS Kepler Core Extensions Ptolemy

…Kepler GUI Extensions…

Actor&Data SEARCH

Type System Ext

Provenance Framework Kepler Object Manager

Documentation

Smart Re-run / Failure Recovery

slide-17
SLIDE 17

Kepler Directors

  • Orchestrate Workflow
  • Synchronous & Dynamic Data Flow

– Consumer actors not started until producer completes

  • Process Networks

– All actors execute concurrently

  • IO modes produce different performance

results

  • Existing directors don’t support multiple

instances of actors.

17

slide-18
SLIDE 18

Workflow Threading

  • Nimrod parameter combinations can be

viewed as threads

  • Multi-threaded workflows allow

independent sequences in a workflow to run concurrently

– This might be the whole workflow, or part

  • f the workflow
  • Tokens in different threads do not

interact with each other in the workflow

slide-19
SLIDE 19

The Nimrod/K director

  • Implements the Tagged Data

Architecture

  • Provides threading
  • Maintains copies (clones) of actors
  • Maintains token tags
  • Schedules actor’s events

Nimrod Director

slide-20
SLIDE 20

MTC through Data Flow Execution

slide-21
SLIDE 21

Clone 1 Clone 2 Clone 3 Actor

Dynamic Parallelism

Token Colouring

Clone 1 Clone 2 Clone 3

slide-22
SLIDE 22

So …

slide-23
SLIDE 23

Director controls parallelism

  • Uses Nimrod to perform the execution
slide-24
SLIDE 24

Complete Parameter Sweep

  • Using a MATLAB actor provided by

Kepler

  • Local spawn
  • Multiple thread ran concurrently on

a computer with 8 cores (2 x quads)

  • Workflow execution was just under

8 times faster

  • Remote Spawn
  • 100’s – 1000’s of remote processes
slide-25
SLIDE 25

Parameter Sweep Actor

slide-26
SLIDE 26

Partial Parameter Sweep

slide-27
SLIDE 27

Nimrod/EK Actors

  • Actors for generating

and analyzing designs

  • Leverage concurrent

infrastructure

slide-28
SLIDE 28

No actor parameters need setting No difference from the parameter sweep actors

Nimrod/E Actors

slide-29
SLIDE 29

Parameter Optimization: Inverse Problems

Domain Definer Points Generator Optimizer Constraint Enforcer Execute Model

F(x,y,z,w,…)

slide-30
SLIDE 30

Nimrod/OK Workflows

30

  • Nimrod/K supports

parallel execution

  • General template for

search

– Built from key components

  • Can mix and match
  • ptimization

algorithms

slide-31
SLIDE 31

Things the Grid ignored

slide-32
SLIDE 32

Resource Scheduling

  • What’s so hard about scheduling parameter

studies?

– User has deadline – Grid resources unpredictable

  • Machine load may change at any time
  • Multiple machine queues

– No central scheduler

  • Soft real time problem
slide-33
SLIDE 33

Computational Economy

  • Without cost ANY shared system becomes

un-manageable

  • Resource selection on based pseudo money

and market based forces

  • A large number of sellers and buyers

(resources may be dedicated/shared)

  • Negotiation: tenders/bids and select those
  • ffers meet the requirement
  • Trading and Advance Resource Reservation
  • Schedule computations on those resources

that meet all requirements

slide-34
SLIDE 34

34

2 4 6 8 10 12 1 3 4 6 8 9 10 12 14 15 17 19 20 21 22 24 25 27 28 30 31 33 34 36 37 38 40 41 43 44 46 47 49 51 52 54

Time (minutes) Jobs

Linux cluster - Monash (20) Sun - ANL (5) SP2 - ANL (5) SGI - ANL (15) SGI - ISI (10)

Soft real-time scheduling problem

slide-35
SLIDE 35

2 4 6 8 10 12 1 3 4 6 8 9 10 12 14 15 17 19 20 21 22 24 25 27 28 30 31 33 34 36 37 38 40 41 43 44 46 47 49 51 52 54

Time (minutes) Jobs

Linux cluster - Monash (20) Sun - ANL (5) SP2 - ANL (5) SGI - ANL (15) SGI - ISI (10)

slide-36
SLIDE 36

2 4 6 8 10 12 3 4 7 8 10 13 15 17 19 21 23 26 28 31 32 35 37 39 41 43 46 48 50 53 55 57 60

Time (minutes) Jobs

Linux cluster - Monash (5) Sun - ANL (10) SP2 - ANL (10) SGI - ANL (15) SGI - ISI (20)

slide-37
SLIDE 37

2 4 6 8 10 12 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72

Time (in Minute)

  • No. of Tasks in Execution

Condor-Monash Linux-Prosecco-CNR Linux-Barbera-CNR Solaris/Ultas2-TITech SGI-ISI Sun-ANL

slide-38
SLIDE 38

2 4 6 8 10 12 14 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102 108 114

Time (in Minute)

  • No. of Tasks in Execution

Condor-Monash Linux-Prosecco-CNR Linux-Barbera-CNR Solaris/Ultas2-TITech SGI-ISI Sun-ANL

slide-39
SLIDE 39

Deployment

  • Has largely been ignored in Grid

middleware

– Globus supports file transport, execution, data access

  • Challenges

– Deployment interfaces lacking – Heterogeneity

39 Your Java Service Your Java Service

RFT GRAM Delegation Index Trigger Archiver CAS OGSA-DAI GTCP

SERVER Globus 4.0 Services

Deployment High Performance Virtualization Grid Deploy Aware Clients

CLIENT

slide-40
SLIDE 40

Deployment Service

Intermediate Code Application Binary Application Handle

GRAM

Application Handle

Installed Applications

Install Execute

Application Source

.NET Compilers

Client Machine Grid Resource

.NET Runtime

.NET Parallel Virtual Machine Globus/OGSA

Deployment Service

  • Hide the complexity

in installing software

  • n a remote

resource.

  • Use local knowledge

about

– the instruction set, – machine structure, – file system, – I/O system, and – installed libraries

slide-41
SLIDE 41

41

Configured Application User Security Scope Remote Host Local Host Ant Build File

1

DistAnt Deployment Client

3

Un-configured Files

Globus User Hosting Environment DistAnt Service Managed Job Service (GRAM)

4 6 5

Instantiated Application

4

Application Files

2 6

RSL

Reliable File Transfer Service (GridFTP)

slide-42
SLIDE 42
  • Our approach is runtime-

internal

  • Why do Java & .NET support

web services, UI, security and other libraries as part of the standard environment?

  • Functionality is guaranteed
  • Similarly, we aim to provide

guaranteed HPC functionality

42

Virtual Machine Native OS and Interconnect HPC Application Runtime Core

System Libraries

System.HPC HPC Comm

Runtime-Internal

(Our Approach)

Virtualization

System Libraries Runtime Core Virtual Machine

HPC Comm

HPC Application

Managed to Native Bindings

Native OS and Interconnect

Runtime-External

(Existing Approach)

slide-43
SLIDE 43

Clusters & Grids & Clouds

slide-44
SLIDE 44

Nimrod over Clusters

44

Nimrod Actuator, e.g., SGE, PBS, LSF, Condor Local Batch System Jobs / Nimrod experiment

slide-45
SLIDE 45

Nimrod over Grids

  • Advantages

– Wide area elastic computing – Portal based point-of-presence independent of location of computational resources – Grid level security – Computational economy proposed

  • New scheduling and data challenges

– Virtualization proposed (Based on .NET!)

  • Leveraged Grid middleware

– Globus, Legion, ad-hoc standards

45

slide-46
SLIDE 46

Leveraging Cloud Infrastructure

  • Centralisation is easier

– (Clusters vs Grid)

  • Virtualisation improves interoperability and scalability

– Build once, run everywhere

  • Computational economy, for real

– Deadline driven

  • “I need this finished by Monday morning!”

– Budget driven

  • “Here’s my credit card, do this as quickly and cheaply as

possible.”

  • Cloud bursting

– Scale-out to supplement locally and nationally available resources

46

slide-47
SLIDE 47

Cloud Architectures

  • IaaS

– Build a virtual cluster

  • PaaS

– Leverage platform services

  • SaaS

– Nimrod portal installed on cloud

slide-48
SLIDE 48

Integrating Nimrod with IaaS

48

Grid Middleware

Agents

Nimrod/G Portal Nimrod-O/E/K Jobs / Nimrod experiment Actuator: Globus,... Services New actuators: EC2, Azure, IBM, OCCI?,...?

VM

Agents

VM

Agents

VM

Agents

RESTful IaaS API

slide-49
SLIDE 49

Integrating Nimrod with IaaS

  • Nimrod is already a meta-scheduler

– Creates an ad-hoc grid dynamically overlaying the available resource pool – Don’t need all the Grid bells and whistles to stand-up a resource pool under Nimrod, just need to launch our code

  • Requires explicit management of infrastructure
  • Extra level of scheduling – when to initialise

infrastructure?

49

slide-50
SLIDE 50

Integrating Nimrod with IaaS

50

1 2 3

slide-51
SLIDE 51

PaaS is trickier...

  • More variety

– Azure vs AppEngine

  • Designed for web-app hosting

– Nimrod provides a generic execution framework

  • Higher level PaaS too prescriptive

– AppEngine: Python and Java only

51

slide-52
SLIDE 52

Nimrod-Azure Mk.1

  • Nimrod server runs on a Linux box

external to Azure

  • Nimrod-Azure actuator module contains

the code for managing Nimrod agents on Azure

– pre-defined minimal NimrodWorkerService cspkg; – library for speaking XML over HTTP with the Azure Storage and Management REST APIs

52

slide-53
SLIDE 53

Integrating Nimrod with Azure

53

  • Copies the Nimrod agent

package and encryption keys to an Azure Blob

  • Adds command line parameters

for agents to an Azure Queue

  • Builds an initial cscfg for the

deployment including relevant blob and queue URLs

  • Deploys the service to the Cloud

To stand- up an Azure compute resource under Nimrod, the actuator:

slide-54
SLIDE 54

Integrating Nimrod with Azure

54

Azure Blob Queue Nimrod Server Azure Actuator

Nimrod Experiment Agent cspkg

Blob Blob

slide-55
SLIDE 55

Integrating Nimrod with Azure

Once deployed, the NimrodWorkerService:

  • Pulls the Nimrod agent package from blobs

referenced in cscfg settings

  • Unpacks and launches the agent with parameters

from the queue referenced by cscfg

  • The agent connects out to the Nimrod server, pulling

work and pushing results until: no work left; lifetime ends; exception

  • But, when the agent exits there is no way to de-

provision the role instance... scaling without de- scaling?! Please fix this!

55

slide-56
SLIDE 56

Integrating Nimrod with Azure

56

Azure Queue Nimrod Server Azure Actuator

Agent params

Blob Blob Worker Worker Worker Worker Workers

Agent User app/s

slide-57
SLIDE 57

Grid + Amazon + Azure

slide-58
SLIDE 58

Conclusions and Future Directions

  • Commercial Clouds

– Grid economy == commercial clouds – Virtualisation built into fabric

  • Leverage MTC paradigm

– More complex Design of Experiments – More optimization Algorithms

  • Make environment more useful

– New portal – Workflows that interact with IO devices and Portals

slide-59
SLIDE 59

Questions?

More information: http://messagelab.monash.edu.au

slide-60
SLIDE 60
  • Faculty Members

– Jeff Tan – Maria Indrawan

  • Research Fellows

– Blair Bethwaite – Slavisa Garic – Jin Chao

  • Admin

– Rob Gray

  • Current PhD Students

– Shahaan Ayyub – Philip Chan – Colin Enticott – ABM Russell – Steve Quinette – Ngoc Dinh (Minh)

  • Completed PhD Students

– Greg Watson – Rajkumar Buyya – Andrew Lewis – Nam Tran – Wojtek Goscinski – Aaron Searle – Tim Ho – Donny Kurniawan – Tirath Ramdas

  • Funding & Support

– Amazon – Axceleon – Australian Partnership for Advanced Computing (APAC) – Australian Research Council – Cray Inc – CRC for Enterprise Distributed Systems (DSTC) – GrangeNet (DCITA) – Hewlett Packard – IBM – Microsoft – Sun Microsystems – US Department of Energy 61