Providing IaaS Resources to ATLAS: The UVic-NeCTAR Experience Ashok - - PowerPoint PPT Presentation

providing iaas resources to atlas the uvic nectar
SMART_READER_LITE
LIVE PREVIEW

Providing IaaS Resources to ATLAS: The UVic-NeCTAR Experience Ashok - - PowerPoint PPT Presentation

Providing IaaS Resources to ATLAS: The UVic-NeCTAR Experience Ashok Agarwal, Andre Charbonneau, Asoka de Silva, Ian Gable, Joanna Huang, Colin Leavett-Brown, Michael Paterson, Randall Sobie, Ryan Taylor Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec.


slide-1
SLIDE 1

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 1

Providing IaaS Resources to ATLAS: The UVic-NeCTAR Experience

Ashok Agarwal, Andre Charbonneau, Asoka de Silva, Ian Gable, Joanna Huang, Colin Leavett-Brown, Michael Paterson, Randall Sobie, Ryan Taylor

slide-2
SLIDE 2

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 2

CA Cloud Production Activity, Last 7 Months

IAAS

slide-3
SLIDE 3

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 3

IAAS

  • Early tests Nov. 2011, standard operation since April 2012
slide-4
SLIDE 4

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 4

  • Commissioned Dec. 2012, still in early stages

Australia-NECTAR

slide-5
SLIDE 5

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 5

Powered by Cloud Scheduler

  • Cloud Scheduler is a simple python package for

managing VMs on IaaS clouds, based on the requirements of Condor jobs

  • Users submit Condor jobs, with additional

attributes specifying VM properties

  • Developed at UVic and NRC since 2009
  • Used by BaBar, CANFAR, ATLAS
  • http://cloudscheduler.org/
  • http://goo.gl/G91RA (ADC Cloud Computing Workshop, May 2011)
  • http://arxiv.org/abs/1007.0050
slide-6
SLIDE 6

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 6

Key Features of Cloud Scheduler

  • securely delegates user credentials to VMs,

and authenticates VMs joining the Condor pool.

  • interacts with multiple IaaS sites, and

aggregates their resources under one Condor queue.

  • dynamically manages quantity and type of VMs

in response to user demand.

slide-7
SLIDE 7

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 7

slide-8
SLIDE 8

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 8

Participating Clouds

Alto Synnefo Quicksilver Elephant Hotel Nova Foxtrot Sierra

slide-9
SLIDE 9

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 9

VM Image

  • Dual-hypervisor image, can run on KVM or Xen
  • Customized batch node v2.6.0
  • Use whole-node VMs for better efficiency
  • cache sharing instead of disk contention
  • fewer image downloads when ramping up
slide-10
SLIDE 10

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 10

Data Access

  • IAAS and Australia-NECTAR are linked to their

T2 SEs

  • Our approach has been to dynamically create

compute resources, with remote access to static storage outside the cloud

  • Satisfactory for now
  • MC production is low I/O, ideal use-case
  • But not scalable long-term
  • Eventually should use a storage federation
slide-11
SLIDE 11

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 11

Adding IaaS Resources to The “Grid of Clouds”

  • Step 0 - Get an IaaS cloud
  • Step 1 - Boot VMs
  • Step 2 (optional) - Get a Panda queue
  • Step 3 (optional) - Run your own Cloud

Scheduler

slide-12
SLIDE 12

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 12

Step 0: Get An IaaS Cloud

  • Cloud Scheduler supports:
  • Nimbus
  • Amazon EC2
  • OpenStack
  • StratusLab
  • OpenNebula
  • Then, use your cloud!
slide-13
SLIDE 13

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 13

Step 1: Boot VMs

  • Allow Cloud Scheduler server to boot VMs
  • Analogous to allowing a DN to submit grid jobs to a CE
  • Test the image (may need customization)
  • We can provide an image to use
  • Run some VMs, join condor pool
  • Then, run condor jobs!
  • If joining an existing Panda queue, you're already

done!

slide-14
SLIDE 14

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 14

Optional Step 2: Get a Panda Queue

  • Make a Panda site, with prod and analy queues
  • Associate with a SE
  • Use WAN protocol (e.g. lcgcp, curl) for stagein
  • Enable AFT/PFT jobs in HammerCloud, and

switcher for downtimes

  • Create site in AGIS (but not GOCDB)
  • Then, run Panda jobs!
slide-15
SLIDE 15

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 15

Optional Step 3: Run Your Own Cloud Scheduler

  • For a fully independent and complete solution
  • Install condor server
  • pip install cloud-scheduler
  • Maybe even your own Pilot Factory
slide-16
SLIDE 16

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 16

Missing Pieces

  • APEL accounting in the cloud
  • Ability to declare downtime on a Cloud

Scheduler server

  • SW release publication in AGIS without a CE
slide-17
SLIDE 17

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 17

Conclusion

  • Developed and deployed an infrastructure to

transparently run jobs in Panda queues spanning multiple IaaS clouds

  • Using it to deliver beyond-pledge resources to

ATLAS

  • In IAAS, completed 177K prod jobs since April
  • Recently created the Australia-NECTAR cloud

site running on another continent

slide-18
SLIDE 18

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 18

Extra Material

slide-19
SLIDE 19

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 19

CA Production Queues

  • Two are in the cloud: IAAS and Australia-NECTAR

IAAS Australia-NECTAR

slide-20
SLIDE 20

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 20

Condor Job Description File

Executable = runpilot3-wrapper.sh Arguments = -s IAAS -h IAAS-cloudscheduler -p 25443 -w https://pandaserver.cern.ch -j false -k 0 # Run-environment requirements Requirements = VMType =?= "pandacernvm" && Target.Arch == "X86_64" # User requirements +VMName = "PandaCern" +VMLoc = "http://images.heprc.uvic.ca/images/cernvm-batch-node-2.5.1-3-1- x86_64.ext3.gz" +VMMem = "18000" #MB +VMCPUCores = "8" +VMStorage = "160" #GB +TargetClouds = "FGHotel,Hermes" x509userproxy = /tmp/atprd.proxy

slide-21
SLIDE 21

12/09/12

21 Ian Gable 21

Research and Commercial clouds made available through a cloud interface.

Step 1

slide-22
SLIDE 22

12/09/12

22 Ian Gable 22

User submits a Condor job. The scheduler might not have any resources available to it yet.

Step 2

slide-23
SLIDE 23

12/09/12

23 Ian Gable 23

Cloud Scheduler detects waiting jobs in the Condor queue, and makes a request to boot VMs matching the job requirements.

Step 3

slide-24
SLIDE 24

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 24

The VMs boot, attach themselves to the Condor queue and begin draining

  • jobs. VMs are kept alive and

re-used until no more jobs require that VM type.

Step 4

slide-25
SLIDE 25

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 25

Implementation Details

  • Condor Job Scheduler

– VMs contextualized with Condor Pool URL and service certificate – VM image has the Condor startd daemon installed, which advertises to the central manager at start – GSI host authentication used when VMs join pools – User credentials delegated to VMs after boot by job submission – Condor Connection Broker handles private IP clouds

  • Cloud Scheduler

– User proxy certs used for authenticating with IaaS service where possible (Nimbus). Otherwise using secret API key (EC2 Style). – Can communicate with Condor using SOAP interface (slow at scale) or via condor_q

slide-26
SLIDE 26

Ryan Taylor - ADC T1/T2/T3 Jamboree, Dec. 10, 2012 26

Credential Transport