ExPERT: Pareto-efficient task replication on grids and a cloud Orna - - PowerPoint PPT Presentation

expert pareto efficient task replication on grids and a
SMART_READER_LITE
LIVE PREVIEW

ExPERT: Pareto-efficient task replication on grids and a cloud Orna - - PowerPoint PPT Presentation

ExPERT: Pareto-efficient task replication on grids and a cloud Orna Agmon Ben-Yehuda 1 Assaf Schuster 1 Artyom Sharov 1 Mark Silberstein 1 Alexandru Iosup 2 1 Department of Computer Science Technion Israel Institute of Technology 2 Faculty of


slide-1
SLIDE 1

ExPERT: Pareto-efficient task replication on grids and a cloud

Orna Agmon Ben-Yehuda1 Assaf Schuster1 Artyom Sharov1 Mark Silberstein1 Alexandru Iosup2

1Department of Computer Science

Technion — Israel Institute of Technology

2Faculty of Engineering, Mathematics and Computer Science (EWI)

TU Delft

IPDPS, May 2012

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 1/33

slide-2
SLIDE 2

The Shared Resource Game — Players and Goals

Costs Policy Enforcing QoS

Owner goals − minimize: User goals − minimize: *Makespan *Cost

Resource OWNER Resource USER

*Effective load *Operational costs (energy)

Workload Paying for resources or QoS Strategy (Declarations, Resource Usage)

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 2/33

slide-3
SLIDE 3

The Unreliable Shared Resource Game

Costs Grid OWNER (Preemption) Policy Enforcing QoS

Slow/Costly

Alternative Reliable Unreliable Bag of Async Tasks Paying for Credentials Grid USER

Owner goals − minimize: *operational costs (energy) *effective load User goals − minimize: *Makespan *Cost An environment of uncertainty: Will the task fail on the unreliable resource? Which system to use?

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 3/33

slide-4
SLIDE 4

In the Beginning...

Unreliable queue D Timeout Failure/ Success Pool Unreliable

#machines < #unfinished tasks. D - instance deadline. No replication (replication is inefficient).

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 4/33

slide-5
SLIDE 5

Using the Same Strategy After the Tail Starts

#machines > #unfinished tasks

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 200 400 600

Time [s] Number of remaining tasks

Remaining tasks Tail phase start time ( T

tail )

Throughput Phase Tail Phase

The tail is wagging the dog...

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 5/33

slide-6
SLIDE 6

Replication - the User’s Bank of NTDMr Strategies

Unreliable queue D Timeout Failure/ Success Pool Unreliable T First N Tail Instances Instance N+1 in Tail Reliable Queue Reliable Pool Success T

D - instance deadline, T - replication time Reliable machine used to ensure task completion N tail instances at most on unreliable resources Mr - max ratio of reliable to unreliable resources

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 6/33

slide-7
SLIDE 7

Replication - the User’s Bank of NTDMr Strategies

Unreliable queue First N Tail Instances T D Timeout Failure/ Success Success Instance N+1 in Tail Reliable Queue Reliable Pool Pool Unreliable T

D - instance deadline, T - replication time Reliable machine used to ensure task completion N tail instances at most on unreliable resources Mr - max ratio of reliable to unreliable resources

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 6/33

slide-8
SLIDE 8

Replication - the User’s Bank of NTDMr Strategies

Example: Number of unreliable instances N = 3

  • T

2T 3T D UNRELIABLE2 UNRELIABLE1 RELIABLE Time 3T+Tr

Replication wastes work!

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 7/33

slide-9
SLIDE 9

The User’s Problem: Optimization of...

The user cares about multi-objective optimization: Cost - Mean cost

task or tail−cost tail−task

MS - Mean makespan or tail makespan. Each user may have her own objective, pending on those values: Below minimal makespan: MS < Const As fast as possible: minMS Below max budget: Cost < Const As cheap as possible: minCost Best price for the goods: minCostMS Any other function of means: Cost, MS. . .

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 8/33

slide-10
SLIDE 10

The Feedback Loop

Costs Strategy Directly Indirectly USER OWNER Trial & Error Trial & Error Users who Billed Users Heavy Users Caring Users Credential are Owners

Users who do not optimize well behave irrationally and are hard to predict.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 9/33

slide-11
SLIDE 11

The Feedback Loop — Our Contribution

Costs Strategy Directly Indirectly USER OWNER Trial & Error Users who Billed Users Heavy Users Caring Users Credential are Owners Optimize

Rational users can optimize general utility function.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 10/33

slide-12
SLIDE 12

The Feedback Loop - Lookout

Costs Strategy Directly Indirectly USER OWNER Users who Billed Users Heavy Users Caring Users Credential are Owners Optimize Optimize

Towards the final goal of manipulating users to save energy

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 11/33

slide-13
SLIDE 13

Solution Concept

3 4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 1 2 Decision Making 5 ExPERT Pareto

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 12/33

slide-14
SLIDE 14

Solution Concept - Step 1

Get user additional data (costs, reliable pool times).

3 4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 2 Decision Making 5 ExPERT Pareto 1

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 13/33

slide-15
SLIDE 15

Solution Concept - Step 2

Get unreliable resource statistics (trace analysis).

3 4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 1 Decision Making 5 ExPERT Pareto 2

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 14/33

slide-16
SLIDE 16

Solution Concept - Step 3

Compute a Pareto frontier for Cost,MS.

4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 1 2 Decision Making 5 ExPERT Pareto 3

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 15/33

slide-17
SLIDE 17

Solution Concept - Step 3

Estimate Cost,MS for each strategy in the search space: For every working point, the ExPERT Estimator computes several random realizations on the basis of the statistic characterization. The average maksespan and cost over these realizations are used as the expectation values Cost,MS.

4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical 1 2 Decision Making 5 ExPERT 3 Frontier Pareto Generation

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 16/33

slide-18
SLIDE 18

Solution Concept - Step 3

Estimate Cost,MS for each strategy in the search space. Filter out dominated strategies. Keep frontier composed of non-dominated strategies.

Cost Makespan Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 17/33

slide-19
SLIDE 19

Solution Concept - Step 3

Estimate Cost,MS for each strategy in the search space. Filter out dominated strategies. Keep frontier composed of non-dominated strategies.

Area Dominated Cost Makespan Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 17/33

slide-20
SLIDE 20

Solution Concept - Step 3

Estimate Cost,MS for each strategy in the search space. Filter out dominated strategies. Keep frontier composed of non-dominated strategies.

Dominated Non−dominated Non−dominated Area Dominated Cost Makespan Strategy S Strategy S StrategyS 1

2 3

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 17/33

slide-21
SLIDE 21

Solution Concept - Step 3

Estimate Cost,MS for each strategy in the search space. Filter out dominated strategies. Keep frontier composed of non-dominated strategies.

Non−dominated Non−dominated Area Dominated Cost Makespan Strategy S StrategyS 1

2

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 17/33

slide-22
SLIDE 22

Solution Concept - Step 3

Estimate Cost,MS for each strategy in the search space. Filter out dominated strategies. Keep frontier composed of non-dominated strategies.

Non−dominated Non−dominated

Pareto Frontier

Cost Makespan Strategy S StrategyS 1

2

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 17/33

slide-23
SLIDE 23

Solution Concept - Step 4

Choose optimal strategy according to user utility (by expectation value).

3 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 1 2 Decision Making 5 ExPERT Pareto 4

Get N, T, D, Mr params for the desired strategy.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 18/33

slide-24
SLIDE 24

Solution Concept - Step 5

Apply strategy: Feed N, T, D, Mr params as input to the user scheduler and deploy tasks on the resource pools.

3 4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 1 2 Decision Making ExPERT Pareto 5

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 19/33

slide-25
SLIDE 25

Example - Based on a GridBoT Trace on UW-M

GridBoT: Supplies a unified front-end to multiple grids and clouds using their local resource management infrastructure. Employs dynamic run-time scheduling and replication strategies to execute BoTs in multiple environments simultaneously. A BoT trace holds a line per task with the following fields: Status (failed/succeeded) Runtime : only for successful tasks. Wait time : from submitting to starting running. May be unavailable for failed tasks. Result time= Runtime+Wait time UW-M: Condor cluster of University of Wisconsin-Madison.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 20/33

slide-26
SLIDE 26

Characterize the Unreliable Resource

♯ur: the effective size of the unreliable pool. F(t, t′) = reliability(t′) · Fs(t), CDF of result turnaround time, on the basis of:

Fs(t): the measured CDF of result turnaround time (t)of successful tasks. reliability(t′): the fraction of successful tasks as a function

  • f the time since the BoT started t′.

1000 2000 3000 4000 5000 6000 0.2 0.4 0.6 0.8

Single result turnaround time [s] probability

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 21/33

slide-27
SLIDE 27

Pareto Frontier and Working Points

0.5 1 1.5 2 2.5 3 3.5 x 10

4

1 2 3 4 5

Tail Makespan[s] Cost [cent/task]

N=0 N=1 N=2 N=3

Local optimization is not trivial. Unoptimized strategies are wasteful.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 22/33

slide-28
SLIDE 28

Optimizing a General User Utility Function along the Pareto Frontier

4500 5000 5500 6000 6500 7000 7500 8000 1 2 3 4

Tail Makespan[s] Cost [cent/task]

NTDMr Pareto frontier Fastest Cheapest within deadline Cheapest Deadline=6300 s Budget=2.5 cent/task Both Fastest within budget and Min Cost x Makespan

Utility function choice can be postponed till the user is aware of the cost-makespan tradeoff. It does not have to be expressed.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 23/33

slide-29
SLIDE 29

Validation and Evaluation

Estimator ExPERT

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33

slide-30
SLIDE 30

Validation and Evaluation

Simulator Wrapper Estimator ExPERT

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33

slide-31
SLIDE 31

Validation and Evaluation

Simulator Wrapper Estimator ExPERT Validation (EC2) a cloud and real grids

  • n

Experiments

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33

slide-32
SLIDE 32

Validation and Evaluation

Simulator Wrapper Estimator ExPERT Validation (EC2) a cloud and real grids

  • n

Experiments Validated Simulator

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33

slide-33
SLIDE 33

Validation and Evaluation

Simulator Wrapper Estimator ExPERT Validation (EC2) a cloud and real grids

  • n

Experiments Validated Estimator Validated Simulator

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33

slide-34
SLIDE 34

Validation and Evaluation

Simulator Wrapper Estimator ExPERT Validation (EC2) a cloud and real grids

  • n

Experiments Validated Estimator Validated Simulator Performance Compared with Static Strategies Scheduling

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33

slide-35
SLIDE 35

Validation and Evaluation

Simulator Wrapper Estimator ExPERT Validation (EC2) a cloud and real grids

  • n

Experiments Validated Estimator Validated Simulator Performance Compared with Static Strategies Scheduling Use Resource Efficient Regarding Insights Evaluation

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33

slide-36
SLIDE 36

Resource Pools

Table: Resource Pools

Reliable Pool Properties Technion 20 self-owned CPUs in the Technion. EC2 20 large EC2 cloud instances. Unreliable Pool Properties UW-M UW-Madison Condor pool (preempts). OSG Open Science Grid (no preemption). UW-M + OSG Combined: half ♯ur from each pool. UW-M + EC2 Combined: 200 UW-M, 20 EC2. UW-M + Technion Combined: 200 UW-M, 20 Technion.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 25/33

slide-37
SLIDE 37

Validation: Prediction Deviation

−60 −50 −40 −30 −20 −10 10 20 30 40 −40 −30 −20 −10 10 20 30 40

X: 20.38 Y: 13 X: 10.38 Y: 7.077

Relative tail makespan deviation Relative tail cost deviation

  • ffline deviation
  • nline deviation

mean absolute offline deviation mean absolute online deviation

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 26/33

slide-38
SLIDE 38

Augmenting the Estimator with Static Strategies

1

AR: All to Reliable

2

TRR: all Tail Replicated to Reliable (N=0,T=0)

3

TR: all Tail to Reliable (N = 0, T = D)

4

AUR: All to UnReliable, no replication

5

B: Budget of 1$ for a BoT of 150 tasks (on average,

2 3 cent BoTtask )

6

CN∞: Combine resources, no replication

7

CT0N1: Combine resources, replicate at tail with N = 1, T = 0

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 27/33

slide-39
SLIDE 39

Bi-Objective Performance, Mmax

r

= 0.1

0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 10

4

0.5 1 1.5 2 2.5 3 3.5 4 4.5

makespan [s] cost [cent/task]

CT0N1 TRR TR B=5 cent/task NTDMr Pareto frontier 72% cost reduction 33% makespan reduction AUR CN∞ ExPERT Recommended

ExPert’s Pareto frontier dominates most static strategies.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 28/33

slide-40
SLIDE 40

Performance: makespan × cost, 3 Mmax

r

values

0.1 0.3 0.5 0.5 1 1.5 2 2.5 3 3.5 x 10

5

#reliable/#unreliable (Mr

max)

BoTmakespan * cost/task

AR TRR TR AUR B=5.00 cent/task C CT0N1 ExPERT Rec. 2*106 5*105 4*105

Smaller is better. Cost × makespan ExPert’s recommended strategy is 25% lower than second-best and at least 72% lower than third -best.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 29/33

slide-41
SLIDE 41

Insight: Efficient Reliable Resource Use Includes a Queue

4500 5000 5500 6000 6500 7000 7500 8000 8500 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Tail Makespan [s] Mr ; Queue Length as Fraction of Tail Tasks

used Mr max reliable queue Mr Used Mr < Mr

On the Pareto frontier, usually all reliable resources are used at some point, and a significant queue is built for them.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 30/33

slide-42
SLIDE 42

Insight: the Importance of Mr as a Free Parameter

5000 5500 6000 6500 7000 7500 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Cost/Task [cent/task] Tail Makespan

Mr=0.02 Mr=0.06 Lower cost for same Makespan using smaller reliable pool (low Mr) ALL Mr values combined (Mr is a free parameter) Mr=0.2 Mr=0.1 Mr=0.4

The free parameter Mr enables efficient strategies with lower costs for the same makespan. It makes tasks wait in a queue, where they may be canceled.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 31/33

slide-43
SLIDE 43

Conclusion

The NTMr strategy space is vast enough to provide user preference flexibility. ExPERT-recommended strategies finish in two-thirds of the time and cost a quarter of commonly used static strategies. Using ExPERT means you do not waste time or money, and you optimize your own utility function.

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 32/33

slide-44
SLIDE 44

Questions?

Contact us at: Orna Agmon Ben-Yehuda ladypine@cs.technion.ac.il Assaf Schuster assaf@cs.technion.ac.il Artyom Sharov sharov@cs.technion.ac.il Mark Silberstein marks@cs.technion.ac.il Alexandru Iosup A.Iosup@tudelft.nl Thank You!

Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 33/33