SLIDE 1 ExPERT: Pareto-efficient task replication on grids and a cloud
Orna Agmon Ben-Yehuda1 Assaf Schuster1 Artyom Sharov1 Mark Silberstein1 Alexandru Iosup2
1Department of Computer Science
Technion — Israel Institute of Technology
2Faculty of Engineering, Mathematics and Computer Science (EWI)
TU Delft
IPDPS, May 2012
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 1/33
SLIDE 2
The Shared Resource Game — Players and Goals
Costs Policy Enforcing QoS
Owner goals − minimize: User goals − minimize: *Makespan *Cost
Resource OWNER Resource USER
*Effective load *Operational costs (energy)
Workload Paying for resources or QoS Strategy (Declarations, Resource Usage)
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 2/33
SLIDE 3 The Unreliable Shared Resource Game
Costs Grid OWNER (Preemption) Policy Enforcing QoS
Slow/Costly
Alternative Reliable Unreliable Bag of Async Tasks Paying for Credentials Grid USER
Owner goals − minimize: *operational costs (energy) *effective load User goals − minimize: *Makespan *Cost An environment of uncertainty: Will the task fail on the unreliable resource? Which system to use?
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 3/33
SLIDE 4
In the Beginning...
Unreliable queue D Timeout Failure/ Success Pool Unreliable
#machines < #unfinished tasks. D - instance deadline. No replication (replication is inefficient).
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 4/33
SLIDE 5 Using the Same Strategy After the Tail Starts
#machines > #unfinished tasks
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 200 400 600
Time [s] Number of remaining tasks
Remaining tasks Tail phase start time ( T
tail )
Throughput Phase Tail Phase
The tail is wagging the dog...
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 5/33
SLIDE 6
Replication - the User’s Bank of NTDMr Strategies
Unreliable queue D Timeout Failure/ Success Pool Unreliable T First N Tail Instances Instance N+1 in Tail Reliable Queue Reliable Pool Success T
D - instance deadline, T - replication time Reliable machine used to ensure task completion N tail instances at most on unreliable resources Mr - max ratio of reliable to unreliable resources
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 6/33
SLIDE 7
Replication - the User’s Bank of NTDMr Strategies
Unreliable queue First N Tail Instances T D Timeout Failure/ Success Success Instance N+1 in Tail Reliable Queue Reliable Pool Pool Unreliable T
D - instance deadline, T - replication time Reliable machine used to ensure task completion N tail instances at most on unreliable resources Mr - max ratio of reliable to unreliable resources
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 6/33
SLIDE 8 Replication - the User’s Bank of NTDMr Strategies
Example: Number of unreliable instances N = 3
2T 3T D UNRELIABLE2 UNRELIABLE1 RELIABLE Time 3T+Tr
Replication wastes work!
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 7/33
SLIDE 9
The User’s Problem: Optimization of...
The user cares about multi-objective optimization: Cost - Mean cost
task or tail−cost tail−task
MS - Mean makespan or tail makespan. Each user may have her own objective, pending on those values: Below minimal makespan: MS < Const As fast as possible: minMS Below max budget: Cost < Const As cheap as possible: minCost Best price for the goods: minCostMS Any other function of means: Cost, MS. . .
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 8/33
SLIDE 10 The Feedback Loop
Costs Strategy Directly Indirectly USER OWNER Trial & Error Trial & Error Users who Billed Users Heavy Users Caring Users Credential are Owners
Users who do not optimize well behave irrationally and are hard to predict.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 9/33
SLIDE 11 The Feedback Loop — Our Contribution
Costs Strategy Directly Indirectly USER OWNER Trial & Error Users who Billed Users Heavy Users Caring Users Credential are Owners Optimize
Rational users can optimize general utility function.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 10/33
SLIDE 12 The Feedback Loop - Lookout
Costs Strategy Directly Indirectly USER OWNER Users who Billed Users Heavy Users Caring Users Credential are Owners Optimize Optimize
Towards the final goal of manipulating users to save energy
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 11/33
SLIDE 13 Solution Concept
3 4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 1 2 Decision Making 5 ExPERT Pareto
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 12/33
SLIDE 14 Solution Concept - Step 1
Get user additional data (costs, reliable pool times).
3 4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 2 Decision Making 5 ExPERT Pareto 1
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 13/33
SLIDE 15 Solution Concept - Step 2
Get unreliable resource statistics (trace analysis).
3 4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 1 Decision Making 5 ExPERT Pareto 2
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 14/33
SLIDE 16 Solution Concept - Step 3
Compute a Pareto frontier for Cost,MS.
4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 1 2 Decision Making 5 ExPERT Pareto 3
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 15/33
SLIDE 17 Solution Concept - Step 3
Estimate Cost,MS for each strategy in the search space: For every working point, the ExPERT Estimator computes several random realizations on the basis of the statistic characterization. The average maksespan and cost over these realizations are used as the expectation values Cost,MS.
4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical 1 2 Decision Making 5 ExPERT 3 Frontier Pareto Generation
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 16/33
SLIDE 18
Solution Concept - Step 3
Estimate Cost,MS for each strategy in the search space. Filter out dominated strategies. Keep frontier composed of non-dominated strategies.
Cost Makespan Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 17/33
SLIDE 19
Solution Concept - Step 3
Estimate Cost,MS for each strategy in the search space. Filter out dominated strategies. Keep frontier composed of non-dominated strategies.
Area Dominated Cost Makespan Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 17/33
SLIDE 20 Solution Concept - Step 3
Estimate Cost,MS for each strategy in the search space. Filter out dominated strategies. Keep frontier composed of non-dominated strategies.
Dominated Non−dominated Non−dominated Area Dominated Cost Makespan Strategy S Strategy S StrategyS 1
2 3
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 17/33
SLIDE 21 Solution Concept - Step 3
Estimate Cost,MS for each strategy in the search space. Filter out dominated strategies. Keep frontier composed of non-dominated strategies.
Non−dominated Non−dominated Area Dominated Cost Makespan Strategy S StrategyS 1
2
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 17/33
SLIDE 22 Solution Concept - Step 3
Estimate Cost,MS for each strategy in the search space. Filter out dominated strategies. Keep frontier composed of non-dominated strategies.
Non−dominated Non−dominated
Pareto Frontier
Cost Makespan Strategy S StrategyS 1
2
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 17/33
SLIDE 23 Solution Concept - Step 4
Choose optimal strategy according to user utility (by expectation value).
3 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 1 2 Decision Making 5 ExPERT Pareto 4
Get N, T, D, Mr params for the desired strategy.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 18/33
SLIDE 24 Solution Concept - Step 5
Apply strategy: Feed N, T, D, Mr params as input to the user scheduler and deploy tasks on the resource pools.
3 4 Reliable Pool Unreliable Pool Execution BoT User Scheduler Characterization Statistical Frontier Generation 1 2 Decision Making ExPERT Pareto 5
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 19/33
SLIDE 25
Example - Based on a GridBoT Trace on UW-M
GridBoT: Supplies a unified front-end to multiple grids and clouds using their local resource management infrastructure. Employs dynamic run-time scheduling and replication strategies to execute BoTs in multiple environments simultaneously. A BoT trace holds a line per task with the following fields: Status (failed/succeeded) Runtime : only for successful tasks. Wait time : from submitting to starting running. May be unavailable for failed tasks. Result time= Runtime+Wait time UW-M: Condor cluster of University of Wisconsin-Madison.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 20/33
SLIDE 26 Characterize the Unreliable Resource
♯ur: the effective size of the unreliable pool. F(t, t′) = reliability(t′) · Fs(t), CDF of result turnaround time, on the basis of:
Fs(t): the measured CDF of result turnaround time (t)of successful tasks. reliability(t′): the fraction of successful tasks as a function
- f the time since the BoT started t′.
1000 2000 3000 4000 5000 6000 0.2 0.4 0.6 0.8
Single result turnaround time [s] probability
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 21/33
SLIDE 27 Pareto Frontier and Working Points
0.5 1 1.5 2 2.5 3 3.5 x 10
4
1 2 3 4 5
Tail Makespan[s] Cost [cent/task]
N=0 N=1 N=2 N=3
Local optimization is not trivial. Unoptimized strategies are wasteful.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 22/33
SLIDE 28 Optimizing a General User Utility Function along the Pareto Frontier
4500 5000 5500 6000 6500 7000 7500 8000 1 2 3 4
Tail Makespan[s] Cost [cent/task]
NTDMr Pareto frontier Fastest Cheapest within deadline Cheapest Deadline=6300 s Budget=2.5 cent/task Both Fastest within budget and Min Cost x Makespan
Utility function choice can be postponed till the user is aware of the cost-makespan tradeoff. It does not have to be expressed.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 23/33
SLIDE 29
Validation and Evaluation
Estimator ExPERT
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33
SLIDE 30
Validation and Evaluation
Simulator Wrapper Estimator ExPERT
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33
SLIDE 31 Validation and Evaluation
Simulator Wrapper Estimator ExPERT Validation (EC2) a cloud and real grids
Experiments
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33
SLIDE 32 Validation and Evaluation
Simulator Wrapper Estimator ExPERT Validation (EC2) a cloud and real grids
Experiments Validated Simulator
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33
SLIDE 33 Validation and Evaluation
Simulator Wrapper Estimator ExPERT Validation (EC2) a cloud and real grids
Experiments Validated Estimator Validated Simulator
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33
SLIDE 34 Validation and Evaluation
Simulator Wrapper Estimator ExPERT Validation (EC2) a cloud and real grids
Experiments Validated Estimator Validated Simulator Performance Compared with Static Strategies Scheduling
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33
SLIDE 35 Validation and Evaluation
Simulator Wrapper Estimator ExPERT Validation (EC2) a cloud and real grids
Experiments Validated Estimator Validated Simulator Performance Compared with Static Strategies Scheduling Use Resource Efficient Regarding Insights Evaluation
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 24/33
SLIDE 36
Resource Pools
Table: Resource Pools
Reliable Pool Properties Technion 20 self-owned CPUs in the Technion. EC2 20 large EC2 cloud instances. Unreliable Pool Properties UW-M UW-Madison Condor pool (preempts). OSG Open Science Grid (no preemption). UW-M + OSG Combined: half ♯ur from each pool. UW-M + EC2 Combined: 200 UW-M, 20 EC2. UW-M + Technion Combined: 200 UW-M, 20 Technion.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 25/33
SLIDE 37 Validation: Prediction Deviation
−60 −50 −40 −30 −20 −10 10 20 30 40 −40 −30 −20 −10 10 20 30 40
X: 20.38 Y: 13 X: 10.38 Y: 7.077
Relative tail makespan deviation Relative tail cost deviation
- ffline deviation
- nline deviation
mean absolute offline deviation mean absolute online deviation
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 26/33
SLIDE 38 Augmenting the Estimator with Static Strategies
1
AR: All to Reliable
2
TRR: all Tail Replicated to Reliable (N=0,T=0)
3
TR: all Tail to Reliable (N = 0, T = D)
4
AUR: All to UnReliable, no replication
5
B: Budget of 1$ for a BoT of 150 tasks (on average,
2 3 cent BoTtask )
6
CN∞: Combine resources, no replication
7
CT0N1: Combine resources, replicate at tail with N = 1, T = 0
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 27/33
SLIDE 39 Bi-Objective Performance, Mmax
r
= 0.1
0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 10
4
0.5 1 1.5 2 2.5 3 3.5 4 4.5
makespan [s] cost [cent/task]
CT0N1 TRR TR B=5 cent/task NTDMr Pareto frontier 72% cost reduction 33% makespan reduction AUR CN∞ ExPERT Recommended
ExPert’s Pareto frontier dominates most static strategies.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 28/33
SLIDE 40 Performance: makespan × cost, 3 Mmax
r
values
0.1 0.3 0.5 0.5 1 1.5 2 2.5 3 3.5 x 10
5
#reliable/#unreliable (Mr
max)
BoTmakespan * cost/task
AR TRR TR AUR B=5.00 cent/task C CT0N1 ExPERT Rec. 2*106 5*105 4*105
Smaller is better. Cost × makespan ExPert’s recommended strategy is 25% lower than second-best and at least 72% lower than third -best.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 29/33
SLIDE 41 Insight: Efficient Reliable Resource Use Includes a Queue
4500 5000 5500 6000 6500 7000 7500 8000 8500 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Tail Makespan [s] Mr ; Queue Length as Fraction of Tail Tasks
used Mr max reliable queue Mr Used Mr < Mr
On the Pareto frontier, usually all reliable resources are used at some point, and a significant queue is built for them.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 30/33
SLIDE 42 Insight: the Importance of Mr as a Free Parameter
5000 5500 6000 6500 7000 7500 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Cost/Task [cent/task] Tail Makespan
Mr=0.02 Mr=0.06 Lower cost for same Makespan using smaller reliable pool (low Mr) ALL Mr values combined (Mr is a free parameter) Mr=0.2 Mr=0.1 Mr=0.4
The free parameter Mr enables efficient strategies with lower costs for the same makespan. It makes tasks wait in a queue, where they may be canceled.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 31/33
SLIDE 43
Conclusion
The NTMr strategy space is vast enough to provide user preference flexibility. ExPERT-recommended strategies finish in two-thirds of the time and cost a quarter of commonly used static strategies. Using ExPERT means you do not waste time or money, and you optimize your own utility function.
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 32/33
SLIDE 44
Questions?
Contact us at: Orna Agmon Ben-Yehuda ladypine@cs.technion.ac.il Assaf Schuster assaf@cs.technion.ac.il Artyom Sharov sharov@cs.technion.ac.il Mark Silberstein marks@cs.technion.ac.il Alexandru Iosup A.Iosup@tudelft.nl Thank You!
Agmon Ben-Yehuda, Schuster, Sharov, Silberstein, Iosup ExPERT 33/33