Computing Load Aware and Long-View Load Balancing for Cluster - - PowerPoint PPT Presentation

▶

Jan 26, 2024 288 likes •490 views

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu, Haiying Shen and Haoyu Wang Holcombe Department of Electrical and Computer Engineering Clemson University Presented by Haoyu Wang Outline 1,

SLIDE 1

Guoxin Liu, Haiying Shen and Haoyu Wang Holcombe Department of Electrical and Computer Engineering Clemson University Presented by Haoyu Wang

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

SLIDE 2

Outline

1, Introduction 2, System design 3, Performance evaluation 4, Conclusion

SLIDE 3

Introduction

Background (Clemson Palmetto Clusters)

Load Balancing Problem

I/O load Data storage …... Why not consider the computing workload?

SLIDE 4

Introduction

Previous work

Challenge for load

balancing

– Data locality – Task delay – Long-term load balance – Cost-efficient & scalable

Related work

– Random data allocation – Balancing the number

f data blocks

– Balancing the I/O load

SLIDE 5

System Design

Main contribution

1, Trace analysis on computing workloads 2, Computing load aware long-view load balancing method 3, Trace-driven experiments

SLIDE 6

System Design

Trace Data Analysis

Click to edit subtitle style

0% 20% 40% 60% 80% 100% 1 10 100 1000

CDF Task running time (s)

SLIDE 7

System Design

Trace Data Analysis

Click to edit subtitle style

0% 20% 40% 60% 80% 100% 1 10 100 1000

CDF Task running time (s)

0% 20% 40% 60% 80% 100% 20000 40000 60000

CDF Number of currently submitted tasks

SLIDE 8

System Design

Trace Data Analysis

Click to edit subtitle style

0% 20% 40% 60% 80% 100% 1 10 100 1000

CDF Task running time (s)

0% 20% 40% 60% 80% 100% 20000 40000 60000

CDF Number of currently submitted tasks

80% 85% 90% 95% 100% 10000 20000 30000 40000

CDF Number of currently submitted tasks from different jobs

SLIDE 9

System Design

Trace Data Analysis

Click to edit subtitle style

0% 20% 40% 60% 80% 100% 1 10 100 1000

CDF Task running time (s)

0% 20% 40% 60% 80% 100% 20000 40000 60000

CDF Number of currently submitted tasks

80% 85% 90% 95% 100% 10000 20000 30000 40000

CDF Number of currently submitted tasks from different jobs

0% 20% 40% 60% 80% 100% 10 20 30 40

CDF

Num. of data transmissions of a server

SLIDE 10

System Design

Trace Data Analysis

Click to edit subtitle style

0% 20% 40% 60% 80% 100% 1 10 100 1000

CDF Task running time (s)

0% 20% 40% 60% 80% 100% 20000 40000 60000

CDF Number of currently submitted tasks

80% 85% 90% 95% 100% 10000 20000 30000 40000

CDF Number of currently submitted tasks from different jobs

0% 20% 40% 60% 80% 100% 10 20 30 40

CDF

Num. of data transmissions of a server

0% 20% 40% 60% 80% 100% 20000 40000 60000 80000

CDF Waiting time of a task (s)

SLIDE 11

System Design

CALV System Overview

Coefficient-based data reallocation

Principle 1: The data blocks contributing more computing workloads at more

verloaded epochs in the spatial space and temporal space have a

higher priority to be selected to reallocate Principle2: Among all data blocks contributing workloads at an overloaded epoch, the data blocks contribute less workload at more underloaded epochs have a higher priority to be selected to reallocate.

SLIDE 12

System Design

CALV System Overview

Coefficient-based data reallocation

(a) Reduce num. of reported data blocks in spatial space (b) Reduce num. of reported data blocks in temporal space

: Computing capacity of the server e1 e2 e3 d1 d2 d3 d1 d2 d6 d5 Sj e1 e2 e3 d1 d2 d3 d5 d7 d6 d4 Sk e1 e2 e3 d1 d2 d3 d2 d3 d2 d4 d3

Selection of data block to reallocate

SLIDE 13

System Design

CALV System Overview

Lazy Data Block Transmission

Lazy data block transmission

Si : Computing capacity e1 e2 e3 d1 d3 d1 d2 d1 d2 d3 d1 d2 d3 e4 Sj e1 e2 e3 d5 d4 d5 d5 d5 e4

SLIDE 14

Performance Evaluation

Trace-driven experiments

Simulated environment:

3000 servers with typical fat-tree topology. 8 computing slots for each server Epoch set to 1 second Comparison method: Random, Sierra, Ursa, CA

SLIDE 15

Performance Evaluation

Trace-driven experiments Performance of Data locality

20 40 60 80 100 120 0.5 0.75 1 1.25 1.5

% of network load compared to Random x times of num. of jobs

Random Sierra Ursa CA CALV 20 40 60 80 100 120 0.5 0.75 1 1.25 1.5

% of network load compared to Random x times of num. of jobs

Random Sierra Ursa CA CALV

SLIDE 16

Performance Evaluation

Trace-driven experiments Performance of Task Latency

10 20 30 40 50 0.5 0.75 1 1.25 1.5

Reduced avg. latency per task (s) x times of num. of jobs

Random=0 Sierra Ursa CA CALV 10 20 30 40 50 0.5 0.75 1 1.25 1.5

Reduced avg. latency per task (s) x times of num. of jobs

Random=0 Sierra Ursa CA CALV

SLIDE 17

Performance Evaluation

Trace-driven experiments Performance of Cost-Efficiency

0.E+0 5.E+6 1.E+7 2.E+7 2.E+7 3.E+7 0.5 0.75 1 1.25 1.5

Num. of reported

blocks x times of num. of jobs

CALV CALV-MAX CALV-Random CALV-All 1 4 16 64 256 1024 0.5 0.75 1 1.25 1.5

x times of num. of jobs

Saved % of network load Saved % of peak num. of reallocated blocks Reduced num. of overloads (*20)

Performance of Lazy Data transmission

SLIDE 18

Conclusion

Conclusion The importance of considering the computing workloads CALV is cost-efficient and could get long-term load balance

SLIDE 19

The End

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Outline

Load Balancing Problem

I/O load Data storage …... Why not consider the computing workload?

Previous work

balancing

1, Trace analysis on computing workloads 2, Computing load aware long-view load balancing method 3, Trace-driven experiments

Click to edit subtitle style

CDF Task running time (s)

Click to edit subtitle style

CDF Task running time (s)

CDF Number of currently submitted tasks

Click to edit subtitle style

CDF Task running time (s)

CDF Number of currently submitted tasks

CDF Number of currently submitted tasks from different jobs

Click to edit subtitle style

CDF Task running time (s)

CDF Number of currently submitted tasks

CDF Number of currently submitted tasks from different jobs

CDF

Click to edit subtitle style

CDF Task running time (s)

CDF Number of currently submitted tasks

CDF Number of currently submitted tasks from different jobs

CDF

CDF Waiting time of a task (s)

Coefficient-based data reallocation

Coefficient-based data reallocation

Lazy Data Block Transmission

Simulated environment:

3000 servers with typical fat-tree topology. 8 computing slots for each server Epoch set to 1 second Comparison method: Random, Sierra, Ursa, CA

Conclusion The importance of considering the computing workloads CALV is cost-efficient and could get long-term load balance

Thanks! Questions?