Principled Schedulability Analysis for Distributed Storage Systems - - PowerPoint PPT Presentation

principled schedulability analysis for distributed
SMART_READER_LITE
LIVE PREVIEW

Principled Schedulability Analysis for Distributed Storage Systems - - PowerPoint PPT Presentation

Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models Suli Yang*, Jing Liu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau * work done while at UW-Madison Scheduling: A Fundamental Primitive


slide-1
SLIDE 1

Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models

Suli Yang*, Jing Liu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

* work done while at UW-Madison

slide-2
SLIDE 2

Scheduling: A Fundamental Primitive

  • Modern storage systems are shared
  • Correct and efficient request scheduling is indispensable

N S

snapchat

A R/W R/W R/W R/W E

Shared Storage

A A A S N E

2

slide-3
SLIDE 3
  • Popular storage systems have fundamental scheduling deficiencies

Broken Scheduling in Current Systems

[MongoDB - #21858]:

“A high throughput update workload … could cause starvation on secondary reads”

[HBase - #8884]:

“ …when the read load is high on a specific RS is high, the write throughput also get impacted dramatically, and even write data loss...”

[Cassandra - #10989]:

“inability to balance writes/reads/compaction/flushing…”

etc.

3

slide-4
SLIDE 4

Why Is Scheduling Broken?

  • The complexities in modern storage systems
  • Distributed: >1000 servers
  • Highly concurrent: ~1000 interacting threads in each server
  • Long execution path: requests traverses numerous threads across multiple machines

We introduce Thread Architecture Model to describe scheduling complexities

4

slide-5
SLIDE 5

Thread Architecture Model (TAM)

  • Encodes scheduling related info:
  • Request flows
  • Thread interactions
  • Resource consumption patterns
  • Easy to obtain automatically
  • From complicated systems to an

understandable and analyzable model

  • HBase
  • Cassandra
  • MongoDB
  • Riak

Packet Ack

  • Data Xceive

Ack Process

  • Data Stream

Data Xceive

  • r1

r2 f1 a3

Packet Ack

w6 w7

w

3

w7

w

2

w4

a1

w4

w

5

w

1

w

5

w

3

r1 r2 w

5

w

2

w

3

a2

LOG Sync Mem Flush

  • RPC Respond
  • RPC Read
  • LOG Append

2 1

RPC Handle

  • RegionServer/DataNode

RegionServer/DataNode

5

slide-6
SLIDE 6

TAM Exposes Scheduling Problems

  • We discovered five categories of problems that happen in real systems
  • Lack of scheduling points
  • Unknown resource usage
  • Hidden contention between threads
  • Uncontrolled thread blocking
  • Ordering constraints upon requests

6

slide-7
SLIDE 7

Fix Problems Leads to Effective Scheduling

  • TAM-based simulation finds problem-free thread architectures
  • Provides schedulability: various desired scheduling policies can be realized
  • HBase

Tamed-HBase

  • Implementation transforms system to be schedulable
  • Muzzled-HBase: approximated implementation
  • Effective scheduling under YCSB and other workloads

7

slide-8
SLIDE 8

Thread Architecture Model

enables principled schedulability analysis

  • n general distributed storage systems

8

slide-9
SLIDE 9

Outline

  • Overview
  • Thread Architecture Model
  • Scheduling Problems
  • Achieve Schedulability: A Case Study
  • Conclusion

9

slide-10
SLIDE 10

Thread Architecture Model

Packet Ack

  • Data Xceive

Ack Process

  • Data Stream

Data Xceive

  • r1

r2 f1 a3

Packet Ack

w6 w7

w

3

w7

w

2

w4

a1

w4

w

5

w

1

w

5

w

3

r1 r2 w

5

w

2

w

3

a2

LOG Sync Mem Flush

  • RPC Respond
  • RPC Read
  • LOG Append

2 1

RPC Handle

  • RegionServer/DataNode

RegionServer/DataNode

  • Data Xceive
  • w

3 w 2

  • RPC Read
  • RPC Handle
  • egionServer/DataNode

Name C N

I L

stage (threads performing similar tasks)

Name

  • CPU

I/O network Lock

resource usage request flow request queue (scheduling point) blocking

11

slide-11
SLIDE 11

Thread Architecture Model

  • TAM encodes scheduling related info:
  • Request flows
  • Thread interactions
  • Resource consumption patterns
  • From complex systems to analyzable models
  • TADalyzer: from live system to TAM automatically
  • Only 20-50 lines of user annotation code required

12

slide-12
SLIDE 12

Outline

  • Overview
  • Thread Architecture Model
  • Scheduling Problems
  • Achieve Schedulability: A Case Study
  • Conclusion

13

slide-13
SLIDE 13

TAM Exposes Scheduling Problems

  • No scheduling
  • Unknown resource usage
  • Hidden contention
  • Blocking
  • Ordering constraint
  • Common in distributed storage systems
  • HBase, Cassandra, MongoDB, Riak…
  • Directly identifiable from TAM
  • No low-level implementation details required

Req Handle

  • Req Handle
  • 14
slide-14
SLIDE 14

TAM Exposes Scheduling Problems

  • No scheduling
  • Unknown resource usage
  • Hidden contention
  • Blocking
  • Ordering constraint
  • Common in distributed storage systems
  • HBase, Cassandra, MongoDB, Riak…
  • Directly identifiable from TAM
  • No low-level implementation details required

15

slide-15
SLIDE 15
  • C-ReqHandle
  • Msg In
  • Read

Mutation V

  • Mutation
  • Respond
  • C-Respond
  • Msg Out

...

  • Msg In

Read Mutation V

  • Mutation
  • Respond
  • Msg Out

1

...

2 3 3 3 4 4 4 1 5 6 7

l1 l2

6 7 3 3 3 4 4 4 5

l2

8

Cassandra Node Cassandra Node

5

l1

Scheduling Problem: Unknown Resource Usage

16

slide-16
SLIDE 16

Scheduling Problem: Unknown Resource Usage

Workload:

C1: issues cold requests C2: issues cold and cached requests

Expectation:

C2 has much higher throughput (due to cached request)

CPU underutilized

17

slide-17
SLIDE 17

Unknown Resource Usage: Solution

Workload:

C1: issues cold requests C2: issues cold and cached requests

Expectation:

C2 has much higher throughput (due to cached request)

18

slide-18
SLIDE 18

Scheduling Problem: Unknown Resource Usage

  • Resource usage patterns unknown to schedulers until after the processing

begins

  • Forces schedulers to make decisions before information is available
  • Identified as red square brackets around resource symbols in TAM

Req Handle

  • 19
slide-19
SLIDE 19

Scheduling Problem: Blocking

  • Worker
  • Feedback
  • Oplog Writer
  • Writer

Batcher

  • NetInterface

Fetcher

  • Worker

1 2 3 4 5 6 7 8

Primary Node Secondary Node

8 1

MongoDB

20

slide-20
SLIDE 20

Scheduling Problem: Blocking

MongoDB

Workload: C1: reads from primary (does not go to secondary) C2: writes to primary (replicate to secondary node) time 10: the secondary node slows down Expectation: C1 reads throughput remains stable

Time (s)

21

slide-21
SLIDE 21

Blocking: Solution

Workload: C1: reads C2: writes (replicate to secondary node) time 10: the secondary node slows down Expectation: C1 reads throughput remains stable

MongoDB

slide-22
SLIDE 22

Scheduling Problem: Blocking

  • Stages with fixed number of threads block on other stages
  • Unable to schedule requests that could have been completed because all

threads block

  • Identified as dashed arrow point to stages with queues in TAM

Req Handle

  • I/O
  • 23
slide-23
SLIDE 23

Outline

  • Overview
  • Thread Architecture Model
  • Scheduling Problems
  • Achieve Schedulability: A Case Study
  • Conclusion

24

slide-24
SLIDE 24

Fixing Problems Leads to Schedulability

  • TAM-based simulation framework: explore thread architectures
  • Simulates how systems perform under workloads
  • Easily study architecture designs and scheduling policies
  • Implementation: realize schedulable systems
  • Also validates that simulation matches the real world

25

slide-25
SLIDE 25

Simulation: HBase to Tamed-HBase

Packet Ack

  • Data Xceive

Ack Process

  • Data Stream

Data Xceive

  • r1

r2 f1 a3

Packet Ack

w6 w7

w

3

w7

w

2

w4

a1

w4

w

5

w

1

w

5

w

3

r1 r2 w

5

w

2

w

3

a2

LOG Sync Mem Flush

  • RPC Respond
  • RPC Read
  • LOG Append

2 1

RPC Handle

  • RegionServer/DataNode

RegionServer/DataNode

RegionServer/DataNode

Packet Ack Ack Process Packet Ack

a1

CPU

  • RegionServer/DataNode

IO

  • LOG Sync

Network

  • Network
  • IO

Data Xceive

  • RPC Read
  • RPC Handle
  • [ ]

RPC Respond

  • LOG Append
  • Mem Flush
  • Data Stream
  • a2

26

slide-26
SLIDE 26

Implementation : Tamed-HBase to Muzzled-HBase

  • Some approximations to make implementation easier
  • Supports multiple scheduling policies
  • Proper scheduling under various workloads

27

slide-27
SLIDE 27

Muzzled-HBase: Weighted Fairness

28

Workloads: Five clients, each with different weight , run YCSB (reads mostly) Expectation: Client receives throughput proportional to weight

slide-28
SLIDE 28

Muzzled-HBase: Weighted Fairness

29

Workloads: Five clients, each with different weight , run YCSB (reads mostly) Expectation: Client receives throughput proportional to weight

slide-29
SLIDE 29

Muzzled-HBase: Tail Latency Guarantee

30

Workloads: Foreground client: runs YCSB (update-heavy) Background client: random Gets or Puts Expectation: Foreground latency remains stable

slide-30
SLIDE 30

Muzzled-HBase: Tail Latency Guarantee

31

Workloads: Foreground client: runs YCSB Background client: random Gets or Puts Expectation: Foreground latency remains stable

slide-31
SLIDE 31

Muzzled-HBase: Tail Latency Guarantee

32

Workloads: Foreground client: runs YCSB Background client: random Gets or Puts Expectation: Foreground latency remains stable

slide-32
SLIDE 32

Outline

  • Overview
  • Thread Architecture Model
  • Scheduling Problems
  • Achieve Schedulability: A Case Study
  • Conclusion

33

slide-33
SLIDE 33

Conclusion

  • We introduce thread architecture models
  • Reduce complex distributed scheduling to an understandable representation
  • Enable schedulability analysis
  • We discover five scheduling problems
  • Point to problematic architecture that exist in real systems
  • Fixing them enables effective scheduling
  • Complex systems need to be built with the help of TAM
  • Analyze existing system and enable schedulability
  • Design systems that are problem-free and natively schedulable

34

slide-34
SLIDE 34

Thank you! Questions?

(poster number: 28) OceanBase: We are Hiring

Geo-scale relational database behind Alipay 42,000,000 SQLs per second US and China based Contact OceanBase-Public@list.alibaba-inc.com

OceanBase微信 公众号

35