Background Distributed Key/Value stores provide a simple put / get - - PowerPoint PPT Presentation

background
SMART_READER_LITE
LIVE PREVIEW

Background Distributed Key/Value stores provide a simple put / get - - PowerPoint PPT Presentation

Background Distributed Key/Value stores provide a simple put / get interface Great properties: scalability, availability, reliability Increasingly popular both within data centers Cassandra Dynamo Voldemort Dynamo: Amazon's Highly


slide-1
SLIDE 1

Background

  • Distributed Key/Value stores provide a simple put/get

interface

  • Great properties: scalability, availability, reliability
  • Increasingly popular both within data centers

Dynamo Cassandra Voldemort

slide-2
SLIDE 2

2

Dynamo: Amazon's Highly Available Key-value Store

Giuseppe DeCandia etc.

Presented by: Tony Huang

slide-3
SLIDE 3

3

Motivation

 Highly scalable and reliable.  Tight control over the trade-offs between

availability, consistency, cost-effectiveness and performance.

 Flexible enough to let designer to make trade-

  • ffs.

 Simple primary-key access to data store.

 Best seller list, shopping carts, customer

preference, session management, sale rank, etc.

slide-4
SLIDE 4

4

Assumptions and Design Consideration

Query Model

Simple read and write operations to a data item that is uniquely identified by a key.

Small objects, ~1MB.

ACID (Atomicity, Consistency, Isolation, Durability)

Trade consistency for availability.

Does not provide any isolation guarantees.

Efficiency

Stringent SLA requirement.

Assumed non-hostile environment.

No authentication or authorization.

Conflict resolution is executed during read instead of write.

 Always writable.  Performed either by data store or application

slide-5
SLIDE 5

5

Amazon's Platform Architecture

slide-6
SLIDE 6

6

Techniques

Problem Technique Advantage

Partitioning Consistent Hashing Incremental Scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates. Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.

slide-7
SLIDE 7

7

Partitioning

 Consistent hashing: the output

range of a hash function is treated as a fixed circular space or “ring”.

”Virtual Nodes”: Each node can be responsible for more than one virtual node.

 Node fails: load evenly dispersed

across the rest.

 Node joins: its virtual nodes accept

a roughly equivalent amount of load from the rest.

 Heterogeneity.

slide-8
SLIDE 8

8

Load Distribution

 Strategy 1: T random tokens per node and and

partition by token value.

 Ranges vary in size and frequently change.  Long bootstrapping.  Difficult to take a snapshot.

slide-9
SLIDE 9

9

Load Distribution

 Strategy 2: T random tokens per node, partition by token value.  Turn out to be the worst, why?  Strategy 3: Q/S tokens per node, equal-sized partitions.

 Best load balancing configuration.  Drawback: Changing node membership requires coordination.

slide-10
SLIDE 10

10

Replication

 Each data item is

replicated at N hosts.

 “preference list”: The list

  • f nodes that is

responsible for storing a particular key.

 Improvement: The

preference list contains

  • nly distinct physical

nodes.

slide-11
SLIDE 11

11

Data Versioning

 A vector clock is a list of

(node, counter) pairs.

 Every version of every

  • bject is associated

with one vector clock.

 Client perform

reconciliation when system can not.

slide-12
SLIDE 12

12

Quorum for Consistency

 R: min num of nodes in a successful read.  W: min num of nodes in a successful write.  N: Num of machines in System.  Different combination of R and W results in

systems for different purpose.

slide-13
SLIDE 13

13

Quorum for Consistency

Consistenc y Insurance

Writ e Read

Consistenc y Insurance

Read Write

Always writable, but high risk on inconsistency.

Write: 1 Read: ?

 Read Engine

Write: 3 Read: 1

Writ e Read

Consistenc y Insurance

 Normally

Write: 2 Read: 2

slide-14
SLIDE 14

14

Hinted Handoff

 Assume N = 3. When A is

temporarily down or unreachable during a write, send replica to D.

 D is hinted that the

replica is belong to A and it will deliver to A when A is recovered.

 What if A never

recovered?

 What if D fails before

A recovers?

slide-15
SLIDE 15

15

Replica Synchronization

 Merkle trees:

 Hash tree.  Leaves are hashes of individual

keys.

 Parent nodes are hashes of their

children.

 Reduce amount of data required

while checking for consistency.

slide-16
SLIDE 16

16

Membership and Failure Detection

 Manually signal membership change.  Gossip-based protocol propagates membership

changes.

 Some Dynamo nodes as seed nodes for

external discovery.

 Potential single point of failure?

 Local detection of neighbor failure

 Gossip style protocol to propagate failure

information.

slide-17
SLIDE 17

17

Discussion

 What applications are suitable Dynamo

(shopping cart, what else?)

 What applications are NOT suitable for

Dynamo.

 How can you adapt Dynamo to store large

data?

 How can you make Dynamo secure?

slide-18
SLIDE 18

1

Comet: An Active Comet: An Active Comet: An Active Comet: An Active Distributed Key-Value Store Distributed Key-Value Store Distributed Key-Value Store Distributed Key-Value Store

Roxana Geambasu, Amit Levy, Yoshi Kohno, Arvind Krishnamurthy, and Hank Levy

OSDI'10 OSDI'10 OSDI'10 OSDI'10 Presented by Shen Li

slide-19
SLIDE 19

2

Outline

  • Background
  • Motivation
  • Design
  • Application
slide-20
SLIDE 20

3

Background

  • Distributed Key/Value stores provide a simple put

put put put/ / / /get get get get interface

  • Great properties: scalability, availability, reliability
  • Widely used in P2P systems and is becoming increasingly

popular in data centers

Dynamo Dynamo Dynamo Dynamo Cassandra Cassandra Cassandra Cassandra Voldemort Voldemort Voldemort Voldemort

slide-21
SLIDE 21

4

Background

  • Many applications may share the same key/value

storage system.

amazon S3 amazon S3 amazon S3 amazon S3

slide-22
SLIDE 22

5

Outline

  • Background
  • Motivation
  • Design
  • Application
slide-23
SLIDE 23

6

Motivation

  • Increasingly, key/value stores are shared by many

apps

– Avoids per-app storage system deployment

  • Applications have different (even conflicting) needs:

– Availability, security, performance, functionality

  • But today’s key/value stores are one-size-fits-all
slide-24
SLIDE 24

7

Motivating Example

  • Vanish is a self-destructing data system above Vuze
  • Vuze problems for Vanish:

– Fixed 8-hour data timeout – Overly aggressive replication, which hurts security

  • Changes were simple, but deploying them was

difficult:

– Need Vuze engineer – Long deployment cycle – Hard to evaluate before deployment

Vuze Vanis h

Vuze DHT

Vuze Vanis h

Vuze DHT

Vuze Vanis h

Vuze DHT

Vuze Vanis h

Vuze DHT

Futur e app

Vuze Vuze Vuze Vuze App App App App Vanish Vanish Vanish Vanish Future Future Future Future app app app app

Vuze DHT

Vanish: Enhancing the Privacy of the Web with Self-Destructing Data Vanish: Enhancing the Privacy of the Web with Self-Destructing Data Vanish: Enhancing the Privacy of the Web with Self-Destructing Data Vanish: Enhancing the Privacy of the Web with Self-Destructing Data . USENIX Security . USENIX Security . USENIX Security . USENIX Security ‘ ‘ ‘ ‘09 09 09 09

slide-25
SLIDE 25

8

Solution

  • Build Extensible Key/Value Stores
  • Allow apps to customize store’s functions

– Different data lifetimes – Different numbers of replicas – Different replication intervals

  • Allow apps to define new functions

– Tracking popularity: data item counts the number of reads – Access logging: data item logs readers’ IPs – Adapting to context: data item returns different values to different requestors

slide-26
SLIDE 26

9

Solution

  • It should also be simple!

– Allow apps to inject tiny code fragments (10s of lines of code) – Adding even a tiny amount of programmability into key/value stores can be extremely powerful

slide-27
SLIDE 27

10

Outline

  • Background
  • Motivation
  • Design
  • Application
slide-28
SLIDE 28

11

Design

  • DHT that supports application-specific customizations
  • Applications store active objects instead of passive

values

– Active objects contain small code snippets that control their behavior in the DHT App 1 App 2 App 3

Comet Comet Comet Comet

Active object Comet node

slide-29
SLIDE 29

12

Active Storage Objects

  • The ASO consists of data and code

– The data is the value

– The code is a set of handlers and user defined functions

App 1 App 2 App 3

Comet Comet Comet Comet

ASO data code function onGet() function onGet() function onGet() function onGet() [ [ [ [… … … …] ] ] ] end end end end

slide-30
SLIDE 30

13

ASO Example

  • Each replica keeps track of number of gets on an
  • bject.

ASO data code

aso.value = aso.value = aso.value = aso.value = “ “ “ “Hello world! Hello world! Hello world! Hello world!” ” ” ” aso.getCount = 0 aso.getCount = 0 aso.getCount = 0 aso.getCount = 0 function onGet() function onGet() function onGet() function onGet() self.getCount = self.getCount + 1 self.getCount = self.getCount + 1 self.getCount = self.getCount + 1 self.getCount = self.getCount + 1 return {self.value, self.getCount} return {self.value, self.getCount} return {self.value, self.getCount} return {self.value, self.getCount} end end end end

slide-31
SLIDE 31

14

slide-32
SLIDE 32

15

ASO Extension API

Intercept Intercept Intercept Intercept accesses accesses accesses accesses Periodic Periodic Periodic Periodic Tasks Tasks Tasks Tasks Host Host Host Host Interaction Interaction Interaction Interaction DHT DHT DHT DHT Interaction Interaction Interaction Interaction

  • nPut
  • nPut
  • nPut
  • nPut(caller)
  • nTimer
  • nTimer
  • nTimer
  • nTimer()

getSystemTime getSystemTime getSystemTime getSystemTime() get get get get(key, nodes)

  • nGet
  • nGet
  • nGet
  • nGet(caller)

getNodeIP getNodeIP getNodeIP getNodeIP() put put put put(key, data, nodes)

  • nUpdate
  • nUpdate
  • nUpdate
  • nUpdate(caller)

getNodeID getNodeID getNodeID getNodeID() lookup lookup lookup lookup(key) getASOKey getASOKey getASOKey getASOKey() deleteSelf deleteSelf deleteSelf deleteSelf()

  • Both local and remote recourses are restricted
slide-33
SLIDE 33

16

Local Restriction

  • Runtime library

– Only math packet, string manipulation, and table manipulation.

  • CPU

– 100K bytecode instructions per handler invocation

  • memory

– 100KB per ASO

slide-34
SLIDE 34

17

Remote Restriction

  • ASO can only interact specific nodes

– neighbors responsible for its replication – remote node, once per previous interaction

  • ASO can only communication with specific ASOs

– ASOs under the same key

  • Message generating rate is limited
slide-35
SLIDE 35

18

Outline

  • Background
  • Motivation
  • Design
  • Application
slide-36
SLIDE 36

19

Application

  • Three example

– Application-specific DHT customization – Proximity-based distributed tracker – Self-monitoring DHT

slide-37
SLIDE 37

20

Application-Specific DHT

  • Example: customize the replication scheme

function aso:selectReplicas(neighbors) function aso:selectReplicas(neighbors) function aso:selectReplicas(neighbors) function aso:selectReplicas(neighbors) [...] [...] [...] [...] end end end end function aso:onTimer() function aso:onTimer() function aso:onTimer() function aso:onTimer() neighbors = comet.lookup() neighbors = comet.lookup() neighbors = comet.lookup() neighbors = comet.lookup() replicas = self.selectReplicas(neighbors) replicas = self.selectReplicas(neighbors) replicas = self.selectReplicas(neighbors) replicas = self.selectReplicas(neighbors) comet.put(self, replicas) comet.put(self, replicas) comet.put(self, replicas) comet.put(self, replicas) end end end end

slide-38
SLIDE 38

21

Distributed Tracker

  • Traditional distributed trackers return a randomized

subset of the nodes

  • Comet: a proximity-based distributed tracker

– Peers put put put put their IPs and Vivaldi coordinates at torrentID torrentID torrentID torrentID – On get get get get, the ASO computes and returns the set of closest peers to the requestor

slide-39
SLIDE 39

22

distributed tracker

Comet tracker Random tracker

slide-40
SLIDE 40

23

Self-Monitoring DHT

  • Example: monitor a remote node’s neighbors

– Put Put Put Put a monitoring ASO that “pings” its neighbors periodically

  • Useful for internal measurements of DHTs

– Provides additional visibility over external measurement (e.g., NAT/firewall traversal)

aso.neighbors = {} aso.neighbors = {} aso.neighbors = {} aso.neighbors = {} function aso:onTimer() function aso:onTimer() function aso:onTimer() function aso:onTimer() neighbors = comet.lookup() neighbors = comet.lookup() neighbors = comet.lookup() neighbors = comet.lookup() self.neighbors[comet.systemTime()] = neighbors self.neighbors[comet.systemTime()] = neighbors self.neighbors[comet.systemTime()] = neighbors self.neighbors[comet.systemTime()] = neighbors end end end end

slide-41
SLIDE 41

24

Self-Monitoring DHT

Vuze Node Lifetime (hours)

slide-42
SLIDE 42

25

Discusion

  • 1. Is Comet safe enough? Can you come up with an

idea to bring it down?

  • 2. Do you agree with the point that Comet trades too

much performance for security? Why?

  • 3. If you are service provider of one DHT, would you

like to embed Comet into your network? Why?

  • 4. Can you come up with some practical applications

that can benefits from Comet?