Big Data and Internet Thinking Chentao Wu Associate Professor - - PowerPoint PPT Presentation

big data and internet thinking
SMART_READER_LITE
LIVE PREVIEW

Big Data and Internet Thinking Chentao Wu Associate Professor - - PowerPoint PPT Presentation

Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User: wuct Password: wuct123456


slide-1
SLIDE 1

Big Data and Internet Thinking

Chentao Wu Associate Professor

  • Dept. of Computer Science and Engineering

wuct@cs.sjtu.edu.cn

slide-2
SLIDE 2

Download lectures

  • ftp://public.sjtu.edu.cn
  • User: wuct
  • Password: wuct123456
  • http://www.cs.sjtu.edu.cn/~wuct/bdit/
slide-3
SLIDE 3

Schedule

  • lec1: Introduction on big data, cloud computing & IoT
  • Iec2: Parallel processing framework (e.g., MapReduce)
  • lec3: Advanced parallel processing techniques (e.g.,

YARN, Spark)

  • lec4: Cloud & Fog/Edge Computing
  • lec5: Data reliability & data consistency
  • lec6: Distributed file system & objected-based storage
  • lec7: Metadata management & NoSQL Database
  • lec8: Big Data Analytics
slide-4
SLIDE 4

Collaborators

slide-5
SLIDE 5

Contents

  • Intro. to Data Reliability & Replication

1

slide-6
SLIDE 6

Data Reliability Problem (1) Google – Disk Annual Failure Rate

slide-7
SLIDE 7

Data Reliability Problem (2) Facebook-- Failure nodes in a 3000 nodes cluster

slide-8
SLIDE 8

What is Replication?

  • Replication can be classified as
  • Local replication
  • Replicating data within the same array or data center
  • Remote replication
  • Replicating data at remote site

It is a process of creating an exact copy (replica) of data.

Replication

Source Replica (Target)

REPLICATION

slide-9
SLIDE 9

File System Consistency: Flushing Host Buffer

File System Application Memory Buffers Logical Volume Manager Physical Disk Driver Data

Flush Buffer

Source Replica

slide-10
SLIDE 10

Database Consistency: Dependent Write I/O Principle

 Inconsistent 

Consistent

Source Replica 4 4 3 3 2 2 1 1 4 4 3 3 2 1

Source Replica

slide-11
SLIDE 11

Host-based Replication: LVM-based Mirroring

 

Host Logical Volume Physical Volume 1 Physical Volume 2

  • LVM: Logical Volume Manager
slide-12
SLIDE 12

Host-based Replication: File System Snapshot

 

  • Pointer-based

replication

  • Uses Copy on First

Write (CoFW) principle

  • Uses bitmap and block

map

  • Requires a fraction of

the space used by the production FS

Metadata Production FS Metadata 1 Data a 2 Data b FS Snapshot 3 no data 4 no data BLK Bit 1-0 1-0 2-0 2-0 N Data N 3 Data C 2 Data c 3-1 4 Data D 1 Data d 4-1 3-2 4-1

slide-13
SLIDE 13

Storage Array-based Local Replication

 

  • Replication performed by the array operating

environment

  • Source and replica are on the same array
  • Types of array-based replication
  • Full-volume mirroring
  • Pointer-based full-volume replication
  • Pointer-based virtual replication

BC Host Storage Array Replica Source Production Host

slide-14
SLIDE 14

Full-Volume Mirroring

Source

Attached

Storage Array

Read/Write Not Ready

Production Host BC Host Target

Detached – Point In Time

Read/Write Read/Write

Source Storage Array Production Host BC Host Target

slide-15
SLIDE 15

Copy on First Access: Write to the Source

Source C’ Target

  • When a write is issued to the source for the first time after replication

session activation:

 Original data at that address is copied to the target  Then the new data is updated on the source  This ensures that original data at the point-in-time of activation is

preserved on the target

Production Host BC Host C Write to Source

A B C’ C

slide-16
SLIDE 16

Copy on First Access: Write to the Target

  • When a write is issued to the target for the first time after replication

session activation:

 The original data is copied from the source to the target  Then the new data is updated on the target

Source B’ Target Production Host BC Host B Write to Target

A B C’ C B’

slide-17
SLIDE 17

Copy on First Access: Read from Target

  • When a read is issued to the target for the first time after replication

session activation:

 The original data is copied from the source to the target and is made

available to the BC host

Source A Target Production Host BC Host A Read request for data “A”

A B C’ C B’ A

slide-18
SLIDE 18

Tracking Changes to Source and Target

Source Target

unchanged changed

Logical OR At PIT Target Source After PIT… 1 1 1 1 1 1 1 1 1 1 1

1

For resynchronization/restore

slide-19
SLIDE 19

Contents

Introduction to Erasure Codes

2

slide-20
SLIDE 20

Erasure Coding Basis (1)

  • You've got some data
  • And a collection of storage

nodes.

  • And you want to store the data on the storage nodes so that

you can get the data back, even when the nodes fail..

slide-21
SLIDE 21

Erasure Coding Basis (2)

  • More concrete: You have k

disks worth of data

  • And n total disks.
  • The erasure code tells you how to create n disks worth of

data+coding so that when disks fail, you can still get the data

slide-22
SLIDE 22

Erasure Coding Basis (3)

  • You have k disks worth of

data

  • And n total disks.
  • n = k + m
  • A systematic erasure code stores the data in the clear on k of

the n disks. There are k data disks, and m coding or “parity”

  • disks.  Horizontal Code
slide-23
SLIDE 23

Erasure Coding Basis (4)

  • You have k disks worth of

data

  • And n total disks.
  • n = k + m
  • A non-systematic erasure code stores only coding information,

but we still use k, m, and n to describe the code.  Vertical Code

slide-24
SLIDE 24

Erasure Coding Basis (5)

  • You have k disks worth of

data

  • And n total disks.
  • n = k + m
  • When disks fail, their contents become unusable, and

the storage system detects this. This failure mode is called an erasure.

slide-25
SLIDE 25

Erasure Coding Basis (6)

  • You have k disks worth of

data

  • And n total disks.
  • n = k + m
  • An MDS (“Maximum Distance Separable”) code can reconstruct

the data from any m failures.  Optimal

  • Can reconstruct any f failures (f < m)  non-MDS code
slide-26
SLIDE 26

Two Views of a Stripe (1)

  • The Theoretical View:

– The minimum collection of bits that encode and decode together. – r rows of w-bit symbols from each of n disks:

slide-27
SLIDE 27

Two Views of a Stripe (2)

  • The Systems View:

– The minimum partition of the system that encodes and decodes together. – Groups together theoretical stripes for performance.

slide-28
SLIDE 28

Horizontal & Vertical Codes

  • Horizontal Code
  • Vertical Code
slide-29
SLIDE 29

Expressing Code with Generator Matrix (1)

slide-30
SLIDE 30

Expressing Code with Generator Matrix (2)

slide-31
SLIDE 31

Expressing Code with Generator Matrix (3)

slide-32
SLIDE 32

Encoding— Linux RAID-6 (1)

slide-33
SLIDE 33

Encoding— Linux RAID-6 (2)

slide-34
SLIDE 34

Encoding— Linux RAID-6 (3)

slide-35
SLIDE 35

Accelerate Encoding— Linux RAID-6

slide-36
SLIDE 36

Arithmetic for Erasure Codes

  • When w = 1: XOR's only.
  • Otherwise, Galois Field Arithmetic GF(2w)

– w is 2, 4, 8, 16, 32, 64, 128 so that words fit evenly into

computer words. – Addition is equal to XOR.

Nice because addition equals subtraction.

– Multiplication is more complicated:

Gets more expensive as w grows. Buffer-constant different from a * b. Buffer * 2 can be done really fast. Open source library support.

slide-37
SLIDE 37

Decoding with Generator Matrices (1)

slide-38
SLIDE 38

Decoding with Generator Matrices (2)

slide-39
SLIDE 39

Decoding with Generator Matrices (3)

slide-40
SLIDE 40

Decoding with Generator Matrices (4)

slide-41
SLIDE 41

Decoding with Generator Matrices (5)

slide-42
SLIDE 42

Erasure Codes — Reed Solomon (1)

  • Given in 1960.
  • MDS Erasure codes for any n and k.

– That means any m = (n-k) failures can be tolerated without data loss.

  • r = 1

(Theoretical): One word per disk per stripe.

  • w constrained so that n ≤ 2w.
  • Systematic and non-systematic forms.
slide-43
SLIDE 43

Erasure Codes —Reed Solomon (2) Systematic RS -- Cauchy generator matrix

slide-44
SLIDE 44

Erasure Codes —Reed Solomon (3) Non-Systematic RS -- Vandermonde generator matrix

slide-45
SLIDE 45

Erasure Codes —Reed Solomon (4) Non-Systematic RS -- Vandermonde generator matrix

slide-46
SLIDE 46

Contents

Replication and EC in Cloud

3

slide-47
SLIDE 47

Three Dimensions in Cloud Storage

slide-48
SLIDE 48

Replication vs Erasure Coding (RS)

slide-49
SLIDE 49

Fundamental Tradeoff

slide-50
SLIDE 50

Pyramid Codes (1)

slide-51
SLIDE 51

Pyramid Codes (2)

slide-52
SLIDE 52

Pyramid Codes (3) Multiple Hierachies

slide-53
SLIDE 53

Pyramid Codes (4) Multiple Hierachies

slide-54
SLIDE 54

Pyramid Codes (5) Multiple Hierachies

slide-55
SLIDE 55

Pyramid Codes (6)

slide-56
SLIDE 56

Google GFS II – Based on RS

slide-57
SLIDE 57

Microsoft Azure (1) How to Reduce Cost?

slide-58
SLIDE 58

Microsoft Azure (2) Recovery becomes expensive

slide-59
SLIDE 59

Microsoft Azure (3) Best of both worlds?

slide-60
SLIDE 60

Microsoft Azure (4) Local Reconstruction Code (LRC)

slide-61
SLIDE 61

Microsoft Azure (5) Analysis LRC vs RS

slide-62
SLIDE 62

Microsoft Azure (6) Analysis LRC vs RS

slide-63
SLIDE 63

Recovery problem in Cloud

  • Recovery I/Os from 6 disks (high network bandwidth)
slide-64
SLIDE 64

Regenerating Codes (1)

  • Data = {a,b,c}
slide-65
SLIDE 65

Regenerating Codes (2)

  • Optimal Repair
slide-66
SLIDE 66

Regenerating Codes (3)

  • Optimal Repair
slide-67
SLIDE 67

Regenerating Codes (4)

  • Optimal Repair
slide-68
SLIDE 68

Regenerating Codes (4) Analysis -- Regenerating vs RS

slide-69
SLIDE 69

Facebook Xorbas Hadoop Locally Repairable Codes

slide-70
SLIDE 70

Combination of Two ECs (1) Recovery Cost vs. Storage Overhead

slide-71
SLIDE 71

Combination of Two ECs (2) Fast Code and Compact Code

slide-72
SLIDE 72

Combination of Two ECs (3) Analysis

slide-73
SLIDE 73

Combination of Two ECs (4) Analysis

slide-74
SLIDE 74

Combination of Two ECs (5) Analysis

slide-75
SLIDE 75

Combination of Two ECs (6) Conversion

  • Horizontal parities require no re-computation
  • Vertical parities require no data block transfer
  • All parity updates can be done in parallel and in a distributed

manner

slide-76
SLIDE 76

Combination of Two ECs (7) Results

slide-77
SLIDE 77

Contents

Data Consistency & CAP Theorem

4

slide-78
SLIDE 78

Today’s data share systems (1)

slide-79
SLIDE 79

Today’s data share systems (2)

slide-80
SLIDE 80

Fundamental Properties

  • Consistency
  • (informally) “every request receives the right response”
  • E.g. If I get my shopping list on Amazon I expect it contains all

the previously selected items

  • Availability
  • (informally) “each request eventually receives a response”
  • E.g. eventually I access my shopping list
  • tolerance to network Partitions
  • (informally) “servers can be partitioned in to multiple groups

that cannot communicate with one other”

slide-81
SLIDE 81

The CAP Theorem

  • The CAP Theorem (Eric Brewer):
  • One can achieve at most two of the following:
  • Data Consistency
  • System Availability
  • Tolerance to network Partitions
  • Was first made as a conjecture At PODC 2000 by Eric Brewer
  • The Conjecture was formalized and confirmed by MIT

researchers Seth Gilbert and Nancy Lynch in 2002

slide-82
SLIDE 82

Proof

slide-83
SLIDE 83

Consistency (Simplified)

WAN Replica A Replica B

Update Retrieve

slide-84
SLIDE 84

Tolerance to Network Partitions / Availability

WAN Replica A Replica B

Update Update

slide-85
SLIDE 85

CAP

slide-86
SLIDE 86

Forfeit Partitions

slide-87
SLIDE 87

Observations

  • CAP states that in case of failures you can have at most

two of these three properties for any shared-data system

  • To scale out, you have to distribute resources.
  • P in not really an option but rather a need
  • The real selection is among consistency or availability
  • In almost all cases, you would choose availability over

consistency

slide-88
SLIDE 88

Forfeit Availability

slide-89
SLIDE 89

Forfeit Consistency

slide-90
SLIDE 90

Consistency Boundary Summary

  • We can have consistency & availability within a cluster.
  • No partitions within boundary!
  • OS/Networking better at A than C
  • Databases better at C than A
  • Wide-area databases can’t have both
  • Disconnected clients can’t have both
slide-91
SLIDE 91

CAP in Database System

slide-92
SLIDE 92

Another CAP -- BASE

  • BASE stands for Basically Available Soft State Eventually

Consistent system.

  • Basically Available: the system available most of the

time and there could exists a subsystems temporarily unavailable

  • Soft State: data are “volatile” in the sense that their

persistence is in the hand of the user that must take care of refresh them

  • Eventually Consistent: the system eventually converge

to a consistent state

slide-93
SLIDE 93

Another CAP -- ACID

  • Relation among ACID and CAP is core complex
  • Atomicity: every operation is executed in “all-or-nothing”

fashion

  • Consistency: every transaction preserves the consistency

constraints on data

  • Integrity: transaction does not interfere. Every

transaction is executed as it is the only one in the system

  • Durability: after a commit, the updates made are

permanent regardless possible failures

slide-94
SLIDE 94

CAP vs. ACID

  • ACID
  • C here looks to constraints
  • n data and data model
  • A looks to atomicity of
  • peration and it is always

ensured

  • I is deeply related to CAP. I

can be ensured in at most

  • ne partition
  • D is independent from CAP
  • CAP
  • C here looks to single-copy

consistency

  • A here look to the

service/data availability

slide-95
SLIDE 95

2 of 3 is misleading (1)

  • In principle every system should be designed to

ensure both C and A in normal situation

  • When a partition occurs the decision among C and A

can be taken

  • When the partition is resolved the system takes

corrective action coming back to work in normal situation

slide-96
SLIDE 96

2 of 3 is misleading (2)

  • Partitions are rare events
  • there are little reasons to forfeit by design C or A
  • Systems evolve along time
  • Depending on the specific partition, service or data, the

decision about the property to be sacrificed can change

  • C, A and P are measured according to continuum
  • Several level of Consistency (e.g. ACID vs BASE)
  • Several level of Availability
  • Several degree of partition severity
slide-97
SLIDE 97

Consistency/Latency Tradeoff (1)

  • CAP does not force designers to give up A or C but why

there exists a lot of systems trading C?

  • CAP does not explicitly talk about latency…
  • … however latency is crucial to get the essence of CAP
slide-98
SLIDE 98

Consistency/Latency Tradeoff (2)

slide-99
SLIDE 99

Contents

Consensus Protocol: 2PC and 3PC

5

slide-100
SLIDE 100

2PC: Two Phase Commit Protocol (1)

  • Coordinator: propose a vote to other nodes
  • Participants/Cohorts: send a vote to coordinator
slide-101
SLIDE 101

2PC: Phase one

  • Coordinator propose a vote, and wait for the response
  • f participants
slide-102
SLIDE 102

2PC: Phase two

  • Coordinator commits or aborts the transaction

according to the participants’ feedback

  • If all agree, commit
  • If any one disagree, abort
slide-103
SLIDE 103

Problem of 2PC

  • Scenario:

– TC sends commit decision to A, A gets it and commits, and then both TC and A crash – B, C, D, who voted Yes, now need to wait for TC or A to reappear (w/ mutexes locked)

  • They can’t commit or abort, as they don’t

know what A responded – If that takes a long time (e.g., a human must replace hardware), then availability suffers – If TC is also participant, as it typically is, then this protocol is vulnerable to a single-node failure (the TC’s failure)!

  • This is why 2 phase commit is called a blocking protocol
  • In context of consensus requirements: 2PC is safe, but not live
slide-104
SLIDE 104

3PC: Three Phase Commit Protocol (1)

  • Goal: Turn 2PC into a live (non-blocking) protocol

– 3PC should never block on node failures as 2PC did

  • Insight: 2PC suffers from allowing nodes to irreversibly

commit an outcome before ensuring that the others know the outcome, too

  • Idea in 3PC: split “commit/abort” phase into two

phases – First communicate the outcome to everyone – Let them commit only after everyone knows the

  • utcome
slide-105
SLIDE 105

3PC: Three Phase Commit Protocol (2)

slide-106
SLIDE 106

Can 3PC Solving the Blocking Problem? (1)

  • 1. If one of them has received

preCommit, …

  • 2. If none of them has received

preCommit, …

  • Assuming same scenario as before (TC, A crash), can

B/C/D reach a safe decision when they time out?

slide-107
SLIDE 107

Can 3PC Solving the Blocking Problem? (2)

3PC is safe for node crashes (including TC+participant)

  • Assuming same scenario as before (TC, A crash), can

B/C/D reach a safe decision when they time out?

  • 1. If one of them has received preCommit,

they can all commit

  • This is safe if we assume that A is DEAD and after

coming back it runs a recovery protocol in which it requires input from B/C/D to complete an uncommitted transaction

  • This conclusion was impossible to reach for 2PC b/c

A might have already committed and exposed

  • utcome of transaction to world
  • 2. If none of them has received preCommit,

they can all abort

  • This is safe, b/c we know A couldn't have received a

doCommit, so it couldn't have committed

slide-108
SLIDE 108

3PC: Timeout Handling Specs (trouble begins)

slide-109
SLIDE 109

But Does 3PC Achieve Consensus?

  • Liveness (availability): Yes

– Doesn’t block, it always makes progress by timing out

  • Safety (correctness): No

– Can you think of scenarios in which original 3PC would result in inconsistent states between the replicas?

  • Two examples of unsafety in 3PC:

– A hasn’t crashed, it’s just offline – TC hasn’t crashed, it’s just offline Network Partitions

slide-110
SLIDE 110

Partition Management

slide-111
SLIDE 111

3PC with Network Partitions

  • Similar scenario with partitioned, not crashed, TC
  • One example scenario:

– A receives prepareCommit from TC – Then, A gets partitioned from B/C/D and TC crashes – None of B/C/D have received prepareCommit, hence they all abort upon timeout – A is prepared to commit, hence, according to protocol, after it times out, it unilaterally decides to commit

slide-112
SLIDE 112

Safety vs. liveness

  • So, 3PC is doomed for network partitions

– The way to think about it is that this protocol’s design trades safety for liveness

  • Remember that 2PC traded liveness for safety
  • Can we design a protocol that’s both safe and live?
slide-113
SLIDE 113

Contents

Paxos

6

slide-114
SLIDE 114

Paxos (1)

  • The only known completely-safe and largely-live

agreement protocol

  • Lets all nodes agree on the same value despite node

failures, network failures, and delays

– Only blocks in exceptional circumstances that are vanishingly rare in practice

  • Extremely useful, e.g.:

– nodes agree that client X gets a lock – nodes agree that Y is the primary – nodes agree that Z should be the next operation to be executed

slide-115
SLIDE 115

Paxos (2)

  • Widely used in both industry and academia
  • Examples:

– Google: Chubby (Paxos-based distributed lock service) Most Google services use Chubby directly or indirectly – Yahoo: Zookeeper (Paxos-based distributed lock service) In Hadoop rightnow – MSR: Frangipani (Paxos-based distributed lock service) – UW: Scatter (Paxos-based consistent DHT) – Open source:

  • libpaxos (Paxos-based atomic broadcast)
  • Zookeeper is open-source and integrates with Hadoop
slide-116
SLIDE 116

Paxos Properties

  • Safety

– If agreement is reached, everyone agrees on the same value – The value agreed upon was proposed by some node

  • Fault tolerance (i.e., as-good-as-it-gets liveness)

– If less than half the nodes fail, the rest nodes reach agreement eventually

  • No guaranteed termination (i.e., imperfect liveness)

– Paxos may not always converge on a value, but only in very degenerate cases that are improbable in the real world

  • Lots of awesomeness

– Basic idea seems natural in retrospect, but why it works in any detail is incredibly complex!

slide-117
SLIDE 117

Basic Idea (1)

  • Paxos is similar to 2PC, but with some twists
  • One (or more) node decides to be coordinator (proposer)
  • Proposer proposes a value and solicits acceptance from others

(acceptors)

  • Proposer announces the chosen value or tries again if it’s failed

to converge on a value

  • Values to agree on:
  • Whether to commit/abort a transaction
  • Which client should get the next lock
  • Which write we perform next
  • What time to meet (party example)
slide-118
SLIDE 118

Basic Idea (2)

  • Paxos is similar to 2PC, but with some twists
  • One (or more) node decides to be coordinator (proposer)
  • Proposer proposes a value and solicits acceptance from others

(acceptors)

  • Proposer announces the chosen value or tries again if it’s failed

to converge on a value

slide-119
SLIDE 119

Basic Idea (3)

  • Paxos is similar to 2PC, but with some twists
  • One (or more) node decides to be coordinator (proposer)
  • Proposer proposes a value and solicits acceptance from others

(acceptors)

  • Proposer announces the chosen value or tries again if it’s failed

to converge on a value

  • Hence, Paxos is egalitarian: any

node can propose/accept, no

  • ne has special powers
  • Just like real world, e.g., group
  • f friends organize a party –

anyone can take the lead

slide-120
SLIDE 120

Challenges

  • What if multiple nodes become proposers

simultaneously?

  • What if the new proposer proposes different values

than an already decided value?

  • What if there is a network partition?
  • What if a proposer crashes in the middle of solicitation?
  • What if a proposer crashes after deciding but before

announcing results?

slide-121
SLIDE 121

Core Differentiating Mechanisms

  • 1. Proposal ordering

– Lets nodes decide which of several concurrent proposals to accept and which to reject

  • 2. Majority voting

– 2PC needs all nodes to vote Yes before committing

  • As a result, 2PC may block when a single node fails

– Paxos requires only a majority of the acceptors (half+1) to accept a proposal

  • As a result, in Paxos nearly half the nodes can fail to reply and

the protocol continues to work correctly

  • Moreover, since no two majorities can exist simultaneously,

network partitions do not cause problems (as they did for 3PC)

slide-122
SLIDE 122

Implementation of Paxos

  • Paxos has rounds; each round has a unique ballot id
  • Rounds are asynchronous
  • Time synchronization not required
  • If you’re in round j and hear a message from round j+1, abort

everything and move over to round j+1

  • Use timeouts; may be pessimistic
  • Each round itself broken into phases (which are also

asynchronous)

  • Phase 1: A leader is elected (Election)
  • Phase 2: Leader proposes a value, processes ack (Bill)
  • Phase 3: Leader multicasts final value (Law)
slide-123
SLIDE 123

Phase 1 – Election

  • Potential leader chooses a unique ballot id, higher than seen anything so far
  • Sends to all processes
  • Processes wait, respond once to highest ballot id
  • If potential leader sees a higher ballot id, it can’t be a leader
  • Paxos tolerant to multiple leaders, but we’ll only discuss 1 leader case
  • Processes also log received ballot ID on disk
  • If a process has in a previous round decided on a value v’, it includes value

v’ in its response

  • If majority (i.e., quorum) respond OK then you are the leader
  • If no one has majority, start new round
  • A round cannot have two leaders (why?)

Please elect me! OK!

slide-124
SLIDE 124

Phase 2 – Proposal (Bill)

  • Leader sends proposed value v to all
  • use v=v’ if some process already decided in a previous

round and sent you its decided value v’

  • Recipient logs on disk; responds OK

Please elect me! OK! Value v ok? OK!

slide-125
SLIDE 125

Phase 3 – Decision (Law)

  • If leader hears a majority of OKs, it lets everyone

know of the decision

  • Recipients receive decision, log it on disk

Please elect me! OK! Value v ok? OK! v!

slide-126
SLIDE 126

Which is the point of no-return? (1)

  • That is, when is consensus reached in the system

Please elect me! OK! Value v ok? OK! v!

slide-127
SLIDE 127

Which is the point of no-return? (2)

  • If/when a majority of processes hear proposed

value and accept it (i.e., are about to/have respond(ed) with an OK!)

  • Processes may not know it yet, but a decision has

been made for the group

  • Even leader does not know it yet
  • What if leader fails after that?
  • Keep having rounds until some round completes

Please elect me! OK! Value v ok? OK! v!

slide-128
SLIDE 128

Safety

  • If some round has a majority (i.e., quorum) hearing

proposed value v’ and accepting it (middle of Phase 2), then subsequently at each round either: 1) the round chooses v’ as decision or 2) the round fails

  • Proof:
  • Potential leader waits for majority of OKs in Phase 1
  • At least one will contain v’ (because two majorities or quorums

always intersect)

  • It will choose to send out v’ in Phase 2
  • Success requires a majority, and any two majority sets

intersect

Please elect me! OK! Value v ok? OK! v!

slide-129
SLIDE 129

What could go wrong?

Please elect me! OK! Value v ok? OK! v!

  • Process fails
  • Majority does not include it
  • When process restarts, it uses log to retrieve a past decision (if any)

and past-seen ballot ids. Tries to know of past decisions.

  • Leader fails
  • Start another round
  • Messages dropped
  • If too flaky, just start another round
  • Note that anyone can start a round any time
  • Protocol may never end – tough luck, buddy!
  • Impossibility result not violated
  • If things go well sometime in the future, consensus reached
slide-130
SLIDE 130

Contents

Chubby and Zookeeper

7

slide-131
SLIDE 131

Google Chubby

  • Research Paper
  • The Chubby Lock Service for Loosely-coupled Distributed Systems.
  • Proc. of OSDI’06.
  • What is Chubby?
  • Lock service in a loosely-coupled distributed system (e.g., 10K 4-

processor machines connected by 1Gbps Ethernet)

  • Client interface similar to whole-file advisory locks with notification
  • f various events (e.g., file modifications)
  • Primary goals: reliability, availability, easy-to-understand semantics
  • How is it used?
  • Used in Google: GFS, Bigtable, etc.
  • Elect leaders, store small amount of meta-data, as the root of the

distributed data structures

slide-132
SLIDE 132

System Structure (1)

  • A chubby cell consists of a small set of servers (replicas)
  • A master is elected from the replicas via a consensus protocol
  • Master lease: several seconds
  • If a master fails, a new one will be elected when the master leases expire
  • Client talks to the master via chubby library
  • All replicas are listed in DNS; clients discover the master by talking to any replica
slide-133
SLIDE 133

System Structure (2)

  • Replicas maintain copies of a simple database
  • Clients send read/write requests only to the master
  • For a write:
  • The master propagates it to replicas via the consensus protocol
  • Replies after the write reaches a majority of replicas
  • For a read:
  • The master satisfies the read alone
slide-134
SLIDE 134

System Structure (3)

  • If a replica fails and does not recover for a long time (a few hours)
  • A fresh machine is selected to be a new replica, replacing the failed one
  • It updates the DNS
  • Obtains a recent copy of the database
  • The current master polls DNS periodically to discover new replicas
slide-135
SLIDE 135

Simple UNIX-like File System Interface

  • Chubby supports a strict tree of files and directories
  • No symbolic links, no hard links
  • /ls/foo/wombat/pouch
  • 1st component (ls): lock service (common to all names)
  • 2nd component (foo): the chubby cell (used in DNS lookup to find the

cell master)

  • The rest: name inside the cell
  • Can be accessed via Chubby’s specialized API / other file

system interface (e.g., GFS)

  • Support most normal operations (create, delete, open,

write, …)

  • Support advisory reader/writer lock on a node
slide-136
SLIDE 136

ACLs and File Handles

  • Access Control List (ACL)
  • A node has three ACL names (read/write/change ACL names)
  • An ACL name is a name to a file in the ACL directory
  • The file lists the authorized users
  • File handle:
  • Has check digits encoded in it; cannot be forged
  • Sequence number:
  • a master can tell if this handle is created by a previous master
  • Mode information at open time:
  • If previous master created the handle, a newly restarted master can

learn the mode information

slide-137
SLIDE 137

Locks and Sequences

  • Locks: advisory rather than mandatory
  • Potential lock problems in distributed systems
  • A holds a lock L, issues request W, then fails
  • B acquires L (because A fails), performs actions
  • W arrives (out-of-order) after B’s actions
  • Solution #1: backward compatible
  • Lock server will prevent other clients from getting the lock if a lock

become inaccessible or the holder has failed

  • Lock-delay period can be specified by clients
  • Solution #2: sequencer
  • A lock holder can obtain a sequencer from Chubby
  • It attaches the sequencer to any requests that it sends to other servers

(e.g., Bigtable)

  • The other servers can verify the sequencer information
slide-138
SLIDE 138

Chubby Events

  • Clients can subscribe to events (up-calls from Chubby

library)

  • File contents modified: if the file contains the location of a

service, this event can be used to monitor the service location

  • Master failed over
  • Child node added, removed, modified
  • Handle becomes invalid: probably communication problem
  • Lock acquired (rarely used)
  • Locks are conflicting (rarely used)
slide-139
SLIDE 139

APIs

  • Open()
  • Mode: read/write/change ACL; Events; Lock-delay
  • Create new file or directory?
  • Close()
  • GetContentsAndStat(), GetStat(), ReadDir()
  • SetContents(): set all contents; SetACL()
  • Delete()
  • Locks: Acquire(), TryAcquire(), Release()
  • Sequencers: GetSequencer(), SetSequencer(), CheckSequencer()
slide-140
SLIDE 140

Example – Primary Election

Open(“write mode”); If (successful) { // primary SetContents(“identity”); } Else { // replica

  • pen (“read mode”, “file-modification event”);

when notified of file modification: primary= GetContentsAndStat(); }

slide-141
SLIDE 141

Caching

  • Strict consistency: easy to understand
  • Lease based
  • master will invalidate cached copies upon a write request
  • Write-through caches
slide-142
SLIDE 142

Sessions, Keep-Alives, Master Fail-overs (1)

  • Session:
  • A client sends keep-alive requests to a master
  • A master responds by a keep-alive response
  • Immediately after getting the keep-alive response, the client sends another

request for extension

  • The master will block keep-alives until close the expiration of a session
  • Extension is default to 12s
  • Clients maintain a local timer for estimating the session timeouts

(time is not perfectly synchronized)

  • If local timer runs out, wait for a 45s grace period before ending the

session

  • Happens when a master fails over
slide-143
SLIDE 143

Sessions, Keep-Alives, Master Fail-overs (2)

slide-144
SLIDE 144

Other details

  • Database implementation
  • a simple database with write ahead logging and snapshotting
  • Backup:
  • Write a snapshot to a GFS server in a different building
  • Mirroring files across multiple cells
  • Configuration files (e.g., locations of other services, access

control lists, etc.)

slide-145
SLIDE 145

ZooKeeper

  • Developed at Yahoo! Research
  • Started as sub-project of Hadoop, now a top-level

Apache project

  • Development is driven by application needs
  • [book] ZooKeeper by Junqueira & Reed, 2013
slide-146
SLIDE 146

ZooKeeper in the Hadoop Ecosystem

slide-147
SLIDE 147

ZooKeeper Service (1)

  • Znode
  • In-memory data node in the Zookeeper data
  • Have a hierarchical namespace
  • UNIX like notation for path
  • Types of Znode
  • Regular
  • Ephemeral
  • Flags of Znode
  • Sequential flag
slide-148
SLIDE 148

ZooKeeper Service (2)

  • Watch Mechanism
  • Get notification
  • One time triggers
  • Other properties of Znode
  • Znode doesn’t not design for data storage, instead it store

meta-data or configuration

  • Can store information like timestamp version
  • Session
  • A connection to server from client is a session
  • Timeout mechanism
slide-149
SLIDE 149

Client API

  • Create(path, data, flags)
  • Delete(path, version)
  • Exist(path, watch)
  • getData(path, watch)
  • setData(path, data, version)
  • getChildren(path, watch)
  • Sync(path)
  • Two version synchronous and asynchronous
slide-150
SLIDE 150

Guarantees

  • Linearizable writes
  • All requests that update the state of ZooKeeper

are serializable and respect precedence

  • FIFO client order
  • All requests are in order that they were sent by

client.

slide-151
SLIDE 151

Implementation (1)

  • ZooKeeper data is replicated on each server that

composes the service

slide-152
SLIDE 152

Implementation (2)

  • ZooKeeper server services clients
  • Clients connect to exactly one server to submit

requests

  • read requests served from the local replica
  • write requests are processed by an agreement protocol

(an elected server leader initiates processing of the write request)

slide-153
SLIDE 153

Hadoop Environment

slide-154
SLIDE 154

Example: Configuration

slide-155
SLIDE 155

Example: group membership

slide-156
SLIDE 156

Example: simple locks

slide-157
SLIDE 157

Example: locking without herd effect

slide-158
SLIDE 158

Example: leader election

slide-159
SLIDE 159

Zookeeper Application (1)

  • Fetching Service
  • Using ZooKeeper for recovering from failure of masters
  • Configuration metadata and leader election
slide-160
SLIDE 160

Zookeeper Application (2)

  • Yahoo! Message Broker
  • A distributed publish-subscribe system
slide-161
SLIDE 161

Thank you!