NOSQL UNDER THE HOOD: THE ANATOMY AND EVOLUTION OF CASSANDRA THE - - PowerPoint PPT Presentation

nosql under the hood
SMART_READER_LITE
LIVE PREVIEW

NOSQL UNDER THE HOOD: THE ANATOMY AND EVOLUTION OF CASSANDRA THE - - PowerPoint PPT Presentation

BEN COVERSTON DSE ARCHITECT DATASTAX INC. @BCOVERSTON NOSQL UNDER THE HOOD: THE ANATOMY AND EVOLUTION OF CASSANDRA THE GRADUAL DEVELOPMENT OF SOMETHING, ESPECIALLY FROM A SIMPLE TO A MORE COMPLEX FORM. Evolution A STUDY OF THE


slide-1
SLIDE 1

NOSQL UNDER THE HOOD:


THE ANATOMY AND EVOLUTION OF CASSANDRA

BEN COVERSTON DSE ARCHITECT — DATASTAX INC. @BCOVERSTON

slide-2
SLIDE 2

THE GRADUAL DEVELOPMENT OF SOMETHING, ESPECIALLY FROM A SIMPLE TO A MORE COMPLEX FORM. Evolution

slide-3
SLIDE 3

A STUDY OF THE STRUCTURE OR INTERNAL WORKINGS OF SOMETHING.

Anatomy

slide-4
SLIDE 4

EX NIHILO IS A LATIN PHRASE MEANING "OUT OF NOTHING"

Ex Nihilo

slide-5
SLIDE 5

THE ANATOMY AND EVOLUTION OF CASSANDRA

CASSANDRA WAS NOT CREATED EX NIHILO

slide-6
SLIDE 6

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-7
SLIDE 7

THE ANATOMY AND EVOLUTION OF CASSANDRA

WITH SO MANY OPTIONS, WHY CASSANDRA?

▸ Big Table could Scale Out, Very Well ▸ And provide a flexible data model ▸ But it wasn’t great at High Availability ▸ Dynamo Could also Scale Out ▸ But High Availability was its biggest strength ▸ Extreme resilience under exceptionally hostile conditions ▸ But… mostly a key value store

slide-8
SLIDE 8

THE POSITION OR FUNCTION OF AN ORGANISM IN A COMMUNITY OF PLANTS AND ANIMALS.

Niche

slide-9
SLIDE 9

THE ANATOMY AND EVOLUTION OF CASSANDRA

SO DYNAMO AND BIG TABLE HAD A BABY? WHY?

▸ Originally to fill a Niche ▸ Facebook Inbox Search ▸ When the project was complete, they open sourced it. ▸ July 2008 — Google Code

slide-10
SLIDE 10

I NEED A DISTRIBUTED DATABASE. A REAL DISTRIBUTED DATABASE … SO, I'M WORKING ON CASSANDRA.

Jonathan Ellis - March 27th 2009

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-11
SLIDE 11

THE ANATOMY AND EVOLUTION OF CASSANDRA

THE SPARK OF LIFE

▸ Rackspace needed a metadata store for CloudFiles ▸ An engineer needs a real distributed database that is both

Partition Tolerant and Highly Available

▸ Cassandra entered the Apache Incubator in March 2009

slide-12
SLIDE 12

THE ANATOMY AND EVOLUTION OF CASSANDRA

FUNDAMENTAL ANATOMICAL STRENGTHS (CIRCA 2009)

▸ P2P —> No Single Point of Failure ▸ Easy to understand replication model ▸ Focus on Availability and Partition Tolerance ▸ API that allows for more than just Java clients (Thrift) ▸ Raw access to the file system on the individual machines (not

abstracted away into HDFS)

▸ Good performance for mixed workloads, and larger-than-

memory workloads.

slide-13
SLIDE 13

WHAT’S DANGEROUS IS NOT TO EVOLVE

Jeff Besos

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-14
SLIDE 14

THE ANATOMY AND EVOLUTION OF CASSANDRA

MAJOR ANATOMICAL STRUCTURES

▸ Log Structured Storage ▸ Data Versioning ▸ Replication (on Write) ▸ Anti-Entropy Repair ▸ Read Repair ▸ Consistent Hashing

slide-15
SLIDE 15

THE ANATOMY AND EVOLUTION OF CASSANDRA

EVOLUTION OF LOG STRUCTURED STORAGE

slide-16
SLIDE 16

THE ANATOMY AND EVOLUTION OF CASSANDRA

BTREES / ISAM

▸ Pros ▸ Querying is often fast and easy ▸ Support for referential integrity ▸ You need a master in a distributed system: order matters ▸ Cons ▸ Reads and writes are tightly coupled ▸ Writes have to seek before an update happens (RbW) ▸ Index Structures (if used) are Expensive ▸ Indexes, and most of your data needs to be in your working set for good

performance.

slide-17
SLIDE 17

Alice Carroll Bob Parr Alice Liddell Bob Parr

Seek for Alice’s Record Update record in place

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-18
SLIDE 18

WHY LOG STRUCTURED STORAGE

▸ Pros: ▸ Writes are fast, they never wait for locks ▸ No locking means you don’t need a master ▸ Cons: ▸ Reads may require a merge step ▸ Background Compaction

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-19
SLIDE 19

Alice Carroll Alice Liddell Bob Parr

Record has a timestamp

Alice Liddell Bob Parr

Merge Records Write a new record

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-20
SLIDE 20

DATA VERSIONING

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-21
SLIDE 21

IN PLACE UPDATES

▸ Very Common in Single Server or in Master/Slave systems ▸ Requires coordination ▸ Coordination is expensive, error prone. ▸ If you choose this, you’re choosing a CP system.

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-22
SLIDE 22

VERSIONED UPDATES

▸ Can be easily distributed ▸ Versions can be chosen by the client or the server ▸ Reconciliation can happen later. ▸ This can be expensive ▸ Scales, because there is no need for coordination.

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-23
SLIDE 23

TIMESTAMPS

▸ Assign a timestamp at write time ▸ Client or Server can reconcile at read ▸ To varying degrees (Consistency Level) ▸ Relatively Simple and Straightforward ▸ Clocks have to be synchronized ▸ Old data gets overwritten by new data.

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-24
SLIDE 24

Alice Carroll 3456 Alice Liddell 1234 Bob Parr 2345

Record has a timestamp

Alice Carroll 3456 Bob Parr 2345

Merge Records Write a new record

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-25
SLIDE 25

VECTOR CLOCKS

▸ Data is decorated with two pieces of data: ▸ Where the update happened (node) ▸ A Sequence Number ▸ Some updates can be reconciled ▸ Some updates must be manually reconciled ▸ This is !FUN!, especially for clients that get multiple

versions back for a client request.

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-26
SLIDE 26

Alice Carroll 1 n2 Alice Liddell 1 n1 Bob Parr 2 n1

Record has a version and a node identifier

Alice Liddell 1 n1 Alice Carroll 3 n2 Bob Parr 2 n1

Merge Records Write a new record on another node

* reductionist, many details no included

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-27
SLIDE 27

EVOLUTION OF REPLICATION

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-28
SLIDE 28

Slave Slave Master

Writes Synchronous Replication

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-29
SLIDE 29

MASTER - SLAVE REPLICATION

▸ Pros ▸ Consistent ▸ Doesn’t require versioning ▸ Real-Time Knowledge about Replication Status ▸ Cons ▸ Updates happen at the master first ▸ Consistent reads happen at the master ▸ Single Point of Failure ▸ Failover modes are complicated

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-30
SLIDE 30

A-E F-J K-M N-Q R-T U-Z

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-31
SLIDE 31

MULTI-MASTER REPLICATION

▸ Shard the Data over Multiple Masters ▸ Pros ▸ Failures are smaller in scope ▸ Cons ▸ Different masters own different ranges ▸ Failover Machinery also has to be duplicated ▸ SPOF still exists for every range

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-32
SLIDE 32

client

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-33
SLIDE 33

PEER TO PEER REPLICATION

▸ Pros ▸ Failover is simple ▸ Write anywhere ▸ No master ▸ Can choose consistency levels ▸ Cons ▸ Writes are not guaranteed to happen at every replica at write time ▸ Anti Entropy can be expensive, and is a major consideration in sizing for

individual nodes.

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-34
SLIDE 34

EVOLUTION OF ANTI-ENTROPY

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-35
SLIDE 35

WHY ANTI-ENTROPY

▸ Mostly a Peer to Peer Problem ▸ None of the AE systems are perfect ▸ Multiple compensating mechanisms increase the time

between updates and a consistent view of the system.

▸ Evolutionary path has been mostly iterative, lessons have

been learned along the way.

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-36
SLIDE 36

BACKUP / RESTORE

▸ It works ▸ If you have tested it ▸ If the backup is not corrupt ▸ Requires operational discipline ▸ Requires time ▸ Most systems have to be down for a restore.

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-37
SLIDE 37

HINTED HANDOFF

▸ Missed Updates Revisited Later ▸ Part of the Dynamo Paper ▸ Problems With Early Versions of Hinted Handoff ▸ Stampeding Herd ▸ Increased Load on Delivery ▸ Storage (How Many and Where?)

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-38
SLIDE 38

HINTED HANDOFF

X

N3 Alice N3 Bob N3 Chuck N3 Dave N3 Eve N3 Frank N3 Gus

Alice Alice Bob Bob Chuck Alice Dave Eve Frank Gus Bob Chuck Dave Eve Frank Gus

slide-39
SLIDE 39

READ REPAIR

▸ Easy and cheap when you are already reading from

multiple replicas

▸ Probabilistic repair when only reading from a single replica ▸ Hot data will be repaired frequently

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-40
SLIDE 40

Bob, 1 Bob, 1 Robert, 2

READ AT CONSISTENCY LEVEL 1

slide-41
SLIDE 41

Bob, 1 Bob, 1 Robert, 2

READ AT CONSISTENCY LEVEL 1

Bob, 1

slide-42
SLIDE 42

Bob, 1 Bob, 1 Robert, 2

CHECK OTHER REPLICAS AT PROBABILITY 0.1, OR SOME OTHER VALUE

Get Digest Get Digest

slide-43
SLIDE 43

Robert, 2 Bob, 1 Robert, 2

UPDATE COORDINATOR WITH LATEST RECORD FROM DIGEST

slide-44
SLIDE 44

Robert, 2 Robert, 2 Robert, 2

REPAIR OUT OF DATE REPLICAS

Update Replica

slide-45
SLIDE 45

Robert, 2 Robert, 2 Robert, 2

READ AT CONSISTENCY LEVEL 1

Robert , 2

slide-46
SLIDE 46

TEXT

REPAIR

▸ Similar to rsync ▸ Expensive, Slow ▸ Incremental Repair is a

huge (late) improvement in manageability.

▸ How Repair Works*

  • 1. Generate Merkle Trees
  • 2. Compare Differences
  • 3. Stream Differences
slide-47
SLIDE 47

HASHES FOR EACH RANGE, ARRANGE IN A HASH TREE

01 02 03 04 05 06 07 08 09 10 11 22 02 03 14 05 06 17 08 09 10 12

slide-48
SLIDE 48

FIND THE HASHES THAT DON’T MATCH

01 02 03 04 05 06 07 08 09 10 11 22 02 03 14 05 06 17 08 09 10 12

slide-49
SLIDE 49

LEAF NODES REPRESENT DATA THAT NEEDS TO BE EXCHANGED

01 02 03 04 05 06 07 08 09 10 11 22 02 03 14 05 06 17 08 09 10 12

slide-50
SLIDE 50

EVOLUTION OF SHARDING

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-51
SLIDE 51

RANGE SHARDING

▸ Pros ▸ Generally good for short term problems ▸ Range Scanning ▸ Cons ▸ Need to track shards, repartitioning ▸ Second System problem ▸ Data is “lumpy”

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-52
SLIDE 52

A-E F-J K-M N-Q R-T U-Z

RANGE SHARDING

slide-53
SLIDE 53

FUNCTIONAL SHARDING

▸ Pros ▸ Natural and easy process ▸ Can alleviate some pain and isolate failure conditions ▸ Cons ▸ Different Parts scale at different rates, if you need to

scale, this probably isn’t a long term solution.

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-54
SLIDE 54

Sales User Marketing

FUNCTIONAL SHARDING

slide-55
SLIDE 55

CONSISTENT HASHING

▸ Pros ▸ Scales Exceptionally Well ▸ Nothing Else Required (except a sharing algorithm) ▸ Uniform Distribution of Data ▸ Cons ▸ Where is my data? ▸ Global range queries become nonsensical ▸ Relationships based on the ‘key’ are scattered

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-56
SLIDE 56

CONSISTENT HASHING

F(X) ALICE BOB CHUCK DAVE EVE FRANK 100 200 300 400 500 600 1-100 101-200 201-300 301-400 401-500 501-600 FRANK DAVE BOB CHUCK ALICE EVE

slide-57
SLIDE 57

CONCLUSIONS

▸ NoSQL is full of choices and tradeoffs ▸ Every animal has different characteristics ▸ Know your niche. ▸ Find an animal that will thrive in your niche.

THE ANATOMY AND EVOLUTION OF CASSANDRA

slide-58
SLIDE 58

QUESTIONS?

DATASTAX INC.