NOSQL UNDER THE HOOD:
THE ANATOMY AND EVOLUTION OF CASSANDRA
BEN COVERSTON DSE ARCHITECT — DATASTAX INC. @BCOVERSTON
NOSQL UNDER THE HOOD: THE ANATOMY AND EVOLUTION OF CASSANDRA THE - - PowerPoint PPT Presentation
BEN COVERSTON DSE ARCHITECT DATASTAX INC. @BCOVERSTON NOSQL UNDER THE HOOD: THE ANATOMY AND EVOLUTION OF CASSANDRA THE GRADUAL DEVELOPMENT OF SOMETHING, ESPECIALLY FROM A SIMPLE TO A MORE COMPLEX FORM. Evolution A STUDY OF THE
BEN COVERSTON DSE ARCHITECT — DATASTAX INC. @BCOVERSTON
THE GRADUAL DEVELOPMENT OF SOMETHING, ESPECIALLY FROM A SIMPLE TO A MORE COMPLEX FORM. Evolution
Anatomy
Ex Nihilo
THE ANATOMY AND EVOLUTION OF CASSANDRA
CASSANDRA WAS NOT CREATED EX NIHILO
THE ANATOMY AND EVOLUTION OF CASSANDRA
THE ANATOMY AND EVOLUTION OF CASSANDRA
WITH SO MANY OPTIONS, WHY CASSANDRA?
▸ Big Table could Scale Out, Very Well ▸ And provide a flexible data model ▸ But it wasn’t great at High Availability ▸ Dynamo Could also Scale Out ▸ But High Availability was its biggest strength ▸ Extreme resilience under exceptionally hostile conditions ▸ But… mostly a key value store
Niche
THE ANATOMY AND EVOLUTION OF CASSANDRA
SO DYNAMO AND BIG TABLE HAD A BABY? WHY?
▸ Originally to fill a Niche ▸ Facebook Inbox Search ▸ When the project was complete, they open sourced it. ▸ July 2008 — Google Code
Jonathan Ellis - March 27th 2009
THE ANATOMY AND EVOLUTION OF CASSANDRA
THE ANATOMY AND EVOLUTION OF CASSANDRA
THE SPARK OF LIFE
▸ Rackspace needed a metadata store for CloudFiles ▸ An engineer needs a real distributed database that is both
Partition Tolerant and Highly Available
▸ Cassandra entered the Apache Incubator in March 2009
THE ANATOMY AND EVOLUTION OF CASSANDRA
FUNDAMENTAL ANATOMICAL STRENGTHS (CIRCA 2009)
▸ P2P —> No Single Point of Failure ▸ Easy to understand replication model ▸ Focus on Availability and Partition Tolerance ▸ API that allows for more than just Java clients (Thrift) ▸ Raw access to the file system on the individual machines (not
abstracted away into HDFS)
▸ Good performance for mixed workloads, and larger-than-
memory workloads.
Jeff Besos
THE ANATOMY AND EVOLUTION OF CASSANDRA
THE ANATOMY AND EVOLUTION OF CASSANDRA
MAJOR ANATOMICAL STRUCTURES
▸ Log Structured Storage ▸ Data Versioning ▸ Replication (on Write) ▸ Anti-Entropy Repair ▸ Read Repair ▸ Consistent Hashing
THE ANATOMY AND EVOLUTION OF CASSANDRA
EVOLUTION OF LOG STRUCTURED STORAGE
THE ANATOMY AND EVOLUTION OF CASSANDRA
BTREES / ISAM
▸ Pros ▸ Querying is often fast and easy ▸ Support for referential integrity ▸ You need a master in a distributed system: order matters ▸ Cons ▸ Reads and writes are tightly coupled ▸ Writes have to seek before an update happens (RbW) ▸ Index Structures (if used) are Expensive ▸ Indexes, and most of your data needs to be in your working set for good
performance.
Alice Carroll Bob Parr Alice Liddell Bob Parr
Seek for Alice’s Record Update record in place
THE ANATOMY AND EVOLUTION OF CASSANDRA
WHY LOG STRUCTURED STORAGE
▸ Pros: ▸ Writes are fast, they never wait for locks ▸ No locking means you don’t need a master ▸ Cons: ▸ Reads may require a merge step ▸ Background Compaction
THE ANATOMY AND EVOLUTION OF CASSANDRA
Alice Carroll Alice Liddell Bob Parr
Record has a timestamp
Alice Liddell Bob Parr
Merge Records Write a new record
THE ANATOMY AND EVOLUTION OF CASSANDRA
DATA VERSIONING
THE ANATOMY AND EVOLUTION OF CASSANDRA
IN PLACE UPDATES
▸ Very Common in Single Server or in Master/Slave systems ▸ Requires coordination ▸ Coordination is expensive, error prone. ▸ If you choose this, you’re choosing a CP system.
THE ANATOMY AND EVOLUTION OF CASSANDRA
VERSIONED UPDATES
▸ Can be easily distributed ▸ Versions can be chosen by the client or the server ▸ Reconciliation can happen later. ▸ This can be expensive ▸ Scales, because there is no need for coordination.
THE ANATOMY AND EVOLUTION OF CASSANDRA
TIMESTAMPS
▸ Assign a timestamp at write time ▸ Client or Server can reconcile at read ▸ To varying degrees (Consistency Level) ▸ Relatively Simple and Straightforward ▸ Clocks have to be synchronized ▸ Old data gets overwritten by new data.
THE ANATOMY AND EVOLUTION OF CASSANDRA
Alice Carroll 3456 Alice Liddell 1234 Bob Parr 2345
Record has a timestamp
Alice Carroll 3456 Bob Parr 2345
Merge Records Write a new record
THE ANATOMY AND EVOLUTION OF CASSANDRA
VECTOR CLOCKS
▸ Data is decorated with two pieces of data: ▸ Where the update happened (node) ▸ A Sequence Number ▸ Some updates can be reconciled ▸ Some updates must be manually reconciled ▸ This is !FUN!, especially for clients that get multiple
versions back for a client request.
THE ANATOMY AND EVOLUTION OF CASSANDRA
Alice Carroll 1 n2 Alice Liddell 1 n1 Bob Parr 2 n1
Record has a version and a node identifier
Alice Liddell 1 n1 Alice Carroll 3 n2 Bob Parr 2 n1
Merge Records Write a new record on another node
* reductionist, many details no included
THE ANATOMY AND EVOLUTION OF CASSANDRA
EVOLUTION OF REPLICATION
THE ANATOMY AND EVOLUTION OF CASSANDRA
Slave Slave Master
Writes Synchronous Replication
THE ANATOMY AND EVOLUTION OF CASSANDRA
MASTER - SLAVE REPLICATION
▸ Pros ▸ Consistent ▸ Doesn’t require versioning ▸ Real-Time Knowledge about Replication Status ▸ Cons ▸ Updates happen at the master first ▸ Consistent reads happen at the master ▸ Single Point of Failure ▸ Failover modes are complicated
THE ANATOMY AND EVOLUTION OF CASSANDRA
A-E F-J K-M N-Q R-T U-Z
THE ANATOMY AND EVOLUTION OF CASSANDRA
MULTI-MASTER REPLICATION
▸ Shard the Data over Multiple Masters ▸ Pros ▸ Failures are smaller in scope ▸ Cons ▸ Different masters own different ranges ▸ Failover Machinery also has to be duplicated ▸ SPOF still exists for every range
THE ANATOMY AND EVOLUTION OF CASSANDRA
client
THE ANATOMY AND EVOLUTION OF CASSANDRA
PEER TO PEER REPLICATION
▸ Pros ▸ Failover is simple ▸ Write anywhere ▸ No master ▸ Can choose consistency levels ▸ Cons ▸ Writes are not guaranteed to happen at every replica at write time ▸ Anti Entropy can be expensive, and is a major consideration in sizing for
individual nodes.
THE ANATOMY AND EVOLUTION OF CASSANDRA
EVOLUTION OF ANTI-ENTROPY
THE ANATOMY AND EVOLUTION OF CASSANDRA
WHY ANTI-ENTROPY
▸ Mostly a Peer to Peer Problem ▸ None of the AE systems are perfect ▸ Multiple compensating mechanisms increase the time
between updates and a consistent view of the system.
▸ Evolutionary path has been mostly iterative, lessons have
been learned along the way.
THE ANATOMY AND EVOLUTION OF CASSANDRA
BACKUP / RESTORE
▸ It works ▸ If you have tested it ▸ If the backup is not corrupt ▸ Requires operational discipline ▸ Requires time ▸ Most systems have to be down for a restore.
THE ANATOMY AND EVOLUTION OF CASSANDRA
HINTED HANDOFF
▸ Missed Updates Revisited Later ▸ Part of the Dynamo Paper ▸ Problems With Early Versions of Hinted Handoff ▸ Stampeding Herd ▸ Increased Load on Delivery ▸ Storage (How Many and Where?)
THE ANATOMY AND EVOLUTION OF CASSANDRA
HINTED HANDOFF
X
N3 Alice N3 Bob N3 Chuck N3 Dave N3 Eve N3 Frank N3 Gus
Alice Alice Bob Bob Chuck Alice Dave Eve Frank Gus Bob Chuck Dave Eve Frank Gus
READ REPAIR
▸ Easy and cheap when you are already reading from
multiple replicas
▸ Probabilistic repair when only reading from a single replica ▸ Hot data will be repaired frequently
THE ANATOMY AND EVOLUTION OF CASSANDRA
Bob, 1 Bob, 1 Robert, 2
READ AT CONSISTENCY LEVEL 1
Bob, 1 Bob, 1 Robert, 2
READ AT CONSISTENCY LEVEL 1
Bob, 1
Bob, 1 Bob, 1 Robert, 2
CHECK OTHER REPLICAS AT PROBABILITY 0.1, OR SOME OTHER VALUE
Get Digest Get Digest
Robert, 2 Bob, 1 Robert, 2
UPDATE COORDINATOR WITH LATEST RECORD FROM DIGEST
Robert, 2 Robert, 2 Robert, 2
REPAIR OUT OF DATE REPLICAS
Update Replica
Robert, 2 Robert, 2 Robert, 2
READ AT CONSISTENCY LEVEL 1
Robert , 2
TEXT
REPAIR
▸ Similar to rsync ▸ Expensive, Slow ▸ Incremental Repair is a
huge (late) improvement in manageability.
▸ How Repair Works*
HASHES FOR EACH RANGE, ARRANGE IN A HASH TREE
01 02 03 04 05 06 07 08 09 10 11 22 02 03 14 05 06 17 08 09 10 12
FIND THE HASHES THAT DON’T MATCH
01 02 03 04 05 06 07 08 09 10 11 22 02 03 14 05 06 17 08 09 10 12
LEAF NODES REPRESENT DATA THAT NEEDS TO BE EXCHANGED
01 02 03 04 05 06 07 08 09 10 11 22 02 03 14 05 06 17 08 09 10 12
EVOLUTION OF SHARDING
THE ANATOMY AND EVOLUTION OF CASSANDRA
RANGE SHARDING
▸ Pros ▸ Generally good for short term problems ▸ Range Scanning ▸ Cons ▸ Need to track shards, repartitioning ▸ Second System problem ▸ Data is “lumpy”
THE ANATOMY AND EVOLUTION OF CASSANDRA
A-E F-J K-M N-Q R-T U-Z
RANGE SHARDING
FUNCTIONAL SHARDING
▸ Pros ▸ Natural and easy process ▸ Can alleviate some pain and isolate failure conditions ▸ Cons ▸ Different Parts scale at different rates, if you need to
scale, this probably isn’t a long term solution.
THE ANATOMY AND EVOLUTION OF CASSANDRA
Sales User Marketing
FUNCTIONAL SHARDING
CONSISTENT HASHING
▸ Pros ▸ Scales Exceptionally Well ▸ Nothing Else Required (except a sharing algorithm) ▸ Uniform Distribution of Data ▸ Cons ▸ Where is my data? ▸ Global range queries become nonsensical ▸ Relationships based on the ‘key’ are scattered
THE ANATOMY AND EVOLUTION OF CASSANDRA
CONSISTENT HASHING
F(X) ALICE BOB CHUCK DAVE EVE FRANK 100 200 300 400 500 600 1-100 101-200 201-300 301-400 401-500 501-600 FRANK DAVE BOB CHUCK ALICE EVE
CONCLUSIONS
▸ NoSQL is full of choices and tradeoffs ▸ Every animal has different characteristics ▸ Know your niche. ▸ Find an animal that will thrive in your niche.
THE ANATOMY AND EVOLUTION OF CASSANDRA
DATASTAX INC.